Major Publishers Block Internet Archive Over Concerns About AI Scraping
Several prominent publications, including The New York Times and The Guardian, have taken steps to block the Internet Archive's access to their content. This move is aimed at preventing AI scrapers from using the archive's collections as a workaround to bypass article restrictions.
The issue revolves around AI companies' use of bots that rely on structured databases of content, such as those provided by the Internet Archive. These bots aim to scrape articles and utilize them for training large language models. Publishers are concerned that this would give their competitors an unfair advantage in terms of content access.
In a statement, Robert Hahn, head of business affairs and licensing at The Guardian, noted that AI businesses seek readily available databases of content to feed into their systems. "The Internet Archive's API would have been an obvious place for them to plug their own machines into and suck out the IP," he said.
Other publications, such as The Financial Times and Reddit, have also implemented measures to selectively block how the Internet Archive catalogs their material. These steps aim to prevent AI companies from accessing content without authorization.
The trend of publishers suing AI businesses over content access is on the rise. Several major outlets, including The New York Times, The Wall Street Journal, and The Chicago Tribune, have taken legal action against companies like OpenAI and Microsoft, Perplexity, and Cohere.
While some media outlets have negotiated financial deals with AI companies in exchange for allowing their libraries to be used as training material, others are taking a more aggressive approach. This move highlights the ongoing tension between the publishing industry and the rapidly evolving world of artificial intelligence.
Several prominent publications, including The New York Times and The Guardian, have taken steps to block the Internet Archive's access to their content. This move is aimed at preventing AI scrapers from using the archive's collections as a workaround to bypass article restrictions.
The issue revolves around AI companies' use of bots that rely on structured databases of content, such as those provided by the Internet Archive. These bots aim to scrape articles and utilize them for training large language models. Publishers are concerned that this would give their competitors an unfair advantage in terms of content access.
In a statement, Robert Hahn, head of business affairs and licensing at The Guardian, noted that AI businesses seek readily available databases of content to feed into their systems. "The Internet Archive's API would have been an obvious place for them to plug their own machines into and suck out the IP," he said.
Other publications, such as The Financial Times and Reddit, have also implemented measures to selectively block how the Internet Archive catalogs their material. These steps aim to prevent AI companies from accessing content without authorization.
The trend of publishers suing AI businesses over content access is on the rise. Several major outlets, including The New York Times, The Wall Street Journal, and The Chicago Tribune, have taken legal action against companies like OpenAI and Microsoft, Perplexity, and Cohere.
While some media outlets have negotiated financial deals with AI companies in exchange for allowing their libraries to be used as training material, others are taking a more aggressive approach. This move highlights the ongoing tension between the publishing industry and the rapidly evolving world of artificial intelligence.