‘Preserving the web is not the problem. Losing it is,’ claims the director of the Internet Archive

The Internet Archive, the site responsible for running the handy archival tool, Wayback Machine, has reportedly seen sites such as Reddit, The New York Times, and The Guardian blocking it from getting access, due to concerns of AI scraping. Internet Archive director Mark Graham thinks “These concerns are understandable, but unfounded.”

In a new blog on TechDirt named “Preserving The Web Is Not The Problem. Losing It Is”, Graham affirms that “The Wayback Machine is built for human readers.” The blog points out why many are wary of AI scrapers, and also notes measures it has put in place to stop said AI bots.

Graham claims the Internet Archive uses rate limiting, filtering, and monitoring to stop large-scale bots from stealing all that data. He also claims that the team watch out for new ways to abuse archival sites, to stop future mass scraping.

He goes on to say, “Whatever legitimate concerns people may have about generative AI, libraries are not the problem, and blocking access to web archives is not the solution; doing so risks serious harm to the public record.”

Citing tech policy writer Mike Wasnick, Graham argues that, “blocking preservation efforts risks a profound unintended consequence.” He essentially feels that if tools aren’t given the ability to archive sites on the internet, it effectively risks tampering with the historical record for future generations.

(Image credit: The Internet Archive)

As well as this, Graham argues that stopping archival sites means “Journalists lose tools for accountability. Researchers lose evidence. The web becomes more fragile and more fragmented, and history becomes easier to rewrite.”

One point not mentioned here is that some websites, especially those like The Guardian and the New York Times that have paywalls in place, may want to stop archival tools so readers can’t bypass those restrictions. The Wayback Machine may serve as somewhat of a digital library, but there’s no limit on the number of books it can lend, and figuring out who claims them is tough work.

There’s a push and pull here, one where archival tools need access to sites to preserve the history and integrity of the internet, and some sites need paid viewers to continue their work. AI coming in and stealing that work only makes that tension worse, even if the Internet Archive claims it can handle those nasty scrapers.