With the rise of AI, web crawlers are suddenly controversial – The Verge
February 19, 2024
For decades, the main focus of robots.txt was on search engines; you’d let them scrape your site and in exchange they’d promise to send people back to you. Now AI has changed the equation: companies around the web are using your site and its data to build massive sets of training data, in order to build models and products that may not acknowledge your existence at all. The robots.txt file governs a give and take; AI feels to many like all take and no give. But there’s now so much money in AI, and the technological state of the art is changing so fast that many site owners can’t keep up. And the fundamental agreement behind robots.txt, and the web as a whole — which for so long amounted to “everybody just be cool” — may not be able to keep up either.
Source: With the rise of AI, web crawlers are suddenly controversial – The Verge
The web is a public good,
available to all (“nonexcludable”) and that can be enjoyed over and over again by anyone without diminishing the benefits they deliver to others (“nonrival”) [IMF].
It’s a social contract–where among other things, in return for the utility of search just about everyone who creates anything online gives search engines access. For free. While the benefits, particularly for larger publishers haven’t always been equal, search engines have long been a key driver of traffic, the life blood of those publishers.
But now Large Language Models like the GPT family have been trained in no small part on this Web content. And the reciprocal benefit to those who created the “training data” seems a lot harder to see, when a service answers a question in part based on content you created, which sees no traffic in return, why this social contract should stand.
This excellent article by the Verge looks at the history of robots.txt, which it turns out is over 30 years old. These simple files outline the policy of their site toward how ‘robots’ (like search engines and AI crawlers) and acceptable use. But as they observe, robots.txt is not a contract. It’s not (most likely) legally enforceable.
So what happens now? We’re at the very beginning of finding out.