Reddit, Google, and the Real Cost of the AI Data Rush

28.07.2024 12:00

Thecut.com

Photo-Illustration: Intelligencer; Photo: Getty Images

As of this past week, the only major search engine that still includes Reddit is Google. You won’t find recent Reddit links in Microsoft’s Bing, or get useful results from the platform on privacy-centric search engine DuckDuckGo. For most people, this change won’t matter much in the short term — most people, after all, use Google, which hovers at somewhere around 90 percent of global search market share, and they can still visit Reddit directly — but it’s a weird one. Reddit is a platform with deep connections to the web, beginning its life as a link aggregator and growing into an online community with more than 80 million daily active users (the more the better). Why, according to a report in 404 Media, is it suddenly battening its hatches?

As is often the case when tech companies are behaving strangely or unpredictably these days, the answer has something to do with AI. Earlier this year, Google entered into a deal with Reddit, reportedly valued at $60 million per year, to license the site’s data for training AI models. In recent months, Google watchers have also noticed an increase in Reddit posts showing up in search results, with user comments ranking highly in a wide range of queries. While these stories are related, they’re also somewhat independent: Google, like others in AI, wants to license data so it can build better models and avoid getting sued; Google users were already adding “Reddit” to search queries as a sort of search quality hack, so the company was in some sense following their lead.

It’s where the two stories overlap that things are starting to break down. Search engines gather relevant and up-to-date information by crawling the web with bots, indexing what they find, and ranking it for relevance to users’ searches. Websites have some control over whether and how this crawling takes place, and there are various reasons they might refuse crawling of part of all of their sites (a private individual might want to keep an old blog online but unsearchable, while a company like Facebook might want Google users to be able to find profiles but not search their contents). For decades, though, crawling was part of a straightforward and mutually beneficial arrangement. Search engines gathered lots of users by offering a useful service; websites allowed and even catered to search engine crawling in order to connect with those audiences.

In the last few years, though, crawling has assumed an additional purpose. Those robots that are indexing your site and reading all your data aren’t just building a search index. They might be building an AI model, too. This, for many websites, is very much not part of the deal. As David Pierce writes at The Verge, the sudden pivot from search index to AI training means that “the basic social contract of the web is falling apart.” A mutually beneficial arrangement is being replaced by an extractive one, driven by frantic and unilateral actions of startups and tech giants alike.

At first, the consequences of this breakdown were relatively contained, with large websites and platforms — owned by big companies like Facebook and Amazon — explicitly blocking crawlers from firms like OpenAI. This clarity didn’t last long. Google is all-in on AI. Bing is owned by Microsoft, which is OpenAI’s largest investor and partner. Suddenly, all the search companies were also AI companies, and there were new crawlers in the mix. All crawling was AI crawling — to assume otherwise would be naive. The sudden harvest was apparent to anyone paying attention to their traffic stats. The bots were scraping everything they could.

In response to 404 Media’s reporting, critics have made the case that Google — a company that otherwise seemed spooked by past and potential antitrust enforcement — is nonetheless buying an unfair advantage for a product that’s already nearly a monopoly, and they have a point: Without Reddit, one of the largest repositories of authentic human text on the internet, smaller search engines can’t compete.

But the story isn’t complete without Reddit, which is the company actually doing the blocking. (Microsoft, for its part, has confirmed its crawlers have been prohibited.) As a Reddit spokesperson told The Verge:

This is not at all related to our recent partnership with Google. We have been in discussions with multiple search engines. We have been unable to reach agreements with all of them, since some are unable or unwilling to make enforceable promises regarding their use of Reddit content, including their use for AI.

This is practical behavior by the leadership of Reddit, a public company with a duty to its shareholders, but also plainly bad for the general public: in addition to reducing access for users who don’t want to use Google, it further subordinates Reddit to Google’s specific search incentives, which are changing fast anyway, in part because of AI; already, spammers are polluting popular threads on Reddit, which is suddenly getting enormous amounts of traffic from Google, in hopes of getting more visibility in search results.

Google, like Reddit, owes its existence and success to the principles and practices of the open web, but exclusive arrangements like these mark the end of that long and incredibly fruitful era. They’re also a sign of things to come. The web was already in rough shape, reduced over the last 15 years by the rise of walled-off platforms, battered by advertising consolidation, and polluted by a glut of content from the AI products that used it for training. The rise of AI scraping threatens to finish the job, collapsing a flawed but enormously successful, decades-long experiment in open networking and human communication to a set of antagonistic contracts between warring tech firms.

More screen time