Moor Studio | DigitalVision Vectors

Cloudflare moves to end free, endless AI scraping with one-click blocking

Cloudflare may charge an app store-like fee for its AI-scraping data marketplace.

by · Ars Technica

Cloudflare announced new tools Monday that it claims will help end the era of endless AI scraping by giving all sites on its network the power to block bots in one click.

That will help stop the firehose of unrestricted AI scraping, but, perhaps even more intriguing to content creators everywhere, Cloudflare says it will also make it easier to identify which content that bots scan most, so that sites can eventually wall off access and charge bots to scrape their most valuable content. To pave the way for that future, Cloudflare is also creating a marketplace for all sites to negotiate content deals based on more granular AI audits of their sites.

These tools, Cloudflare's blog said, give content creators "for the first time" ways "to quickly and easily understand how AI model providers are using their content, and then take control of whether and how the models are able to access it."

That's necessary for content creators because the rise of generative AI has made it harder to value their content, Cloudflare suggested in a longer blog explaining the tools.

Previously, sites could distinguish between approving access to helpful bots that drive traffic, like search engine crawlers, and denying access to bad bots that try to take down sites or scrape sensitive or competitive data.

But now, "Large Language Models (LLMs) and other generative tools created a murkier third category" of bots, Cloudflare said, that don't perfectly fit in either category. They don't "necessarily drive traffic" like a good bot, but they also don't try to steal sensitive data like a bad bot, so many site operators don't have a clear way to think about the "value exchange" of allowing AI scraping, Cloudflare said.

That's a problem because enabling all scraping could hurt content creators in the long run, Cloudflare predicted.

"Many sites allowed these AI crawlers to scan their content because these crawlers, for the most part, looked like 'good' bots—only for the result to mean less traffic to their site as their content is repackaged in AI-written answers," Cloudflare said.

All this unrestricted AI scraping "poses a risk to an open Internet," Cloudflare warned, proposing that its tools could set a new industry standard for how content is scraped online.

How to block bots in one click

Increasingly, creators fighting to control what happens with their content have been pushed to either sue AI companies to block unwanted scraping, as The New York Times has, or put content behind paywalls, decreasing public access to information.

While some big publishers have been striking content deals with AI companies to license content, Cloudflare is hoping new tools will help to level the playing field for everyone. That way, "there can be a transparent exchange between the websites that want greater control over their content, and the AI model providers that require fresh data sources, so that everyone benefits," Cloudflare said.

Today, Cloudflare site operators can stop manually blocking each AI bot one by one and instead choose to "block all AI bots in one click," Cloudflare said.

They can do this by visiting the Bots section under the Security tab of the Cloudflare dashboard, then clicking a blue link in the top-right corner "to configure how Cloudflare’s proxy handles bot traffic," Cloudflare said. On that screen, operators can easily "toggle the button in the 'Block AI Scrapers and Crawlers' card to the 'On' position," blocking everything and giving content creators time to strategize what access they want to re-enable, if any.

Beyond just blocking bots, operators can also conduct AI audits, quickly analyzing which sections of their sites are scanned most by which bots. From there, operators can decide which scraping is allowed and use sophisticated controls to decide which bots can scrape which parts of their sites.

"For some teams, the decision will be to allow the bots associated with AI search engines to scan their Internet properties because those tools can still drive traffic to the site," Cloudflare's blog explained. "Other organizations might sign deals with a specific model provider, and they want to allow any type of bot from that provider to access their content."

For publishers already playing whack-a-mole with bots, a key perk would be if Cloudflare's tools allowed them to write rules to restrict certain bots that scrape sites for both "good" and "bad" purposes to keep the good and throw away the bad.

Perhaps the most frustrating bot for publishers today is the Googlebot, which scrapes sites to populate search results as well as to train AI to generate Google search AI overviews that could negatively impact traffic to source sites by summarizing content. Publishers currently have no way of opting out of training models fueling Google's AI overviews without losing visibility in search results, and Cloudflare's tools won't be able to get publishers out of that uncomfortable position, Cloudflare CEO Matthew Prince confirmed to Ars.

For any site operators tempted to toggle off all AI scraping, blocking the Googlebot from scraping and inadvertently causing dips in traffic may be a compelling reason not to use Cloudflare's one-click solution.

However, Prince expects "that Google's practices over the long term won't be sustainable" and "that Cloudflare will be a part of getting Google and other folks that are like Google" to give creators "much more granular control over" how bots like the Googlebot scrape the web to train AI.

Prince told Ars that while Google solves its "philosophical" internal question of whether the Googlebot's scraping is for search or for AI, a technical solution to block one bot from certain kinds of scraping will likely soon emerge. And in the meantime, "there can also be a legal solution" that "can rely on contract law" based on improving sites' terms of service.

Not every site would, of course, be able to afford a lawsuit to challenge AI scraping, but to help creators better defend themselves, Cloudflare drafted "model terms of use that every content creator can add to their sites to legally protect their rights as sites gain more control over AI scraping." With these terms, sites could perhaps more easily dispute any restricted scraping discovered through Cloudflare's analytics tools.

"One way or another, Google is going to get forced to be more fine-grained here," Prince predicted.

Getting paid for AI scraping isn’t just for big publishers

Using the AI audit data, content creators will soon also be able to access a new feature to "reliably set a fair price for their content that is used by AI companies for model training and retrieval augmented generation (RAG)," Cloudflare's blog said.

"Site owners will have the ability to set a price for their site, or sections of their site, and to then charge model providers based on their scans and the price you have set," Cloudflare's blog said.

That price-setting feature is still in development, but site operators can join a waitlist to help Cloudflare test it out, "based on the date they first joined Cloudflare," the company said.

Prince told Ars that it's currently unclear how this marketplace will work or if Cloudflare will charge for access.

If Cloudflare is simply matching AI companies with content creators to facilitate deals, Cloudflare may never charge a fee, considering it a feature of their overall services package.

But if Cloudflare ends up processing and sending payments, Cloudflare would likely charge a fee, similar to how an app store takes a cut when processing app payments. Before that could happen, Prince said, Cloudflare would have to research Know Your Customer and anti-money laundering laws, but Prince signaled that Cloudflare doesn't consider those regulatory hurdles to be a deterrent.

"It might not be that we charge a fee specifically" for the marketplace, Prince told Ars, "but my hunch is that we would, especially because we want to service the 40 million-plus websites that use Cloudflare today. Figuring out how to get payments to all of them, it will be a challenging, although solvable task, and we would need to at least recoup what the costs of doing that are."

While this tool is pitched as a way for "sites of any size" to be "fairly compensated" for content, Cloudflare expects it could also become a go-to tool for publishers currently left out of early AI licensing deals and seeking to negotiate similar terms.

According to Cloudflare's explainer blog, the price-setting tool "will provide advanced analytics to understand metrics that are commonly used" when major publishers negotiate content deals with AI companies. (Ars Technica parent company Condé Nast struck such a deal with OpenAI in August.) Those deals depend on having data "about the frequency of scanning and the type of content that can be accessed," Cloudflare said, and now all site operators will have access to that data.

This potentially makes it easier for big and small sites to negotiate in terms that AI companies already understand. Further, any site can generate a report auditing AI activity to alter deals as their content strategies and AI technologies evolve.

"In the future," Cloudflare claimed, "even the largest content creators will benefit from Cloudflare’s seamless price setting and transaction flow, making it easy for model providers to find fresh content to scan they may otherwise be blocked from, and content providers to take control and be paid for the value they create."

In the announcement blog, Prince promised that Cloudflare's tools would help set a new standard for how content creators are compensated for AI training on their works.

“AI will dramatically change content online, and we must all decide together what its future will look like,” Prince said. "With Cloudflare's scale and global infrastructure, we believe we can provide the tools and set the standards to give websites, publishers, and content creators control and fair compensation for their contribution to the Internet, while still enabling AI model providers to innovate."