Behind The IAB Tech Lab’s New Initiative To Deal With AI Scraping And Publisher Revenue Loss

In June, the IAB Tech Lab proposed a new initiative to create guardrails around how AI bots are permitted to access content, with an emphasis on publisher monetization.

It’s hoping that its new solution will get publishers back on their feet – and keep them there.

Publishers are like “the plankton of the digital media ecosystem,” said IAB Tech Lab CEO Anthony Katsur.

Every living thing in an aquatic environment depends on plankton. If they die out, the rest of the ocean goes down with them. And if publishers collapse, that would be an “extinction-level event” for digital media, Katsur said.

Many publishers are still managing to stay afloat, but the water is choppy, with traffic falling off the metaphorical cliff and no metaphorical harness in sight.

A life raft for publishers

The IAB Tech Lab’s initiative, currently called the LLM Content Ingest API Initiative (“which we need to rename,” Katsur joked; it’s “a mouthful”) can be broken down into four major components.

The first is access controls, which determine who is allowed to access a publisher’s content in the first place.

Once controls are established, access terms come into place, such as licensing models and content tiers. Under the IAB Tech Lab’s guidelines, content will be segregated into tiers based on relevance and value.

“Your archival content from 10 years ago is not worth as much as your late-breaking news or your interview with Taylor Swift,” Katsur said.

The guidelines would also mandate logging the use of content, which Katsur defines as “tracking and recording when and how publisher content is accessed or used by an LLM or AI system,” so publishers can accurately invoice and track usage of their data.

AdExchanger Daily

Get our editors’ roundup delivered to your inbox every weekday.

Daily Roundup

Daily News Roundup

Who Pays The Cloud Compute Bill?; AppLovin Bounces Back On Ads

Content logging ties into the final part of the initiative, which Katsur believes is the most important facet: tokenization. Tokenization involves breaking content down into smaller units made up of words, parts of words, punctuation or metadata, Katsur said. These units, called tokens, are used to train LLMs and generate their responses. Publisher content gets tokenized and uniquely assigned to each publisher.

Then, “using the logging and reporting functions that we are proposing,” he explained, publishers can see exactly how the information scraped from their sites is being used.

Tokenization is useful for brands, too, so they can see what is being said about their products and by whom. Many LLMs scrape sites like Reddit, for example, and parrot back what they find as fact – despite the information often being outdated, if not outright incorrect.

As AI continues to make a name for itself in search, a set of guidelines like the LLM Content Ingest API Initiative (looking forward to that new name) is the best way to ensure that query responses are accurate, Katsur said, and that publishers – and with them, the rest of the ad tech ecosystem – continue to thrive.

The big picture

But let’s zoom out.

What actually happens when a bot scrapes a website?

First, it’s important to note that AI isn’t born with limitless knowledge. It has to get that knowledge from somewhere. That’s why AI bots mine websites, which are vast troves of information.

Sometimes, scraping is one-and-done. When a query is for something straightforward, like a chocolate chip cookie recipe, a bot typically won’t need to continue scraping a site for more updated information, Katsur explained, since a cookie recipe doesn’t generally update or evolve. And once an AI model has a good recipe, it can feed it (no pun intended) to the hundreds of thousands of people requesting it.

It’s not guaranteed that after a page is scraped once it never will be scraped again. There is a common misconception “that once an LLM crawls, it stores all the data and never crawls again,” said Katsur. The IAB Tech Lab’s research has shown that crawlers will recrawl content they have already accessed.

Still, scraping the same page a handful of additional times doesn’t scale against the pay-per-visit model that publishers are used to.

With a pay-per-crawl model, a publisher gets paid when a bot pulls information from its site – and that’s basically the end of the story. No matter how many of a generative AI search engine’s users benefit from that information down the line, the publisher only gets paid once per scrape.

Pay per query, on the other hand, is more similar to the way publishers currently drive revenue, and is the model favored by the IAB Tech Lab. “Now you’re getting paid per use,” said Katsur, “which is similar to getting paid per visit.”

“Pay per query scales,” he said. “Pay per crawl does not.”

Problem is, even pay per crawl isn’t guaranteed. Plenty of bots are scraping sites without providing any compensation and, technically, that’s allowed – for now.

But that seems to be changing, as more companies develop models that put publisher monetization at the forefront.

Earlier this month, Cloudflare implemented a new pay-per-crawl model that gives publishers full rein over the access they provide to bots. Publishers can give full access, block all scraping or opt into the new pay-per-crawl model, which requires bots to share payment information so they can be charged for each scrape.

That’s something – although, until this sort of model is widely adopted, publisher traffic is still in serious danger.

But, hey, along with the LLM Content Ingest API Initiative, it’s definitely a start.

Must Read

Platforms

Magnite Targets CTV, SMBs And Google's SSP Market Share

The SSP is betting on the DOJ’s antitrust remedies, plus closer relationships with agencies, DSPs and mid-sized advertisers, to help it eat some of Google’s lunch.

Digital TV and Video

Zillow Pilots Containerized RTB, As It Rethinks The Equation Of Quality And Cost

Zillow is the pilot brand advertiser to test a new programmatic buying strategy known as containerized RTB. The strategy embeds the DSP or ad-buying platform intelligence, in this case the startup Chalice Custom Algorithms, within the SSP, which is Index Exchange.