Home Publishers Behind The IAB Tech Lab’s New Initiative To Deal With AI Scraping And Publisher Revenue Loss

Behind The IAB Tech Lab’s New Initiative To Deal With AI Scraping And Publisher Revenue Loss

SHARE:
App developers aren’t making as much money as they could from in-app auctions despite the promise of increased average revenue per daily user.

In June, the IAB Tech Lab proposed a new initiative to create guardrails around how AI bots are permitted to access content, with an emphasis on publisher monetization.

It’s hoping that its new solution will get publishers back on their feet – and keep them there.

Publishers are like “the plankton of the digital media ecosystem,” said IAB Tech Lab CEO Anthony Katsur.

Every living thing in an aquatic environment depends on plankton. If they die out, the rest of the ocean goes down with them. And if publishers collapse, that would be an “extinction-level event” for digital media, Katsur said.

Many publishers are still managing to stay afloat, but the water is choppy, with traffic falling off the metaphorical cliff and no metaphorical harness in sight.

A life raft for publishers

The IAB Tech Lab’s initiative, currently called the LLM Content Ingest API Initiative (“which we need to rename,” Katsur joked; it’s “a mouthful”) can be broken down into four major components.

The first is access controls, which determine who is allowed to access a publisher’s content in the first place.

Once controls are established, access terms come into place, such as licensing models and content tiers. Under the IAB Tech Lab’s guidelines, content will be segregated into tiers based on relevance and value.

“Your archival content from 10 years ago is not worth as much as your late-breaking news or your interview with Taylor Swift,” Katsur said.

The guidelines would also mandate logging the use of content, which Katsur defines as “tracking and recording when and how publisher content is accessed or used by an LLM or AI system,” so publishers can accurately invoice and track usage of their data.

Subscribe

AdExchanger Daily

Get our editors’ roundup delivered to your inbox every weekday.

Content logging ties into the final part of the initiative, which Katsur believes is the most important facet: tokenization. Tokenization involves breaking content down into smaller units made up of words, parts of words, punctuation or metadata, Katsur said. These units, called tokens, are used to train LLMs and generate their responses. Publisher content gets tokenized and uniquely assigned to each publisher.

Then, “using the logging and reporting functions that we are proposing,” he explained, publishers can see exactly how the information scraped from their sites is being used.

Tokenization is useful for brands, too, so they can see what is being said about their products and by whom. Many LLMs scrape sites like Reddit, for example, and parrot back what they find as fact – despite the information often being outdated, if not outright incorrect.

As AI continues to make a name for itself in search, a set of guidelines like the LLM Content Ingest API Initiative (looking forward to that new name) is the best way to ensure that query responses are accurate, Katsur said, and that publishers – and with them, the rest of the ad tech ecosystem – continue to thrive.

The big picture

But let’s zoom out.

What actually happens when a bot scrapes a website?

First, it’s important to note that AI isn’t born with limitless knowledge. It has to get that knowledge from somewhere. That’s why AI bots mine websites, which are vast troves of information.

Sometimes, scraping is one-and-done. When a query is for something straightforward, like a chocolate chip cookie recipe, a bot typically won’t need to continue scraping a site for more updated information, Katsur explained, since a cookie recipe doesn’t generally update or evolve. And once an AI model has a good recipe, it can feed it (no pun intended) to the hundreds of thousands of people requesting it.

It’s not guaranteed that after a page is scraped once it never will be scraped again. There is a common misconception “that once an LLM crawls, it stores all the data and never crawls again,” said Katsur. The IAB Tech Lab’s research has shown that crawlers will recrawl content they have already accessed.

Still, scraping the same page a handful of additional times doesn’t scale against the pay-per-visit model that publishers are used to.

With a pay-per-crawl model, a publisher gets paid when a bot pulls information from its site ­– and that’s basically the end of the story. No matter how many of a generative AI search engine’s users benefit from that information down the line, the publisher only gets paid once per scrape.

Pay per query, on the other hand, is more similar to the way publishers currently drive revenue, and is the model favored by the IAB Tech Lab. “Now you’re getting paid per use,” said Katsur, “which is similar to getting paid per visit.”

“Pay per query scales,” he said. “Pay per crawl does not.”

Problem is, even pay per crawl isn’t guaranteed. Plenty of bots are scraping sites without providing any compensation and, technically, that’s allowed – for now.

But that seems to be changing, as more companies develop models that put publisher monetization at the forefront.

Earlier this month, Cloudflare implemented a new pay-per-crawl model that gives publishers full rein over the access they provide to bots. Publishers can give full access, block all scraping or opt into the new pay-per-crawl model, which requires bots to share payment information so they can be charged for each scrape.

That’s something – although, until this sort of model is widely adopted, publisher traffic is still in serious danger.

But, hey, along with the LLM Content Ingest API Initiative, it’s definitely a start.

Must Read

Magnite Targets CTV, SMBs And Google's SSP Market Share

The SSP is betting on the DOJ’s antitrust remedies, plus closer relationships with agencies, DSPs and mid-sized advertisers, to help it eat some of Google’s lunch.

Zillow Pilots Containerized RTB, As It Rethinks The Equation Of Quality And Cost

Zillow is the pilot brand advertiser to test a new programmatic buying strategy known as containerized RTB. The strategy embeds the DSP or ad-buying platform intelligence, in this case the startup Chalice Custom Algorithms, within the SSP, which is Index Exchange.

Shell Shutters Its Volta EV Charging And Media Division

Volta Media, which is owned by the gas station and energy giant Shell, will be shuttered by November and its network of more than 2,000 charging stations will be dismantled this year.

Privacy! Commerce! Connected TV! Read all about it. Subscribe to AdExchanger Newsletters
Comic: Traffic Jam

People Inc. Has A New Name, But It Still Faces The Same Old Google Search Traffic Drought

People Inc. – the former Dotdash Meredith – is fighting on multiple fronts to keep its business growing as Google Search declines precipitously as a source of referral traffic.

Monopoly Man looks on at the DOJ vs. Google ad tech antitrust trial (comic).

More Like No Yield: A New Book Explores How Google Soaked Up The Web’s Ad Profits

“I tried to write it so it’s not exclusively for ad tech nerds,” Ari Paparo told AdExchanger of his new book, about Google’s advertising dominance. “And I mean that affectionately.”

CleanTap Filters Out ‘Sorta CTV’ Placements Before Buyers Can Bid On Them

CleanTap, an ad tech startup launched by the founder of Method Media Intelligence, wants to separate the wheat from the chaff in CTV by serving as a curation layer between DSPs and SSPs.