“the secret list of websites”

The Washington Post does research to figure out which websites were used to train Google’s AI model:

To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA.
Inside the secret list of websites that make AI like ChatGPT sound smart

My largest corpus of writing to date is on the web at css-tricks.com (along with many other writers), so naturally, I’m interested in seeing if it was used. (Plus, I’ve been seeing people post their rank as a weird nerd flex, so I’m following suit.)

CSS-Tricks.com ranks 8,182 on the websites used to train Google's AI model.

Despite Google’s employees serious misgivings (just little stuff like the information presented leading to “serious injury or death”), Google has publicly launched their Bard tool and are very serious about investing in AI.

Me, I just think it’s fuckin’ rude.

Google is a portal to the web. Google is an amazing tool for finding relevant websites to go to. That was useful when it was made, and it’s nothing but grown in usefulness. Google should be encouraging and fighting for the open web. But now they’re like, actually we’re just going to suck up your website, put it in a blender with all other websites, and spit out word smoothies for people instead of sending them to your website. Instead.

And while doing that, they aren’t:

Telling authors their content is being used to train
Telling users where the output came from
Offering any meter of how reliable or confidently correct the output is

So, I’m critical. It’s irresponsible.

But I’m not a neo luddite or whatever on this. It’s all certainly interesting. I like that these tools are almost immediately useful and pouring over with use cases. Heck, I needed a quick CSS rainbow gradient the other day, and the output from Bard was quick and useful. I’m a GitHub Copilot paying customer and I’m 100% sure it makes me a faster and better coder. I’m nervous about lots of things related to (massive air quotes) “AI” but I’m hopeful it can do some good.

On being critical though, here’s Manuel Moreale:

… I do enjoy reading news and discussions when politics and technology are both involved. I especially enjoy reading people’s perspectives on these topics. One thing I’m noticing more and more though, is that most people are quick to point out what’s wrong about something, but almost never offer solutions or alternatives.

And that is because complaining or pointing fingers is the easy part. Figuring out alternatives is hard
Criticising is the easy part

So here’s what I’d like to see done:

Stop firing ethics people. What is it, three times now?
Be very open about what content a model is trained on, and at least allow people to opt-out. Better — opt in.
Credit and link to the sources directly in the output where possible.
Operate this part of the business as profit neutral.

Matt says:

04/21/2023 at 9:45 am

I particularly like the idea of crediting sources where it’s feasible. You shouldn’t have to rely on digging from the Washington Post to see if your work was used for training after the fact. Bit too much of a haveibeenpwned.com vibe for my comfort.

Seirdy says:

04/21/2023 at 10:44 pm

I added an entry to my robots.txt to block ChatGPT’s crawler, but blocking crawling isn’t the same as blocking indexing; it looks like Google chose to use the Common Crawl for this and sidestep the need to do crawling of its own. That’s a strange decision; after all, Google has a much larger proprietary index at its disposal.

A “secret list of websites” was an ironic choice of words, given that this originates from the Common Crawl. It’s sad to see Common Crawl (ab)used for this, but I suppose we should have seen it coming.

I know Google tells authors how to qualify/disqualify from rich results, but I don’t see any docs for opting a site out of LLM/Bard training.

(POSSE note from https://seirdy.one/notes/2023/04/21/opting-out-of-llm-indexing/)

Christian Tietze says:

04/22/2023 at 1:04 am

FWIW my understanding of the C4 model is that it’s not Google’s, but an independent crawler foundation’s: https://commoncrawl.org/

(Made this mistake myself after reading the Post’s article and then had to correct)

Or am I missing some secret affiliation?

magicsofa says:

08/20/2024 at 1:44 pm

Manuel Moreale has a point when it comes to things that NEED solutions. I just feel that having bots digest and regurgitate all of our content doesn’t solve any problems. I don’t think making you a “better and faster coder” is a pressing problem either, considering that the AI solution is actually just creating a dependency rather than making you, as a lone person, better. If you learn from it, fine, but that’s YOU learning from the code in front of you, regardless of how it was spawned. Additionally, the resource cost of generating your rainbow CSS must be greater than the resource cost of you spending five or ten extra minutes figuring it out yourself?

Imagine you tell someone that monopolies degrade free market forces, and then they respond with “OK but what are the alternatives?” And you say, obviously, no more monopolies. To which they reply “That’s not a solution!” As if there is some need for mega-corporations to be able to control the market…

Related

4 responses to ““the secret list of websites””

Leave a Reply Cancel reply