The Dark Side of AI: Uncovering the Controversial Practices Fueling Artificial Intelligence

The AI data land grab: How tech giants are amassing staggering amounts of data to fuel their AI models, and what it means for our online privacy and security.

_{^{Photo by Adam Birkett on Unsplash}}

The Dark Side of AI: Uncovering the Controversial Practices Fueling Artificial Intelligence

As I delve into the world of artificial intelligence, I am struck by the sheer scale and implications of the historic data land grab happening in the AI sector. According to researcher Kate Crawford, AI is the largest superstructure ever built by humans, requiring immense human labor, natural resources, and staggering amounts of data. But how are tech giants like Meta and Google amassing this data?

The AI data land grab: A race to the bottom

It’s increasingly clear that we’re in the middle of a historic land grab for all the data that has ever been created by humanity. So where is all this data coming from, and how are these companies getting access to it? Well, first, they’re clearly scraping the public internet. It’s safe to say that if anything you’ve done has been posted to the internet in a public way, it’s inside the training data of at least one of these models.

“AI is the largest superstructure that our species has ever built.” - Kate Crawford

But it’s also probably the case that this scraping includes a large amount of copyrighted data, or not publicly necessarily available data. They’re probably also getting behind paywalls, as we’ll find out soon enough as the New York Times lawsuit against OpenAI works its way through the system. According to the New York Times, Google found out that OpenAI was scraping YouTube, but they didn’t reveal it or push or reel it to the public because they too were scraping all of YouTube themselves and didn’t just want this getting out.

Data brokers: The middlemen of the AI data land grab

Second, all these companies are purchasing or licensing data. This includes news licensing, entering into agreements with publishers, data purchased from data brokers, purchasing companies, or getting access to company datas that have rich data sets. Meta, for example, was considering buying the publisher Simon and Schuster just for access to their copyrighted books in order to train their LLM.

The companies that have access to rich data sets themselves are obviously an advantage here. And in particular, this is Meta and Google. Meta uses all the public data that’s ever been inputted into their system. And it said that even if you aren’t even on their products or use their product, your data could be in their systems, either from data they’ve purchased outside of their products, or if you’ve just appeared, for example, in an Instagram photo, your face is now being used to train their AI.

The AI data collection machine: A never-ending cycle

So where does this all leave us citizens and users of the internet? Well, one thing’s clear is that we can’t opt out of this data collection and data use. Meta’s opt-out tool they provide is hidden and complicated to use, and it requires you to provide proof that your data has been used to train Meta’s AI system before they’ll consider removing it from their data sets. This is not the kind of user tools that we should expect in democratic societies.

Regulation: The only way forward for AI accountability

So it’s pretty clear that we’re going to need to do three things. One, we’re going to need to scale up our journalism. This is exactly why we have investigative journalism, is to hold powerful governments and actors and corporations in our society to account. Journalism needs to dig deep into who’s collecting what data, how these models are being trained, and how they’re being built on data collected on our lives and our online experiences.

Second, the lawsuits are going to need to work their way through the system and the discovery that comes with them should be revealing. The New York Times’ lawsuit just to take one of the many against OpenAI, will surely reveal whether paywall journalism sits within the training models of these AI systems.

And finally, there is absolutely no doubt that we need regulation to provide transparency and accountability of the data collection that is driving AI. Meta recently announced, for example, that they were going to use data they’d collected on EU citizens in training their LLM. Immediately after the Irish Data Protection Commission pushed back, they announced they were going to pause this activity. This is why we need regulations. People who live in countries or jurisdictions that have strong data protection regulations and AI transparency regimes will ultimately be better protected.