AI: The Data Problem

AI is a complex and contentious topic. In some aspects, it has the capacity to improve our lives. At the same time, it can (and arguably has) harmed many of the people it’s ostensibly supposed to help. The online discourse around it is heavily polarized and discussions on social media often become heated arguments with little to no resolution. The subject is a battleground and it seems in many cases that we eschew the nuances we need to have meaningful discussions around it.

Technology has fascinated me since I was a child. Human innovation changes the world for better or worse, and that fact is both awesome and horrifying to me. The scale at which we iterate on technology seems exponential. At the turn of the millenium, after we’d survived Y2K, I couldn’t have imagined how far we would come in 25 years. Back then, the idea that a machine could react to human input in any way that mimicked human intelligence— let alone produce visual art or music— was basically something out of Star Trek. It was a fantasy that I yearned for; this idea that one could hold a conversation with a computer to bounce ideas off of, or collaborate in order to create something beyond our own ability. Now that it’s more reality than fantasy, I find myself torn.

The technology that makes AI possible requires more information than many people realize. Companies like OpenAI train their products on staggering amounts of data scraped from millions of sources across the internet, including text, images, audio, and video. AI could not hold human-like conversations if it didn’t have a large amount of examples of how humans talk and respond to one another. ChatGPT could not summarize Moby Dick if it didn’t have access to that work, or summaries of it. Midjourney could not create an accurate image of Mickey Mouse if it didn’t have many examples of what Mickey Mouse looks like. My point is that like us, AI needs something to learn from to produce something useful or accurate.

This is where things get ethically murky, if not downright dark.

It’s no secret that AI companies play fast and loose when it comes to gathering data to train their products. While we don’t know exactly how many bots are scraping the web for the purposes of training AI, we do know that a lot of them do their job aggressively. This is why I can ask ChatGPT who ‘Jedai Saboteur’ is or what the website ‘Paradox Inversion’ is about and get a fairly accurate response. It’s why, with the right prompt (and sometimes a workaround), AI can rewrite a section of prose in the style of a prolific author. Or, create art in the style of a production company such as Studio Ghibli. Or, make a video of cats jumping off of diving boards in the Olympic games.

You get what I’m saying.

These facts, in and of themselves aren’t what I and many others take issue with. After all, we as humans could not summarize a book or website, or recreate a piece of art in a certain style unless we had examples of those works. The issue here lies in how the data is obtained, what it’s used for, and whether or not the companies profiting from them are crediting or compensating the creators of the work that trained the products.

I use the word ‘products’ here, because the way the vast majority of people utilize AI is through one of many companies which either focus mainly on AI products (such as OpenAI, Anthropic, Midjourney, Suno, etc), or companies that use AI to enhance their existing products (such as Google, Facebook, TikTok, Microsoft, Apple, etc). While these companies use AI to provide us a service, they do so under the expectation that the service they provide returns a profit for them. That’s basic business and I’m not arguing that entities providing a service must do so for free. What I am arguing though, is that these companies (some more so than others) are profiting from data they technically have no right to profit from, while failing or outright refusing to compensate the sources from which their data training sets come.

Returning to my previous point: if an AI training set doesn’t include, say, Stephen King’s works, it would be unable to produce works in his style. As far as I understand, he’s not vocally opposed to his works being used in this way without his explicit permission, and he’s not alone in his sentiment. The opposite is true for many creatives as well— they either know or suspect their work has been utilized without their permission for profit, and they are justifiably upset. There is a mix of high and low profile creators in both camps, encompassing people who make millions, people who scrape by, and those somewhere in the middle.

Some of AI’s strongest proponents argue that this technology is one that ‘democratizes’ art, allowing people access to a tool that grants them the ability to create works they otherwise could not create on their own. The ability for someone, for instance, to create a drawing they don’t have the skill, time, or resources, to draw (or the money to compensate artists who can). To them, AI breaks down a barrier they themselves are incapable or perhaps unwilling to break down. It is, to them, a tool and little more.

On the other hand, those in opposition argue that the technology is one that facilitates infringement upon the work they have poured themselves into creating. Again, AI is incapable of producing output without vast amounts of input, and that input includes potentially millions of works that were never intended to be used as training data for AI products. Their particular reasons for opposition vary, but some of the most common come back to the point that they were never asked, compensated, or credited in any way for their work being used for these products. They feel their work has been stolen, and in many cases, it has.

Speaking in terms of American copyright, when a work is created, the creator holds that copyright at the time of creation. If you write a novel, for instance, you own the copyright to it. This generally extends to what you create and put out into the world via the internet— though it should be noted that the vast majority of online platforms require you to extend that copyright to them if you post it. This isn’t particularly insidious in my opinion. It’s what’s required to allow them to host what you share, so other people can see it. AI training sets are somewhat of a different beast however because in innumerable cases, there is no agreement between the creators of the works scraped and companies doing the scraping. It just happens, and there’s little recourse for the creator. In many cases, creators may not even be aware their work has been used. Companies like OpenAI have been (and likely will continue to be) sued over copyright infringement as it relates to training their AI models. OpenAI itself has stated that it needs access to these copyrighted works to provide a competitive product. Meta’s LLaMA model has been outed for using data from sources that pirate and host copyrighted works by published authors (and Meta has been sued for it).

Many AI companies’ training data is proprietary, meaning people outside of those companies have no access to it. That means the average person, even the creators of work upon which these companies train their data, have no way to confirm that their work has been used. This makes it even harder for creators, especially lesser known ones, to take action against their work being infringed upon. Because of the proprietary nature of these training sets, the companies that own them have no obligation to grant credit or compensation as things currently stand. Some large entities such as Disney, who’s intellectual property is distinct and well known have a stronger ability to take action when an image generator can perfectly recreate Mickey Mouse, but a self-published writer with few works and much shallower pockets has a much harder time proving any wrongdoing or taking these companies to court.

I like the idea, the dream, I had of what AI could be. What it’s turning out to be is something else entirely, at least in the creative sphere. This isn’t a machine running millions of simulations and happening to fall upon certain results. It’s based on data that was created by people, most of whom could not have imagined their works— whatever those works may be— would be used by companies with billions of dollars in profit and backing to further turn over a profit without even telling them their work would be used. The word that comes to mind when I think of how the technology is maturing is ‘predatory’. What worries me most is the speed at which so many associate the prevalence of some of these practices as ‘fine’, or ‘necessary’.

What all of this has shown me is that an uncomfortable amount of people either don’t understand how copyright works, don’t care, or believe that copyright only exists to protect large corporations and entities. To that last point, I can understand that perception. When some relatively-unknown creator fights to protect the right to their work, it doesn’t usually make the news. When Disney does, it makes headlines, so I can absolutely see where this sentiment comes from. I still find it misguided, and in the end it ultimately benefits the same multibillion dollar entities that these people take issue with. Some folks might think that without copyright, they stand a chance to compete with these major brands, but they overlook the fact that the copyrights they hold might be the only thing preventing such brands from stealing their greatest works and marketing them as their own. Even now, with copyright and intellectual property laws in place, it still happens.

What also deeply worries me are some of the analogies drawn by proponents of AI. From what I’ve been passionately told when criticizing elements of AI on social media, if I have a problem with it, I should also have a problem with cameras, image editing suites, digital audio workstations, auto correct— the list goes on. None of these analogies address the scale at which these companies scrape data they technically have no right to, and on their own, these technologies don’t rely on billions of data points to function. These things have worked without requiring copyright infringement on a massive scale, despite AI being adopted within them in recent years.

I don’t believe the underlying technology of AI is outright evil or predatory. I believe there are ways to apply the technology to improve our lives as a whole. The scientific potential alone is greater than I may be able to conceptualize.There is high potential for AI to augment our creative processes and allow us to focus on creating the best work possible. I don’t have a problem with having a technology that acts as a collaborator, an entertainer, or even something to pass the time with. What I do have a problem with is what rights have been trampled to get us here, and how it seems that we may be incapable of reigning in the beast that’s been unleashed.