The internet is not a free-for-all—we shouldn't let big tech companies wish copyright out of existence
This week: I just so happened to be listening to Mustafa Suleyman's book, The Coming Wave: AI, Power and the 21st Century's Greatest Dilemma. In which, the DeepMind co-founder goes into his thoughts on AI and the "technological revolution" he suggests has already begun.
When a generative AI system creates an image or some text, it all begins with training. Without an understanding of how words are statistically related to one another, or without knowledge of what an image is showing, a generative AI cannot successfully recreate it. The image generated by an AI might be a new work in itself, a complete original, though it's influenced by real works—millions of them—owned by millions of people.
How AI companies, or the firms which create datasets used by AI systems, continue to collect data is a source of much contention—an uncomfortable truth hanging over AI's exponential growth. Many AI firms have quietly assumed a position of acting as though they're allowed to use data freely from the web—be it images, videos, or text. Without this justification, they'd be stuck having to actually pay for the content they're using, threatening said growth. Meanwhile, artists, content creators, journalists, bloggers, producers, novelists, coders, developers, musicians, and many more argue that's absolute hogwash.
This split is best exemplified by comments made during a CNBC interview at Aspen Ideas Festival (via The Verge) by the CEO of Microsoft AI, Mustafa Suleyman.
Suleyman is at the centre of AI development today. Not only is he leading Microsoft's AI efforts, he co-founded DeepMind, which was later bought by Google, and drove Google's AI efforts, too. He's had a large part to play in how two of the largest tech firms on the planet deliver their AI systems. I've been listening to the audiobook of Suleyman's book this past few weeks, The Coming Wave, as he's someone informed and with a lot to say about how AI has and will impact our daily lives.
So, I say this with the utmost respect to a pioneer in his field: I believe his idea of a "social contract" for the internet is complete nonsense.
Suleyman, when asked by CNBC's Andrew Ross Sorkin on whether AI companies have "effectively stolen the world's IP", had this to say:
"It's a very fair argument. With respect to content that is already on the open web, the social contract of that content since the '90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That's been the understanding."
Except that isn't the understanding. At least not mine, anyways, and if you've been taking content freely from anywhere on the internet this whole time, I have some very bad news for you.
If we ignore the fact that freeware is already a thing and, no, not everything on the internet is freeware—just think of the ramifications for a moment if it were so, especially for Suleyman's own employer, Microsoft—there's further legalese to prevent a free-for-all online.
There's something called copyright, which here in the UK was enshrined into law through the Copyright, Designs and Patents Act 1988. As a journalist, I have to be very conscious of the right I have to use anything on the internet, otherwise I may (rightly) be forced to pay a very large sum of money to the copyright holder.
Let's not get too into the weeds with this (he says, not even halfway through a 2,000 word column), but generally copyright law covers "original literary, dramatic, musical, or artistic works." That includes all manner of text, too, not just novels or short stories, and lasts usually 70 years. The rights to which are initially assigned to the "first owner" or creator of that work.
Copyright is automatically applied, meaning someone need not register to get it, but only applies to original works.
Some argue the creations of generative AI are original works, and therefore qualify for automatic copyright. To whom you grant the automatic copyright is a tricky situation, as when animals have taken photos of themselves (search 'monkey selfie') our very human laws don't quite know what to make of it. We actually ended up with a ruling in 2014 by the United States Copyright office that states works by non-humans are not copyrightable (PDF). That's despite a human playing a pivotal role in setting up the entire thing—which could have implications for AI-generated art, and not the least bit because that same ruling applies similar constraints on works created by a computer.
Whether you own the copyright to the art you prompted through a generative AI system, even finessing those prompts to get it just right, is an ongoing debate. However, US courts currently rule against granting copyright in these instances, and have even barred award winning artwork from copyright.
But this is a tangent. Let's focus back on the use of copyrighted works for training purposes because clearly copyright has something to do with the mass collection and use of images, videos, and text, without permission, for an AI system likely run by a private business for commercial gain.
Within UK law, the copyright owner (automatically the author or creator, or employer of said author or creator) gets to say who can use its images and how. It's easy to waive your rights to images—I might see you've posted an image of a fun PC mod and message to ask if I have your permission to use it on PC Gamer, for example. If you say yes, providing I give you sufficient attribution for your work, everyone is happy and life moves on.
If I don't ask your permission and subsequently take the image or "substantial" part of it (which some do, no doubt about that), upon finding out that I've encroached on your copyright, you could demand I remove the offending material, sue for damages, or even get an injunction banning me from publishing or repeating an offence again.
This has been the case since the act was introduced in the UK in 1988—which I'd add was before the internet was a big deal. Similar protections also exist around the world, including the US and EU.
So there's really no excuse for saying we've all been living in some kind of wild west where anything goes on the internet. It doesn't, AI companies just want that to be the case, and they are fighting to protect their own interests.
There are a few defences for taking copyrighted works without permission in UK law. These mostly come under something called fair dealing. Fair use in the US is a similar concept but different in practice and applicability—as a UK national, it's mostly fair dealing that covers my actions. There are a few versions of fair dealing: one covers reporting of current events , another for review or criticism, and quotations and parody are also covered. Unless AI is actually a big joke, that last one won't offer much of a defence.
Neither will the rest. They don't cover photographs, for one, which are proactively defended in the law. They also require a user to not take unfair commercial advantage of the copyright owner's works and only using what's necessary for the defined purpose. They also frequently require sufficient acknowledgement—none of which is the done thing in generative AI.
The rights of some publishers to not share their content is something that Suleyman tends to agree is the case, and which has already been exploited, as he explained to CNBC (which, by the way, I can quote thanks to fair dealing):
"There's a separate category where a website or a publisher or a news organisation had explicitly said do not scrape or crawl me for any other reason than indexing me so other people can find that content. That's a grey area and I think that's going to work its way through the courts."
"So far, some people have taken that information. I don't know who, who hasn't, but that's going to get litigated and I think that's rightly so."
Except that the one form of content that doesn't generally come under copyright law are actually news articles.
I'm frustrated by the moves from Google and Microsoft to use AI to summarise my articles into little regurgitated bites that threaten to destroy the business of the internet, but I wouldn't want to argue that's copyright infringement in court. It's known as "lifting" a story when you take key information from something published by another and republish it yourself. Providing you don't use the same words and layout—you don't take the piss, basically—it's legally fine to do under existing law.
Plenty of publishers will argue against AI systems on the finer points of these systems and what constitutes lifting and what's just taking without asking and without fair recompense—see the New York Times vs. Open AI case. I'll leave that to the lawyers. My argument is that, legal or not, an AI summarising stories with no kickback for the people working to create those stories will ultimately do a lot more harm than good in the long run.
Simply put, I don't understand the argument from Suleyman here. Maybe it's a degree of wishful thinking from someone inside the AI inner circle looking out, or maybe he's looking around the internet and seeing some sort of wild west without any rules? But that's not the case, even considering the common exceptions to copyright law we'll get to in a moment.
Copyright infringement happens all the time on the web, and it's a debasement of both our rights as creators to not have our stuff nicked and the value of the content itself. Does that mean we should just lay down, admit defeat, and let an AI system or dataset crawler rewrite the rules so that copyright need not apply to them? I don't think so.
What is artificial general intelligence?: We dive into the lingo of AI and what the terms actually mean.
There are some measures coming into place to try to defend copyright in a world obsessed by AI. The EU has introduced the Artificial intelligence (AI) Act which includes a transparency requirement for "publishing summaries of copyrighted data used in training" and rules on compliance with EU copyright law, much of which is similar to that of the UK.
Though the EU also includes some get-outs allowing for data mining of copyrighted works in some instances. One allows the use in research and by cultural heritage institutions, and the other means users can opt out of further use by other organisations, including for machine learning. How exactly one opts out is, uh, not entirely clear (PDF).
The UK has something similar in place, as an exception to the 1988 Act, which allows for non-commercial use of data mining. This is generally not considered a viable defence for large AI firms with public, and commercial, generative AI systems. The UK Government had also planned another exception, since the sudden popularity of AI systems, though that has since fallen through. That's probably to the benefit of people in the UK, who are technically safe from data mining for commercial purposes, but not for the AI firms hoping to scrape data from within the UK's borders.
The exact ways in which companies hope to circumvent these limitations or how these laws look in practice are matters that lawyers, civil servants and politicians will have to debate for years to come. Though generally I just want to make it clear that these arguments exist because of copyright law—not for a lack of it.
By acting as though these rules don't apply to them, and putting pressure on governments to make allowances for AI due to the significant amount of money AI promises to deliver, AI firms have largely gotten away with it to-date. Though I'd hold they're mostly running on a strategy of "it's easier to ask for forgiveness than to ask for permission" and have been for a couple of years now. They might continue to get away with it, too. By the time we've got to grips with copyright claims and whether they even exist for AI, will it be possible to untrain AI systems already trained on datasets filled to the brim with copyrighted content? Oops, turns out we can't really do that very well.
"What a pity," the AI exec might say.
It's my take, as a person that creates for the internet and without any claims of being a copyright lawyer, that in the creation of any loopholes for the purposes of data mining we may end up with one rule for big AI firms and another for regular folk like you and me. The presumed benefits of AI generated art trained on the hard graft of your own creative work deemed too valuable to human existence to be held back by petty copyright infringement. It could feel like that, or we could hold the AI companies, some worth billions of dollars, to account for the copyrighted content they're benefiting from.
If copyright owners don't manage to fend off AI, what will become of the internet, or "open web", as we know it? Will an artist want to publish anything online? Will social media platforms arise with the promise to be 'AI-proof'? Will the internet become more siloed as a result, split off into smaller communities off the beaten track and away from the prying eyes of Google, Microsoft, and crawlers sent out by dataset companies?
Because, after all, it's not just my words in an article that an AI might look fondly upon, or even someone's publicly published artwork, but perhaps your wedding photos or your smutty fanfiction. And what then, of that AI generated advert for something that you don't agree with that looks like you, or sounds like you, of your ability to get it removed or your likeness untrained? Now, that sounds a lot more like the pandemonium free-for-all that Suleyman believes has been happening this entire time.