Some believe that generative AI training is copyright infringement. We look at fair use to see if that’s the case.
Note: This does not constitute legal advice. It’s just someone reading into resources and coming to personal conclusions.
There’s been a long-standing, and incorrect, belief that generative AI training is copyright infringement. This has led to really bizarre interpretations such as reading is copyright infringement and watching streams or video is copyright infringement. What amazes me is that both journalists for major news websites and lawyers alike have come to those conclusions just because AI is involved. I’ve expressed skepticism for both arguments, but I thought I’d dive deeper into the US fair use to better showcase why.
Generally speaking, fair use is considered an exception to copying and reusing copyrighted works.
For instance, when a news reporter, say, uses the logo for a company for illustrative purposes to write a report, that journalist is not in violation of copyright law because there is a fair use defence. In that case, the work is being reused for journalistic purposes.
Another example is a professor using a clip from, say, a movie to discuss a particular topic and to highlight some of his points. Again, the use of that clip was for educational purposes and the professor is not liable for copyright infringement for the sole act of showing that clip (there is some weirdness when it comes to the use of DRM in the movie, but I digress).
In each instance, a portion of the original work is used in the new piece of content being created. The company logo is being used to create a journalistic piece. That movie clip is being used to create a lecture. For those familiar with copyright and fair use, there is nothing particularly earth shattering to see here.
For those in the legal profession, this next part should be very familiar to you if you specialize in copyright law in any way. There is a thing known as the four factors in which to conclude if a use of a copyrighted work is fair use. Those factors are as follows:
Factor 1: The Purpose and Character of the Use
Factor 2: The Nature of the Copyrighted Work
Factor 3: The Amount or Substantiality of the Portion Used
Factor 4: The Effect of the Use on the Potential Market for or Value of the Work
A more detailed description of each factor can be found on the Columbia University website. So, let’s talk about each factor as it applies to general generative AI.
Factor 1: The Purpose and Character of the Use
This can be a little murky because the purpose isn’t necessarily explicitly explored for this particular case. In this instance, we are talking about an AI training on this kind of content which the university, very understandably, didn’t mention. Of course, as mentioned before, fair use generally has the presence of the original work in mind. So, with that in mind, we can note that the original work isn’t even present in the output of a generative AI. Instead, it learned facts (which does not fall under copyright) and general writing styles (also not under copyright) or even how a video is supposed to look (also not protected by copyright). In response, it creates an entirely different work.
This leads to the character of the use of the work. A court is going to look at how transformative a work is in the first place. Is it a summary of an original work? How much has that work changed? Is it an entirely new work? Generally speaking, because none of the original content is found in the output of a generative AI program, the work is, indeed, highly transformative because it’s an entirely brand new work altogether.
As a result, there is a very strong argument for factor 1.
Factor 2: The Nature of the Copyrighted Work
It’s hard to really look at this side of things because the original works are wide ranging. This ranges both from fiction to non-fiction works. What I will say, however, is that because the original work isn’t even present, there is at least a case for this even if this largely doesn’t really apply.
Factor 3: The Amount or Substantiality of the Portion Used
This is probably the strongest element for a fair use defence and is something I have already alluded to a couple of times now. While there is no set percentage of a work or proportion, this implies that a portion of the original work is present in the output of the AI. The thing is, absolutely none of the original work can be found in the output of your standard generative AI program.
Now, there is the argument where the “heart” of the work is present. To put it one way, if a portion of the original work is big enough to make it pointless to consume the original work afterwards, then an argument can be made that an excessive amount of the original work has been used in the new work. Of course, once again, since none of the original work is present in the output, that generally doesn’t apply.
So, for instance, if an original work said something like, “An election was held on November 3, 2020. Joe Biden won that election and became president at that time.”, then the generative AI created an output that read something like, “On November 3, 2020, an election was held. Joe Biden won that election and became president.”, one might be mistaken for believing that the heart of the original work was taken and is, therefore, copyright infringement. The problem with that argument is that the original work is stating a fact. The AI output, in turn, is also stating a fact. Facts cannot be copyrighted. Conversely, if a generative AI read facts off of a news article, reworded it, and simply restated those facts, there really is no copyright infringement case to be made.
Ultimately, the output features none of the original work in question. As such, there is a pretty air tight case to be made that generative AI very easily complies with the third factor.
Factor 4: The Effect of the Use on the Potential Market for or Value of the Work
This factor is probably better left on a case-by-case basis, but generally speaking, for a number of generative AI systems, a good case can be made that the output is protected by the fourth factor. In one example, if a video game maker produces a video game and puts it up for sale, then a generative AI produces a 20 second clip that sort of looks like the original game, would consumers generally conclude that there is no point in buying that video game because that 20 second clip was generated? A strong case can be made that the answer is “no”. A 20 second video clip is by no means replacing that original product.
Likewise, if a short summary is being made about a news article, if two or three sentences replaces an entire article, what kind of article is the original anyway? If anything, an argument could be made that the summary is an advertisement for people to click through and read the original article. In fact, journalists and media companies use summaries all the time when sharing articles on social media. This is to get clicks in the first place. Generally speaking, a brief summary in and of itself is not replacing the original work in the greater market.
The Case for Fair Use
So, in walking our way through the factors of fair use, you can see a general pattern of how the output of an AI would qualify for fair use. Generally speaking, the output is transformative and doesn’t actually reproduce the original work at all. Instead, what the AI is generally doing is creating a whole new work after learning how to write effectively. The original work is not even really present in that output. It might restate facts, but facts aren’t copyrighted. It might create summaries, but again, that’s not really infringing on copyright.
AI Output Is Not Protected By Copyright
I decided to throw this into the article because it does add some interesting context to this. Specifically, the question is this, “OK, on the flip side, is what the AI producing also protected by copyright?” This is actually an area that the courts have looked at. The conclusion? Absolutely not. Back in 2022, there was an effort by an AI creator to have the output protected by both patents and copyright. The result? The courts repeatedly rejected those claims. From TechDirt:
Stephen Thaler is a man on a mission. It’s not a very good mission, but it’s a mission. He created something called DABUS (Device for the Autonomous Bootstrapping of Unified Sentience) and claims that it’s creating things, for which he has tried to file for patents and copyrights around the globe, with his mission being to have DABUS named as the inventor or author. This is dumb for many reasons. The purpose of copyright and patents are to incentivize the creation of these things, by providing to the inventor or author a limited time monopoly, allowing them to, in theory, use that monopoly to make some money, thereby making the entire inventing/authoring process worthwhile. An AI doesn’t need such an incentive. And this is why patents and copyright only are given to persons and not animals or AI.
But Thaler has spent years trying to get patent offices around the world to give DABUS a patent. I’ll just note here that if Thaler is correct, then it seems to me that he shouldn’t be able to do this at all, as it’s not his invention to patent. It’s DABUS’s. And unless DABUS has hired Thaler to seek a patent, it’s a little unclear to me why Thaler has any say here.
Either way, Thaler’s somewhat quixotic quest continues to fail. The EU Patent Office rejected his application. The Australian patent office similarly rejected his request. In that case, a court sided with Thaler after he sued the Australian patent office, and said that his AI could be named as an inventor, but thankfully an appeals court set aside that ruling a few months ago. In the US, Thaler/DABUS keeps on losing as well. Last fall, he lost in court as he tried to overturn the USPTO ruling, and then earlier this year, the US Copyright Office also rejected his copyright attempt (something it has done a few times before). In June, he sued the Copyright Office over this, which seems like a long shot.
And now, he’s also lost his appeal of the ruling in the patent case. CAFC, the Court of Appeals for the Federal Circuit — the appeals court that handles all patent appeals — has rejected Thaler’s request just like basically every other patent and copyright office, and nearly all courts.
So, if a work is not protected by copyright, then it technically falls into the public domain. This means that if someone really likes the output of a particular AI and re-uses it wholesale, they are free to do so. After all, again, there is no copyright protection for that work that was generated by AI. As far as I’m concerned, that’s great news since that will actually help humans generate new works because the public domain has repeatedly been running stagnant for some time thanks to the repeated extensions of copyright terms.
This alone should help people panic a lot less about AI creating content in the first place.
Conclusions
What you see above is a huge reason why I’m skeptical that a claim can be made that what AI is producing is copyright infringement. The original work is not really present in these AI outputs. Yes, the AI is able to take inspiration from original copyrighted works, but being inspired in and of itself is not an act of copyright infringement (nor should it be). Facts cannot be copyrighted, either, so rewording a series of facts is also not an act of copyright infringement. What’s more, if it’s any comfort, the output of AI is also not protected by copyright because the work was not generated by a human in the first place.
As a result of all of this, it’s very easy to see the argument that the output falls well within the bounds of fair use. The argument that it’s transformative is extremely strong as far as I can tell.
For those who dislike AI and think copyright will save them, well, I got bad news. As far as I can tell, if people are using copyright to fight generative AI, they should only expect to run into a legal dead end. Can laws be changed to say that generative AI programs need license agreements before reading or examining copyrighted works? Sure. Has that happened? No. If a lawsuit is to be filed against AI large language modules (LLMs), it has to work with existing law and precedent. The way existing law works, it’s not looking good for those who are filing the complaints.