No Bad Idea Left Behind: Mainstream Media Sues OpenAI for Copyright Infringement

Mainstream media just keeps digging as they try another really bad idea: suing OpenAI for copyright infringement.

Are you reading this news article right now? Are you comprehending the words on your screen? If so, as the legal theory goes, you are infringing on my copyrighted work, “scraping” my content, and circumventing my technological protection measures. As a result, it is well within my right to sue you for tens of thousands of dollars for every article you happened to read even though I freely published my work for all to see. Also, I’m entitled to all attorney fees as well.

It doesn’t really make a whole lot of sense, does it? Since when is reading a freely posted news article “copyright infringement”? Yet, that is exactly the kind of legal arguments that the mainstream media is putting forward in a copyright infringement lawsuit against OpenAI. The only difference here is that it’s not a human that’s doing the reading and comprehending, but rather, a Large Language Module (LLM). As far as I’m concerned, that… really doesn’t change anything.

The press release posted by the mainstream media really is a sight to behold. Here’s what it reads, in part:

News media companies invest hundreds of millions of dollars into reporting Canadians’ critical stories, undertaking investigations and original reporting, and distributing media in both official languages in every province and territory across this country. The content that Canadian news media companies produce is fact-checked, sourced and reliable, producing trusted news and information by, for, and about Canadians. This requires significant investment, and the content produced by news media companies is protected by copyright.

News media companies welcome technological innovations. However, all participants must follow the law, and any use of intellectual property must be on fair terms.

OpenAI regularly breaches copyright and online terms of use by scraping large swaths of content from Canadian media to help develop its products, such as ChatGPT. OpenAI is capitalizing and profiting from the use of this content, without getting permission or compensating content owners.

OpenAI’s public statements that it is somehow fair or in the public interest for them to use other companies’ intellectual property for their own commercial gain is wrong. Journalism is in the public interest. OpenAI using other companies’ journalism for their own commercial gain is not. It’s illegal.

First of all, the response was that their use is “fair dealing”. Saying that the argument is just “fair” is misleading. Honestly, though, when it comes to the mainstream media lying or misleading the public on various technology issues, that whole “lying to the public” has become fairly routine in recent years, so this misleading statement here is nothing new.

As for the rest of the press release, I saw what was written and looked at it sideways. It doesn’t really make a whole lot of sense to me because I’m struggling to find the part where OpenAI has engaged in activity that “breaches copyright”. The press release isn’t all that clear. If OpenAI is simply reading freely distributed news content, that’s not an act of copyright infringement. That easily falls within fair dealing. Heck, regurgitating samples collected by reading the works also falls within the realm of fair dealing. I mean, I just did that above by copying a sample from the press release. Legally speaking, that’s not copyright infringement.

So, I decided to read the lawsuit (PDF) itself thinking that the press release simply glossed over the actual point that rises to the level of copyright infringement. As it turns out, the mainstream media really is legally trying to argue that reading is copyright infringement:

4. The Defendants (collectively, “OpenAI”) are the creators, proponents, and operators of a series of artificial intelligence (“AI”) products, including their “Generating Pre-training Transformer” (“GPT”) models, commercialized under the product name ChatGPT. ChatGPT is amass-marketed large language model designed to provide natural-sounding text responses to user prompts, in a manner that mirrors human communication. OpenAI’s GPT models work by using
pattern recognition developed through the analysis of enormous quantities of text data.

5. To obtain the significant quantities of text data needed to develop their GPT models, OpenAI deliberately “scrapes” (i.e., accesses and copies) content from the News Media Companies’ websites, web-based applications, and/or the websites of their Third Party Partners (defined below). It then uses that proprietary content to develop its GPT models, without consent or authorization. OpenAI also augments its models on an ongoing basis by accessing, copying, and/or scraping the News Media Companies’ content in response to user prompts.

6. OpenAI has taken large swaths of valuable work, indiscriminately and without regard for copyright protection or the contractual Terms of Use applicable to the misappropriated content. The misappropriated content includes works that the News Media Companies own or exclusively license (the “Owned Works”) as well as works that they non-exclusively license from other third parties (the “Licensed Works”) (together, the “Works”). Through its conduct, OpenAI has and continues to:

(a) Infringe, authorize, and/or induce the infringement of the News Media Companies’ copyright in its Owned Works;
(b) Circumvent the technological protection measures employed by the News Media Companies and/or their Third Party Partners to protect the Works from unauthorized access; and,
(c) Breach the Terms of Use of the News Media Companies’ Websites.

Um, am I being punked here? What’s presented here does not show copyright infringement. It shows an LLM reading material. Yes, the wording like how it uses the content to “train” might add a complexity to the situation, but “training” really is just a term for a machine reading material. Unless there is a wildly different interpretation here, then the same could be said for the term “scrape” in this situation. If a web service scrapes paywalled content, then posts it up for free afterwards, then you would have copyright infringement. Yet, what is likely being done here is that OpenAI is simply “scraping” (AKA reading) the material and just used it to learn how to put together words. Again, I’m not really seeing the copyright infringement here. They aren’t just freely redistributing whole articles.

Some of the sites mentioned in the claim do have paywalled material (such as the Globe and Mail). If the paywalled material is being read, and it was OpenAI paying for that material in the first place, then it’s rather hard to find any claim for copyright infringement. When you pay for the material, you pay for a license to read the material. That’s… what normal humans do.

Now, moving on to the claim of circumventing a technical protection measure (TPM). Nowhere in that sample of text leading up to those points does any of that rise to the level of circumventing a TPM. The claim mentioned above only mentioned that the LLM was reading the news media material. In order to claim a TPM was broken, there would have to be an act that actively breaks said TPM. That was not spelled out there in any way. It was just thrown in seemingly for shiggles.

The only thing I can see even remotely plausible here is breaking the terms of use for the news sites. If the websites in question silently put into their terms and conditions that the content appearing on their site shall not be used for training an AI (or something along those lines) and OpenAI unknowingly breached that, it’s possible there’s a case here. The problem for the mainstream media companies, however, is that damages for breach of contract is generally less than statutory damages for copyright infringement. So, if they are hoping for a huge payday, they’ll likely be disappointed there.

The funny thing is that the mainstream media argues that because OpenAI made money at some point, that makes them the bad guys here:

9. OpenAI has capitalized on the commercial success of its GPT models, building an expansive suite of GPT-based products and services, and raising significant capital—all without obtaining a valid licence from any of the News Media Companies. In doing so, OpenAI has been substantially and unjustly enriched to the detriment of the News Media Companies.

10. The News Media Companies accordingly seek damages and/or disgorgement to compensate them for the wrongful misappropriation of their Works, as well as permanent injunctive relief to prevent OpenAI from continuing with its unlawful conduct.

I… hate to break it to the mainstream media, but people do make money while utilizing fair dealing all the time. News media companies themselves use things like logo’s from companies they are reporting on in their reports and broadcasts. What’s more, those media companies gasp made money off of that as well. You know what? That use is legal. It’s called “fair dealing”. In this example, it is using the element of journalistic purposes to reuse that work without express authorization. The work is transformative. Instead of simply distributing the logo in question, it is being used for illustrative purposes to create a new work. In short, making money off of a work that uses fair dealing is not only legal, but common practice in multiple industries and sectors.

So, I decided to dig further into the lawsuit to see if I can get any insight that changes my mind on any of this. As I dug deeper, I found this:

31. Commencing at different times, each of the News Media Companies have also employed web-based exclusion protocols on their respective Websites, such as the Robot Exclusion Protocol (i.e., robots.txt), which is a standard used by websites to prevent the unauthorized scraping of data from the entirety or designated portions of a website. These exclusion protocols and account and subscription-based restrictions all serve to prevent unauthorized access to their Works.

Um… what?

OK, let’s back this up here. First of all, last I checked, robots.txt is not legally binding. It is simply a document that basically says “please”. Even if you put in a specific rule for OpenAI to not read your material, and OpenAI read that material anyway, sure you could be mad about it, but it’s not grounds for a lawsuit.

Second, robots.txt is generally meant for search engines like Google. If you don’t want your website to appear on Google, you can use robots.txt to be excluded from indexing. That’s generally what most web developers use robots.txt for.

Third, OpenAI itself says that it uses robots.txt and even gives detailed instructions on what tags to use if you don’t want OpenAI to crawl your site if you really are freaked out about it. The worst case scenario is that OpenAI crawled something it shouldn’t have. In that case, you can ask to have that data removed from its dataset. Otherwise, it’s entirely possible that the websites in question didn’t properly implement their robots.txt.

Fourth, as argued further down in the lawsuit, not abiding by robots.txt is not circumventing a TPM. You really have to contort reality around to try and come up with that conclusion. Simply put, robots.txt doesn’t encrypt anything, nor could you call it an “effective” protection measure. It’s the equivalent of putting $20 in a clear glass jar in the middle of a public and taping a note onto it saying, “not for use of strangers” and calling that high level security. It’s extremely silly for anyone to declare otherwise.

Another nugget I was able to find in the lawsuit is this:

69. OpenAI has been, and continues to be, enriched at the expense of the News Media Companies, including by unlawfully obtaining and using the Works for free. The News Media Companies have been correspondingly deprived. There is no juristic reason for OpenAI’s enrichment at the expense of the News Media Companies. OpenAI is accordingly liable for unjust enrichment.

I’m… really not sure how that can even be shown here. OpenAI simply provides answers. It is by no means replacing the original works by any means. After all, a well known weakness in LLMs is the fact that it can’t provide relevant details on a current affairs event in the first place. Further, the lawsuit doesn’t show any evidence that OpenAI is reproducing any specific reports the LLM read. The only argument I can possibly think of here is that huge swaths of people are reading articles from 5 years ago or more and that OpenAI is depriving the mainstream media of that traffic by coughing up, uh, facts about what happened. Even the, last I checked, facts can’t be copyrighted.

So, having read both the press release and the lawsuit, I’m left with a question I can’t really answer. That question is, “what’s the case?” Ultimately, I can only conclude that this is a garbage lawsuit.

I think the strongest points I can read here is that OpenAI allegedly circumvented a paywall and that they violated the sites terms and conditions.

With respect to the former, you have to show evidence that they did so (the lawsuit doesn’t provide that). What’s more, you need to show that the paywalled content was reproduced and distributed without authorization (which the lawsuit does not do). The act of reading said material does not rise to the level of copyright infringement.

With respect to the latter, again, breaching the terms of use of a website is not, in and of itself, copyright infringement. You’d have to fall back on contract law (which, unless I’m missing something, I don’t believe the lawsuit argues that what OpenAI did violated contract law. Instead, it argues that OpenAI is violating copyright law which is, well, a different law). I can see an amended lawsuit being remotely possible, but even the, I’m not sure the damages involved would make such a lawsuit worth it.

Personally, in a sensible court setting, I don’t see this lawsuit going anywhere. Some media organizations have argued that similar lawsuits were filed in the US and Canadian media companies are simply doing the same thing in Canada. Of all the times I saw that argument, I never once saw them mention the results of those lawsuits. That’s probably with the reasoning that the lawsuits didn’t do so well in court. Here’s one example from Techdirt:

I get that a lot of people don’t like the big AI companies and how they scrape the web. But these copyright lawsuits being filed against them are absolute garbage. And you want that to be the case, because if it goes the other way, it will do real damage to the open web by further entrenching the largest companies. If you don’t like the AI companies find another path, because copyright is not the answer.

So far, we’ve seen that these cases aren’t doing all that well, though many are still ongoing.

Last week, a judge tossed out one of the early ones against OpenAI, brought by Raw Story and Alternet.

Part of the problem is that these lawsuits assume, incorrectly, that these AI services really are, as some people falsely call them, “plagiarism machines.” The assumption is that they’re just copying everything and then handing out snippets of it.

But that’s not how it works. It is much more akin to reading all these works and then being able to make suggestions based on an understanding of how similar things kinda look, though from memory, not from having access to the originals.

Finally, the judge basically says, “Look, I get it, you’re upset that ChatGPT read your stuff, but you don’t have an actual legal claim here.”

Let us be clear about what is really at stake here. The alleged injury for which Plaintiffs truly seek redress is not the exclusion of CMI from Defendants’ training sets, but rather Defendants’ use of Plaintiffs’ articles to develop ChatGPT without compensation to Plaintiffs. See Compl. ~ 57 (“The OpenAI Defendants have acknowledged that use of copyright-protected works to train ChatGPT requires a license to that content, and in some instances, have entered licensing agreements with large copyright owners … They are also in licensing talks with other copyright owners in the news industry, but have offered no compensation to Plaintiffs.”). Whether or not that type of injury satisfies the injury-in-fact requirement, it is not the type of harm that has been “elevated” by Section 1202(b )(i) of the DMCA. See Spokeo, 578 U.S. at 341 (Congress may “elevate to the status of legally cognizable injuries, de facto injuries that were previously inadequate in law.”). Whether there is another statute or legal theory that does elevate this type of harm remains to be seen. But that question is not before the Court today.

While the judge dismisses the case with prejudice and says they can try again, it would appear that she is skeptical they could do so with any reasonable chance of success:

In the event of dismissal Plaintiffs seek leave to file an amended complaint. I cannot ascertain whether amendment would be futile without seeing a proposed amended pleading. I am skeptical about Plaintiffs’ ability to allege a cognizable injury but, at least as to injunctive relief, I am prepared to consider an amended pleading.

If you are looking to the lawsuits in the US as a sign that potential Canadian lawsuits could be successful, the above example is not exactly a ringing endorsement. After all, the US DMCA is a very strict copyright law that many would argue is tougher than Canadian copyright laws – and the lawsuits are getting tossed even under the DMCA.

If anything, the Canadian lawsuit really sticks of emotion and desperation. The Online News Act has caused significant damage to the Canadian news sector as a whole. This thanks to the unmitigated greed of the mainstream media who thinks everyone everywhere else owes them because they exist in the first place. After strings of bankruptcies and stories of significant losses when the Online News Act predictably blew up in their faces, this lawsuit seems to be a desperate attempt to extract money from elsewhere across the internet. Unless there is some wild new (and very bad) interpretation of copyright law that magically somehow comes out of the woodwork in all of this, I can only see this as yet another scheme blowing up in the mainstream media’s faces. But hey, no bad idea for the mainstream media is to be left behind, so may as well waste even more money by filing stupid lawsuits.

(via @MGeist)

Drew Wilson on Mastodon, Twitter and Facebook.

Drew Wilson

December 2, 2024 at 2:25 pm

Your wealth status, generally speaking, does not change how the law is applied, only your effectiveness of fighting in court (offer may not be valid in the US).

Copyright law does not distinguish between machines or human reading the material. If that’s problematic for you, then lobby for changes in copyright law.

Further, by disagreeing with my opinion on reading not giving rise to a copyright violation, you are also disagreeing with at least one judge who oversaw one of these lawsuits. I’ll quote the opinion again:

Let us be clear about what is really at stake here. The alleged injury for which Plaintiffs truly seek redress is not the exclusion of CMI from Defendants’ training sets, but rather Defendants’ use of Plaintiffs’ articles to develop ChatGPT without compensation to Plaintiffs. See Compl. ~ 57 (“The OpenAI Defendants have acknowledged that use of copyright-protected works to train ChatGPT requires a license to that content, and in some instances, have entered licensing agreements with large copyright owners … They are also in licensing talks with other copyright owners in the news industry, but have offered no compensation to Plaintiffs.”). Whether or not that type of injury satisfies the injury-in-fact requirement, it is not the type of harm that has been “elevated” by Section 1202(b )(i) of the DMCA. See Spokeo, 578 U.S. at 341 (Congress may “elevate to the status of legally cognizable injuries, de facto injuries that were previously inadequate in law.”). Whether there is another statute or legal theory that does elevate this type of harm remains to be seen. But that question is not before the Court today.

So, you don’t even have to take my word for it.

I mean, you are free to believe that reading is copyright infringement and that a computer makes all the legal difference here, but so far, that argument is not doing well in the American court system.

Crowley

December 2, 2024 at 11:21 am

There’s a difference between a human reading an article and machines gorging themselves on the sum total of human content, text and imagery and audio and video, multiple times over, in order to create climate-killing “AI” products whose main purpose is to further enrich the billionaire owners of the machines.

The “AI” products from garbage piles like OpenAI, Google, and more have just served to make my daily use of the Internet worse.

I know you have a vendetta against mainstream media, but corpos with machines that do stuff like scan webpages countless times per second to where they can effectively simulate a Denial Of Service attack on smaller sites? They’re not the same thing as regular human readers, nor are they worth defending.

Drew Wilson
December 2, 2024 at 1:19 pm

I have a vendetta against stupid people/organizations actively doing things to make the situation with civil rights and the free and open internet worse (i.e. the major record labels throughout the 2000’s that thought they could sue their way back into the 80’s). It just so happens that mainstream media has been doing a spectacular job at fitting the bill thanks, in part, to them living in their own reality bubble.

You may not like the concept of AI and want to smash any possible instance the moment you can, but the reality is that using copyright law against it is a non-starter. The method is failing spectacularly in the US and will very likely go down a similar manner in Canada. Reading a copyrighted work does not give rise to a copyright violation.

If the complaint was the bandwidth used to read that material, then file a lawsuit to recover the costs of the bandwidth consumed. That’s not what the mainstream media did here. They basically sued on the (very faulty) grounds that reading and comprehending text is copyright infringement. They also hilariously argued that a robots.txt is legally binding (it’s not). Their only hope, as far as I’m concerned, is that they get a judge who simply doesn’t understand technology at all and gives a bad ruling. That would basically give OpenAI a slam dunk case to launch an appeal.

Loading...

1. Crowley
  December 2, 2024 at 1:39 pm
  
  “Reading a copyrighted work does not give rise to a copyright violation.”
  
  The machines don’t read. They are not people. The material isn’t being “read” or “watched” or “listened to” in any human way. The material is being processed as an input into products such as chatbots/LLMs and image generators that are making the owners billions in investor dollars while the artists, writers, journalists and musicians whose corpus of work is mulched up into those products don’t see a lick of it. They don’t see any money from it, and are told they should just do the equivalent of “nerd harder”.
  
  Why should fair use protect the ability for billionaires to gorge their machines on the Internet at large to make these products?
  
  Loading...
  
  1. Drew Wilson
    December 2, 2024 at 2:25 pm
    
    Your wealth status, generally speaking, does not change how the law is applied, only your effectiveness of fighting in court (offer may not be valid in the US).
    
    Copyright law does not distinguish between machines or human reading the material. If that’s problematic for you, then lobby for changes in copyright law.
    
    Further, by disagreeing with my opinion on reading not giving rise to a copyright violation, you are also disagreeing with at least one judge who oversaw one of these lawsuits. I’ll quote the opinion again:
    
    Let us be clear about what is really at stake here. The alleged injury for which Plaintiffs truly seek redress is not the exclusion of CMI from Defendants’ training sets, but rather Defendants’ use of Plaintiffs’ articles to develop ChatGPT without compensation to Plaintiffs. See Compl. ~ 57 (“The OpenAI Defendants have acknowledged that use of copyright-protected works to train ChatGPT requires a license to that content, and in some instances, have entered licensing agreements with large copyright owners … They are also in licensing talks with other copyright owners in the news industry, but have offered no compensation to Plaintiffs.”). Whether or not that type of injury satisfies the injury-in-fact requirement, it is not the type of harm that has been “elevated” by Section 1202(b )(i) of the DMCA. See Spokeo, 578 U.S. at 341 (Congress may “elevate to the status of legally cognizable injuries, de facto injuries that were previously inadequate in law.”). Whether there is another statute or legal theory that does elevate this type of harm remains to be seen. But that question is not before the Court today.
    
    So, you don’t even have to take my word for it.
    
    I mean, you are free to believe that reading is copyright infringement and that a computer makes all the legal difference here, but so far, that argument is not doing well in the American court system.
    
    Loading...
    
    1. Crowley
      December 2, 2024 at 3:34 pm
      
      1) The billionaires are not the ones scraping the Internet. The machines they have are the ones doing it, and those machines aren’t people.
      
      2) It’s not actual reading like humans do. The attempted “well you’re just gonna outlaw reading!” gotcha is disingenuous. The machine companies depend on skewed analogies and anthropomorphizing of their machines to get people (including judges) to think that mass-trawling and refreshing web pages countless times a second to create a product is the same thing as me opening up a website and reading through it.
      
      3) An argument not doing well in the nightmare scape of the American Court System, such a thing does not help your own argument the way you think it does.
      
      Loading...