Here come the lawyers.
Last week, the New York Times sued Microsoft and OpenAI, in which Microsoft has invested $13 billion and counting, for copyright violations. The Times claims Microsoft’s genAI-based Copilot and OpenAI’s ChatGPT, which powers Copilot, were trained using millions of articles without the Times’s permission.
It goes on to argue that those tools (and Microsoft’s search engine, Bing) “now compete with the news outlet as a source of reliable information.”
The Times isn’t seeking a specific amount of damages – yet. Ultimately, though, it wants a lot — “billions of dollars in statutory and actual damages” — because of the “unlawful copying and use of The Times’s uniquely valuable works.”
Beyond that, the filing demands that Microsoft and OpenAI destroy both the datasets used to train the tools and the tools themselves.
This isn’t the first lawsuit claiming AI companies violated copyrights in building their chatbots, and it won’t be the last. But it is the Big Kahuna – the Times is among the best-known newspapers in the world and the gold standard in journalism. And its move could prove to be among the most influential lawsuits of the computer and internet age, perhaps the most influential.
That’s because the outcome could well determine the future of generative AI.
Who’s right here? Is the Times just grubbing for money, and using the lawsuit to negotiate a better rights deal with Microsoft and OpenAI for use of its articles? Or is it standing up for the rights of all copyright holders, no matter how small, against the onslaught of the AI titans?
What’s in the lawsuit?
To get a better understanding of what’s involved, let’s first take a closer look at the underlying technology involved and the suit itself. GenAI chatbots like Copilot and ChatGPT are trained on large language models (LLMs) — which include tremendous amounts of data — to be effective and useful. The more data, the better. And just as important is the quality of the data. The better the quality of the data, the better the genAI results.
Microsoft and OpenAI use content available on the internet to train their tools, regardless of whether that content is public domain information, open source data, or copyrighted material; it all gets ingested by the great, hungry maw of genAI. That means millions and millions of articles from the Times and myriad other publications are used for training.
Microsoft and OpenAI contend that those articles and all other copyrighted material are covered by the fair use doctrine. Fair use is an exceedingly complicated and confusing legal concept, and there’s an unending stream of lawsuits that determine what’s fair use and what isn’t. It’s widely open to interpretation.
That’s why the Times lawsuit is so important. It will determine whether all genAI tools, not just those owned by Microsoft and OpenAI, can continue to be trained on copyrighted material. (Copyrighted content is highly valuable because it tends to be the broadest and most accurate. And there’s lots of it.)
Fair use of copyrighted material generally falls into two categories: commentary and parody. Use of the material must be “transformative,” in other words; it can’t just copy the copyrighted material. It has to transform it in some way.
So, for example, if someone is writing a review of a novel, they can quote several lines to make a point. In a news report, fair use lets you summarize an article about a medical research report, and quote briefly from it.
Microsoft and OpenAI say their use of copyrighted material is transformative. They contend the output of the chatbots transforms the original content into something different. The Times suit claims there’s no real transformation, that what Microsoft and OpenAI are doing is outright theft. It claims the companies are not just stealing Times content, but their audience as well, and making billions of dollars from it. People will have no need to read the Times either online or in print, if they can get all the newspaper’s information for free from a chatbot instead, the suit alleges.
This paragraph sums up the Times contentions: “There is nothing ‘transformative’ about using The Times’scontent without payment to create products that substitute for The Times and steal audiences away from it. Because the outputs of Defendants’ GenAI models compete with and closely mimic the inputs used to train them, copying Times works for that purpose is not fair use.”
The suit offers plenty of evidence for its claims. The most egregious examples are many instances in which ChatGPT outright plagiarizes articles, including a Pulitzer-Prize-winning, five-part 18-month investigation into predatory lending practices in New York City’s taxi industry. The suit charges: “OpenAI had no role in the creation of this content, yet with minimal prompting, will recite large portions of it verbatim.”
For its part, OpenAI on Monday accused the Times of intentionally manipulating prompts to get ChatGPT to regurgitate its content. “Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts,” the company said in blog post.
It’s not just plagiarism that’s a problem. The Times notes that it spends a tremendous amount of money and effort on its news organization, and that if people can get its breaking news for free – even if it’s paraphrased by a chatbot – they will have no need to read the newspaper.
Beyond that, the publisher found out that the Microsoft and OpenAI chatbots take information from the newspaper’s Wirecutter product review site, publish it, and remove referral links to the products, which the Times gets revenue from.
“Defendants have not only copied Times content, but also altered the content by removing links to the products, thereby depriving The Times of the opportunity to receive referral revenue and appropriating that opportunity for Defendants,” the lawsuit argues.
So, who’s right?
This is not a difficult call. The answer is simple. The Times is right. Microsoft and OpenAI are wrong. Microsoft and OpenAI are getting a free ride to use copyrighted material that takes a tremendous amount of time and money to create, and uses that material to reap big profits. If the court rules against the Times, copyright holders everywhere — from giants like the Times to individual writers, artists, photographers and others — will struggle to survive while Microsoft, OpenAI and other AI makers get fat with profits.
One of the great ironies of this suit is that a young Bill Gates complained mightily when Microsoft’s first product, a version of BASIC for the Altair 8800 personal computer, was being pirated by people rather than being paid for.
This was in 1975, when the idea of paying money for software was anathema to most people who used the first personal computers. An idealistic share-and-share-alike ethos ruled, especially among those who were members of the influential Home Brew Computer Club.
So an angry Gates sent his “Open Letter to Hobbyists” to the Home Brew Computer Club, and to computer-related publications. He wrote, in part:
“The amount of royalties we have received from sales to hobbyists makes the time spent on Altair BASIC worth less than $2 an hour. As the majority of hobbyists must be aware, most of you steal your software…. Who cares if the people who work on it get paid?
“Who can afford to do professional work for nothing? What hobbyist can put [three] man-years into programming, finding all bugs, documenting his product and distribute for free? …Most directly, the thing you do is theft.”
There’s not much difference between what Gates was complaining about and what Microsoft is doing now. Gates was right back then. Microsoft and OpenAI are wrong right now. They should either come to an agreement with the Times and other copyright holders or retrain their AI in a way that doesn’t violate copyright laws. And the same holds for all other AI creators as well.
Copyright © 2024 IDG Communications, Inc.
This story originally appeared on Computerworld