Generative AI training data sets are now trackable – and often legally complicated

October 26, 2023

85

A new online tool allows users to identify, track and learn about the legal status of training data sets for generative AI, and a quick glance shows that many may have licensing issues.

The tool, dubbed the Data Provenance Explorer, is the result of a joint effort between machine learning and legal experts from MIT, generative AI API provider Cohere, and 11 other organizations — Harvard Law School, Carnegie Mellon University and Apple are all among the contributors. The Data Provenance Explorer lets researchers, journalists and anyone else search through thousands of AI training databases and trace the “lineage” of widely used data sets.

The idea is to provide a way to explore the sometimes murky world of training data used to develop generative AI. In an official statement announcing the Data Provenance Explorer, the team behind it described a “data transparency crisis” that could complicate the development and commercial use of generative AI systems.

Crowdsourced data sets lack licenses

“Crowdsourced aggregators like GitHub, Papers with Code, and many of the open source LLMs [large language models] trained from data on these aggregators, have an extremely high proportion of missing data licenses … ranging from 72% to 83%,” the group said. “In addition, the licenses that are assigned by crowdsourced aggregators frequently allow broader use than the original intent expressed by the authors of a data set.”

The need for responsibly developed AI is something that the industry appears to be well aware of, according to Kathy Lange, a research director for IDC. The headlong rush to deploy generative AI has created a public focus on the safe and legal use of data, she said.

“Understanding the provenance of the data; how it was collected, processed, and transformed can impact the trust in AI model results,” Lange said. “AI vendors prioritizing data provenance will have a leg-up in the market for customers requiring transparency, accountability, and compliance initiatives.”

AI data has become nothing less than a battleground, in certain respects. Lange highlighted the recent introduction of the Nightshade tool, which subtly changes digital art in such a way as to confuse AI creators attempting to use copyrighted works for training data. Moreover, authors and other copyright holders have begun to take legal action against the use of their works in generative AI training – comedian and author Sarah Silverman is among those suing OpenAI for this reason. However, the legal landscape for those claims remains murky in many respects.

This story originally appeared on Computerworld

Generative AI training data sets are now trackable – and often legally complicated

Crowdsourced data sets lack licenses

LinkedIn is developing in-app games to further distract you from your job hunt

I’m here for the hoverboard

Apple can’t get out of facing a class-action lawsuit over AirTags stalking claims

Most Popular

Electric Transmission Buildout Could Cost Americans Trillions of Dollars | The Gateway Pundit

positive interest rates By Reuters

Exploring Omega’s Constellation Meteorite Collection

Khris Middleton sparks Bucks past Suns after 16-game absence

Recent Comments

WORLD NEWS

Israel launches night raid on Gaza’s al-Shifa hospital

Putin poised to rule for another six years after re-election in Russia

North Korea fires ballistic missile as top US diplomat visits Seoul

TRENDING NEWS

Judy Garland ‘Wizard of Oz’ Ruby Slippers Theft: Second Man Charged

Justin Timberlake’s ‘Everything I Thought It Was’ Voted Best New Music

North West Gives First Interview on ‘Elementary School Dropout’ Album

POPULAR CATEGORY

ABOUT US

FOLLOW US