Researchers often train large language models using massive datasets sourced from thousands of web pages. These datasets are regularly mixed and reused across various projects. However, essential information like data origin, licensing terms, and usage restrictions is frequently lost or overlooked. This creates ethical and legal concerns and risks of producing AI models that are biased or misused.

A poorly labeled dataset might mislead developers into applying it incorrectly. For example, using misclassified data in a model for healthcare or legal decisions could lead to flawed outcomes. Worse, if the dataset includes unidentified sources, it may embed unseen biases, skewing predictions unfairly.

Systematic Audit Reveals Shocking Gaps

To address these concerns, a multidisciplinary team of researchers from MIT and other global institutions audited over 1,800 text datasets from popular hosting platforms. They found that:

Over 70% of datasets lacked any license information.
More than half included incorrect or misleading metadata.

In response, the team developed the Data Provenance Explorer, a user-friendly tool that automatically generates summaries about a dataset’s creators, licenses, and permitted uses.

Professor Alex “Sandy” Pentland, head of the Human Dynamics Group at MIT Media Lab, emphasized the significance of this tool, saying it could guide both AI developers and regulators in making more informed decisions about AI deployment.

Benefits of Data Transparency

The Data Provenance Explorer helps AI practitioners select datasets that align with their model goals, improving output quality in real-world applications such as customer support or loan assessments. According to co-lead author Robert Mahari, a Harvard Law School JD candidate and MIT graduate student, knowing a model’s data origin is critical. Misunderstanding or misattribution can seriously undermine transparency and accountability.

Co-lead author Shayne Longpre, also a graduate student at MIT Media Lab, collaborated with experts from several prestigious institutions, including:

University of California at Irvine
University of Lille
University of Colorado at Boulder
Olin College
Carnegie Mellon University
Contextual AI
ML Commons
Tidelift

Focus on Fine-Tuning Datasets

The researchers specifically examined fine-tuning datasets, often used to enhance general models for niche applications like Q&A systems. These datasets, typically created by companies or academics for specific use cases, often lose their original licensing details when combined through crowd-sourced algorithms.

Mahari pointed out that ignoring or misrepresenting license restrictions could lead to costly setbacks. For instance, a company might unknowingly use restricted data, forcing them to discard an otherwise functional model.

Longpre noted, “People can end up training models where they don’t even understand the capabilities, concerns, or risks… all rooted in the data.”

What is Data Provenance?

Data provenance encompasses a dataset’s full history: its origin, licenses, creators, and intended uses. The research team reverse-engineered datasets to uncover missing details and managed to reduce the percentage of unspecified licenses from over 70% to about 30%.

They also discovered that:

Assigned licenses were often more lenient than the actual ones.
Most dataset developers were from the Global North.

This raises concerns about geographic bias. For example, a Turkish-language dataset built primarily by American or Chinese contributors might fail to capture cultural nuances accurately.

Mahari warned, “We almost delude ourselves into thinking the datasets are more diverse than they are.”

A Rise in Licensing Restrictions

Interestingly, the team noted a sharp increase in licensing restrictions in datasets created during 2023 and 2024. They attribute this to growing concerns over the unauthorized commercial use of academic data.

Introducing the Data Provenance Explorer

To make dataset analysis more accessible, the researchers built the Data Provenance Explorer. The tool offers:

Automatic summaries of dataset attributes
Filters based on user-defined standards
Data provenance cards showing key dataset features

Mahari hopes this tool will help researchers and developers make smarter, more ethical choices about dataset selection.

The team also aims to extend this work to multimodal datasets, including audio and video, and to examine how web platform terms of service align with dataset use.

Moving Toward Greater Accountability

The researchers are actively engaging with regulators to explore copyright issues surrounding data fine-tuning. Longpre stressed the need for transparency from the outset:

“We need data provenance and transparency from the outset… to make it easier for others to derive these insights.”

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

A study reveals that Data Sets.

Study Exposes Lack of Transparency in

AI Training Datasets

Hidden Issues in AI Training Data