A study Reveals that Data sets
Used to Train large language models frequently lack Transparency.
Researchers employ massive dataset collections that combine varied data from thousands of web sources to train larger, more sophisticated language models. However, as these datasets are mixed and remixed into various collections, crucial details regarding their original sources and usage limitations are sometimes omitted or confused.
In addition to posing moral and legal questions, this might negatively impact a model’s output. For example, someone developing a machine-learning model for a task could inadvertently use data that are not intended for that purpose if a dataset is misclassified.
Furthermore, if a model is used with data from unidentified sources, it may contain biases that lead to unjust predictions.
A group of interdisciplinary academics from MIT and other institutions started a methodical audit of more than 1,800 text files on well-known hosting platforms in an effort to increase data transparency.
More than half of these datasets contained erroneous information, and more than 70% had no license information at all, it was found.
Based on these findings, they developed the user-friendly Data Provenance Explorer, an application that generates brief summaries of the inventors, licenses, and allowed uses of a dataset automatically.
Leading the Human Dynamics Group at the Media Lab at MIT, Professor Alex “Sandy” Pentland says, “These kinds of resources can support the responsible development of AI and assist regulators and practitioners in making decisions regarding the implementation of AI.”
He also co-wrote an open-access article about the experiment, which Nature Machine Intelligence will release in the near future.
By enabling users to choose training datasets that align with their models’ goals, the Data Provenance Explorer can assist AI practitioners in building more effective models. This has the potential to increase the accuracy of AI models in practical applications, such answering consumer inquiries or assessing loan applications.
One of the greatest ways to understand the advantages and disadvantages of an AI model is to know what data it was trained on. Co-lead author Robert Mahari, an MIT graduate student and Harvard Law School JD candidate, points out that misattribution and misunderstanding over the data’s source pose serious obstacles to transparency.
Along with experts from MIT, the University of California at Irvine, the University of Lille in France, the University of Colorado at Boulder, Olin College, Carnegie Mellon University, Contextual AI, ML Commons, and Tidelift, co-lead author Shayne Longpre, a graduate student in the Media Lab, and Sara Hooker, the head of the Cohere for AI research lab, co-authored the paper with Mahari and Pentland.
Concentrate on implementing the modifications.
Researchers frequently employ a process known as fine-tuning to enhance a general language model’s performance for a particular use case, like answering questions. They offer carefully selected datasets for fine-tuning, enabling a model to function more effectively for this particular use case.
The MIT researchers concentrated on these fine-tuning datasets, which are frequently created by academics, organizations, businesses, or others with permission for particular uses.
Part of the original licensing information is lost when these datasets are routinely combined by crowdsourced algorithms into larger collections for additional refinement.
“These licenses ought to matter, and they should be enforceable,” Mahari argues.
For example, if the licensing limitations on the dataset are unclear or nonexistent, someone may have invested a lot of time and money in constructing a model that they may eventually be forced to remove because some training data contains private information.
“People can end up training models where they don’t even understand the capabilities, concerns, or risk of those models, which ultimately stem from the data,” Longpre explains.
The researchers define data provenance as the result of a dataset’s development, licensing, source histories, and original study features.
Then, using more than 1,800 text dataset sets from reliable online sources, they developed a systematic auditing procedure to determine the data’s original source.
The researchers went backward to fill in the blanks after finding that more than 70% of these datasets had “unspecified” licenses, which omitted a significant amount of information. They were able to reduce the number of datasets with “unspecified” licenses to roughly 30%.
Additionally, their analysis showed that the licenses that were assigned by the repositories were frequently less restrictive than the correct licenses.
Furthermore, they discovered that almost all dataset developers were based in the global north, which may have an impact on a model’s performance if it is trained for use in a different area. Mahari explains that a dataset developed primarily by individuals in the United States and China in the Turkish language, for example, might not have any culturally meaningful elements.
“We almost delude ourselves into thinking the datasets are more diverse than they actually are,” he says.
It’s interesting to note that the researchers also saw a sharp increase in limits put on datasets created in 2023 and 2024. This increase may have been caused by academics’ worries that their datasets would be misused for unintended commercial gain.
An easy-to-use instrument
The researchers created the Data Provenance Explorer to make it easier for others to get this data without requiring a human assessment. Apart from arranging and screening datasets according to specific standards, the application lets users obtain a data provenance card that offers a concise, organized summary of dataset attributes.
“We are hoping this is a step, not just to understand the landscape, but also help people going forward to make more informed choices about what data they are training on,” Mahari adds.
The researchers hope to extend their analysis in the future to look into the origin of data for multimodal data, such as audio and video. Additionally, they aim to investigate how datasets mirror the terms of service found on websites that act as data sources.
They are contacting regulators to discuss their research as it progresses and the particular copyright concerns of data fine-tuning.
“We need data provenance and transparency from the outset, when people are creating and releasing these datasets, to make it easier for others to derive these insights,” Longpre argues.
Massachusetts Institute of Technology provided
Discover more from
Subscribe to get the latest posts sent to your email.