PROJECTS

The Data Nutrition Project

Helping data scientists understand what's inside data sets before they're used in machine learning.

Problem

Artificial intelligence (AI) systems built on incomplete or biased data will often exhibit problematic outcomes, leading to negative unintended consequences that affect the very communities that are already marginalized, underserved, or underrepresented. And yet there are few, if any, standard methods of data analysis to check for the ‘health’ of data, particularly before model development.

Our Approach

With an aim of mitigating harms caused by automated decision-making systems, The Dataset Nutrition Label tool enhances context, content, and legibility of datasets. Drawing from the analogy of the Nutrition Facts Label on food, the Label highlights the “ingredients” of a dataset to help shed light on how (or whether) the dataset is healthy for use.

 

The Dataset Nutrition Label is a diagnostic framework that lowers the barrier to standardized data analysis by providing a distilled yet comprehensive overview of dataset “ingredients” before AI model development. This framework is optimized for the data practitioner journey and leverages potential use cases for the data alongside alerts or flags that highlight known issues and possible mitigation strategies. 

 

The Label is intended to drive robust data analysis practices by making it easier and faster for data scientists to interrogate and select datasets; increase overall quality of models by driving the use of better and more appropriate datasets for those models; and enable the creation and publishing of responsible datasets by those who collect, clean and publish data.

Links



At-a-Glance Information

The web-based Label includes three distinct windows of information: Label Overview (below), Objectives & Alerts, and Dataset Info panes.

The Label Overview provides overall dataset information including known issues (alerts) and indicators (badges) for key questions such as whether the data has undergone ethical review.

Upcoming Work

The Data Nutrition Project is a research organization and product development team composed of technologists, designers, academics and scientists. Together, we are excited to continue the work of driving better AI through the exploration and development of practical tools.

 

Since launching the second generation of the Dataset Nutrition Label in early 2021, the team has turned our focus to a number of initiatives for this year and beyond:

 

  • Publish a Labels Library. Creating additional Labels for high-impact datasets often used to train AI.
 
  • Conduct user research. With data scientists, we plan to conduct user research to refine the utility and legibility of the Label for data scientists.
 
  • Improve our tools. Develop and build a semi-automated Label Maker and Label Comparison tool.
 
  • Research the landscape. Continue research on the broader ecosystem of labeling and algorithmic accountability and make this broadly available for academics and policy makers.

 

More From Digital Lab

Get involved with the Digital Lab

Stay up to date with all things Digital Lab

Bitnami