Lack of diversity in data collection has caused significant failures in machine learning (ML) applications. While ML developers perform post-collection interventions, they are time-consuming and rarely comprehensive. Thus, new methods for tracking and managing data collection, iteration, and model training are needed to assess whether data sets reflect real-world variability. We present design data, an iterative, bias-mitigating approach to data collection that links HCI concepts with ML techniques. Our process includes (1) Pre-collection planning to reflexively elicit and document expected data distributions; (2) Monitoring recruitment to systematically encourage diversity of options; and (3) Data familiarization to identify samples that are unfamiliar to the model using out-of-distribution (OOD) methods. We design data through our own data collection and applied ML case studies. We find that models trained on “targeted” datasets generalize better across cross-sectional groups than models trained on similarly sized but less targeted datasets, and that data familiarity is effective for debugging datasets.
