The Dataset Doesn’t Lie (But It Doesn’t Tell the Whole Truth)

Conceptual illustration of a person seated at a desk reviewing notes and research materials while a chaotic cloud of handwritten ideas transforms into a clean digital report interface. Warm orange tones blend into cool blue technology visuals, symbolizing evidence-based analysis, human skepticism and the critical evaluation of AI-generated information.

By RACHAEL J. WEBB and MAGGIE GRADY

Gender Shades Series

This assignment is intended to cultivate a habit of evidence-based AI skepticism. By experiencing firsthand the difficulty of the task that facial recognition systems are asked to perform, and by auditing an AI’s own argumentation against primary research data, students develop the capacity to evaluate AI-generated content not as a finished product, but as a draft subject to human scrutiny.

Overview

In 2018, Joy Buolamwini—a researcher at the MIT Media Lab and founder of the Algorithmic Justice League—published a landmark study that would reverberate across the technology industry, the legal community, and beyond. In “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification” (Buolamwini & Gebru, 2018), Buolamwini revealed that three leading commercial facial recognition systems—built by Microsoft, IBM, and Face++—misclassified darker-skinned women at dramatically higher rates than lighter-skinned men. While the systems performed with near-perfect accuracy on lighter-skinned male faces, error rates on darker-skinned female faces reached as high as 34.7%. The disparity was not a bug. It was a consequence of the data used to train the systems.

Buolamwini’s methodology centered on a benchmark she constructed herself: the Pilot Parliaments Benchmark (PPB), a dataset of 1,270 parliamentarian headshots balanced across gender and the Fitzpatrick skin tone scale—a dermatological classification system that ranges from Type I (lightest) to Type VI (darkest). Crucially, Buolamwini did not rely solely on automated classification to label gender in her dataset. She coded each image by hand, a process that underscored one of the project’s most provocative findings: even humans, examining a simple headshot in isolation, frequently struggle to determine gender with confidence. For AI systems to perform this task reliably and equitably, the challenge is far harder than it might first appear.

The Gender Shades dataset is publicly accessible at gs.ajl.org. Visitors to the site can browse the benchmark images and examine disaggregated performance data for each of the commercial systems Buolamwini evaluated. Since its publication, the study has prompted vendor responses, federal legislative hearings, and ongoing debate about whether facial recognition technology should be deployed in high-stakes settings at all—and if so, under what conditions. Buolamwini expanded this body of work in her 2023 book Unmasking AI: My Mission to Protect What Is Human in a World of Machines, a memoir-driven account of algorithmic bias and the human cost of invisible errors (Buolamwini, 2023).

Pre-Work

  • Watch Joy Buolamwini’s TED Talk, “How I’m fighting bias in algorithms” (Buolamwini, 2016). The talk is approximately nine minutes long. Take notes on the specific examples of bias Buolamwini describes and the language she uses to frame the problem.
  • Read the short overview on the Gender Shades project page at gs.ajl.org before your session. Do not examine the benchmark results yet—you will explore those during class as part of the assignment.

Assignment Instructions

This assignment unfolds in three phases. Each phase builds directly on the one before it. Read all three phases before beginning.

  • Navigate to gs.ajl.org and open the benchmark image gallery. Browse the headshot photographs in the dataset without yet reviewing any performance benchmarks.
  • Select any 20 headshot images at random. For each image, record your own gender classification—male, female, or uncertain—alongside a brief note on your level of confidence (high, medium, or low). Do this quickly and instinctively, as a facial recognition system would. Take a screenshot or keep a simple running log.
  • After completing your 20 classifications, step back and reflect: How often were you uncertain? What visual features were you relying on? Did any images give you pause in ways you did not expect? Write two to three sentences capturing your honest reaction before moving on.
  • Now, examine the Gender Shades benchmark results for the three commercial systems. Pay close attention to how performance differs across the four demographic subgroups: darker females, darker males, lighter females, and lighter males. Note which subgroup has the highest error rate and which has the lowest.
  • Identify a specific institutional context within your own discipline where facial recognition technology could plausibly be deployed. Examples might include: a hospital emergency department using facial recognition to verify patient identity; a university using it to monitor exam integrity; a court system using it to match suspects to video footage; or a K–12 school using it for building access. Choose a context that feels real and consequential to you.
  • Open a generative AI tool of your choice (such as Claude, ChatGPT, or Gemini) and prompt it to generate arguments both for and against deploying facial recognition technology in the institutional context you identified. Be specific in your prompt: name the setting, the population it would affect, and the stated purpose of the technology. You may revise and re-prompt as many times as you like to get a thorough response.
  • Print, copy, or screenshot the AI’s full response so you can reference it in Phase 3. Save your exact prompt as well.

Write a short reflection of 400–600 words that does three things:

  • Takes a clear position on whether facial recognition technology should be deployed in the institutional context you selected. Your position must be supported by specific evidence from the Gender Shades benchmark data.
  • Identifies at least one specific place in the AI’s generated argumentation where the tool was factually incomplete, misleading, or insufficiently specific about the disparate impacts documented by Buolamwini and Gebru (2018). Quote the AI’s language directly and explain precisely what it got wrong or left out.
  • Connects your personal experience from Phase 1—your own difficulty or confidence in classifying gender from headshots—to the broader argument you are making about AI deployment in your field.

Grading Rubric

The following criteria reflect the core learning outcomes demonstrated through this assignment: students' ability to critically examine AI-generated claims using empirical evidence, recognize and evaluate algorithmic bias, engage directly with primary data sources, and connect personal observations to broader ethical and institutional questions surrounding AI deployment.

Student completed the manual classification exercise and wrote a genuine, specific reflection on the experience. Evidence of honest uncertainty is valued over false confidence.

Student’s institutional context is clearly defined, and the prompt given to the AI is specific enough to elicit substantive argumentation on both sides.

Student’s position is clearly stated and directly supported by specific benchmark figures from the Gender Shades dataset. Vague references to “bias” without data citations do not meet this criterion.

Student identifies a concrete gap or inaccuracy in the AI’s argumentation and quotes the AI’s language directly. The critique is analytical, not impressionistic.

Student draws a meaningful link between their Phase 1 experience and their broader argument. This connection should feel specific to the student’s own encounter with the dataset, not generic.

References

Algorithmic Justice League. (2018). Gender Shades. Retrieved from https://gs.ajl.org/

Buolamwini, J. (2016). How I’m fighting bias in algorithms [TED Talk]. TED Conferences. https://www.ted.com/talks/joy_buolamwini_how_i_m_fighting_bias_in_algorithms

Buolamwini, J. (2023). Unmasking AI: My mission to protect what is human in a world of machines. Random House.

Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of Machine Learning Research, 81, 1–15.

Printable Version