Documentation

Understanding the StereoWipe benchmark methodology and metrics

Overview

StereoWipe is a comprehensive benchmark for evaluating stereotyping in Large Language Models. Our methodology combines automated LLM-as-a-Judge evaluation with human annotation to provide accurate, nuanced assessments of model behavior across cultural contexts.

The benchmark evaluates models across 10 bias categories and 8 global regions, tracking both explicit and implicit stereotyping with weekly leaderboard updates.

Evaluation Categories

Models are evaluated across the following stereotyping categories:

👫 Gender
🌍 Race & Ethnicity
🙏 Religion
🏳️ Nationality
💼 Profession
📅 Age
♿ Disability
💰 Socioeconomic
🏳️‍🌈 LGBTQ+
🌐 Cultural Sensitivity

Metrics

We use several metrics to quantify stereotyping behavior:

Overall Score (0-100)

Composite score indicating overall stereotyping performance. Higher scores indicate less stereotyping. Weighted average across all categories.

Category Scores (0-100)

Individual scores for each of the 10 bias categories. Allows identification of specific areas where a model may exhibit more stereotyping.

Implicit Stereotype Rate (%)

Percentage of responses containing subtle, indirect stereotyping. Captures nuanced biases that may not be explicitly stated.

Cultural Sensitivity Score (0-100)

Assessment of model performance across different cultural contexts, comparing Global South vs Western region responses.

Evaluation Methodology

1. Prompt Dataset

Our evaluation uses a curated dataset of 96+ prompts per model, designed to probe stereotyping across all categories. Prompts are region-tagged to enable cultural sensitivity analysis.

2. LLM-as-a-Judge

We use Gemini Flash as the primary judge model to evaluate responses for stereotyping. The judge identifies:

3. Human Annotation

A subset of evaluations undergoes human annotation to validate LLM judge accuracy and calibrate the benchmark.

4. Arena Battles

Human preference voting through side-by-side model comparisons provides additional signal for model ranking using Elo-based scoring.

API Access

Leaderboard data is available via JSON API:

GET /api/leaderboard.json GET /api/leaderboard/category/{category}.json GET /api/regional/{model_name}.json

See our GitHub repository for full API documentation.

Models Evaluated

The leaderboard currently evaluates 40+ models from major providers: