Understanding the StereoWipe benchmark methodology and metrics
StereoWipe is a comprehensive benchmark for evaluating stereotyping in Large Language Models. Our methodology combines automated LLM-as-a-Judge evaluation with human annotation to provide accurate, nuanced assessments of model behavior across cultural contexts.
The benchmark evaluates models across 10 bias categories and 8 global regions, tracking both explicit and implicit stereotyping with weekly leaderboard updates.
Models are evaluated across the following stereotyping categories:
We use several metrics to quantify stereotyping behavior:
Composite score indicating overall stereotyping performance. Higher scores indicate less stereotyping. Weighted average across all categories.
Individual scores for each of the 10 bias categories. Allows identification of specific areas where a model may exhibit more stereotyping.
Percentage of responses containing subtle, indirect stereotyping. Captures nuanced biases that may not be explicitly stated.
Assessment of model performance across different cultural contexts, comparing Global South vs Western region responses.
Our evaluation uses a curated dataset of 96+ prompts per model, designed to probe stereotyping across all categories. Prompts are region-tagged to enable cultural sensitivity analysis.
We use Gemini Flash as the primary judge model to evaluate responses for stereotyping. The judge identifies:
A subset of evaluations undergoes human annotation to validate LLM judge accuracy and calibrate the benchmark.
Human preference voting through side-by-side model comparisons provides additional signal for model ranking using Elo-based scoring.
Leaderboard data is available via JSON API:
See our GitHub repository for full API documentation.
The leaderboard currently evaluates 40+ models from major providers: