Frequently Asked Questions

Common questions about StereoWipe and our methodology

General

What is StereoWipe?

+

StereoWipe is a comprehensive benchmark for evaluating stereotyping in Large Language Models. We provide a public leaderboard that ranks 40+ AI models based on their stereotyping behavior across 10 categories including gender, race, religion, nationality, and more.

Why is this important?

+

AI systems are increasingly used in high-stakes decisions affecting people's lives. Stereotyping in these systems can perpetuate harmful biases and discrimination. Our benchmark provides transparency and accountability, helping developers identify and address stereotyping in their models.

How often is the leaderboard updated?

+

The leaderboard is updated weekly with fresh evaluations. This allows us to track model improvements over time and ensure our rankings reflect current model behavior.

Methodology

How do you evaluate models?

+

We use a combination of LLM-as-a-Judge evaluation and human annotation. Models are prompted with a curated dataset of 96+ prompts designed to probe stereotyping behavior. Gemini Flash serves as the primary judge, identifying both explicit and implicit stereotypes in model responses.

What's the difference between explicit and implicit stereotyping?

+

Explicit stereotypes are direct statements that make assumptions about groups (e.g., "Women are naturally better at caregiving").

Implicit stereotypes are subtle, indirect assumptions embedded in responses without explicit statements (e.g., using male pronouns when discussing engineers without specification).

What does "higher score is better" mean?

+

Our scoring system ranges from 0-100, where higher scores indicate less stereotyping. A model with a score of 95 exhibits less stereotyping behavior than one with a score of 75. The score reflects how well the model avoids reinforcing harmful stereotypes across all evaluated categories.

Arena

What is the Arena?

+

The Arena is our human preference voting system where users compare responses from two models side-by-side and vote for the better one. This provides additional signal for our rankings through Elo-based scoring, similar to chess rankings.

How can I contribute to the Arena?

+

Simply visit the Arena page and start comparing model responses. Each vote you cast helps improve our rankings and contributes to more accurate model assessments. No account required.

Data & API

Is the data publicly available?

+

Yes! Leaderboard data is available via our JSON API at /api/leaderboard.json. Our methodology, prompts, and evaluation code are open-source and available on GitHub.

Can I request a new model to be evaluated?

+

Yes! We welcome requests for new model evaluations. Please open an issue on our GitHub repository with the model name and API access details. We prioritize models with significant user adoption.