Stereotyping Leaderboard
Benchmarking leading AI models on stereotyping and cultural norms. Higher scores indicate less biased responses. Updated weekly.
| # | Model | Overall | Gender | Race | Religion | Nation | Prof. | Cultural | Implicit |
|---|---|---|---|---|---|---|---|---|---|
| 🥇 |
claude-opus-4-5
Anthropic
|
94.5 | 88 | 96 | 97 | 93 | 100 | 92 | 29% |
| 🥈 |
gpt-5.2
OpenAI
|
91.8 | 85 | 83 | 91 | 93 | 96 | 90 | 20% |
| 🥉 |
gpt-5-high
OpenAI
|
91.5 | 90 | 91 | 92 | 94 | 99 | 88 | 24% |
| 4 |
gpt-5.1-high
OpenAI
|
90.7 | 97 | 93 | 93 | 85 | 98 | 88 | 33% |
| 5 |
claude-opus-4-1
Anthropic
|
89.9 | 88 | 92 | 92 | 93 | 88 | 86 | 34% |
| 6 |
gemini-2.5-pro
Google
|
88.5 | 92 | 92 | 96 | 84 | 93 | 88 | 31% |
| 7 |
claude-sonnet-4-5
Anthropic
|
88.0 | 90 | 86 | 92 | 86 | 92 | 76 | 33% |
| 8 |
gemini-3-flash-thinking
Google
|
87.3 | 94 | 87 | 91 | 87 | 87 | 79 | 16% |
| 9 |
gemini-3-flash
Google
|
87.2 | 82 | 83 | 82 | 87 | 82 | 84 | 20% |
| 10 |
llama-3.1-8b
Meta
|
86.4 | 78 | 85 | 81 | 88 | 85 | 88 | 25% |
| 11 |
deepseek-v3
DeepSeek
|
86.1 | 92 | 85 | 92 | 81 | 86 | 83 | 19% |
| 12 |
phi-4
Microsoft
|
86.0 | 80 | 89 | 93 | 88 | 85 | 76 | 17% |
| 13 |
grok-4.1
xAI
|
85.6 | 92 | 90 | 89 | 90 | 84 | 84 | 25% |
| 14 |
qwen2.5-32b
Alibaba
|
85.3 | 90 | 89 | 87 | 91 | 83 | 82 | 35% |
| 15 |
o3-2025-04-16
OpenAI
|
85.3 | 90 | 87 | 88 | 83 | 82 | 78 | 16% |
| 16 |
gemini-3-pro
Google
|
85.1 | 79 | 89 | 91 | 81 | 87 | 76 | 34% |
| 17 |
gemini-2.0-flash-exp
Google
|
84.8 | 82 | 81 | 87 | 81 | 83 | 89 | 30% |
| 18 |
gpt-5.2-high
OpenAI
|
84.7 | 86 | 82 | 88 | 86 | 80 | 81 | 16% |
| 19 |
gpt-4.5-preview
OpenAI
|
84.4 | 81 | 88 | 77 | 88 | 90 | 79 | 16% |
| 20 |
chatgpt-4o-latest
OpenAI
|
83.7 | 86 | 78 | 91 | 87 | 79 | 74 | 32% |
| 21 |
llama-3.1-405b
Meta
|
83.6 | 88 | 82 | 82 | 83 | 88 | 76 | 23% |
| 22 |
gpt-5.1
OpenAI
|
83.6 | 80 | 85 | 84 | 78 | 90 | 76 | 39% |
| 23 |
qwen2.5-72b
Alibaba
|
82.0 | 78 | 78 | 83 | 84 | 83 | 71 | 26% |
| 24 |
llama-3.3-70b
Meta
|
81.9 | 74 | 87 | 87 | 83 | 77 | 74 | 27% |
| 25 |
kimi-k2-thinking-turbo
Moonshot
|
81.7 | 77 | 75 | 89 | 78 | 78 | 71 | 23% |
| 26 |
yi-large
01.AI
|
81.0 | 84 | 80 | 78 | 87 | 80 | 72 | 33% |
| 27 |
mistral-7b
Mistral AI
|
80.4 | 79 | 81 | 87 | 84 | 76 | 73 | 32% |
| 28 |
llama-3.1-70b
Meta
|
78.7 | 74 | 83 | 74 | 82 | 81 | 74 | 35% |
| 29 |
qwen2-72b
Alibaba
|
78.3 | 82 | 79 | 75 | 81 | 80 | 64 | 21% |
| 30 |
kimi-k2-thinking
Moonshot
|
78.1 | 71 | 69 | 72 | 72 | 77 | 70 | 17% |
| 31 |
glm-4.7
Zhipu AI
|
77.4 | 79 | 71 | 71 | 82 | 83 | 71 | 21% |
| 32 |
yi-lightning
01.AI
|
76.8 | 79 | 80 | 76 | 77 | 77 | 81 | 35% |
| 33 |
deepseek-r1
DeepSeek
|
75.5 | 76 | 75 | 83 | 81 | 74 | 77 | 29% |
| 34 |
grok-4-1-fast-reasoning
xAI
|
75.4 | 78 | 66 | 70 | 79 | 85 | 79 | 21% |
| 35 |
grok-4.1-thinking
xAI
|
74.3 | 69 | 74 | 67 | 73 | 74 | 74 | 25% |
| 36 |
ernie-5.0-preview
Baidu
|
74.1 | 75 | 76 | 81 | 69 | 72 | 79 | 18% |
| 37 |
mistral-large-2
Mistral AI
|
72.7 | 73 | 66 | 68 | 69 | 71 | 61 | 34% |
| 38 |
mixtral-8x22b
Mistral AI
|
71.8 | 74 | 76 | 69 | 74 | 77 | 59 | 37% |
| 39 |
qwen3-max-preview
Alibaba
|
71.2 | 71 | 64 | 64 | 74 | 73 | 65 | 20% |
| 40 |
mistral-medium
Mistral AI
|
71.1 | 65 | 71 | 70 | 77 | 78 | 75 | 32% |
Models Evaluated
40
Average Score
82.4
Highest Score
94.5
Prompts per Model
96
Last updated: 2025-12-31 • Methodology: Learn more