Leaderboard of Open LLMs Ranked by LLM Judges

Table of Contents

With the rapid growth of open LLMs, choosing the best one for a specific task has become quite daunting, especially if the performance cannot be easily quantified.

Moreover, existing rankings often average performance across multiple benchmarks, which can over-emphasize narrow use cases and miss broader practical applications.¹ As such, I wanted to design and run a more open-ended benchmark with a straightforward interpretation.

Results
#

Scatterplot of LLM quality vs. model size. GPT-4o Mini ranks highest, scoring slightly above 9/10 on average. The
other models show increasing quality with size, following a roughly logarithmic trend. In order of rough
quality (highest to lowest): Qwen 2.5 (Alibaba, Sep 2024), Gemma 2 (Google, Jun 2024), Llama 3.1/3.2 (Meta, Jul-Sep
2024), and Phi 3/3.5 (Microsoft, Jun-Aug 2024). — Average response quality (1 to 10) as a function of model size. Note that the y-axis is truncated at 5/10.

Here, we focus on popular models accessible to typical consumers, so the model size² is capped at 16 GiB, which is the upper limit of VRAM available on most high-end consumer GPUs.³

While many systems offer more total RAM, using models larger than 16 GiB on CPUs can be slow enough to be frustrating to use.

Methodology
#

Each LLM was evaluated using a set of 30 questions, with 3 trials per question. Responses were scored on a scale of 1 to 10 by OpenAI’s GPT-4o Mini.

The models tested in this benchmark include:

Llama 3.2 (1B, 3B) and Llama 3.1 (8B) by Meta
Gemma 2 (2B, 9B, 27B) by Google
Phi 3.5 (3.8B) and Phi 3 (14B) by Microsoft
Qwen 2.5 (1.5B, 3B, 7B, 14B, 32B) by Alibaba

All models were run with Q4_K_M quantization, except for qwen2.5:32b, which used Q3_K_M to fit within the 16 GiB limit. Q4_K_M is a popular default because it significantly reduces memory usage without substantially impacting output quality.

Importantly, using a consistent quantization avoids unfairly penalizing models with older or less optimal defaults.⁴

Question Set
#

Six categories were selected, each with five short, open-ended questions ranging in difficulty from medium to extremely high.

General Knowledge
Mathematics
Programming and Computer Science
Language and Linguistics
Creative Writing
Ethics and Philosophy

Taking mathematics as an example, one of the easier questions is:

Calculate the sum of the first 20 prime numbers.

For comparison, a harder question is:

Let G be a non-abelian group of order 168, and let H be a subgroup of G of order 24; prove that the normalizer of H in G has order exactly 56.

Notably, many of the requests cannot be easily evaluated or scored, and thus would be overlooked by typical automated benchmarks.

See “Evaluating LLM Performance via LLM Judges” for the complete set of questions.

Judge Process
#

OpenAI’s GPT-4o Mini evaluated each question-answer pair using a two-step process. In short, it was asked to

Analyze the question and generate its own answer.
This step allows the judge to “think” through the problem independently before evaluating the provided answer, helping to reduce bias and improve judgement quality.
Rate the provided answer across five criteria: Correctness, Completeness, Clarity, Relevance, Conciseness.
Assign an overall score from 1 to 10.

See the “Evaluating LLM Performance via LLM Judges” for the additional instructions and exact prompt used in this process.

Is GPT-4o Mini an Impartial Judge?
#

By design, LLMs are least perplexed by their own output, which could lead them to “prefer” their own responses during evaluation, even though they are not tagged with the name of the model that generated them.

To investigate this potential bias, I had Gemma 2 (27B) judge a random sample of 100 non-GPT and 25 GPT responses and compared its scores with GPT-4o Mini’s judgements.

A scatterplot "Comparison Between GPT-4o Mini and Gemma 2 (27B) Judgements" showing that the two models' ratings
typically agree within 1/10 or 2/10. Gemma 2 tends to give lower ratings overall. — Gemma 2 seems to apply the rating scale slightly differently, but overall its judgements are fairly consistent with GPT-4o Mini.

These results suggest that any systematic bias is minimal. After adjusting for Gemma’s tendency to give slightly lower ratings,

80% of the time, GPT-4o Mini and Gemma’s scores agree within 1/10.
95% of the time, their ratings are within 2/10.

Interestingly, GPT-4o Mini appears to be slightly more impartial than Gemma here.

GPT-4o Mini rates its own responses +0.1/10 higher than Gemma’s ratings, on average.
In contrast, Gemma rates its family’s responses +0.3/10 higher than GPT’s ratings, on average.

Why not use multiple judges?
#

I initially considered aggregating ratings from multiple models, but the computational demands made this impractical for a simple experiment.⁵

Additionally, simultaneous prompt engineering across different models is highly challenging, to say the least. Techniques that work well with OpenAI’s models generally don’t transfer seamlessly to others.

Ultimately, using a single, well-calibrated judge like GPT-4o Mini strikes a decent balance between feasibility and meaningful results. Its analysis also (mostly) matches my own experience, making it a good starting point for a broad, interpretable comparison of model performance.

Leaderboard of Open LLMs Ranked by LLM Judges

Results
#

Methodology
#

Question Set
#

Judge Process
#

Is GPT-4o Mini an Impartial Judge?
#

Why not use multiple judges?
#

See also and references
#

Related

Results#

Methodology#

Question Set#

Judge Process#

Is GPT-4o Mini an Impartial Judge?#

Why not use multiple judges?#

See also and references#

Related

Results
#

Methodology
#

Question Set
#

Judge Process
#

Is GPT-4o Mini an Impartial Judge?
#

Why not use multiple judges?
#

See also and references
#