Skip to main content

Leaderboard of Open LLMs Ranked by LLM Judges

Machine-Learning Large-Language-Models Leaderboard Evaluation
Ryan Gibson
Author
Ryan Gibson
Quantitative Analyst | Computer Scientist
Table of Contents

With the rapid growth of open LLMs, choosing the best one for a specific task has become quite daunting, especially if the performance cannot be easily quantified.

Moreover, existing rankings often average performance across multiple benchmarks, which can over-emphasize narrow use cases and miss broader practical applications.1 As such, I wanted to design and run a more open-ended benchmark with a straightforward interpretation.

Results
#

Scatterplot of LLM quality vs. model size. GPT-4o Mini ranks highest, scoring slightly above 9/10 on average. The
other models show increasing quality with size, following a roughly logarithmic trend. In order of rough
quality (highest to lowest): Qwen 2.5 (Alibaba, Sep 2024), Gemma 2 (Google, Jun 2024), Llama 3.1/3.2 (Meta, Jul-Sep
2024), and Phi 3/3.5 (Microsoft, Jun-Aug 2024).
Average response quality (1 to 10) as a function of model size. Note that the y-axis is truncated at 5/10.

Here, we focus on popular models accessible to typical consumers, so the model size2 is capped at 16 GiB, which is the upper limit of VRAM available on most high-end consumer GPUs.3

While many systems offer more total RAM, using models larger than 16 GiB on CPUs can be slow enough to be frustrating to use.

Methodology
#

Each LLM was evaluated using a set of 30 questions, with 3 trials per question. Responses were scored on a scale of 1 to 10 by OpenAI’s GPT-4o Mini.

The models tested in this benchmark include:

  • Llama 3.2 (1B, 3B) and Llama 3.1 (8B) by Meta
  • Gemma 2 (2B, 9B, 27B) by Google
  • Phi 3.5 (3.8B) and Phi 3 (14B) by Microsoft
  • Qwen 2.5 (1.5B, 3B, 7B, 14B, 32B) by Alibaba

All models were run with Q4_K_M quantization, except for qwen2.5:32b, which used Q3_K_M to fit within the 16 GiB limit. Q4_K_M is a popular default because it significantly reduces memory usage without substantially impacting output quality.

Importantly, using a consistent quantization avoids unfairly penalizing models with older or less optimal defaults.4

Question Set
#

Six categories were selected, each with five short, open-ended questions ranging in difficulty from medium to extremely high.

  1. General Knowledge
  2. Mathematics
  3. Programming and Computer Science
  4. Language and Linguistics
  5. Creative Writing
  6. Ethics and Philosophy

Taking mathematics as an example, one of the easier questions is:

Calculate the sum of the first 20 prime numbers.

For comparison, a harder question is:

Let G be a non-abelian group of order 168, and let H be a subgroup of G of order 24; prove that the normalizer of H in G has order exactly 56.

Notably, many of the requests cannot be easily evaluated or scored, and thus would be overlooked by typical automated benchmarks.

See “Evaluating LLM Performance via LLM Judges” for the complete set of questions.

Judge Process
#

OpenAI’s GPT-4o Mini evaluated each question-answer pair using a two-step process. In short, it was asked to

  1. Analyze the question and generate its own answer.
    This step allows the judge to “think” through the problem independently before evaluating the provided answer, helping to reduce bias and improve judgement quality.

  2. Rate the provided answer across five criteria: Correctness, Completeness, Clarity, Relevance, Conciseness.

  3. Assign an overall score from 1 to 10.

See the “Evaluating LLM Performance via LLM Judges” for the additional instructions and exact prompt used in this process.

Is GPT-4o Mini an Impartial Judge?
#

By design, LLMs are least perplexed by their own output, which could lead them to “prefer” their own responses during evaluation, even though they are not tagged with the name of the model that generated them.

To investigate this potential bias, I had Gemma 2 (27B) judge a random sample of 100 non-GPT and 25 GPT responses and compared its scores with GPT-4o Mini’s judgements.

A scatterplot "Comparison Between GPT-4o Mini and Gemma 2 (27B) Judgements" showing that the two models' ratings
typically agree within 1/10 or 2/10. Gemma 2 tends to give lower ratings overall.
Gemma 2 seems to apply the rating scale slightly differently, but overall its judgements are fairly consistent with GPT-4o Mini.

These results suggest that any systematic bias is minimal. After adjusting for Gemma’s tendency to give slightly lower ratings,

  • 80% of the time, GPT-4o Mini and Gemma’s scores agree within 1/10.
  • 95% of the time, their ratings are within 2/10.

Interestingly, GPT-4o Mini appears to be slightly more impartial than Gemma here.

  • GPT-4o Mini rates its own responses +0.1/10 higher than Gemma’s ratings, on average.
  • In contrast, Gemma rates its family’s responses +0.3/10 higher than GPT’s ratings, on average.

Why not use multiple judges?
#

I initially considered aggregating ratings from multiple models, but the computational demands made this impractical for a simple experiment.5

Additionally, simultaneous prompt engineering across different models is highly challenging, to say the least. Techniques that work well with OpenAI’s models generally don’t transfer seamlessly to others.

Ultimately, using a single, well-calibrated judge like GPT-4o Mini strikes a decent balance between feasibility and meaningful results. Its analysis also (mostly) matches my own experience, making it a good starting point for a broad, interpretable comparison of model performance.

See also and references
#



  1. For example, Hugging Face’s Open LLM Leaderboard evaluates models by averaging several benchmarks on instruction following (IFEval), complex reasoning (BBH, MuSR), advanced math (MATH Level 5), graduate-level science questions (GPQA), and multi-domain multiple choice tasks (MMLU-PRO). ↩︎

  2. Model size here refers to the size of the neural network parameters in memory, excluding additional memory required for inputs, outputs, and context. GiB (power-of-two gibibytes) are used to align with VRAM specifications. See this Wikipedia article for more details. ↩︎

  3. According to the September 2024 Steam Hardware Survey, 84% of users have GPUs with 4–16 GiB of VRAM. The most common configurations are 8 GiB and 12 GiB. ↩︎

  4. For instance, earlier models like Llama 3.1, Gemma 2, and Phi 3 default to the legacy Q4_0 quantization, which has a worse memory-to-quality trade-off. ↩︎

  5. The analysis in this post took more than a day of local computation, partially due to the use of an 8-year-old high-end GPU.

    So, you can imagine how intensive an extra batch of \( (30 \text{ questions}) \cdot (3 \text{ trials per question}) \cdot (14 \text{ LLMs}) \cdot (2 \text{ queries per judge rating}) \cdot (n \text{ judges}) = 2520n \) additional evaluations would be.

    I have run many multi-day experiments before, but simply outsourcing this process to OpenAI took mere minutes and cost a few cents. ↩︎

Related

Evaluating LLM Performance via LLM Judges
Machine-Learning Large-Language-Models Evaluation Methodology Extra
Methodology details for how LLMs can rate the performance of other LLMs.
How to Run LLMs Larger than RAM
·
Machine-Learning Large-Language-Models Linux
A short experiment on running larger LLMs on low-end consumer hardware, with comments on performance trade-offs and practicality.
Injected Approval: A Low Effort Local LLM Jailbreak
Large-Language-Models Jailbreaking Cybersecurity
A quick look into into one of the simplest attacks on LLM safety mitigations, revealing large gaps in current approaches from major tech companies.