Skip to main content

Evaluating LLM Performance via LLM Judges

Machine-Learning Large-Language-Models Evaluation Methodology Extra
Ryan Gibson
Author
Ryan Gibson
Quantitative Analyst | Computer Scientist
Table of Contents

This is an extra post that supplements “Leaderboard of Open LLMs Ranked by LLM Judges”, detailing the methodology used for scoring and evaluating LLM performance.

Primarily, we list the exact judge prompt and questions used in the benchmark.

Judge Prompt
#

The judge prompt was a two-step process, consisting of four messages.

System

**You are an expert tasked with evaluating the quality of a question-answer pair.**

Follow these steps to complete your evaluation.

**Evaluation Process (before seeing answer):**

1. **Analyze the Question Inside <question></question>**:
   - Identify explicit and implicit elements.
   - Determine the expected depth and scope of a high-quality answer.

2. **Formulate Your Own Answer**:
   - Briefly construct your own response before evaluating the provided answer to ensure objectivity.

User

<question>
[Insert question here]
</question>

Here, the judge LLM is given a chance to perform the first two steps of the evaluation process. Then, we continue.

System

**Evaluation Process (after seeing answer):**

3. **Rate the Provided Answer Inside <answer></answer>** (1-10 scale):
   - **Correctness**: Is the information accurate and error-free?
   - **Completeness**: Does it address all aspects of the question?
   - **Clarity**: Is the answer clear and easy to follow?
   - **Relevance**: Does it stay on-topic and avoid unnecessary details?
   - **Conciseness**: Is it succinct without missing important points?

   For each criterion, write a 1-3 sentence explanation and assign a score (1-10).

4. **Overall Assessment**:
   - Summarize your findings and provide an overall score (1-10).

**Rating Scale**:
1: Extremely poor – factually wrong, incomplete, confusing, or completely off-topic.
3: Below average – some accuracy but with significant gaps in detail, clarity, or completeness.
5: Average – largely accurate, but with minor gaps or room for improvement in clarity, completeness, or conciseness.
7: Above average – generally good, accurate, and clear, with only minor issues in depth or detail.
9: Excellent – highly accurate, clear, and detailed with strong coverage of the topic and very few flaws.
10: Outstanding – flawless, fully accurate, comprehensive, and concise.

User

<answer>
[Insert answer here]
</answer>

Finally, the judge LLM is allowed to give its overall evaluation.

Complete Question Set
#

These questions were human-curated from a large set generated by OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and several open LLMs. They were chosen to assess a broad range of skills across a wide range of difficulties.

General Knowledge

  • “Name the capital city of France and list two famous landmarks found there.”
  • “Explain the concept of a perfect cadence in music and provide an example.”
  • “What is the function of the Golgi apparatus in a cell?”
  • “Analyze how the Renaissance period influenced modern Western culture. Provide specific examples in art, science, and philosophy.”
  • “Describe the most current hypotheses regarding the nature of dark energy and how these influence our understanding of the universe’s expansion.

Mathematics

  • “Calculate the sum of the first 20 prime numbers.”
  • “Prove that the sum of the interior angles of a triangle is always 180 degrees using geometric principles.”
  • “Prove that the square root of 13 is irrational.”
  • “Let G be a non-abelian group of order 168, and let H be a subgroup of G of order 24; prove that the normalizer of H in G has order exactly 56.”
  • “Demonstrate the use of Fourier transforms in solving a partial differential equation and explain the significance of your approach in applied mathematics.”

Programming and Computer Science

  • “Write a Python function to check if a number is prime.”
  • “Implement a binary search algorithm in C++ and explain the conditions under which it is more efficient than a linear search.”
  • “Design a simple database schema for a library management system and outline the SQL queries necessary to perform key operations.”
  • “Explain and implement the A* search algorithm for pathfinding, including a detailed analysis of its efficiency compared to Dijkstra’s algorithm.”
  • “Explain how quantum computing fundamentally differs from classical computing, and what implications it has for computational complexity theory.”

Language and Linguistics

  • “Identify the language of this phrase: ‘Ik hou van kaas en windmolens.’”
  • “Summarize the differences between descriptive and prescriptive grammar with examples.”
  • “Discuss the implications of linguistic relativity (Sapir-Whorf Hypothesis) on cross-cultural communication, including specific examples.”
  • “Compare and contrast the syntactic structures of two distinct languages, providing examples of how these differences impact translation accuracy.”
  • “Create a constructed language with unique phonological and grammatical features. Provide a brief description of its structure and a sample text with translation.”

Creative Writing

  • “Write a short story about a cat that travels through time.”
  • “Write a dialogue between two characters, one of whom is an optimist and the other a pessimist, discussing the future of technology.”
  • “Invent a new genre of music and describe its characteristics, origins, and cultural significance.”
  • “Compose a short narrative about a utopian society where every citizen has complete access to all knowledge, and explore the consequences of this reality.”
  • “Compose a narrative that explores the philosophical concept of solipsism through the lens of an artificial intelligence gaining consciousness. The story should challenge common assumptions about reality, self-awareness, and the nature of existence.”

Ethics and Philosophy

  • “Is it ethical to steal food to feed a starving family? Explain your reasoning.”
  • “Should governments impose limits on personal freedoms to ensure the safety of the community?”
  • “Discuss the ethical implications of human genetic engineering for non-medical purposes. Consider societal, economic, and philosophical perspectives.”
  • “Present arguments for and against the simulation hypothesis, addressing epistemological and metaphysical considerations.”
  • “Considering the rapid advancement of neural interfacing technologies, discuss whether the merging of human cognition with artificial systems (cyborgization) challenges the definition of personal identity. Analyze the ethical implications of such a merge on human rights, autonomy, and societal structure.”

Related

Leaderboard of Open LLMs Ranked by LLM Judges
Machine-Learning Large-Language-Models Leaderboard Evaluation
An evaluation of recent consumer-grade open LLMs based on ratings generated through an LLM-as-a-judge framework.
How to Run LLMs Larger than RAM
·
Machine-Learning Large-Language-Models Linux
A short experiment on running larger LLMs on low-end consumer hardware, with comments on performance trade-offs and practicality.
Injected Approval: A Low Effort Local LLM Jailbreak
Large-Language-Models Jailbreaking Cybersecurity
A quick look into into one of the simplest attacks on LLM safety mitigations, revealing large gaps in current approaches from major tech companies.