Senjuti Dutta

Using LLMs as Judges: A Guide to Evaluation and Standardization

Learn how to use Large Language Models (LLMs) to evaluate creative and technical works with a focus on objectives, prompt design, model selection, and standardization.

1. Identifying Objectives

Evaluation objectives are the cornerstone of any successful assessment. They define what needs to be evaluated and why, ensuring that the entire evaluation process remains focused and effective.

Why Define Objectives?

  • Clear objectives help gather the right information and avoid irrelevant evaluations.
  • They act as a guide for crafting prompts and interpreting results.

Determining What to Evaluate

Break down the subject into key areas of focus. For example, if evaluating a restaurant, you might look at:

  • Food quality
  • Service speed
  • Cleanliness

For creative writing, key evaluation dimensions might include:

  • Creativity: Originality and innovative ideas.
  • Coherence: Logical flow and readability.
  • Emotional Impact: Ability to evoke emotions.

Learning from Reliable Examples

Look at trusted evaluations in similar contexts to define your standards. For instance, review systems on consumer websites often provide inspiration for structured evaluation criteria.

2. Crafting Effective Prompt Designs

Prompt design is the key to guiding LLMs effectively. A well-crafted prompt helps the model focus on what is important, ensuring accurate and consistent evaluations.

Key Elements of Prompt Design

1. Specifying Scoring Dimensions

Clearly define the criteria to be evaluated. For instance:

  • Creativity: Originality and innovation.
  • Coherence: Logical structure and readability.
  • Emotional Impact: Ability to resonate emotionally.

  import openai
  
  openai.api_key = "YOUR_API_KEY"
  
  def evaluate_creative_work(text):
      prompt = f"""
      You are an AI judge evaluating creative writing.
      Evaluate based on:
      1. Creativity (0-10)
      2. Coherence (0-10)
      3. Emotional Impact (0-10)
  
      Provide scores and feedback.
  
      Creative Work:
      "{text}"
      """
      response = openai.ChatCompletion.create(
          model="gpt-4",
          messages=[{"role": "user", "content": prompt}]
      )
      return response["choices"][0]["message"]["content"]
  
  text = "The stars whispered secrets to the wandering waves under the moonlit sky."
  print(evaluate_creative_work(text))
            

2. Emphasizing Relative Comparisons

Comparing two works directly provides a richer analysis by identifying relative strengths and weaknesses. This approach is especially useful for competitions or ranking tasks.


  def compare_creative_works(work1, work2):
      prompt = f"""
      Compare the following creative works based on:
      1. Creativity (0-10)
      2. Coherence (0-10)
      3. Emotional Impact (0-10)
  
      Provide scores and explain which work is better and why.
  
      Work 1:
      "{work1}"
  
      Work 2:
      "{work2}"
      """
      response = openai.ChatCompletion.create(
          model="gpt-4",
          messages=[{"role": "user", "content": prompt}]
      )
      return response["choices"][0]["message"]["content"]
  
  work1 = "The mountains roared with the fury of a thousand storms."
  work2 = "The river whispered softly, guiding the leaves to their slumber."
  print(compare_creative_works(work1, work2))
            

3. Choosing the Right Model

The choice of LLM can significantly impact the quality of evaluations. Large-scale models like GPT-4 offer better reasoning and instruction-following abilities compared to smaller models like GPT-3.5.


  def evaluate_with_model(model_name, text):
      prompt = f"""
      Evaluate the creative work based on:
      1. Creativity
      2. Coherence
      3. Emotional Impact
  
      Provide scores and reasoning.
  
      Creative Work:
      "{text}"
      """
      response = openai.ChatCompletion.create(
          model=model_name,
          messages=[{"role": "user", "content": prompt}]
      )
      return response["choices"][0]["message"]["content"]
  
  models = ["gpt-3.5-turbo", "gpt-4"]
  text = "The sky burned red as the day took its final breath."
  
  for model in models:
      print(f"Results from {model}:\n{evaluate_with_model(model, text)}\n")
            

4. Standardizing the Evaluation Process

Standardization ensures that evaluations are consistent and reliable. Use structured outputs like numerical scores or binary responses to make results interpretable and comparable.

  • Numerical Scores: Scores like 8/10 provide clear indicators of quality.
  • Binary Responses: Simple yes/no answers work for binary evaluations.

Example: Iterative Refinement

Testing and refining evaluation processes multiple times ensures reliability. For example, adjust prompts based on feedback and retest for better accuracy.