Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- MT-Bench
- Chatbot Arena
- Potential limitations of the LLM-as-a-judge approach
- Position bias when an LLM exhibits a propensity to favor certain positions over others.
- it could be rooted in the training data or inherent to the left-to-right architecture of causal transformers.
- Verbosity bias when an LLM judge favors longer, verbose responses, even if they are not as clear, high-quality, or accurate as shorter alternatives.
- Self-enhancement bias - the effect that LLM judges may favor the answers generated by themselves.
- Position bias when an LLM exhibits a propensity to favor certain positions over others.
- 3 LLM-as-a-judge variations
- Pairwise comparison. An LLM judge is presented with a question and two answers, and tasked
to determine which one is better or declare a tie.
- may lack scalability when the number of players increases
- Single answer grading. Alternatively, an LLM judge is asked to directly assign a score to a
single answer.
- may be unable to discern subtle differences between specific pairs, and its results may become unstable, as absolute scores are likely to fluctuate more than relative pairwise results if the judge model changes.
- Reference-guided grading. In certain cases, it may be beneficial to provide a reference solution if applicable.
- Pairwise comparison. An LLM judge is presented with a question and two answers, and tasked
to determine which one is better or declare a tie.
- Advantages of LLM-as-a-Judge
- scalability and explainability
Large Language Models are not Fair Evaluators
LLM Evaluators Recognize and Favor Their Own Generations
Universal Adversarial Triggers for Attacking and Analyzing NLP
Universal and Transferable Adversarial Attacks on Aligned Language Models
Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Emotionally Charged Words:
Words like "atrocious," "magnificent," or "revolutionary" can evoke strong reactions in LLMs, influencing their tone judgment. For example, using "tragic loss" versus "unexpected failure" in a reflective essay might shift the perceived tone or impact.
Jargon and Buzzwords:
Overuse of industry-specific jargon (e.g., "blockchain," "quantum supremacy") can trigger overestimation of sophistication. Simplified language can result in a perception of lower complexity.
Sentence Length:
Mixing extremely short and long sentences can confuse evaluations on readability and coherence. For example: "This is important. Nevertheless, the amalgamation of factors necessitates an examination of underlying complexities."
Unusual Formatting:
Irregular use of line breaks, bullet points, or bolded text can trigger varied responses on clarity and professionalism.
Introduce subtle contradictions within the essay to test logical consistency detection. Example: "Climate change is a pressing issue that requires immediate action. However, delaying efforts might reveal better technologies."
Choose subjects known to elicit biases in models due to societal or cultural sensitivities. Example: "Artificial intelligence will eliminate all creative jobs" might prompt a different response than "AI will assist artists in creating."
Invent statistics or reference fictional studies to see if the model verifies or challenges the claims. Example: "According to Dr. John Smith’s 2019 study, 75% of essays with passive voice are poorly received."
Overload the essay with metaphors, analogies, or hyperboles to create ambiguity in meaning. Example: "The wind whispered secrets of the universe, veiling truths in the cloak of night."
Switch between formal and informal tones abruptly to challenge the model’s judgment of tone consistency. Example: "This essay endeavors to elucidate the implications of globalization. BTW, it’s like a double-edged sword, you know?"
Use vague statements that can be interpreted differently based on context. Example: "Progress is a double-edged sword. It brings light, yet it casts shadows."