user
~25 tokens
What's the current best practice for evaluating LLM output quality beyond simple metrics like BLEU or ROUGE?
assistant
~200 tokens
The field has moved significantly beyond n-gram overlap metrics. Here's the current landscape:
**1. LLM-as-Judge (most practical)** — Use a strong model to evaluate outputs. Pairwise comparison is more reliable than absolute scoring.
**2. Human Evaluation Frameworks** — Chatbot Arena (LMSYS) for blind pairwise comparisons. RLHF-style preference rankings.
**3. Task-Specific Benchmarks** — Factual accuracy (FActScore), Instruction following (IFEval), Safety (red-teaming), Code (HumanEval/MBPP execution-based).
**4. Composite Metrics** — AlpacaEval 2.0 (length-controlled LLM judge), MT-Bench (multi-turn coherence).
**Practical recommendation**: Start with LLM-as-Judge using pairwise comparisons and a detailed rubric. Add human evaluation for high-stakes decisions.
user
~16 tokens
How do I set up LLM-as-Judge practically? What prompts work?
assistant
~160 tokens
Here's a battle-tested setup:
**Step 1: Define your rubric** — Accuracy (1-5), Completeness (1-5), Clarity (1-5), Usefulness (1-5)
**Step 2: Judge prompt (pairwise)** — Present question + both responses, ask judge to analyze strengths/weaknesses then verdict.
**Step 3: Mitigate biases** — Run each comparison TWICE with A/B swapped. Only count if both runs agree. Use temperature=0.
**Step 4: Calibrate** — Have humans evaluate 50-100 examples. Measure agreement. If <80%, refine your rubric.
Cost: ~$0.02-0.05 per evaluation with GPT-4. For 1000 evaluations: ~$20-50.