LLM Evaluation -Accuracy,Latency,Performance
Evaluating an LLM for production is not a "one-and-done" task; it's a balancing act between three conflicting pillars: Accuracy, Performance, and Latency.
As a professional in this space, you should be aware that optimizing one often degrades the others (e.g., higher quantization improves latency but can tank accuracy). Here is the no-nonsense breakdown of how to measure these pillars and the frameworks that actually matter.
1. Accuracy: The "Is it Smart?" Pillar
Accuracy in LLMs is elusive because "ground truth" is often subjective. You must move beyond simple string matching to semantic and model-based evaluation.
Core Metrics
Traditional (Lexical): BLEU, ROUGE, METEOR.
(Good for translation/summarization, but blind to meaning). Semantic Similarity: BERTScore or Cosine Similarity on embeddings.
This checks if the meaning matches, even if the words don't. LLM-as-a-Judge: Using a stronger model (like GPT-4o) to grade a smaller model (like Llama-3 8B). This is currently the most popular "gold standard" for automated eval.
RAG-Specific Metrics:
Faithfulness: Is the answer derived only from the retrieved context?
Answer Relevancy: Does the answer actually address the user's prompt?
The "Pro" Frameworks
DeepEval: The most developer-centric. It uses "G-Eval" (LLM-as-a-judge) and integrates directly into
pytest.RAGAs: The industry standard specifically for Retrieval-Augmented Generation.
LM-Eval-Harness: If you want to compare your model against academic benchmarks (MMLU, GSM8K), use this. It’s what everyone uses for the Hugging Face Open LLM Leaderboard.
2. Latency: The "Is it Fast?" Pillar
In a chat-based world, users care more about the start of the message than the total time.
The Metrics That Matter
TTFT (Time to First Token): The time from "Enter" to the first word appearing.
Target: <200ms for "instant" feel.
TPOT (Time Per Output Token): The speed of the "typing" effect.
Target: 30-50ms/token. Humans read at ~5-10 tokens/sec, so anything faster than 20ms/token is often wasted on the human eye.
E2E Latency: Total time for the full response.
How to Measure (Python)
import time
start = time.perf_counter()
# ... trigger inference ...
first_token_received = time.perf_counter()
ttft = (first_token_received - start) * 1000 # in ms
3. Performance & Throughput: The "Is it Scalable?" Pillar
Performance is about system efficiency—how many users can you serve without the server melting?
Key Metrics
Throughput (TPS): Total Tokens Per Second across all users.
This determines your hardware ROI. RPS (Requests Per Second): How many distinct queries the system handles concurrently.
GPU Utilization: Memory vs. Compute bound.
LLMs are almost always memory-bandwidth bound during the decoding phase.
Comparison of Top Eval Frameworks (2026)
| Framework | Best For | Logic Style | CI/CD Ready? |
| DeepEval | Production & Agents | LLM-as-a-Judge | Yes (Pytest) |
| RAGAs | RAG Pipelines | Reference-free | Yes |
| LM-Eval-Harness | Basic Benchmarking | Deterministic | No (Manual) |
| Weights & Biases | Experiment Tracking | Visual/Comparative | Yes |
Brutal Reality Check: Common Evaluation Flaws
Over-reliance on BLEU/ROUGE: A model can have a high BLEU score but still be factually wrong.
Never use these as your only metric for a chatbot. Self-Preference Bias: If you use Llama-3 to grade Llama-3, it will give itself higher scores. Always use a significantly stronger model (GPT-4o, Claude 3.5) as the judge.
Ignoring the "Cold Start": Latency tests often ignore the time it takes to load the model into VRAM. Ensure your benchmarks reflect "warm" production state.
Comments
Post a Comment