Evals
Evaluate each step! Eval should be part of the product specification, along with objectives.
Evaluation must serve a clear purpose and have a clear metric, even for (seemingly) less rigid tasks such as summarization. In a summarization task, ask the question what impact do we want the summary to create? If it is action items, then use them as the evaluation metric.
product metrics and eval metrics are related but not necessarily equal
Eval framework/library are helpful, but at the end real customer evaluations are what really matters.
Eval/assertions: - Code-based deterministic unit tests (e.g., pytest) - Human Judge - LLM as a Judge -> side-by-side, multiple judges, periodic random check for human judge alignment (e.g., model response, model critique, model decision, human critique, human decision, human revised response)
Break into different scenarios.
Use LLM to generate synthetic test data for eval.
Log testing results and analyze / visualize them.
Model comparison: A/B testing: randomly select models to serve and measures human ratings
Hamel’s blog on evals for fine-tuning and his step-by-step workflow example.
- run the result through an existing model, if it is a picture we can run it through GPT-4o
- L1 eval (assertions that remove invalid data samples). For example, if we are generating a coding language, we can check the validity of code syntax.
- synthetic sample data generation (e.g., with a highly capable model)
- data preprocessing
- training
- inference sanity check
- L2 eval (remove bad samples)
- iterative curation to improve the dataset quality
Inspect_ai
Allaire’s talk Inspect, An OSS framework for LLM evals, and slides.
$ git clone https://github.com/UKGovernmentBEIS/inspect_ai.git
$ cd inspect_ai
$ python3 -m venv venv
$ cd inspect_ai
$ pip install -e ".[dev]"
We need to set up the eval model as environment variable export INSPECT_EVAL_MODEL="openai/gpt-4o"
RAG is search based generation (SBG)
Evaluating RAG
Use real queries and check the search results. Does it retrieve the desired documents?
Before worrying about specific RAG techniques such as chunking, re-ranking and different types of indicing.
Frameworks: semantic or lexical retrieval scores are not necessarily calibrated. Don’t be over confident about the scores.
Agent
Planning: state-machine type of planning may be treated as a classifier for next step’s choices. Also evaluate quality of prompts generated during each stage if applicable.
Tasks: structured outputs
Don’t put too much context/everything to the final stage agent workflow. Use minimum context needed.
Step-by-step Agent evaluation: e.g. meeting notes summarizer:
- Extract key decisions, action items and owners and verify (classification problem with precision recall)
- Check factual consistency (classification )
- Rewrite into bulletin point summaries (writing, info density)
Production workflow
Expose endpoints at the production workflow stages directly for the eval! (Avoid drifts on a replicated system that doesn’t keep everything in sync)
Logging, trace and debugging
Event trace: sequence of events commonly saved to Jsonl format. May create some UI to visualize and process the trace logs (e.g., Shiny, Streamlit and Gradio) but could also use existing tools: - Commercial tools: Langsmith, W&B Weave, BrainTrust, Pydantic LogFire - OSS: Instruct, Open LLMetry
Run scheduled notebooks for eval tests!? Targeting evals that suceeds 60-70% times.