Generative AI Evaluation – Dr. Charles Shen

Evals

Evaluate each step! Eval should be part of the product specification, along with objectives.

Evaluation must serve a clear purpose and have a clear metric, even for (seemingly) less rigid tasks such as summarization. In a summarization task, ask the question what impact do we want the summary to create? If it is action items, then use them as the evaluation metric.

product metrics and eval metrics are related but not necessarily equal

Eval framework/library are helpful, but at the end real customer evaluations are what really matters.

Eval/assertions: - Code-based deterministic unit tests (e.g., pytest) - Human Judge - LLM as a Judge -> side-by-side, multiple judges, periodic random check for human judge alignment (e.g., model response, model critique, model decision, human critique, human decision, human revised response)

Break into different scenarios.

Use LLM to generate synthetic test data for eval.

Log testing results and analyze / visualize them.

Model comparison: A/B testing: randomly select models to serve and measures human ratings

Hamel’s blog on evals for fine-tuning and his step-by-step workflow example.

run the result through an existing model, if it is a picture we can run it through GPT-4o
L1 eval (assertions that remove invalid data samples). For example, if we are generating a coding language, we can check the validity of code syntax.
synthetic sample data generation (e.g., with a highly capable model)
data preprocessing
training
inference sanity check
L2 eval (remove bad samples)
iterative curation to improve the dataset quality

Inspect_ai

inspect_ai

Allaire’s talk Inspect, An OSS framework for LLM evals, and slides.

$ git clone https://github.com/UKGovernmentBEIS/inspect_ai.git
$ cd inspect_ai
$ python3 -m venv venv
$ cd inspect_ai
$ pip install -e ".[dev]"

We need to set up the eval model as environment variable export INSPECT_EVAL_MODEL="openai/gpt-4o"

RAG is search based generation (SBG)

Evaluating RAG

Use real queries and check the search results. Does it retrieve the desired documents?

Before worrying about specific RAG techniques such as chunking, re-ranking and different types of indicing.

Frameworks: semantic or lexical retrieval scores are not necessarily calibrated. Don’t be over confident about the scores.

Agent

Planning: state-machine type of planning may be treated as a classifier for next step’s choices. Also evaluate quality of prompts generated during each stage if applicable.

Tasks: structured outputs

Don’t put too much context/everything to the final stage agent workflow. Use minimum context needed.

Step-by-step Agent evaluation: e.g. meeting notes summarizer:

Extract key decisions, action items and owners and verify (classification problem with precision recall)
Check factual consistency (classification )
Rewrite into bulletin point summaries (writing, info density)

Production workflow

Expose endpoints at the production workflow stages directly for the eval! (Avoid drifts on a replicated system that doesn’t keep everything in sync)

Logging, trace and debugging

Event trace: sequence of events commonly saved to Jsonl format. May create some UI to visualize and process the trace logs (e.g., Shiny, Streamlit and Gradio) but could also use existing tools: - Commercial tools: Langsmith, W&B Weave, BrainTrust, Pydantic LogFire - OSS: Instruct, Open LLMetry

Run scheduled notebooks for eval tests!? Targeting evals that suceeds 60-70% times.