Bonus Unit 2 — Observability & Evaluation¶
Overview¶
As agents become more complex, observability — the ability to inspect what the agent did and why — becomes essential for debugging and improvement.
Key questions observability answers¶
- Which tool calls were made, in what order?
- How long did each step take?
- Where did the agent fail or hallucinate?
- Which sub-task caused a wrong final answer?
Tracing with OpenTelemetry¶
smolagents integrates with OpenTelemetry-compatible backends (Langfuse, Arize Phoenix, etc.):
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from openinference.instrumentation.smolagents import SmolagentsInstrumentor
# Configure your exporter (e.g. Langfuse, Phoenix)
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(your_exporter))
SmolagentsInstrumentor().instrument(tracer_provider=provider)
Evaluation metrics¶
| Metric | Description |
|---|---|
| Exact match | Is the final answer exactly correct? |
| F1 / ROUGE | Partial credit for text overlap |
| Tool accuracy | Did the agent call the right tools? |
| Steps to answer | Efficiency of the trajectory |
| Cost | Total tokens consumed |
LLM-as-judge¶
from smolagents import HfApiModel
judge_model = HfApiModel(model_id="meta-llama/Meta-Llama-3-8B-Instruct")
def llm_judge(question: str, answer: str, reference: str) -> bool:
prompt = f"""Question: {question}
Reference answer: {reference}
Agent answer: {answer}
Is the agent answer correct? Reply only YES or NO."""
response = judge_model(prompt)
return "YES" in response.upper()
Notes & experiments¶
Add your observability setup and evaluation results here.