An AI feature without observability is not just difficult to debug. It is a liability.
Not in a metaphorical sense. In a legal sense. EU AI Act Article 15 places requirements on the accuracy, robustness, and cybersecurity of high-risk AI systems — and the central instrument for meeting those requirements is logging and audit trails. A feature that does not log what it does, who it does it for, and what the output was, cannot document compliance.
But compliance is not the only driver. It is the operational reality. AI systems fail differently from traditional software. They fail softly — they give wrong answers instead of throwing exceptions. Without traces, you do not know when this happens, why it happens, or whether it happens systematically for a specific user group, on specific input types, or at specific times of day.
Observability for AI is not a nice-to-have. It is the foundation for running AI features responsibly.
What observability is (and what it is not)
Observability for AI features is not application performance monitoring. It is not just "is request time under 2 seconds?" It is a structured record of:
What the model received — prompt text, system instructions, context data, token count.
What the model responded — output text, structured output, any tool calls, output token count.
What it cost — input tokens, output tokens, cached tokens, estimated price.
When it happened — timestamp, latency, model version, prompt version.
Who it concerned — user ID (hashed), workspace ID, feature ID.
Whether it succeeded — success/failure, error code, model confidence (where available).
Together, these constitute an audit trail. They make it possible to answer the questions: "What did the system tell user X about topic Y on that day?" and "Is there a pattern in failed calls in the past week?" and "Has model confidence dropped since we changed the prompt version?"
An AI system without traces is a black box. A black box cannot be audited. Something that cannot be audited cannot be regulated. From the AI Act's perspective, this is not a technical choice — it is a compliance question.
The trace structure: functionId, tokens, latency
The practical implementation starts with defining a consistent trace structure across all AI calls in the system.
The minimum pattern uses five fields:
await generateText({
model: models.fast,
messages: [...],
experimental_telemetry: {
isEnabled: true,
functionId: "capability.enrich.v2", // semantic ID, not system ID
metadata: {
workspaceId: ctx.workspaceId,
userId: hash(ctx.userId), // never plain-text user ID
promptVersion: "v2.3.1", // semantic versioning
inputType: "capability-description",
},
},
})
functionId is the most important field. It is what makes it possible to group traces across calls and identify patterns. Use semantic names, not system IDs: capability.enrich.v2 tells you more than ai_call_4892.
Versions in promptVersion are critical for correlating output quality changes with prompt changes. Without them, it is impossible to answer: "When did output start changing, and what did we change?"
ADR-0001 metadata fields in practice
ADR-0001 (Spekir's data architecture standard) defines a set of mandatory fields for all tables that store AI output. These fields are not bureaucracy — they are the data part of the audit trail.
For an output register (example: capability description generated by AI) the field structure looks like this:
ai_model_version text -- "claude-sonnet-4-6-2026-01"
ai_prompt_version text -- "v2.3.1"
ai_confidence numeric(3,2) -- 0.00 to 1.00
ai_classified_at timestamptz -- when was it generated
ai_needs_review boolean -- flagged for human review
ai_trace_id text -- Langfuse trace ID
ai_input_tokens integer -- actual token consumption
ai_output_tokens integer -- actual token consumption
ai_cached_tokens integer -- tokens read from cache
ai_cost_usd numeric(10,6) -- estimated cost
These fields make it possible to:
Answer audit questions: "What was the model version when this capability description was generated, and which prompt version were we using?"
Identify stale output: "Which capabilities have AI output generated with a model version older than six months?"
Estimate costs per workspace or per feature: "What does capability enrichment cost us per workspace per month?"
EU AI Act Article 15 in practice
Article 15 of the EU AI Act imposes three requirements on high-risk AI systems: accuracy, robustness, and cybersecurity. What does this mean operationally?
Accuracy: The system must produce output that is correct relative to its purpose at a defined error rate. This requires measuring output quality continuously — not just at launch, but over time. Without traces, you cannot measure.
Robustness: The system must handle failure scenarios without producing harmful outputs. This requires being able to identify when the system produces unexpected output and what the specific input characteristics were. Without traces, you cannot identify patterns.
Cybersecurity: The system must be protected against prompt injection and adversarial inputs. This requires logging inputs that are markedly different from expected patterns. Without input logging, you cannot detect attack patterns.
Article 15 is not a requirement to implement a specific system. It is a requirement to have control over the system's behaviour. Observability is the instrument that gives you that control.
Implementation plan: four steps
Observability for AI features can be implemented progressively. Not everything at once.
Step 1 — Basic telemetry: Enable experimental_telemetry on all generateText/streamText calls with consistent functionId names. If you use Langfuse, this is sufficient to start seeing traces in the Langfuse console.
Step 2 — Metadata fields in the data model: Add ai_model_version, ai_prompt_version, ai_classified_at, and ai_needs_review to tables that store AI output. Start with the most important features.
Step 3 — Token tracking: Add ai_input_tokens, ai_output_tokens, ai_cached_tokens, and ai_cost_usd to the data model. Use these for per-workspace cost tracking and for identifying calls with unexpected token consumption.
Step 4 — AI review flag: Implement a pattern that sets ai_needs_review = true based on rules (low confidence, short output, token consumption above threshold). Build a review queue that exposes these to a human reviewer.
Each step adds observability that can be used independently of the others.
What observability does not solve
It is important to delineate what observability is an instrument for, and what requires other measures.
Observability shows what happened. It does not tell you whether it was the right thing to do. A model that systematically gives low confidence scores on a specific input type will show up in traces — but traces do not tell you whether it is a prompt problem, a model problem, or a data problem.
Traces are a diagnostic instrument, not a solution in themselves. They reduce the time it takes to reach the right diagnosis — but the diagnosis still requires human interpretation.
What to do tomorrow
Observability is an investment with returns from day one. Three steps to get started:
Week 1: Enable experimental_telemetry on all existing AI calls. Define a consistent functionId naming scheme. Verify that traces appear in Langfuse.
Week 2: Add ai_model_version, ai_prompt_version, and ai_classified_at to the database tables that store AI output. Start with the most important features.
Week 3: Implement the ai_needs_review flag with one simple rule (example: ai_confidence < 0.7). Build a minimal review list that exposes the flagged output.
Build it in steps. But start today. Every call without a trace is a diagnostic and learning opportunity lost.
References
[1] Regulation (EU) 2024/1689 of the European Parliament and of the Council on artificial intelligence (AI Act), Article 15 — Accuracy, robustness and cybersecurity, available at eur-lex.europa.eu.
[2] Langfuse, "AI Observability & Analytics", available at langfuse.com/docs (accessed 2026-04-23).
[3] Anthropic, "Usage Metadata and Caching Statistics", available at docs.anthropic.com/en/api/messages (accessed 2026-04-23).
Spekir builds the layer that connects strategy to the IT portfolio. See Atlas →
Related articles
AI governance for midmarket: beyond the policy document
A policy PDF doesn't make you compliant. Here are the four deliverables that actually move the needle — a register, a risk classification, a decision matrix, and a one-pager.
8 min read →
Model routing: why choosing Haiku, Sonnet, or Opus matters more than your prompt
80% of AI cost reduction comes from sending the right request to the right model — not from prompt engineering. A practical guide to model routing in production.
7 min read →
Prompt caching: the 90% cost reduction nobody talks about
Anthropic's ephemeral cache discount is mechanically simple but operationally hard. The placement pattern, the 1024-token threshold, and what can and cannot be cached.
6 min read →