Observability for AI features: from black box to audit trail

An AI feature without observability is not just difficult to debug. It is a liability.

Not in a metaphorical sense. In a legal sense. EU AI Act Article 15 places requirements on the accuracy, robustness, and cybersecurity of high-risk AI systems — and the central instrument for meeting those requirements is logging and audit trails. A feature that does not log what it does, who it does it for, and what the output was, cannot document compliance.

But compliance is not the only driver. It is the operational reality. AI systems fail differently from traditional software. They fail softly — they give wrong answers instead of throwing exceptions. Without traces, you do not know when this happens, why it happens, or whether it happens systematically for a specific user group, on specific input types, or at specific times of day.

Observability for AI is not a nice-to-have. It is the foundation for running AI features responsibly.

What observability is (and what it is not)

Observability for AI features is not application performance monitoring. It is not just "is request time under 2 seconds?" It is a structured record of:

What the model received — prompt text, system instructions, context data, token count.

What the model responded — output text, structured output, any tool calls, output token count.

What it cost — input tokens, output tokens, cached tokens, estimated price.

When it happened — timestamp, latency, model version, prompt version.

Who it concerned — user ID (hashed), workspace ID, feature ID.

Whether it succeeded — success/failure, error code, model confidence (where available).

Together, these constitute an audit trail. They make it possible to answer the questions: "What did the system tell user X about topic Y on that day?" and "Is there a pattern in failed calls in the past week?" and "Has model confidence dropped since we changed the prompt version?"

An AI system without traces is a black box. A black box cannot be audited. Something that cannot be audited cannot be regulated. From the AI Act's perspective, this is not a technical choice — it is a compliance question.

The trace structure: functionId, tokens, latency

The practical implementation starts with defining a consistent trace structure across all AI calls in the system.

The minimum pattern uses five fields:

await generateText({
  model: models.fast,
  messages: [...],
  experimental_telemetry: {
    isEnabled: true,
    functionId: "capability.enrich.v2",  // semantic ID, not system ID
    metadata: {
      workspaceId: ctx.workspaceId,
      userId: hash(ctx.userId),           // never plain-text user ID
      promptVersion: "v2.3.1",            // semantic versioning
      inputType: "capability-description",
    },
  },
})

functionId is the most important field. It is what makes it possible to group traces across calls and identify patterns. Use semantic names, not system IDs: capability.enrich.v2 tells you more than ai_call_4892.

Versions in promptVersion are critical for correlating output quality changes with prompt changes. Without them, it is impossible to answer: "When did output start changing, and what did we change?"

An important security point: never log user ID in plain text. Hash it. It is personal data and must be treated accordingly — even in internal systems.

Metadata fields in the data model

For systems that store AI output in a database, the trace log is only half the story. The other half is the metadata fields stored directly on AI-generated records.

An AI output stored without metadata is a black box in your data model. You can see that a capability description was generated by AI. You cannot see when, with which model, with which prompt version, or with what confidence.

A minimum set of metadata fields:

ai_model_version      text        -- "claude-sonnet-4-6-2026-01"
ai_prompt_version     text        -- "v2.3.1"
ai_confidence         numeric(3,2) -- 0.00 to 1.00
ai_classified_at      timestamptz  -- when was it generated
ai_needs_review       boolean      -- flagged for human review
ai_trace_id           text         -- Langfuse trace ID
ai_input_tokens       integer      -- actual token consumption
ai_output_tokens      integer      -- actual token consumption
ai_cached_tokens      integer      -- tokens read from cache
ai_cost_usd           numeric(10,6) -- estimated cost

These fields make it possible to:

Answer audit questions: "What was the model version when this capability description was generated, and which prompt version were we using?"

Identify stale output: "Which capabilities have AI output generated with a model version older than six months?"

Estimate costs per workspace or per feature: "What does capability enrichment cost us per workspace per month?"

Link database records to traces via ai_trace_id: "Show me the full trace for this record."

EU AI Act Article 15 in practice

Article 15 of the EU AI Act imposes three requirements on high-risk AI systems: accuracy, robustness, and cybersecurity. What does this mean operationally?

Accuracy: The system must produce output that is correct relative to its purpose at a defined error rate. This requires measuring output quality continuously — not just at launch, but over time. Without traces, you cannot measure.

Robustness: The system must handle failure scenarios without producing harmful outputs. This requires being able to identify when the system produces unexpected output and what the specific input characteristics were. Without traces, you cannot identify patterns.

Cybersecurity: The system must be protected against prompt injection and adversarial inputs. This requires logging inputs that are markedly different from expected patterns. Without input logging, you cannot detect attack patterns.

Article 15 is not a requirement to implement a specific system. It is a requirement to have control over the system's behaviour. Observability is the instrument that gives you that control.

Article 12 adds transparency requirements: high-risk AI systems must log sufficient information to identify when, by whom, and under what circumstances the system produced a given output. That is exactly what a good trace system delivers.

Hash-chain as tamper-evident audit trail

For systems with strict compliance requirements, having traces is not sufficient — they must also be able to prove they have not been manipulated after the fact.

A hash-chain pattern solves this:

Each AI output in the database computes a SHA-256 hash of the core fields (id, model, value, classified_at) plus the hash of the previous record. This creates an unbreakable chain: if one record is changed, all subsequent records' hashes become invalid.

This is not a requirement for most systems. But for systems that support GDPR DSAR responses, regulatory filings, or other critical processes, it is a strong guarantee — and a relatively simple one to implement.

Implementation plan: four steps

Observability for AI features can be implemented progressively. Not everything at once.

Step 1 — Basic telemetry: Enable experimental_telemetry on all generateText/streamText calls with consistent functionId names. If you use Langfuse, this is sufficient to start seeing traces in the Langfuse console.

Step 2 — Metadata fields in the data model: Add ai_model_version, ai_prompt_version, ai_classified_at, and ai_needs_review to tables that store AI output. These four fields alone make it possible to identify stale output and maintain an audit trail.

Step 3 — Token tracking: Add ai_input_tokens, ai_output_tokens, ai_cached_tokens, and ai_cost_usd to the data model. Use these for per-workspace cost tracking and for identifying calls with unexpected token consumption.

Step 4 — AI review flag: Implement a pattern that sets ai_needs_review = true based on rules (low confidence, short output, token consumption above threshold). Build a review queue that exposes these to a human reviewer.

Each step adds observability that can be used independently of the others.

What observability does not solve

It is important to delineate what observability is an instrument for, and what requires other measures.

Observability shows what happened. It does not tell you whether it was the right thing to do. A model that systematically gives low confidence scores on a specific input type will show up in traces — but traces do not tell you whether it is a prompt problem, a model problem, or a data problem.

Traces are a diagnostic instrument, not a solution in themselves. They reduce the time it takes to reach the right diagnosis — but the diagnosis still requires human interpretation.

Observability is also not a substitute for evaluation. It tells you what happened in production — it does not tell you whether the system's output is correct across all possible inputs. That requires a structured eval suite with golden datasets.

What to do tomorrow

Observability is an investment with returns from day one. Three steps to get started:

Week 1: Enable experimental_telemetry on all existing AI calls. Define a consistent functionId naming scheme. Verify that traces appear in Langfuse.

Week 2: Add ai_model_version, ai_prompt_version, and ai_classified_at to the database tables that store AI output. Start with the most important features.

Week 3: Implement the ai_needs_review flag with one simple rule (example: ai_confidence < 0.7). Build a minimal review list that exposes the flagged output.

Build it in steps. But start today. Every call without a trace is a diagnostic and learning opportunity lost.

References

[1] Regulation (EU) 2024/1689 of the European Parliament and of the Council on artificial intelligence (AI Act), Article 15 — Accuracy, robustness and cybersecurity, available at eur-lex.europa.eu.

[2] Article 12 (EU) 2024/1689 — Record-keeping and logging requirements for high-risk AI systems, eur-lex.europa.eu.

[3] Langfuse, "AI Observability & Analytics", available at langfuse.com/docs (accessed 2026-04-23).

[4] Anthropic, "Usage Metadata and Caching Statistics", available at docs.anthropic.com/en/api/messages (accessed 2026-04-23).

ai-governance observability ai-act

ShareLinkedIn X

Spekir builds the layer that connects strategy to the IT portfolio. See Atlas →

EU AI Act for Midmarket — What You Actually Need to Do

A pragmatic roadmap for the IT manager or compliance coordinator who needs to translate the EU AI Act into action without a dedicated compliance team. The 20 things, prioritisation, and what is realistic.

9 min read →

Annex III Explained — When Is Your AI 'High-Risk'?

The eight Annex III categories explained with concrete examples from Nordic midmarket. When is your recruitment tool, credit scoring, or OT system high-risk under the EU AI Act?

8 min read →

Your AI Policy — 8 Sections You Cannot Skip

What must an AI policy contain? The eight mandatory sections, common mistakes, and what separates a policy that is actually used from one that lives in a PDF folder nobody opens.