Most healthcare AI proposals fall apart at the audit log. The model is fine, the retrieval is fine, the UI is fine — and then the procurement reader asks how a subpoenaed regulator would replay any single clinical decision back to the prompt that generated it, and the proposal goes back to the drawing board. The audit log is not a feature; it is the substrate the rest of the system has to be designed against. For DRIS v1.0 the audit log architecture was the first thing we designed and the last thing we changed.
What HIPAA actually demands
HIPAA does not specify an audit-log schema. What it requires is that a covered entity can produce, on request, evidence of who accessed protected health information, when, what action they performed, and whether the action was authorised. GDPR Article 30 adds the requirement of a record of processing activities. The EU AI Act adds the requirement that high-risk AI systems maintain logs sufficient to support the assessment of compliance. Together these obligations imply a log that is (a) complete, (b) tamper-evident, (c) per-tenant isolated, (d) replay-able to the level of the underlying inference, and (e) retainable for a documented period under access controls. The technical implementation is open; the requirements are not.
What a multi-agent system makes harder
A single-call LLM application has a simple audit shape: one request in, one response out, log the pair. An agentic system breaks that shape. A single user request can fan out to multiple tool calls, each with its own retrieval, each with its own intermediate prompt, each with its own routing decision. By the time the system emits a clinical assertion, the path between the user request and the assertion may pass through ten or fifteen distinct inference and tool steps. A regulator who wants to replay the clinical decision has to be able to reconstruct that path exactly.
The first design we tried — log the final assertion plus a redacted summary of the intermediate steps — failed clinical governance review on the first pass. The reviewer's question was direct: if a patient was harmed by a clinical assertion this system emitted, can you produce, in a subpoena-ready form, the full causal chain that led to it? With a redacted summary, the answer is no. With a complete signed log, the answer is yes.
The schema we ended up with
Every agent step writes an event. An event has a fixed shape:
{
"event_id": "uuid-v4",
"tenant_id": "uuid-v4",
"session_id": "uuid-v4",
"user_id": "uuid-v4-hashed",
"step_index": 42,
"parent_step": 41,
"timestamp": "ISO-8601-with-nanos",
"actor": {
"kind": "model" | "tool" | "router" | "validator",
"model": "claude-sonnet-4-6",
"tool": "openfda.query",
"version": "v2.3.1"
},
"input": {
"prompt": "...",
"context": ["doc_id_1", "doc_id_2"],
"params": {...}
},
"output": {
"raw": "...",
"structured": {...},
"tokens": {"in": 412, "out": 88}
},
"validation": {
"schema": "clinical_assertion.v3",
"passed": true,
"errors": []
},
"routing": {
"next_step": "tool_call_openfda",
"decision": "high_confidence_match",
"rationale": "..."
},
"signature": "sha256:..."
}
The signature is computed over a canonical JSON serialisation of the event minus the signature field itself, using a per-tenant signing key rotated quarterly. The signature is verified at write time and at read time. Tampering breaks the verification.
Per-tenant isolation in practice
The audit log is partitioned by tenant at the database level using PostgreSQL Row Level Security. A tenant's audit log is readable only by the tenant's authenticated users and by the Phoenix Group governance role under explicit audit-mode access. The governance access itself is logged into a meta-audit table — the audit log auditing the audit-log access. The signing key is per-tenant and stored in a hardware-backed key store; the Phoenix Group operations role does not have access to the tenant signing keys.
Replay-ability
Given a session_id, the audit log produces the full event sequence in chronological order. Each event references its parent step. The chain reconstructs the causal tree of inference, tool calls, validation decisions, and routing choices that led to the final clinical assertion. The replay is deterministic up to model temperature; for clinical assertions, the operating temperature is zero and the model version is pinned per session, so the replay is exactly reproducible.
For a subpoenaed replay, the operations team produces the event sequence plus the snapshot of the retrieval corpus at the timestamp of the session. The retrieval corpus snapshot is stored as a per-day immutable index reference; the system retains daily snapshots for the documented retention period. Together the event sequence and the retrieval snapshot are sufficient to reproduce the inference path exactly.
What we deliberately do not log
The audit log does not store unredacted PHI. The user input is stored with a tokenised reference to the per-tenant PHI vault; the PHI itself sits behind a separate access-controlled surface with its own access log. The audit log can prove that a particular PHI record was accessed; it does not reproduce the record itself. This separation lets the audit log be analysed without exposing PHI and lets the PHI vault be deleted on subject-access-request without breaking the audit chain. The signature over the event references the PHI by hash, not by content, so the signature survives PHI deletion.
The latency cost
The audit log adds approximately 8 milliseconds per event at p50 and 24 milliseconds at p99 in our production deployment. With an average of nine events per clinical assertion, that contributes between 72 and 216 milliseconds to the median agent response. DRIS v1.0 reports sub-1.5-second median agent response latency end to end; the audit log is a substantial fraction of that budget and we paid it deliberately. A system that cannot survive a clinical governance review has no inference budget worth discussing.
The architectural lesson
The audit log is not the last layer you add. It is the first layer you design. Every other architectural decision in the system has to be expressible in audit events that a regulator can verify. Build the audit log first; build everything else against the audit log.
For DRIS that meant the tool surface had to be a small set of well-typed tools, because every tool call has to produce a structured audit event. It meant the agent had to be a deterministic-routed system rather than a free-roaming planner, because a free-roaming planner produces audit events that a reviewer cannot interpret. It meant the clinical assertion surface had to be schema-grade, because a free-form assertion cannot be replay-verified against a schema validator. The audit log shaped the entire architecture, and that is the right relationship.
Closing
Healthcare AI under HIPAA is buildable. The path is to design the audit log first and let it constrain the rest. The DRIS v1.0 audit log survived clinical governance review on its first pass after this design landed, and it has carried zero unredacted PHI, zero assertion-outside-schema, and zero unsigned events across the production lifetime to date. That is the standard the system holds itself to and that is the standard a regulator can verify against.
Companion case study / Read the DRIS v1.0 case study