3 Commits

Author SHA1 Message Date
Ronni Baslund 4d9e906ec1 feat(audit): cold-storage archival to S3 (Phase 4)
Final piece of the audit work. Events older than the hot retention window
move to S3-compatible object storage with signed manifests. Production uses
Hetzner Object Storage; dev uses a MinIO container with the same API.

Infra (infrastructure/docker-compose):
  - New `minio` service exposing the S3 API at minio:9000 + admin console at
    minio.dezky.local. Healthchecked. Bucket-init sidecar runs `mc mb` once
    to create `dezky-audit`; safe to re-run.
  - .env adds MINIO_ROOT_USER + MINIO_ROOT_PASSWORD.
  - platform-api env: AUDIT_COLD_{ENDPOINT,REGION,BUCKET,ACCESS_KEY,SECRET_KEY}
    + AUDIT_HOT_RETENTION_DAYS=90 + ARCHIVE_ENABLED=false (dormant in dev;
    operator UI's "Run archive now" bypasses this gate). AUDIT_COLD_SSE
    opts into SSE-S3 — left unset in dev because MinIO without a KMS rejects
    AES256 PUTs with "KMS is not configured".

Platform-api (services/platform-api/src/cold/):
  - cold-storage.client.ts: thin @aws-sdk/client-s3 wrapper — put/head/list.
    forcePathStyle=true so MinIO and Hetzner both work; same code, env-swap.
  - archive.service.ts: runOnce() selects chained events with at < cutoff →
    serializes to JSONL → gzip → sha256s → uploads JSONL + signed manifest
    → HEAD-confirms both objects exist → records an ArchiveBatch doc → only
    then deletes from hot Mongo. Crash-safe: a failed upload leaves events
    in hot. Manifest uses the Phase 3 AUDIT_SIGNING_KEY (HMAC-SHA-256), so
    archives + checkpoints share trust chain. Bypassable via { override:
    true } for the operator's UI force-run.
  - archive.worker.ts: hourly tick guarded by configured run-hour-UTC
    (default 03:00) + day-guard so the same UTC day doesn't archive twice.
    Disabled until ARCHIVE_ENABLED=true.
  - archive-batch.schema.ts: { archivedAt, startSeq, endSeq, eventCount,
    manifestSha256, jsonlKey, manifestKey, bytesUncompressed }. The
    manifest sha256 stored in Mongo lets us detect manifest tampering
    without downloading the actual manifest.

Audit module additions:
  - audit.controller.ts: GET /audit/archives, POST /audit/archive/run,
    /audit/verify now reports { oldestHotSeq, highestArchivedSeq } so the
    UI shows the tier boundary.

Operator UI (apps/operator):
  - 2 new proxies: /api/audit/archives + /api/audit/archive/run (force
    override=true). Both behind operator auth via the existing platformApi
    helper.
  - audit.vue: new "Cold storage" card with batch table (archived-at, seq
    range, event count, size, truncated manifest sha256), "Run archive
    now" button + per-run result line.

Smoke-tested end-to-end:
  - 7 chained events in hot. /api/audit/archive/run → ok=true, batchId
    returned. JSONL + manifest both exist in MinIO (verified via mc ls +
    mc cat). Mongo's chained set went 7 → 0. Verify reports
    highestArchivedSeq=1446 (since we burn-allocate seqs on Authentik
    dup-key rejections). Operator /audit panel shows the batch with
    manifest hash 1d8263…
  - First attempt with SSE-S3 enabled failed cleanly (MinIO KMS not
    configured) — archive service correctly left events in hot Mongo.
    Made SSE opt-in via AUDIT_COLD_SSE=true; prod turns it on.

Out of scope (each could be its own session):
  - Restore-to-hot endpoint (today: download from S3 + offline query)
  - Client-side encryption (today: SSE-S3 in prod, none in dev)
  - Multi-region replication
  - Soft TTL safety net (defense-in-depth on top of app-managed deletion)

This completes the four-phase audit log work:
  1. platform-api as audit hub
  2. External system ingest (Authentik / Stalwart / OCIS)
  3. Hash-chain + signed checkpoints (tamper evidence)
  4. Cold-storage archival (retention without unbounded Mongo growth)
2026-05-24 21:03:41 +02:00
Ronni Baslund 9435baa09d feat(audit): hash-chain tamper evidence + signed checkpoints (Phase 3)
The audit log now carries cryptographic chain-of-custody. Every chained
event references the previous event's sha256, and periodic checkpoints
sign the head with HMAC-SHA-256. An attacker who modifies a historical
row must also forge every checkpoint signature past it — which requires
the AUDIT_SIGNING_KEY, kept outside Mongo.

Schema (services/platform-api/src/schemas/):
  - audit-event.schema.ts: new `seq` (monotonic) + `chained` (Phase-3-or-
    later flag) + `prevHash` + `hash`. Compound unique index on seq with
    partial filter so pre-Phase-3 rows don't collide on null.
  - audit-counter.schema.ts: single doc `_id='audit_seq'`, incremented
    atomically by findOneAndUpdate($inc).
  - audit-checkpoint.schema.ts: { at, headSeq, headHash, signature,
    sigAlg, reason }. Reason ∈ {startup, interval, threshold, manual}.

Audit module (services/platform-api/src/audit/):
  - canonical.ts: stable JSON form + hashCanonical (sha256) +
    checkpointSignature (HMAC-SHA-256) + verifyCheckpointSignature
    (timingSafeEqual). Single source of truth for hash inputs — schema
    additions land here at the same time as the field.
  - audit.service.ts: record() now allocates seq → looks up lastHash() →
    computes hash → inserts. Per-process write mutex serializes the
    allocate+lookup so concurrent writers don't both chain off the same
    predecessor. Documented multi-instance caveat (needs Mongo replica
    set + transactions OR a distributed lock).
  - checkpoint.service.ts: scheduler triggers on startup + every 5min
    + threshold of 100 events accumulated. Skips when no new chained
    events since the last anchor.
  - verifier.service.ts: walks chain in seq order, recomputes each
    hash, validates checkpoint signatures. Returns a precise break:
    'event-hash-mismatch' (in-place modification), 'event-prev-hash-
    mismatch' (insertion/deletion), or 'checkpoint-signature-mismatch'.
  - audit.controller.ts: GET /audit/verify, GET /audit/checkpoint/latest,
    POST /audit/checkpoint (manual force).

Operator UI (apps/operator/):
  - 3 new proxies under /api/audit/{verify, checkpoint/latest, checkpoint}.
  - pages/audit.vue: new "Tamper evidence" card with "Force checkpoint"
    + "Verify chain" buttons. Header shows live head seq; result line
    shows verified count or a precise break (kind + seq + expected vs
    actual hash). Background tinted green/red on ok/broken.

Env (.env + docker-compose.yml):
  - new AUDIT_SIGNING_KEY (32-byte hex HMAC secret). Prod swaps this for
    ed25519 from an HSM/KMS; verifier code stays the same because sigAlg
    is on the checkpoint doc.

Smoke-tested all three break paths against a clean chain of 5 events:
  - normal verify: ok=true, 5/5 events verified, 1 checkpoint signed
  - modified seq=3 in Mongo directly: verify returns ok=false with
    break = { kind: 'event-hash-mismatch', seq: 3, expected, actual }
  - restored, nuked checkpoint signature: break = { kind:
    'checkpoint-signature-mismatch', headSeq: 5 }
  - operator UI's verify panel reflects all three states correctly.

Legacy data: pre-Phase-3 events stay `chained: false` and are excluded
from the chain walk. Retroactive chaining of historical entries is a
one-off migration script we can run if we ever care to.

Out of scope (Phase 4 etc.):
  - TTL + cold-storage archival to Hetzner Object Storage
  - GDPR right-to-erasure tooling
  - ed25519 / HSM signing (swap is well-defined; sigAlg field is ready)
  - Multi-instance write coordination (Mongo transaction OR distributed
    lock when we scale platform-api beyond 1 replica)
2026-05-24 20:43:54 +02:00
Ronni Baslund 02341d8ba5 feat(audit): platform-api audit log + operator UI wired to real events
Phase 1 of the audit work — capture everything we control today, ingest from
external systems (Authentik / OCIS / Stalwart) in a later phase. The mock
OP_AUDIT fixture is gone; both the /audit page and Overview's activity card
now show real events recorded by AuditService.record() in platform-api.

Schema (services/platform-api/src/schemas/audit-event.schema.ts):
  AuditEvent { at, actorType, actorId, actorEmail, actorIp, action, outcome,
    resourceType, resourceId, resourceName, tenantSlug, partnerSlug, source,
    metadata, prevHash, hash }
  Indexes: {at:-1}, {tenantSlug,at:-1}, {actorId,at:-1}, {action,at:-1}.
  prevHash/hash are nullable now; hash-chain tamper evidence is a later phase.

AuditService:
  - record() — best-effort write, swallows errors so the underlying mutation
    that succeeded isn't failed by a downstream log issue. Surfaces failures
    via Logger.
  - list() — filters: since/until/before, action (exact OR prefix match
    via leading-anchor regex), tenantSlug, partnerSlug, actorEmail, outcome,
    free-text q across action/resourceName/actorEmail/tenantSlug, limit
    (default 100, max 500). Cursor pagination via `before`.
  - No UPDATE/DELETE surface — entries are append-only by construction.

AuditController: GET /audit, behind JwtAuthGuard + OperatorGuard. No mutations
exposed; entries written internally by other modules.

X-Forwarded-For threading:
  - apps/operator/server/utils/platform-api.ts forwards the originating
    client IP to platform-api so audit entries carry a real address.
  - services/platform-api/src/auth/client-ip.ts extracts leftmost
    X-Forwarded-For, falls back to socket.remoteAddress.

Instrumented mutations (every one threads actor + IP through):
  Tenants: create, update, softDelete, setStatus(suspend/resume)
  Partners: create, update, terminate
  Flags:   create, update (incl. flag.killed verb when state=off+note=kill-switch),
           remove
  Users:   deactivate

Each controller resolves the User doc via ActorService, extracts IP via
clientIp(req), and passes { userId, email, ip } as AuditActor to the service.
FlagsService's local ActorRef collapses to AuditActor so flag history and the
audit log share one shape.

Operator UI:
  - /api/audit proxy that forwards query params verbatim
  - types/audit.ts
  - pages/audit.vue: real list with quick-pick action chips (All/Tenants/
    Partners/Flags/Users), outcome filter, free-text search, "Load older
    events" cursor pagination
  - pages/index.vue: Overview activity card swaps mock OP_AUDIT for the
    same /api/audit endpoint, rows link into /audit
  - data/fixtures.ts: OP_AUDIT / AuditEntry / AuditTone exports removed

Verified end-to-end: suspended + resumed acme, flipped oci_versioning through
rollout → kill → on, then /audit returned all 5 events with the right action
verbs (tenant.suspended, tenant.resumed, flag.updated, flag.killed,
flag.updated), actor admin@dezky.local, IP 192.168.65.1. Filters (action
prefix + free-text q) narrow correctly.

Out of scope for this commit (each gets its own conversation):
  - Authentik / OCIS / Stalwart ingest adapters (Phase 2)
  - Hash-chain tamper evidence (Phase 3)
  - TTL + cold-storage archival to Hetzner Object Storage (Phase 4)
  - GDPR right-to-erasure tooling
2026-05-24 19:50:24 +02:00