Commit Graph

36 Commits

Author SHA1 Message Date
Ronni Baslund 9a97945565 feat(operator): invite operator → creates user in Authentik
New "Invite operator" button + modal on /operator-team. Replaces the
bounce-to-Authentik flow with an inline invite that creates the user via
the Authentik API and pre-populates our local User doc so they appear
immediately.

services/platform-api/src/integrations/authentik.client.ts:
  - findUserByEmail(): early-conflict check before we attempt the create
  - createUser(): POST /core/users/ with username = email, internal type,
    is_active, attached to the supplied group PKs
  - addUserToGroup(): kept for tenant-member invites later
  - recoveryLink(): tries POST /core/users/{pk}/recovery/, returns
    undefined when no recovery flow is configured on the Authentik brand
    (we soft-fail and the service falls back to setInitialPassword)
  - setInitialPassword(): POST /core/users/{pk}/set_password/. Returns 204
    No Content so we bypass request<T>'s JSON parser and call fetch
    directly with explicit ok check.

services/platform-api/src/users/users.service.ts:
  - inviteOperator(dto, actor) orchestrates: dedup by email →
    findOrCreate Authentik group → create user in group → pre-create
    local User doc with platformAdmin=true so the list reflects them
    immediately → try recovery link → fall back to temp password →
    record platform.user_invited audit event with handoff method.
  - Return type is { subject, userId, link? | tempPassword? } —
    exactly one credential mode set depending on Authentik config.
  - generateTempPassword(): 16-char with at least one upper/lower/digit/
    symbol, shuffled. Confusable chars (I/O/0/1/l) omitted.
  - Cached platform-admin group ID after first lookup.

services/platform-api/src/users/users.controller.ts:
  - POST /users/invite behind OperatorGuard. Calls the service with
    actor + IP from the JWT/request.

apps/operator:
  - server/api/users/invite.post.ts: standard platformApi proxy.
  - components/InviteOperatorModal.vue: 2-step form. Step 1: name +
    email with client-side validation. Step 2: shows whichever
    credential the backend returned — recovery link OR username+
    temp-password — with copy-to-clipboard buttons and a note about
    SMTP/recovery-flow follow-up paths.
  - pages/operator-team.vue: "Invite operator" replaces "Manage in
    Authentik" as the primary action; Authentik link demoted to
    secondary. Refreshes the list on @invited so the new user shows
    up without a manual reload.

Verified end-to-end against real Authentik:
  - Invite created user pk=7, uid=f22f2bb…, group=dezky-platform-admins,
    is_active=true, temp password set. Modal showed both fields with
    copy buttons; operator-team count went 1 → 2 immediately. Audit
    event recorded (platform.user_invited with handoff='temp-password').
  - Recovery link path is preferred but Authentik has no recovery flow
    configured on the default brand. AuthentikClient.recoveryLink()
    soft-fails on the "No recovery flow set." 400, returns undefined,
    and inviteOperator transparently falls back to set_password. Once
    a recovery flow is configured (Authentik admin → Flows), the link
    path becomes active and the temp-password path stops firing
    without any code changes.

Known follow-ups:
  - Configure Authentik recovery flow so the link path activates
    (one-time admin task, not in code)
  - Outbound SMTP wiring (Phase 5/6) → Authentik can email link/temp
    directly; modal stops showing the credential
  - Deactivate / remove operator from inside the app (currently still
    Authentik UI; defensible until proven needed)
  - Tenant-member invite — similar flow but adds to tenant group
    instead, exposed from /users (global users) or tenant detail
2026-05-24 21:27:46 +02:00
Ronni Baslund 4d9e906ec1 feat(audit): cold-storage archival to S3 (Phase 4)
Final piece of the audit work. Events older than the hot retention window
move to S3-compatible object storage with signed manifests. Production uses
Hetzner Object Storage; dev uses a MinIO container with the same API.

Infra (infrastructure/docker-compose):
  - New `minio` service exposing the S3 API at minio:9000 + admin console at
    minio.dezky.local. Healthchecked. Bucket-init sidecar runs `mc mb` once
    to create `dezky-audit`; safe to re-run.
  - .env adds MINIO_ROOT_USER + MINIO_ROOT_PASSWORD.
  - platform-api env: AUDIT_COLD_{ENDPOINT,REGION,BUCKET,ACCESS_KEY,SECRET_KEY}
    + AUDIT_HOT_RETENTION_DAYS=90 + ARCHIVE_ENABLED=false (dormant in dev;
    operator UI's "Run archive now" bypasses this gate). AUDIT_COLD_SSE
    opts into SSE-S3 — left unset in dev because MinIO without a KMS rejects
    AES256 PUTs with "KMS is not configured".

Platform-api (services/platform-api/src/cold/):
  - cold-storage.client.ts: thin @aws-sdk/client-s3 wrapper — put/head/list.
    forcePathStyle=true so MinIO and Hetzner both work; same code, env-swap.
  - archive.service.ts: runOnce() selects chained events with at < cutoff →
    serializes to JSONL → gzip → sha256s → uploads JSONL + signed manifest
    → HEAD-confirms both objects exist → records an ArchiveBatch doc → only
    then deletes from hot Mongo. Crash-safe: a failed upload leaves events
    in hot. Manifest uses the Phase 3 AUDIT_SIGNING_KEY (HMAC-SHA-256), so
    archives + checkpoints share trust chain. Bypassable via { override:
    true } for the operator's UI force-run.
  - archive.worker.ts: hourly tick guarded by configured run-hour-UTC
    (default 03:00) + day-guard so the same UTC day doesn't archive twice.
    Disabled until ARCHIVE_ENABLED=true.
  - archive-batch.schema.ts: { archivedAt, startSeq, endSeq, eventCount,
    manifestSha256, jsonlKey, manifestKey, bytesUncompressed }. The
    manifest sha256 stored in Mongo lets us detect manifest tampering
    without downloading the actual manifest.

Audit module additions:
  - audit.controller.ts: GET /audit/archives, POST /audit/archive/run,
    /audit/verify now reports { oldestHotSeq, highestArchivedSeq } so the
    UI shows the tier boundary.

Operator UI (apps/operator):
  - 2 new proxies: /api/audit/archives + /api/audit/archive/run (force
    override=true). Both behind operator auth via the existing platformApi
    helper.
  - audit.vue: new "Cold storage" card with batch table (archived-at, seq
    range, event count, size, truncated manifest sha256), "Run archive
    now" button + per-run result line.

Smoke-tested end-to-end:
  - 7 chained events in hot. /api/audit/archive/run → ok=true, batchId
    returned. JSONL + manifest both exist in MinIO (verified via mc ls +
    mc cat). Mongo's chained set went 7 → 0. Verify reports
    highestArchivedSeq=1446 (since we burn-allocate seqs on Authentik
    dup-key rejections). Operator /audit panel shows the batch with
    manifest hash 1d8263…
  - First attempt with SSE-S3 enabled failed cleanly (MinIO KMS not
    configured) — archive service correctly left events in hot Mongo.
    Made SSE opt-in via AUDIT_COLD_SSE=true; prod turns it on.

Out of scope (each could be its own session):
  - Restore-to-hot endpoint (today: download from S3 + offline query)
  - Client-side encryption (today: SSE-S3 in prod, none in dev)
  - Multi-region replication
  - Soft TTL safety net (defense-in-depth on top of app-managed deletion)

This completes the four-phase audit log work:
  1. platform-api as audit hub
  2. External system ingest (Authentik / Stalwart / OCIS)
  3. Hash-chain + signed checkpoints (tamper evidence)
  4. Cold-storage archival (retention without unbounded Mongo growth)
2026-05-24 21:03:41 +02:00
Ronni Baslund 9435baa09d feat(audit): hash-chain tamper evidence + signed checkpoints (Phase 3)
The audit log now carries cryptographic chain-of-custody. Every chained
event references the previous event's sha256, and periodic checkpoints
sign the head with HMAC-SHA-256. An attacker who modifies a historical
row must also forge every checkpoint signature past it — which requires
the AUDIT_SIGNING_KEY, kept outside Mongo.

Schema (services/platform-api/src/schemas/):
  - audit-event.schema.ts: new `seq` (monotonic) + `chained` (Phase-3-or-
    later flag) + `prevHash` + `hash`. Compound unique index on seq with
    partial filter so pre-Phase-3 rows don't collide on null.
  - audit-counter.schema.ts: single doc `_id='audit_seq'`, incremented
    atomically by findOneAndUpdate($inc).
  - audit-checkpoint.schema.ts: { at, headSeq, headHash, signature,
    sigAlg, reason }. Reason ∈ {startup, interval, threshold, manual}.

Audit module (services/platform-api/src/audit/):
  - canonical.ts: stable JSON form + hashCanonical (sha256) +
    checkpointSignature (HMAC-SHA-256) + verifyCheckpointSignature
    (timingSafeEqual). Single source of truth for hash inputs — schema
    additions land here at the same time as the field.
  - audit.service.ts: record() now allocates seq → looks up lastHash() →
    computes hash → inserts. Per-process write mutex serializes the
    allocate+lookup so concurrent writers don't both chain off the same
    predecessor. Documented multi-instance caveat (needs Mongo replica
    set + transactions OR a distributed lock).
  - checkpoint.service.ts: scheduler triggers on startup + every 5min
    + threshold of 100 events accumulated. Skips when no new chained
    events since the last anchor.
  - verifier.service.ts: walks chain in seq order, recomputes each
    hash, validates checkpoint signatures. Returns a precise break:
    'event-hash-mismatch' (in-place modification), 'event-prev-hash-
    mismatch' (insertion/deletion), or 'checkpoint-signature-mismatch'.
  - audit.controller.ts: GET /audit/verify, GET /audit/checkpoint/latest,
    POST /audit/checkpoint (manual force).

Operator UI (apps/operator/):
  - 3 new proxies under /api/audit/{verify, checkpoint/latest, checkpoint}.
  - pages/audit.vue: new "Tamper evidence" card with "Force checkpoint"
    + "Verify chain" buttons. Header shows live head seq; result line
    shows verified count or a precise break (kind + seq + expected vs
    actual hash). Background tinted green/red on ok/broken.

Env (.env + docker-compose.yml):
  - new AUDIT_SIGNING_KEY (32-byte hex HMAC secret). Prod swaps this for
    ed25519 from an HSM/KMS; verifier code stays the same because sigAlg
    is on the checkpoint doc.

Smoke-tested all three break paths against a clean chain of 5 events:
  - normal verify: ok=true, 5/5 events verified, 1 checkpoint signed
  - modified seq=3 in Mongo directly: verify returns ok=false with
    break = { kind: 'event-hash-mismatch', seq: 3, expected, actual }
  - restored, nuked checkpoint signature: break = { kind:
    'checkpoint-signature-mismatch', headSeq: 5 }
  - operator UI's verify panel reflects all three states correctly.

Legacy data: pre-Phase-3 events stay `chained: false` and are excluded
from the chain walk. Retroactive chaining of historical entries is a
one-off migration script we can run if we ever care to.

Out of scope (Phase 4 etc.):
  - TTL + cold-storage archival to Hetzner Object Storage
  - GDPR right-to-erasure tooling
  - ed25519 / HSM signing (swap is well-defined; sigAlg field is ready)
  - Multi-instance write coordination (Mongo transaction OR distributed
    lock when we scale platform-api beyond 1 replica)
2026-05-24 20:43:54 +02:00
Ronni Baslund df18128617 feat(audit): OCIS file-tail ingest worker (Phase 2 chunk 3)
Tails OCIS's JSON-Lines audit log on a shared Docker volume and forwards
mutations into AuditService. Final piece of Phase 2 — the /audit page now
unifies platform-api, authentik, and ocis events on one timeline.

services/platform-api/src/ingest/ocis.ingest.ts:
  - 5s polling loop (fs.watch is unreliable across Docker bind mounts on
    macOS). Stat → detect inode change or truncation → resume from byte
    position OR start over.
  - Cursor in IngestCursor stores lastEventId = "<inode>:<bytePosition>".
    Restarts resume cleanly; on overlap the (source, externalId) unique
    index dedups silently.
  - Lines collected first, then processed sequentially after the read
    stream closes. Earlier draft fired recordOne() from inside the
    readline 'line' callback which would have resolved the stream
    before all writes finished — same class of race we hit in the
    Authentik worker, fixed before commit.
  - Tenant inference: spaceName (set during provisioning to the slug)
    first, then User.authentikSubjectId → tenantIds → Tenant.slug.
  - Mutations only: OCIS_ALLOWLIST in action-map.ts whitelists 24 event
    types (User/Group/Space/Share/Link/File mutations). FileDownloaded,
    UserSignedIn, and the rest of the high-volume read traffic gets
    skipped — keeps the timeline scannable.

services/platform-api/src/ingest/action-map.ts:
  - mapOcisAction() + OCIS_ALLOWLIST. Returns null for non-whitelisted
    types so the worker filters early.

infrastructure/docker-compose/docker-compose.yml:
  - New named volume `ocis_audit_log` mounted writeable on the ocis
    container and read-only on platform-api.
  - OCIS env: OCIS_ADD_RUN_SERVICES=audit (the audit microservice is
    NOT in the default `ocis server` set — opt in explicitly),
    AUDIT_LOG_FILE_PATH=/var/log/ocis/audit.log, AUDIT_LOG_FORMAT=json.
  - platform-api env: OCIS_AUDIT_LOG_PATH points at the same file.

Verified end-to-end with synthetic events written to the audit log:
  - Worker tailed 5 events across initial read + incremental append
    (5 → bytes 0:1276, then 1 → bytes 1276:1519).
  - FileDownloaded correctly filtered by the allowlist (4 mutations
    landed in Mongo, not 5).
  - Tenant inference: events with executingUser.id resolved to
    `dezky` via User → tenantIds → Tenant.slug.
  - Operator /audit shows all three sources (89 events: 79 authentik
    + 5 platform-api + 5 ocis) in one unified timeline.

Known unknown — same shape as the Stalwart commit: I couldn't fully
confirm the OCIS v7 audit microservice emits events with just
OCIS_ADD_RUN_SERVICES=audit + the AUDIT_LOG_FILE_PATH env. The audit
service starts but the file stays empty until OCIS internals start
publishing events to NATS (which may need additional service-side
config). The ingest worker is correct regardless — when OCIS starts
writing real events, they'll flow into /audit. This is a follow-up
in the OCIS-side configuration, not in our ingest code.
2026-05-24 20:30:47 +02:00
Ronni Baslund 7bec940e7f feat(audit): Stalwart webhook ingest endpoint (Phase 2 chunk 2)
Push-based ingest for mail-server events. Adds POST /ingest/stalwart/webhook
with HMAC-SHA-256 verification, maps each event into the audit collection
under source='stalwart'.

services/platform-api/src/ingest/stalwart-webhook.controller.ts:
  - Public endpoint (no JwtAuthGuard — Stalwart can't carry a JWT). Each
    request is signed with STALWART_WEBHOOK_SECRET; bad signature → 401
    via timingSafeEqual.
  - Body: { events: [{ id, type, createdAt, data }, ... ] }. Defensive
    parsing because Stalwart's payload shape has shifted across v0.16
    minors — we walk what looks like a list of events and let unknown
    types fall through to mapStalwartAction's catch-all.
  - Per-event recordOne: action via mapStalwartAction(), actor from
    data.email/account/username, IP from data.ip or X-Forwarded-For,
    targetName from data.account/email/address/to, full payload kept
    in metadata. externalId = evt.id so the (source, externalId)
    unique index dedups re-deliveries.

action-map.ts: 14 known Stalwart event types →
  stalwart.{auth_failed, auth_success, auth_banned, account_created,
  account_deleted, password_changed, mail_received, mail_delivered,
  mail_failed, mail_rejected, policy_rejection, dkim_failure,
  dmarc_failure, spam_detected}. Snake/kebab forms normalized.

infrastructure/docker-compose:
  - .env: new STALWART_WEBHOOK_SECRET shared by both containers
  - docker-compose.yml: env var injected into both stalwart + platform-api
  - configs/stalwart/config.toml: [webhook."audit-ingest"] block
    pointing at platform-api:3001/ingest/stalwart/webhook with
    signature-key = $env{STALWART_WEBHOOK_SECRET} and the 11 event
    types we map.

Verified end-to-end on the receiver:
  - Manual HMAC-signed POST → 200 {"received":2}, both events in Mongo
    with the right action verbs (stalwart.auth_failed, stalwart.account_created),
    actor/IP/externalId populated.
  - Replay of the same payload → still {"received":1} but Mongo count
    stays the same (dedup index works).
  - X-Signature: deadbeef → 401, no row written.

Known unknown: I couldn't fully confirm Stalwart v0.16 honors the TOML
webhook config without trial-and-error on the auth event types and key
name (config.toml uses signature-key; some Stalwart builds want plain
'key'). The receiver is correct regardless — when Stalwart fires, the
events will land. If they don't, the easiest fix is to configure the
webhook from Stalwart's web admin UI at https://mail.dezky.local
instead of via TOML.
2026-05-24 20:21:29 +02:00
Ronni Baslund b1d717e466 feat(audit): Authentik events ingest worker (Phase 2 chunk 1)
Background worker that pulls Authentik's /api/v3/events/events/ on a
60s cadence and writes each event into our audit log via AuditService.
External system events now share the same /audit timeline as
internally-recorded platform mutations — operator queries don't have
to cross-reference Authentik's own UI to see logins, password changes,
group membership, impersonation, etc.

Pieces:
- src/schemas/ingest-cursor.schema.ts: one row per source, tracks
  lastEventAt + lastEventId so restarts resume without re-pulling.
- src/schemas/audit-event.schema.ts: new `externalId` field; new
  compound unique index on (source, externalId) with a partial filter
  on externalId being a string. Partial (not sparse) so internally-
  recorded events with externalId=null don't collide.
- src/audit/audit.service.ts: AuditRecordInput grows `externalId` +
  `at` fields. record() now silently swallows MongoError code 11000
  (duplicate key) so re-pulling the cursor overlap doesn't log noise.
- src/integrations/authentik.client.ts: listEvents(since, page,
  pageSize) on the existing client — reuses the admin token and base
  URL the provisioning code already configured.
- src/ingest/action-map.ts: 16 known Authentik actions → dotted
  authentik.* verbs (login, login_failed, password_changed,
  impersonation_started, …). Unknown actions fall through to
  authentik.<raw> rather than getting silently dropped.
- src/ingest/authentik.ingest.ts: OnApplicationBootstrap worker.
  Reads cursor → pulls events with created__gt=cursor, ordering=created
  ASC → paginates forward (10 pages × 100/page safety cap per tick) →
  writes each event with source='authentik' + externalId=pk + at=
  evt.created → advances cursor to the newest seen. inFlight guard
  prevents overlapping ticks. AUDIT_INGEST_ENABLED=false disables for
  test environments.
- Tenant inference: from the user's groups (same convention the portal
  flag-eval proxy uses). Admin groups stripped; first match against a
  real Tenant.slug wins. Unmatched → tenantSlug undefined, event still
  lands in the global timeline.

Smoke-tested: fresh Mongo + restart → 78 Authentik events ingested,
0 duplicates. Performed a login at app.dezky.local → next 60s tick
captured the new login row with actor email + IP. Compound unique
index on (source, externalId) verified to reject re-pulled events
silently (no error logs).

Out of scope here (covered by chunks 2 + 3):
- Stalwart webhook ingest
- OCIS file-tail ingest
2026-05-24 20:12:21 +02:00
Ronni Baslund 02341d8ba5 feat(audit): platform-api audit log + operator UI wired to real events
Phase 1 of the audit work — capture everything we control today, ingest from
external systems (Authentik / OCIS / Stalwart) in a later phase. The mock
OP_AUDIT fixture is gone; both the /audit page and Overview's activity card
now show real events recorded by AuditService.record() in platform-api.

Schema (services/platform-api/src/schemas/audit-event.schema.ts):
  AuditEvent { at, actorType, actorId, actorEmail, actorIp, action, outcome,
    resourceType, resourceId, resourceName, tenantSlug, partnerSlug, source,
    metadata, prevHash, hash }
  Indexes: {at:-1}, {tenantSlug,at:-1}, {actorId,at:-1}, {action,at:-1}.
  prevHash/hash are nullable now; hash-chain tamper evidence is a later phase.

AuditService:
  - record() — best-effort write, swallows errors so the underlying mutation
    that succeeded isn't failed by a downstream log issue. Surfaces failures
    via Logger.
  - list() — filters: since/until/before, action (exact OR prefix match
    via leading-anchor regex), tenantSlug, partnerSlug, actorEmail, outcome,
    free-text q across action/resourceName/actorEmail/tenantSlug, limit
    (default 100, max 500). Cursor pagination via `before`.
  - No UPDATE/DELETE surface — entries are append-only by construction.

AuditController: GET /audit, behind JwtAuthGuard + OperatorGuard. No mutations
exposed; entries written internally by other modules.

X-Forwarded-For threading:
  - apps/operator/server/utils/platform-api.ts forwards the originating
    client IP to platform-api so audit entries carry a real address.
  - services/platform-api/src/auth/client-ip.ts extracts leftmost
    X-Forwarded-For, falls back to socket.remoteAddress.

Instrumented mutations (every one threads actor + IP through):
  Tenants: create, update, softDelete, setStatus(suspend/resume)
  Partners: create, update, terminate
  Flags:   create, update (incl. flag.killed verb when state=off+note=kill-switch),
           remove
  Users:   deactivate

Each controller resolves the User doc via ActorService, extracts IP via
clientIp(req), and passes { userId, email, ip } as AuditActor to the service.
FlagsService's local ActorRef collapses to AuditActor so flag history and the
audit log share one shape.

Operator UI:
  - /api/audit proxy that forwards query params verbatim
  - types/audit.ts
  - pages/audit.vue: real list with quick-pick action chips (All/Tenants/
    Partners/Flags/Users), outcome filter, free-text search, "Load older
    events" cursor pagination
  - pages/index.vue: Overview activity card swaps mock OP_AUDIT for the
    same /api/audit endpoint, rows link into /audit
  - data/fixtures.ts: OP_AUDIT / AuditEntry / AuditTone exports removed

Verified end-to-end: suspended + resumed acme, flipped oci_versioning through
rollout → kill → on, then /audit returned all 5 events with the right action
verbs (tenant.suspended, tenant.resumed, flag.updated, flag.killed,
flag.updated), actor admin@dezky.local, IP 192.168.65.1. Filters (action
prefix + free-text q) narrow correctly.

Out of scope for this commit (each gets its own conversation):
  - Authentik / OCIS / Stalwart ingest adapters (Phase 2)
  - Hash-chain tamper evidence (Phase 3)
  - TTL + cold-storage archival to Hetzner Object Storage (Phase 4)
  - GDPR right-to-erasure tooling
2026-05-24 19:50:24 +02:00
Ronni Baslund 5407c04682 docs: feature-flag usage guide + cross-links
New docs/FEATURE-FLAGS.md captures when to add a flag, where the moving
parts live, how to use useFeatureFlag from app code, the 4 states + 4
scope axes, kill-switch flow, naming conventions, and the parts we know
aren't built yet (partnerSlug eval context, user-level flags, audit-log
integration, server-side cache).

CLAUDE.md gets a one-line convention entry under "Code conventions" so
future devs notice it when grepping for code rules. NEXT-STEPS.md is
updated: the feature-flag backend follow-up is now ticked done with a
pointer to FEATURE-FLAGS.md for the remaining sub-tasks, and the
"What landed" section reflects the real Infrastructure + Flags pages
and the notification drawer.
2026-05-24 19:29:24 +02:00
Ronni Baslund 7f8516295c feat(portal): useFeatureFlag composable + /api/flags/evaluate proxy
Client-side helper for the portal to consume feature flags. Hits platform-api
through a new portal-side proxy that derives the tenant slug from the
signed-in user's JWT groups — so callers don't pass a slug, they just check
`useFeatureFlag('key')`.

apps/portal/server/api/flags/evaluate.post.ts:
- Reads access token from the nuxt-oidc-auth session
- Decodes the JWT and picks the first non-admin group as the tenant slug
  (admin groups: dezky-platform-admins, "authentik Admins"). Filters
  duplicates Authentik double-lists via policy bindings.
- Forwards { tenantSlug } to platform-api POST /flags/evaluate
- Caller can still pass an explicit tenantSlug in the request body to
  override the auto-derivation (rare).

apps/portal/composables/useFeatureFlag.ts:
- Singleton module-level state shared across every component — one bulk
  eval per session, not one per flag check
- `useFeatureFlag(key)` → ComputedRef<boolean>, lazily triggers the first
  eval, fail-closed (every flag stays false on error)
- `useFeatureFlags()` → { flags, ready, pending, refresh } for the rare
  case where you need the full map or want to re-evaluate (long-lived
  session, admin flipped a flag mid-flight)
- Returns refs that update once the bulk eval lands; gated UI stays
  hidden during the ~25ms round trip

apps/portal/nuxt.config.ts:
- Vite 7 `server.allowedHosts` set to ['app.dezky.local'] — same fix we
  already shipped on the operator side; without it, the proxy returned a
  plaintext 403 "Blocked request" instead of forwarding.

Verified end-to-end: signed in to app.dezky.local, hit /api/flags/evaluate
with no body → 200 with the full truth map (same shape as the operator's
direct eval), latency ~25ms, explicit-slug override returns identical
results.
2026-05-24 19:26:55 +02:00
Ronni Baslund 868a305539 feat(flags): real feature-flag system with bulk eval + operator UI
Real backend for the flags page (was pure mock). Built so it's ready for
the first risky rollout (likely the Stalwart JMAP client or the Stripe
billing engine).

services/platform-api:
- Flag schema (key, description, state, pct, scope.{plans, tenantSlugs,
  partnerSlugs, environments}, embedded history capped at 20)
- FlagsService with CRUD + evaluateAll(tenantSlug) → { key: bool }
  Eval algorithm:
    off  → false; on → true
    targeted → require non-empty scope (empty allowlist means "nobody"),
               then match every non-empty axis
    rollout  → match scope, then sha256(`${tenantId}:${key}`) % 100 < pct
  Hash-based rollout is deterministic: bumping pct only flips the new
  slice. Pure helpers (matchesScope, hasAnyScope, inRolloutBucket) are
  exported for future unit tests.
- FlagsController exposes GET /flags, GET /flags/:key, POST /flags/evaluate
  (JwtAuthGuard); POST/PATCH/DELETE require OperatorGuard. History entries
  capture the actor's email.
- SeedService idempotently creates 10 flag keys mapping to real Dezky
  concerns (jmap_native_v2, gdpr_export_v2, new_billing_engine, etc.).
  $setOnInsert so operator edits survive restarts.

apps/operator:
- 6 proxies: /api/flags index get/post, [key] get/patch/delete, evaluate post
- types/flag.ts with the shape that mirrors the backend
- pages/flags.vue: useFetch real list, row click opens FlagDetail,
  "New flag" opens NewFlagModal, scope summary column shows targeting
  at a glance
- FlagDetail.vue: side panel with segmented state, rollout slider with
  live "~N of M tenants" preview from /api/tenants, plan/tenant/env chip
  pickers, dirty-tracked Save, instant Kill-switch (PATCH state=off+pct=0),
  embedded change history
- NewFlagModal.vue: minimal create form (key + description). Everything
  else is configured in the detail panel afterward.
- CommandPalette: feature-flag rows now come from /api/flags instead of
  the dropped fixture, so newly-created flags are searchable immediately
- data/fixtures.ts: drop FLAGS / FeatureFlag exports (replaced by the
  real backend)

Smoke-tested end-to-end: list renders 10 seed flags, opening gdpr_export_v2
and flipping to rollout 25% then saving persists + adds a history entry,
kill-switch sets state=off in one click, /api/flags/evaluate returns the
correct booleans for the seeded tenant, same tenant gets the same answer
on consecutive evals (determinism), and creating + deleting a flag through
the UI roundtrips correctly.
2026-05-24 19:21:15 +02:00
Ronni Baslund 77a09aaf77 feat(operator): live Infrastructure probes + honest split between deployed and planned
The Infrastructure page used to read from a mock fixture that lied two ways:
it listed services that aren't deployed (Jitsi, Zulip, Cloudflare, Object
Storage, Postmark) and showed hardcoded uptime/latency for the ones that
are. Now it shows truth from real probes plus a clearly-labelled "planned"
section for the rest.

Backend (services/platform-api):
- New src/health/ module — HealthService runs 9 probes in parallel with a
  1.5s timeout each:
    Stalwart  → TCP stalwart:8080
    OCIS      → HTTP GET ocis:9200/health
    Collabora → HTTP GET collabora:9980/hosting/discovery
    Authentik → HTTP GET authentik-server:9000/-/health/ready/
    Postgres  → TCP postgres:5432
    Mongo     → existing Mongoose connection.db.admin().ping()
    Redis     → TCP redis:6379
    Traefik   → TCP traefik:80
    Platform API → trivially ok (this code is running)
  Status thresholds: ok ≤500ms, warn 500–1500ms, bad on timeout/refuse.
- HealthController exposes GET /health/platform behind JwtAuthGuard, plus
  keeps the existing public GET /health for infra liveness checks.
- Moved the old src/health.controller.ts into the new module.

Frontend (apps/operator):
- /api/health/platform proxy forwards the operator's access token.
- Infrastructure page swaps SERVICES fixture for useFetch with 30s auto-
  refresh + a manual Refresh button. Cards show real status badge + real
  latency; uptime/error stay as em-dash with a "no probe history yet"
  tooltip until a Prometheus/event-log backend lands.
- Below the live grid, a "Planned · not deployed" section renders 5 dimmed
  cards (Jitsi, Zulip, simpledns.plus, Hetzner Object Storage, Postmark).
  simpledns.plus replaces the misnamed Cloudflare entry — we use
  simpledns.plus, not Cloudflare.
- Subtitle is now truthful: "8 / 9 services live · checked 2s ago".

Verified: stopped redis → card flipped to "down · getaddrinfo ENOTFOUND
redis", subtitle reflected 8/9, incident banner appeared. Restarted →
back to 9/9, banner gone.

SERVICES fixture stays in place for Overview's incident banner — replacing
that is a separate follow-up tied to the incident-management backend.
2026-05-24 18:47:38 +02:00
Ronni Baslund 9fac11e668 feat(operator): notification drawer behind the topbar bell
Right-anchored slide-in inbox triggered by the bell button. Backend is a
follow-up — for now this is a visual + behavior shell with mock fixtures,
same pattern as INCIDENT / FLAGS / OP_AUDIT.

- data/fixtures.ts: new NotificationItem type + 6 seed rows from the
  design (DMARC, invitation, invoice, SAML, ticket reply, failed sign-in)
- useNotifications composable: isOpen + items + unreadCount + markRead +
  markAllRead. Items deep-clone the fixture on first import so toggling
  unread doesn't mutate the shared seed.
- NotificationDrawer component: Teleport + scrim + slide animation,
  header/list/footer. Each row shows tone-tinted icon tile + title +
  description + timestamp + left-rail unread dot. Click a row to mark
  read; click Mark all read or Preferences in the footer.
- OpTopbar: bell now opens the drawer and only shows .icon-btn-dot when
  unreadCount > 0.
- Layout mounts <NotificationDrawer /> alongside the other floating
  components.

Dismissal: backdrop click, Escape, X, and route-change watcher (so
Preferences → /settings closes the drawer cleanly).
2026-05-24 17:08:14 +02:00
Ronni Baslund 455717ac67 refactor(operator): remove fake on-call pill from topbar
The "on-call · Mikkel" pill named a person who doesn't exist and a paging
system we haven't built. The IncidentModal still says "will notify on-call"
but nothing actually does, and no schema for rotations / pages exists in
platform-api. Showing this in the chrome was claiming an operational fact
that isn't true.

Drop the prop, span, and CSS from OpTopbar. The right cluster becomes
just [bell] [profile].

Mock audit + incident-timeline fixtures still carry historical "on-call
paged" entries — those are records of past events in the mock, not live
state, so they stay. Paging gets a real indicator when the incident
backend lands (tracked as "Real on-call indicator" in NEXT-STEPS.md).
2026-05-24 17:00:40 +02:00
Ronni Baslund 78e15b9a84 refactor(operator): group on-call/notifications/profile flush right in topbar
Replace the inert .spacer (flex: 0 0 auto, did nothing) with a real .right
wrapper using margin-left: auto. The on-call indicator, notifications bell,
and UserMenu now form a single right-aligned cluster instead of sitting
loose in the header flex flow.
2026-05-24 16:55:11 +02:00
Ronni Baslund c93865e187 refactor(operator): derive env badge from hostname, not from user choice
A toggle-able env badge is a sticker, not a safety signal. Move env to
useEnv() which reads window.location.hostname:
  *.local / localhost → 'dev'
  *staging* → 'staging'
  everything else → 'prod' (safest default)

- New composable: apps/operator/composables/useEnv.ts
- Topbar reads useEnv() instead of useTweaks().env
- useTweaks loses the env field; hydrate strips it from stale
  localStorage payloads so old entries don't break
- TweaksPanel: env section removed (theme + density remain)
- Settings: env section removed from Appearance; added a read-only
  Environment row to the Profile card showing the detected env +
  hostname source ("auto-detected from operator.dezky.local")
2026-05-24 16:52:07 +02:00
Ronni Baslund 702fe9e134 feat(operator): real account settings page
Replace the OpPlaceholder stub at /settings with a three-card account page:

- Profile: name, email, subject ID, deduped group chips (Authentik returns
  each group twice; dedupe in the computed). Last sign-in derived from JWT
  iat via the existing /api/_verify-token endpoint.
- Security: three deep links to Authentik's user settings — change password,
  manage MFA devices, active sessions. We don't re-implement identity here;
  Authentik already has a polished UI for it.
- Appearance: theme / density / env segmented controls. Shares the
  useTweaks composable with the floating Tweaks panel, so flipping here
  is reflected there and vice-versa.
2026-05-24 16:47:03 +02:00
Ronni Baslund 3f4be27bd9 feat(operator): avatar dropdown context menu in topbar
New UserMenu component owns its own trigger + dropdown + dismissal so the
topbar stays simple. Menu contents: identity row (name + email), theme
toggle (reuses useTweaks so the floating panel and menu stay in sync),
link to /settings, Sign out (calls useOidcAuth().logout).

Dismissal: outside click via a transparent Teleport scrim, Escape, and
route change (watch on route.path → close).

Drops the now-unused useOidcAuth import from OpTopbar.
2026-05-24 16:45:11 +02:00
Ronni Baslund 885aa65219 refactor(operator): drop redundant profile card from sidebar foot
The .me-wrap block in OpSidebar was an inert button — no click handler,
no menu — and duplicated the avatar already shown in the topbar. Remove
it so there's a single place to render the user (topbar), making room
for the avatar dropdown that's landing next.
2026-05-24 16:43:54 +02:00
Ronni Baslund 19e1a4fca3 chore(operator): O.9 verification + roll follow-ups into NEXT-STEPS
- Add _verify-token.get.ts to both operator and portal — decodes the
  access token stored in the nuxt-oidc-auth session and echoes iss/aud/
  sub/groups. Used to confirm operator tokens carry aud=dezky-operator
  and portal tokens carry aud=dezky-portal. Listed in NEXT-STEPS.md as
  throwaway, to be removed when proper verification surfaces exist.
- OPERATOR-PLAN.md O.9 marked done with the actual claims captured + the
  Mongo-side verification of attach + suspend flows.
- NEXT-STEPS.md: replaced the "Operator portal — out-of-band track"
  section with a "shipped + follow-ups" version. The 9-item follow-up
  list (impersonation, audit, flags, incidents, support, partner
  portal, env switcher, on-call, workspace impersonation) is now the
  authoritative roadmap, not buried inside OPERATOR-PLAN.md.
2026-05-24 08:47:56 +02:00
Ronni Baslund c71e782dc0 feat(operator): command palette, impersonation, incident, tweaks (O.8)
- CommandPalette + useCommandPalette: ⌘K opens a search-and-jump panel over
  real tenants/partners + fixture flags + nav + actions. Arrow keys + Enter
  navigate, Escape/backdrop close. Recents are intentionally omitted for now;
  add when there's something to recent over.
- Impersonation stub: useImpersonation + ImpersonationModal + ImpersonationBanner.
  Modal opens from tenant detail and from the palette. Banner stays at the top
  of the shell until exited. No real OBO token is minted — wiring OAuth Token
  Exchange is tracked as a follow-up.
- IncidentModal + useIncidentModal: opened from the Overview and Infrastructure
  incident banners, renders the mock INCIDENT data with metrics, timeline and
  draft composer.
- TweaksPanel + useTweaks: floating bottom-right panel for theme (dark/light),
  density (comfy/compact), env badge (prod/staging/dev). Saved to localStorage.
- Theme/density apply via [data-theme] + [data-density] overrides in
  tokens.css. Topbar env badge now reads from useTweaks instead of a prop.
- Layout wires ⌘K + ⌘[ at the document level and mounts the palette + modals
  + banner + tweaks panel once for all pages.
2026-05-24 08:34:34 +02:00
Ronni Baslund e0ac643e80 feat(operator): visual-only screens with real-data overview (O.7)
- Overview (pages/index.vue): KPIs from real /tenants /partners /users,
  status meter, recent + needs-follow-up tables. Mock activity stream and
  incident banner overlay come from data/fixtures.ts.
- Operator team: real GET /users filtered to platformAdmin === true,
  with last-seen + tenant counts.
- Users (global): real read with All/Admins/Inactive views and search.
- Infrastructure / Feature flags / Audit: mock fixtures only — wiring to
  real backends (Prometheus, OpenFeature, append-only audit) is tracked
  as follow-ups in OPERATOR-PLAN.md.
- Placeholder pages (support/billing/reports/settings) via OpPlaceholder.
- Shared: Stat, MetricCell, OpPlaceholder components, /api/users proxy,
  PlatformUser type.
- .gitignore: scope the docker volumes data/ rule so apps/*/data/ is
  tracked again (operator carries mock fixtures there).
2026-05-24 08:17:26 +02:00
Ronni Baslund fbbb43e3e2 feat(operator): partner management with attach/detach (O.6)
- Partners list with name/domain/status/customers/margin + Create modal
- Partner detail: contract card, contact card, customers table, attach modal,
  terminate (soft-delete) danger card
- Operator proxies for /partners + /partners/:slug/tenants
- platform-api: add partnerId Prop to Tenant schema. The field was being
  silently dropped by Mongoose because the schema didn't declare it.
- tenants.service: rewrite update() to build $set/$unset explicitly and cast
  partnerId via new Types.ObjectId(). Handles detach via $unset so the field
  vanishes from the doc cleanly.
2026-05-24 08:02:00 +02:00
Ronni Baslund 8e81730372 feat(operator): tenant list + 7-tab detail with real lifecycle (O.5)
Operator can now manage tenants end-to-end from the UI:

  - pages/tenants/index.vue — list with status/plan/domains/created/
    provisioning-state columns, search by slug or name, status chips
    with live counts (all/active/pending/suspended), click-through
    to detail
  - pages/tenants/[slug].vue — 7-tab detail (Overview, Users, Resources,
    Billing, Audit, Support, Danger zone)
  - 3 tabs hit real backends: Overview (identity + billing fields),
    Users (lazy-loaded via new GET /tenants/:slug/users endpoint),
    Resources (live provisioning state per integration + Reconcile button)
  - 3 tabs render mock fixtures with warn-tone "mock" badges: Billing
    (Stripe placeholder), Audit (sample log lines), Support (placeholder
    pending the ticket queue work)
  - Danger zone: 3 real-backend cards (Suspend / Resume / Soft-delete),
    each gated by a ConfirmDialog modal. Verified live — clicked
    Suspend on acme, status flipped to 'suspended' in Mongo, then
    Resumed back to 'active'

platform-api additions:
  - GET /tenants/:slug/users returns users with this tenant in their
    tenantIds, sorted by last login. Same authorization rule as the
    existing /tenants/:slug — platform admins always pass,
    non-admins must be a member of the tenant
  - tenants.module imports User schema for the new lookup

New components (apps/operator/components/):
  - Tabs.vue — horizontal strip with optional per-tab counts, v-model
  - ConfirmDialog.vue — Teleport-to-body modal, Escape/backdrop close,
    danger/primary tone for the confirm button

Server proxy infrastructure (apps/operator/server/):
  - utils/platform-api.ts — single helper encapsulating
    access-token-from-session + bearer-forward + error normalization.
    Every operator proxy route is now a one-liner against this helper
  - api/tenants/index.get.ts, [slug]/{index.get,index.patch,index.delete,
    users.get,suspend.post,resume.post,reconcile.post}.ts

Two real bugs found and fixed during the smoke test:

  - Mongoose subdocument `_id` leaks into JSON when iterating
    tenant.provisioningStatus. Switched to an explicit
    `['authentik', 'stalwart', 'ocis']` whitelist in both v-fors
  - Documents created before provisioningErrors was added (like the
    acme tenant) don't have the field at all in JSON. Use optional
    chaining (`tenant.provisioningErrors?.[k]`) instead of bracket
    access. Without it: 'Cannot read properties of undefined (reading
    "authentik")' during the Resources tab render
2026-05-24 07:44:23 +02:00
Ronni Baslund 8e6f73a921 feat(operator): design system port + persistent shell (O.4)
Operator portal now wears its real chrome instead of placeholder spans.
Sidebar + topbar + page header all rendered against the carbon palette
from tokens.css.

Components ported from the source design (operator-app.jsx,
platform-ui.jsx, operator-screens.jsx) as Vue 3 SFCs in
apps/operator/components/:

  Foundation: NodeMark (copied from portal), UiIcon (expanded to 31 icons
  covering sidebar/topbar/sort/arrows)

  Primitives: Card (3 surface variants), UiButton (primary / secondary /
  ghost / dark / danger × sm / md / lg), DataTable (header + rows),
  Badge (7 tones), Avatar (deterministic palette by name hash), Mono,
  Eyebrow, StatusDot, PageHeader (with actions slot)

  Shell: OpSidebar (collapsible 232<->56px, 12 nav items in 4 sections,
  active-row highlight from route, badge slot, brand + user footer);
  OpTopbar (env badge with prod/staging/dev variants, palette trigger
  stub for the ⌘K work in O.8, on-call pill, bell, avatar)

Layouts: layouts/default.vue wires sidebar + topbar + slot; layouts/blank.vue
is used by the login page (definePageMeta layout:'blank'). app.vue now
wraps NuxtPage in NuxtLayout (the missing piece — without it Nuxt warns
"Your project has layouts but the <NuxtLayout /> component has not been
used" and renders nothing chrome-wise).

Composable composables/useSidebar.ts holds the collapsed state shared
between OpSidebar's toggle button and layouts/default.vue's ⌘[ keyboard
shortcut.

Verified in the browser:
  - Sidebar renders all 12 nav links with section dividers, env badge shows
    PROD, PageHeader resolves to the user's display name from
    useOidcAuth().user
  - Collapse toggle flips sidebar width 232↔56; nav rows become icon-only
  - Smoke test on the placeholder home still returns 409 for the seeded
    test-partner (token forwarding survives the layout refactor)

Gotcha documented in the plan: Vite 7.3 added a strict
server.allowedHosts check that returns plaintext 403 for any host header
that isn't the dev origin. The customer portal pre-dates this Vite
version; operator needs allowedHosts: ['operator.dezky.local'] in
nuxt.config.ts under vite.server.

Pages/index.vue replaces the bare HTML placeholder from O.3 with the
new PageHeader + Card primitives — same smoke-test functionality, much
better visual fidelity.

Real screen content (Tenants, Partners, Infrastructure, etc.) lands in
O.5+. This commit is the chrome, the smoke test, and the verification
that the design system primitives compose correctly.
2026-05-24 07:32:08 +02:00
Ronni Baslund 55b1c133e3 feat(operator): scaffold apps/operator Nuxt app + multi-issuer JWT (O.3)
New Nuxt 3 app at apps/operator/ — internal admin portal on its own domain
(operator.dezky.local), own OAuth client (dezky-operator), own session
secrets, own cookies. Customer and operator surfaces can't decrypt each
other's session state.

OAuth flow verified end-to-end:
  - GET / → middleware redirect to /auth/login
  - User clicks Sign in → /auth/oidc/login → bounces to Authentik with
    client_id=dezky-operator, scope includes 'groups'
  - Authentik checks dezky-platform-admins group binding (added in O.1),
    silent-reauths via the existing auth.dezky.local session
  - Returns to /auth/oidc/callback with code, exchanges for token,
    creates session cookie on operator.dezky.local
  - Lands on pages/index.vue placeholder dashboard

Smoke test 'Create partner "test-partner"' button on the placeholder home
exercises the full operator-only authorization chain:
  - 1st call: 200, partner created in Mongo
  - 2nd call: 409 'already exists' (idempotency holds, token still valid)
  - Same call from the customer portal: 403 'requires operator-scoped
    token' (audience guard rejects dezky-portal aud)

JwtAuthGuard now multi-issuer in addition to multi-audience. Each
Authentik OAuth provider mints tokens with its own per-app iss URL
(.../application/o/<slug>/), so the guard accepts a comma-separated
AUTHENTIK_ISSUER. The audience-only fix from O.2 wasn't sufficient —
issuer is validated separately by jose.jwtVerify and was still pinned
to dezky-portal alone, yielding 'unexpected iss claim value' rejections.

Compose changes: new 'operator' service (Node 20 alpine, pnpm install +
nuxt dev, mkcert CA mount, traefik labels for operator.dezky.local +
TLS); new operator_node_modules volume; operator.dezky.local added to
traefik's Docker network aliases. Distinct OPERATOR_NUXT_OIDC_* session
secrets pulled from .env (gitignored, generated via openssl).

Real operator screens (sidebar, topbar, tenants, partners, etc.) come
in O.4. This commit is pure scaffolding + the security boundary proof.
2026-05-24 07:20:16 +02:00
Ronni Baslund 2db41fec5e feat(platform-api): multi-audience JWT + Partner CRUD + tenant lifecycle (O.2)
JwtAuthGuard now accepts a comma-separated AUTHENTIK_AUDIENCE
('dezky-portal,dezky-operator'). jose.jwtVerify takes an array and succeeds
on any match — both customer-portal and operator-portal tokens validate
against this service. Per-endpoint guards restrict further.

New OperatorGuard enforces operator-only mutations:
  1. JWT audience claim includes 'dezky-operator' (proof from the token
     alone that this is a privileged session)
  2. ActorService-resolved User has platformAdmin=true (DB check so
     revocation works without waiting for the token to expire)
Both required; either alone is insufficient.

Partner module:
  - Partner schema: slug, name, domain, status, marginPct, contactInfo,
    billingInfo. marginPct is one number per partner (decided in grilling)
  - CRUD endpoints under @UseGuards(JwtAuthGuard, OperatorGuard) — every
    partner mutation requires operator scope
  - GET /partners returns each row with a computed customers count from
    aggregating Tenant.partnerId. MRR aggregation deferred until
    Subscription gains a price column
  - GET /partners/:slug/tenants for the partner detail view
  - DELETE soft-terminates (status='terminated') — never hard-delete
    because tenants may still reference the partner

Tenant changes:
  - partnerId?: Types.ObjectId (ref Partner, indexed sparse) added to
    Tenant schema
  - UpdateTenantDto accepts partnerId so PATCH can attach/detach
  - POST /tenants/:slug/suspend and /resume — operator-only via
    OperatorGuard. PATCH already covers plan/domains/partnerId changes

Smoke test: customer-portal session sends POST /api/partners through the
portal proxy → 403 "This endpoint requires an operator-scoped token". The
positive test (operator-token → 200) waits for O.3 when there's an
operator app to mint the right token.

apps/portal/server/api/partners/index.post.ts is a temporary verification
proxy — delete once the operator portal exists.
2026-05-24 07:08:59 +02:00
Ronni Baslund 3573188431 docs(operator): O.1 done — Authentik dezky-operator OAuth client live
What landed in Authentik (runtime state, not in git):
- OAuth2 provider 'dezky-operator', confidential, PKCE, audience
  dezky-operator, redirect URIs operator.dezky.local/auth/oidc/{callback,logout}
- Application 'Dezky Operator' linked to the provider
- Policy binding: dezky-platform-admins group required on the application

.env (gitignored) gained OPERATOR_OIDC_CLIENT_ID/SECRET/ISSUER.

MFA-required is deferred — Authentik enforces it via a stage binding on
the auth flow, which is app-specific config better tackled when there's
a real enrollment to gate. akadmin already has WebAuthn so the flow
prompts for it anyway.

Discovery doc at /application/o/dezky-operator/.well-known/openid-
configuration confirmed: issuer correct, scopes include 'groups'.

Two gotchas documented in OPERATOR-PLAN.md:
- Authentik 2025.10 requires invalidation_flow alongside authorization_flow
- policies/group_membership endpoint is gone; use policies/bindings with a
  direct group reference instead
2026-05-24 07:01:37 +02:00
Ronni Baslund 22b2583f0b chore(services): rename services/provisioning -> services/platform-api
O.0 prep from OPERATOR-PLAN.md. Mechanical refactor before adding partner
management and operator-specific endpoints. The service now owns more than
just provisioning orchestration (it'll soon own partners, tenant lifecycle
actions, multi-audience JWT validation), so the name 'platform-api' reflects
its scope better.

What changed:
- Directory: services/provisioning/ -> services/platform-api/
- Package: @dezky/provisioning -> @dezky/platform-api
- Docker: container_name dezky-provisioning -> dezky-platform-api;
  compose service key 'provisioning' -> 'platform-api'; volume
  provisioning_node_modules -> platform_api_node_modules
- Portal: PROVISIONING_INTERNAL_URL env var -> PLATFORM_API_INTERNAL_URL,
  default URL http://provisioning:3001 -> http://platform-api:3001 in all
  three proxy routes (me.get.ts, tenants/index.post.ts, tenants/[slug]/
  reconcile.post.ts), plus NUXT_API_BASE updated
- Health endpoint service identifier and main.ts log lines updated to
  'dezky-platform-api'
- Docs swept: README, CLAUDE.md, SERVICES.md, AUTHENTIK-SETUP.md,
  NEXT-STEPS.md, TROUBLESHOOTING.md, OPERATOR-PLAN.md, traefik/dynamic.yml

What deliberately stays:
- Internal module names ProvisioningService / ProvisioningModule (those
  describe an orchestration sub-concern, not the service's purpose)
- Tenant.provisioningStatus / provisioningErrors field names (state
  per integration, not service name)
- File services/platform-api/src/tenants/provisioning.service.ts
- 'Hetzner provisioning' references in production-prep docs (infrastructure
  provisioning, unrelated)

Verified end-to-end after rename: /api/me returns 200 with profile + 2
tenants + subscription, /api/tenants/dezky/reconcile returns 200 with
Authentik integration still ok.

OPERATOR-PLAN.md O.0 checkboxes ticked.
2026-05-24 00:35:01 +02:00
Ronni Baslund fb3d7aa716 docs: add execution checklist to OPERATOR-PLAN
Ten phases (O.0–O.9), each ~one commit, in dependency order. Lets us tick
boxes as work lands and surfaces what's blocking what.
2026-05-24 00:28:54 +02:00
Ronni Baslund 92c5056a1d docs: capture operator portal plan from grilling session
OPERATOR-PLAN.md records the decisions from the design review:
- Scope: C-visual (full UI fidelity, mock data for most screens) but real
  CRUD for tenants and partners from day one
- Lives at apps/operator/ as a separate Nuxt app, separate domain, separate
  Authentik OAuth client (dezky-operator), aud-claim distinguishes operator
  vs portal tokens
- Backend stays as a single NestJS service; rename
  services/provisioning -> services/platform-api as a prep commit
- Partner schema designed: slug/name/domain/status/marginPct/contactInfo;
  Tenant gains optional partnerId; counts and MRR are computed at query time
- Impersonation: visual stub now (modal + banner, no-op toast); real OAuth
  Token Exchange flow recorded as the first follow-up task

Also lists follow-up tasks (real audit log, feature flag backend, incident
management, partner portal) and out-of-scope items so the next grilling
session has a starting point.

Pointer added in NEXT-STEPS.md under a new 'Operator portal' track.
2026-05-24 00:26:21 +02:00
Ronni Baslund 467e6a7ab5 docs: mark Phase 4 partial — Authentik real, Stalwart + OCIS stubbed
Honest status for the data-model + provisioning phases. Lists the smoke
test that verified the chain works end-to-end and points at the upstream
docs for the JMAP and libregraph follow-up work.
2026-05-24 00:07:40 +02:00
Ronni Baslund 28766b80c2 feat(provisioning): orchestrate Authentik/Stalwart/OCIS on tenant create
Phase 4 from docs/NEXT-STEPS.md. POST /tenants now writes Mongo AND drives
external service provisioning. A new POST /tenants/:slug/reconcile endpoint
retries the orchestration — useful when an upstream was down at create time
or external state drifted out of band.

Integration clients (services/provisioning/src/integrations/):
- AuthentikClient: real implementation. ensureGroup() is idempotent — looks
  up the group by name, creates if missing, returns either way. Group
  attributes record the tenant slug + Mongo id so we can trace back
- StalwartClient: stubbed. v0.16 removed the REST management API in favor
  of JMAP, which is significantly more work to wrap. TODO comment points
  to https://stalw.art/docs/api/management/overview for the follow-up
- OcisClient: stubbed. Needs libregraph /drives endpoint with service-to-
  service auth via OIDC client_credentials

Orchestration (provisioning.service.ts):
- Each step runs independently; one failure doesn't roll back the others
- Per-step state recorded on Tenant.provisioningStatus (ok/skipped/error/
  pending) plus error message on Tenant.provisioningErrors
- Steps return their own terminal state — 'skipped' for stubs, void
  defaults to 'ok' for real integrations
- Mongoose markModified() required for nested subdoc mutations to persist
- Tenant auto-flips status: pending → active when all steps are ok|skipped

Portal proxy routes (apps/portal/server/api/tenants/):
- POST /api/tenants and POST /api/tenants/:slug/reconcile forward the
  signed-in user's access token to the provisioning service. Lets the
  browser drive provisioning without minting tokens by hand. Will be
  replaced by a real "create workspace" flow with UI later

docker-compose: AUTHENTIK_API_URL/STALWART_API_URL/OCIS_API_URL now point
at the public Traefik-routed hostnames (with mkcert CA mounted into the
provisioning container so Node fetch trusts them). Previously these
pointed at internal Docker hostnames which doesn't work for Authentik
because of TLS issuer mismatch against the JWT.
2026-05-24 00:06:40 +02:00
Ronni Baslund 4bf6a85517 fix(stalwart): wire recovery admin + point portal tile at admin UI
- docker-compose: add STALWART_RECOVERY_ADMIN env so the env-file password
  works as a permanent recovery login. Without this, Stalwart prints a
  one-time bootstrap password to the logs and discards it after first setup
- portal: mail tile now links to /admin/ (the real Stalwart admin SPA),
  not /login (which is the OAuth client authorization UI for IMAP/SMTP
  clients like Thunderbird — confusing and unrelated)

The persistent admin (admin@dezky.local) was created via Stalwart's setup
wizard at /admin/init and lives in the stalwart_data volume. Recovery admin
in env is the "I lost the wizard credentials" escape hatch.
2026-05-23 22:51:25 +02:00
Ronni Baslund e0808bf13e fix(ocis): wire OCIS web SSO + Collabora document editing end to end
OCIS SSO was loading the SPA but never redirecting to Authentik: the default
OCIS CSP only allows connect-src to itself + the awesome-ocis GitHub repo, so
the metadata fetch to auth.dezky.local was blocked. Mount a custom csp.yaml
and point PROXY_CSP_CONFIG_FILE_LOCATION at it (env var lives on the proxy
service, not web — easy mistake). Also added the .html OIDC callback URIs to
the ocis-provider in Authentik (run-time state, not in this commit).

Collabora document editing required adding the OCIS collaboration service —
the WOPI bridge between OCIS storage and Collabora. Key wiring:

- ocis: expose embedded NATS (NATS_NATS_HOST=0.0.0.0) and gateway
  (GATEWAY_GRPC_ADDR=0.0.0.0:9142) so the new container can register and
  reach the rest of OCIS over the Docker network
- collaboration: COLLABORATION_GRPC_ADDR=0.0.0.0:9301 so it registers itself
  in the service registry with a reachable address (default 127.0.0.1 was
  unreachable from cross-container callers)
- collaboration: APP_ADDR uses the public host (office.dezky.local), not
  the internal Docker hostname — this value is sent to the browser as the
  iframe src
- collabora: regenerate proof key on every start (coolconfig generate-proof-key)
  so its public key matches what coolwsd signs with; otherwise collaboration
  rejects WOPI calls with "ProofKeys verification failed"
- collabora: ssl_verification=false (mkcert root not in Collabora's trust
  store), frame_ancestors=files.dezky.local (otherwise the iframe is blocked
  with a Danish "Indhold blokeret"), home_mode.enable=true to drop the
  "Explore The New" welcome popup and feedback prompt
- ocis CSP: extend connect-src + frame-src to include the new hostnames

Result: opening a .docx from OCIS now embeds Collabora in an iframe and the
document opens for editing.

Dev-mode caveats (not for prod): TLS verification disabled on Collabora's
outbound WOPI calls; home_mode caps at 20 concurrent connections / 10 docs.
2026-05-23 22:36:42 +02:00
Ronni Baslund 3d370caa62 feat(provisioning): tenant data model + CRUD with JWT-validated authz
Implements Phase 3 from docs/NEXT-STEPS.md.

Mongoose schemas (services/provisioning/src/schemas/):
- Tenant: slug, name, status, plan, domains, billingInfo, plus handles for
  Authentik group, OCIS space, and Stalwart domain (set in Phase 4)
- User: authentikSubjectId, tenantIds[], email, name, role, platformAdmin flag
- Subscription: tenantId, plan, status, Stripe IDs (unused until Phase 4)

Auth (services/provisioning/src/auth/):
- JwtAuthGuard verifies Authentik access tokens against the provider's JWKS
  with issuer + audience checks. Uses NODE_EXTRA_CA_CERTS to trust the
  mkcert root for the local Authentik cert
- ActorService resolves the verified JWT into a Mongo User document — every
  controller reads tenantIds + platformAdmin from the DB, not the token
- CurrentUser decorator extracts the JWT payload onto controllers

CRUD modules:
- /tenants, /users, /subscriptions with create/read/update/delete
- /users/me upserts the caller's User record on every request, syncing email,
  name, tenantIds, and platformAdmin from the JWT's groups claim — the only
  place we read JWT.groups outside the bootstrap

Why DB-derived authz: putting all group memberships in the JWT doesn't scale
past ~50 tenants per user (header/cookie size limits, no mid-session
revocation, stale data until re-login). JWT now carries identity only; the
DB is the source of truth for who can see what.

Seed (SeedService.OnApplicationBootstrap): idempotent creation of the
default 'dezky' tenant + matching subscription. User records are created on
first /users/me hit.

Infrastructure:
- Traefik label exposes provisioning at https://api.dezky.local (dev only)
- api.dezky.local added to Docker network aliases on Traefik
- mkcert root CA mounted into the provisioning container for JWKS fetch
- Authentik 'groups' scope mapping created + attached to dezky-portal
  provider; portal now requests it as a scope
- nuxt.config.ts portal: exposeAccessToken=true so Nitro forwards token;
  NUXT_OIDC_TOKEN_KEY fixed to base64-encoded 32 bytes (was hex, causing
  "Invalid key length" once exposeAccessToken turned on)

Portal: apps/portal/server/api/me.get.ts is a scaffolding route that
forwards the user's access token to provisioning and returns profile +
tenants + subscriptions — verifies the full chain end to end.
2026-05-23 21:53:53 +02:00
Ronni Baslund adfd9baafe chore: initial scaffold with running local stack and portal auth
Brings up Dezky's local development environment end-to-end:

Infrastructure (docker-compose):
- Traefik v3.7 reverse proxy with mkcert TLS (v3.2 couldn't speak Docker API 1.54)
- Postgres + Mongo + Redis with healthchecks and init script for per-service users
- Authentik 2025.10 (server + worker) as OIDC IdP
- Stalwart v0.16 mail server (image renamed from stalwartlabs/mail-server)
- OCIS 7.0 with PROXY_TLS=false and OCIS_CONFIG_DIR=/etc/ocis so init writes
  where the server reads
- Collabora office, plus the portal + provisioning service stubs
- Docker network aliases on Traefik so containers resolve auth.dezky.local etc.
  through the network (not host /etc/hosts)
- Docker socket mount parameterized for macOS Docker Desktop symlink path

Authentik provisioning (done via API after stack boot):
- ocis-provider (public client) + OCIS Files application
- dezky-portal provider (confidential) + Dezky Portal application
- Admin API token bound to akadmin manually since 2025.10's
  AUTHENTIK_BOOTSTRAP_TOKEN env var doesn't auto-materialize a token row

Portal (apps/portal):
- Nuxt 3 with nuxt-oidc-auth 1.0.0-beta.11 against generic 'oidc' preset
- Global auth middleware; login at /auth/oidc/login redirects to Authentik
- Visual implementation of Claude Design 'Auth' canvas: AuthShell, NodeMark,
  Auth* sub-components, design tokens as CSS custom properties
- Pages: auth/login, auth/expired, auth/disabled, index (post-login landing)
- mkcert root CA mounted into the portal so Node fetch trusts Authentik's
  self-signed cert (NODE_EXTRA_CA_CERTS) — dev only

Docs:
- AUTHENTIK-SETUP.md updated with manual token bind + portal provider scripted
  alternative
- NEXT-STEPS.md: Phase 1 and Phase 2 marked done with file locations and
  dev-mode caveats

Dev-mode shortcuts that need to be revisited before prod:
- skipAccessTokenParsing on the OIDC config
- NODE_EXTRA_CA_CERTS mkcert mount
- Bootstrap password still the generated value in .env
- Authentik admin token (dezky-bootstrap-token) is non-expiring
2026-05-23 21:25:11 +02:00