dezky

Author	SHA1	Message	Date
Ronni Baslund	df18128617	feat(audit): OCIS file-tail ingest worker (Phase 2 chunk 3) Tails OCIS's JSON-Lines audit log on a shared Docker volume and forwards mutations into AuditService. Final piece of Phase 2 — the /audit page now unifies platform-api, authentik, and ocis events on one timeline. services/platform-api/src/ingest/ocis.ingest.ts: - 5s polling loop (fs.watch is unreliable across Docker bind mounts on macOS). Stat → detect inode change or truncation → resume from byte position OR start over. - Cursor in IngestCursor stores lastEventId = "<inode>:<bytePosition>". Restarts resume cleanly; on overlap the (source, externalId) unique index dedups silently. - Lines collected first, then processed sequentially after the read stream closes. Earlier draft fired recordOne() from inside the readline 'line' callback which would have resolved the stream before all writes finished — same class of race we hit in the Authentik worker, fixed before commit. - Tenant inference: spaceName (set during provisioning to the slug) first, then User.authentikSubjectId → tenantIds → Tenant.slug. - Mutations only: OCIS_ALLOWLIST in action-map.ts whitelists 24 event types (User/Group/Space/Share/Link/File mutations). FileDownloaded, UserSignedIn, and the rest of the high-volume read traffic gets skipped — keeps the timeline scannable. services/platform-api/src/ingest/action-map.ts: - mapOcisAction() + OCIS_ALLOWLIST. Returns null for non-whitelisted types so the worker filters early. infrastructure/docker-compose/docker-compose.yml: - New named volume `ocis_audit_log` mounted writeable on the ocis container and read-only on platform-api. - OCIS env: OCIS_ADD_RUN_SERVICES=audit (the audit microservice is NOT in the default `ocis server` set — opt in explicitly), AUDIT_LOG_FILE_PATH=/var/log/ocis/audit.log, AUDIT_LOG_FORMAT=json. - platform-api env: OCIS_AUDIT_LOG_PATH points at the same file. Verified end-to-end with synthetic events written to the audit log: - Worker tailed 5 events across initial read + incremental append (5 → bytes 0:1276, then 1 → bytes 1276:1519). - FileDownloaded correctly filtered by the allowlist (4 mutations landed in Mongo, not 5). - Tenant inference: events with executingUser.id resolved to `dezky` via User → tenantIds → Tenant.slug. - Operator /audit shows all three sources (89 events: 79 authentik + 5 platform-api + 5 ocis) in one unified timeline. Known unknown — same shape as the Stalwart commit: I couldn't fully confirm the OCIS v7 audit microservice emits events with just OCIS_ADD_RUN_SERVICES=audit + the AUDIT_LOG_FILE_PATH env. The audit service starts but the file stays empty until OCIS internals start publishing events to NATS (which may need additional service-side config). The ingest worker is correct regardless — when OCIS starts writing real events, they'll flow into /audit. This is a follow-up in the OCIS-side configuration, not in our ingest code.	2026-05-24 20:30:47 +02:00
Ronni Baslund	7bec940e7f	feat(audit): Stalwart webhook ingest endpoint (Phase 2 chunk 2) Push-based ingest for mail-server events. Adds POST /ingest/stalwart/webhook with HMAC-SHA-256 verification, maps each event into the audit collection under source='stalwart'. services/platform-api/src/ingest/stalwart-webhook.controller.ts: - Public endpoint (no JwtAuthGuard — Stalwart can't carry a JWT). Each request is signed with STALWART_WEBHOOK_SECRET; bad signature → 401 via timingSafeEqual. - Body: { events: [{ id, type, createdAt, data }, ... ] }. Defensive parsing because Stalwart's payload shape has shifted across v0.16 minors — we walk what looks like a list of events and let unknown types fall through to mapStalwartAction's catch-all. - Per-event recordOne: action via mapStalwartAction(), actor from data.email/account/username, IP from data.ip or X-Forwarded-For, targetName from data.account/email/address/to, full payload kept in metadata. externalId = evt.id so the (source, externalId) unique index dedups re-deliveries. action-map.ts: 14 known Stalwart event types → stalwart.{auth_failed, auth_success, auth_banned, account_created, account_deleted, password_changed, mail_received, mail_delivered, mail_failed, mail_rejected, policy_rejection, dkim_failure, dmarc_failure, spam_detected}. Snake/kebab forms normalized. infrastructure/docker-compose: - .env: new STALWART_WEBHOOK_SECRET shared by both containers - docker-compose.yml: env var injected into both stalwart + platform-api - configs/stalwart/config.toml: [webhook."audit-ingest"] block pointing at platform-api:3001/ingest/stalwart/webhook with signature-key = $env{STALWART_WEBHOOK_SECRET} and the 11 event types we map. Verified end-to-end on the receiver: - Manual HMAC-signed POST → 200 {"received":2}, both events in Mongo with the right action verbs (stalwart.auth_failed, stalwart.account_created), actor/IP/externalId populated. - Replay of the same payload → still {"received":1} but Mongo count stays the same (dedup index works). - X-Signature: deadbeef → 401, no row written. Known unknown: I couldn't fully confirm Stalwart v0.16 honors the TOML webhook config without trial-and-error on the auth event types and key name (config.toml uses signature-key; some Stalwart builds want plain 'key'). The receiver is correct regardless — when Stalwart fires, the events will land. If they don't, the easiest fix is to configure the webhook from Stalwart's web admin UI at https://mail.dezky.local instead of via TOML.	2026-05-24 20:21:29 +02:00
Ronni Baslund	b1d717e466	feat(audit): Authentik events ingest worker (Phase 2 chunk 1) Background worker that pulls Authentik's /api/v3/events/events/ on a 60s cadence and writes each event into our audit log via AuditService. External system events now share the same /audit timeline as internally-recorded platform mutations — operator queries don't have to cross-reference Authentik's own UI to see logins, password changes, group membership, impersonation, etc. Pieces: - src/schemas/ingest-cursor.schema.ts: one row per source, tracks lastEventAt + lastEventId so restarts resume without re-pulling. - src/schemas/audit-event.schema.ts: new `externalId` field; new compound unique index on (source, externalId) with a partial filter on externalId being a string. Partial (not sparse) so internally- recorded events with externalId=null don't collide. - src/audit/audit.service.ts: AuditRecordInput grows `externalId` + `at` fields. record() now silently swallows MongoError code 11000 (duplicate key) so re-pulling the cursor overlap doesn't log noise. - src/integrations/authentik.client.ts: listEvents(since, page, pageSize) on the existing client — reuses the admin token and base URL the provisioning code already configured. - src/ingest/action-map.ts: 16 known Authentik actions → dotted authentik.* verbs (login, login_failed, password_changed, impersonation_started, …). Unknown actions fall through to authentik.<raw> rather than getting silently dropped. - src/ingest/authentik.ingest.ts: OnApplicationBootstrap worker. Reads cursor → pulls events with created__gt=cursor, ordering=created ASC → paginates forward (10 pages × 100/page safety cap per tick) → writes each event with source='authentik' + externalId=pk + at= evt.created → advances cursor to the newest seen. inFlight guard prevents overlapping ticks. AUDIT_INGEST_ENABLED=false disables for test environments. - Tenant inference: from the user's groups (same convention the portal flag-eval proxy uses). Admin groups stripped; first match against a real Tenant.slug wins. Unmatched → tenantSlug undefined, event still lands in the global timeline. Smoke-tested: fresh Mongo + restart → 78 Authentik events ingested, 0 duplicates. Performed a login at app.dezky.local → next 60s tick captured the new login row with actor email + IP. Compound unique index on (source, externalId) verified to reject re-pulled events silently (no error logs). Out of scope here (covered by chunks 2 + 3): - Stalwart webhook ingest - OCIS file-tail ingest	2026-05-24 20:12:21 +02:00
Ronni Baslund	02341d8ba5	feat(audit): platform-api audit log + operator UI wired to real events Phase 1 of the audit work — capture everything we control today, ingest from external systems (Authentik / OCIS / Stalwart) in a later phase. The mock OP_AUDIT fixture is gone; both the /audit page and Overview's activity card now show real events recorded by AuditService.record() in platform-api. Schema (services/platform-api/src/schemas/audit-event.schema.ts): AuditEvent { at, actorType, actorId, actorEmail, actorIp, action, outcome, resourceType, resourceId, resourceName, tenantSlug, partnerSlug, source, metadata, prevHash, hash } Indexes: {at:-1}, {tenantSlug,at:-1}, {actorId,at:-1}, {action,at:-1}. prevHash/hash are nullable now; hash-chain tamper evidence is a later phase. AuditService: - record() — best-effort write, swallows errors so the underlying mutation that succeeded isn't failed by a downstream log issue. Surfaces failures via Logger. - list() — filters: since/until/before, action (exact OR prefix match via leading-anchor regex), tenantSlug, partnerSlug, actorEmail, outcome, free-text q across action/resourceName/actorEmail/tenantSlug, limit (default 100, max 500). Cursor pagination via `before`. - No UPDATE/DELETE surface — entries are append-only by construction. AuditController: GET /audit, behind JwtAuthGuard + OperatorGuard. No mutations exposed; entries written internally by other modules. X-Forwarded-For threading: - apps/operator/server/utils/platform-api.ts forwards the originating client IP to platform-api so audit entries carry a real address. - services/platform-api/src/auth/client-ip.ts extracts leftmost X-Forwarded-For, falls back to socket.remoteAddress. Instrumented mutations (every one threads actor + IP through): Tenants: create, update, softDelete, setStatus(suspend/resume) Partners: create, update, terminate Flags: create, update (incl. flag.killed verb when state=off+note=kill-switch), remove Users: deactivate Each controller resolves the User doc via ActorService, extracts IP via clientIp(req), and passes { userId, email, ip } as AuditActor to the service. FlagsService's local ActorRef collapses to AuditActor so flag history and the audit log share one shape. Operator UI: - /api/audit proxy that forwards query params verbatim - types/audit.ts - pages/audit.vue: real list with quick-pick action chips (All/Tenants/ Partners/Flags/Users), outcome filter, free-text search, "Load older events" cursor pagination - pages/index.vue: Overview activity card swaps mock OP_AUDIT for the same /api/audit endpoint, rows link into /audit - data/fixtures.ts: OP_AUDIT / AuditEntry / AuditTone exports removed Verified end-to-end: suspended + resumed acme, flipped oci_versioning through rollout → kill → on, then /audit returned all 5 events with the right action verbs (tenant.suspended, tenant.resumed, flag.updated, flag.killed, flag.updated), actor admin@dezky.local, IP 192.168.65.1. Filters (action prefix + free-text q) narrow correctly. Out of scope for this commit (each gets its own conversation): - Authentik / OCIS / Stalwart ingest adapters (Phase 2) - Hash-chain tamper evidence (Phase 3) - TTL + cold-storage archival to Hetzner Object Storage (Phase 4) - GDPR right-to-erasure tooling	2026-05-24 19:50:24 +02:00
Ronni Baslund	868a305539	feat(flags): real feature-flag system with bulk eval + operator UI Real backend for the flags page (was pure mock). Built so it's ready for the first risky rollout (likely the Stalwart JMAP client or the Stripe billing engine). services/platform-api: - Flag schema (key, description, state, pct, scope.{plans, tenantSlugs, partnerSlugs, environments}, embedded history capped at 20) - FlagsService with CRUD + evaluateAll(tenantSlug) → { key: bool } Eval algorithm: off → false; on → true targeted → require non-empty scope (empty allowlist means "nobody"), then match every non-empty axis rollout → match scope, then sha256(`${tenantId}:${key}`) % 100 < pct Hash-based rollout is deterministic: bumping pct only flips the new slice. Pure helpers (matchesScope, hasAnyScope, inRolloutBucket) are exported for future unit tests. - FlagsController exposes GET /flags, GET /flags/:key, POST /flags/evaluate (JwtAuthGuard); POST/PATCH/DELETE require OperatorGuard. History entries capture the actor's email. - SeedService idempotently creates 10 flag keys mapping to real Dezky concerns (jmap_native_v2, gdpr_export_v2, new_billing_engine, etc.). $setOnInsert so operator edits survive restarts. apps/operator: - 6 proxies: /api/flags index get/post, [key] get/patch/delete, evaluate post - types/flag.ts with the shape that mirrors the backend - pages/flags.vue: useFetch real list, row click opens FlagDetail, "New flag" opens NewFlagModal, scope summary column shows targeting at a glance - FlagDetail.vue: side panel with segmented state, rollout slider with live "~N of M tenants" preview from /api/tenants, plan/tenant/env chip pickers, dirty-tracked Save, instant Kill-switch (PATCH state=off+pct=0), embedded change history - NewFlagModal.vue: minimal create form (key + description). Everything else is configured in the detail panel afterward. - CommandPalette: feature-flag rows now come from /api/flags instead of the dropped fixture, so newly-created flags are searchable immediately - data/fixtures.ts: drop FLAGS / FeatureFlag exports (replaced by the real backend) Smoke-tested end-to-end: list renders 10 seed flags, opening gdpr_export_v2 and flipping to rollout 25% then saving persists + adds a history entry, kill-switch sets state=off in one click, /api/flags/evaluate returns the correct booleans for the seeded tenant, same tenant gets the same answer on consecutive evals (determinism), and creating + deleting a flag through the UI roundtrips correctly.	2026-05-24 19:21:15 +02:00
Ronni Baslund	77a09aaf77	feat(operator): live Infrastructure probes + honest split between deployed and planned The Infrastructure page used to read from a mock fixture that lied two ways: it listed services that aren't deployed (Jitsi, Zulip, Cloudflare, Object Storage, Postmark) and showed hardcoded uptime/latency for the ones that are. Now it shows truth from real probes plus a clearly-labelled "planned" section for the rest. Backend (services/platform-api): - New src/health/ module — HealthService runs 9 probes in parallel with a 1.5s timeout each: Stalwart → TCP stalwart:8080 OCIS → HTTP GET ocis:9200/health Collabora → HTTP GET collabora:9980/hosting/discovery Authentik → HTTP GET authentik-server:9000/-/health/ready/ Postgres → TCP postgres:5432 Mongo → existing Mongoose connection.db.admin().ping() Redis → TCP redis:6379 Traefik → TCP traefik:80 Platform API → trivially ok (this code is running) Status thresholds: ok ≤500ms, warn 500–1500ms, bad on timeout/refuse. - HealthController exposes GET /health/platform behind JwtAuthGuard, plus keeps the existing public GET /health for infra liveness checks. - Moved the old src/health.controller.ts into the new module. Frontend (apps/operator): - /api/health/platform proxy forwards the operator's access token. - Infrastructure page swaps SERVICES fixture for useFetch with 30s auto- refresh + a manual Refresh button. Cards show real status badge + real latency; uptime/error stay as em-dash with a "no probe history yet" tooltip until a Prometheus/event-log backend lands. - Below the live grid, a "Planned · not deployed" section renders 5 dimmed cards (Jitsi, Zulip, simpledns.plus, Hetzner Object Storage, Postmark). simpledns.plus replaces the misnamed Cloudflare entry — we use simpledns.plus, not Cloudflare. - Subtitle is now truthful: "8 / 9 services live · checked 2s ago". Verified: stopped redis → card flipped to "down · getaddrinfo ENOTFOUND redis", subtitle reflected 8/9, incident banner appeared. Restarted → back to 9/9, banner gone. SERVICES fixture stays in place for Overview's incident banner — replacing that is a separate follow-up tied to the incident-management backend.	2026-05-24 18:47:38 +02:00
Ronni Baslund	fbbb43e3e2	feat(operator): partner management with attach/detach (O.6) - Partners list with name/domain/status/customers/margin + Create modal - Partner detail: contract card, contact card, customers table, attach modal, terminate (soft-delete) danger card - Operator proxies for /partners + /partners/:slug/tenants - platform-api: add partnerId Prop to Tenant schema. The field was being silently dropped by Mongoose because the schema didn't declare it. - tenants.service: rewrite update() to build $set/$unset explicitly and cast partnerId via new Types.ObjectId(). Handles detach via $unset so the field vanishes from the doc cleanly.	2026-05-24 08:02:00 +02:00
Ronni Baslund	8e81730372	feat(operator): tenant list + 7-tab detail with real lifecycle (O.5) Operator can now manage tenants end-to-end from the UI: - pages/tenants/index.vue — list with status/plan/domains/created/ provisioning-state columns, search by slug or name, status chips with live counts (all/active/pending/suspended), click-through to detail - pages/tenants/[slug].vue — 7-tab detail (Overview, Users, Resources, Billing, Audit, Support, Danger zone) - 3 tabs hit real backends: Overview (identity + billing fields), Users (lazy-loaded via new GET /tenants/:slug/users endpoint), Resources (live provisioning state per integration + Reconcile button) - 3 tabs render mock fixtures with warn-tone "mock" badges: Billing (Stripe placeholder), Audit (sample log lines), Support (placeholder pending the ticket queue work) - Danger zone: 3 real-backend cards (Suspend / Resume / Soft-delete), each gated by a ConfirmDialog modal. Verified live — clicked Suspend on acme, status flipped to 'suspended' in Mongo, then Resumed back to 'active' platform-api additions: - GET /tenants/:slug/users returns users with this tenant in their tenantIds, sorted by last login. Same authorization rule as the existing /tenants/:slug — platform admins always pass, non-admins must be a member of the tenant - tenants.module imports User schema for the new lookup New components (apps/operator/components/): - Tabs.vue — horizontal strip with optional per-tab counts, v-model - ConfirmDialog.vue — Teleport-to-body modal, Escape/backdrop close, danger/primary tone for the confirm button Server proxy infrastructure (apps/operator/server/): - utils/platform-api.ts — single helper encapsulating access-token-from-session + bearer-forward + error normalization. Every operator proxy route is now a one-liner against this helper - api/tenants/index.get.ts, [slug]/{index.get,index.patch,index.delete, users.get,suspend.post,resume.post,reconcile.post}.ts Two real bugs found and fixed during the smoke test: - Mongoose subdocument `_id` leaks into JSON when iterating tenant.provisioningStatus. Switched to an explicit `['authentik', 'stalwart', 'ocis']` whitelist in both v-fors - Documents created before provisioningErrors was added (like the acme tenant) don't have the field at all in JSON. Use optional chaining (`tenant.provisioningErrors?.[k]`) instead of bracket access. Without it: 'Cannot read properties of undefined (reading "authentik")' during the Resources tab render	2026-05-24 07:44:23 +02:00
Ronni Baslund	55b1c133e3	feat(operator): scaffold apps/operator Nuxt app + multi-issuer JWT (O.3) New Nuxt 3 app at apps/operator/ — internal admin portal on its own domain (operator.dezky.local), own OAuth client (dezky-operator), own session secrets, own cookies. Customer and operator surfaces can't decrypt each other's session state. OAuth flow verified end-to-end: - GET / → middleware redirect to /auth/login - User clicks Sign in → /auth/oidc/login → bounces to Authentik with client_id=dezky-operator, scope includes 'groups' - Authentik checks dezky-platform-admins group binding (added in O.1), silent-reauths via the existing auth.dezky.local session - Returns to /auth/oidc/callback with code, exchanges for token, creates session cookie on operator.dezky.local - Lands on pages/index.vue placeholder dashboard Smoke test 'Create partner "test-partner"' button on the placeholder home exercises the full operator-only authorization chain: - 1st call: 200, partner created in Mongo - 2nd call: 409 'already exists' (idempotency holds, token still valid) - Same call from the customer portal: 403 'requires operator-scoped token' (audience guard rejects dezky-portal aud) JwtAuthGuard now multi-issuer in addition to multi-audience. Each Authentik OAuth provider mints tokens with its own per-app iss URL (.../application/o/<slug>/), so the guard accepts a comma-separated AUTHENTIK_ISSUER. The audience-only fix from O.2 wasn't sufficient — issuer is validated separately by jose.jwtVerify and was still pinned to dezky-portal alone, yielding 'unexpected iss claim value' rejections. Compose changes: new 'operator' service (Node 20 alpine, pnpm install + nuxt dev, mkcert CA mount, traefik labels for operator.dezky.local + TLS); new operator_node_modules volume; operator.dezky.local added to traefik's Docker network aliases. Distinct OPERATOR_NUXT_OIDC_* session secrets pulled from .env (gitignored, generated via openssl). Real operator screens (sidebar, topbar, tenants, partners, etc.) come in O.4. This commit is pure scaffolding + the security boundary proof.	2026-05-24 07:20:16 +02:00
Ronni Baslund	2db41fec5e	feat(platform-api): multi-audience JWT + Partner CRUD + tenant lifecycle (O.2) JwtAuthGuard now accepts a comma-separated AUTHENTIK_AUDIENCE ('dezky-portal,dezky-operator'). jose.jwtVerify takes an array and succeeds on any match — both customer-portal and operator-portal tokens validate against this service. Per-endpoint guards restrict further. New OperatorGuard enforces operator-only mutations: 1. JWT audience claim includes 'dezky-operator' (proof from the token alone that this is a privileged session) 2. ActorService-resolved User has platformAdmin=true (DB check so revocation works without waiting for the token to expire) Both required; either alone is insufficient. Partner module: - Partner schema: slug, name, domain, status, marginPct, contactInfo, billingInfo. marginPct is one number per partner (decided in grilling) - CRUD endpoints under @UseGuards(JwtAuthGuard, OperatorGuard) — every partner mutation requires operator scope - GET /partners returns each row with a computed customers count from aggregating Tenant.partnerId. MRR aggregation deferred until Subscription gains a price column - GET /partners/:slug/tenants for the partner detail view - DELETE soft-terminates (status='terminated') — never hard-delete because tenants may still reference the partner Tenant changes: - partnerId?: Types.ObjectId (ref Partner, indexed sparse) added to Tenant schema - UpdateTenantDto accepts partnerId so PATCH can attach/detach - POST /tenants/:slug/suspend and /resume — operator-only via OperatorGuard. PATCH already covers plan/domains/partnerId changes Smoke test: customer-portal session sends POST /api/partners through the portal proxy → 403 "This endpoint requires an operator-scoped token". The positive test (operator-token → 200) waits for O.3 when there's an operator app to mint the right token. apps/portal/server/api/partners/index.post.ts is a temporary verification proxy — delete once the operator portal exists.	2026-05-24 07:08:59 +02:00
Ronni Baslund	22b2583f0b	chore(services): rename services/provisioning -> services/platform-api O.0 prep from OPERATOR-PLAN.md. Mechanical refactor before adding partner management and operator-specific endpoints. The service now owns more than just provisioning orchestration (it'll soon own partners, tenant lifecycle actions, multi-audience JWT validation), so the name 'platform-api' reflects its scope better. What changed: - Directory: services/provisioning/ -> services/platform-api/ - Package: @dezky/provisioning -> @dezky/platform-api - Docker: container_name dezky-provisioning -> dezky-platform-api; compose service key 'provisioning' -> 'platform-api'; volume provisioning_node_modules -> platform_api_node_modules - Portal: PROVISIONING_INTERNAL_URL env var -> PLATFORM_API_INTERNAL_URL, default URL http://provisioning:3001 -> http://platform-api:3001 in all three proxy routes (me.get.ts, tenants/index.post.ts, tenants/[slug]/ reconcile.post.ts), plus NUXT_API_BASE updated - Health endpoint service identifier and main.ts log lines updated to 'dezky-platform-api' - Docs swept: README, CLAUDE.md, SERVICES.md, AUTHENTIK-SETUP.md, NEXT-STEPS.md, TROUBLESHOOTING.md, OPERATOR-PLAN.md, traefik/dynamic.yml What deliberately stays: - Internal module names ProvisioningService / ProvisioningModule (those describe an orchestration sub-concern, not the service's purpose) - Tenant.provisioningStatus / provisioningErrors field names (state per integration, not service name) - File services/platform-api/src/tenants/provisioning.service.ts - 'Hetzner provisioning' references in production-prep docs (infrastructure provisioning, unrelated) Verified end-to-end after rename: /api/me returns 200 with profile + 2 tenants + subscription, /api/tenants/dezky/reconcile returns 200 with Authentik integration still ok. OPERATOR-PLAN.md O.0 checkboxes ticked.	2026-05-24 00:35:01 +02:00

11 Commits