The shipped policies.ini has maxattsize='' and PHP 8's string/int
comparison semantics make SyncProvisioning's '>-1' check fail, so
every Provision command 500'd ('Invalid policies!') and the device
got 449 retry-after-provisioning on FolderSync/Sync/SendMail forever
— account added but no folders, no sent mail, no calendar. dezky
doesn't manage device policies (no remote-wipe promise), so turn
PROVISIONING off rather than patching the policy file.
BackendCombined login is all-or-nothing, and Stalwart returns 404 for
/dav/card/<account>/ when the account's default address book was never
created (it doesn't auto-create on the gateway's PROPFIND the way the
calendar home worked) — so CardDAV killed every otherwise-successful
EAS login. Exchange accounts now bundle mail + calendar; contacts stay
on the Apple-profile CardDAV path. Re-enable BackendCardDAV once
platform-api provisions a default address book at mailbox creation.
Copy/docs aligned: portal hint, SERVICES.md, website FAQ (da+en).
include/z_caldav.php needs XMLDocument.php from AWL (Andrew's Web
Libraries); the Debian z-push packages pull php-awl in automatically
but bookworm dropped the package, so vendor it from upstream at
r0.65 into /usr/share/awl/inc (already on Z-Push's include_path).
Only surfaces on *authenticated* requests: combined login hits IMAP
first, so fake-credential smoke tests never reach the CalDAV class.
Hardening from the same incident: a build-time class-load smoke test
fails the image if any backend dependency is missing, and
zend.exception_ignore_args stops uncaught fatals from logging the
raw passwords Z-Push passes through Logon().
The replacement imap.config.php omitted constants the backend
references unconditionally — SYSTEM_MIME_TYPES_MAPPING 500'd every
authenticated request (backend construction, before login, so the
unauthenticated 401 smoke tests never hit it). Define all remaining
upstream constants with their defaults so the replacement file can
never be narrower than the template it replaces.
Wraps Stalwart in EAS so iOS/Android native Mail/Calendar 'Exchange'
accounts get two-way mail+calendar+contacts sync (BackendCombined:
IMAP + CalDAV /dav/cal/%l/ + CardDAV, credentials pass through).
- services/zpush: Z-Push 2.6.4 (AGPLv3, see LICENSE-NOTES.md) on
php:8.2-apache-bookworm (trixie dropped libc-client); PHP 8 sysv
sprintf fatal sed-patched; autodiscover dispatcher answers
mobilesync schema, proxies outlook schema to Stalwart unchanged
- prod: zpush Deployment (replicas:1, Recreate — file sync state),
/Microsoft-Server-ActiveSync Ingress on mail.dezky.eu (no redirect,
POST-heavy), autodiscover.dezky.eu repointed to the dispatcher,
selectorless stalwart-imaps/-smtps Services (host-Stalwart is
implicit-TLS only: 993/465, no plain 143/587 — verified on node1)
- CI: build+deploy zpush like the other apps
EAS tops out at 14.1: covers native mobile clients, NOT the Outlook
mobile app (needs 16.1) and not new Outlook for Windows (no EAS).
DAV was internal-only (the node's :443 is Traefik's). New mail-dav
Ingress routes /.well-known/caldav, /.well-known/carddav and /dav on
mail.dezky.eu through to Stalwart — with the HTTPS-redirect middleware
(safe for DAV's GET/PROPFIND; kept OFF the autodiscover Ingress whose
POSTs don't survive redirects). The _caldavs/_carddavs SRV records are
now legitimate, so the Domains page surfaces them, and the Apple
.mobileconfig gains CalDAV + CardDAV payloads: one install sets up Mail,
Calendar and Contacts on Mac/iPhone. Stalwart's STALWART_PUBLIC_URL is
set to https://mail.dezky.eu on the host (discovery documents).
Outlook autodiscovers via POST https://autodiscover.<domain>/autodiscover/
autodiscover.xml and Thunderbird via autoconfig.<domain>/mail/
config-v1.1.xml — Stalwart serves both (verified, answers carry
mail.dezky.eu:993/465) but its HTTP listener wasn't reachable from
outside (the node's :443 is Traefik's). New exact-path-only Ingress
routes JUST those discovery endpoints to host-Stalwart via a selectorless
Service + Endpoints on the cni0 gateway; the admin/management surface
stays internal, and there's no HTTPS-redirect middleware because
Thunderbird probes plain HTTP and Outlook POSTs.
Domains page now also lists the autoconfig/autodiscover CNAMEs under the
autodiscovery slot (CNAME verified against the mail host; a bare A record
warns instead of failing). Customer-domain autodiscovery (per-domain
certs + automated Ingress) is a follow-up.
Mail clients could never autoconfigure: Stalwart's zone file contains the
_imaps/_submissions/_pop3s SRV records but classify() dropped everything
except mx/spf/dkim/dmarc, so customers never saw them and every client
needed manual server entry. New 'autodiscovery' record kind: classified
from the zone (only the services actually reachable in prod — the
_jmap/_caldavs SRVs target :443 which Traefik owns, deferred to the
webmail story), verified via resolveSrv (missing=bad, wrong target=warn),
shown as an OPTIONAL slot on the portal Domains page that never gates the
domain status or the records-to-fix nag.
Also fixed on the live server via management JMAP (x:SystemSettings):
hostname was the machine name node1.dezky.eu from the v0.16 auto-bootstrap
— MX/SRV targets and the SMTP banner now say mail.dezky.eu, and the LE
x:Certificate is set as defaultCertificateId.
Identifying the company tenant by slug in env was fragile — every
purge/recreate changed the slug (or id) and the apex guard chased reality
through three config flips in one day. The identity now lives ON the
tenant document: isPlatformTenant, operator-set from the tenant page
(single holder — setting it clears the flag everywhere else), guarded so
tenant admins can't set it on themselves through the shared PATCH route.
The dezky.eu apex guard reads the flag; PLATFORM_TENANT_SLUG is gone.
Dev seed flags its seeded tenant. config-rev 5 rolls platform-api.
The minimal create modal silently dropped adminName/adminEmail — the invite
only existed in the partner wizard's server path. Operator now gets the
same 5-step wizard UX (organization, domain, first admin, plan with live
price catalog, review) composed client-side: POST /tenants creates +
provisions, then POST /users/invite-tenant-admin (new, operator-only —
lives in UsersModule because UsersModule already imports TenantsModule and
the reverse would be circular) runs the same inviteTenantAdmin flow the
partner gets, and the result view hands over the single-use recovery link
or temp password. Tenant detail page gains an Invite admin action for
retries/successors. PLATFORM_TENANT_SLUG back to 'dezky' (the recreated
company tenant) + config-rev bump to roll platform-api.
Soft-delete kept the slug occupied forever — no way to remove a test tenant
and reuse its name, and external resources lingered. DELETE
/tenants/:slug/purge (platform-admin only, two-step: refuses anything not
already soft-deleted) tears down the Stalwart service + customer domains
(never the platform apex — the management admin account lives there) and
the Authentik group, then removes domains/subscriptions/invoices/user
links/the tenant doc. Audit trail is kept. Operator detail page shows a
'Purge permanently' card once a tenant is soft-deleted.
The company tenant ended up as slug dezky-aps (the seeded 'dezky' tenant was
deleted), so the hardcoded apex allowance for slug 'dezky' would have
rejected adding dezky.eu to the real tenant. PLATFORM_TENANT_SLUG env
(default 'dezky') now names the only tenant allowed to claim the
PLATFORM_TENANT_DOMAIN apex.
dezky.eu doubles as the platform's infrastructure domain AND the company's
own employee mail domain (added to the dezky tenant via the normal Domains
flow). Guard rails in DomainsService.add:
- a domain already used by ANY other workspace is rejected — Stalwart's
idempotent ensureDomain would otherwise silently share one mail domain
(and its mailboxes) between tenants
- the PLATFORM_TENANT_DOMAIN apex is claimable only by the dezky tenant;
everything under it (per-tenant service domains, auth/api/mail/* infra
hosts) is reserved outright
Set PLATFORM_TENANT_DOMAIN=dezky.eu in the prod ConfigMap (was unset, so
prod service domains would have been {slug}.dezky.local) and align the
seeded dezky tenant's display domain with the environment.
The operator infrastructure page probed docker-compose hostnames
(stalwart/postgres/redis/traefik…) which don't resolve in k3s — 7 of 9
services showed down. Probe targets now come from HEALTH_* env vars with
the compose names as dev defaults; platform-api-config.yaml sets the
in-cluster/host addresses. 'disabled' omits a service from the report —
used for OCIS/Collabora until the files tier is deployed.
The apps were wired for the dev (.local) environment. Drive the base URLs from
env so one build serves dev and prod (.eu):
- portal nuxt.config: OIDC authorization/token/userinfo/discovery URLs +
redirectUri now derive from NUXT_PUBLIC_AUTH_URL / NUXT_PUBLIC_PORTAL_URL
(+ PORTAL_OIDC_APP_SLUG); .local defaults keep dev working with no env.
- portal sign-out handler: end-session + post-logout URLs env-driven.
- portal scheduling page: booking base/host from runtimeConfig.public.bookingUrl
(NUXT_PUBLIC_BOOKING_URL).
- platform-api: tenant mail domain suffix from PLATFORM_TENANT_DOMAIN (dezky.eu
in prod), defaulting to dezky.local.
(booking needs no change — its only .local ref is the dev-server allowedHosts.)
Rebuild the /admin/users detail drawer from a read-only profile into an
editable, Office 365-style panel with four sections:
- Username & mail: read-only primary for mailbox users; editable sign-in
(Authentik-only) for mailbox-less identities; "Create mailbox" provisions
a Stalwart inbox for an external-login admin
- Aliases: list/add/remove mailbox aliases (Stalwart), domain-scoped
- Role: member/admin toggle with a primary-account lock (owner, mailbox-less
bootstrap admin, self) and a last-admin guard
- Contact information: display name, first/last name, phone, alternative
email — mirrored best-effort to Authentik attributes + mailbox name
Ownership transfer: "Make owner" (row menu + drawer) plus an owner-side
"Transfer ownership" picker, gated to tenant admins / platform admins so a
departed owner can be replaced; promotes the target and demotes the prior
owner to admin.
Backend (platform-api): contact fields on User; AuthentikClient.updateUser;
StalwartClient.setMailboxName; UsersService updateTenantMember,
changeMemberPrimaryEmail, list/add/removeMemberAlias, createMailboxForMember,
transferOwnership; new DTOs and tenant-member routes. All mutations audited.
Portal: Nuxt proxies for the new endpoints + extended TenantUserDoc.
Surface pending/calendar_failed booking states in the admin bookings list with
proper status badges (failed shows the last calendar error as a tooltip), and
add an operator "Retry now" action. The retry re-drives the same Stalwart
calendar write (confirm + attendee email on success); for a terminal
calendar_failed booking it re-claims the slot lock atomically first and refuses
if the time was taken in the meantime, so a manual retry can never double-book.
A failed Stalwart calendar write during confirmation no longer deletes the
booking + SlotLock. The booking stays 'pending' with its lock retained, and a
new @Cron worker (every 2 min, max 5 attempts by default) re-drives the write:
on success it promotes to 'confirmed' and sends the confirmation email; after
the cap it moves to the terminal 'calendar_failed' state and releases the lock.
Tracks calendarWriteAttempts + lastCalendarError on the Booking. The public
confirm endpoint still throws 503 on a failed first write (preserving the DoD:
never surface a confirmed booking without a calendar event); the pending row is
left for the background retry to finish.
Customer-admin Mail settings backed by Stalwart JMAP: per-tenant aliases
(extra addresses routing to a mailbox) and distribution lists (one address
fanning out to many recipients). Adds StalwartClient x:Alias/x:MailingList
methods, a tenant-scoped MailController/MailService, the portal Mail settings
page and its proxy routes, and the mailboxAddress field on TenantUserDoc.
Removes the old mock mail data now that the page reads live data.
Wire the mail/identity stack to real Stalwart/Authentik/OCIS provisioning,
replacing the mocked Domains and Users pages.
Domains (customer-admin):
- StalwartClient: real JMAP management (v0.16 dropped REST) — create/list/delete
email domains via x:Domain at the internal http://stalwart:8080 listener;
DKIM auto-generated; the records to publish are read from the domain's
dnsZoneFile. Gated by STALWART_PROVISIONING_ENABLED.
- New Domain collection + DomainsModule: add/list/recheck/set-DMARC/remove,
tenant-membership-gated and audited.
- DnsVerifierService: verifies MX/SPF/DKIM/DMARC/ownership against a public
resolver (1.1.1.1/8.8.8.8) and diffs them against the expected records.
- Remove is guarded: refuses while accounts/aliases/mailing lists still use the
domain (via Stalwart referential integrity).
- Domains page + add wizard on real data; sidebar badge counts domains needing
attention.
Users & groups (customer-admin):
- Create a member provisioned across Authentik SSO, a Stalwart mailbox on the
tenant's primary domain, and OCIS — returning a one-time password.
- Lifecycle: suspend/resume (Authentik is_active + freeze the mailbox via
account permissions, original password preserved), force-logout (terminate
sessions, filtered client-side so it can never end other users' sessions),
reset password (new one-time password on SSO + mailbox), and remove (tear down
mailbox + SSO identity + OCIS + doc; mailbox-in-use aware for multi-tenant
users). Self-suspend / self-force-logout are blocked.
Infra: point platform-api at the internal Stalwart listener; document the new
STALWART_/provisioning vars in .env.example.
GET /tenants/:slug/users now returns a tenant-scoped `tenantRole`
(resolved server-side via roleForTenant), and the operator tenant page
displays it instead of the global `role` — so a user who is admin here
but member elsewhere reads correctly in this tenant's context. The
global `role` field is kept intact for other consumers.
A user who is admin in one tenant but member in another must read
'admin' for this tenant — use roleForTenant() rather than the global
u.role fallback when building the tenant users list.
The Storage page + endpoint landed earlier but had no working OCIS
backend credential. OCIS has no service-account/client-credentials grant
and trusts a single issuer, and basic auth resolves no user in our
external-IdP setup — so authenticate OcisClient via an OIDC
refresh-token bootstrap instead:
- One-time headless login of svc-platform-api against the ocis provider
(public client ocis-web, issuer .../o/ocis/) yields a refresh token,
persisted in Mongo (ocis_credentials) and rotated on every use.
- OcisClient mints access tokens with the refresh_token grant; the
service user holds the OCIS admin role (OCIS_ADMIN_USER_ID) so
libregraph ListAllDrives works.
- scripts/bootstrap-ocis.mjs re-runs the bootstrap if the token lapses.
- Dashboard Plan card gains a storage capacity bar beside seats;
hidden when storage is unavailable.
- compose + .env.example: OCIS service OIDC env and admin user id.
- docs/NEXT-STEPS: document the mechanism and the dead-end alternatives.
Security & audit (admin)
- Audit log: real, tenant-scoped — widened GET /tenants/:slug/audit with
q/action/outcome/actorEmail/since/before; UI gains search, outcome + time
filters, action chips, cursor pagination, and client-side CSV export.
- Security policy: new tenant.securityPolicy (mfaMode, session idle/absolute,
allowedCountries, ipAllowlist) + PATCH /tenants/:slug/security-policy
(membership-gated, audited). Editable, labelled by enforcement status.
- MFA: live enrollment overview via GET /tenants/:slug/mfa-status
(Authentik countAuthenticators per member).
- SSO apps (Dezky as IdP): real Authentik OIDC provider + application CRUD,
scoped to the tenant group. New AuthentikClient methods (provider/app/binding
+ flow/key/scope discovery), TenantSsoApp schema, TenantSsoService (rollback
on partial failure; client secret never stored), GET/POST/DELETE
/tenants/:slug/sso-apps. Validated end-to-end against live Authentik.
- Deferred: shared-flow MFA/geo/session enforcement (global auth-flow blast
radius) — to be done as its own reviewed change.
Bundled in-progress work that shares the same files (kept together so the tree
stays green):
- Storage page: StorageService + GET /tenants/:slug/storage (OCIS-backed),
storage.get proxy, storage.vue.
- Per-tenant roles: User.tenantRoles + MeProfile.tenantRoles plumbing.
Access & navigation
- Gate partner-mode strictly to partner staff so admins/end-users never inherit
leftover partner-view state; purge stale session entry on hydrate.
- Role-driven admin entry: useMe.isTenantAdmin, Admin/Personal tiles in the app
launcher, and an /admin route guard in the global middleware (fail closed).
- Drop the duplicate user identity block from the sidebar footer.
Admin pages on real data
- New tenant-scoped, membership-gated endpoints: GET /tenants/:slug/{audit,users,
invoices}; useTenant composable resolves the active workspace + subscription.
- Dashboard: real seats, spend (cycle-normalized + minor-units), plan, renewal,
and recent audit; unbacked sections removed.
- Users & groups: real members; Groups/Invitations/Service accounts shown as
honest "coming soon".
- Subscription & invoices: real plan hero, invoice history, and billing details.
Stripe payment method (Elements + SetupIntent)
- StripeClient: publishable key + getDefaultCard/createSetupIntent/setDefaultCard.
- CustomerBillingController + BillingService methods (ensure-customer on demand).
- Portal: PaymentMethodModal, useStripeJs (CDN load), proxies; hidePostalCode.
Editable billing details & whitelabel branding
- PATCH /tenants/:slug/billing-info (narrow: company/VAT/country/email).
- TenantBranding schema/service + GET/PUT /tenants/:slug/branding: real product
name, accent colour, and per-tenant email-template overrides.
- Branding preview + sidebar workspace mark wired to real name/plan/seats/colour
with YIQ auto-contrast (readableOn util).
Session resilience
- Request offline_access so Authentik issues a refresh token (automaticRefresh).
- Silent refresh + single retry on 401 for writes (useApiFetch, incl. partner
pages) and reads (useMe.fetchMe) — no redirect, no lost input.
- Modal backdrop closes only on press+release on the backdrop (no more
drag-select-to-close).
Editing a catalog amount now propagates beyond MongoDB. Stripe Prices are
immutable, so each changed currency mints a fresh Stripe Price at the new
amount, overwrites the cached stripePriceIds[currency] (which also fixes the
stale-price bug for new subscriptions), and repoints every live subscription
on that (row, currency) onto it with proration_behavior 'none' — the new
amount takes effect at the customer's next billing cycle, no mid-cycle charge.
The per-seat snapshot is refreshed so MRR reflects the go-forward rate.
Before committing the edit, the operator sees a warning with the affected
customer count, driven by a new GET /prices/:id/impact endpoint. Per-sub
failures are logged, never fatal; Stripe-disabled rows still re-snapshot.
Add BillingService.generatePayouts: idempotent per-partner/month/currency snapshot of gross MRR x marginPct into Payout rows (never rewrites a paid row), plus platformPayouts(). A PayoutWorker generates the current month daily (and on boot; PAYOUTS_AUTOGEN=false to disable). Operator endpoints GET /billing/payouts + POST /billing/payouts/generate, an operator payouts ledger table with a Generate button, and the proxy routes. The partner Payouts tab now shows real data.
Wire the Stripe lifecycle into TenantsService (best-effort, gated on stripe.enabled): on create open a Stripe customer and, for a priced plan with seats >= 1, lazily create a Stripe Product+Price (persisted to Price.stripePriceIds[currency]) and a send_invoice subscription; mirror seat changes to the subscription quantity; pause/resume on suspend/resume; cancel on delete. A Stripe failure never blocks the tenant — the local Subscription stays the source of truth for derived MRR.
Add a lazy/guarded Stripe client (boots without keys), Invoice/Payout schemas, per-currency Price.stripePriceIds, and a BillingService deriving partner/platform summaries, invoices and a partner-cut payout ledger. Partner and operator billing controllers plus a signature-verified Stripe webhook (Fastify raw body). Frontend: partner and operator billing pages and the operator tenant billing/audit tabs on real data. Gated behind new_billing_engine and BILLING_STRIPE_ENABLED; live money paths stay off until keys are set.
Partner reports — health cohorts, revenue-by-plan, top customers, signup/churn cohorts, plus saved custom reports (create/list/delete). Operator platform-wide reports (MRR, revenue by plan, top tenants, growth). Replaces the reports fixtures in both apps.
Backend (platform-api): computed tenant health plus industry/brandColor; partner-scoped tenant update/suspend/resume guarded by assertPartnerOwnsTenant; enriched partner users (MFA + access level) with invite/remove; partner settings and whitelabel branding persistence; Authentik authenticator counting and group removal. Audit on every mutation.
Frontend (portal): all five partner pages on real data — dashboard alerts, customers edit/suspend, team MFA/access with invite/remove, editable settings, branding fetch/save.
Operator: dashboard and infrastructure service health driven by real liveness probes; fabricated uptime/p95/error-rate removed.
@IsOptional() only skips validation for null/undefined; an empty string
still trips @IsEmail(). When the operator UI saves billingInfo with a
blank contactEmail, the request 400'd with 'must be an email'. Coerce
'' → undefined on every @IsEmail-decorated field via a shared
@Transform so blank inputs round-trip cleanly.
partner.updated events previously recorded only the field names that
changed (metadata.changes). Now they record metadata.diff — a
{ field: { from, to } } map — by reading the partner before the
findOneAndUpdate and comparing serialized values. Only fields that
actually differ make it into the diff, so a save-without-changes
records an empty diff instead of every DTO key.
The operator audit row's expanded panel renders the diff as a small
inline table (field · from → to). Older audit rows that still carry
metadata.changes fall back to the original chip layout so historical
events stay readable.
Toggle the partner detail cards from read-only to editable in place. Edit
button in the PageHeader flips to Cancel + Save changes; cards expose
text inputs for name/domain/contact/billing, a 4-option segmented
control for status, and a 0–100 range slider for marginPct. Save sends
a PATCH diff (only fields that actually changed), refreshes the page
data, and exits edit mode. Cancel with unsaved changes confirms first.
Also tightens audit metadata: previously `Object.keys(dto)` on the
ValidationPipe-instantiated DTO listed every @IsOptional() field, even
when the request body didn't touch them. The partner.updated audit
event now records only the keys the operator actually sent.
Two related fixes that together close the "no recovery flow" gap behind
the invite-operator feature.
1. SeedService now provisions an Authentik recovery flow on every boot.
Without this, /core/users/{pk}/recovery/ returns 400 "No recovery flow
set." and our invite endpoint silently falls back to setting a plaintext
temp password — operationally fine in dev but not appropriate for prod.
ensureRecoveryFlow() (in seed.service.ts):
- Check if a flow with designation='recovery' already exists → no-op
- Otherwise create one with slug='default-dezky-recovery'
(designation='recovery', authentication='none' so the link token
is the only auth needed)
- Bind three default Authentik stages to it in order:
10: default-authentication-identification (auto-skipped when the
recovery token already pins a user; lets the flow also work
for self-service "forgot password" entry)
20: default-password-change-prompt
30: default-password-change-write
- PATCH the default brand's flow_recovery to point at the new flow
- Wrapped in .catch(warn) so an Authentik blip during boot doesn't
crash platform-api — next restart retries.
AuthentikClient additions:
- findRecoveryFlow(), getDefaultBrand(), findStageByName(),
createFlow(), bindStageToFlow(), setBrandRecoveryFlow().
IntegrationsModule pulled into SeedModule so SeedService can use
AuthentikClient.
2. Temp-password fallback path now marks the password expired so
Authentik forces a change on next login. Closes the window where an
operator's plaintext share could outlive the new user's first session.
AuthentikClient.markPasswordExpired(userPk):
- GET user → merge attributes.passwordExpired=true +
passwordExpiredAt=now → PATCH back
- Read-modify-write because Authentik PATCH replaces nested objects
and we don't want to clobber other attributes
UsersService.inviteOperator() calls it on the fallback branch only —
the recovery-link path doesn't need it (clicking the link sets a
fresh password through the flow anyway).
Verified end-to-end:
- Boot → recovery flow auto-provisioned with three correctly-ordered
stage bindings, default brand patched to flow_recovery=<new pk>.
- Re-invite test user → modal now shows a single recovery link
starting with https://auth.dezky.local/if/flow/default-dezky-
recovery/?flow_token=... (no temp password fallback).
- Operator-team list still updates to include the new user
immediately via the pre-created local User doc.
Known follow-ups:
- Enforce MFA enrollment in the recovery flow (add an authenticator
stage). Deferred — locks users out if they lose the second factor
on day one. Better to fire MFA from a separate "MFA required" stage
on subsequent logins for platform admins.
- Outbound SMTP (Phase 5/6) so Authentik emails the recovery link
directly and the modal hides it.
New "Invite operator" button + modal on /operator-team. Replaces the
bounce-to-Authentik flow with an inline invite that creates the user via
the Authentik API and pre-populates our local User doc so they appear
immediately.
services/platform-api/src/integrations/authentik.client.ts:
- findUserByEmail(): early-conflict check before we attempt the create
- createUser(): POST /core/users/ with username = email, internal type,
is_active, attached to the supplied group PKs
- addUserToGroup(): kept for tenant-member invites later
- recoveryLink(): tries POST /core/users/{pk}/recovery/, returns
undefined when no recovery flow is configured on the Authentik brand
(we soft-fail and the service falls back to setInitialPassword)
- setInitialPassword(): POST /core/users/{pk}/set_password/. Returns 204
No Content so we bypass request<T>'s JSON parser and call fetch
directly with explicit ok check.
services/platform-api/src/users/users.service.ts:
- inviteOperator(dto, actor) orchestrates: dedup by email →
findOrCreate Authentik group → create user in group → pre-create
local User doc with platformAdmin=true so the list reflects them
immediately → try recovery link → fall back to temp password →
record platform.user_invited audit event with handoff method.
- Return type is { subject, userId, link? | tempPassword? } —
exactly one credential mode set depending on Authentik config.
- generateTempPassword(): 16-char with at least one upper/lower/digit/
symbol, shuffled. Confusable chars (I/O/0/1/l) omitted.
- Cached platform-admin group ID after first lookup.
services/platform-api/src/users/users.controller.ts:
- POST /users/invite behind OperatorGuard. Calls the service with
actor + IP from the JWT/request.
apps/operator:
- server/api/users/invite.post.ts: standard platformApi proxy.
- components/InviteOperatorModal.vue: 2-step form. Step 1: name +
email with client-side validation. Step 2: shows whichever
credential the backend returned — recovery link OR username+
temp-password — with copy-to-clipboard buttons and a note about
SMTP/recovery-flow follow-up paths.
- pages/operator-team.vue: "Invite operator" replaces "Manage in
Authentik" as the primary action; Authentik link demoted to
secondary. Refreshes the list on @invited so the new user shows
up without a manual reload.
Verified end-to-end against real Authentik:
- Invite created user pk=7, uid=f22f2bb…, group=dezky-platform-admins,
is_active=true, temp password set. Modal showed both fields with
copy buttons; operator-team count went 1 → 2 immediately. Audit
event recorded (platform.user_invited with handoff='temp-password').
- Recovery link path is preferred but Authentik has no recovery flow
configured on the default brand. AuthentikClient.recoveryLink()
soft-fails on the "No recovery flow set." 400, returns undefined,
and inviteOperator transparently falls back to set_password. Once
a recovery flow is configured (Authentik admin → Flows), the link
path becomes active and the temp-password path stops firing
without any code changes.
Known follow-ups:
- Configure Authentik recovery flow so the link path activates
(one-time admin task, not in code)
- Outbound SMTP wiring (Phase 5/6) → Authentik can email link/temp
directly; modal stops showing the credential
- Deactivate / remove operator from inside the app (currently still
Authentik UI; defensible until proven needed)
- Tenant-member invite — similar flow but adds to tenant group
instead, exposed from /users (global users) or tenant detail
Final piece of the audit work. Events older than the hot retention window
move to S3-compatible object storage with signed manifests. Production uses
Hetzner Object Storage; dev uses a MinIO container with the same API.
Infra (infrastructure/docker-compose):
- New `minio` service exposing the S3 API at minio:9000 + admin console at
minio.dezky.local. Healthchecked. Bucket-init sidecar runs `mc mb` once
to create `dezky-audit`; safe to re-run.
- .env adds MINIO_ROOT_USER + MINIO_ROOT_PASSWORD.
- platform-api env: AUDIT_COLD_{ENDPOINT,REGION,BUCKET,ACCESS_KEY,SECRET_KEY}
+ AUDIT_HOT_RETENTION_DAYS=90 + ARCHIVE_ENABLED=false (dormant in dev;
operator UI's "Run archive now" bypasses this gate). AUDIT_COLD_SSE
opts into SSE-S3 — left unset in dev because MinIO without a KMS rejects
AES256 PUTs with "KMS is not configured".
Platform-api (services/platform-api/src/cold/):
- cold-storage.client.ts: thin @aws-sdk/client-s3 wrapper — put/head/list.
forcePathStyle=true so MinIO and Hetzner both work; same code, env-swap.
- archive.service.ts: runOnce() selects chained events with at < cutoff →
serializes to JSONL → gzip → sha256s → uploads JSONL + signed manifest
→ HEAD-confirms both objects exist → records an ArchiveBatch doc → only
then deletes from hot Mongo. Crash-safe: a failed upload leaves events
in hot. Manifest uses the Phase 3 AUDIT_SIGNING_KEY (HMAC-SHA-256), so
archives + checkpoints share trust chain. Bypassable via { override:
true } for the operator's UI force-run.
- archive.worker.ts: hourly tick guarded by configured run-hour-UTC
(default 03:00) + day-guard so the same UTC day doesn't archive twice.
Disabled until ARCHIVE_ENABLED=true.
- archive-batch.schema.ts: { archivedAt, startSeq, endSeq, eventCount,
manifestSha256, jsonlKey, manifestKey, bytesUncompressed }. The
manifest sha256 stored in Mongo lets us detect manifest tampering
without downloading the actual manifest.
Audit module additions:
- audit.controller.ts: GET /audit/archives, POST /audit/archive/run,
/audit/verify now reports { oldestHotSeq, highestArchivedSeq } so the
UI shows the tier boundary.
Operator UI (apps/operator):
- 2 new proxies: /api/audit/archives + /api/audit/archive/run (force
override=true). Both behind operator auth via the existing platformApi
helper.
- audit.vue: new "Cold storage" card with batch table (archived-at, seq
range, event count, size, truncated manifest sha256), "Run archive
now" button + per-run result line.
Smoke-tested end-to-end:
- 7 chained events in hot. /api/audit/archive/run → ok=true, batchId
returned. JSONL + manifest both exist in MinIO (verified via mc ls +
mc cat). Mongo's chained set went 7 → 0. Verify reports
highestArchivedSeq=1446 (since we burn-allocate seqs on Authentik
dup-key rejections). Operator /audit panel shows the batch with
manifest hash 1d8263…
- First attempt with SSE-S3 enabled failed cleanly (MinIO KMS not
configured) — archive service correctly left events in hot Mongo.
Made SSE opt-in via AUDIT_COLD_SSE=true; prod turns it on.
Out of scope (each could be its own session):
- Restore-to-hot endpoint (today: download from S3 + offline query)
- Client-side encryption (today: SSE-S3 in prod, none in dev)
- Multi-region replication
- Soft TTL safety net (defense-in-depth on top of app-managed deletion)
This completes the four-phase audit log work:
1. platform-api as audit hub
2. External system ingest (Authentik / Stalwart / OCIS)
3. Hash-chain + signed checkpoints (tamper evidence)
4. Cold-storage archival (retention without unbounded Mongo growth)
The audit log now carries cryptographic chain-of-custody. Every chained
event references the previous event's sha256, and periodic checkpoints
sign the head with HMAC-SHA-256. An attacker who modifies a historical
row must also forge every checkpoint signature past it — which requires
the AUDIT_SIGNING_KEY, kept outside Mongo.
Schema (services/platform-api/src/schemas/):
- audit-event.schema.ts: new `seq` (monotonic) + `chained` (Phase-3-or-
later flag) + `prevHash` + `hash`. Compound unique index on seq with
partial filter so pre-Phase-3 rows don't collide on null.
- audit-counter.schema.ts: single doc `_id='audit_seq'`, incremented
atomically by findOneAndUpdate($inc).
- audit-checkpoint.schema.ts: { at, headSeq, headHash, signature,
sigAlg, reason }. Reason ∈ {startup, interval, threshold, manual}.
Audit module (services/platform-api/src/audit/):
- canonical.ts: stable JSON form + hashCanonical (sha256) +
checkpointSignature (HMAC-SHA-256) + verifyCheckpointSignature
(timingSafeEqual). Single source of truth for hash inputs — schema
additions land here at the same time as the field.
- audit.service.ts: record() now allocates seq → looks up lastHash() →
computes hash → inserts. Per-process write mutex serializes the
allocate+lookup so concurrent writers don't both chain off the same
predecessor. Documented multi-instance caveat (needs Mongo replica
set + transactions OR a distributed lock).
- checkpoint.service.ts: scheduler triggers on startup + every 5min
+ threshold of 100 events accumulated. Skips when no new chained
events since the last anchor.
- verifier.service.ts: walks chain in seq order, recomputes each
hash, validates checkpoint signatures. Returns a precise break:
'event-hash-mismatch' (in-place modification), 'event-prev-hash-
mismatch' (insertion/deletion), or 'checkpoint-signature-mismatch'.
- audit.controller.ts: GET /audit/verify, GET /audit/checkpoint/latest,
POST /audit/checkpoint (manual force).
Operator UI (apps/operator/):
- 3 new proxies under /api/audit/{verify, checkpoint/latest, checkpoint}.
- pages/audit.vue: new "Tamper evidence" card with "Force checkpoint"
+ "Verify chain" buttons. Header shows live head seq; result line
shows verified count or a precise break (kind + seq + expected vs
actual hash). Background tinted green/red on ok/broken.
Env (.env + docker-compose.yml):
- new AUDIT_SIGNING_KEY (32-byte hex HMAC secret). Prod swaps this for
ed25519 from an HSM/KMS; verifier code stays the same because sigAlg
is on the checkpoint doc.
Smoke-tested all three break paths against a clean chain of 5 events:
- normal verify: ok=true, 5/5 events verified, 1 checkpoint signed
- modified seq=3 in Mongo directly: verify returns ok=false with
break = { kind: 'event-hash-mismatch', seq: 3, expected, actual }
- restored, nuked checkpoint signature: break = { kind:
'checkpoint-signature-mismatch', headSeq: 5 }
- operator UI's verify panel reflects all three states correctly.
Legacy data: pre-Phase-3 events stay `chained: false` and are excluded
from the chain walk. Retroactive chaining of historical entries is a
one-off migration script we can run if we ever care to.
Out of scope (Phase 4 etc.):
- TTL + cold-storage archival to Hetzner Object Storage
- GDPR right-to-erasure tooling
- ed25519 / HSM signing (swap is well-defined; sigAlg field is ready)
- Multi-instance write coordination (Mongo transaction OR distributed
lock when we scale platform-api beyond 1 replica)
Tails OCIS's JSON-Lines audit log on a shared Docker volume and forwards
mutations into AuditService. Final piece of Phase 2 — the /audit page now
unifies platform-api, authentik, and ocis events on one timeline.
services/platform-api/src/ingest/ocis.ingest.ts:
- 5s polling loop (fs.watch is unreliable across Docker bind mounts on
macOS). Stat → detect inode change or truncation → resume from byte
position OR start over.
- Cursor in IngestCursor stores lastEventId = "<inode>:<bytePosition>".
Restarts resume cleanly; on overlap the (source, externalId) unique
index dedups silently.
- Lines collected first, then processed sequentially after the read
stream closes. Earlier draft fired recordOne() from inside the
readline 'line' callback which would have resolved the stream
before all writes finished — same class of race we hit in the
Authentik worker, fixed before commit.
- Tenant inference: spaceName (set during provisioning to the slug)
first, then User.authentikSubjectId → tenantIds → Tenant.slug.
- Mutations only: OCIS_ALLOWLIST in action-map.ts whitelists 24 event
types (User/Group/Space/Share/Link/File mutations). FileDownloaded,
UserSignedIn, and the rest of the high-volume read traffic gets
skipped — keeps the timeline scannable.
services/platform-api/src/ingest/action-map.ts:
- mapOcisAction() + OCIS_ALLOWLIST. Returns null for non-whitelisted
types so the worker filters early.
infrastructure/docker-compose/docker-compose.yml:
- New named volume `ocis_audit_log` mounted writeable on the ocis
container and read-only on platform-api.
- OCIS env: OCIS_ADD_RUN_SERVICES=audit (the audit microservice is
NOT in the default `ocis server` set — opt in explicitly),
AUDIT_LOG_FILE_PATH=/var/log/ocis/audit.log, AUDIT_LOG_FORMAT=json.
- platform-api env: OCIS_AUDIT_LOG_PATH points at the same file.
Verified end-to-end with synthetic events written to the audit log:
- Worker tailed 5 events across initial read + incremental append
(5 → bytes 0:1276, then 1 → bytes 1276:1519).
- FileDownloaded correctly filtered by the allowlist (4 mutations
landed in Mongo, not 5).
- Tenant inference: events with executingUser.id resolved to
`dezky` via User → tenantIds → Tenant.slug.
- Operator /audit shows all three sources (89 events: 79 authentik
+ 5 platform-api + 5 ocis) in one unified timeline.
Known unknown — same shape as the Stalwart commit: I couldn't fully
confirm the OCIS v7 audit microservice emits events with just
OCIS_ADD_RUN_SERVICES=audit + the AUDIT_LOG_FILE_PATH env. The audit
service starts but the file stays empty until OCIS internals start
publishing events to NATS (which may need additional service-side
config). The ingest worker is correct regardless — when OCIS starts
writing real events, they'll flow into /audit. This is a follow-up
in the OCIS-side configuration, not in our ingest code.