The operator infrastructure page probed docker-compose hostnames
(stalwart/postgres/redis/traefik…) which don't resolve in k3s — 7 of 9
services showed down. Probe targets now come from HEALTH_* env vars with
the compose names as dev defaults; platform-api-config.yaml sets the
in-cluster/host addresses. 'disabled' omits a service from the report —
used for OCIS/Collabora until the files tier is deployed.
Operator sign-out hardcoded the dev Authentik end-session URL, so prod
logout landed on auth.dezky.local. Mirror the portal's env-driven pattern
(NUXT_PUBLIC_AUTH_URL/NUXT_PUBLIC_OPERATOR_URL with .local fallbacks).
Expose authUrl/operatorUrl via public runtimeConfig and use them for the
Authentik admin links and the cosmetic host labels (sidebar, eyebrows,
auth-page hints). Portal: signed-out + webmail copy now derive their hosts
from runtime config (new public.mailUrl, NUXT_PUBLIC_MAIL_URL in prod).
nuxt-oidc-auth registers its own 'oidc' storage mount at build, so
storage.mount('oidc', …) at runtime threw 'already mounted at oidc:' and
crash-looped the new pods. Unmount the memory mount first.
nuxt-oidc-auth persists sessions via useStorage('oidc'), whose default
mount is per-pod memory — broken at >1 replica (random 401s) and every
deploy logged all users out. A nitro plugin now mounts 'oidc' on the
dezky-data Redis (db 1, app-prefixed keys, 14d TTL) when SESSION_REDIS_URL
is set; dev keeps the memory driver with no Redis required. Replicas back
to 2 for both apps.
nuxt-oidc-auth stores sessions in per-pod memory. With 2 replicas, any
request balanced to the pod that didn't handle the login 401s — in practice
roughly half of all operator API calls failed after sign-in. One replica
until sessions move to shared storage (nitro storage on the dezky-data
Redis), then scale back up. Already scaled live; this pins the manifests so
the next deploy doesn't undo it.
The operator could list and inspect tenants but had no create flow — tenant
creation only existed as the partner-portal wizard, which always attaches a
partnerId. Platform-api's POST /tenants (platform-admin only, no partner
field) was already built for this; add the missing UI: a New tenant modal on
the tenants page (slug, name, plan/cycle/currency/seats, optional primary
mail domain + first-admin invite) and the server proxy route. Operator-created
tenants are direct customers; attach a partner later if needed.
CI builds the Nuxt images with no env, so nuxt.config bakes empty OIDC
client creds and .local Authentik URLs into runtimeConfig — sign-in
dead-ended on the app's own /auth/login. Nitro env overrides only apply
when the var name matches the runtimeConfig path
(oidc.providers.oidc.* -> NUXT_OIDC_PROVIDERS_OIDC_*), so production
secrets need that second set of names; the plain NUXT_OIDC_* ones only
work in dev. Also pin NUXT_OIDC_TOKEN_KEY/AUTH_SESSION_SECRET so sessions
survive pod restarts. Live secrets patched on the cluster accordingly.
Bring the runbook up to the 2026-06-10 state: app tier + CI/CD in current
state, a Deploy flow section (push to main = release, rollback, break-glass,
required Gitea secrets), reproduce steps 8-9 (app tier secrets+apply, CI
runner + ci-deployer with the runner gotchas), per-router ACME-safe redirect
instead of the old global one, platform-api key read-back for Bitwarden, and
a pruned TODO list.
gitea/runner can only bind-mount a UNIX-socket docker host into job
containers — the old tcp://localhost:2376 + TLS daemon address cannot be
mounted, so build jobs still had no docker API. Share dind's
/var/run/docker.sock with the runner via a /var/run emptyDir and drop the
DOCKER_HOST/TLS env; the runner auto-finds the socket and the bind path
resolves inside dind where the socket lives.
gitea/runner 1.x no longer auto-mounts the docker daemon into job
containers (act_runner 0.2.x did), so 'docker build' in the build jobs
failed with 'cannot connect to /var/run/docker.sock'. container.docker_host
"" restores find-and-mount.
The per-job GITHUB_TOKEN is no longer accepted by the container registry's
/v2/ basic-auth endpoint since the act_runner -> gitea/runner switch (login
fails 'unauthorized' before push). Use a personal access token with package
read+write scope, provided as the REGISTRY_TOKEN repo secret.
Gitea 1.26 never marked finished jobs complete with the deprecated
act_runner 0.2.11: the runner ran the job, logged 'Job succeeded' and freed
its slot, but Gitea kept the job 'Running' forever, so dependent jobs
(build -> deploy) were never dispatched. gitea/runner is the successor
project; config, env vars and the .runner registration file are unchanged.
Push to main = release: after build, a deploy job pins each app image to the
commit SHA (kustomize edit set image), kubectl-applies fleet/apps and waits
for the rollouts. The runner already runs in-cluster, so it reaches the API
server on the in-cluster service IP with a kubeconfig for the new ci-deployer
ServiceAccount (namespace-scoped admin, KUBECONFIG_B64 repo secret).
The drafted Flux sync/image-automation layer is removed — a GitOps controller
plus bot tag-bump commits is more machinery than a single-node cluster needs.
Sortable image tags and $imagepolicy markers go with it.
Also: per-router ACME-safe HTTP->HTTPS redirects for the app ingresses,
platform-api prod config completed (Authentik JWT/JWKS + admin API, Stalwart
via the cni0 gateway IP, OCIS/cold-storage placeholders until those tiers
exist) and the secrets template/README updated to match.
- Dockerfile for the operator app (same pattern as portal/booking).
- Env-driven auth/app base URLs in nuxt.config so one build serves
dev (.local) and production (.eu).
- Deployment + Service + Ingress on operator.dezky.eu.
- Add operator to the typecheck matrix.
- Pin the helm-controller chart version (unset = silent latest upgrades) and
move the image tag under global.image per the 2026.5 chart layout.
- Authentik 2026.5 enforces a per-provider grant_types allowlist; empty list
rejected every authorize request. Allow authorization_code + refresh_token
for portal and operator providers.
- Fix the portal redirect URI to the nuxt-oidc-auth callback path.
- Serve the auth ingress on :80 with a per-router HTTPS redirect so the
cert-manager HTTP-01 solver keeps working.
After typecheck + test pass on main, build portal/booking/platform-api images
(matrix) via the dind sidecar and push to git.lastcloud.io tagged latest + SHA.
Auth uses the runner's job token against the same Gitea instance.
actions/setup-node writes node into a tool-cache shared across concurrent jobs;
with capacity>1 one job execs node while another writes it → "/usr/bin/env:
'node': Text file busy". The catthehacker runner image already ships node 24,
and corepack (bundled) reads each app's packageManager — so setup-node is
unneeded. Removing it eliminates the shared-cache race.
Add an act_runner config.yaml (ConfigMap, CONFIG_FILE env): capacity 4 so the
typecheck matrix + image builds run in parallel instead of one-at-a-time, and
cache.enabled: false (we removed the setup-node cache; the cache server isn't
reachable from the DinD job containers anyway).
timeToMin destructured [h, m] from t.split(':').map(Number); under
noUncheckedIndexedAccess those are number|undefined, so `h * 60` errored. Use
default-value destructuring ([h = 0, m = 0]). Surfaced now that the Gitea runner
actually runs the typecheck job (it never ran before).
pnpm/action-setup@v4 ran at the repo root (uses: steps ignore
defaults.run.working-directory) where there is no package.json, so it couldn't
read the pnpm version → "No pnpm version specified". Use corepack (bundled with
node) in the install step, which reads each app's own packageManager — matching
the Dockerfiles. Verified in the runner's container: corepack enable + frozen
install succeeds for every app.
pnpm/action-setup ran with no version: `uses:` steps ignore
defaults.run.working-directory, so it executed at the repo root, which has no
package.json (per-app monorepo) → "No pnpm version specified". Pin version: 9
explicitly. Also drop setup-node's `cache: pnpm` — the act_runner cache server
isn't reachable from the DinD job containers, and the install is fast anyway.
The apps were wired for the dev (.local) environment. Drive the base URLs from
env so one build serves dev and prod (.eu):
- portal nuxt.config: OIDC authorization/token/userinfo/discovery URLs +
redirectUri now derive from NUXT_PUBLIC_AUTH_URL / NUXT_PUBLIC_PORTAL_URL
(+ PORTAL_OIDC_APP_SLUG); .local defaults keep dev working with no env.
- portal sign-out handler: end-session + post-logout URLs env-driven.
- portal scheduling page: booking base/host from runtimeConfig.public.bookingUrl
(NUXT_PUBLIC_BOOKING_URL).
- platform-api: tenant mail domain suffix from PLATFORM_TENANT_DOMAIN (dezky.eu
in prod), defaulting to dezky.local.
(booking needs no change — its only .local ref is the dev-server allowedHosts.)
Self-registering act_runner on node1 with a privileged docker:dind sidecar so
workflow jobs can build + push app images (k3s has containerd only, no Docker
daemon). Labels ubuntu-latest + docker; state persisted on a Longhorn PVC. The
registration token is applied out-of-band as the gitea-runner-token Secret
(not in git). Verified: runner declared successfully, dind API up.
pg_dumpall (all Postgres DBs + roles) and mongodump (all Mongo DBs) write
gzipped dumps to the hostPath /opt/dezky-backup/dumps at 02:50/02:52 UTC, which
the host Restic job (03:20) ships to the Storage Box. Each keeps the last 7
local dumps; Restic holds the real off-box retention.
- pods run as root (hostPath dir is root-owned, as is the host Restic reader)
- mongo job uses bash (mongo:7 /bin/sh is dash → no pipefail)
- creds from postgres-secret / mongo-secret via secretKeyRef
Verified: both jobs Complete, dumps present on the host
(postgres-all ~2.2MB w/ Authentik data, mongo archive).
Three fixes found bringing up backups on node1:
- restic.env wrote BACKUP_PATHS/RETENTION unquoted → sourcing ran a path as a
command ("Is a directory"); now quoted.
- ssh config was written to $BACKUP_HOME/.ssh/config, but restic runs as root
and its ssh resolves ~ from the passwd db (not $HOME), so it reads
/root/.ssh/config — write the Storage Box block there. Also
StrictHostKeyChecking=no + UserKnownHostsFile=/dev/null (safe: restic encrypts
before upload; fixes flaky Storage Box host-key verification).
- Storage Box SFTP lands in /home, so the repo path needs the /home prefix
(absolute /dezky hit the root-owned chroot parent → SSH_FX_FAILURE).
Verified: repo initialized, nightly snapshot of mail store + Stalwart config +
etcd snapshots + dumps dir, `restic check` clean, retention applied.
v0.16 dropped TOML config. The host service now boots from a tiny config.json
that describes only the datastore (RocksDB); all other settings live in the DB
(web UI / stalwart-cli / platform-api JMAP).
- add stalwart/config.json (RocksDb datastore at /opt/stalwart/data)
- install.sh: install config.json instead of config.toml
- stalwart-mail.service: --config points at config.json
- README: document the v0.16 model + remaining DB-side config + DNS/PTR
Verified: Stalwart 0.16.8 runs on node1 with default mail listeners + the :8080
management server. config.toml retained as a reference for the DB settings.
- install.sh: default repo stalwartlabs/mail-server -> stalwartlabs/stalwart
(renamed), and select the exact /stalwart-<target>.tar.gz asset excluding the
foundationdb build (head -n1 could grab the wrong one).
- config.toml: $env{...} -> %{env:...}% (correct Stalwart macro syntax).
KNOWN ISSUE: Stalwart v0.16 removed TOML config (single config.json datastore +
everything else in the DB via CLI/UI), so this config.toml does not load on
0.16.8 ("Failed to parse data store settings"). Needs either a pinned pre-0.16
version or a migration to the v0.16 config model. Binary is installed; the
service is stopped pending that decision.
Brand CSS only reaches the flow shadow DOM via CSS vars (colors), not the
logo/favicon (deeper shadow root) or the "Powered by authentik" footer (light
DOM). So, dev-style: serve real dezky assets + sed the bundle.
- web-assets/: dezky-logo.svg, dezky-favicon.svg, dezky-bg.svg (carbon).
- server-rebrand.py: patches the authentik-server Deployment with an
initContainer that copies /web/dist to an emptyDir, drops the svgs into
assets/icons, and seds "Powered by authentik" -> "Powered by Dezky".
- brand.yaml: branding_logo / branding_favicon / branding_default_flow_background
point at the served svgs; auth-flow title "Welcome to Dezky"; signal-green CSS.
Verified live: login now matches dev (logo, title, carbon bg, green button,
favicon, Powered by Dezky). Durability caveat documented (reverts on helm
upgrade).
branding_logo / branding_default_flow_background are file-path fields (reject
data URIs), so the dezky logo + carbon background are injected via the brand's
custom CSS (data URIs allowed there): logo replaces the authentik wordmark,
background overrides the forest. Auth-flow title -> "Welcome to Dezky".
Signal-green primary button retained.
Mirror the dev Authentik config in prod via blueprints, applied & successful on
node1:
- brand.yaml: dezky branding on the default brand (title + signal-green custom
CSS) — login page now in dezky colors.
- portal-application.yaml / operator-application.yaml: dezky-portal &
dezky-operator OIDC apps/providers (prod redirect URLs) + the
dezky-platform-admins group & operator access policy.
Two 2026.5 gotchas handled + documented in README:
- invalidation_flow is now REQUIRED on OAuth2 providers (added via !Find).
- ConfigMap mounts are symlinks (discovery can't read them) → worker uses an
initContainer that copies them to an emptyDir as real files. (chart
worker.volumes didn't apply on this version; patch reverts on helm upgrade —
noted as a durability TODO.)
Client secrets (PORTAL/OPERATOR_OIDC_CLIENT_SECRET) live in authentik-secret;
the apps must reuse them.
Adds the production cluster foundation (authored + applied live on node1):
- cert-manager via the k3s HelmChart controller + letsencrypt staging/prod
ClusterIssuers (HTTP-01 / Traefik).
- Longhorn config for single-node (values: replica=1, default StorageClass,
Retain) + backup-to-Hetzner-Object-Storage credential template.
- In-cluster data tier (dezky-data): Postgres 16 (with Authentik+OCIS DB init),
MongoDB 7, Redis 7 as StatefulSets on Longhorn, + secret template.
- bootstrap.sh: install open-iscsi/nfs-common + enable iscsid (Longhorn prereq).
- RUNBOOK.md: full reproducible node1 build order.
Real secrets are generated on-box and kept in Bitwarden — never in git.
- Request offline_access for the ocis-web client (WEB_OIDC_SCOPE) so the web
SPA gets a refresh token and renews silently instead of dropping the session
(no surprise logouts; the "no permission to upload" symptom was the
expired-token state). The ocis-provider already has the offline_access scope
mapping; its access-token validity is bumped 5m → 1h (refresh 30d).
- Flatten the remaining brand gradients in index.html: the active sidebar
highlight (.oc-background-primary-gradient) and primary buttons
(.oc-button-primary-filled) are now solid carbon (text stays light/readable).
- Document the offline_access + token-validity provider settings in
AUTHENTIK-SETUP.md (the provider lives in Authentik's DB, not git).
Skin OCIS web in the dezky brand so users don't see ownCloud/Infinite Scale.
- Custom theme.json (WEB_UI_THEME_PATH + WEB_ASSET_THEMES_PATH): dezky name,
slogan, logos (light wordmark for the dark top bar, dark wordmark for the
light login, favicon), and the full dezky palette — carbon chrome, signal
yellow as a sparing accent, paper/bone surfaces, dezky semantic colours
- Pin the light theme as default (single variant) so OS-dark / auto-system
always resolves to it
- Override only index.html via WEB_ASSET_CORE_PATH (OCIS falls back to the
embedded core per-file): hide the ".versions" footer ("Infinite Scale … /
ownCloud Web UI …") and set the pre-hydration <title>/theme-color to dezky
Apache-2.0 lets us drop the ownCloud marks without trademark fees. NOTE:
index.html pins the built bundle hashes — refresh it after an OCIS image bump.
The "Jump to" launcher only navigated for the internal tiles (Personal /
Admin / Partner); every external app (Mail, Drev, Møder, …) just fired a
toast and never opened. Hosts were also hardcoded to *.dezky.com, with Drev
pointing at a vanity drev. subdomain instead of the real OCIS host.
- Open external apps in a new tab at https://<host>.<baseDomain>
- Derive the base domain from the portal's own hostname so links resolve in
every environment (app.dezky.local → dezky.local, app.dezky.com → dezky.com)
- Map Drev → files (OCIS); mail/meet/chat/cal/contacts/docs use their service
subdomain
Rebuild the /admin/users detail drawer from a read-only profile into an
editable, Office 365-style panel with four sections:
- Username & mail: read-only primary for mailbox users; editable sign-in
(Authentik-only) for mailbox-less identities; "Create mailbox" provisions
a Stalwart inbox for an external-login admin
- Aliases: list/add/remove mailbox aliases (Stalwart), domain-scoped
- Role: member/admin toggle with a primary-account lock (owner, mailbox-less
bootstrap admin, self) and a last-admin guard
- Contact information: display name, first/last name, phone, alternative
email — mirrored best-effort to Authentik attributes + mailbox name
Ownership transfer: "Make owner" (row menu + drawer) plus an owner-side
"Transfer ownership" picker, gated to tenant admins / platform admins so a
departed owner can be replaced; promotes the target and demotes the prior
owner to admin.
Backend (platform-api): contact fields on User; AuthentikClient.updateUser;
StalwartClient.setMailboxName; UsersService updateTenantMember,
changeMemberPrimaryEmail, list/add/removeMemberAlias, createMailboxForMember,
transferOwnership; new DTOs and tenant-member routes. All mutations audited.
Portal: Nuxt proxies for the new endpoints + extended TenantUserDoc.
Surface pending/calendar_failed booking states in the admin bookings list with
proper status badges (failed shows the last calendar error as a tooltip), and
add an operator "Retry now" action. The retry re-drives the same Stalwart
calendar write (confirm + attendee email on success); for a terminal
calendar_failed booking it re-claims the slot lock atomically first and refuses
if the time was taken in the meantime, so a manual retry can never double-book.