Commit Graph

22 Commits

Author SHA1 Message Date
Ronni Baslund 3b9b06a99b docs(runbook): app tier + push-to-deploy CI/CD flow
ci / typecheck (map[dir:apps/booking name:booking]) (push) Successful in 20s
ci / typecheck (map[dir:apps/operator name:operator]) (push) Successful in 23s
ci / typecheck (map[dir:apps/website name:website]) (push) Successful in 20s
ci / typecheck (map[dir:apps/portal name:portal]) (push) Successful in 26s
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Successful in 22s
ci / test (push) Successful in 32s
ci / build (map[dir:apps/booking name:booking]) (push) Successful in 9s
ci / build (map[dir:apps/operator name:operator]) (push) Successful in 9s
ci / build (map[dir:apps/portal name:portal]) (push) Successful in 6s
ci / build (map[dir:services/platform-api name:platform-api]) (push) Successful in 5s
ci / deploy (push) Successful in 41s
Bring the runbook up to the 2026-06-10 state: app tier + CI/CD in current
state, a Deploy flow section (push to main = release, rollback, break-glass,
required Gitea secrets), reproduce steps 8-9 (app tier secrets+apply, CI
runner + ci-deployer with the runner gotchas), per-router ACME-safe redirect
instead of the old global one, platform-api key read-back for Bitwarden, and
a pruned TODO list.
2026-06-10 12:19:47 +02:00
Ronni Baslund 9a58e486e3 docs(fleet): note verified push-to-deploy pipeline
ci / typecheck (map[dir:apps/booking name:booking]) (push) Successful in 21s
ci / typecheck (map[dir:apps/operator name:operator]) (push) Successful in 23s
ci / typecheck (map[dir:apps/website name:website]) (push) Successful in 21s
ci / typecheck (map[dir:apps/portal name:portal]) (push) Successful in 26s
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Successful in 23s
ci / test (push) Successful in 31s
ci / build (map[dir:apps/operator name:operator]) (push) Successful in 9s
ci / build (map[dir:apps/booking name:booking]) (push) Successful in 9s
ci / build (map[dir:apps/portal name:portal]) (push) Successful in 6s
ci / build (map[dir:services/platform-api name:platform-api]) (push) Successful in 5s
ci / deploy (push) Successful in 41s
2026-06-10 09:20:18 +02:00
Ronni Baslund 323c46fba1 fix(ci): share dind's unix socket with the runner (jobs need a mountable docker host)
ci / typecheck (map[dir:apps/booking name:booking]) (push) Successful in 42s
ci / typecheck (map[dir:apps/operator name:operator]) (push) Successful in 45s
ci / typecheck (map[dir:apps/website name:website]) (push) Successful in 21s
ci / typecheck (map[dir:apps/portal name:portal]) (push) Successful in 26s
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Successful in 20s
ci / test (push) Successful in 32s
ci / build (map[dir:apps/booking name:booking]) (push) Successful in 34s
ci / build (map[dir:apps/operator name:operator]) (push) Successful in 46s
ci / build (map[dir:services/platform-api name:platform-api]) (push) Successful in 35s
ci / build (map[dir:apps/portal name:portal]) (push) Successful in 49s
ci / deploy (push) Successful in 45s
gitea/runner can only bind-mount a UNIX-socket docker host into job
containers — the old tcp://localhost:2376 + TLS daemon address cannot be
mounted, so build jobs still had no docker API. Share dind's
/var/run/docker.sock with the runner via a /var/run emptyDir and drop the
DOCKER_HOST/TLS env; the runner auto-finds the socket and the bind path
resolves inside dind where the socket lives.
2026-06-10 08:51:44 +02:00
Ronni Baslund 1114be6c93 fix(ci): expose the dind docker host to job containers
ci / typecheck (map[dir:apps/booking name:booking]) (push) Successful in 45s
ci / typecheck (map[dir:apps/operator name:operator]) (push) Successful in 50s
ci / build (map[dir:apps/operator name:operator]) (push) Failing after 5s
ci / deploy (push) Has been skipped
ci / typecheck (map[dir:apps/portal name:portal]) (push) Successful in 27s
ci / typecheck (map[dir:apps/website name:website]) (push) Successful in 23s
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Successful in 24s
ci / test (push) Successful in 35s
ci / build (map[dir:apps/booking name:booking]) (push) Failing after 7s
ci / build (map[dir:apps/portal name:portal]) (push) Failing after 5s
ci / build (map[dir:services/platform-api name:platform-api]) (push) Failing after 6s
gitea/runner 1.x no longer auto-mounts the docker daemon into job
containers (act_runner 0.2.x did), so 'docker build' in the build jobs
failed with 'cannot connect to /var/run/docker.sock'. container.docker_host
"" restores find-and-mount.
2026-06-10 08:34:54 +02:00
Ronni Baslund ec707643d6 fix(ci): act_runner 0.2.11 -> gitea/runner 1.0.8
ci / typecheck (map[dir:apps/booking name:booking]) (push) Successful in 45s
ci / typecheck (map[dir:apps/operator name:operator]) (push) Successful in 48s
ci / typecheck (map[dir:apps/website name:website]) (push) Successful in 23s
ci / typecheck (map[dir:apps/portal name:portal]) (push) Successful in 28s
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Successful in 20s
ci / test (push) Successful in 33s
ci / build (map[dir:apps/booking name:booking]) (push) Failing after 5s
ci / build (map[dir:apps/operator name:operator]) (push) Failing after 6s
ci / build (map[dir:apps/portal name:portal]) (push) Failing after 5s
ci / build (map[dir:services/platform-api name:platform-api]) (push) Failing after 5s
ci / deploy (push) Has been skipped
Gitea 1.26 never marked finished jobs complete with the deprecated
act_runner 0.2.11: the runner ran the job, logged 'Job succeeded' and freed
its slot, but Gitea kept the job 'Running' forever, so dependent jobs
(build -> deploy) were never dispatched. gitea/runner is the successor
project; config, env vars and the .runner registration file are unchanged.
2026-06-10 08:02:40 +02:00
Ronni Baslund c60937c5cb feat(ci): deploy to k3s straight from the pipeline (drop Flux plan)
ci / build (map[dir:apps/booking name:booking]) (push) Has been cancelled
ci / build (map[dir:apps/operator name:operator]) (push) Has been cancelled
ci / build (map[dir:apps/portal name:portal]) (push) Has been cancelled
ci / build (map[dir:services/platform-api name:platform-api]) (push) Has been cancelled
ci / deploy (push) Has been cancelled
ci / typecheck (map[dir:apps/booking name:booking]) (push) Has been cancelled
ci / typecheck (map[dir:apps/operator name:operator]) (push) Has been cancelled
ci / typecheck (map[dir:apps/portal name:portal]) (push) Has been cancelled
ci / typecheck (map[dir:apps/website name:website]) (push) Has been cancelled
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Has been cancelled
ci / test (push) Has been cancelled
Push to main = release: after build, a deploy job pins each app image to the
commit SHA (kustomize edit set image), kubectl-applies fleet/apps and waits
for the rollouts. The runner already runs in-cluster, so it reaches the API
server on the in-cluster service IP with a kubeconfig for the new ci-deployer
ServiceAccount (namespace-scoped admin, KUBECONFIG_B64 repo secret).

The drafted Flux sync/image-automation layer is removed — a GitOps controller
plus bot tag-bump commits is more machinery than a single-node cluster needs.
Sortable image tags and $imagepolicy markers go with it.

Also: per-router ACME-safe HTTP->HTTPS redirects for the app ingresses,
platform-api prod config completed (Authentik JWT/JWKS + admin API, Stalwart
via the cni0 gateway IP, OCIS/cold-storage placeholders until those tiers
exist) and the secrets template/README updated to match.
2026-06-10 07:53:55 +02:00
Ronni Baslund 52e0f5e375 feat(operator): production build + k3s deployment
- Dockerfile for the operator app (same pattern as portal/booking).
- Env-driven auth/app base URLs in nuxt.config so one build serves
  dev (.local) and production (.eu).
- Deployment + Service + Ingress on operator.dezky.eu.
- Add operator to the typecheck matrix.
2026-06-10 07:53:55 +02:00
Ronni Baslund d02eb5ec50 fix(authentik): pin chart 2026.5.2, grant_types allowlist, portal redirect URI
- Pin the helm-controller chart version (unset = silent latest upgrades) and
  move the image tag under global.image per the 2026.5 chart layout.
- Authentik 2026.5 enforces a per-provider grant_types allowlist; empty list
  rejected every authorize request. Allow authorization_code + refresh_token
  for portal and operator providers.
- Fix the portal redirect URI to the nuxt-oidc-auth callback path.
- Serve the auth ingress on :80 with a per-router HTTPS redirect so the
  cert-manager HTTP-01 solver keeps working.
2026-06-10 07:53:49 +02:00
Ronni Baslund 4c5fdde787 fix(infra): docker:24-dind + capacity 2 (fix moby cgroup-v2 teardown deadlock that hung 'Complete job') 2026-06-08 22:56:21 +02:00
Ronni Baslund aef0f44915 chore(infra): act_runner capacity 4 + disable cache server
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Successful in 31s
ci / typecheck (map[dir:apps/booking name:booking]) (push) Has been cancelled
ci / typecheck (map[dir:apps/portal name:portal]) (push) Has been cancelled
ci / typecheck (map[dir:apps/website name:website]) (push) Has been cancelled
ci / test (push) Has been cancelled
Add an act_runner config.yaml (ConfigMap, CONFIG_FILE env): capacity 4 so the
typecheck matrix + image builds run in parallel instead of one-at-a-time, and
cache.enabled: false (we removed the setup-node cache; the cache server isn't
reachable from the DinD job containers anyway).
2026-06-08 22:46:43 +02:00
Ronni Baslund f331e3c1e6 feat(infra): in-cluster Gitea Actions runner (act_runner + dind)
ci / typecheck (map[dir:apps/booking name:booking]) (push) Has been cancelled
ci / typecheck (map[dir:apps/portal name:portal]) (push) Has been cancelled
ci / typecheck (map[dir:apps/website name:website]) (push) Has been cancelled
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Has been cancelled
ci / test (push) Has been cancelled
Self-registering act_runner on node1 with a privileged docker:dind sidecar so
workflow jobs can build + push app images (k3s has containerd only, no Docker
daemon). Labels ubuntu-latest + docker; state persisted on a Longhorn PVC. The
registration token is applied out-of-band as the gitea-runner-token Secret
(not in git). Verified: runner declared successfully, dind API up.
2026-06-08 22:13:38 +02:00
Ronni Baslund a27c238c76 feat(infra): nightly DB-dump CronJobs feeding the Restic backup
ci / typecheck (map[dir:apps/booking name:booking]) (push) Has been cancelled
ci / typecheck (map[dir:apps/portal name:portal]) (push) Has been cancelled
ci / typecheck (map[dir:apps/website name:website]) (push) Has been cancelled
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Has been cancelled
ci / test (push) Has been cancelled
pg_dumpall (all Postgres DBs + roles) and mongodump (all Mongo DBs) write
gzipped dumps to the hostPath /opt/dezky-backup/dumps at 02:50/02:52 UTC, which
the host Restic job (03:20) ships to the Storage Box. Each keeps the last 7
local dumps; Restic holds the real off-box retention.

- pods run as root (hostPath dir is root-owned, as is the host Restic reader)
- mongo job uses bash (mongo:7 /bin/sh is dash → no pipefail)
- creds from postgres-secret / mongo-secret via secretKeyRef

Verified: both jobs Complete, dumps present on the host
(postgres-all ~2.2MB w/ Authentik data, mongo archive).
2026-06-08 21:55:14 +02:00
Ronni Baslund 861212831d fix(infra): restic→Storage Box backups working end-to-end
ci / typecheck (map[dir:apps/booking name:booking]) (push) Has been cancelled
ci / typecheck (map[dir:apps/portal name:portal]) (push) Has been cancelled
ci / typecheck (map[dir:apps/website name:website]) (push) Has been cancelled
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Has been cancelled
ci / test (push) Has been cancelled
Three fixes found bringing up backups on node1:
- restic.env wrote BACKUP_PATHS/RETENTION unquoted → sourcing ran a path as a
  command ("Is a directory"); now quoted.
- ssh config was written to $BACKUP_HOME/.ssh/config, but restic runs as root
  and its ssh resolves ~ from the passwd db (not $HOME), so it reads
  /root/.ssh/config — write the Storage Box block there. Also
  StrictHostKeyChecking=no + UserKnownHostsFile=/dev/null (safe: restic encrypts
  before upload; fixes flaky Storage Box host-key verification).
- Storage Box SFTP lands in /home, so the repo path needs the /home prefix
  (absolute /dezky hit the root-owned chroot parent → SSH_FX_FAILURE).

Verified: repo initialized, nightly snapshot of mail store + Stalwart config +
etcd snapshots + dumps dir, `restic check` clean, retention applied.
2026-06-08 21:46:49 +02:00
Ronni Baslund 9d075343c5 feat(infra): migrate Stalwart to the v0.16 config model (config.json)
ci / typecheck (map[dir:apps/booking name:booking]) (push) Has been cancelled
ci / typecheck (map[dir:apps/portal name:portal]) (push) Has been cancelled
ci / typecheck (map[dir:apps/website name:website]) (push) Has been cancelled
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Has been cancelled
ci / test (push) Has been cancelled
v0.16 dropped TOML config. The host service now boots from a tiny config.json
that describes only the datastore (RocksDB); all other settings live in the DB
(web UI / stalwart-cli / platform-api JMAP).

- add stalwart/config.json (RocksDb datastore at /opt/stalwart/data)
- install.sh: install config.json instead of config.toml
- stalwart-mail.service: --config points at config.json
- README: document the v0.16 model + remaining DB-side config + DNS/PTR

Verified: Stalwart 0.16.8 runs on node1 with default mail listeners + the :8080
management server. config.toml retained as a reference for the DB settings.
2026-06-08 21:02:17 +02:00
Ronni Baslund 149eb0b020 fix(infra): Stalwart installer — repo rename + exact asset; flag 0.16 config break
ci / typecheck (map[dir:apps/booking name:booking]) (push) Has been cancelled
ci / typecheck (map[dir:apps/portal name:portal]) (push) Has been cancelled
ci / typecheck (map[dir:apps/website name:website]) (push) Has been cancelled
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Has been cancelled
ci / test (push) Has been cancelled
- install.sh: default repo stalwartlabs/mail-server -> stalwartlabs/stalwart
  (renamed), and select the exact /stalwart-<target>.tar.gz asset excluding the
  foundationdb build (head -n1 could grab the wrong one).
- config.toml: $env{...} -> %{env:...}% (correct Stalwart macro syntax).

KNOWN ISSUE: Stalwart v0.16 removed TOML config (single config.json datastore +
everything else in the DB via CLI/UI), so this config.toml does not load on
0.16.8 ("Failed to parse data store settings"). Needs either a pinned pre-0.16
version or a migration to the v0.16 config model. Binary is installed; the
service is stopped pending that decision.
2026-06-08 20:51:56 +02:00
Ronni Baslund 326b626fc6 feat(infra): full dezky rebrand of Authentik login (logo, favicon, bg, footer)
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Has been cancelled
ci / test (push) Has been cancelled
ci / typecheck (map[dir:apps/portal name:portal]) (push) Has been cancelled
ci / typecheck (map[dir:apps/website name:website]) (push) Has been cancelled
ci / typecheck (map[dir:apps/booking name:booking]) (push) Has been cancelled
Brand CSS only reaches the flow shadow DOM via CSS vars (colors), not the
logo/favicon (deeper shadow root) or the "Powered by authentik" footer (light
DOM). So, dev-style: serve real dezky assets + sed the bundle.

- web-assets/: dezky-logo.svg, dezky-favicon.svg, dezky-bg.svg (carbon).
- server-rebrand.py: patches the authentik-server Deployment with an
  initContainer that copies /web/dist to an emptyDir, drops the svgs into
  assets/icons, and seds "Powered by authentik" -> "Powered by Dezky".
- brand.yaml: branding_logo / branding_favicon / branding_default_flow_background
  point at the served svgs; auth-flow title "Welcome to Dezky"; signal-green CSS.

Verified live: login now matches dev (logo, title, carbon bg, green button,
favicon, Powered by Dezky). Durability caveat documented (reverts on helm
upgrade).
2026-06-08 20:36:01 +02:00
Ronni Baslund 99cd86cd3a feat(infra): full dezky branding on Authentik (logo, carbon bg, flow title)
ci / typecheck (map[dir:apps/booking name:booking]) (push) Failing after 7s
ci / test (push) Failing after 7s
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Failing after 6s
ci / typecheck (map[dir:apps/portal name:portal]) (push) Has been cancelled
ci / typecheck (map[dir:apps/website name:website]) (push) Has been cancelled
branding_logo / branding_default_flow_background are file-path fields (reject
data URIs), so the dezky logo + carbon background are injected via the brand's
custom CSS (data URIs allowed there): logo replaces the authentik wordmark,
background overrides the forest. Auth-flow title -> "Welcome to Dezky".
Signal-green primary button retained.
2026-06-08 19:54:44 +02:00
Ronni Baslund db1354a151 feat(infra): Authentik blueprints (portal+operator OIDC, dezky brand)
ci / typecheck (map[dir:apps/booking name:booking]) (push) Failing after 6s
ci / typecheck (map[dir:apps/portal name:portal]) (push) Has been cancelled
ci / typecheck (map[dir:apps/website name:website]) (push) Has been cancelled
ci / test (push) Has been cancelled
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Has been cancelled
Mirror the dev Authentik config in prod via blueprints, applied & successful on
node1:
- brand.yaml: dezky branding on the default brand (title + signal-green custom
  CSS) — login page now in dezky colors.
- portal-application.yaml / operator-application.yaml: dezky-portal &
  dezky-operator OIDC apps/providers (prod redirect URLs) + the
  dezky-platform-admins group & operator access policy.

Two 2026.5 gotchas handled + documented in README:
- invalidation_flow is now REQUIRED on OAuth2 providers (added via !Find).
- ConfigMap mounts are symlinks (discovery can't read them) → worker uses an
  initContainer that copies them to an emptyDir as real files. (chart
  worker.volumes didn't apply on this version; patch reverts on helm upgrade —
  noted as a durability TODO.)

Client secrets (PORTAL/OPERATOR_OIDC_CLIENT_SECRET) live in authentik-secret;
the apps must reuse them.
2026-06-08 19:46:48 +02:00
Ronni Baslund 406e2ca78b feat(infra): deploy Authentik (auth.dezky.eu) + global HTTP→HTTPS redirect
ci / typecheck (map[dir:apps/booking name:booking]) (push) Has been cancelled
ci / typecheck (map[dir:apps/website name:website]) (push) Has been cancelled
ci / typecheck (map[dir:apps/portal name:portal]) (push) Has been cancelled
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Has been cancelled
ci / test (push) Has been cancelled
- Authentik on the in-cluster Postgres/Redis (mirrors the dev compose config:
  external DB/Redis, error-reporting off, update-check off, bootstrap admin),
  via the k3s Helm controller; Ingress + cert-manager letsencrypt-prod. Live at
  https://auth.dezky.eu (image 2026.5.2). Secrets generated on-box (Bitwarden).
- Traefik HelmChartConfig: global :80 -> :443 (308) redirect via
  additionalArguments (to=:443, HTTP-01-safe).
- RUNBOOK updated.

Deferred (mirror remaining dev bits): OIDC app blueprints (portal/operator with
prod URLs) + the cosmetic "Powered by Dezky" rebrand.
2026-06-08 19:00:07 +02:00
Ronni Baslund 153d7053ca feat(infra): k3s foundation — cert-manager, Longhorn config, in-cluster data tier
ci / typecheck (map[dir:apps/website name:website]) (push) Failing after 10m58s
ci / typecheck (map[dir:apps/portal name:portal]) (push) Failing after 11m56s
ci / typecheck (map[dir:apps/booking name:booking]) (push) Failing after 14m0s
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Has been cancelled
ci / test (push) Has been cancelled
Adds the production cluster foundation (authored + applied live on node1):
- cert-manager via the k3s HelmChart controller + letsencrypt staging/prod
  ClusterIssuers (HTTP-01 / Traefik).
- Longhorn config for single-node (values: replica=1, default StorageClass,
  Retain) + backup-to-Hetzner-Object-Storage credential template.
- In-cluster data tier (dezky-data): Postgres 16 (with Authentik+OCIS DB init),
  MongoDB 7, Redis 7 as StatefulSets on Longhorn, + secret template.
- bootstrap.sh: install open-iscsi/nfs-common + enable iscsid (Longhorn prereq).
- RUNBOOK.md: full reproducible node1 build order.

Real secrets are generated on-box and kept in Bitwarden — never in git.
2026-06-08 18:39:31 +02:00
Ronni Baslund 35bc7b6c31 chore(infra): production manifests + CI for scheduling apps
ci / typecheck (map[dir:apps/booking name:booking]) (push) Has been cancelled
ci / typecheck (map[dir:apps/portal name:portal]) (push) Has been cancelled
ci / typecheck (map[dir:apps/website name:website]) (push) Has been cancelled
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Has been cancelled
ci / test (push) Has been cancelled
2026-06-07 09:27:44 +02:00
Ronni Baslund 3831c85285 feat(infra): production host bootstrap and bare-metal Stalwart scaffolding
Host provisioning for the single-server production target: SSH + firewall
hardening (nftables allowlist), k3s node registration, bare-metal Stalwart
install with systemd units and TLS cert-sync from the cluster secret, and
Restic encrypted backup/restore (primary + DR) with timer units. Host-specific
secrets live in config.env (gitignored); config.env.example is the template.
Also gitignores MemPalace per-project files.
2026-06-07 00:19:48 +02:00