# Dezky production — node1 build runbook The actual, reproducible order used to stand up **node1.dezky.eu** (Hetzner AX41, `46.4.78.187`, Ubuntu 24.04). If the box is lost, follow this top to bottom to rebuild it. Per-layer detail lives in `host/README.md`, `fleet/cert-manager/`, `fleet/longhorn/`, `fleet/data/`. > Secrets are **never** in git. They're generated with `openssl rand -hex 24` > and stored in **Bitwarden**. See "Secrets" below for how to read the live > values back out of the cluster. ## Current state (built 2026-06-08, app tier + CI/CD 2026-06-10) - **Host:** hardened via `host/bootstrap.sh` — `dezky` admin user, **key-only SSH** (no root, no passwords), k3s-safe nftables firewall (SSH/6443 → mgmt IPs `46.32.144.38`/`46.32.144.45`; 80/443+mail → world), fail2ban, unattended-upgrades, `open-iscsi`+`iscsid` (Longhorn prereq). `dezky` has **NOPASSWD sudo** (`/etc/sudoers.d/90-dezky`). - **k3s** v1.33.11 — single node (control-plane/etcd/worker), registered in Rancher (`91.99.122.153`). - **Longhorn** — default StorageClass, `numberOfReplicas: 1` (single node). - **cert-manager** + `letsencrypt-staging` / `letsencrypt-prod` (HTTP-01/Traefik). - **Data tier** (`dezky-data` ns) — Postgres 16, Mongo 7, Redis 7 as StatefulSets on Longhorn PVCs. Postgres holds the `authentik` + `ocis` DBs. - **Authentik** (`dezky-auth` ns) — live at https://auth.dezky.eu (LE cert), chart pinned `2026.5.2`, on our Postgres/Redis. Portal + operator OIDC app blueprints applied (`fleet/authentik/blueprints/`). - **Stalwart** (host, not k3s) — mail on the bare host; JMAP management API reachable from pods at `http://10.42.0.1:8080` (cni0 gateway). - **Traefik** — per-router HTTP→HTTPS redirect via `redirectScheme` Middleware on each Ingress (`web,websecure` entrypoints). **No global entrypoint redirect** — that breaks cert-manager HTTP-01 (`fleet/traefik/`). - **App tier** (`dezky-apps` ns) — portal (`app.dezky.eu`), platform-api (`api.dezky.eu`), booking (`booking.dezky.eu`), operator (`operator.dezky.eu`). See `fleet/README.md`. - **CI/CD** (`gitea-runner` ns) — in-cluster `gitea/runner:1.0.8` + dind sidecar. **Push to main = deploy** (see "Deploy flow" below). - **Registry hygiene** — Gitea package cleanup rule (user-level, Container type): keep newest 5 versions per image + `latest`, remove older than 7 days. Applied by Gitea's daily cleanup cron. - **Monitoring** — HetrixTools (Ronni's account): 11 uptime monitors via API (HTTPS on the five apps + Gitea w/ SSL verify, ping, IMAPS/SMTPS/port-25 TCP — port 25 is a TCP check ON PURPOSE: Stalwart's DNSBL screening rejects HetrixTools' probe IPs, so an SMTP-protocol check reads down while real MTAs are fine; 1-min checks from ams/fra/lon, alert after 2 fails), blacklist monitors on dezky.eu + 46.4.78.187, and the Linux server agent on node1 (root mode, per-minute cron in /etc/hetrixtools/; watches stalwart/k3s/dockerd processes, mdadm RAID, NVMe SMART via smartmontools). Re-create monitors via their v2 API (uptime/add, Type 9 = server agent — hidden in the new UI); agent install: hetrixtools_install.sh 1 "stalwart,k3s,dockerd" 1 1. ## Deploy flow (day-to-day) Push to `main` on Gitea → `.gitea/workflows/ci.yml` runs in-cluster: **typecheck + test → docker build + push** (each app image tagged `:latest` + the commit SHA, to `git.lastcloud.io/ronnibaslund/dezky/`) → **deploy** (`kustomize edit set image` pins the SHA, `kubectl apply -k fleet/apps`, waits for rollouts). No GitOps controller, no manual steps. Push-to-live is ~2 min with a warm build cache, 5–10 min after a runner pod restart (the dind layer cache is an emptyDir). - **Watch:** repo → Actions in Gitea, or `kubectl -n dezky-apps get deploy -o wide` (image column shows the SHA). - **Rollback:** re-run an older green run from the Gitea Actions UI, or `kubectl -n dezky-apps set image deploy/ =git.lastcloud.io/ronnibaslund/dezky/:`. - **Break-glass (runner down):** `kubectl apply -k fleet/apps/` by hand — manifests reference `:latest`. - **Gitea Actions secrets** (repo Settings → Actions → Secrets): `KUBECONFIG_B64` (ci-deployer kubeconfig, see step 9) and `REGISTRY_TOKEN` (Gitea PAT with package read+write — the per-job GITHUB_TOKEN is NOT accepted by the container registry). ## Reproduce from scratch ### 1. Host layer ```bash # from laptop scp -r infrastructure/production/host root@:/opt/dezky-host # copy/fill config.env on the box (gitignored — MGMT IPs, ADMIN_SSH_PUBKEY, # RANCHER_* token/checksum, STALWART_*, RESTIC_*) ssh root@ 'cd /opt/dezky-host && ./bootstrap.sh' # set a console/sudo password for the admin user, then (optional) NOPASSWD: ssh root@ 'passwd dezky' ssh dezky@ "echo 'dezky ALL=(ALL) NOPASSWD:ALL' | sudo tee /etc/sudoers.d/90-dezky && sudo chmod 0440 /etc/sudoers.d/90-dezky" ``` ### 2. k3s + kubectl access ```bash ssh dezky@ sudo /opt/dezky-host/k3s/register.sh # joins the Rancher Custom (K3s) cluster kubectl --kubeconfig /etc/rancher/k3s/k3s.yaml get nodes # -> Ready # give dezky a kubeconfig: mkdir -p ~/.kube && sudo install -m 600 -o dezky -g dezky /etc/rancher/k3s/k3s.yaml ~/.kube/config ``` ### 3. Longhorn (storage) ```bash sudo apt-get install -y open-iscsi nfs-common && sudo systemctl enable --now iscsid # (bootstrap.sh does this now) helm repo add longhorn https://charts.longhorn.io && helm repo update helm install longhorn longhorn/longhorn -n longhorn-system --create-namespace \ --version 1.12.0 -f fleet/longhorn/values.yaml # replica=1, default class # one default SC only: kubectl patch storageclass local-path -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}' kubectl -n longhorn-system patch settings.longhorn.io default-replica-count --type=merge -p '{"value":"1"}' kubectl get storageclass # only 'longhorn (default)' ``` ### 4. cert-manager + issuers ```bash kubectl apply -f fleet/cert-manager/cert-manager.yaml kubectl -n cert-manager rollout status deploy/cert-manager-webhook --timeout=180s kubectl apply -f fleet/cert-manager/cluster-issuer.yaml kubectl get clusterissuer # both READY=True ``` ### 5. Data tier ```bash kubectl create namespace dezky-data --dry-run=client -o yaml | kubectl apply -f - # secrets — generate fresh, store in Bitwarden: kubectl -n dezky-data create secret generic postgres-secret \ --from-literal=POSTGRES_PASSWORD=$(openssl rand -hex 24) \ --from-literal=AUTHENTIK_DB_PASSWORD=$(openssl rand -hex 24) \ --from-literal=OCIS_DB_PASSWORD=$(openssl rand -hex 24) kubectl -n dezky-data create secret generic mongo-secret \ --from-literal=root-username=dezky --from-literal=root-password=$(openssl rand -hex 24) kubectl -n dezky-data create secret generic redis-secret \ --from-literal=REDIS_PASSWORD=$(openssl rand -hex 24) kubectl apply -k fleet/data/ kubectl -n dezky-data get pods,pvc # all Running, PVCs Bound on longhorn ``` ### 6. Authentik (IdP) See `fleet/authentik/README.md`. Create `dezky-auth` ns + `authentik-secret` (DB/Redis pw read back from dezky-data so they match; SECRET_KEY + bootstrap generated), then `kubectl apply -f fleet/authentik/helmchart.yaml`. Reachable at https://auth.dezky.eu; first login `akadmin` / `AUTHENTIK_BOOTSTRAP_PASSWORD`. ### 7. Traefik — per-router HTTPS redirect (ACME-safe) ```bash # NO global entrypoint redirect — it would 301 the HTTP-01 challenge before # cert-manager's solver router can answer it. Redirect lives per-Ingress via # a redirectScheme Middleware instead (applied with each tier's kustomize). kubectl apply -f fleet/traefik/helmchartconfig.yaml kubectl -n kube-system delete job helm-install-traefik # force the controller to re-run with merged values # verify: curl -sI http://app.dezky.eu -> 301 https://... AND new certs still issue ``` ### 8. App tier (portal · platform-api · booking · operator) ```bash # Secrets first (out-of-band, values from Bitwarden / generated — see # fleet/README.md "Required env / secrets" + fleet/apps/secrets.example.yaml): # portal-secrets, booking-secrets, operator-secrets, platform-api-secrets kubectl apply -k fleet/apps/ kubectl -n dezky-apps get pods # all Running once images exist in the registry ``` ### 9. CI runner + push-to-deploy ```bash # In-cluster Gitea Actions runner (gitea/runner + privileged dind sidecar). # Registration token from Gitea: Settings → Actions → Runners → Create token. kubectl create namespace gitea-runner --dry-run=client -o yaml | kubectl apply -f - kubectl -n gitea-runner create secret generic gitea-runner-token \ --from-literal=token= kubectl apply -f fleet/ci/gitea-runner.yaml # Deploy ServiceAccount + kubeconfig for the pipeline's deploy job: kubectl apply -f fleet/ci/ci-deployer.yaml # mint the kubeconfig (full recipe in fleet/README.md "Deploy") and store it # as the KUBECONFIG_B64 repo secret; create a Gitea PAT with package # read+write and store as REGISTRY_TOKEN. # Gotchas baked into fleet/ci/gitea-runner.yaml — don't "simplify" them away: # - gitea/runner 1.x (NOT act_runner 0.2.x: Gitea 1.26 never marks its jobs # complete, which freezes runs at "Complete job"). # - dind shares /var/run with the runner: jobs can only get a docker host # by bind-mounting a UNIX socket (tcp://+TLS can't be mounted). # - docker:24-dind (moby 27 has a cgroup-v2 teardown deadlock). ``` ## Secrets — read live values for Bitwarden ```bash k(){ kubectl -n dezky-data get secret "$1" -o jsonpath="{.data.$2}" | base64 -d; echo; } k postgres-secret POSTGRES_PASSWORD k postgres-secret AUTHENTIK_DB_PASSWORD # must match Authentik's DB config k postgres-secret OCIS_DB_PASSWORD # must match OCIS's DB config k mongo-secret root-password k redis-secret REDIS_PASSWORD a(){ kubectl -n dezky-apps get secret platform-api-secrets -o jsonpath="{.data.$1}" | base64 -d; echo; } a SCHEDULING_CREDENTIAL_KEY # AES key for stored scheduling creds — losing it orphans them a AUDIT_SIGNING_KEY # audit hash-chain key — rotation closes the segment ``` ## Still TODO (next layers) 1. **OCIS** (files) — uses the `ocis` Postgres DB + Hetzner Object Storage (S3). platform-api already carries placeholder `OCIS_*` config (`fleet/apps/platform-api-config.yaml`) — swap in real values when live. 2. **Audit cold storage** — Hetzner Object Storage bucket + real `AUDIT_COLD_*` keys in `platform-api-secrets`; flip `ARCHIVE_ENABLED`. 3. **Backups** — Longhorn → Hetzner Object Storage (`fleet/longhorn/README.md`), plus host Restic for the mail store + etcd snapshots, plus pg_dump/mongodump CronJobs. 4. **Stripe live keys** — billing is dark-launched off (`BILLING_STRIPE_ENABLED: "false"` in the app config). Done since first build: ✅ Authentik + OIDC blueprints · ✅ Stalwart on the host · ✅ app tier (incl. operator) · ✅ CI/CD push-to-deploy · ✅ DNS A records (`api`/`app`/`booking`/`auth`/`mail`/`operator`).dezky.eu. ## Access cheatsheet - SSH: `ssh dezky@46.4.78.187` (key only). Root SSH disabled. - kubectl: works as `dezky` (kubeconfig at `~/.kube/config`). - Out-of-band if locked out: Hetzner Robot KVM/LARA or Rescue System. - The `level=warning … 50-rancher.yaml: permission denied` from kubectl is harmless noise (k3s kubectl probing a root-only config dir).