Files
dezky/infrastructure/production/RUNBOOK.md
T
Ronni Baslund 2e3c0f9188
ci / tc_portal (push) Has been skipped
ci / tc_operator (push) Has been skipped
ci / build_operator (push) Has been skipped
ci / changes (push) Successful in 3s
ci / tc_booking (push) Has been skipped
ci / tc_website (push) Has been skipped
ci / tc_platform_api (push) Has been skipped
ci / test_platform_api (push) Has been skipped
ci / build_portal (push) Has been skipped
ci / build_booking (push) Has been skipped
ci / build_platform_api (push) Has been skipped
ci / deploy (push) Has been skipped
docs(runbook): monitoring update — TCP-25 rationale + blacklist monitors
2026-06-11 11:49:00 +02:00

219 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Dezky production — node1 build runbook
The actual, reproducible order used to stand up **node1.dezky.eu** (Hetzner
AX41, `46.4.78.187`, Ubuntu 24.04). If the box is lost, follow this top to
bottom to rebuild it. Per-layer detail lives in `host/README.md`,
`fleet/cert-manager/`, `fleet/longhorn/`, `fleet/data/`.
> Secrets are **never** in git. They're generated with `openssl rand -hex 24`
> and stored in **Bitwarden**. See "Secrets" below for how to read the live
> values back out of the cluster.
## Current state (built 2026-06-08, app tier + CI/CD 2026-06-10)
- **Host:** hardened via `host/bootstrap.sh``dezky` admin user, **key-only
SSH** (no root, no passwords), k3s-safe nftables firewall (SSH/6443 → mgmt
IPs `46.32.144.38`/`46.32.144.45`; 80/443+mail → world), fail2ban,
unattended-upgrades, `open-iscsi`+`iscsid` (Longhorn prereq).
`dezky` has **NOPASSWD sudo** (`/etc/sudoers.d/90-dezky`).
- **k3s** v1.33.11 — single node (control-plane/etcd/worker), registered in
Rancher (`91.99.122.153`).
- **Longhorn** — default StorageClass, `numberOfReplicas: 1` (single node).
- **cert-manager** + `letsencrypt-staging` / `letsencrypt-prod` (HTTP-01/Traefik).
- **Data tier** (`dezky-data` ns) — Postgres 16, Mongo 7, Redis 7 as
StatefulSets on Longhorn PVCs. Postgres holds the `authentik` + `ocis` DBs.
- **Authentik** (`dezky-auth` ns) — live at https://auth.dezky.eu (LE cert),
chart pinned `2026.5.2`, on our Postgres/Redis. Portal + operator OIDC app
blueprints applied (`fleet/authentik/blueprints/`).
- **Stalwart** (host, not k3s) — mail on the bare host; JMAP management API
reachable from pods at `http://10.42.0.1:8080` (cni0 gateway).
- **Traefik** — per-router HTTP→HTTPS redirect via `redirectScheme`
Middleware on each Ingress (`web,websecure` entrypoints). **No global
entrypoint redirect** — that breaks cert-manager HTTP-01 (`fleet/traefik/`).
- **App tier** (`dezky-apps` ns) — portal (`app.dezky.eu`), platform-api
(`api.dezky.eu`), booking (`booking.dezky.eu`), operator
(`operator.dezky.eu`). See `fleet/README.md`.
- **CI/CD** (`gitea-runner` ns) — in-cluster `gitea/runner:1.0.8` + dind
sidecar. **Push to main = deploy** (see "Deploy flow" below).
- **Registry hygiene** — Gitea package cleanup rule (user-level, Container
type): keep newest 5 versions per image + `latest`, remove older than 7
days. Applied by Gitea's daily cleanup cron.
- **Monitoring** — HetrixTools (Ronni's account): 11 uptime monitors via API
(HTTPS on the five apps + Gitea w/ SSL verify, ping, IMAPS/SMTPS/port-25
TCP — port 25 is a TCP check ON PURPOSE: Stalwart's DNSBL screening
rejects HetrixTools' probe IPs, so an SMTP-protocol check reads down while
real MTAs are fine; 1-min checks from ams/fra/lon, alert after 2 fails),
blacklist monitors on dezky.eu + 46.4.78.187, and the Linux server agent
on node1 (root mode, per-minute cron in /etc/hetrixtools/; watches
stalwart/k3s/dockerd processes, mdadm RAID, NVMe SMART via smartmontools).
Re-create monitors via their v2 API (uptime/add, Type 9 = server agent —
hidden in the new UI); agent install:
hetrixtools_install.sh <server_id from API response> 1 "stalwart,k3s,dockerd" 1 1.
## Deploy flow (day-to-day)
Push to `main` on Gitea → `.gitea/workflows/ci.yml` runs in-cluster:
**typecheck + test → docker build + push** (each app image tagged `:latest` +
the commit SHA, to `git.lastcloud.io/ronnibaslund/dezky/<app>`) → **deploy**
(`kustomize edit set image` pins the SHA, `kubectl apply -k fleet/apps`,
waits for rollouts). No GitOps controller, no manual steps. Push-to-live is
~2 min with a warm build cache, 510 min after a runner pod restart (the dind
layer cache is an emptyDir).
- **Watch:** repo → Actions in Gitea, or
`kubectl -n dezky-apps get deploy -o wide` (image column shows the SHA).
- **Rollback:** re-run an older green run from the Gitea Actions UI, or
`kubectl -n dezky-apps set image deploy/<app> <app>=git.lastcloud.io/ronnibaslund/dezky/<app>:<old-sha>`.
- **Break-glass (runner down):** `kubectl apply -k fleet/apps/` by hand —
manifests reference `:latest`.
- **Gitea Actions secrets** (repo Settings → Actions → Secrets):
`KUBECONFIG_B64` (ci-deployer kubeconfig, see step 9) and `REGISTRY_TOKEN`
(Gitea PAT with package read+write — the per-job GITHUB_TOKEN is NOT
accepted by the container registry).
## Reproduce from scratch
### 1. Host layer
```bash
# from laptop
scp -r infrastructure/production/host root@<ip>:/opt/dezky-host
# copy/fill config.env on the box (gitignored — MGMT IPs, ADMIN_SSH_PUBKEY,
# RANCHER_* token/checksum, STALWART_*, RESTIC_*)
ssh root@<ip> 'cd /opt/dezky-host && ./bootstrap.sh'
# set a console/sudo password for the admin user, then (optional) NOPASSWD:
ssh root@<ip> 'passwd dezky'
ssh dezky@<ip> "echo 'dezky ALL=(ALL) NOPASSWD:ALL' | sudo tee /etc/sudoers.d/90-dezky && sudo chmod 0440 /etc/sudoers.d/90-dezky"
```
### 2. k3s + kubectl access
```bash
ssh dezky@<ip>
sudo /opt/dezky-host/k3s/register.sh # joins the Rancher Custom (K3s) cluster
kubectl --kubeconfig /etc/rancher/k3s/k3s.yaml get nodes # -> Ready
# give dezky a kubeconfig:
mkdir -p ~/.kube && sudo install -m 600 -o dezky -g dezky /etc/rancher/k3s/k3s.yaml ~/.kube/config
```
### 3. Longhorn (storage)
```bash
sudo apt-get install -y open-iscsi nfs-common && sudo systemctl enable --now iscsid # (bootstrap.sh does this now)
helm repo add longhorn https://charts.longhorn.io && helm repo update
helm install longhorn longhorn/longhorn -n longhorn-system --create-namespace \
--version 1.12.0 -f fleet/longhorn/values.yaml # replica=1, default class
# one default SC only:
kubectl patch storageclass local-path -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
kubectl -n longhorn-system patch settings.longhorn.io default-replica-count --type=merge -p '{"value":"1"}'
kubectl get storageclass # only 'longhorn (default)'
```
### 4. cert-manager + issuers
```bash
kubectl apply -f fleet/cert-manager/cert-manager.yaml
kubectl -n cert-manager rollout status deploy/cert-manager-webhook --timeout=180s
kubectl apply -f fleet/cert-manager/cluster-issuer.yaml
kubectl get clusterissuer # both READY=True
```
### 5. Data tier
```bash
kubectl create namespace dezky-data --dry-run=client -o yaml | kubectl apply -f -
# secrets — generate fresh, store in Bitwarden:
kubectl -n dezky-data create secret generic postgres-secret \
--from-literal=POSTGRES_PASSWORD=$(openssl rand -hex 24) \
--from-literal=AUTHENTIK_DB_PASSWORD=$(openssl rand -hex 24) \
--from-literal=OCIS_DB_PASSWORD=$(openssl rand -hex 24)
kubectl -n dezky-data create secret generic mongo-secret \
--from-literal=root-username=dezky --from-literal=root-password=$(openssl rand -hex 24)
kubectl -n dezky-data create secret generic redis-secret \
--from-literal=REDIS_PASSWORD=$(openssl rand -hex 24)
kubectl apply -k fleet/data/
kubectl -n dezky-data get pods,pvc # all Running, PVCs Bound on longhorn
```
### 6. Authentik (IdP)
See `fleet/authentik/README.md`. Create `dezky-auth` ns + `authentik-secret`
(DB/Redis pw read back from dezky-data so they match; SECRET_KEY + bootstrap
generated), then `kubectl apply -f fleet/authentik/helmchart.yaml`. Reachable at
https://auth.dezky.eu; first login `akadmin` / `AUTHENTIK_BOOTSTRAP_PASSWORD`.
### 7. Traefik — per-router HTTPS redirect (ACME-safe)
```bash
# NO global entrypoint redirect — it would 301 the HTTP-01 challenge before
# cert-manager's solver router can answer it. Redirect lives per-Ingress via
# a redirectScheme Middleware instead (applied with each tier's kustomize).
kubectl apply -f fleet/traefik/helmchartconfig.yaml
kubectl -n kube-system delete job helm-install-traefik # force the controller to re-run with merged values
# verify: curl -sI http://app.dezky.eu -> 301 https://... AND new certs still issue
```
### 8. App tier (portal · platform-api · booking · operator)
```bash
# Secrets first (out-of-band, values from Bitwarden / generated — see
# fleet/README.md "Required env / secrets" + fleet/apps/secrets.example.yaml):
# portal-secrets, booking-secrets, operator-secrets, platform-api-secrets
kubectl apply -k fleet/apps/
kubectl -n dezky-apps get pods # all Running once images exist in the registry
```
### 9. CI runner + push-to-deploy
```bash
# In-cluster Gitea Actions runner (gitea/runner + privileged dind sidecar).
# Registration token from Gitea: Settings → Actions → Runners → Create token.
kubectl create namespace gitea-runner --dry-run=client -o yaml | kubectl apply -f -
kubectl -n gitea-runner create secret generic gitea-runner-token \
--from-literal=token=<registration token>
kubectl apply -f fleet/ci/gitea-runner.yaml
# Deploy ServiceAccount + kubeconfig for the pipeline's deploy job:
kubectl apply -f fleet/ci/ci-deployer.yaml
# mint the kubeconfig (full recipe in fleet/README.md "Deploy") and store it
# as the KUBECONFIG_B64 repo secret; create a Gitea PAT with package
# read+write and store as REGISTRY_TOKEN.
# Gotchas baked into fleet/ci/gitea-runner.yaml — don't "simplify" them away:
# - gitea/runner 1.x (NOT act_runner 0.2.x: Gitea 1.26 never marks its jobs
# complete, which freezes runs at "Complete job").
# - dind shares /var/run with the runner: jobs can only get a docker host
# by bind-mounting a UNIX socket (tcp://+TLS can't be mounted).
# - docker:24-dind (moby 27 has a cgroup-v2 teardown deadlock).
```
## Secrets — read live values for Bitwarden
```bash
k(){ kubectl -n dezky-data get secret "$1" -o jsonpath="{.data.$2}" | base64 -d; echo; }
k postgres-secret POSTGRES_PASSWORD
k postgres-secret AUTHENTIK_DB_PASSWORD # must match Authentik's DB config
k postgres-secret OCIS_DB_PASSWORD # must match OCIS's DB config
k mongo-secret root-password
k redis-secret REDIS_PASSWORD
a(){ kubectl -n dezky-apps get secret platform-api-secrets -o jsonpath="{.data.$1}" | base64 -d; echo; }
a SCHEDULING_CREDENTIAL_KEY # AES key for stored scheduling creds — losing it orphans them
a AUDIT_SIGNING_KEY # audit hash-chain key — rotation closes the segment
```
## Still TODO (next layers)
1. **OCIS** (files) — uses the `ocis` Postgres DB + Hetzner Object Storage
(S3). platform-api already carries placeholder `OCIS_*` config
(`fleet/apps/platform-api-config.yaml`) — swap in real values when live.
2. **Audit cold storage** — Hetzner Object Storage bucket + real
`AUDIT_COLD_*` keys in `platform-api-secrets`; flip `ARCHIVE_ENABLED`.
3. **Backups** — Longhorn → Hetzner Object Storage (`fleet/longhorn/README.md`),
plus host Restic for the mail store + etcd snapshots, plus pg_dump/mongodump
CronJobs.
4. **Stripe live keys** — billing is dark-launched off
(`BILLING_STRIPE_ENABLED: "false"` in the app config).
Done since first build: ✅ Authentik + OIDC blueprints · ✅ Stalwart on the
host · ✅ app tier (incl. operator) · ✅ CI/CD push-to-deploy · ✅ DNS A
records (`api`/`app`/`booking`/`auth`/`mail`/`operator`).dezky.eu.
## Access cheatsheet
- SSH: `ssh dezky@46.4.78.187` (key only). Root SSH disabled.
- kubectl: works as `dezky` (kubeconfig at `~/.kube/config`).
- Out-of-band if locked out: Hetzner Robot KVM/LARA or Rescue System.
- The `level=warning … 50-rancher.yaml: permission denied` from kubectl is
harmless noise (k3s kubectl probing a root-only config dir).