c6b6f8faec
ci / changes (push) Successful in 4s
ci / tc_portal (push) Has been skipped
ci / tc_booking (push) Has been skipped
ci / tc_operator (push) Has been skipped
ci / tc_platform_api (push) Has been skipped
ci / tc_website (push) Has been skipped
ci / build_portal (push) Has been skipped
ci / build_operator (push) Has been skipped
ci / deploy (push) Has been skipped
ci / test_platform_api (push) Has been skipped
ci / build_booking (push) Has been skipped
ci / build_platform_api (push) Has been skipped
216 lines
11 KiB
Markdown
216 lines
11 KiB
Markdown
# Dezky production — node1 build runbook
|
||
|
||
The actual, reproducible order used to stand up **node1.dezky.eu** (Hetzner
|
||
AX41, `46.4.78.187`, Ubuntu 24.04). If the box is lost, follow this top to
|
||
bottom to rebuild it. Per-layer detail lives in `host/README.md`,
|
||
`fleet/cert-manager/`, `fleet/longhorn/`, `fleet/data/`.
|
||
|
||
> Secrets are **never** in git. They're generated with `openssl rand -hex 24`
|
||
> and stored in **Bitwarden**. See "Secrets" below for how to read the live
|
||
> values back out of the cluster.
|
||
|
||
## Current state (built 2026-06-08, app tier + CI/CD 2026-06-10)
|
||
|
||
- **Host:** hardened via `host/bootstrap.sh` — `dezky` admin user, **key-only
|
||
SSH** (no root, no passwords), k3s-safe nftables firewall (SSH/6443 → mgmt
|
||
IPs `46.32.144.38`/`46.32.144.45`; 80/443+mail → world), fail2ban,
|
||
unattended-upgrades, `open-iscsi`+`iscsid` (Longhorn prereq).
|
||
`dezky` has **NOPASSWD sudo** (`/etc/sudoers.d/90-dezky`).
|
||
- **k3s** v1.33.11 — single node (control-plane/etcd/worker), registered in
|
||
Rancher (`91.99.122.153`).
|
||
- **Longhorn** — default StorageClass, `numberOfReplicas: 1` (single node).
|
||
- **cert-manager** + `letsencrypt-staging` / `letsencrypt-prod` (HTTP-01/Traefik).
|
||
- **Data tier** (`dezky-data` ns) — Postgres 16, Mongo 7, Redis 7 as
|
||
StatefulSets on Longhorn PVCs. Postgres holds the `authentik` + `ocis` DBs.
|
||
- **Authentik** (`dezky-auth` ns) — live at https://auth.dezky.eu (LE cert),
|
||
chart pinned `2026.5.2`, on our Postgres/Redis. Portal + operator OIDC app
|
||
blueprints applied (`fleet/authentik/blueprints/`).
|
||
- **Stalwart** (host, not k3s) — mail on the bare host; JMAP management API
|
||
reachable from pods at `http://10.42.0.1:8080` (cni0 gateway).
|
||
- **Traefik** — per-router HTTP→HTTPS redirect via `redirectScheme`
|
||
Middleware on each Ingress (`web,websecure` entrypoints). **No global
|
||
entrypoint redirect** — that breaks cert-manager HTTP-01 (`fleet/traefik/`).
|
||
- **App tier** (`dezky-apps` ns) — portal (`app.dezky.eu`), platform-api
|
||
(`api.dezky.eu`), booking (`booking.dezky.eu`), operator
|
||
(`operator.dezky.eu`). See `fleet/README.md`.
|
||
- **CI/CD** (`gitea-runner` ns) — in-cluster `gitea/runner:1.0.8` + dind
|
||
sidecar. **Push to main = deploy** (see "Deploy flow" below).
|
||
- **Registry hygiene** — Gitea package cleanup rule (user-level, Container
|
||
type): keep newest 5 versions per image + `latest`, remove older than 7
|
||
days. Applied by Gitea's daily cleanup cron.
|
||
- **Monitoring** — HetrixTools (Ronni's account): 11 uptime monitors via API
|
||
(HTTPS on the five apps + Gitea w/ SSL verify, ping, IMAPS/SMTPS TCP, SMTP
|
||
protocol on :25; 1-min checks from ams/fra/lon, alert after 2 fails) + the
|
||
Linux server agent on node1 (root mode, per-minute cron in
|
||
/etc/hetrixtools/; watches stalwart/k3s/dockerd processes, mdadm RAID,
|
||
NVMe SMART via smartmontools). Re-create monitors via their v2 API
|
||
(uptime/add, Type 9 = server agent — hidden in the new UI); agent install:
|
||
hetrixtools_install.sh <server_id from API response> 1 "stalwart,k3s,dockerd" 1 1.
|
||
|
||
## Deploy flow (day-to-day)
|
||
|
||
Push to `main` on Gitea → `.gitea/workflows/ci.yml` runs in-cluster:
|
||
**typecheck + test → docker build + push** (each app image tagged `:latest` +
|
||
the commit SHA, to `git.lastcloud.io/ronnibaslund/dezky/<app>`) → **deploy**
|
||
(`kustomize edit set image` pins the SHA, `kubectl apply -k fleet/apps`,
|
||
waits for rollouts). No GitOps controller, no manual steps. Push-to-live is
|
||
~2 min with a warm build cache, 5–10 min after a runner pod restart (the dind
|
||
layer cache is an emptyDir).
|
||
|
||
- **Watch:** repo → Actions in Gitea, or
|
||
`kubectl -n dezky-apps get deploy -o wide` (image column shows the SHA).
|
||
- **Rollback:** re-run an older green run from the Gitea Actions UI, or
|
||
`kubectl -n dezky-apps set image deploy/<app> <app>=git.lastcloud.io/ronnibaslund/dezky/<app>:<old-sha>`.
|
||
- **Break-glass (runner down):** `kubectl apply -k fleet/apps/` by hand —
|
||
manifests reference `:latest`.
|
||
- **Gitea Actions secrets** (repo Settings → Actions → Secrets):
|
||
`KUBECONFIG_B64` (ci-deployer kubeconfig, see step 9) and `REGISTRY_TOKEN`
|
||
(Gitea PAT with package read+write — the per-job GITHUB_TOKEN is NOT
|
||
accepted by the container registry).
|
||
|
||
## Reproduce from scratch
|
||
|
||
### 1. Host layer
|
||
```bash
|
||
# from laptop
|
||
scp -r infrastructure/production/host root@<ip>:/opt/dezky-host
|
||
# copy/fill config.env on the box (gitignored — MGMT IPs, ADMIN_SSH_PUBKEY,
|
||
# RANCHER_* token/checksum, STALWART_*, RESTIC_*)
|
||
ssh root@<ip> 'cd /opt/dezky-host && ./bootstrap.sh'
|
||
# set a console/sudo password for the admin user, then (optional) NOPASSWD:
|
||
ssh root@<ip> 'passwd dezky'
|
||
ssh dezky@<ip> "echo 'dezky ALL=(ALL) NOPASSWD:ALL' | sudo tee /etc/sudoers.d/90-dezky && sudo chmod 0440 /etc/sudoers.d/90-dezky"
|
||
```
|
||
|
||
### 2. k3s + kubectl access
|
||
```bash
|
||
ssh dezky@<ip>
|
||
sudo /opt/dezky-host/k3s/register.sh # joins the Rancher Custom (K3s) cluster
|
||
kubectl --kubeconfig /etc/rancher/k3s/k3s.yaml get nodes # -> Ready
|
||
# give dezky a kubeconfig:
|
||
mkdir -p ~/.kube && sudo install -m 600 -o dezky -g dezky /etc/rancher/k3s/k3s.yaml ~/.kube/config
|
||
```
|
||
|
||
### 3. Longhorn (storage)
|
||
```bash
|
||
sudo apt-get install -y open-iscsi nfs-common && sudo systemctl enable --now iscsid # (bootstrap.sh does this now)
|
||
helm repo add longhorn https://charts.longhorn.io && helm repo update
|
||
helm install longhorn longhorn/longhorn -n longhorn-system --create-namespace \
|
||
--version 1.12.0 -f fleet/longhorn/values.yaml # replica=1, default class
|
||
# one default SC only:
|
||
kubectl patch storageclass local-path -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
|
||
kubectl -n longhorn-system patch settings.longhorn.io default-replica-count --type=merge -p '{"value":"1"}'
|
||
kubectl get storageclass # only 'longhorn (default)'
|
||
```
|
||
|
||
### 4. cert-manager + issuers
|
||
```bash
|
||
kubectl apply -f fleet/cert-manager/cert-manager.yaml
|
||
kubectl -n cert-manager rollout status deploy/cert-manager-webhook --timeout=180s
|
||
kubectl apply -f fleet/cert-manager/cluster-issuer.yaml
|
||
kubectl get clusterissuer # both READY=True
|
||
```
|
||
|
||
### 5. Data tier
|
||
```bash
|
||
kubectl create namespace dezky-data --dry-run=client -o yaml | kubectl apply -f -
|
||
# secrets — generate fresh, store in Bitwarden:
|
||
kubectl -n dezky-data create secret generic postgres-secret \
|
||
--from-literal=POSTGRES_PASSWORD=$(openssl rand -hex 24) \
|
||
--from-literal=AUTHENTIK_DB_PASSWORD=$(openssl rand -hex 24) \
|
||
--from-literal=OCIS_DB_PASSWORD=$(openssl rand -hex 24)
|
||
kubectl -n dezky-data create secret generic mongo-secret \
|
||
--from-literal=root-username=dezky --from-literal=root-password=$(openssl rand -hex 24)
|
||
kubectl -n dezky-data create secret generic redis-secret \
|
||
--from-literal=REDIS_PASSWORD=$(openssl rand -hex 24)
|
||
kubectl apply -k fleet/data/
|
||
kubectl -n dezky-data get pods,pvc # all Running, PVCs Bound on longhorn
|
||
```
|
||
|
||
### 6. Authentik (IdP)
|
||
See `fleet/authentik/README.md`. Create `dezky-auth` ns + `authentik-secret`
|
||
(DB/Redis pw read back from dezky-data so they match; SECRET_KEY + bootstrap
|
||
generated), then `kubectl apply -f fleet/authentik/helmchart.yaml`. Reachable at
|
||
https://auth.dezky.eu; first login `akadmin` / `AUTHENTIK_BOOTSTRAP_PASSWORD`.
|
||
|
||
### 7. Traefik — per-router HTTPS redirect (ACME-safe)
|
||
```bash
|
||
# NO global entrypoint redirect — it would 301 the HTTP-01 challenge before
|
||
# cert-manager's solver router can answer it. Redirect lives per-Ingress via
|
||
# a redirectScheme Middleware instead (applied with each tier's kustomize).
|
||
kubectl apply -f fleet/traefik/helmchartconfig.yaml
|
||
kubectl -n kube-system delete job helm-install-traefik # force the controller to re-run with merged values
|
||
# verify: curl -sI http://app.dezky.eu -> 301 https://... AND new certs still issue
|
||
```
|
||
|
||
### 8. App tier (portal · platform-api · booking · operator)
|
||
```bash
|
||
# Secrets first (out-of-band, values from Bitwarden / generated — see
|
||
# fleet/README.md "Required env / secrets" + fleet/apps/secrets.example.yaml):
|
||
# portal-secrets, booking-secrets, operator-secrets, platform-api-secrets
|
||
kubectl apply -k fleet/apps/
|
||
kubectl -n dezky-apps get pods # all Running once images exist in the registry
|
||
```
|
||
|
||
### 9. CI runner + push-to-deploy
|
||
```bash
|
||
# In-cluster Gitea Actions runner (gitea/runner + privileged dind sidecar).
|
||
# Registration token from Gitea: Settings → Actions → Runners → Create token.
|
||
kubectl create namespace gitea-runner --dry-run=client -o yaml | kubectl apply -f -
|
||
kubectl -n gitea-runner create secret generic gitea-runner-token \
|
||
--from-literal=token=<registration token>
|
||
kubectl apply -f fleet/ci/gitea-runner.yaml
|
||
|
||
# Deploy ServiceAccount + kubeconfig for the pipeline's deploy job:
|
||
kubectl apply -f fleet/ci/ci-deployer.yaml
|
||
# mint the kubeconfig (full recipe in fleet/README.md "Deploy") and store it
|
||
# as the KUBECONFIG_B64 repo secret; create a Gitea PAT with package
|
||
# read+write and store as REGISTRY_TOKEN.
|
||
|
||
# Gotchas baked into fleet/ci/gitea-runner.yaml — don't "simplify" them away:
|
||
# - gitea/runner 1.x (NOT act_runner 0.2.x: Gitea 1.26 never marks its jobs
|
||
# complete, which freezes runs at "Complete job").
|
||
# - dind shares /var/run with the runner: jobs can only get a docker host
|
||
# by bind-mounting a UNIX socket (tcp://+TLS can't be mounted).
|
||
# - docker:24-dind (moby 27 has a cgroup-v2 teardown deadlock).
|
||
```
|
||
|
||
## Secrets — read live values for Bitwarden
|
||
|
||
```bash
|
||
k(){ kubectl -n dezky-data get secret "$1" -o jsonpath="{.data.$2}" | base64 -d; echo; }
|
||
k postgres-secret POSTGRES_PASSWORD
|
||
k postgres-secret AUTHENTIK_DB_PASSWORD # must match Authentik's DB config
|
||
k postgres-secret OCIS_DB_PASSWORD # must match OCIS's DB config
|
||
k mongo-secret root-password
|
||
k redis-secret REDIS_PASSWORD
|
||
|
||
a(){ kubectl -n dezky-apps get secret platform-api-secrets -o jsonpath="{.data.$1}" | base64 -d; echo; }
|
||
a SCHEDULING_CREDENTIAL_KEY # AES key for stored scheduling creds — losing it orphans them
|
||
a AUDIT_SIGNING_KEY # audit hash-chain key — rotation closes the segment
|
||
```
|
||
|
||
## Still TODO (next layers)
|
||
|
||
1. **OCIS** (files) — uses the `ocis` Postgres DB + Hetzner Object Storage
|
||
(S3). platform-api already carries placeholder `OCIS_*` config
|
||
(`fleet/apps/platform-api-config.yaml`) — swap in real values when live.
|
||
2. **Audit cold storage** — Hetzner Object Storage bucket + real
|
||
`AUDIT_COLD_*` keys in `platform-api-secrets`; flip `ARCHIVE_ENABLED`.
|
||
3. **Backups** — Longhorn → Hetzner Object Storage (`fleet/longhorn/README.md`),
|
||
plus host Restic for the mail store + etcd snapshots, plus pg_dump/mongodump
|
||
CronJobs.
|
||
4. **Stripe live keys** — billing is dark-launched off
|
||
(`BILLING_STRIPE_ENABLED: "false"` in the app config).
|
||
|
||
Done since first build: ✅ Authentik + OIDC blueprints · ✅ Stalwart on the
|
||
host · ✅ app tier (incl. operator) · ✅ CI/CD push-to-deploy · ✅ DNS A
|
||
records (`api`/`app`/`booking`/`auth`/`mail`/`operator`).dezky.eu.
|
||
|
||
## Access cheatsheet
|
||
- SSH: `ssh dezky@46.4.78.187` (key only). Root SSH disabled.
|
||
- kubectl: works as `dezky` (kubeconfig at `~/.kube/config`).
|
||
- Out-of-band if locked out: Hetzner Robot KVM/LARA or Rescue System.
|
||
- The `level=warning … 50-rancher.yaml: permission denied` from kubectl is
|
||
harmless noise (k3s kubectl probing a root-only config dir).
|