From 153d7053ca3dbf91d592c47ebec8546fcb0e8f8d Mon Sep 17 00:00:00 2001 From: Ronni Baslund Date: Mon, 8 Jun 2026 18:39:31 +0200 Subject: [PATCH] =?UTF-8?q?feat(infra):=20k3s=20foundation=20=E2=80=94=20c?= =?UTF-8?q?ert-manager,=20Longhorn=20config,=20in-cluster=20data=20tier?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds the production cluster foundation (authored + applied live on node1): - cert-manager via the k3s HelmChart controller + letsencrypt staging/prod ClusterIssuers (HTTP-01 / Traefik). - Longhorn config for single-node (values: replica=1, default StorageClass, Retain) + backup-to-Hetzner-Object-Storage credential template. - In-cluster data tier (dezky-data): Postgres 16 (with Authentik+OCIS DB init), MongoDB 7, Redis 7 as StatefulSets on Longhorn, + secret template. - bootstrap.sh: install open-iscsi/nfs-common + enable iscsid (Longhorn prereq). - RUNBOOK.md: full reproducible node1 build order. Real secrets are generated on-box and kept in Bitwarden — never in git. --- .gitignore | 3 + infrastructure/production/RUNBOOK.md | 114 ++++++++++++++++++ .../production/fleet/cert-manager/README.md | 29 +++++ .../fleet/cert-manager/cert-manager.yaml | 37 ++++++ .../fleet/cert-manager/cluster-issuer.yaml | 43 +++++++ .../production/fleet/data/README.md | 49 ++++++++ .../production/fleet/data/kustomization.yaml | 12 ++ .../production/fleet/data/mongodb.yaml | 78 ++++++++++++ .../production/fleet/data/namespace.yaml | 6 + .../production/fleet/data/postgres-init.yaml | 20 +++ .../production/fleet/data/postgres.yaml | 82 +++++++++++++ .../production/fleet/data/redis.yaml | 78 ++++++++++++ .../fleet/data/secrets.example.yaml | 39 ++++++ .../production/fleet/longhorn/README.md | 68 +++++++++++ .../fleet/longhorn/backup-secret.example.yaml | 28 +++++ .../production/fleet/longhorn/values.yaml | 42 +++++++ infrastructure/production/host/bootstrap.sh | 6 +- 17 files changed, 733 insertions(+), 1 deletion(-) create mode 100644 infrastructure/production/RUNBOOK.md create mode 100644 infrastructure/production/fleet/cert-manager/README.md create mode 100644 infrastructure/production/fleet/cert-manager/cert-manager.yaml create mode 100644 infrastructure/production/fleet/cert-manager/cluster-issuer.yaml create mode 100644 infrastructure/production/fleet/data/README.md create mode 100644 infrastructure/production/fleet/data/kustomization.yaml create mode 100644 infrastructure/production/fleet/data/mongodb.yaml create mode 100644 infrastructure/production/fleet/data/namespace.yaml create mode 100644 infrastructure/production/fleet/data/postgres-init.yaml create mode 100644 infrastructure/production/fleet/data/postgres.yaml create mode 100644 infrastructure/production/fleet/data/redis.yaml create mode 100644 infrastructure/production/fleet/data/secrets.example.yaml create mode 100644 infrastructure/production/fleet/longhorn/README.md create mode 100644 infrastructure/production/fleet/longhorn/backup-secret.example.yaml create mode 100644 infrastructure/production/fleet/longhorn/values.yaml diff --git a/.gitignore b/.gitignore index 16a7577..2275037 100644 --- a/.gitignore +++ b/.gitignore @@ -37,6 +37,9 @@ data/ # But keep app-level data/ dirs — operator carries mock fixtures there. !apps/*/data/ !apps/*/data/** +# ...and the production fleet data-tier manifests (k8s YAML, not volume data). +!infrastructure/production/fleet/data/ +!infrastructure/production/fleet/data/** # Coverage coverage/ diff --git a/infrastructure/production/RUNBOOK.md b/infrastructure/production/RUNBOOK.md new file mode 100644 index 0000000..15bf4bc --- /dev/null +++ b/infrastructure/production/RUNBOOK.md @@ -0,0 +1,114 @@ +# Dezky production — node1 build runbook + +The actual, reproducible order used to stand up **node1.dezky.eu** (Hetzner +AX41, `46.4.78.187`, Ubuntu 24.04). If the box is lost, follow this top to +bottom to rebuild it. Per-layer detail lives in `host/README.md`, +`fleet/cert-manager/`, `fleet/longhorn/`, `fleet/data/`. + +> Secrets are **never** in git. They're generated with `openssl rand -hex 24` +> and stored in **Bitwarden**. See "Secrets" below for how to read the live +> values back out of the cluster. + +## Current state (built 2026-06-08) + +- **Host:** hardened via `host/bootstrap.sh` — `dezky` admin user, **key-only + SSH** (no root, no passwords), k3s-safe nftables firewall (SSH/6443 → mgmt + IPs `46.32.144.38`/`46.32.144.45`; 80/443+mail → world), fail2ban, + unattended-upgrades, `open-iscsi`+`iscsid` (Longhorn prereq). + `dezky` has **NOPASSWD sudo** (`/etc/sudoers.d/90-dezky`). +- **k3s** v1.33.11 — single node (control-plane/etcd/worker), registered in + Rancher (`91.99.122.153`). +- **Longhorn** — default StorageClass, `numberOfReplicas: 1` (single node). +- **cert-manager** + `letsencrypt-staging` / `letsencrypt-prod` (HTTP-01/Traefik). +- **Data tier** (`dezky-data` ns) — Postgres 16, Mongo 7, Redis 7 as + StatefulSets on Longhorn PVCs. Postgres holds the `authentik` + `ocis` DBs. + +## Reproduce from scratch + +### 1. Host layer +```bash +# from laptop +scp -r infrastructure/production/host root@:/opt/dezky-host +# copy/fill config.env on the box (gitignored — MGMT IPs, ADMIN_SSH_PUBKEY, +# RANCHER_* token/checksum, STALWART_*, RESTIC_*) +ssh root@ 'cd /opt/dezky-host && ./bootstrap.sh' +# set a console/sudo password for the admin user, then (optional) NOPASSWD: +ssh root@ 'passwd dezky' +ssh dezky@ "echo 'dezky ALL=(ALL) NOPASSWD:ALL' | sudo tee /etc/sudoers.d/90-dezky && sudo chmod 0440 /etc/sudoers.d/90-dezky" +``` + +### 2. k3s + kubectl access +```bash +ssh dezky@ +sudo /opt/dezky-host/k3s/register.sh # joins the Rancher Custom (K3s) cluster +kubectl --kubeconfig /etc/rancher/k3s/k3s.yaml get nodes # -> Ready +# give dezky a kubeconfig: +mkdir -p ~/.kube && sudo install -m 600 -o dezky -g dezky /etc/rancher/k3s/k3s.yaml ~/.kube/config +``` + +### 3. Longhorn (storage) +```bash +sudo apt-get install -y open-iscsi nfs-common && sudo systemctl enable --now iscsid # (bootstrap.sh does this now) +helm repo add longhorn https://charts.longhorn.io && helm repo update +helm install longhorn longhorn/longhorn -n longhorn-system --create-namespace \ + --version 1.12.0 -f fleet/longhorn/values.yaml # replica=1, default class +# one default SC only: +kubectl patch storageclass local-path -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}' +kubectl -n longhorn-system patch settings.longhorn.io default-replica-count --type=merge -p '{"value":"1"}' +kubectl get storageclass # only 'longhorn (default)' +``` + +### 4. cert-manager + issuers +```bash +kubectl apply -f fleet/cert-manager/cert-manager.yaml +kubectl -n cert-manager rollout status deploy/cert-manager-webhook --timeout=180s +kubectl apply -f fleet/cert-manager/cluster-issuer.yaml +kubectl get clusterissuer # both READY=True +``` + +### 5. Data tier +```bash +kubectl create namespace dezky-data --dry-run=client -o yaml | kubectl apply -f - +# secrets — generate fresh, store in Bitwarden: +kubectl -n dezky-data create secret generic postgres-secret \ + --from-literal=POSTGRES_PASSWORD=$(openssl rand -hex 24) \ + --from-literal=AUTHENTIK_DB_PASSWORD=$(openssl rand -hex 24) \ + --from-literal=OCIS_DB_PASSWORD=$(openssl rand -hex 24) +kubectl -n dezky-data create secret generic mongo-secret \ + --from-literal=root-username=dezky --from-literal=root-password=$(openssl rand -hex 24) +kubectl -n dezky-data create secret generic redis-secret \ + --from-literal=REDIS_PASSWORD=$(openssl rand -hex 24) +kubectl apply -k fleet/data/ +kubectl -n dezky-data get pods,pvc # all Running, PVCs Bound on longhorn +``` + +## Secrets — read live values for Bitwarden + +```bash +k(){ kubectl -n dezky-data get secret "$1" -o jsonpath="{.data.$2}" | base64 -d; echo; } +k postgres-secret POSTGRES_PASSWORD +k postgres-secret AUTHENTIK_DB_PASSWORD # must match Authentik's DB config +k postgres-secret OCIS_DB_PASSWORD # must match OCIS's DB config +k mongo-secret root-password +k redis-secret REDIS_PASSWORD +``` + +## Still TODO (next layers) + +1. **Authentik** (`auth.dezky.eu`) — OIDC for the portal; uses the `authentik` + Postgres DB + Redis. +2. **OCIS** (files) — uses the `ocis` Postgres DB + Hetzner Object Storage (S3). +3. **Apps** — `fleet/apps/` (portal · platform-api · booking) + their secrets. +4. **Stalwart** (host) — `host/stalwart/install.sh`; needs DNS + PTR. +5. **Backups** — Longhorn → Hetzner Object Storage (`fleet/longhorn/README.md`), + plus host Restic for the mail store + etcd snapshots, plus pg_dump/mongodump + CronJobs. +6. **DNS** — A records `api`/`app`/`booking`/`auth`/`mail`.dezky.eu → 46.4.78.187, + and PTR for mail. + +## Access cheatsheet +- SSH: `ssh dezky@46.4.78.187` (key only). Root SSH disabled. +- kubectl: works as `dezky` (kubeconfig at `~/.kube/config`). +- Out-of-band if locked out: Hetzner Robot KVM/LARA or Rescue System. +- The `level=warning … 50-rancher.yaml: permission denied` from kubectl is + harmless noise (k3s kubectl probing a root-only config dir). diff --git a/infrastructure/production/fleet/cert-manager/README.md b/infrastructure/production/fleet/cert-manager/README.md new file mode 100644 index 0000000..fbcc922 --- /dev/null +++ b/infrastructure/production/fleet/cert-manager/README.md @@ -0,0 +1,29 @@ +# fleet/cert-manager — TLS for the cluster + +cert-manager + ACME ClusterIssuers. Installs via the **k3s built-in Helm +controller** (no Helm CLI needed), then defines `letsencrypt-staging` and +`letsencrypt-prod` (HTTP-01 through the bundled Traefik). + +## Apply order (matters — issuers need the CRDs first) + +```bash +# 1) Install cert-manager +kubectl apply -f cert-manager.yaml + +# 2) Wait until it's up (CRDs + webhook ready) +kubectl -n cert-manager rollout status deploy/cert-manager-webhook --timeout=180s +kubectl -n cert-manager get pods + +# 3) Create the issuers +kubectl apply -f cluster-issuer.yaml +kubectl get clusterissuer # both should report READY=True +``` + +## Notes +- ACME email is `info@dezky.eu` — change in `cluster-issuer.yaml` if needed. +- **Test with `letsencrypt-staging` first** (set an Ingress annotation + `cert-manager.io/cluster-issuer: letsencrypt-staging`) to avoid burning the + strict prod rate limits, then switch the apps to `letsencrypt-prod`. +- HTTP-01 requires each hostname's DNS A record → `46.4.78.187` and port 80 + open (already true). A cert won't issue until DNS resolves. +- The app Ingresses (`fleet/apps/`) already reference `letsencrypt-prod`. diff --git a/infrastructure/production/fleet/cert-manager/cert-manager.yaml b/infrastructure/production/fleet/cert-manager/cert-manager.yaml new file mode 100644 index 0000000..29e00a8 --- /dev/null +++ b/infrastructure/production/fleet/cert-manager/cert-manager.yaml @@ -0,0 +1,37 @@ +# cert-manager, installed via the k3s built-in Helm controller +# (helm.cattle.io/v1). k3s watches HelmChart resources in any namespace and +# runs a `helm install` Job for them — no Helm CLI needed on your laptop. +# +# The chart installs its own CRDs (crds.enabled=true). Apply this first and +# wait for the cert-manager pods to be Running/Ready before applying the +# ClusterIssuers (cluster-issuer.yaml) — the issuers need the CRDs + webhook. +apiVersion: helm.cattle.io/v1 +kind: HelmChart +metadata: + name: cert-manager + namespace: kube-system +spec: + repo: https://charts.jetstack.io + chart: cert-manager + # Pin a version; bump to the latest stable when you upgrade. + version: v1.16.2 + targetNamespace: cert-manager + createNamespace: true + valuesContent: |- + crds: + enabled: true + # Single-node box — keep the footprint modest. + resources: + requests: + cpu: 10m + memory: 64Mi + webhook: + resources: + requests: + cpu: 10m + memory: 32Mi + cainjector: + resources: + requests: + cpu: 10m + memory: 64Mi diff --git a/infrastructure/production/fleet/cert-manager/cluster-issuer.yaml b/infrastructure/production/fleet/cert-manager/cluster-issuer.yaml new file mode 100644 index 0000000..a27e5bf --- /dev/null +++ b/infrastructure/production/fleet/cert-manager/cluster-issuer.yaml @@ -0,0 +1,43 @@ +# ACME ClusterIssuers (HTTP-01 via the k3s-bundled Traefik ingress). +# +# Apply ONLY after cert-manager is Running: +# kubectl -n cert-manager rollout status deploy/cert-manager-webhook +# +# Two issuers: +# - letsencrypt-staging : use while testing (high rate limits, UNTRUSTED +# certs). Point an Ingress at this first to prove the HTTP-01 flow works. +# - letsencrypt-prod : the real one the app Ingresses reference. Switch to +# it once staging issues cleanly, to avoid burning Let's Encrypt's strict +# prod rate limits on misconfigurations. +# +# HTTP-01 needs the hostname to resolve to this box (DNS A record -> 46.4.78.187) +# and port 80 reachable — both are already true (firewall opens 80 to the world). +apiVersion: cert-manager.io/v1 +kind: ClusterIssuer +metadata: + name: letsencrypt-staging +spec: + acme: + server: https://acme-staging-v02.api.letsencrypt.org/directory + email: info@dezky.eu + privateKeySecretRef: + name: letsencrypt-staging-account-key + solvers: + - http01: + ingress: + class: traefik +--- +apiVersion: cert-manager.io/v1 +kind: ClusterIssuer +metadata: + name: letsencrypt-prod +spec: + acme: + server: https://acme-v02.api.letsencrypt.org/directory + email: info@dezky.eu + privateKeySecretRef: + name: letsencrypt-prod-account-key + solvers: + - http01: + ingress: + class: traefik diff --git a/infrastructure/production/fleet/data/README.md b/infrastructure/production/fleet/data/README.md new file mode 100644 index 0000000..d5d7a9c --- /dev/null +++ b/infrastructure/production/fleet/data/README.md @@ -0,0 +1,49 @@ +# fleet/data — in-cluster data tier + +PostgreSQL 16 (Authentik + OCIS), MongoDB 7 (portal/platform-api) and Redis 7 +(cache/sessions) as single-node StatefulSets on **Longhorn** volumes +(`storageClassName: longhorn` — see `../longhorn/`), in the `dezky-data` +namespace. Mirrors the dev docker-compose stack. Self-hosted on the box — no +external/managed DBs (EU-sovereign). + +> Prereq: Longhorn must be installed and its `longhorn` StorageClass present +> before applying these (the PVCs request it). See `../longhorn/README.md`. + +Stable in-cluster DNS: +- `postgres.dezky-data.svc.cluster.local:5432` +- `mongo.dezky-data.svc.cluster.local:27017` +- `redis.dezky-data.svc.cluster.local:6379` + +## Apply + +```bash +# 1) Secrets first (out-of-band — NOT in git). Generate values with openssl. +cp secrets.example.yaml /tmp/data-secrets.yaml +$EDITOR /tmp/data-secrets.yaml # fill every REPLACE_* (openssl rand -hex 24) +kubectl create namespace dezky-data --dry-run=client -o yaml | kubectl apply -f - +kubectl apply -f /tmp/data-secrets.yaml && rm /tmp/data-secrets.yaml + +# 2) The data tier +kubectl apply -k . + +# 3) Watch them come up +kubectl -n dezky-data rollout status statefulset/postgres +kubectl -n dezky-data rollout status statefulset/mongo +kubectl -n dezky-data rollout status statefulset/redis +kubectl -n dezky-data get pods,pvc +``` + +## Notes +- **Postgres init runs once** (empty data dir): `postgres-init` ConfigMap + creates the `authentik` + `ocis` databases/roles using + `AUTHENTIK_DB_PASSWORD` / `OCIS_DB_PASSWORD` from the secret. If you change + those passwords later, alter the roles in SQL — re-init won't re-run on an + existing volume. +- Store all generated passwords in **Bitwarden**. `AUTHENTIK_DB_PASSWORD` / + `OCIS_DB_PASSWORD` must match what you later give Authentik and OCIS. +- **Backups:** Longhorn snapshots + backs these volumes up to Hetzner Object + Storage (S3) — see `../longhorn/README.md`. Block snapshots of a live DB are + crash-consistent at best, so also run `pg_dump`/`mongodump` CronJobs (added + next) into a Longhorn PVC; restore from those logical dumps, not the raw + data dirs. +- Single replica each — fine for one node. HA/replicas are a later concern. diff --git a/infrastructure/production/fleet/data/kustomization.yaml b/infrastructure/production/fleet/data/kustomization.yaml new file mode 100644 index 0000000..423a32a --- /dev/null +++ b/infrastructure/production/fleet/data/kustomization.yaml @@ -0,0 +1,12 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization +namespace: dezky-data + +# Non-secret resources only. Real secrets (secrets.example.yaml) are applied +# out-of-band and deliberately NOT listed here — same pattern as apps/. +resources: + - namespace.yaml + - postgres-init.yaml + - postgres.yaml + - mongodb.yaml + - redis.yaml diff --git a/infrastructure/production/fleet/data/mongodb.yaml b/infrastructure/production/fleet/data/mongodb.yaml new file mode 100644 index 0000000..ece0970 --- /dev/null +++ b/infrastructure/production/fleet/data/mongodb.yaml @@ -0,0 +1,78 @@ +# MongoDB 7 — portal / platform-api application data (mirrors the dev stack). +# Single-node StatefulSet on k3s local-path storage. App DBs/collections are +# created by the apps on first use; root creds come from mongo-secret. +apiVersion: v1 +kind: Service +metadata: + name: mongo + namespace: dezky-data +spec: + clusterIP: None # headless: stable DNS mongo.dezky-data:27017 + selector: + app: mongo + ports: + - name: mongo + port: 27017 + targetPort: 27017 +--- +apiVersion: apps/v1 +kind: StatefulSet +metadata: + name: mongo + namespace: dezky-data +spec: + serviceName: mongo + replicas: 1 + selector: + matchLabels: + app: mongo + template: + metadata: + labels: + app: mongo + spec: + containers: + - name: mongo + image: mongo:7 + args: ["--bind_ip_all"] + ports: + - containerPort: 27017 + env: + - name: MONGO_INITDB_ROOT_USERNAME + valueFrom: + secretKeyRef: + name: mongo-secret + key: root-username + - name: MONGO_INITDB_ROOT_PASSWORD + valueFrom: + secretKeyRef: + name: mongo-secret + key: root-password + volumeMounts: + - name: data + mountPath: /data/db + resources: + requests: + cpu: 100m + memory: 256Mi + limits: + memory: 1Gi + readinessProbe: + exec: + command: ["mongosh", "--quiet", "--eval", "db.adminCommand('ping')"] + initialDelaySeconds: 15 + periodSeconds: 10 + livenessProbe: + exec: + command: ["mongosh", "--quiet", "--eval", "db.adminCommand('ping')"] + initialDelaySeconds: 30 + periodSeconds: 20 + volumeClaimTemplates: + - metadata: + name: data + spec: + accessModes: ["ReadWriteOnce"] + storageClassName: longhorn + resources: + requests: + storage: 20Gi diff --git a/infrastructure/production/fleet/data/namespace.yaml b/infrastructure/production/fleet/data/namespace.yaml new file mode 100644 index 0000000..1d7e5bc --- /dev/null +++ b/infrastructure/production/fleet/data/namespace.yaml @@ -0,0 +1,6 @@ +apiVersion: v1 +kind: Namespace +metadata: + name: dezky-data + labels: + app.kubernetes.io/part-of: dezky diff --git a/infrastructure/production/fleet/data/postgres-init.yaml b/infrastructure/production/fleet/data/postgres-init.yaml new file mode 100644 index 0000000..e302a2a --- /dev/null +++ b/infrastructure/production/fleet/data/postgres-init.yaml @@ -0,0 +1,20 @@ +# Runs once, on first Postgres init (empty data dir), via the official image's +# /docker-entrypoint-initdb.d hook. Creates the per-service databases + roles +# Authentik and OCIS need. Passwords come from the postgres-secret env (see +# secrets.example.yaml) — never hard-code them here. +apiVersion: v1 +kind: ConfigMap +metadata: + name: postgres-init + namespace: dezky-data +data: + 10-extra-databases.sh: | + #!/bin/bash + set -euo pipefail + psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" <<-EOSQL + CREATE ROLE authentik LOGIN PASSWORD '${AUTHENTIK_DB_PASSWORD}'; + CREATE DATABASE authentik OWNER authentik; + + CREATE ROLE ocis LOGIN PASSWORD '${OCIS_DB_PASSWORD}'; + CREATE DATABASE ocis OWNER ocis; + EOSQL diff --git a/infrastructure/production/fleet/data/postgres.yaml b/infrastructure/production/fleet/data/postgres.yaml new file mode 100644 index 0000000..c5c4c7f --- /dev/null +++ b/infrastructure/production/fleet/data/postgres.yaml @@ -0,0 +1,82 @@ +# PostgreSQL 16 — shared RDBMS for Authentik + OCIS (mirrors the dev stack). +# Single-node StatefulSet on k3s local-path storage. Logical dumps for backup +# are added by a pg_dump CronJob (Restic captures the dump dir on the host). +apiVersion: v1 +kind: Service +metadata: + name: postgres + namespace: dezky-data +spec: + clusterIP: None # headless: stable DNS postgres.dezky-data:5432 + selector: + app: postgres + ports: + - name: postgres + port: 5432 + targetPort: 5432 +--- +apiVersion: apps/v1 +kind: StatefulSet +metadata: + name: postgres + namespace: dezky-data +spec: + serviceName: postgres + replicas: 1 + selector: + matchLabels: + app: postgres + template: + metadata: + labels: + app: postgres + spec: + # No fsGroup needed: the postgres image entrypoint runs as root and + # chowns PGDATA to the postgres user before stepping down. + containers: + - name: postgres + image: postgres:16-alpine + ports: + - containerPort: 5432 + env: + - name: POSTGRES_USER + value: postgres + - name: PGDATA + value: /var/lib/postgresql/data/pgdata # subdir avoids lost+found clash + envFrom: + - secretRef: + name: postgres-secret # POSTGRES_PASSWORD, AUTHENTIK_DB_PASSWORD, OCIS_DB_PASSWORD + volumeMounts: + - name: data + mountPath: /var/lib/postgresql/data + - name: init + mountPath: /docker-entrypoint-initdb.d + resources: + requests: + cpu: 100m + memory: 256Mi + limits: + memory: 1Gi + readinessProbe: + exec: + command: ["pg_isready", "-U", "postgres"] + initialDelaySeconds: 10 + periodSeconds: 10 + livenessProbe: + exec: + command: ["pg_isready", "-U", "postgres"] + initialDelaySeconds: 30 + periodSeconds: 20 + volumes: + - name: init + configMap: + name: postgres-init + volumeClaimTemplates: + - metadata: + name: data + spec: + accessModes: ["ReadWriteOnce"] + storageClassName: longhorn + resources: + requests: + storage: 10Gi diff --git a/infrastructure/production/fleet/data/redis.yaml b/infrastructure/production/fleet/data/redis.yaml new file mode 100644 index 0000000..0a1b198 --- /dev/null +++ b/infrastructure/production/fleet/data/redis.yaml @@ -0,0 +1,78 @@ +# Redis 7 — cache / session store (Authentik, and available to the apps). +# Password-protected (requirepass) even in-cluster; AOF persistence on a small +# local-path volume so sessions survive restarts. +apiVersion: v1 +kind: Service +metadata: + name: redis + namespace: dezky-data +spec: + clusterIP: None # headless: stable DNS redis.dezky-data:6379 + selector: + app: redis + ports: + - name: redis + port: 6379 + targetPort: 6379 +--- +apiVersion: apps/v1 +kind: StatefulSet +metadata: + name: redis + namespace: dezky-data +spec: + serviceName: redis + replicas: 1 + selector: + matchLabels: + app: redis + template: + metadata: + labels: + app: redis + spec: + containers: + - name: redis + image: redis:7-alpine + command: ["redis-server"] + args: + - "--requirepass" + - "$(REDIS_PASSWORD)" + - "--appendonly" + - "yes" + ports: + - containerPort: 6379 + env: + - name: REDIS_PASSWORD + valueFrom: + secretKeyRef: + name: redis-secret + key: REDIS_PASSWORD + volumeMounts: + - name: data + mountPath: /data + resources: + requests: + cpu: 50m + memory: 64Mi + limits: + memory: 256Mi + readinessProbe: + exec: + command: ["sh", "-c", 'redis-cli -a "$REDIS_PASSWORD" ping'] + initialDelaySeconds: 5 + periodSeconds: 10 + livenessProbe: + exec: + command: ["sh", "-c", 'redis-cli -a "$REDIS_PASSWORD" ping'] + initialDelaySeconds: 15 + periodSeconds: 20 + volumeClaimTemplates: + - metadata: + name: data + spec: + accessModes: ["ReadWriteOnce"] + storageClassName: longhorn + resources: + requests: + storage: 2Gi diff --git a/infrastructure/production/fleet/data/secrets.example.yaml b/infrastructure/production/fleet/data/secrets.example.yaml new file mode 100644 index 0000000..bdbba63 --- /dev/null +++ b/infrastructure/production/fleet/data/secrets.example.yaml @@ -0,0 +1,39 @@ +# SECRET TEMPLATE for the data tier — copy, fill, apply OUT-OF-BAND. +# NEVER commit real values. Excluded from kustomization.yaml on purpose. +# +# cp secrets.example.yaml /tmp/data-secrets.yaml +# # fill every REPLACE_* (openssl rand -hex 24) +# kubectl apply -f /tmp/data-secrets.yaml && rm /tmp/data-secrets.yaml +# +# Record these in Bitwarden — losing them locks you out of the DBs. The +# AUTHENTIK_DB_PASSWORD / OCIS_DB_PASSWORD must match what you give Authentik +# and OCIS in their own configs. +apiVersion: v1 +kind: Secret +metadata: + name: postgres-secret + namespace: dezky-data +type: Opaque +stringData: + POSTGRES_PASSWORD: REPLACE_superuser_pw # openssl rand -hex 24 + AUTHENTIK_DB_PASSWORD: REPLACE_authentik_pw # openssl rand -hex 24 + OCIS_DB_PASSWORD: REPLACE_ocis_pw # openssl rand -hex 24 +--- +apiVersion: v1 +kind: Secret +metadata: + name: mongo-secret + namespace: dezky-data +type: Opaque +stringData: + root-username: dezky + root-password: REPLACE_mongo_root_pw # openssl rand -hex 24 +--- +apiVersion: v1 +kind: Secret +metadata: + name: redis-secret + namespace: dezky-data +type: Opaque +stringData: + REDIS_PASSWORD: REPLACE_redis_pw # openssl rand -hex 24 diff --git a/infrastructure/production/fleet/longhorn/README.md b/infrastructure/production/fleet/longhorn/README.md new file mode 100644 index 0000000..fa3c225 --- /dev/null +++ b/infrastructure/production/fleet/longhorn/README.md @@ -0,0 +1,68 @@ +# fleet/longhorn — block storage for the data tier + +Longhorn provides the `longhorn` StorageClass that the data tier (Postgres / +Mongo / Redis) and other stateful apps use. Single node for now (replica = 1): +durability is the same as local disk, but you gain **snapshots** and **off-box +backups to Hetzner Object Storage**, plus a clean path to multi-node later. + +You install Longhorn; this dir holds the **config** (`values.yaml`) + the backup +credential template. + +## 1. Host prerequisite (every node) +`open-iscsi` + a running `iscsid`, and `nfs-common`. Already baked into +`../../host/bootstrap.sh` — but the node is already bootstrapped, so install it +**now** on node1: +```bash +sudo apt-get install -y open-iscsi nfs-common +sudo systemctl enable --now iscsid +systemctl is-active iscsid # -> active +``` +(Optional but recommended) run Longhorn's environment check before installing: +```bash +curl -sSfL https://raw.githubusercontent.com/longhorn/longhorn/v1.12.0/scripts/environment_check.sh | bash +``` + +## 2. Install (your step) with this config +```bash +helm repo add longhorn https://charts.longhorn.io && helm repo update +helm install longhorn longhorn/longhorn \ + -n longhorn-system --create-namespace \ + --version 1.12.0 -f values.yaml +kubectl -n longhorn-system rollout status deploy/longhorn-driver-deployer +kubectl get storageclass # 'longhorn' present + (default) +``` + +## 3. Make Longhorn the only default StorageClass +`values.yaml` sets Longhorn as default — now drop k3s's local-path default so +there aren't two: +```bash +kubectl patch storageclass local-path \ + -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}' +kubectl get storageclass # only 'longhorn' shows (default) +``` + +## 4. Backups → Hetzner Object Storage (S3) +1. In Hetzner: create a bucket (e.g. `dezky-longhorn`) + an S3 key pair; note the + endpoint (`https://fsn1.your-objectstorage.com`). +2. Fill + apply `backup-secret.example.yaml` (creds → Bitwarden). +3. Set the backup target (UI: **Settings → General**, or uncomment in + `values.yaml` + upgrade): + - Backup Target: `s3://dezky-longhorn@fsn1/` + - Backup Target Credential Secret: `longhorn-backup-secret` +4. Add a **RecurringJob** (UI → Recurring Job, or a `RecurringJob` CR): e.g. a + nightly `backup` with retention 14, applied to the `default` volume group so + every PV is backed up off-box. + +## How this changes the backup story +Longhorn now owns volume-level snapshots + S3 backups, so the host `restic` +layer no longer needs to capture `/var/lib/rancher/k3s/storage` (local-path). +Keep restic for the **host** bits (Stalwart mail store, k3s etcd snapshots), and +still take **logical DB dumps** (`pg_dump`/`mongodump`) into a Longhorn PVC — +Longhorn backs that up to S3 and a logical dump is what you actually restore +from. (Crash-consistent block snapshots of a live DB are a last resort.) + +## Notes +- Bump `defaultReplicaCount` to 2–3 in `values.yaml` (helm upgrade) once more + nodes join; Longhorn rebalances. +- The UI Ingress is intentionally **off** — it's full storage admin. Gate it + behind an IP allowlist or Authentik before exposing it. diff --git a/infrastructure/production/fleet/longhorn/backup-secret.example.yaml b/infrastructure/production/fleet/longhorn/backup-secret.example.yaml new file mode 100644 index 0000000..d72b511 --- /dev/null +++ b/infrastructure/production/fleet/longhorn/backup-secret.example.yaml @@ -0,0 +1,28 @@ +# Longhorn backup target credentials → Hetzner Object Storage (S3-compatible). +# Template — fill + apply OUT-OF-BAND, never commit real keys. Store the keys +# in Bitwarden. +# +# 1. Create a bucket (e.g. dezky-longhorn) + an S3 key pair in Hetzner Cloud +# Console → Object Storage. Note the endpoint, e.g.: +# Falkenstein https://fsn1.your-objectstorage.com +# Nuremberg https://nbg1.your-objectstorage.com +# Helsinki https://hel1.your-objectstorage.com +# 2. Fill this and apply: +# kubectl apply -f /tmp/longhorn-backup-secret.yaml +# 3. Set the backup target (UI: Settings → General, or in values.yaml): +# Backup Target: s3://dezky-longhorn@fsn1/ +# Backup Target Credential: longhorn-backup-secret +# (The "@fsn1" region tag is just a label for non-AWS S3; the real endpoint +# comes from AWS_ENDPOINTS below.) +apiVersion: v1 +kind: Secret +metadata: + name: longhorn-backup-secret + namespace: longhorn-system +type: Opaque +stringData: + AWS_ACCESS_KEY_ID: REPLACE_hetzner_s3_access_key + AWS_SECRET_ACCESS_KEY: REPLACE_hetzner_s3_secret_key + AWS_ENDPOINTS: https://fsn1.your-objectstorage.com + # Hetzner Object Storage uses virtual-hosted-style addressing. + VIRTUAL_HOSTED_STYLE: "true" diff --git a/infrastructure/production/fleet/longhorn/values.yaml b/infrastructure/production/fleet/longhorn/values.yaml new file mode 100644 index 0000000..8f1f7e5 --- /dev/null +++ b/infrastructure/production/fleet/longhorn/values.yaml @@ -0,0 +1,42 @@ +# Longhorn Helm values — single-node config for the dezky AX41 (node1). +# You install Longhorn; feed it these values, e.g.: +# +# helm repo add longhorn https://charts.longhorn.io && helm repo update +# helm install longhorn longhorn/longhorn \ +# -n longhorn-system --create-namespace \ +# --version 1.12.0 -f values.yaml +# +# (Or paste this into Rancher → Apps → Longhorn → Edit YAML.) +# +# Host prereq (added to bootstrap.sh): open-iscsi + a running iscsid + nfs-common +# on EVERY node. Verify: `systemctl is-active iscsid` → active. + +defaultSettings: + # Single node → 1 replica. No cross-node redundancy yet (durability is the + # same as local disk, but you gain snapshots + off-box backups). Bump to 2–3 + # once you add nodes and Longhorn will rebalance. + defaultReplicaCount: 1 + # Replica data lives here on the AX41 NVMe. + defaultDataPath: /var/lib/longhorn + # Don't pack the disk to 100%. + storageMinimalAvailablePercentage: 15 + storageOverProvisioningPercentage: 100 + # Tidy up orphaned replicas automatically. + orphanResourceAutoDeletion: "replica-data" + # ── Backups → Hetzner Object Storage (set after creating the bucket+secret; + # see README). Can also be set in the UI under Settings → General. ── + # backupTarget: s3://dezky-longhorn@fsn1/ + # backupTargetCredentialSecret: longhorn-backup-secret + +persistence: + # Make Longhorn the DEFAULT StorageClass so PVCs land on it automatically. + # ALSO unset local-path's default flag (one default only — see README). + defaultClass: true + defaultClassReplicaCount: 1 + # Databases: keep the volume if a PVC is deleted, until you reclaim it by hand. + reclaimPolicy: Retain + +# The Longhorn UI is full storage admin — keep its Ingress OFF until you decide +# how to protect it (IP allowlist at Traefik, or behind Authentik forward-auth). +ingress: + enabled: false diff --git a/infrastructure/production/host/bootstrap.sh b/infrastructure/production/host/bootstrap.sh index 32402f9..ff42612 100755 --- a/infrastructure/production/host/bootstrap.sh +++ b/infrastructure/production/host/bootstrap.sh @@ -63,8 +63,12 @@ apt-get upgrade -y -qq apt-get install -y -qq \ nftables fail2ban unattended-upgrades apt-listchanges \ curl ca-certificates gnupg htop tmux vim chrony \ + open-iscsi nfs-common \ >/dev/null -ok "Base packages installed." +# Longhorn requires a running iscsid on every node; nfs-common is needed for +# RWX volumes / NFS backup targets. +systemctl enable --now iscsid >/dev/null 2>&1 || true +ok "Base packages installed (incl. Longhorn prereqs: open-iscsi, nfs-common)." # ── Step 2: hostname + timezone + time sync ──────────────────────────────── info "Step 2: Hostname, timezone (UTC), time sync..."