Files
dezky/infrastructure/production/host
Ronni Baslund 9d075343c5
ci / typecheck (map[dir:apps/booking name:booking]) (push) Has been cancelled
ci / typecheck (map[dir:apps/portal name:portal]) (push) Has been cancelled
ci / typecheck (map[dir:apps/website name:website]) (push) Has been cancelled
ci / typecheck (map[dir:services/platform-api name:platform-api]) (push) Has been cancelled
ci / test (push) Has been cancelled
feat(infra): migrate Stalwart to the v0.16 config model (config.json)
v0.16 dropped TOML config. The host service now boots from a tiny config.json
that describes only the datastore (RocksDB); all other settings live in the DB
(web UI / stalwart-cli / platform-api JMAP).

- add stalwart/config.json (RocksDb datastore at /opt/stalwart/data)
- install.sh: install config.json instead of config.toml
- stalwart-mail.service: --config points at config.json
- README: document the v0.16 model + remaining DB-side config + DNS/PTR

Verified: Stalwart 0.16.8 runs on node1 with default mail listeners + the :8080
management server. config.toml retained as a reference for the DB settings.
2026-06-08 21:02:17 +02:00
..

Dezky production — host layer

OS baseline + firewall for the bare-metal Hetzner AX41 that runs the k3s node. This layer is everything that lives on the host (outside Kubernetes): hardening, the k3s-safe firewall, and — added next — k3s registration, Stalwart mail, and Restic backups.

Managed by Fleet/Rancher once k3s is up; this host layer is the part Fleet can't do, so it runs over SSH from reviewed scripts.

Files

File Purpose
config.env.example Template for host-specific values
config.env Real values — gitignored. Source of truth lives only on your machine/box
bootstrap.sh One-shot OS hardening: user, SSH, sysctl, swap, fail2ban, auto-updates, firewall
firewall/firewall.sh Renders + applies the k3s-safe nftables ruleset (idempotent)
firewall/dezky-firewall.service systemd unit; reapplies our table on boot, never flushes globally
k3s/register.sh Registers the node into Rancher (Custom k3s cluster); secrets from config.env
stalwart/install.sh Installs Stalwart as a hardened host service (binary, units, secrets, bootstrap cert)
stalwart/config.toml Production Stalwart config (mail ports on host, JMAP on internal 8080)
stalwart/stalwart-mail.service systemd unit; non-root + CAP_NET_BIND_SERVICE for low ports
stalwart/cert-sync.sh + *.service/*.timer Pulls the cert-manager mail cert into Stalwart, reloads on change
restic/install.sh Sets up Restic, the backup SSH key/config, env, and the nightly timer
restic/backup.sh Backup → primary Storage Box, retention, then copy → Helsinki DR
restic/restore.sh List/restore snapshots (run drills!)
restic/dezky-backup.service + .timer Nightly 03:20 UTC backup

The firewall model (read this)

k3s, kube-proxy and flannel manage their own nftables tables (ip/ip6: filter, nat, mangle). The classic mistake is running ufw/firewalld or nft flush ruleset, which wipes or fights those rules and breaks pod networking.

So instead:

  • We own a single dedicated table — inet dezky_fw — with only an INPUT chain (default drop). Separate tables coexist; a packet is dropped if any base chain drops it, so our default-drop INPUT gates host-bound traffic while k3s keeps owning FORWARD/NAT untouched.
  • We explicitly accept the pod (10.42.0.0/16) and service (10.43.0.0/16) CIDRs and the CNI interfaces (cni0, flannel.1) so cluster↔host traffic (API server, kubelet, CoreDNS) is never dropped.
  • We never flush ruleset. The systemd unit's ExecStop removes only our table.

Access policy

Surface Ports Who
Web + ACME 80, 443 World (customers)
Mail 25, 465, 587, 143, 993, 4190 World
SSH 22 MGMT_ALLOW_V4/V6 only
k3s API 6443 MGMT_ALLOW_V4/V6 only

Current management allowlist: home 46.32.144.38, office 46.32.144.45.

The Rancher plane (91.99.122.153) needs no inbound rule — the cluster agent dials out to Rancher over 443, so replies ride the established/related fast-path.

Apply order

Prereqs: AX41 provisioned with Debian 12 (bookworm), reachable as root. config.env filled in — in particular ADMIN_SSH_PUBKEY and SERVER_PUBLIC_IPV4 (still TODO until the box exists).

# From your laptop:
scp -r infrastructure/production/host root@<server-ip>:/opt/dezky-host

# On the server:
ssh root@<server-ip>
cd /opt/dezky-host
# config.env is gitignored, so copy it up separately or recreate it here:
#   cp config.env.example config.env && nano config.env
./bootstrap.sh

bootstrap.sh creates your admin user and installs your key before it disables root/password SSH, so the order is lockout-safe. It's idempotent — re-run anytime.

To touch only the firewall later:

sudo ./firewall/firewall.sh --dry-run   # preview the ruleset
sudo ./firewall/firewall.sh             # render, validate, apply, install unit

Then register into Rancher

Once the host is hardened, register the node as a Custom k3s cluster (create the cluster in Rancher first, choosing the K3s distribution, then paste its token/checksum into config.env):

sudo ./k3s/register.sh                  # downloads agent installer, joins cluster
journalctl -u rancher-system-agent -f   # follow provisioning

Rancher is currently reached by IP, so the installer is fetched with --insecure; the agent's ongoing link is still verified via --ca-checksum. Give Rancher a real hostname + cert later to drop the insecure fetch.

Then install Stalwart (mail)

sudo ./stalwart/install.sh         # binary + systemd + bootstrap cert
systemctl status stalwart-mail

Requires STALWART_ADMIN_PASSWORD + STALWART_WEBHOOK_SECRET in config.env (openssl rand -hex 24 / -hex 32). See the mail topology below.

Mail (Stalwart) topology

Stalwart runs on the host, not in k3s — mail must keep flowing regardless of cluster state, and SMTP/IMAP want the real public IP for reputation. The single public IP forces a deliberate split with Traefik:

Concern Owner Detail
Mail protocol ports (25/465/587/143/993/4190) Stalwart (host) Bound on the public IP; opened to the world by the firewall
Web/JMAP for mail.dezky.eu:443 Traefik (k3s) Terminates TLS, reverse-proxies to Stalwart's internal :8080
ACME / TLS issuance cert-manager (k3s) Issues mail.dezky.eu via HTTP-01; Stalwart runs no ACME (80/443 are Traefik's)
Cert delivery to mail ports cert-sync.sh (host) Reads the cluster TLS secret via local kubeconfig, reloads Stalwart on change
Storage RocksDB on host disk Intentionally independent of the in-cluster Postgres
Domain/DKIM provisioning platform-api (k3s) JMAP management API at http://<node>:8080/jmap, Basic auth
Audit webhook Stalwart → platform-api POSTs to https://api.dezky.eu/ingest/..., HMAC-signed

platform-api Fleet env (must match the host's config.env):

STALWART_API_URL=http://<node-internal-ip>:8080
STALWART_ADMIN_USER=admin
STALWART_ADMIN_PASSWORD=<same as host STALWART_ADMIN_PASSWORD>
STALWART_WEBHOOK_SECRET=<same as host STALWART_WEBHOOK_SECRET>
STALWART_PROVISIONING_ENABLED=true

The firewall already lets the k3s pod CIDR reach host :8080 while blocking the world, so no extra rule is needed.

Forward dependency: cert-sync.sh needs the fleet layer to create the mail/mail-tls cert secret. Until then Stalwart serves the self-signed bootstrap cert install.sh generated; the timer swaps in the real cert automatically once it exists.

Finally, backups

sudo ./restic/install.sh           # restic + key + nightly timer
# upload the printed public key to BOTH Storage Boxes (port 23), then:
sudo ./restic/install.sh           # re-run to init the repos
sudo /opt/dezky-backup/backup.sh   # first backup (or wait for 03:20 UTC)

Needs RESTIC_PASSWORD + BACKUP_PRIMARY_REPO (+ BACKUP_DR_REPO) in config.env. See backups below.

Backups (Restic)

Nightly at 03:20 UTC: back up to the primary Storage Box, apply retention, restic check, then a dedup-aware copy to the Helsinki DR box.

What Why
/opt/stalwart/data + /etc Mail store (RocksDB) + config — the crown jewels
/var/lib/rancher/k3s/server/db/snapshots k3s etcd snapshots (cluster state)
/var/lib/rancher/k3s/storage local-path PVCs — incl. where fleet pg_dump/mongodump CronJobs land
  • Retention: 7 daily · 4 weekly · 6 monthly (tunable via BACKUP_RETENTION).
  • Storage Box quirk: SSH/SFTP on port 23, key auth. A single ssh-config wildcard covers both boxes, so one key + restic copy mirrors primary → DR.
  • Encryption: repos are Restic-encrypted with RESTIC_PASSWORD. Store it offline — losing it makes every backup unrecoverable.
  • Alerting: set BACKUP_HEALTHCHECK_URL (e.g. healthchecks.io) for a dead-man's switch — get paged when a nightly run is missed, not when you need to restore.

Database consistency: live DB files in PVCs are crash-consistent at best. The reliable path is logical dumps — the fleet layer adds pg_dump / mongodump CronJobs that write into a backup PVC under /var/lib/rancher/k3s/storage, which Restic then captures. Restore those dumps, not the raw data dirs.

Run restore drills. A backup you've never restored isn't a backup:

sudo /opt/dezky-backup/restore.sh snapshots
sudo /opt/dezky-backup/restore.sh restore latest /tmp/restore-test

⚠️ Lockout safety

  • Always open a second SSH session and confirm access before closing the one you ran bootstrap in.
  • Management is pinned to home + office IPs. Residential IPs can change — if yours does, you'll be locked out of SSH/6443 (public services stay up).
  • Break-glass: Hetzner's KVM/LARA console (Robot panel) is out-of-band and bypasses the firewall entirely. From there you can edit /etc/nftables.d/dezky-fw.nft or update config.env + re-run firewall.sh.
  • If your IP changes often, widen MGMT_ALLOW_V4 to a small prefix, or we add a WireGuard bastion later.

Verifying after apply

sudo nft list table inet dezky_fw     # our rules
sudo nft list ruleset | grep -c KUBE  # k3s rules still present (>0 once k3s runs)
sudo systemctl status dezky-firewall  # enabled + active (exited)
sudo fail2ban-client status sshd      # jail active
# From a NON-allowlisted network, `ssh` should hang/timeout; 443 should work.

Host layer status

Complete: hardening · firewall · k3s registration · Stalwart · backups .

Next is the Fleet/GitOps layer (infrastructure/production/fleet/): cert-manager + ClusterIssuer, ingress, the data tier (Postgres/Mongo/Redis), Authentik, OCIS + Collabora, and portal + platform-api — plus the mail/mail-tls cert and the DB-dump CronJobs this layer's cert-sync and backups depend on.

Stalwart v0.16 — config model change (IMPORTANT)

v0.16 removed TOML configuration. The host service now boots from stalwart/config.json — a tiny file describing ONLY the datastore (RocksDB at /opt/stalwart/data). Every other setting (listeners, authentication, TLS, domains, DKIM, spam, webhooks) is stored in the DB and managed via the web admin UI, stalwart-cli, or platform-api over JMAP. stalwart/config.toml is kept as a reference for the settings to recreate in the DB; it is NOT loaded by v0.16.

Status (node1): Stalwart 0.16.8 installed + running with default listeners (25/465/587/143/993/4190 + management on :8080). Still to configure (DB-side):

  • Fallback admin password (so platform-api can authenticate) + the audit webhook.
  • TLS for mail.dezky.eu — Stalwart's own ACME, or rework cert-sync.sh to feed the cert-manager cert into the v0.16 DB cert model.
  • Domains / DKIM — provisioned by platform-api over JMAP.

Then publish DNS (MX, SPF, DKIM, DMARC) and set the PTR/rDNSmail.dezky.eu.