feat(operator): live Infrastructure probes + honest split between deployed and planned

The Infrastructure page used to read from a mock fixture that lied two ways:
it listed services that aren't deployed (Jitsi, Zulip, Cloudflare, Object
Storage, Postmark) and showed hardcoded uptime/latency for the ones that
are. Now it shows truth from real probes plus a clearly-labelled "planned"
section for the rest.

Backend (services/platform-api):
- New src/health/ module — HealthService runs 9 probes in parallel with a
  1.5s timeout each:
    Stalwart  → TCP stalwart:8080
    OCIS      → HTTP GET ocis:9200/health
    Collabora → HTTP GET collabora:9980/hosting/discovery
    Authentik → HTTP GET authentik-server:9000/-/health/ready/
    Postgres  → TCP postgres:5432
    Mongo     → existing Mongoose connection.db.admin().ping()
    Redis     → TCP redis:6379
    Traefik   → TCP traefik:80
    Platform API → trivially ok (this code is running)
  Status thresholds: ok ≤500ms, warn 500–1500ms, bad on timeout/refuse.
- HealthController exposes GET /health/platform behind JwtAuthGuard, plus
  keeps the existing public GET /health for infra liveness checks.
- Moved the old src/health.controller.ts into the new module.

Frontend (apps/operator):
- /api/health/platform proxy forwards the operator's access token.
- Infrastructure page swaps SERVICES fixture for useFetch with 30s auto-
  refresh + a manual Refresh button. Cards show real status badge + real
  latency; uptime/error stay as em-dash with a "no probe history yet"
  tooltip until a Prometheus/event-log backend lands.
- Below the live grid, a "Planned · not deployed" section renders 5 dimmed
  cards (Jitsi, Zulip, simpledns.plus, Hetzner Object Storage, Postmark).
  simpledns.plus replaces the misnamed Cloudflare entry — we use
  simpledns.plus, not Cloudflare.
- Subtitle is now truthful: "8 / 9 services live · checked 2s ago".

Verified: stopped redis → card flipped to "down · getaddrinfo ENOTFOUND
redis", subtitle reflected 8/9, incident banner appeared. Restarted →
back to 9/9, banner gone.

SERVICES fixture stays in place for Overview's incident banner — replacing
that is a separate follow-up tied to the incident-management backend.
This commit is contained in:
Ronni Baslund
2026-05-24 18:47:38 +02:00
parent 9fac11e668
commit 77a09aaf77
8 changed files with 316 additions and 43 deletions
+18
View File
@@ -108,6 +108,24 @@ export const OP_AUDIT: AuditEntry[] = [
{ id: 'op_8811', when: '09:30:00', actor: 'Anne Baslund', role: 'platform admin', action: 'tos.published', target: 'v2026.05 · all tenants', tenant: '—', ip: '10.0.4.18', tone: 'info' },
]
// Services in the design that haven't been deployed yet. Surfaced as a
// separate "Planned" section on the Infrastructure page so the operator sees
// honest deployment state instead of a fake all-green grid.
export interface PlannedService {
id: string
name: string
role: string
note: string
}
export const PLANNED_SERVICES: PlannedService[] = [
{ id: 'jitsi', name: 'Jitsi', role: 'Video meetings', note: 'Lands with docker-compose.optional.yml (Phase 7)' },
{ id: 'zulip', name: 'Zulip', role: 'Team chat', note: 'Lands with docker-compose.optional.yml (Phase 7)' },
{ id: 'dns', name: 'simpledns.plus', role: 'DNS · authoritative', note: 'External SaaS · prod only' },
{ id: 'objstore', name: 'Hetzner Object Storage', role: 'Files · S3 backend for OCIS', note: 'External · prod only' },
{ id: 'smtp-out', name: 'Postmark', role: 'Outbound SMTP · transactional email', note: 'External SaaS · prod only' },
]
export type NotificationKind = 'security' | 'user' | 'billing' | 'integration' | 'support' | 'signin'
export type NotificationTone = 'warn' | 'info' | 'neutral' | 'ok' | 'bad'
export interface NotificationItem {