diff --git a/CLAUDE.md b/CLAUDE.md index ea6fa2b..8ec976d 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -201,6 +201,11 @@ These choices were made deliberately after extensive license/architecture resear - **Prefer prose comments** over heavy JSDoc — explain *why*, not *what* - **MongoDB** for portal app data (consistent with Målerportal, TurtleLootLine) - **PostgreSQL** for services that require it (Authentik, OCIS) +- **Feature flags ship through `useFeatureFlag('key')`**, NOT hardcoded + `if (env === ...)` checks. Risky / plan-gated / kill-switchable features + go behind a flag. See [`docs/FEATURE-FLAGS.md`](./docs/FEATURE-FLAGS.md) + for when to add one, how to use the composable, and the 4 states / 4 scope + axes. ### Production target (for reference, not deploy now) diff --git a/docs/FEATURE-FLAGS.md b/docs/FEATURE-FLAGS.md new file mode 100644 index 0000000..b5e619d --- /dev/null +++ b/docs/FEATURE-FLAGS.md @@ -0,0 +1,128 @@ +# Feature flags + +Dezky has a real, tenant-aware feature flag system. Use it whenever you ship +something that should roll out incrementally, be gated per plan/tenant, or +needs an instant kill switch in production. Don't push risky behavior behind +hardcoded `if (env === ...)` checks — flip a flag instead. + +## When to add a flag + +- The change can break things for real customers and you want a kill switch +- You want to ship to internal / friendly tenants first +- The feature is gated by plan tier (Pro/Enterprise) +- You're doing trunk-based development on a feature that takes more than + one PR to land +- Compliance-sensitive features (GDPR export, retention, audit) — kill + switch is mandatory + +When you **don't** need one: pure UI tweaks, bug fixes, anything that's safe +to release to everyone at once. + +## Where it lives + +| Layer | Path | What it does | +|---|---|---| +| Schema + service | `services/platform-api/src/flags/` | CRUD + bulk eval (hash-based rollout) | +| Operator UI | `apps/operator/pages/flags.vue` + `components/FlagDetail.vue` | List, side panel, kill-switch, change history | +| Portal helper | `apps/portal/composables/useFeatureFlag.ts` | What you'll import from app code | +| Seed | `services/platform-api/src/seed/seed.service.ts` (`FLAG_SEEDS`) | The 10 flags created on bootstrap | + +## Using a flag from app code + +In the customer portal: + +```vue + + + +``` + +- One bulk eval per session — the composable shares a module-level cache. +- Fail-closed: every flag stays `false` if the eval call errors. +- The returned ref is reactive — gated UI stays hidden during the ~25ms + round-trip and appears when the answer lands. + +For multi-flag panels or long-lived sessions: + +```ts +const { flags, ready, refresh } = useFeatureFlags() +``` + +The composable's tenant context comes from the signed-in user's JWT — no +slug parameter. Operator-side checks (where there's no "current tenant") +go directly through `POST /api/flags/evaluate` with an explicit +`{ tenantSlug }`. + +## Adding a new flag + +1. **Add to the seed list** in + `services/platform-api/src/seed/seed.service.ts → FLAG_SEEDS`. This + documents what the flag is for and ensures every environment gets it + on bootstrap. State defaults to `off` for safety. +2. **Restart platform-api** (or wait for HMR + the bootstrap hook). New + keys are upserted via `$setOnInsert` so existing operator edits + survive. +3. **Open `https://operator.dezky.local/flags`**, click the row, set + targeting/rollout, save. +4. **Reference the key** from app code via `useFeatureFlag('your_key')`. + +Alternative: create the flag directly through the operator UI's +"New flag" button. The seed list is for keys that should always exist; +the UI is for ad-hoc experiments. + +## The 4 states + +| State | Meaning | +|---|---| +| `off` | Disabled for everyone, ignores scope. Default kill-switch state. | +| `on` | Enabled for everyone, ignores scope. | +| `targeted` | Explicit allowlist. Requires non-empty scope — empty allowlist evaluates to false ("nobody is on the list yet"). | +| `rollout` | Scope filter + deterministic hash bucket. `sha256("${tenantId}:${flagKey}") % 100 < pct`. Same tenant always gets the same answer until `pct` changes, so bumping 25→50 only flips the new slice. | + +## The 4 scope axes (all optional, AND-ed when set) + +- **plans** — `['pro', 'enterprise']` +- **tenantSlugs** — explicit allowlist of tenants +- **partnerSlugs** — partner-level pilots (not wired into eval context yet) +- **environments** — `['prod', 'staging']` + +Empty list on an axis = "no restriction on this axis". + +## Kill switch + +One click in the operator UI flips a flag to `state: 'off'` + `pct: 0` and +appends a `kill-switch` history entry. Use it when something's misbehaving +in production and you need it dark immediately. Then triage at leisure. + +## Conventions + +- **Keys** are snake_case, lowercase, start with a letter. Match the regex + in `CreateFlagDto`: `^[a-z][a-z0-9_]{1,62}[a-z0-9]$`. +- **One flag per intent**. Don't reuse `new_thing_v2` for unrelated + features — name them separately. +- **Delete flags** once a feature is `on` for everyone and you've removed + the legacy branch. Stale flags rot fast. +- **Don't gate auth, billing-critical, or audit-logging code** behind a + flag where `false` would silently skip security work. Flags should + pick between two correct paths, not enable correctness. + +## What's not built yet + +- **partnerSlug eval context** — the schema axis exists but the service + doesn't currently hydrate `ctx.partnerSlug` from the tenant doc. + Add when the first partner-gated flag actually needs it. +- **User-level flags** — eval is tenant-level only. If you need + per-individual gating (e.g. internal preview for specific staff), + combine `targeted` + a synthetic single-user tenant for now. +- **Audit log integration** — flag changes write to embedded `history` + on the flag doc, capped at 20. Switch to the real audit collection + once that exists. +- **Server-side cache** — `evaluateAll` re-reads all flags from Mongo + on every call. With ~10–50 flags this is fine; if a service ends up + evaluating per-request and flag count grows, add a small TTL cache + (~5s) in `FlagsService`. diff --git a/docs/NEXT-STEPS.md b/docs/NEXT-STEPS.md index 17c13d6..4c4c176 100644 --- a/docs/NEXT-STEPS.md +++ b/docs/NEXT-STEPS.md @@ -152,11 +152,18 @@ What landed: - Audience-aware JwtAuthGuard accepts both `dezky-portal` and `dezky-operator` - `Partner` schema + CRUD endpoints, `Tenant.partnerId` ref - Tenant lifecycle (suspend / resume) gated by OperatorGuard +- **Real Infrastructure live-probes** — `GET /health/platform` runs TCP + + HTTP probes against every neighbouring service; UI splits "Live" vs + "Planned" with honest status. +- **Real feature-flag system** — `Flag` schema + CRUD + bulk eval + + operator UI + `useFeatureFlag` composable in the portal. Hash-based + deterministic rollout. See [`FEATURE-FLAGS.md`](./FEATURE-FLAGS.md). - Operator UI: Overview (real KPIs), Tenants (7-tab detail w/ Danger), - Partners (attach/detach), Users, Operator team. Visual-only Infrastructure, - Feature flags, Audit. Placeholders for Support/Billing/Reports/Settings. + Partners (attach/detach), Users, Operator team, real Infrastructure, + real Feature flags. Visual-only Audit. Placeholders for + Support/Billing/Reports/Settings. - Interactions: ⌘K command palette, impersonation stub (modal + banner), - incident modal, tweaks panel (theme/density/env) + incident modal, tweaks panel, **notification drawer**. ### Follow-ups before operator hits production @@ -168,8 +175,10 @@ In rough priority order — bulk lifted from OPERATOR-PLAN.md: - [ ] **Real audit log collection** — `platform_audit` Mongo collection, written by platform-api on every privileged action; stream from there instead of `data/fixtures.ts` -- [ ] **Feature flag backend** — `Flag` schema + per-tenant rollout state - + a tiny flag-eval client every service imports +- [x] **Feature flag backend** — shipped. See + [`FEATURE-FLAGS.md`](./FEATURE-FLAGS.md). Remaining sub-tasks: + partnerSlug eval context, user-level flags, audit-log integration, + server-side cache (all called out in that doc). - [ ] **Incident management backend** — `Incident` schema + paging (PagerDuty / OpsGenie / custom). Until then, IncidentModal is mock. - [ ] **Support ticket queue** — `SupportTicket` schema + email-in