ADR-105: Scenario-driven TLS via an in-VM Traefik router
Context
ADR-104 converged the provisioning contract across install paths but
explicitly deferred TLS: "install.sh still owns concerns the others don't
(SSL/Let's Encrypt, macvlan, host shim); convergence targets provisioning, not
those." This ADR picks up the certificate carve-out — but the investigation
turned up a better shape than the one first drafted.
Two needs that converge on the same component
- HTTPS parity across install paths. All cert logic lives in
install.shonly (SECTION 9, ~460 lines: acme.sh/certbot, four modes, nginx config gen, compose overlay, renewal cron). The modernoperator.sh init/ appliance path cannot stand up TLS — it merely consumes adocker-compose.ssl.ymloverlay if one exists (common.sh:142,start-platform.sh:140) and exposesoperator.sh recert, which dispatches tooperator/lib/recert.sh— a file that does not exist (operator.sh:906). Deploy via the appliance path and you get HTTP only. - A consistent service endpoint. The appliance exposes management, web, and API surfaces. The CLI, FUSE driver, MCP server, and any other API client need a stable endpoint regardless of internal container ports. nginx-embedded-in- the-web-container cannot cleanly be that single front door.
The realization
We were about to hand-build — a DNS-provider registry, a renewal cron, nginx TLS config — what a reverse proxy already provides. An in-VM router supplies all of it: ACME (HTTP-01, TLS-ALPN-01, and DNS-01 via lego, which ships ~100 providers including porkbun), automatic renewal, and host/path/port routing. Adopting one turns "build a certificate system" into "configure one and wire our app to it" — less code, and the code we keep is glue and policy, not ACME protocol or crypto.
Deployment is not one-size-fits-all
The right TLS posture depends on where the appliance runs. There are five real scenarios with materially different network exposure and credential trust.
Decision
Adopt Traefik (MIT) as the appliance's in-VM router, drive its TLS behavior
from an explicit deployment scenario, and keep the four mode names as
Traefik certResolver configurations. The surviving thesis from this ADR's first
draft holds — converge cert management, four modes, offload is an EXTERNAL_URL
contract, self-signed default, trust-posture by exposure — only the mechanism
changes from acme.sh/nginx to Traefik/lego.
1. Deployment scenarios (the spine)
| Scenario | Network / IP | TLS terminator | Cert posture | DNS auth on appliance |
|---|---|---|---|---|
| dev | containers on a dev box, no VM | none | HTTP only | none |
| private | appliance VM on trusted LAN, real name → private IP; may offer DHCP/route/DNS | in-VM Traefik | manual (cert issued off-box) / self-signed / internal DNS-01 | none (preferred); account-key tolerable only here |
| internet | single VM on a cloud, public IP | in-VM Traefik | Let's Encrypt HTTP-01 / TLS-ALPN-01 (no DNS key) | none |
| proxied | inside an env with an LB/reverse proxy | edge | offload (EXTERNAL_URL, plain HTTP in VM) |
none |
| public-nat | public but behind NAT / Cloudflare | in-VM Traefik or tunnel | DNS-01 via challenge delegation (CNAME _acme-challenge → acme-dns) or tunnel-terminated |
challenge-only (never zone key) |
operator init selects/derives the scenario; the scenario derives the mode. The
operator chooses where it runs, not which of four cert modes to reason about.
2. Traefik is the in-VM router
Present in every appliance scenario (absent in dev, HTTP-only in proxied). A single ingress routes by host/path to management / web / API, giving the stable endpoint the CLI / FUSE / MCP need. It replaces nginx-in-the-web-container for ingress (the web image goes back to just serving static assets).
3. Modes are Traefik certResolver configs
| Mode | Traefik mechanism | Default? |
|---|---|---|
selfsigned |
default/generated cert served by Traefik | yes — appliance handles itself |
letsencrypt |
ACME resolver: httpChallenge / tlsChallenge / dnsChallenge (lego) |
— |
manual |
file provider points at an operator-supplied cert+key | — |
offload |
Traefik in HTTP-only mode; edge terminates | — |
4. lego is the DNS provider factory
The hand-built provider registry dissolves. A DNS-01 deployment names a lego
provider (porkbun, cloudflare, …) and passes credentials to the resolver.
porkbun is native. Adding a provider is lego's concern, not ours.
5. Lifecycle: Traefik owns renewal; recert.sh shrinks
Per-mode ownership, now mostly delegated to Traefik:
| Mode | Issue | Renew | Owner |
|---|---|---|---|
selfsigned |
Traefik | regenerate near expiry | appliance (Traefik) |
letsencrypt |
Traefik ACME | Traefik auto-renews | appliance (Traefik) |
manual |
operator, off-box | operator re-supplies (Traefik file-provider hot-reloads) | operator; appliance warns near expiry |
offload |
upstream | upstream | upstream — recert is a no-op |
operator/lib/recert.sh becomes thin: trigger/verify for manual, no-op for
offload; Traefik handles letsencrypt/selfsigned internally. It still makes
operator.sh recert real, but it is no longer a renewal engine.
6. Offload = EXTERNAL_URL contract, proxy-agnostic, Traefik as reference edge
"Behind a proxy" is a first-class mode whose contract is: the operator declares
the public URL including scheme (EXTERNAL_URL=https://kg.example.com), and
the app derives scheme-sensitive outputs (OAuth redirect_uri, cookies, links)
from it while trusting X-Forwarded-Proto/Host/For. Concretely this fixes
headless-init.sh:496 (http://${WEB_HOSTNAME}/callback — scheme hardcoded) and
the API-side OAuth client registration.
The contract is proxy-agnostic (works behind nginx / Caddy / ALB / Cloudflare). Because the appliance speaks Traefik internally, we recommend (not require) Traefik as the reference edge and ship a drop-in router+service snippet — the offload handoff becomes Traefik-to-Traefik with matching forwarded-header defaults. Anyone else writes their own five lines against the same contract.
7. Trust posture by exposure (credential blast radius)
The unacceptable risk is a network-exposed box holding a credential that can
rewrite the whole DNS zone (porkbun keys are account-wide — no per-zone
scoping). So: DNS authority on the appliance is zero by default and downgradable
to zero in every public scenario. An account-wide key is tolerable only in
private (contained, private-IP blast radius), and even there off-box issuance
(manual mode) is preferred. Public scenarios use HTTP-01 / TLS-ALPN (no key),
offload (no cert), or challenge delegation (a credential that can do
nothing but answer one challenge). --manage-dns self-FQDN, if ever built, is
opt-in and private-only; lego manages challenge TXT only, never A records.
8. What we still own (the glue)
Traefik is not magic. We own: scenario → Traefik config generation (the new
core task, replacing the cert-factory extraction); the topology change (add
Traefik, route web/api behind it); EXTERNAL_URL + the OAuth redirect-scheme
fix (our app's bug, not Traefik's); policy/trust enforcement (which resolver
per scenario, warn on account-wide key when EXTERNAL_URL is public, the off-box
manual path); the offload snippet; and the wizard/flags UX.
Licensing & commercial posture
Traefik Proxy core and lego are MIT — permissive, irrevocable on shipped versions, no network-copyleft. A managed-fleet/SaaS built on the bundled Traefik incurs only an attribution obligation; no reconsideration needed. Traefik Enterprise/Hub are optional fleet-scale products (centralized control plane, distributed ACME, API gateway) — the legitimate, additive purchase for a proxied/offload edge at fleet scale, not a forced unlock of baseline function. This is the anti-Neo4j: the core capability we depend on is fully MIT and complete. The stack's only copyleft component is Garage (AGPL-3.0) — a deliberate, accepted choice; run upstream-unmodified it imposes only "offer source." Project stance: maximize openness; the moat is how to use the system, not the code or hosting. Dependency choices optimize for openness and rug-pull-resistance, never code protection.
Diagrams
In-VM topology — one router, one stable endpoint
How the pieces relate: every client reaches the platform through a single Traefik ingress, which routes by path to web / api / management regardless of internal ports. This is what gives the CLI / FUSE / MCP a consistent endpoint.
flowchart LR
B[Browser]
C[kg CLI]
F[FUSE driver]
M[MCP server]
B --> T
C --> T
F --> T
M --> T
subgraph vm[Appliance VM]
T[Traefik router<br>:80 / :443<br>TLS termination + routing]
T -->|/| W[web — static assets]
T -->|/api| A[api :8000]
T -->|/mgmt| MG[management]
A --> P[(Postgres + AGE)]
A --> G[(Garage S3)]
end
Scenario → mode selection — how the choice functions
The operator declares where it runs; that derives the cert mode and who terminates TLS. No one reasons about four cert modes cold.
flowchart TD
Q1{Where does it run?}
Q1 -->|dev box, no VM| DEV[dev<br>HTTP only · no Traefik]
Q1 -->|appliance VM| Q2{Network exposure?}
Q2 -->|trusted LAN<br>private IP| PRIV[private<br>Traefik + manual / self-signed<br>cert issued off-box]
Q2 -->|behind LB / proxy| PROX[proxied<br>Traefik HTTP-only<br>edge terminates · EXTERNAL_URL]
Q2 -->|public IP<br>:80 reachable| INET[internet<br>Traefik + LE HTTP-01 / TLS-ALPN]
Q2 -->|public<br>behind NAT / CF| NAT[public-nat<br>Traefik + DNS-01 delegation<br>or tunnel]
Offload handoff — why EXTERNAL_URL is load-bearing
In proxied, the edge holds the cert and terminates TLS; the appliance speaks
plain HTTP but must still emit https URLs. The contract makes that correct.
sequenceDiagram
participant U as Browser (https)
participant E as Edge proxy
participant T as in-VM Traefik (http)
participant A as api
U->>E: HTTPS request
Note over E: terminates TLS, holds the cert
E->>T: HTTP + X-Forwarded-Proto: https
T->>A: route /api (plain HTTP)
A-->>U: redirect_uri derived from EXTERNAL_URL (https)
Consequences
Positive
- Less code, not more: the DNS-provider registry, renewal cron, and nginx TLS config are deleted; Traefik/lego own ACME, renewal, and routing.
- One stable ingress for management/web/API → consistent CLI/FUSE/MCP endpoint.
- HTTPS parity across all install paths;
operator.sh recertstops dangling. - Scenarios make mode-selection defensible instead of asking operators to reason about four cert modes cold.
- All load-bearing dependencies stay permissively licensed.
Negative
- Topology change: Traefik replacing nginx-in-web ingress touches the appliance control plane (ADR-103/104) and how static assets are served. This is a deliberate re-plumb, not a behavior-preserving refactor.
- A new component (Traefik) in the appliance image.
install.sh's existing, working SSL section is replaced, not extracted — must not regress existing production installs (e.g. cube's prior install).
Neutral
- Mode names and the four-mode taxonomy are unchanged; the offload
EXTERNAL_URLcontract and trust posture survive from the first draft intact. - Self-managed DNS A-records (DDNS) remain out of scope — lego does challenge TXT only.
Alternatives Considered
- Hand-rolled acme.sh/nginx cert factory (this ADR's first draft). Rejected: rebuilds what Traefik provides and leaves us maintaining security-sensitive ACME and renewal code.
- Per-mode certs without a router. Rejected: solves TLS but not the consistent-endpoint need; we'd still bolt on routing separately.
- Require Traefik at the edge for offload. Rejected: couples offload to one proxy; the contract must stay proxy-agnostic.
- Caddy instead of Traefik (Apache-2.0, native ACME, on-demand TLS). A genuine peer; rejected for now on Traefik's provider ecosystem and existing operator familiarity. Revisit if the open-core optics ever bite — both are permissive.
Migration Note (non-binding sketch)
Incremental, each step shippable:
- Add a Traefik service to the appliance topology; route web + API behind it (HTTP first, no TLS) — proves routing without touching certs.
- Generate Traefik static + dynamic config from
scenario+EXTERNAL_URL(this is the new shared module — "configure-traefik," not "configure-ssl"). - Wire
certResolvers: self-signed default; LEhttp/tlsChallenge;dnsChallengevia lego;manualfile provider. - Introduce
EXTERNAL_URL; fix the OAuth redirect scheme (headless-init.sh:496 - API client registration); honor
X-Forwarded-Proto. - Thin
operator/lib/recert.sh:manualre-supply +offloadno-op; Traefik owns LE/self-signed renewal.operator.sh recertworks. - Replace
install.shSECTION 9 with the shared generator; verify no regression for existing standalone installs. - Ship the offload Traefik snippet + scenario docs.
- Deploy cube as private / manual (cert issued off-box on north; no DNS key on cube) — the first real exercise, and a faithful rehearsal of proxied.
This revises the acme.sh/nginx mechanism of this same ADR; the decision's thesis is unchanged.
Implementation status
- Step 1 — done (PR #517): in-VM Traefik HTTP router,
ROUTER_MODE=traefik,docker-compose.traefik.yml+nginx.router.conf; appliance CI asserts the unified ingress. - Step 4 — done (PR2 commit A):
EXTERNAL_URL(scheme+host) as the single source of public identity; OAuth redirect + webVITE_*derive from it, fixing the http/https registration mismatch (washeadless-init.sh:515). - Steps 3, 5 — done (PR2 commits B, C):
TLS_MODE=none/selfsigned/manual/letsencrypt/offload.selfsigned= Traefik default cert;manual= file provider over an operator-supplied cert (docker/certs/);letsencrypt= ACME TLS-ALPN-01 (HTTP-01 / DNS-01-via-lego documented as opt-in);offload= HTTP in-VM +EXTERNAL_URL=https.operator/lib/recert.shis the thin verify/no-op dispatcher. Appliance CI exercises theselfsignedHTTPS path end-to-end (:80→:443redirect +httpsweb/api). - Steps 2, 6 — deferred: the
install.shSECTION 9 convergence and a single scenario→config generator are not yet folded in; the modes ship as composable Traefik overlays selected byTLS_MODEfor now. - Step 8 — pending: cube deploy as private / manual (cert issued off-box
on
north; no DNS key on cube).
DNS-01 on the appliance stays opt-in, private-only (§7): the wired Let's
Encrypt default is secretless TLS-ALPN-01, and cube uses manual precisely so no
DNS credential ever lands on the box.