Backup Object Format

Format version: kg-backup/2
Authority: Normative byte-level reference for the portable backup object, required by ADR-102 §5. ADR-102 fixes the principles; this page fixes the bytes.

Purpose and scope

A backup object is a portable, self-describing serialization of a Kappa Graph knowledge graph's primary inputs — the data that cannot be recomputed. It is designed to round-trip into a destination that shares none of the source's internal indices: a different PostgreSQL cluster, a different platform version, or a different embedding configuration.

The anti-coupling principle (ADR-102 §5): nothing in the payload may reference a source-local index whose meaning the destination cannot reconstruct. AGE's internal id()/graphid is OID-coupled and never appears; neither does any source-platform row id. All identity is carried as either an app-assigned string id (concept_id, source_id, …) or a portable, human-meaningful descriptor declared in the header.

This spec defines the logical object: the header, the dictionaries, and the bulk record shapes. It is transport- and container-agnostic — the same logical object is what the JSON manifest carries and what backup_archive.py wraps alongside Garage media bytes.

Top-level structure

A backup object is a single document with two regions in order:

┌──────────────────────────────────────────────────────────────┐
│ HEADER  — declarative dictionary of portable descriptors,     │
│           declared ONCE (format version, source, profiles,    │
│           vocabulary, epoch_kinds, actors, ontologies, schema) │
├──────────────────────────────────────────────────────────────┤
│ BULK    — record streams that reference HEADER entries by      │
│           compact integer index, NEVER by repeating strings    │
│           (concepts, sources, instances, relationships,        │
│            vocabulary, graph_epochs)                           │
└──────────────────────────────────────────────────────────────┘

{
  "header": { ... },
  "bulk":   { ... }
}

The HEADER is read in full before any bulk record, so every dictionary a bulk record might reference is already resolved. A consumer that cannot interpret the HEADER (see Versioning) MUST refuse the object rather than partially apply the bulk.

The HEADER is a dictionary of portable descriptors, each declared exactly once. Repeated values that would otherwise appear across tens of thousands of bulk records live here as dictionaries; bulk records cite them by integer index (see Dictionary / interning rule).

{
  "header": {
    "format_version": "kg-backup/2",

    "source": {
      "platform": "knowledge-graph-system",
      "version":  "1.7.3"
    },

    "exported_at": "2026-06-01T17:42:08Z",

    "schema_version": 76,

    "embedding_profiles": [
      {
        "identity":           "openai:text-embedding-3-small@1536",
        "vector_space":       "openai-3-small",
        "image_vector_space": null,
        "name":               "default-openai",
        "multimodal":         false
      },
      {
        "identity":           "nomic:nomic-embed-text-v1.5@768",
        "vector_space":       "nomic-v1.5",
        "image_vector_space": "siglip2-base@1024",
        "name":               "local-multimodal-rig",
        "multimodal":         false
      }
    ],

    "default_embedding_profile": 0,

    "relationship_vocabulary": [
      {
        "relationship_type":  "IMPLIES",
        "description":        "...",
        "category":           "logical",
        "added_by":           "system",
        "added_at":           "2026-01-04T00:00:00Z",
        "usage_count":        4210,
        "is_active":          true,
        "is_builtin":         true,
        "synonyms":           null,
        "deprecation_reason": null,
        "direction_semantics":"directional",
        "embedding_model":    "openai:text-embedding-3-small@1536",
        "embedding_generated_at": "2026-01-04T00:00:00Z",
        "embedding":          [ 0.013, -0.041, "..." ]
      }
    ],

    "epoch_kinds": [
      { "kind": "ingestion", "semantic_wallclock": true,  "description": "..." },
      { "kind": "edit",      "semantic_wallclock": true,  "description": "..." },
      { "kind": "reasoning", "semantic_wallclock": false, "description": "..." },
      { "kind": "annealing", "semantic_wallclock": false, "description": "..." }
    ],

    "actors": [
      "system",
      "user:aaronsb",
      "agent:session-9f3c"
    ],

    "content_types": [
      "text/plain",
      "application/pdf",
      "image/png"
    ],

    "ontologies": [
      {
        "name":                      "Philosophy Corpus",
        "default_embedding_profile": 0
      },
      {
        "name":                      "Vision Notes",
        "default_embedding_profile": 1
      }
    ]
  }
}

Header field reference

Field	Type	Meaning
`format_version`	string	Always `kg-backup/2` for this spec. The single compatibility negotiation token.
`source.platform`	string	Producing platform identifier.
`source.version`	string	Producing platform version. Informational; not used for gating.
`exported_at`	string	Export instant, ISO-8601 with explicit `Z` (UTC).
`schema_version`	integer	Highest applied DB migration at export time (`BackupFormat.get_schema_version` reads from `kg_api.schema_migrations`). Informational compatibility hint.
`embedding_profiles[]`	array	Portable embedding-profile descriptors. The dictionary that concept and vocabulary embeddings reference by index. See Embedding-profile identity string.
`default_embedding_profile`	integer	Backup-level default profile index — top of the cascade. See Cascading-default resolution.
`relationship_vocabulary[]`	array	Vocabulary dictionary; also the bulk `vocabulary` rows. Edge `type` fields reference indices into this array.
`epoch_kinds[]`	array	The `kg_api.graph_epoch_kinds` lookup rows (migration 064).
`actors[]`	array of strings	Distinct actor identifiers referenced by epoch rows, interned.
`content_types[]`	array of strings	Distinct MIME types referenced by sources, interned.
`ontologies[]`	array	Ontology descriptors, each with its own default embedding-profile index.

Embedding-profile identity string

Each embedding_profiles[] entry is a portable identity, never a source-local row id. The canonical identity string is:

{provider}:{model}@{dims}

Examples: openai:text-embedding-3-small@1536, nomic:nomic-embed-text-v1.5@768.

This is the value ADR-102 §6 reads to decide keep-vs-recompute on restore: if the target's active profile resolves to the same identity, carried vectors are kept; if not, they are in the wrong space and MUST be regenerated.

The descriptor is derived from export_embedding_profile() (api/app/lib/embedding_config.py) and the kg_api.embedding_profile schema (migrations 055 and 075):

Profile field	Source
`identity`	`{text_provider}:{text_model_name}@{text_dimensions}` — the universal text/prose space (ADR-803, migration 075).
`vector_space`	`embedding_profile.vector_space` — compatibility key for the universal text space. Two profiles with the same `vector_space` produce comparable text embeddings.
`image_vector_space`	`embedding_profile.image_vector_space` (migration 075). The independent same-modality image-index space, formatted `{image_provider}:{image_model_name}@{image_dimensions}` (or its `vector_space` tag). `null` for text-only profiles. Never compared to the text `vector_space`.
`name`	`embedding_profile.name`. Informational, not an identity.
`multimodal`	`embedding_profile.multimodal`. When true the text model also serves the image role and `image_vector_space` is `null`.

The graph has one universal text/prose embedding space (concepts, edges, docs, image-prose). A modality's native embedding (the image vector on a Source) is an independent same-modality search index with its own space and dimensions (ADR-803 / migration 075). The header carries both so a destination can decide keep-vs-recompute per space.

Dictionary / interning rule

The HEADER holds the dictionaries; bulk records reference entries by compact integer index, never by repeating a string. A concept's embedding-profile reference is an integer index into header.embedding_profiles[]; an edge's type is an integer index into header.relationship_vocabulary[]; an epoch row's kind/actor are indices into header.epoch_kinds[] / header.actors[]; a source's content_type is an index into header.content_types[].

This is dictionary/interning encoding: the model string openai:…@1536 is written once in the header, not restated across every concept record.

Cascading-default resolution order

The embedding-profile reference for a record cascades. A record states only its override; absent an override it inherits from a parent scope. The tiers available depend on whether the record is ontology-scoped.

Ontology-scoped records (sources, and any media keyed by them) — 3-tier cascade, most-specific-wins:

1. record-level override     (source.embedding_profile)          — if present
2. ontology-level default    (ontologies[i].default_embedding_profile,
                              matched by the record's document)   — else
3. backup-level default      (header.default_embedding_profile)  — else

Concepts — 2-tier cascade, no ontology tier:

1. record-level override     (concept.embedding_profile)         — if present
2. backup-level default      (header.default_embedding_profile)  — else

Concepts skip the ontology tier because a concept is cross-ontology by design: it associates with ontologies only indirectly via APPEARS → Source{document} and the same concept may appear in several ontologies. It therefore has no single home ontology to inherit a profile from, and forcing one would encode a lossy arbitrary pick into the format (ADR-102 P2). In practice one embedding profile is active per backup, so header.default_embedding_profile covers virtually all concepts; a per-concept embedding_profile override handles the rare mixed-profile backup.

Resolution rules:

A backup with one uniform embedding profile declares it once as header.default_embedding_profile and emits no per-record refs.
A mixed backup declares per-ontology defaults (for sources) and per-concept overrides only where a concept deviates from the backup default.

A consumer resolves a record's effective profile by walking its tiers in order and taking the first index present. The resolved index then keys header.embedding_profiles[] for the keep-vs-recompute decision (ADR-102 §6).

BULK records

The bulk region holds the primary-input record streams. Field lists below are grounded in api/lib/serialization/exporter.py (DataExporter) and api/lib/serialization/format.py (BackupFormat, KgBackupV2Reader), plus the ADR-102 §3 epoch additions.

{
  "bulk": {
    "concepts":      [ ... ],
    "sources":       [ ... ],
    "instances":     [ ... ],
    "evidence":      [ ... ],
    "relationships": [ ... ],
    "vocabulary":    [ ... ],
    "graph_epochs":  [ ... ],
    "ontologies":    [ ... ],
    "scoped_by":     [ ... ],
    "anchored_by":   [ ... ]
  }
}

ontologies, scoped_by, and anchored_by are additive streams. A reader tolerates their absence — older backups predate them and restore with no ontology layer. They are not interned (small cardinality); see Ontology layer.

concepts

Field	Type	Notes
`concept_id`	string	App-assigned. Preserved 1:1 (CLONE) or remapped (adjacent MERGE).
`label`	string
`search_terms`	array of strings
`embedding`	array of floats	The text/prose vector. Interpreted in the space named by the resolved embedding-profile ref. May be recomputed on restore if the target profile differs (ADR-102 §6).
`created_at_epoch`	integer	Epoch `event_id` of first appearance (ADR-203). New in `kg-backup/2`; absent in the legacy `1.0` exporter.
`last_seen_epoch`	integer	Epoch `event_id` of most recent re-evidence. New in `kg-backup/2`.
`embedding_profile`	integer (optional)	Override only. Index into `header.embedding_profiles[]`. Omitted when the ontology/backup default applies.

sources

Field	Type	Notes
`source_id`	string	App-assigned. Media keys are re-derived from this on restore (ADR-102 §7), not carried.
`document`	string	Ontology/document name.
`file_path`	string	Original ingest path.
`paragraph`	integer	Ordinal within document.
`full_text`	string	The source prose — a primary input, always carried.
`garage_key`	string (optional)	Present only when set. Sources predating ADR-307 omit it. Informational; restore reconstructs the key from IDs rather than trusting it.
`content_type`	integer (optional)	Index into `header.content_types[]` (interned, replacing the raw MIME string the legacy exporter emitted inline).
`storage_key`	string (optional)	Image/media storage key. Present only when set. Like `garage_key`, reconstructed on restore.

instances

Instances are normalized: one record per instance node, unique by instance_id, carrying no concept_id. An instance belongs to exactly one source (FROM_SOURCE) but may be evidenced by many concepts (EVIDENCED_BY is M:N) — those links live in the separate evidence stream, so the quote text is stored once rather than repeated per evidenced concept.

(The legacy kg-backup/1 exporter denormalized this, emitting one instance row per concept with concept_id inline and the quote duplicated.)

Field	Type	Notes
`instance_id`	string	App-assigned, unique within the backup. Participates in ID remapping.
`quote`	string	Evidence quote.
`source_id`	string	The originating source (`(i)-[:FROM_SOURCE]->(s)`). Participates in ID remapping.
`created_at_event_id`	integer	FK into `graph_epochs.event_id` (ADR-203). New in `kg-backup/2`. In faithful epoch mode it is remapped through the event-ID map; in simple mode all instances are restamped with the single restore event (ADR-102 §3). The legacy exporter dropped this field entirely.

evidence

The evidence stream carries the M:N Concept→Instance EVIDENCED_BY links, one record per link. Restore reconstructs the EVIDENCED_BY edges from it.

Field	Type	Notes
`concept_id`	string	The evidencing concept. Must exist in `concepts[]`. Participates in ID remapping.
`instance_id`	string	The evidenced instance. Must exist in `instances[]`. Participates in ID remapping.

relationships

Field	Type	Notes
`from`	string	Source `concept_id`. Participates in ID remapping.
`to`	string	Target `concept_id`. Participates in ID remapping.
`type`	integer	Index into `header.relationship_vocabulary[]` (interned; the legacy exporter wrote the raw type string per edge). Resolves to the dynamic edge label on restore.
`properties`	object	Free-form edge properties. See `learned_id` and ID remapping.

The `learned_id` edge property participates in ID remapping

Edge properties is generally free-form, but one key is load-bearing for referential integrity: learned_id. Edges minted from agent-learned knowledge carry { learned_id: <source_id> } (see api/app/lib/age_client/query.py — CREATE (c1)-[r:{type} {learned_id: $source_id}]->(c2)).

learned_id is a source_id by another name. It therefore participates in ID remapping exactly like instances[].source_id: in adjacent MERGE mode, when a source's source_id is remapped to a new UID, every edge property learned_id referencing the old value MUST be rewritten through the same old→new ID map.

A consumer that treats properties as opaque will silently orphan the learned-knowledge linkage (and break delete_learned_relationships, which matches ()-[r {learned_id: $learned_id}]-()). Implementations MUST enumerate learned_id in the reference-remap pass (ADR-102 Consequences: "a missed class silently orphans relationships").

vocabulary

The vocabulary rows mirror DataExporter.export_vocabulary (kg_api.relationship_vocabulary). The same rows are surfaced in header.relationship_vocabulary[] so edge type refs can resolve; the bulk vocabulary stream is the import payload — it reconciles against the target's vocabulary during rehydration (ADR-102 §6).

Field	Type
`relationship_type`	string
`description`	string
`category`	string
`added_by`	string
`added_at`	ISO-8601Z string
`usage_count`	integer
`is_active`	boolean
`is_builtin`	boolean
`synonyms`	array of strings or null
`deprecation_reason`	string or null
`direction_semantics`	string or null
`embedding_model`	string — identity form `{provider}:{model}@{dims}`
`embedding_generated_at`	ISO-8601Z string or null
`embedding`	array of floats or null

graph_epochs — faithful epoch mode only

Present only when the backup is produced for faithful epoch replay (ADR-102 §3, CLONE-only). Omitted entirely in simple mode. These are the kg_api.graph_epochs log rows (migration 063).

Field	Type	Notes
`event_id`	integer	Source-local logical-time id. On faithful restore it is recreated as a new id in the target range; the old→new mapping is applied to every `instances[].created_at_event_id`.
`occurred_at`	ISO-8601Z string	Wall-clock axis; preserved.
`kind`	integer	Index into `header.epoch_kinds[]`.
`actor`	integer or null	Index into `header.actors[]`.
`counter_after`	integer or null	`graph_change_counter` snapshot (ADR-114 cross-ref). Informational.
`metadata`	object	Free-form event metadata.

Faithful replay is coherent only when identity is preserved 1:1 (empty target). In MERGE the epoch collapses to one restore event (simple mode), so the source's graph_epochs rows are not carried.

ontologies, scoped_by, anchored_by — ontology layer

:Ontology nodes are first-class primary inputs: they carry their own embedding (in the concept space), lifecycle_state, and curator metadata that nothing post-restore can reconstruct. Backups that omit them silently drop kg ontology list and the catalog browse tree on restore, and turn frozen ontologies back into writable ones.

These three streams carry the nodes and their two edge classes so the layer round-trips faithfully. They are not interned (ontology cardinality is tiny — tens, not tens of thousands) and are additive: a backup taken before they existed simply lacks the keys, and the reader yields empty.

ontologies — one record per :Ontology node:

Field	Type	Notes
`ontology_id`	string	App-assigned (`ont_<uuid>`). Carried unchanged by ID remapping (its own id space — not a concept/source/instance id). Restored as a property, not the MERGE key.
`name`	string	The ontology name; matches `sources[].document`. The natural key — restore MERGEs on `name` (consistent with `ensure_ontology_exists` and the edge MATCHes), so a same-named ontology with a different `ontology_id` converges to one node rather than duplicating. Never minted or remapped.
`description`	string	Curator description. May be empty.
`embedding`	array of floats	Ontology vector, same space as concepts.
`search_terms`	array of strings	Alternative names for similarity matching.
`lifecycle_state`	string	`active` \| `pinned` \| `frozen`. Load-bearing — a lost `frozen` silently re-opens an ontology to writes.
`creation_epoch`	integer or null	Global epoch when created.
`created_by`	string or null	Creating user (ADR-200).

scoped_by — (:Source)-[:SCOPED_BY]->(:Ontology) membership (the source of truth; Source.document is a denormalized cache):

Field	Type	Notes
`source_id`	string	Participates in ID remapping (source map).
`ontology`	string	The `ontologies[].name` it belongs to. Not remapped.

anchored_by — (:Ontology)-[:ANCHORED_BY]->(:Concept) founding-concept provenance (ADR-200):

Field	Type	Notes
`ontology`	string	The `ontologies[].name`. Not remapped.
`concept_id`	string	Participates in ID remapping (concept map).

The independent validator (scripts/development/lint/lint_backup.py) enforces integrity for all three: duplicate ontology_id/name, duplicate edges, scoped_by/anchored_by endpoints that reference an existing source/concept/ontology, and the ontology embedding dimension against the backup-default profile (E_DUP_ONTOLOGY_* / E_DUP_SCOPED_BY / E_DUP_ANCHORED_BY / E_SCOPED_* / E_ANCHORED_* / E_ONTOLOGY_EMBEDDING_DIM). kg admin verify uses this to round-trip-check the ontology layer without performing a restore.

Exclusions: derived products are not in the backup

Per ADR-102 §4 (primary-in / derived-out), the backup carries primary inputs only. The following are explicitly excluded and are regenerated post-restore against the true post-restore graph state — never serialized:

Projections (ADR-114, projections/…) — derived embedding-landscape snapshots.
Artifacts / scores (ADR-116, artifacts/…) — polarity analyses, grounding results, epistemic scores, and other computed derivations.
Grounding caches and the catalog index.

A derived product is a function of global graph state; introducing any foreign node invalidates it wholesale, so a carried copy would be stale-yet-fresh-looking (silent corruption). Carrying none of them also designs out the fragile per-type concept-ID payload rewriter entirely (ADR-102 §4). The freshness machinery marks derivations stale on restore (record_mutation advances the epoch), so rehydration recomputes them in dependency order — embeddings → vocabulary → scores (ADR-102 §6).

Image and source media bytes are primary inputs (the model consumed them), so they are always backed up — carried in the archive alongside this object, keyed by re-derivable source_id/content keys (ADR-102 §4, §7). This distinguishes them from projections, which are recomputable and excluded.

Versioning and compatibility negotiation

header.format_version is the single negotiation token. It is a slash-delimited {family}/{major} string; this spec defines kg-backup/2.

Producer. Writes its native format_version and a HEADER that satisfies the contract for that version. A producer never downgrades silently.

Consumer. On open, reads header.format_version first and:

Exact major match (kg-backup/2) — accept and apply.
Known family, lower major — refuse. This platform is single-path: there is exactly one backup model and no backwards-compatibility / upcasting layer (ADR-102 P3). The pre-ADR-102 1.0 JSON shape (flat version/type/data, no header) was a prototype and has been removed — no producer emits it and no consumer reads it.
Known family, higher major — refuse. A kg-backup/2 consumer MUST NOT attempt a kg-backup/3 object; it may not understand new required HEADER dictionaries, and partially applying primary inputs is unsafe (ADR-102 §8: restore passes through transiently inconsistent states).
Unknown family — refuse.

KgBackupV2Reader enforces this: it refuses any object whose format_version is not kg-backup/2.

schema_version and source.version are informational and never gate acceptance — only format_version does. The major component bumps on any change that an older consumer could not faithfully interpret: a new required HEADER dictionary, changed cascade semantics, or a changed bulk record shape. Additive, back-compatible HEADER fields that a 2 consumer can safely ignore do not require a major bump.

ADR-102 §5 recommends format_version also be discoverable via API (for example, a format-version endpoint) so producer and consumer negotiate compatibility before transfer. The concrete endpoint shape is an API-contract concern outside this spec.

References

ADR-102 — Portable Backup and Restore with Clone/Merge Semantics (§3 epoch reconciliation; §4 primary-in/derived-out; §5 self-describing versioned header; §6 rehydration; §7 storage keys; §8 partial-apply safety).
ADR-203 — Graph epoch event log (graph_epochs, created_at_event_id).
ADR-803 — Independent image vector space (migration 075).
ADR-114 / ADR-116 — Projections / artifacts (excluded derived products).
ADR-107 — Prior backup/restore streaming architecture; partly superseded by ADR-102.
Source: api/lib/serialization/exporter.py (DataExporter), api/lib/serialization/format.py (BackupFormat, KgBackupV2Reader), api/app/lib/embedding_config.py (export_embedding_profile), migrations 055_embedding_profile.sql, 075_decouple_image_embedding_space.sql, 063_graph_epoch_events.sql, 064_graph_epoch_kinds_lookup.sql.