Skip to content

Backup Object Format

Format version: kg-backup/2
Authority: Normative byte-level reference for the portable backup object, required by ADR-102 §5. ADR-102 fixes the principles; this page fixes the bytes.


Purpose and scope

A backup object is a portable, self-describing serialization of a Kappa Graph knowledge graph's primary inputs — the data that cannot be recomputed. It is designed to round-trip into a destination that shares none of the source's internal indices: a different PostgreSQL cluster, a different platform version, or a different embedding configuration.

The anti-coupling principle (ADR-102 §5): nothing in the payload may reference a source-local index whose meaning the destination cannot reconstruct. AGE's internal id()/graphid is OID-coupled and never appears; neither does any source-platform row id. All identity is carried as either an app-assigned string id (concept_id, source_id, …) or a portable, human-meaningful descriptor declared in the header.

This spec defines the logical object: the header, the dictionaries, and the bulk record shapes. It is transport- and container-agnostic — the same logical object is what the JSON manifest carries and what backup_archive.py wraps alongside Garage media bytes.


Top-level structure

A backup object is a single document with two regions in order:

┌──────────────────────────────────────────────────────────────┐
│ HEADER  — declarative dictionary of portable descriptors,     │
│           declared ONCE (format version, source, profiles,    │
│           vocabulary, epoch_kinds, actors, ontologies, schema) │
├──────────────────────────────────────────────────────────────┤
│ BULK    — record streams that reference HEADER entries by      │
│           compact integer index, NEVER by repeating strings    │
│           (concepts, sources, instances, relationships,        │
│            vocabulary, graph_epochs)                           │
└──────────────────────────────────────────────────────────────┘
{
  "header": { ... },
  "bulk":   { ... }
}

The HEADER is read in full before any bulk record, so every dictionary a bulk record might reference is already resolved. A consumer that cannot interpret the HEADER (see Versioning) MUST refuse the object rather than partially apply the bulk.


The HEADER is a dictionary of portable descriptors, each declared exactly once. Repeated values that would otherwise appear across tens of thousands of bulk records live here as dictionaries; bulk records cite them by integer index (see Dictionary / interning rule).

{
  "header": {
    "format_version": "kg-backup/2",

    "source": {
      "platform": "knowledge-graph-system",
      "version":  "1.7.3"
    },

    "exported_at": "2026-06-01T17:42:08Z",

    "schema_version": 76,

    "embedding_profiles": [
      {
        "identity":           "openai:text-embedding-3-small@1536",
        "vector_space":       "openai-3-small",
        "image_vector_space": null,
        "name":               "default-openai",
        "multimodal":         false
      },
      {
        "identity":           "nomic:nomic-embed-text-v1.5@768",
        "vector_space":       "nomic-v1.5",
        "image_vector_space": "siglip2-base@1024",
        "name":               "local-multimodal-rig",
        "multimodal":         false
      }
    ],

    "default_embedding_profile": 0,

    "relationship_vocabulary": [
      {
        "relationship_type":  "IMPLIES",
        "description":        "...",
        "category":           "logical",
        "added_by":           "system",
        "added_at":           "2026-01-04T00:00:00Z",
        "usage_count":        4210,
        "is_active":          true,
        "is_builtin":         true,
        "synonyms":           null,
        "deprecation_reason": null,
        "direction_semantics":"directional",
        "embedding_model":    "openai:text-embedding-3-small@1536",
        "embedding_generated_at": "2026-01-04T00:00:00Z",
        "embedding":          [ 0.013, -0.041, "..." ]
      }
    ],

    "epoch_kinds": [
      { "kind": "ingestion", "semantic_wallclock": true,  "description": "..." },
      { "kind": "edit",      "semantic_wallclock": true,  "description": "..." },
      { "kind": "reasoning", "semantic_wallclock": false, "description": "..." },
      { "kind": "annealing", "semantic_wallclock": false, "description": "..." }
    ],

    "actors": [
      "system",
      "user:aaronsb",
      "agent:session-9f3c"
    ],

    "content_types": [
      "text/plain",
      "application/pdf",
      "image/png"
    ],

    "ontologies": [
      {
        "name":                      "Philosophy Corpus",
        "default_embedding_profile": 0
      },
      {
        "name":                      "Vision Notes",
        "default_embedding_profile": 1
      }
    ]
  }
}

Header field reference

Field Type Meaning
format_version string Always kg-backup/2 for this spec. The single compatibility negotiation token.
source.platform string Producing platform identifier.
source.version string Producing platform version. Informational; not used for gating.
exported_at string Export instant, ISO-8601 with explicit Z (UTC).
schema_version integer Highest applied DB migration at export time (BackupFormat.get_schema_version reads from kg_api.schema_migrations). Informational compatibility hint.
embedding_profiles[] array Portable embedding-profile descriptors. The dictionary that concept and vocabulary embeddings reference by index. See Embedding-profile identity string.
default_embedding_profile integer Backup-level default profile index — top of the cascade. See Cascading-default resolution.
relationship_vocabulary[] array Vocabulary dictionary; also the bulk vocabulary rows. Edge type fields reference indices into this array.
epoch_kinds[] array The kg_api.graph_epoch_kinds lookup rows (migration 064).
actors[] array of strings Distinct actor identifiers referenced by epoch rows, interned.
content_types[] array of strings Distinct MIME types referenced by sources, interned.
ontologies[] array Ontology descriptors, each with its own default embedding-profile index.

Embedding-profile identity string

Each embedding_profiles[] entry is a portable identity, never a source-local row id. The canonical identity string is:

{provider}:{model}@{dims}

Examples: openai:text-embedding-3-small@1536, nomic:nomic-embed-text-v1.5@768.

This is the value ADR-102 §6 reads to decide keep-vs-recompute on restore: if the target's active profile resolves to the same identity, carried vectors are kept; if not, they are in the wrong space and MUST be regenerated.

The descriptor is derived from export_embedding_profile() (api/app/lib/embedding_config.py) and the kg_api.embedding_profile schema (migrations 055 and 075):

Profile field Source
identity {text_provider}:{text_model_name}@{text_dimensions} — the universal text/prose space (ADR-803, migration 075).
vector_space embedding_profile.vector_space — compatibility key for the universal text space. Two profiles with the same vector_space produce comparable text embeddings.
image_vector_space embedding_profile.image_vector_space (migration 075). The independent same-modality image-index space, formatted {image_provider}:{image_model_name}@{image_dimensions} (or its vector_space tag). null for text-only profiles. Never compared to the text vector_space.
name embedding_profile.name. Informational, not an identity.
multimodal embedding_profile.multimodal. When true the text model also serves the image role and image_vector_space is null.

The graph has one universal text/prose embedding space (concepts, edges, docs, image-prose). A modality's native embedding (the image vector on a Source) is an independent same-modality search index with its own space and dimensions (ADR-803 / migration 075). The header carries both so a destination can decide keep-vs-recompute per space.


Dictionary / interning rule

The HEADER holds the dictionaries; bulk records reference entries by compact integer index, never by repeating a string. A concept's embedding-profile reference is an integer index into header.embedding_profiles[]; an edge's type is an integer index into header.relationship_vocabulary[]; an epoch row's kind/actor are indices into header.epoch_kinds[] / header.actors[]; a source's content_type is an index into header.content_types[].

This is dictionary/interning encoding: the model string openai:…@1536 is written once in the header, not restated across every concept record.

Cascading-default resolution order

The embedding-profile reference for a record cascades. A record states only its override; absent an override it inherits from a parent scope. The tiers available depend on whether the record is ontology-scoped.

Ontology-scoped records (sources, and any media keyed by them) — 3-tier cascade, most-specific-wins:

1. record-level override     (source.embedding_profile)          — if present
2. ontology-level default    (ontologies[i].default_embedding_profile,
                              matched by the record's document)   — else
3. backup-level default      (header.default_embedding_profile)  — else

Concepts — 2-tier cascade, no ontology tier:

1. record-level override     (concept.embedding_profile)         — if present
2. backup-level default      (header.default_embedding_profile)  — else

Concepts skip the ontology tier because a concept is cross-ontology by design: it associates with ontologies only indirectly via APPEARS → Source{document} and the same concept may appear in several ontologies. It therefore has no single home ontology to inherit a profile from, and forcing one would encode a lossy arbitrary pick into the format (ADR-102 P2). In practice one embedding profile is active per backup, so header.default_embedding_profile covers virtually all concepts; a per-concept embedding_profile override handles the rare mixed-profile backup.

Resolution rules:

  • A backup with one uniform embedding profile declares it once as header.default_embedding_profile and emits no per-record refs.
  • A mixed backup declares per-ontology defaults (for sources) and per-concept overrides only where a concept deviates from the backup default.

A consumer resolves a record's effective profile by walking its tiers in order and taking the first index present. The resolved index then keys header.embedding_profiles[] for the keep-vs-recompute decision (ADR-102 §6).


BULK records

The bulk region holds the primary-input record streams. Field lists below are grounded in api/lib/serialization/exporter.py (DataExporter) and api/lib/serialization/format.py (BackupFormat, KgBackupV2Reader), plus the ADR-102 §3 epoch additions.

{
  "bulk": {
    "concepts":      [ ... ],
    "sources":       [ ... ],
    "instances":     [ ... ],
    "evidence":      [ ... ],
    "relationships": [ ... ],
    "vocabulary":    [ ... ],
    "graph_epochs":  [ ... ],
    "ontologies":    [ ... ],
    "scoped_by":     [ ... ],
    "anchored_by":   [ ... ]
  }
}

ontologies, scoped_by, and anchored_by are additive streams. A reader tolerates their absence — older backups predate them and restore with no ontology layer. They are not interned (small cardinality); see Ontology layer.

concepts

Field Type Notes
concept_id string App-assigned. Preserved 1:1 (CLONE) or remapped (adjacent MERGE).
label string
search_terms array of strings
embedding array of floats The text/prose vector. Interpreted in the space named by the resolved embedding-profile ref. May be recomputed on restore if the target profile differs (ADR-102 §6).
created_at_epoch integer Epoch event_id of first appearance (ADR-203). New in kg-backup/2; absent in the legacy 1.0 exporter.
last_seen_epoch integer Epoch event_id of most recent re-evidence. New in kg-backup/2.
embedding_profile integer (optional) Override only. Index into header.embedding_profiles[]. Omitted when the ontology/backup default applies.

sources

Field Type Notes
source_id string App-assigned. Media keys are re-derived from this on restore (ADR-102 §7), not carried.
document string Ontology/document name.
file_path string Original ingest path.
paragraph integer Ordinal within document.
full_text string The source prose — a primary input, always carried.
garage_key string (optional) Present only when set. Sources predating ADR-307 omit it. Informational; restore reconstructs the key from IDs rather than trusting it.
content_type integer (optional) Index into header.content_types[] (interned, replacing the raw MIME string the legacy exporter emitted inline).
storage_key string (optional) Image/media storage key. Present only when set. Like garage_key, reconstructed on restore.

instances

Instances are normalized: one record per instance node, unique by instance_id, carrying no concept_id. An instance belongs to exactly one source (FROM_SOURCE) but may be evidenced by many concepts (EVIDENCED_BY is M:N) — those links live in the separate evidence stream, so the quote text is stored once rather than repeated per evidenced concept.

(The legacy kg-backup/1 exporter denormalized this, emitting one instance row per concept with concept_id inline and the quote duplicated.)

Field Type Notes
instance_id string App-assigned, unique within the backup. Participates in ID remapping.
quote string Evidence quote.
source_id string The originating source ((i)-[:FROM_SOURCE]->(s)). Participates in ID remapping.
created_at_event_id integer FK into graph_epochs.event_id (ADR-203). New in kg-backup/2. In faithful epoch mode it is remapped through the event-ID map; in simple mode all instances are restamped with the single restore event (ADR-102 §3). The legacy exporter dropped this field entirely.

evidence

The evidence stream carries the M:N Concept→Instance EVIDENCED_BY links, one record per link. Restore reconstructs the EVIDENCED_BY edges from it.

Field Type Notes
concept_id string The evidencing concept. Must exist in concepts[]. Participates in ID remapping.
instance_id string The evidenced instance. Must exist in instances[]. Participates in ID remapping.

relationships

Field Type Notes
from string Source concept_id. Participates in ID remapping.
to string Target concept_id. Participates in ID remapping.
type integer Index into header.relationship_vocabulary[] (interned; the legacy exporter wrote the raw type string per edge). Resolves to the dynamic edge label on restore.
properties object Free-form edge properties. See learned_id and ID remapping.

The learned_id edge property participates in ID remapping

Edge properties is generally free-form, but one key is load-bearing for referential integrity: learned_id. Edges minted from agent-learned knowledge carry { learned_id: <source_id> } (see api/app/lib/age_client/query.pyCREATE (c1)-[r:{type} {learned_id: $source_id}]->(c2)).

learned_id is a source_id by another name. It therefore participates in ID remapping exactly like instances[].source_id: in adjacent MERGE mode, when a source's source_id is remapped to a new UID, every edge property learned_id referencing the old value MUST be rewritten through the same old→new ID map.

A consumer that treats properties as opaque will silently orphan the learned-knowledge linkage (and break delete_learned_relationships, which matches ()-[r {learned_id: $learned_id}]-()). Implementations MUST enumerate learned_id in the reference-remap pass (ADR-102 Consequences: "a missed class silently orphans relationships").

vocabulary

The vocabulary rows mirror DataExporter.export_vocabulary (kg_api.relationship_vocabulary). The same rows are surfaced in header.relationship_vocabulary[] so edge type refs can resolve; the bulk vocabulary stream is the import payload — it reconciles against the target's vocabulary during rehydration (ADR-102 §6).

Field Type
relationship_type string
description string
category string
added_by string
added_at ISO-8601Z string
usage_count integer
is_active boolean
is_builtin boolean
synonyms array of strings or null
deprecation_reason string or null
direction_semantics string or null
embedding_model string — identity form {provider}:{model}@{dims}
embedding_generated_at ISO-8601Z string or null
embedding array of floats or null

graph_epochs — faithful epoch mode only

Present only when the backup is produced for faithful epoch replay (ADR-102 §3, CLONE-only). Omitted entirely in simple mode. These are the kg_api.graph_epochs log rows (migration 063).

Field Type Notes
event_id integer Source-local logical-time id. On faithful restore it is recreated as a new id in the target range; the old→new mapping is applied to every instances[].created_at_event_id.
occurred_at ISO-8601Z string Wall-clock axis; preserved.
kind integer Index into header.epoch_kinds[].
actor integer or null Index into header.actors[].
counter_after integer or null graph_change_counter snapshot (ADR-114 cross-ref). Informational.
metadata object Free-form event metadata.

Faithful replay is coherent only when identity is preserved 1:1 (empty target). In MERGE the epoch collapses to one restore event (simple mode), so the source's graph_epochs rows are not carried.

ontologies, scoped_by, anchored_by — ontology layer

:Ontology nodes are first-class primary inputs: they carry their own embedding (in the concept space), lifecycle_state, and curator metadata that nothing post-restore can reconstruct. Backups that omit them silently drop kg ontology list and the catalog browse tree on restore, and turn frozen ontologies back into writable ones.

These three streams carry the nodes and their two edge classes so the layer round-trips faithfully. They are not interned (ontology cardinality is tiny — tens, not tens of thousands) and are additive: a backup taken before they existed simply lacks the keys, and the reader yields empty.

ontologies — one record per :Ontology node:

Field Type Notes
ontology_id string App-assigned (ont_<uuid>). Carried unchanged by ID remapping (its own id space — not a concept/source/instance id). Restored as a property, not the MERGE key.
name string The ontology name; matches sources[].document. The natural key — restore MERGEs on name (consistent with ensure_ontology_exists and the edge MATCHes), so a same-named ontology with a different ontology_id converges to one node rather than duplicating. Never minted or remapped.
description string Curator description. May be empty.
embedding array of floats Ontology vector, same space as concepts.
search_terms array of strings Alternative names for similarity matching.
lifecycle_state string active | pinned | frozen. Load-bearing — a lost frozen silently re-opens an ontology to writes.
creation_epoch integer or null Global epoch when created.
created_by string or null Creating user (ADR-200).

scoped_by(:Source)-[:SCOPED_BY]->(:Ontology) membership (the source of truth; Source.document is a denormalized cache):

Field Type Notes
source_id string Participates in ID remapping (source map).
ontology string The ontologies[].name it belongs to. Not remapped.

anchored_by(:Ontology)-[:ANCHORED_BY]->(:Concept) founding-concept provenance (ADR-200):

Field Type Notes
ontology string The ontologies[].name. Not remapped.
concept_id string Participates in ID remapping (concept map).

The independent validator (scripts/development/lint/lint_backup.py) enforces integrity for all three: duplicate ontology_id/name, duplicate edges, scoped_by/anchored_by endpoints that reference an existing source/concept/ontology, and the ontology embedding dimension against the backup-default profile (E_DUP_ONTOLOGY_* / E_DUP_SCOPED_BY / E_DUP_ANCHORED_BY / E_SCOPED_* / E_ANCHORED_* / E_ONTOLOGY_EMBEDDING_DIM). kg admin verify uses this to round-trip-check the ontology layer without performing a restore.


Exclusions: derived products are not in the backup

Per ADR-102 §4 (primary-in / derived-out), the backup carries primary inputs only. The following are explicitly excluded and are regenerated post-restore against the true post-restore graph state — never serialized:

  • Projections (ADR-114, projections/…) — derived embedding-landscape snapshots.
  • Artifacts / scores (ADR-116, artifacts/…) — polarity analyses, grounding results, epistemic scores, and other computed derivations.
  • Grounding caches and the catalog index.

A derived product is a function of global graph state; introducing any foreign node invalidates it wholesale, so a carried copy would be stale-yet-fresh-looking (silent corruption). Carrying none of them also designs out the fragile per-type concept-ID payload rewriter entirely (ADR-102 §4). The freshness machinery marks derivations stale on restore (record_mutation advances the epoch), so rehydration recomputes them in dependency order — embeddings → vocabulary → scores (ADR-102 §6).

Image and source media bytes are primary inputs (the model consumed them), so they are always backed up — carried in the archive alongside this object, keyed by re-derivable source_id/content keys (ADR-102 §4, §7). This distinguishes them from projections, which are recomputable and excluded.


Versioning and compatibility negotiation

header.format_version is the single negotiation token. It is a slash-delimited {family}/{major} string; this spec defines kg-backup/2.

Producer. Writes its native format_version and a HEADER that satisfies the contract for that version. A producer never downgrades silently.

Consumer. On open, reads header.format_version first and:

  1. Exact major match (kg-backup/2) — accept and apply.
  2. Known family, lower major — refuse. This platform is single-path: there is exactly one backup model and no backwards-compatibility / upcasting layer (ADR-102 P3). The pre-ADR-102 1.0 JSON shape (flat version/type/data, no header) was a prototype and has been removed — no producer emits it and no consumer reads it.
  3. Known family, higher major — refuse. A kg-backup/2 consumer MUST NOT attempt a kg-backup/3 object; it may not understand new required HEADER dictionaries, and partially applying primary inputs is unsafe (ADR-102 §8: restore passes through transiently inconsistent states).
  4. Unknown family — refuse.

KgBackupV2Reader enforces this: it refuses any object whose format_version is not kg-backup/2.

schema_version and source.version are informational and never gate acceptance — only format_version does. The major component bumps on any change that an older consumer could not faithfully interpret: a new required HEADER dictionary, changed cascade semantics, or a changed bulk record shape. Additive, back-compatible HEADER fields that a 2 consumer can safely ignore do not require a major bump.

ADR-102 §5 recommends format_version also be discoverable via API (for example, a format-version endpoint) so producer and consumer negotiate compatibility before transfer. The concrete endpoint shape is an API-contract concern outside this spec.


References

  • ADR-102 — Portable Backup and Restore with Clone/Merge Semantics (§3 epoch reconciliation; §4 primary-in/derived-out; §5 self-describing versioned header; §6 rehydration; §7 storage keys; §8 partial-apply safety).
  • ADR-203 — Graph epoch event log (graph_epochs, created_at_event_id).
  • ADR-803 — Independent image vector space (migration 075).
  • ADR-114 / ADR-116 — Projections / artifacts (excluded derived products).
  • ADR-107 — Prior backup/restore streaming architecture; partly superseded by ADR-102.
  • Source: api/lib/serialization/exporter.py (DataExporter), api/lib/serialization/format.py (BackupFormat, KgBackupV2Reader), api/app/lib/embedding_config.py (export_embedding_profile), migrations 055_embedding_profile.sql, 075_decouple_image_embedding_space.sql, 063_graph_epoch_events.sql, 064_graph_epoch_kinds_lookup.sql.