ADR-069: Semantic FUSE Filesystem
Status: Proposed Date: 2025-11-28 Related ADRs: ADR-055 (Sharding), ADR-048 (GraphQueryFacade)
"Everything is a file" - Traditional Unix Philosophy "Everything is a file, but which file depends on what you're thinking about" - Semantic Unix Philosophy
Overview
Traditional filesystems force you to organize knowledge in rigid hierarchiesβone directory, one path, one canonical location. But knowledge doesn't work that way. A document about embedding models is simultaneously about AI architecture, operational procedures, and bug fixes. Why should it live in only one folder?
The knowledge graph already solves this by letting concepts exist in multiple semantic contexts. But accessing it requires custom tools: CLI commands, web interfaces, MCP integration. Unix users already have powerful toolsβgrep, find, diff, tarβthat they know intimately, but these tools can't touch the graph.
This ADR proposes exposing the knowledge graph as a FUSE (Filesystem in Userspace) mount point, turning standard Unix tools into knowledge graph explorers. Type cd /mnt/knowledge/embedding-models/ and you're executing a semantic query. Run ls and you see concepts with similarity scores. Use grep -r across multiple mounted shards and you're running distributed queries. Same concepts appear in multiple "directories" because they belong to multiple contexts. The filesystem adapts to your exploration patterns, making knowledge navigation feel like browsing filesβexcept the files organize themselves based on what they mean.
Abstract
This ADR proposes exposing the knowledge graph as a FUSE (Filesystem in Userspace) mount point, enabling semantic navigation and querying through standard Unix tools (ls, cd, cat, grep, find). Like /sys/ or /proc/, this is a partial filesystem that implements only operations that make semantic sense, providing a familiar interface to knowledge graph exploration.
Context
The Problem: Hierarchies Don't Fit Knowledge
Traditional filesystems organize knowledge through rigid hierarchies:
But knowledge doesn't fit in trees. ADR-068 is simultaneously: - An architecture decision - A guide for operators - An embedding system reference - A bug fix chronicle - A compatibility management strategy
Why force it into one directory when it semantically belongs in multiple conceptual spaces?
The Opportunity: FUSE as Semantic Interface
The knowledge graph already provides: - Semantic search (vector similarity) - Relationship traversal (graph navigation) - Multi-ontology federation (shard/facet architecture from ADR-055) - Cross-domain linking (automatic concept merging)
FUSE could expose these capabilities through filesystem metaphors that users already understand.
Architectural Validation
This proposal underwent external peer review to validate feasibility against the existing codebase. Key findings:
- Architectural Fit: The FUSE operations map directly to existing services without requiring new core logic
ls(semantic query) βQueryService.build_search_querycd relationships/(graph traversal) βQueryService.build_concept_details_query-
Write operations β existing async ingestion pipeline
-
Implementation Feasibility: High - essentially re-skinning existing services into FUSE protocol
-
Discovery Value: Solves the "I don't know what to search for" problem by allowing users to browse valid semantic pathways
-
Standard Tool Integration: Turns every Unix utility (
grep,diff,tar) into a knowledge graph tool for free
The review validated this is a "rigorous application of the 'everything is a file' philosophy to high-dimensional data," not a cursed hack.
Performance and Consistency Engineering
External research on high-dimensional semantic file systems identified critical engineering considerations that our architecture already addresses:
1. The Write Latency Trap (Mitigated) - Risk: Synchronous embedding generation (15-50ms+) and graph linking (seconds) would block write() syscalls, hanging applications - Our Solution: Asynchronous worker pattern (ADR-014) with job queue - Writes accepted immediately to staging area - Background workers handle chunking, embedding, concept matching - POSIX-compliant write performance maintained
2. The Read (ls) Bottleneck (Mitigated) - Risk: Fresh vector searches or clustering on every readdir would cause sluggish directory listings - Our Solution: Query-time retrieval with caching - 100-200ms retrieval target (realistic for vector search + graph traversal) - PostgreSQL connection pooling for concurrent queries - Directory structure is deterministic (ontology-based), not emergent clustering - FUSE implementation will cache directory listings with configurable TTL
3. POSIX Stability via Deterministic Structure (Addressed) - Risk: Purely emergent clustering causes "cluster jitter" - files randomly moving between folders as content shifts - Our Solution: Stable four-level hierarchy (Shard β Facet β Ontology β Concepts) - Paths are deterministic based on ontology assignment - Concepts appear in multiple semantic query directories (intentional non-determinism) - But underlying storage location is stable (ontology-scoped)
4. Eventual Consistency Gap (Acknowledged) - Risk: Async processing creates delay between write and appearance in semantic directories - Mitigation: Virtual README.md in empty query results (see Future Extensions) - Explains why results are empty - Suggests alternative queries or lower thresholds - Future: "Processing" indicator for in-flight ingestion
5. Connection Pool Saturation (Addressed)
- Risk: "Thundering herd" when user pastes 1,000 files - every readdir hammers database
- Our Solution:
- PostgreSQL connection pooling (existing infrastructure)
- FUSE TTL-based caching (mount option: cache_ttl=60)
- Query rate limiting at API layer
- Batch ingestion queuing (ADR-014 job scheduler)
Verdict: The architecture decouples high-latency "thinking" (AI processing) from low-latency "acting" (filesystem I/O), which research validates as the primary requirement for functional semantic filesystems.
Related Work: Other Semantic File Systems
This proposal builds on a rich history of semantic filesystems, though none have applied the "Directory = Query" metaphor to vector embeddings and probabilistic similarity.
1. Logic & Query-Based Systems (Direct Ancestors)
Semantic File System (SFS) - MIT, 1991
- Concept: Original implementation of "transducers" extracting attributes from files
- Innovation: Virtual directories interpreted as queries (/sfs/author/jdoe dynamically generated)
- Limitation: Attribute-based (key-value pairs), not semantic
- Our Extension: Replace discrete attributes with continuous similarity scores
Tagsistant - Linux/FUSE
- Concept: Directory nesting for boolean logic operations
- Innovation: Path as query language (/tags/music/+/rock/ for AND operations)
- Similarity: The /+/ operator is conceptually similar to our relationship traversal
- Our Extension: Replace boolean logic with semantic similarity thresholds
JOINFS
- Concept: Dynamic directories populated by metadata query matching
- Innovation: mkdir "format=mp3" creates persistent searches
- Similarity: Query definition via directory creation (like our approach)
- Our Extension: Semantic queries vs. exact metadata matching
2. Tag-Based Systems (Modern Implementations)
TMSU (Tag My Sht Up)
- Concept: SQLite-backed FUSE mount with explicit tagging
- Architecture: Standard "FUSE + Database" pattern we follow
- Similarity: Files exist in multiple paths (/mnt/tmsu/tags/music/mp3/)
- Difference: Deterministic (file is tagged or not), no similarity threshold
- Our Extension:* Probabilistic membership based on semantic similarity
TagFS / SemFS - Concept: RDF triples for tag storage (graph-like structure) - Similarity: Graph backend architecture (closer to our Knowledge Graph than SQL) - Difference: Explicit RDF relationships vs. emergent semantic relationships - Our Extension: Vector embeddings replace RDF triples
3. Partial POSIX Precedents
Google Cloud FUSE / rclone - Precedent: Explicitly documents "Limitations and differences from POSIX" - Validation: Large-scale ML workloads accept non-compliance for utility - Similar Violations: Directories disappear, non-deterministic caching, eventual consistency - Our Justification: If users accept this for cloud storage, they'll accept it for semantic navigation
Comparison Table
| Feature | Tagsistant | TMSU | MIT SFS (1991) | ADR-069 (This Proposal) |
|---|---|---|---|---|
| Organization | Boolean Logic | Explicit Tags | Key-Value Attributes | Vector Embeddings |
| Navigation | /tag1/+/tag2/ |
/tag1/tag2/ |
/author/name/ |
/query/threshold/ |
| Determinism | Deterministic | Deterministic | Deterministic | Probabilistic |
| Backend | SQL/Dedup | SQLite | Transducers | Vector DB + LLM |
| Write Behavior | Tags file | Tags file | Indexing | Ingest & Grounding |
| Membership Model | Binary (tagged/not) | Binary | Binary | Continuous (similarity score) |
The Key Innovation
Existing systems: Map discrete values (tags, attributes) β directories - File either has tag "music" or it doesn't - Boolean membership: true/false - Deterministic listings
Our proposal: Map continuous values (similarity scores) β directories - Concept has 73.5% similarity to query "embedding models" - Probabilistic membership: threshold-dependent - Non-deterministic listings (similarity changes as graph evolves)
This is the specific innovation that justifies the "POSIX violations" in our design - we're not just organizing files by metadata, we're navigating high-dimensional semantic space through a filesystem interface.
Motivation
Traditional filesystems organize knowledge through rigid hierarchies:
But knowledge doesn't fit in trees. ADR-068 is simultaneously: - An architecture decision - A guide for operators - An embedding system reference - A bug fix chronicle - A compatibility management strategy
Why force it into one directory when it semantically belongs in multiple conceptual spaces?
The Proposal
Mount Point
Directory Structure
Directories are semantic queries, not static folders:
/mnt/knowledge/
βββ embedding-regeneration/ # Concepts matching "embedding regeneration"
β βββ unified-regeneration.concept (79.8% similarity)
β βββ compatibility-checking.concept (75.2% similarity)
β βββ model-migration.concept (78.5% similarity)
βββ ai-models/ # Concepts matching "ai models"
β βββ embedding-models.concept (89.6% similarity)
β βββ unified-regeneration.concept (64.5% similarity) # Same file!
β βββ ai-capabilities.concept (70.6% similarity)
βββ search/
βββ 0.7/ # 70% similarity threshold
β βββ embedding+models/
βββ 0.8/ # 80% similarity threshold
β βββ embedding+models/ # Fewer results
βββ 0.6/ # 60% similarity threshold
βββ embedding+models/ # More results
File Format
Concept files are dynamically generated:
# Unified Embedding Regeneration
**ID:** sha256:95454_chunk1_76de0274
**Ontologies:** ADR-068-Phase4-Implementation, AI-Applications
**Similarity:** 79.8% (to directory query: "embedding regeneration")
**Grounding:** Weak (0.168, 17%)
**Diversity:** 39.2% (10 related concepts)
## Description
A system for regenerating vector embeddings across all graph text entities,
ensuring compatibility and proper namespace organization.
## Evidence
### Source 1: ADR-068-Phase4-Implementation (para 1)
The knowledge graph system needed a unified approach to regenerating vector
embeddings across all graph text entities (concepts, sources, and vocabulary)...
### Source 2: AI-Applications (para 1)
A unified embedding regeneration system addresses this challenge by treating
all embedded entities consistently...
## Relationships
β INCLUDES compatibility-checking.concept
β REQUIRES embedding-management-endpoints.concept
β VALIDATES testing-verification.concept
β SUPPORTS bug-fix-source-regeneration.concept
## Navigate
ls ../ai-models/ # See related concepts in different semantic space
cd relationships/includes/ # Traverse by relationship type
Relationship Navigation
Traverse the graph via relationships:
$ cd /mnt/knowledge/embedding-regeneration/unified-regeneration/
$ ls relationships/
includes/ requires/ validates/ supported-by/
$ cd relationships/includes/
$ ls
compatibility-checking.concept
$ cat compatibility-checking.concept # Full concept description
Search Interface
$ cd /mnt/knowledge/search/0.75/
$ mkdir "embedding+migration+compatibility" # Creates query directory!
$ cd "embedding+migration+compatibility"/
$ ls # Results ranked by similarity
POSIX Violations (Features!)
1. Non-Deterministic Directory Listings
$ ls /mnt/knowledge/embedding-models/
unified-regeneration.concept
compatibility-checking.concept
model-migration.concept
# New concept added to graph elsewhere...
$ ls /mnt/knowledge/embedding-models/
unified-regeneration.concept
compatibility-checking.concept
model-migration.concept
embedding-architecture.concept # New! Without touching this directory!
Why it's beautiful: Your filesystem stays current with your knowledge, automatically.
2. Multiple Canonical Paths
$ pwd
/mnt/knowledge/embedding-regeneration/unified-regeneration.concept
$ cat unified-regeneration.concept
# ... reads file ...
$ pwd # From the file's perspective
/mnt/knowledge/ai-models/unified-regeneration.concept
# Both are correct! The file exists in multiple semantic spaces!
Why it's beautiful: Concepts belong to multiple contexts simultaneously.
3. Read-Influenced Writes
$ cat concept-a.concept
$ cat concept-b.concept
# Graph notices correlation...
$ ls # Now concept-c appears because semantic relatedness!
concept-a.concept
concept-b.concept
concept-c.concept # β Appeared based on your read pattern
Why it's beautiful: The filesystem adapts to your workflow.
4. Relationship-Based Symlinks That Aren't Symlinks
$ ls -l /mnt/knowledge/embedding-regeneration/
lrwxrwxrwx compatibility β [INCLUDES] ../compatibility-checking/
lrwxrwxrwx testing β [VALIDATES] ../testing-verification/
# These aren't real symlinks, they're semantic relationships!
# Different relationship types could render differently!
Why it's beautiful: Explicit relationship semantics instead of opaque links.
5. Threshold-Dependent Paths
$ cd /mnt/knowledge/search/0.8/ai+models/
$ ls | wc -l
12
$ cd ../0.7/ai+models/ # Same query, lower threshold
$ ls | wc -l
27
$ cd ../0.9/ai+models/ # Higher threshold
$ ls | wc -l
5
Why it's beautiful: Precision vs. recall as a filesystem operation!
6. Temporal Inconsistency
$ stat unified-regeneration.concept
Modified: 2025-11-29 03:59:57 # When concept was created
$ cat unified-regeneration.concept # Read it
$ stat unified-regeneration.concept
Modified: 2025-11-29 04:15:32 # NOW! Because grounding updated!
Why it's beautiful: Living knowledge, not static files.
Use Cases Where This Is Actually Useful
1. Exploratory Research
# Start with a concept
cd /mnt/knowledge/embedding-models/
# Navigate by relationships
cd unified-regeneration/relationships/requires/
# Follow to related concepts
cd compatibility-checking/relationships/includes/
# Emerge somewhere totally different but semantically connected!
pwd
# /mnt/knowledge/ai-models/compatibility-checking/relationships/includes/
2. Context-Aware Documentation
# You're working on AI models
cd /workspace/ai-stuff/
# Mount context-aware knowledge
ln -s /mnt/knowledge/ai-models/ ./docs
# Everything in ./docs is semantically relevant to AI!
3. Semantic Grep
# Traditional grep
grep -r "embedding" /docs/
# Returns every file mentioning "embedding" (thousands of false positives)
# Semantic filesystem
ls /mnt/knowledge/search/0.8/embedding/
# Returns only concepts semantically related to embedding at 80% threshold
4. AI-Assisted Workflows
# What concepts relate to what I'm working on?
git log --oneline -1
# fix: compatibility checking for embeddings
ls /mnt/knowledge/compatibility+checking/relationships/
requires/ includes/ supports/ related-to/
# Oh, it requires these other concepts!
cd requires/
ls
embedding-models.concept
model-migration.concept
Practical Applications That Sound Insane But Actually Work
TAR as Temporal Snapshots
# Capture your research state RIGHT NOW
tar czf research-$(date +%s).tar.gz /mnt/knowledge/embedding-models/
# Three months later: graph has evolved, new concepts exist
tar czf research-$(date +%s).tar.gz /mnt/knowledge/embedding-models/
# DIFFERENT tar contents!
# Same "directory", different semantic space!
# Each tarball is a temporal snapshot of the knowledge graph
Why this works: The filesystem is a view of the knowledge graph at a point in time. TAR captures that view. Different views = different archives. Version your knowledge semantically!
Practical use: - Archive research findings before pivoting - Create snapshots before major refactoring - Share "knowledge packs" with collaborators - Restore previous understanding states
Living Documentation in Development Workspaces
# Your project workspace
cd /workspace/my-ai-project/
# Symlink semantic knowledge as documentation
ln -s /mnt/knowledge/my-project/ ./docs
# Claude Code (or any IDE) can now:
cat docs/architecture/api-design.concept # Read current architecture
ls docs/relationships/SUPPORTS/ # See what supports this design
grep -r "performance" docs/ # Semantic search in docs!
# As you work and ingest commit messages:
git commit -m "feat: add caching layer"
kg ingest commit HEAD -o my-project
# Moments later:
ls ./docs/
# NEW concepts appear automatically!
# caching-layer.concept
# performance-optimization.concept
Why this works: The symlink points to a semantic query. The query results update as the graph evolves. Your documentation becomes a living, self-organizing entity.
Claude Code integration:
# Claude can literally read your knowledge graph
<Read file="docs/api-design.concept">
# Gets: full concept, relationships, evidence, grounding metrics
# Not just static markdown
# Claude can explore relationships
cd docs/api-design/relationships/REQUIRES/
# Discovers dependencies automatically
Bidirectional Ingestion
# Write support makes this a full knowledge management system
echo "# New Architecture Decision
We're adopting GraphQL for the API layer because..." > /mnt/knowledge/my-project/adr-070.md
# File write triggers:
# 1. Document chunking
# 2. LLM concept extraction
# 3. Semantic matching against existing concepts
# 4. Relationship discovery
# 5. Graph integration
# Seconds later:
ls /mnt/knowledge/api-design/
# adr-070-graphql-adoption.concept appears!
# Batch ingestion:
cp docs/*.md /mnt/knowledge/my-project/
# Processes all files, discovers cross-document relationships automatically
Why this works: Every write is an ingestion trigger. The filesystem becomes a natural interface for knowledge capture.
Anti-pattern prevention:
# Only accept markdown/text
cp binary-file.exe /mnt/knowledge/
# Error: unsupported file type
# Prevent knowledge pollution
cp spam.txt /mnt/knowledge/my-project/
# Ingests but low grounding, won't pollute semantic queries
Build System Integration
# Makefile that depends on semantic queries
API_DOCS := $(shell ls /mnt/knowledge/api-endpoints/*.concept)
docs/api.html: $(API_DOCS)
kg export --format html /mnt/knowledge/api-endpoints/ > $@
# When new API concepts appear (from code ingestion):
# - Build automatically detects new .concept files
# - Regenerates documentation
# - No manual tracking needed
Why this works: The filesystem exposes semantic queries as file paths. Build tools already know how to depend on file paths.
CI/CD integration:
# GitHub Actions
- name: Check documentation coverage
run: |
concept_count=$(ls /mnt/knowledge/my-project/*.concept | wc -l)
if [ $concept_count -lt 50 ]; then
echo "Warning: Only $concept_count concepts documented"
fi
Event-Driven Workflows
# Watch for knowledge graph changes
fswatch /mnt/knowledge/my-project/ | while read event; do
echo "Knowledge updated: $event"
kg admin embedding regenerate --type concept --only-missing
done
# Trigger notifications when concepts appear
inotifywait -m /mnt/knowledge/security-vulnerabilities/ -e create |
while read dir action file; do
notify-send "Security Alert" "New vulnerability concept: $file"
done
Why this works: Filesystem events map to knowledge graph updates. Standard Linux tools (inotify, fswatch) become knowledge graph event listeners.
Knowledge-driven automation:
# When AI research concepts appear, trigger model retraining
ls /mnt/knowledge/ai-research/*.concept | entr make train-model
# When architecture concepts change, validate against constraints
ls /mnt/knowledge/architecture/*.concept | entr ./validate-architecture.sh
Diff-Based Knowledge Evolution Tracking
# Semantic diff across time
tar czf snapshot-before.tar.gz /mnt/knowledge/my-research/
# ... three months of work ...
tar czf snapshot-after.tar.gz /mnt/knowledge/my-research/
tar xzf snapshot-before.tar.gz -C /tmp/before/
tar xzf snapshot-after.tar.gz -C /tmp/after/
diff -r /tmp/before/ /tmp/after/
# Shows concept evolution:
# - New concepts (+ files)
# - Strengthened concepts (modified files with higher grounding)
# - Abandoned concepts (- files, fell below similarity threshold)
Why this works: Concepts are files. Files can be diffed. Knowledge evolution becomes visible through standard Unix tools.
Architecture and Hierarchy
Important: This Is NOT a Full Filesystem
Like /sys/ or /proc/, this is a partial filesystem that exposes a specific interface (knowledge graphs) through filesystem semantics. It only implements operations that make semantic sense.
What works:
- ls (semantic query)
- cd (navigate semantic space)
- cat (read concept)
- find / grep (search)
- echo > / cp (ingest)
- tar (snapshot)
- stat (metadata)
What doesn't work (and won't):
- mv (concepts don't "move" in semantic space)
- chmod / chown (use facet-level RBAC instead)
- ln -s (maybe future: create relationships)
- touch (timestamps are semantic, not file-based)
- dd (nonsensical for semantic content)
- Most other file operations that assume static files
This is a feature, not a limitation. Don't pretend to be a full filesystem. Be an excellent semantic interface.
The Four-Level Model
The semantic filesystem has a clear hierarchy that maps infrastructure to semantic content:
Shard (infrastructure: database + API + resources)
βββ Facet (logical grouping of related ontologies)
βββ Ontology (specific knowledge domain)
βββ Concepts (semantic content)
Why this hierarchy matters:
| Level | Purpose | Example | Isolation |
|---|---|---|---|
| Shard | Physical deployment instance | shard-research, shard-production |
Infrastructure (separate databases) |
| Facet | Logical grouping for organization/RBAC | academic, industrial, engineering |
Access control & resource limits |
| Ontology | Knowledge domain namespace | ai-research, api-docs, patents |
Semantic namespace |
| Concepts | Individual semantic units | embedding-models.concept |
Content |
Directory Structure
/mnt/knowledge/
βββ shard-research/ # Shard: research infrastructure
β βββ academic/ # Facet: academic research group
β β βββ ai-research/ # Ontology: AI papers
β β β βββ embedding-models.concept
β β βββ neuroscience/ # Ontology: neuroscience papers
β β βββ ml-papers/ # Ontology: ML literature
β β
β βββ industrial/ # Facet: industrial R&D group
β βββ patents/ # Ontology: patent filings
β βββ prototypes/ # Ontology: prototype docs
β
βββ shard-production/ # Shard: production infrastructure
β βββ engineering/ # Facet: engineering team
β β βββ api-docs/ # Ontology: API documentation
β β βββ architecture/ # Ontology: architecture decisions
β β βββ runbooks/ # Ontology: operational runbooks
β β
β βββ compliance/ # Facet: compliance team
β βββ gdpr/ # Ontology: GDPR documentation
β βββ soc2/ # Ontology: SOC2 compliance
β
βββ shard-partners/ # Shard: partner infrastructure (remote)
βββ shared/ # Facet: shared knowledge
βββ api-integration/ # Ontology: integration docs
Why Facets?
Facets provide logical organization within a shard without requiring separate infrastructure:
-
Access Control Boundaries:
-
Resource Isolation:
-
Namespace Management:
-
Organizational Clarity:
Mount Options at Different Levels
# Mount entire shard (all facets, all ontologies)
mount -t fuse.knowledge-graph \
-o api_url=http://localhost:8000 \
-o client_id=fuse-client \
-o client_secret=$FUSE_SECRET \
-o shard=research \
/dev/knowledge /mnt/knowledge/research
ls /mnt/knowledge/research/
academic/ industrial/
# Mount specific facet (all ontologies in facet)
mount -t fuse.knowledge-graph \
-o client_id=fuse-client,client_secret=$FUSE_SECRET \
-o shard=research,facet=academic \
/dev/knowledge /mnt/knowledge/academic
ls /mnt/knowledge/academic/
ai-research/ neuroscience/ ml-papers/
# Mount specific ontology (direct semantic access)
mount -t fuse.knowledge-graph \
-o client_id=fuse-client,client_secret=$FUSE_SECRET \
-o shard=research,facet=academic,ontology=ai-research \
/dev/knowledge /mnt/knowledge/ai-research
ls /mnt/knowledge/ai-research/
# Shows semantic query space directly
embedding-models/ neural-networks/ transformers/
Note: All mount operations use OAuth client authentication (ADR-054). The same client credentials work across FUSE, MCP server, and CLI - they're all clients of the same API backend.
Cross-Shard, Cross-Facet Queries
Standard Unix tools traverse the hierarchy automatically:
# Search across all mounted shards, facets, and ontologies
find /mnt/knowledge/ -name "*.concept" | grep "embedding"
# Traverses:
# 1. Shards (local + remote)
# βββ shard-research (local FUSE β local PostgreSQL)
# βββ shard-partners (SSHFS β remote FUSE β remote PostgreSQL)
#
# 2. Facets within each shard
# βββ academic
# βββ industrial
# βββ shared
#
# 3. Ontologies within each facet
# βββ ai-research
# βββ patents
# βββ api-integration
#
# 4. Semantic queries within each ontology
# βββ embedding-models.concept (found!)
# All through standard Unix tooling!
The magic: find and grep don't know about:
- Knowledge graphs
- Semantic queries
- Shard boundaries
- Local vs. remote mounts
They just traverse directories and read files. The abstraction is perfect.
Distributed Queries Across Mount Boundaries
# Mount local shards
mount -t fuse.knowledge-graph -o shard=research /dev/knowledge /mnt/local/research
mount -t fuse.knowledge-graph -o shard=production /dev/knowledge /mnt/local/production
# Mount remote shards via SSH
sshfs partner-a@remote:/mnt/knowledge/shared /mnt/remote/partner-a
sshfs partner-b@remote:/mnt/knowledge/public /mnt/remote/partner-b
# Now grep across ALL of them:
grep -r "API compatibility" /mnt/{local,remote}/*/
# What actually happens:
# 1. grep traverses /mnt/local/research/
# β FUSE reads local database
# β Returns concept files as text
#
# 2. grep traverses /mnt/local/production/
# β FUSE reads local database
# β Returns concept files as text
#
# 3. grep traverses /mnt/remote/partner-a/
# β SSHFS sends reads over SSH
# β Remote FUSE reads remote database
# β SSH returns concept files as text
#
# 4. grep traverses /mnt/remote/partner-b/
# β Same: SSHFS β SSH β remote FUSE β remote database
# Result: distributed semantic search across multiple knowledge graphs
# Using only: grep, mount, and sshfs
# No special distributed query protocol needed
This is profound: Standard Unix tools become distributed knowledge graph query engines simply by mounting semantic filesystems at different paths.
Write Operations Respect Hierarchy
cd /mnt/knowledge/research/academic/ai-research/embedding-models/
# Write here β ingests into:
# - Shard: research
# - Facet: academic
# - Ontology: ai-research
# - Context: embedding-models (semantic query)
echo "# Quantization Techniques..." > quantization.md
# Concept appears in:
# β /mnt/knowledge/research/academic/ai-research/
# β NOT in /mnt/knowledge/research/industrial/patents/
# Same shard, different facet = isolated
Federation and Discovery
# Local shard (FUSE β local knowledge graph)
mount -t fuse.knowledge-graph -o shard=research /dev/knowledge /mnt/local
# Remote shard (SSHFS β remote FUSE β remote knowledge graph)
sshfs partner@partner.com:/mnt/knowledge/shared \
/mnt/remote
# Now find operates across BOTH:
find /mnt/{local,remote}/ -name "*.concept" | grep "api"
# Returns concepts from:
# - Local research shard (all facets)
# - Remote partner shard (shared facet)
# Distributed knowledge graph queries via standard Unix tools!
Path Semantics
Every path encodes the full context:
/mnt/knowledge/shard-research/academic/ai-research/embedding-models/quantization.concept
β β β β β β
β β β β β ββ Concept (semantic entity)
β β β β ββββββββββββββββββββ Semantic query context
β β β βββββββββββββββββββββββββββββββββ Ontology (knowledge domain)
β β ββββββββββββββββββββββββββββββββββββββββββ Facet (logical group)
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Shard (infrastructure)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Mount point
Deterministic structure, semantic content.
Implementation Sketch
Technology Stack
- FUSE: Filesystem in Userspace (client interface)
- Backend: FastAPI REST API server
- Query Engine: Semantic search API (part of backend)
- Cache: TTL-based concept cache (fights non-determinism slightly)
Note: The FUSE filesystem is a client interface, just like the MCP server, CLI, and web interface. All clients communicate with the same FastAPI backend.
Basic Operations
class SemanticFS(Operations):
def readdir(self, path, fh):
"""List directory = semantic query"""
query = path_to_query(path)
concepts = kg.search(query, threshold=0.7)
return [f"{c.id}.concept" for c in concepts]
def read(self, path, size, offset, fh):
"""Read file = get concept details"""
concept_id = path_to_concept_id(path)
concept = kg.get_concept(concept_id)
return format_concept_markdown(concept)
def getattr(self, path, fh=None):
"""Stat file = concept metadata"""
concept = kg.get_concept(path_to_concept_id(path))
return {
'st_mode': S_IFREG | 0o444, # Read-only
'st_size': len(concept.description),
'st_mtime': concept.last_updated, # Changes with grounding!
}
Mount Options
mount -t fuse.knowledge-graph \
-o api_url=http://localhost:8000 \ # API server endpoint
-o client_id=fuse-client \ # OAuth client ID (ADR-054)
-o client_secret=$FUSE_SECRET \ # OAuth client secret
-o threshold=0.75 \ # Default similarity threshold
-o cache_ttl=60 \ # Cache concepts for 60s
-o relationship_links=true \ # Show relationship symlinks
-o dynamic_discovery=true \ # Concepts appear based on access patterns
/dev/knowledge /mnt/knowledge
Authentication: FUSE authenticates as an OAuth client (ADR-054), just like the MCP server and CLI. The same client credentials can be shared across all client interfaces, or each can have its own client ID for granular access control.
Alternative: rclone Backend Implementation
Instead of writing a custom FUSE driver, implement as an rclone backend.
Why rclone? - rclone already handles FUSE mounting, caching, config management - Implement knowledge graph as "just another backend" (like S3, Google Drive) - Get interop between knowledge graphs and cloud storage for free - Users already understand rclone's model
Implementation:
// rclone backend for knowledge graphs
package kg
import (
"context"
"github.com/rclone/rclone/fs"
)
func init() {
fs.Register(&fs.RegInfo{
Name: "kg",
Description: "Knowledge Graph Backend",
NewFs: NewFs,
Options: []fs.Option{{
Name: "api_url",
Default: "http://localhost:8000",
}, {
Name: "shard",
}, {
Name: "client_id",
Help: "OAuth client ID (ADR-054)",
}, {
Name: "client_secret",
Help: "OAuth client secret",
}},
})
}
// List directory = semantic query
func (f *Fs) List(ctx context.Context, dir string) (entries fs.DirEntries, err error) {
facet, ontology, query := parsePath(dir)
concepts, err := f.client.Search(ctx, query, ontology)
for _, concept := range concepts {
entries = append(entries, conceptToEntry(concept))
}
return entries, nil
}
// Open file = read concept as markdown
func (o *Object) Open(ctx context.Context) (io.ReadCloser, error) {
concept, err := o.fs.client.GetConcept(ctx, o.conceptID)
markdown := formatConceptMarkdown(concept)
return io.NopCloser(strings.NewReader(markdown)), nil
}
// Put file = ingest into knowledge graph
func (f *Fs) Put(ctx context.Context, in io.Reader, src fs.ObjectInfo) (fs.Object, error) {
data, _ := io.ReadAll(in)
facet, ontology, _ := parsePath(src.Remote())
result, err := f.client.Ingest(ctx, data, ontology, facet)
return &Object{...}, nil
}
Usage:
# Configure knowledge graph backend (OAuth client authentication)
rclone config create kg-research kg \
api_url=http://localhost:8000 \
shard=research \
client_id=rclone-client \
client_secret=$RCLONE_SECRET
# Mount it
rclone mount kg-research:academic/ai-research /mnt/knowledge
# Works like any rclone mount
ls /mnt/knowledge/
cat /mnt/knowledge/embedding-models.concept
echo "new idea" > /mnt/knowledge/new-concept.md
Note: Uses same OAuth client authentication (ADR-054) as MCP server and CLI. The same client credentials can be reused, or rclone can have its own client ID for separate access control policies.
Bonus: Cross-Backend Operations
# Backup knowledge graph to S3
rclone sync kg-research: s3:backup/kg-snapshot/
# Ingest Google Drive docs into knowledge graph
rclone copy gdrive:Papers/ kg-research:academic/papers/
# Sync between knowledge graph shards
rclone sync kg-shard-a: kg-shard-b:
# Export concepts to git repository
rclone sync kg-research: /tmp/kg-export/
cd /tmp/kg-export && git init && git add . && git commit
# Use rclone browser GUI to explore knowledge graph
rclone rcd --rc-web-gui
Benefits: - Don't write FUSE layer (rclone handles it) - Get caching, retry logic, rate limiting for free - Instant interop with cloud storage backends - Existing rclone user base understands the model - rclone browser GUI works automatically
Implementation effort: Minimal backend (List/Read/Write) could be prototyped in a weekend.
Why This Will Make Unix Admins Angry
The Angry Tweets We Expect
"This violates everything POSIX stands for. Files shouldn't magically appear and disappear."
Yes. That's the point. Knowledge isn't static.
"How am I supposed to backup a filesystem where
targives different results each time?"
You backup the knowledge graph, not the filesystem. The filesystem is a view of knowledge.
"My scripts depend on deterministic
lsoutput!"
Your scripts are thinking in hierarchies. Think in semantics instead.
"
find . -name '*.concept' | wc -lreturns different numbers!"
Correct! The number of concepts matching your context changes as you explore.
"This breaks
rsync!"
Have you considered that maybe rsync should understand semantic similarity? π€
The rclone Defense
"This is just like rclone for Google Drive!"
Yes. Exactly. And millions of people use rclone daily despite its POSIX violations.
rclone for Google Drive exhibits:
- Non-deterministic listings: Files appear/disappear as others edit shared drives
- Multiple canonical paths: Same file accessible via /MyDrive/ and /SharedDrives/ (Google's "Add to My Drive")
- Eventually consistent: Write a file, read might return old content (API sync lag)
- Weird metadata: Fake Unix permissions from Google's ACLs, timestamps from cloud provider
- Partial POSIX: No symlinks, no memory mapping, fake chmod/chown
People accept this because the abstraction is useful.
Semantic FUSE is actually BETTER than rclone:
| Aspect | rclone (Google Drive) | Semantic FUSE |
|---|---|---|
| Non-determinism | Network sync (unpredictable) | Semantic relevance (intentional) |
| Multiple paths | Google's sharing model (confusing) | Semantic contexts (by design) |
| Performance | Network latency, API rate limits | Local database (consistent) |
| Metadata | Fake Unix perms from ACLs (awkward) | Native semantic data (grounding, similarity) |
| Consistency | Eventually consistent (network) | Immediately consistent (local) |
rclone documentation literally says:
"Note that many operations are not fully POSIX compliant. This is an inherent limitation of cloud storage systems."
Our documentation:
"Note that many operations are not fully POSIX compliant. This is an inherent limitation of exposing semantic graphs as filesystems."
Same energy. Same usefulness. Same tradeoffs.
If you accept rclone's weirdness for the convenience of grep-ing Google Drive, you'll accept semantic FUSE's weirdness for the convenience of grep-ing knowledge graphs.
The Defenses We Don't Care About
"But the POSIX specification says..."
The POSIX specification doesn't account for semantic knowledge graphs. Times change.
"This would break every tool!"
Good! Those tools assume files are in trees. Knowledge isn't a tree.
"What about make? What about git?"
Don't use this for source code. Use it for knowledge about source code.
"This is cursed."
Yes. Beautifully cursed. Like all the best ideas.
Practical Limitations
What This Is NOT Good For
- Source code version control (use git)
- Binary file storage (use object storage)
- High-performance computing (use tmpfs)
- Traditional backups (use the graph's native backup)
- Anything requiring determinism (use a real filesystem)
What This IS Good For
- Research and exploration
- Documentation navigation
- Semantic code search
- Learning domain knowledge
- Following conceptual trails
- AI-assisted development workflows
Future Extensions
Write Support
$ mkdir /mnt/knowledge/my-new-concept/
$ echo "Description: A revolutionary new idea..." > description.md
$ echo "Ontology: MyProject" > .ontology
# Automatically ingested and linked!
Relationship Creation
$ ln -s ../target-concept.concept relationship/supports/
# Creates SUPPORTS relationship in the graph!
Query Operators
$ cd /mnt/knowledge/search/AND/embedding+models/
$ cd /mnt/knowledge/search/OR/ai+ml/
$ cd /mnt/knowledge/search/NOT/embedding-models/
Grounding Filters
Decision
Implement knowledge graph access as a FUSE filesystem with the following design choices:
- Partial Filesystem Model - Like
/sys/or/proc/, implement only semantically meaningful operations - Support:
ls(query),cd(navigate),cat(read),grep/find(search),echo/cp(ingest),tar(snapshot) -
Do not support:
mv,chmod,chown,touch,dd(operations that don't map to semantic concepts) -
Four-Level Hierarchy - Map infrastructure to semantics:
- Shard (infrastructure: database + API + resources)
- Facet (logical grouping: RBAC + resource isolation)
- Ontology (knowledge domain namespace)
-
Concepts (semantic content)
-
Directory Creation = Semantic Query - User creates directories with query names
mkdir "embedding models"defines a semantic querycd embedding-models/executes the query-
lsshows concepts matching the query at configured similarity threshold -
Relationship Navigation - Concepts expose
relationships/subdirectory cd concept.concept/relationships/SUPPORTS/traverses graph edges-
Path represents traversal history (deterministic structure, semantic content)
-
Write = Ingest - File writes trigger automatic ingestion
echo "content" > file.mdingests into current ontology/facet context- File may not reappear with same name (concept extraction determines label)
-
Embraces non-determinism as feature (concepts appear based on semantic relevance)
-
Implementation Options - Two paths forward:
- Option A: Custom FUSE driver in Python (full control, more code)
- Option B: rclone backend in Go (leverage existing infrastructure, instant interop)
Consequences
Benefits
1. Familiar Interface for Semantic Exploration
- Users already understand cd, ls, cat, grep
- No need to learn custom query language or web UI
- Standard Unix tools become knowledge graph query engines
2. Distributed Queries via Standard Tools
# Transparently searches local + remote shards
find /mnt/knowledge/ -name "*.concept" | grep "pattern"
# - Local shards: FUSE β local PostgreSQL
# - Remote shards: SSHFS β SSH β remote FUSE β remote PostgreSQL
3. Cross-Backend Interoperability (if rclone implementation)
# Backup knowledge graph to S3
rclone sync kg:research s3:backup/
# Ingest from Google Drive
rclone copy gdrive:Papers/ kg:research/papers/
# Export to git repository
rclone sync kg:research /tmp/export/
4. TAR as Temporal Snapshots
tar czf snapshot-$(date +%s).tar.gz /mnt/knowledge/my-research/
# Same path, different contents over time
# Version your semantic space
5. Living Documentation in Workspaces
ln -s /mnt/knowledge/my-project/ ./docs
# Documentation auto-updates as concepts evolve
# Claude Code can read semantic graph directly
Drawbacks
1. Non-Determinism Can Be Confusing
- ls results change as graph evolves
- Same query returns different results over time
- Mitigation: Clear documentation, caching options, embrace as feature
2. POSIX Violations Require Education - Many standard file operations won't work - Users expect traditional filesystem behavior - Mitigation: Follow rclone precedent, document limitations clearly
3. Performance Considerations - Semantic queries slower than filesystem metadata operations - Graph traversal can be expensive for deep relationships - Mitigation: Caching layer, configurable similarity thresholds, limit traversal depth
4. Implementation Complexity - Custom FUSE: ~2000-3000 lines of Python - rclone backend: ~500-1000 lines of Go + API wrapper - Either requires ongoing maintenance
Risks
1. User Confusion - Non-deterministic behavior violates expectations - Mitigation: Clear "partial filesystem" designation, precedent from rclone
2. Performance at Scale - Large knowledge graphs may be slow - Mitigation: Shard/facet architecture limits query scope
3. Adoption Barrier - Requires FUSE support, mount permissions - Mitigation: Provide alternative interfaces (web UI, CLI, MCP)
Alternatives Considered
1. WebDAV/HTTP Filesystem
Pros: Cross-platform, no FUSE required, browser-compatible Cons: Poorer performance, limited caching, no local integration Decision: FUSE provides better Unix integration, can add WebDAV later
2. Git-Like Interface
Pros: Familiar to developers, built-in versioning, distributed Cons: Concepts aren't commits, relationships aren't branches, poor semantic fit Decision: Git is for version control, not semantic navigation
3. Custom CLI Only
Pros: Full control, no filesystem abstraction mismatch Cons: Users must learn new commands, can't use standard Unix tools Decision: CLI exists (kg command), FUSE adds complementary interface
4. SQL/GraphQL Query Interface
Pros: Powerful queries, precise results, standard protocols Cons: Requires learning query language, no filesystem metaphor benefits Decision: APIs exist, FUSE provides filesystem convenience layer
5. Database-as-Filesystem (Direct PostgreSQL Mount)
Pros: Tools exist (pgfuse), direct database access Cons: Exposes tables/rows, not semantic concepts, wrong abstraction level Decision: Need semantic layer, not raw database access
Implementation Recommendation
Update (Post Peer Review): After architectural review, we are strongly leaning toward Python FUSE (Option A) for the MVP, though not yet committed.
Reconsidering Python FUSE (Option A)
Advantages for our specific architecture:
- Shared Logic Layer - All core services (
QueryService,EmbeddingModel,GraphQueryFacade) are Python - Can import services directly without HTTP overhead
- Zero-latency local operations during development
-
No schema drift between FUSE layer and graph layer
-
Complex Traversal Support - Deep graph schema knowledge (ADR-048)
- Relationship navigation requires VocabType awareness
- Dynamic relationship discovery easier in Python
-
Access to full graph context without API round-trips
-
Tight Integration - Same runtime as API server
- Can mount on same machine as database for testing
- Direct access to PostgreSQL connection pool
- Shared caching layer with existing services
Implementation with pyfuse3:
import pyfuse3
from api.services.query_service import QueryService
class SemanticFS(pyfuse3.Operations):
def __init__(self):
self.query_service = QueryService() # Direct import!
async def readdir(self, inode, off, token):
# Direct service call, no HTTP
concepts = await self.query_service.execute_search(query, threshold=0.7)
for concept in concepts:
pyfuse3.readdir_reply(token, f"{concept.label}.concept", ...)
When to use rclone instead (Option B): - Remote mounting (laptop β cloud server) - OAuth management for remote instances - Cross-backend sync requirements (knowledge graph β S3/Google Drive) - Deployment to users unfamiliar with Python infrastructure
Current stance: Prototype with Python FUSE for local/development use. Both implementations may coexist - Python for tight integration, rclone for remote access and OAuth workflows.
Authentication (applies to both approaches): Both Python FUSE and rclone implementations use the same OAuth client authentication system (ADR-054) as the MCP server and CLI. This means: - Same client credentials can be shared across all client interfaces - Consistent authentication flow regardless of client type - Granular access control via separate client IDs if needed - FUSE authenticates to the API server just like any other client
Future Extensions
Core Features
- Relationship-based symbolic links (
ln -s concept relationships/SUPPORTS/) - Query operators (
/search/AND/,/search/OR/,/search/NOT/) - Grounding filters (
/grounding/strong/,/grounding/weak/) - Write support for relationship creation
- Multi-shard federated views
Usability Enhancements (From Peer Review)
1. Empty Directory Problem Solution
When semantic queries return no results, generate a virtual README.md explaining why:
mkdir /mnt/knowledge/research/unicorn-physics/
ls /mnt/knowledge/research/unicorn-physics/
# Empty directory - no matching concepts
cat /mnt/knowledge/research/unicorn-physics/README.md
# Query 'unicorn physics' (Threshold: 0.7) matched 0 concepts in ontology 'research'.
#
# Suggestions:
# - Lower threshold: /mnt/knowledge/search/0.5/unicorn+physics/
# - Try broader query: /mnt/knowledge/research/physics/
# - Check available ontologies: ls /mnt/knowledge/
Benefits: Users understand empty results instead of wondering if the system is broken.
2. Tarball Snapshots with Temporal Metadata
Include a .manifest file in every tarball to enable "time travel":
tar czf snapshot-$(date +%s).tar.gz /mnt/knowledge/research/
tar tzf snapshot-*.tar.gz | head -5
.manifest
embedding-models.concept
neural-networks.concept
...
cat .manifest
{
"snapshot_timestamp": "2025-11-28T23:45:00Z",
"graph_revision": "a3b2c1d4",
"shard": "research",
"facet": "academic",
"ontology": "ai-research",
"query_threshold": 0.7,
"concept_count": 127,
"embedding_model": "nomic-ai/nomic-embed-text-v1.5"
}
Benefits: - Restore semantic state from snapshots - Track knowledge evolution over time - Debug "why did this concept disappear?"
3. RBAC Integration via Filesystem Permissions
Map filesystem permission bits to OAuth scopes from ADR-054/055:
ls -l /mnt/knowledge/shard-production/
drwxr-xr-x engineering/ # User has write:engineering scope
drwxr-xr-- compliance/ # User has read:compliance scope (no write)
d--------- finance/ # User has no access
# Attempting to write without scope:
echo "test" > /mnt/knowledge/shard-production/compliance/test.md
# Permission denied (requires write:compliance scope)
Implementation: Check OAuth scopes during FUSE access() and open() operations.
Benefits:
- Familiar Unix permission model
- Natural RBAC enforcement
- Tools like ls -l show access levels automatically
References
Implementation Tools
Related Semantic File Systems
- Semantic File System (SFS) - Gifford et al., MIT, 1991 - Original virtual directories as queries
- Tagsistant - Linux FUSE semantic filesystem with boolean logic
- TMSU - Tag My Sh*t Up - Modern SQLite-backed tagging filesystem
- Google Cloud Storage FUSE - Example of widely-used partial POSIX compliance
Internal Architecture
- ADR-055: Sharding and facet architecture
- ADR-048: Query safety and namespace isolation
- ADR-054: OAuth client management
Knowledge doesn't fit in trees. It forms graphs. Your filesystem should too. π³βπΈοΈ