.grafeo Container Format Specification¶
The .grafeo file is the single-file persistence format for Grafeo databases. It stores data in typed sections, each independently addressable, checksummed, and (for index sections) memory-mappable.
File Layout¶
Offset Size Contents
────────────────────────────────────────────────────
0x0000 4 KiB FileHeader (magic, version, page size)
0x1000 4 KiB DbHeader H1 (iteration, checksum, metadata)
0x2000 4 KiB DbHeader H2 (alternating crash-safe copy)
0x3000 4 KiB Section Directory (type/offset/length/CRC entries)
0x4000+ varies Section data (page-aligned per section)
Total header overhead: 16 KiB. All regions are page-aligned (4 KiB boundaries).
FileHeader (0x0000, 4 KiB)¶
Written once at database creation. Never modified afterwards.
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| 0 | 4 | [u8; 4] | magic | 0x47524146 ("GRAF") |
| 4 | 4 | u32 LE | format_version | 1 (current) |
| 8 | 4 | u32 LE | page_size | Always 4096 |
| 12 | 8 | u64 LE | creation_timestamp_ms | Unix epoch milliseconds |
| 20 | 32 | [u8; 32] | creator_version | UTF-8 Grafeo version, zero-padded |
| 52 | 4044 | - | (reserved) | Zero-filled |
The header is serialized with bincode and zero-padded to 4 KiB.
Validation on open:
magicmust equalb"GRAF"(reject otherwise)format_versionmust be<= FORMAT_VERSION(reject unknown future versions)
DbHeader H1/H2 (0x1000 and 0x2000, 4 KiB each)¶
Two alternating header slots provide crash safety. On each checkpoint, the inactive slot is overwritten with the new state, then fsynced. If the process crashes mid-write, the other slot still contains valid metadata.
| Field | Size | Type | Description |
|---|---|---|---|
iteration | 8 | u64 LE | Monotonic counter, higher = current |
checksum | 4 | u32 LE | CRC-32 of section directory (v2) or snapshot (v1) |
snapshot_length | 8 | u64 LE | 0 for v2 section format, >0 for v1 blob format |
epoch | 8 | u64 LE | MVCC epoch at checkpoint |
transaction_id | 8 | u64 LE | Last committed transaction ID |
node_count | 8 | u64 LE | LPG node count |
edge_count | 8 | u64 LE | LPG edge count |
timestamp_ms | 8 | u64 LE | Checkpoint timestamp (Unix epoch ms) |
| (reserved) | ~3940 | - | Zero-filled to 4 KiB |
Active header selection: On open, read both H1 and H2. The header with the higher iteration value is the active state. If both are empty (iteration == 0), the database has never been checkpointed.
v1/v2 detection: If the active header has snapshot_length > 0, the file uses the v1 blob format (a single bincode snapshot starting at DATA_OFFSET). If snapshot_length == 0 and iteration > 0, the file uses the v2 section format with a section directory at 0x3000.
Section Directory (0x3000, 4 KiB)¶
A fixed-size page containing an array of section entries. Each entry is 32 bytes. Maximum capacity: 127 sections ((4096 - 8) / 32).
Directory Header¶
| Offset | Size | Type | Field |
|---|---|---|---|
| 0 | 4 | u32 LE | entry_count |
| 4 | 4 | u32 LE | reserved (zero) |
Directory Entry (32 bytes each, starting at offset 8)¶
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| 0 | 4 | u32 LE | section_type | Section type ID (see table below) |
| 4 | 1 | u8 | version | Per-section format version |
| 5 | 1 | u8 | flags | Bit 0: required, Bit 1: mmap-able |
| 6 | 2 | u16 LE | reserved | Zero |
| 8 | 8 | u64 LE | offset | Byte offset from file start |
| 16 | 8 | u64 LE | length | Byte length of section data |
| 24 | 4 | u32 LE | checksum | CRC-32 of section data |
| 28 | 4 | u32 LE | reserved | Zero |
Remaining bytes after the last entry are zero-filled to 4 KiB.
Section Types¶
| Value | Name | Required | Mmap-able | Description |
|---|---|---|---|---|
| 1 | CATALOG | yes | no | Schema defs, index metadata, epoch, config |
| 2 | LPG_STORE | yes | no | Nodes, edges, properties, named graphs |
| 3 | RDF_STORE | no | no | RDF triples, named graphs |
| 10 | VECTOR_STORE | no | yes | Embeddings + HNSW topology |
| 11 | TEXT_INDEX | no | yes | BM25 postings + term dictionary |
| 12 | RDF_RING | no | yes | Wavelet trees + dictionary |
| 20 | PROPERTY_INDEX | no | yes | Property hash/btree indexes |
Type ranges:
- 1-9: Data sections (authoritative, cannot be rebuilt)
- 10-19: Index sections (derived, can be rebuilt from data)
- 20+: Reserved for acceleration structures
Flags:
- Bit 0 (required): If set, older binaries that don't recognize this section type must refuse to open the file. If clear, the section can be safely skipped (the database opens without that index).
- Bit 1 (mmap-able): If set, the section uses a fixed binary layout suitable for zero-copy memory-mapped access. If clear, the section must be deserialized into RAM (bincode format).
Empty sections are omitted from the directory entirely. If no RDF data exists, there is no RDF_STORE entry.
Section Data (0x4000+)¶
Sections are written sequentially after the directory, each starting at a page-aligned (4 KiB) offset. The next section starts at the first 4 KiB boundary after the previous section ends.
0x4000 [CATALOG data ................] pad
0x5000 [LPG_STORE data ..............] pad
0xA000 [VECTOR_STORE data ...........] pad
...
Data Section Encoding (Catalog, LPG, RDF)¶
Data sections use bincode serialization (standard configuration). They are fully deserialized into RAM on load. The internal format is version-specific (the version byte in the directory entry allows independent evolution).
Index Section Encoding (Vector, Text, Ring, Property)¶
Index sections use bincode serialization currently (version 1). Future versions may switch to fixed binary layouts for zero-copy mmap access. The mmap_able flag indicates whether the section can be memory-mapped after being written.
Checkpoint Flow¶
Checkpoint(reason):
1. Collect target sections based on reason:
- Explicit: all sections (dirty or clean)
- Periodic: dirty sections only (skip if none dirty)
- Eviction: lowest-priority dirty section only
2. For each target section:
a. Serialize section data to bytes (Section::serialize())
b. Compute CRC-32
3. Write sections to new page-aligned offsets in the file
4. Build section directory with updated entries
5. Write section directory at 0x3000
6. Build new DbHeader (increment iteration, set checksum/counts)
7. Write DbHeader to inactive slot (H1 or H2)
8. fsync
9. (Engine) truncate WAL
Dirty tracking: Each section has an is_dirty() flag. Mutations in the store mark the corresponding section dirty. Periodic checkpoints skip sections that haven't changed since the last flush.
Crash safety: If the process crashes at any point during steps 3-7, the active header still points to the previous valid state. The dual-header alternation ensures atomicity of the commit point (step 8).
Memory-Mapped Section Access¶
After a checkpoint, index sections with flags.mmap_able = true can be memory-mapped for zero-copy read access. This is the foundation for tiered storage: when RAM is scarce, index sections are flushed to the container and served via mmap instead of keeping the full data in heap memory.
Lifecycle:
- Engine flushes dirty sections via checkpoint
- Engine calls
mmap_section()for index sections - CRC-32 is verified against the mmap'd bytes (also warms page cache)
- Engine drops in-memory copy of the section data
- Reads go through the mmap (OS page cache manages eviction)
- Before next checkpoint: drop all mmaps, then write
Platform note: On Windows, the OS rejects writes to a file with active memory mappings (error 1224). All mmap handles must be dropped before write_sections(). On Linux/macOS, writes succeed with active mappings but the drop-before-write lifecycle is used on all platforms for consistency.
Recovery¶
Open database:
1. Read FileHeader at 0x0000, validate magic and format_version
2. Read both DbHeaders (H1 at 0x1000, H2 at 0x2000)
3. Select active header (highest iteration)
4. Detect format:
- snapshot_length > 0: v1 blob format (read snapshot at DATA_OFFSET)
- snapshot_length == 0 && iteration > 0: v2 section format
5. For v2: read section directory at 0x3000
6. For each directory entry:
a. Read section data at entry.offset
b. Verify CRC-32
c. Deserialize into RAM (or mmap if configured as ForceDisk)
7. If sidecar WAL exists: replay committed transactions since last flush
8. Database is ready
Periodic Checkpoints¶
When Config::checkpoint_interval is set, a background thread periodically flushes sections to the container. This bounds the WAL size and limits data loss on crash to at most one interval.
The timer polls a shutdown flag every 100 ms. On database close, the timer is stopped before the final checkpoint to prevent races.
File Locking¶
- Exclusive lock on create/open (read-write mode): prevents concurrent writers on the same file.
- Shared lock on open (read-only mode): allows multiple concurrent readers.
- Locks are released on close or drop.
Size Estimates¶
| Component | Size |
|---|---|
| Fixed overhead (headers + directory) | 16 KiB |
| Empty database (headers + empty catalog + LPG) | ~20 KiB |
| Per-section overhead | 32 bytes (directory entry) + page alignment padding |
| Typical 10K-node LPG | ~1-5 MB |
| 1M-vector HNSW index (384-dim, f32) | ~1.5 GB |
Version History¶
| Version | Format | Notes |
|---|---|---|
| v1 (0.5.0-0.5.34) | Monolithic blob at DATA_OFFSET | Single bincode snapshot |
| v2 (0.5.35+) | Section-based with directory at 0x3000 | Independent sections, mmap support |