Skip to content

.grafeo Container Format Specification

The .grafeo file is the single-file persistence format for Grafeo databases. It stores data in typed sections, each independently addressable, checksummed, and (for index sections) memory-mappable.

File Layout

Offset    Size     Contents
────────────────────────────────────────────────────
0x0000    4 KiB    FileHeader (magic, version, page size)
0x1000    4 KiB    DbHeader H1 (iteration, checksum, metadata)
0x2000    4 KiB    DbHeader H2 (alternating crash-safe copy)
0x3000    4 KiB    Section Directory (type/offset/length/CRC entries)
0x4000+   varies   Section data (page-aligned per section)

Total header overhead: 16 KiB. All regions are page-aligned (4 KiB boundaries).


FileHeader (0x0000, 4 KiB)

Written once at database creation. Never modified afterwards.

Offset Size Type Field Description
0 4 [u8; 4] magic 0x47524146 ("GRAF")
4 4 u32 LE format_version 1 (current)
8 4 u32 LE page_size Always 4096
12 8 u64 LE creation_timestamp_ms Unix epoch milliseconds
20 32 [u8; 32] creator_version UTF-8 Grafeo version, zero-padded
52 4044 - (reserved) Zero-filled

The header is serialized with bincode and zero-padded to 4 KiB.

Validation on open:

  • magic must equal b"GRAF" (reject otherwise)
  • format_version must be <= FORMAT_VERSION (reject unknown future versions)

DbHeader H1/H2 (0x1000 and 0x2000, 4 KiB each)

Two alternating header slots provide crash safety. On each checkpoint, the inactive slot is overwritten with the new state, then fsynced. If the process crashes mid-write, the other slot still contains valid metadata.

Field Size Type Description
iteration 8 u64 LE Monotonic counter, higher = current
checksum 4 u32 LE CRC-32 of section directory (v2) or snapshot (v1)
snapshot_length 8 u64 LE 0 for v2 section format, >0 for v1 blob format
epoch 8 u64 LE MVCC epoch at checkpoint
transaction_id 8 u64 LE Last committed transaction ID
node_count 8 u64 LE LPG node count
edge_count 8 u64 LE LPG edge count
timestamp_ms 8 u64 LE Checkpoint timestamp (Unix epoch ms)
(reserved) ~3940 - Zero-filled to 4 KiB

Active header selection: On open, read both H1 and H2. The header with the higher iteration value is the active state. If both are empty (iteration == 0), the database has never been checkpointed.

v1/v2 detection: If the active header has snapshot_length > 0, the file uses the v1 blob format (a single bincode snapshot starting at DATA_OFFSET). If snapshot_length == 0 and iteration > 0, the file uses the v2 section format with a section directory at 0x3000.


Section Directory (0x3000, 4 KiB)

A fixed-size page containing an array of section entries. Each entry is 32 bytes. Maximum capacity: 127 sections ((4096 - 8) / 32).

Directory Header

Offset Size Type Field
0 4 u32 LE entry_count
4 4 u32 LE reserved (zero)

Directory Entry (32 bytes each, starting at offset 8)

Offset Size Type Field Description
0 4 u32 LE section_type Section type ID (see table below)
4 1 u8 version Per-section format version
5 1 u8 flags Bit 0: required, Bit 1: mmap-able
6 2 u16 LE reserved Zero
8 8 u64 LE offset Byte offset from file start
16 8 u64 LE length Byte length of section data
24 4 u32 LE checksum CRC-32 of section data
28 4 u32 LE reserved Zero

Remaining bytes after the last entry are zero-filled to 4 KiB.


Section Types

Value Name Required Mmap-able Description
1 CATALOG yes no Schema defs, index metadata, epoch, config
2 LPG_STORE yes no Nodes, edges, properties, named graphs
3 RDF_STORE no no RDF triples, named graphs
10 VECTOR_STORE no yes Embeddings + HNSW topology
11 TEXT_INDEX no yes BM25 postings + term dictionary
12 RDF_RING no yes Wavelet trees + dictionary
20 PROPERTY_INDEX no yes Property hash/btree indexes

Type ranges:

  • 1-9: Data sections (authoritative, cannot be rebuilt)
  • 10-19: Index sections (derived, can be rebuilt from data)
  • 20+: Reserved for acceleration structures

Flags:

  • Bit 0 (required): If set, older binaries that don't recognize this section type must refuse to open the file. If clear, the section can be safely skipped (the database opens without that index).
  • Bit 1 (mmap-able): If set, the section uses a fixed binary layout suitable for zero-copy memory-mapped access. If clear, the section must be deserialized into RAM (bincode format).

Empty sections are omitted from the directory entirely. If no RDF data exists, there is no RDF_STORE entry.


Section Data (0x4000+)

Sections are written sequentially after the directory, each starting at a page-aligned (4 KiB) offset. The next section starts at the first 4 KiB boundary after the previous section ends.

0x4000  [CATALOG data ................] pad
0x5000  [LPG_STORE data ..............] pad
0xA000  [VECTOR_STORE data ...........] pad
...

Data Section Encoding (Catalog, LPG, RDF)

Data sections use bincode serialization (standard configuration). They are fully deserialized into RAM on load. The internal format is version-specific (the version byte in the directory entry allows independent evolution).

Index Section Encoding (Vector, Text, Ring, Property)

Index sections use bincode serialization currently (version 1). Future versions may switch to fixed binary layouts for zero-copy mmap access. The mmap_able flag indicates whether the section can be memory-mapped after being written.


Checkpoint Flow

Checkpoint(reason):
  1. Collect target sections based on reason:
     - Explicit:   all sections (dirty or clean)
     - Periodic:   dirty sections only (skip if none dirty)
     - Eviction:   lowest-priority dirty section only
  2. For each target section:
     a. Serialize section data to bytes (Section::serialize())
     b. Compute CRC-32
  3. Write sections to new page-aligned offsets in the file
  4. Build section directory with updated entries
  5. Write section directory at 0x3000
  6. Build new DbHeader (increment iteration, set checksum/counts)
  7. Write DbHeader to inactive slot (H1 or H2)
  8. fsync
  9. (Engine) truncate WAL

Dirty tracking: Each section has an is_dirty() flag. Mutations in the store mark the corresponding section dirty. Periodic checkpoints skip sections that haven't changed since the last flush.

Crash safety: If the process crashes at any point during steps 3-7, the active header still points to the previous valid state. The dual-header alternation ensures atomicity of the commit point (step 8).


Memory-Mapped Section Access

After a checkpoint, index sections with flags.mmap_able = true can be memory-mapped for zero-copy read access. This is the foundation for tiered storage: when RAM is scarce, index sections are flushed to the container and served via mmap instead of keeping the full data in heap memory.

Lifecycle:

  1. Engine flushes dirty sections via checkpoint
  2. Engine calls mmap_section() for index sections
  3. CRC-32 is verified against the mmap'd bytes (also warms page cache)
  4. Engine drops in-memory copy of the section data
  5. Reads go through the mmap (OS page cache manages eviction)
  6. Before next checkpoint: drop all mmaps, then write

Platform note: On Windows, the OS rejects writes to a file with active memory mappings (error 1224). All mmap handles must be dropped before write_sections(). On Linux/macOS, writes succeed with active mappings but the drop-before-write lifecycle is used on all platforms for consistency.


Recovery

Open database:
  1. Read FileHeader at 0x0000, validate magic and format_version
  2. Read both DbHeaders (H1 at 0x1000, H2 at 0x2000)
  3. Select active header (highest iteration)
  4. Detect format:
     - snapshot_length > 0: v1 blob format (read snapshot at DATA_OFFSET)
     - snapshot_length == 0 && iteration > 0: v2 section format
  5. For v2: read section directory at 0x3000
  6. For each directory entry:
     a. Read section data at entry.offset
     b. Verify CRC-32
     c. Deserialize into RAM (or mmap if configured as ForceDisk)
  7. If sidecar WAL exists: replay committed transactions since last flush
  8. Database is ready

Periodic Checkpoints

When Config::checkpoint_interval is set, a background thread periodically flushes sections to the container. This bounds the WAL size and limits data loss on crash to at most one interval.

The timer polls a shutdown flag every 100 ms. On database close, the timer is stopped before the final checkpoint to prevent races.


File Locking

  • Exclusive lock on create/open (read-write mode): prevents concurrent writers on the same file.
  • Shared lock on open (read-only mode): allows multiple concurrent readers.
  • Locks are released on close or drop.

Size Estimates

Component Size
Fixed overhead (headers + directory) 16 KiB
Empty database (headers + empty catalog + LPG) ~20 KiB
Per-section overhead 32 bytes (directory entry) + page alignment padding
Typical 10K-node LPG ~1-5 MB
1M-vector HNSW index (384-dim, f32) ~1.5 GB

Version History

Version Format Notes
v1 (0.5.0-0.5.34) Monolithic blob at DATA_OFFSET Single bincode snapshot
v2 (0.5.35+) Section-based with directory at 0x3000 Independent sections, mmap support