Query Optimization¶

Grafeo uses cost-based optimization to select efficient query plans.

Optimizer Pipeline¶

graph LR
    LP[Logical Plan] --> RW[Rewrite Rules]
    RW --> CE[Cardinality Estimation]
    CE --> CBO[Cost-Based Selection]
    CBO --> PP[Physical Plan]

Cost Model¶

The cost model estimates execution cost for plan selection.

Component	Weight	Description
CPU	1.0	Computation cost
I/O	10.0	Disk access cost
Memory	0.5	Memory allocation

Total Cost = CPU_cost * cpu_weight
           + IO_cost * io_weight
           + Mem_cost * mem_weight

Operator	Cost Formula
Scan	rows * column_count
Filter	input_rows * selectivity
Hash Join	build_rows + probe_rows
Sort	rows * log(rows)

The cost model uses real fanout derived from graph statistics (average degree, label cardinalities, edge-type frequencies) instead of hardcoded defaults. This leads to better plan selection for traversal-heavy queries, especially on graphs with skewed degree distributions.

Cardinality Estimation¶

Accurate cardinality estimation is crucial for plan selection.

Statistic	Purpose
Element count	Base cardinality (nodes/edges per label)
Distinct count	Join estimation
Histograms	Range selectivity
Null fraction	Null handling

// Equality predicate
selectivity = 1 / distinct_count

// Range predicate
selectivity = (high - low) / (max - min)

// Join
output_rows = (rows_a * rows_b) / max(distinct_a, distinct_b)

Statistics are collected automatically by the query engine during graph operations. Grafeo tracks per-label and per-property statistics for cardinality estimation.

Join Ordering (DPccp)¶

Graph pattern matching translates to multi-way joins (one per edge in the pattern). DPccp (Dynamic Programming connected complement pairs) efficiently enumerates only connected subgraph pairs, pruning the exponential search space while guaranteeing an optimal join order. Greedy algorithms miss better plans; exhaustive DP is too slow for large patterns. DPccp hits the sweet spot.

For n relations, the number of possible join orders grows explosively:

Relations	Possible Orders
3	12
5	1,680
10	~17 billion

DPccp works by:

Enumerating all connected subgraphs
Finding the optimal join for each subgraph
Building larger plans from smaller optimal plans
Pruning dominated plans

Join Type Selection¶

Join Type	Best When
Hash Join	Large inputs, equality predicates
Nested Loop	Small inner, indexed
Merge Join	Sorted inputs

RDF Query Optimization¶

SPARQL queries over the RDF triple store use a specialized optimization path built on per-predicate statistics and, when available, the Ring Index.

RDF Statistics¶

The RdfStatistics collector iterates all triples to compute:

Statistic	Scope	Purpose
Total triples	Store-wide	Base cardinality for unbound patterns
Subject/predicate/object counts	Store-wide	Domain sizes for join selectivity
Per-predicate triple count	Per predicate	Cardinality of `?s :pred ?o` patterns
Distinct subjects per predicate	Per predicate	Fan-in estimation
Distinct objects per predicate	Per predicate	Fan-out estimation
Functional / inverse-functional flags	Per predicate	Detects 1:1 and N:1 relationships
Object type distribution	Per predicate	Type-aware selectivity
Subject and object histograms	Store-wide	Range and filter selectivity

Statistics Caching¶

Computing statistics requires a full scan of the triple store, which is expensive for large datasets. Grafeo caches the result behind get_or_collect_statistics():

Fast path: if a cached Arc<RdfStatistics> exists, return it immediately (read lock only)
Slow path: compute statistics, store in cache, return
Invalidation: any mutation (insert, delete, bulk load) clears the cache

This ensures that a burst of read queries pays the collection cost at most once.

Triple Pattern Cardinality¶

The optimizer estimates how many triples each pattern will produce:

Pattern             Estimate
------------------------------------------------------------
?s ?p ?o            total_triples (full scan)
<alix> ?p ?o        total_triples / subject_count
?s <knows> ?o       predicate_stats.triple_count
?s ?p <gus>         total_triples / object_count
<alix> <knows> ?o   predicate_stats.avg_objects_per_subject
?s <knows> <gus>    predicate_stats.avg_subjects_per_object
<alix> ?p <gus>     predicate_count (rare pattern)
<alix> <knows> <gus> 1.0 (existence check)

Join selectivity between two patterns sharing a variable is estimated as 1 / domain_size, where the domain is the number of distinct values for the shared variable's position (subject, predicate, or object).

Ring Index Integration¶

When the ring-index feature is enabled, the planner replaces statistical estimates with exact counts from the Ring's wavelet trees. This changes cardinality estimation from an approximation to a precise lookup in O(log sigma) time:

// Statistical estimate: 10.0 (default for unknown predicates)
// Ring exact count:     847  (wavelet tree rank operation)

Exact counts improve join ordering decisions, especially when predicates have highly skewed cardinalities that statistical defaults cannot capture.

Cost-Based Join Reordering¶

For multi-way joins (queries with multiple triple patterns), the planner:

Estimates cardinality for each pattern (via Ring counts or statistics)
Sorts patterns by ascending cardinality
Folds left-to-right with pairwise hash joins, building on the smallest intermediate results first

When the Ring Index is available and all patterns are simple triple scans without LANG() / DATATYPE() dependencies, the planner upgrades to a leapfrog worst-case optimal join (WCOJ) that intersects sorted streams directly without materializing intermediates.

Dictionary Encoding¶

The Ring Index's term dictionary maps every unique RDF term to a 32-bit integer ID. This provides compact in-memory representation during query execution: wavelet tree operations work on integer IDs rather than variable-length strings, and term resolution back to strings happens only at the result boundary.