DuckDB Integration¶
Combine Grafeo for graph storage and traversal with DuckDB for analytical SQL queries. Data flows between the two via Apache Arrow Tables with zero-copy overhead: no serialization, no parsing, no intermediate formats.
Topics Covered¶
- Exporting graph data to Arrow Tables
- Querying nodes and edges with DuckDB SQL
- Joining nodes and edges for connectivity analysis
- Combining GQL pattern matching with SQL aggregation
- RDF/SPARQL results in DuckDB
- Writing results to Parquet for data lake workflows
Setup¶
Build a Social Graph¶
db = grafeo.GrafeoDB()
# People in three cities
db.execute("""
INSERT (:Person {name: 'Alix', age: 30, city: 'Amsterdam'})
INSERT (:Person {name: 'Gus', age: 35, city: 'Berlin'})
INSERT (:Person {name: 'Vincent', age: 28, city: 'Paris'})
INSERT (:Person {name: 'Jules', age: 42, city: 'Amsterdam'})
INSERT (:Person {name: 'Mia', age: 26, city: 'Berlin'})
INSERT (:Person {name: 'Butch', age: 38, city: 'Paris'})
INSERT (:Person {name: 'Django', age: 31, city: 'Amsterdam'})
INSERT (:Person {name: 'Shosanna', age: 29, city: 'Berlin'})
""")
# Companies
db.execute("""
INSERT (:Company {name: 'GraphTech', industry: 'Technology', founded: 2018})
INSERT (:Company {name: 'DataFlow', industry: 'Technology', founded: 2020})
INSERT (:Company {name: 'CanalBrew', industry: 'Food & Drink', founded: 2015})
""")
# Relationships
db.execute("""
MATCH (a:Person {name: 'Alix'}), (b:Person {name: 'Gus'})
INSERT (a)-[:KNOWS {since: 2019}]->(b)
""")
db.execute("""
MATCH (a:Person {name: 'Alix'}), (b:Person {name: 'Jules'})
INSERT (a)-[:KNOWS {since: 2020}]->(b)
""")
db.execute("""
MATCH (a:Person {name: 'Gus'}), (b:Person {name: 'Vincent'})
INSERT (a)-[:KNOWS {since: 2021}]->(b)
""")
db.execute("""
MATCH (a:Person {name: 'Vincent'}), (b:Person {name: 'Butch'})
INSERT (a)-[:KNOWS {since: 2018}]->(b)
""")
db.execute("""
MATCH (a:Person {name: 'Jules'}), (b:Person {name: 'Django'})
INSERT (a)-[:KNOWS {since: 2022}]->(b)
""")
db.execute("""
MATCH (a:Person {name: 'Mia'}), (b:Person {name: 'Shosanna'})
INSERT (a)-[:KNOWS {since: 2023}]->(b)
""")
db.execute("""
MATCH (a:Person {name: 'Django'}), (b:Person {name: 'Shosanna'})
INSERT (a)-[:KNOWS {since: 2021}]->(b)
""")
# Employment edges
db.execute("""
MATCH (p:Person {name: 'Alix'}), (c:Company {name: 'GraphTech'})
INSERT (p)-[:WORKS_AT {role: 'Engineer', started: 2019}]->(c)
""")
db.execute("""
MATCH (p:Person {name: 'Gus'}), (c:Company {name: 'GraphTech'})
INSERT (p)-[:WORKS_AT {role: 'Manager', started: 2018}]->(c)
""")
db.execute("""
MATCH (p:Person {name: 'Vincent'}), (c:Company {name: 'DataFlow'})
INSERT (p)-[:WORKS_AT {role: 'Analyst', started: 2021}]->(c)
""")
db.execute("""
MATCH (p:Person {name: 'Jules'}), (c:Company {name: 'CanalBrew'})
INSERT (p)-[:WORKS_AT {role: 'Owner', started: 2015}]->(c)
""")
db.execute("""
MATCH (p:Person {name: 'Mia'}), (c:Company {name: 'DataFlow'})
INSERT (p)-[:WORKS_AT {role: 'Designer', started: 2022}]->(c)
""")
print(f"Graph: {db.node_count} nodes, {db.edge_count} edges")
Query Nodes with SQL¶
Export all nodes as an Arrow Table and query them directly in DuckDB:
nodes = db.nodes_to_arrow()
print(nodes.schema)
# id: uint64
# labels: list<item: string>
# name: string
# age: int64
# city: string
# industry: string
# founded: int64
DuckDB can query Arrow Tables in place, with no data copying:
result = duckdb.sql("""
SELECT city, COUNT(*) AS count, ROUND(AVG(age), 1) AS avg_age
FROM nodes
WHERE list_contains(labels, 'Person')
GROUP BY city
ORDER BY count DESC
""")
print(result)
┌───────────┬───────┬─────────┐
│ city │ count │ avg_age │
│ varchar │ int64 │ double │
├───────────┼───────┼─────────┤
│ Amsterdam │ 3 │ 34.3 │
│ Berlin │ 3 │ 30.0 │
│ Paris │ 2 │ 33.0 │
└───────────┴───────┴─────────┘
Label Distribution¶
result = duckdb.sql("""
SELECT
unnest(labels) AS label,
COUNT(*) AS count
FROM nodes
GROUP BY label
ORDER BY count DESC
""")
print(result)
┌─────────┬───────┐
│ label │ count │
│ varchar │ int64 │
├─────────┼───────┤
│ Person │ 8 │
│ Company │ 3 │
└─────────┴───────┘
Edge Analytics¶
Export edges and register both tables in a DuckDB connection for joins:
edges = db.edges_to_arrow()
print(edges.schema)
# id: uint64
# type: string
# source: uint64
# target: uint64
# since: int64
# role: string
# started: int64
Most Connected People¶
Count outgoing and incoming KNOWS relationships per person:
result = conn.sql("""
WITH connections AS (
SELECT source AS node_id, COUNT(*) AS out_degree
FROM edges
WHERE type = 'KNOWS'
GROUP BY source
UNION ALL
SELECT target AS node_id, COUNT(*) AS out_degree
FROM edges
WHERE type = 'KNOWS'
GROUP BY target
)
SELECT n.name, SUM(c.out_degree) AS total_connections
FROM connections c
JOIN nodes n ON c.node_id = n.id
GROUP BY n.name
ORDER BY total_connections DESC
LIMIT 5
""")
print(result)
┌──────────┬───────────────────┐
│ name │ total_connections │
│ varchar │ int128 │
├──────────┼───────────────────┤
│ Django │ 2 │
│ Shosanna │ 2 │
│ Alix │ 2 │
│ Gus │ 2 │
│ Vincent │ 2 │
└──────────┴───────────────────┘
Company Headcount¶
result = conn.sql("""
SELECT
c.name AS company,
c.industry,
COUNT(*) AS employees,
ROUND(AVG(p.age), 1) AS avg_employee_age
FROM edges e
JOIN nodes p ON e.source = p.id
JOIN nodes c ON e.target = c.id
WHERE e.type = 'WORKS_AT'
GROUP BY c.name, c.industry
ORDER BY employees DESC
""")
print(result)
┌───────────┬──────────────┬───────────┬──────────────────┐
│ company │ industry │ employees │ avg_employee_age │
│ varchar │ varchar │ int64 │ double │
├───────────┼──────────────┼───────────┼──────────────────┤
│ GraphTech │ Technology │ 2 │ 32.5 │
│ DataFlow │ Technology │ 2 │ 27.0 │
│ CanalBrew │ Food & Drink │ 1 │ 42.0 │
└───────────┴──────────────┴───────────┴──────────────────┘
Graph-to-Warehouse Pattern¶
Use Grafeo for graph pattern matching, then hand the results to DuckDB for aggregation. This combines the strengths of both: Grafeo handles traversals and pattern matching, DuckDB handles analytical SQL.
Step 1: Pattern match in Grafeo¶
# Find all two-hop connections: who knows someone who knows someone?
result = db.execute("""
MATCH (a:Person)-[:KNOWS]->(b:Person)-[:KNOWS]->(c:Person)
WHERE a <> c
RETURN a.name AS person, b.name AS via, c.name AS reaches
""")
# Export the result to an Arrow Table
friend_paths = result.to_arrow()
print(f"Found {friend_paths.num_rows} two-hop paths")
Step 2: Analyze in DuckDB¶
result = duckdb.sql("""
SELECT
person,
COUNT(DISTINCT reaches) AS reachable_people,
LIST(DISTINCT reaches) AS who
FROM friend_paths
GROUP BY person
ORDER BY reachable_people DESC
""")
print(result)
This pattern works well for any graph-then-aggregate workflow:
- MATCH a subgraph pattern with GQL (paths, neighborhoods, motifs)
- Export the tabular result to Arrow via
result.to_arrow() - Analyze in DuckDB with window functions, pivots, GROUP BY
RDF/SPARQL + DuckDB¶
If you work with RDF data, you can run SPARQL queries in Grafeo and analyze the results in DuckDB.
Load RDF triples¶
db_rdf = grafeo.GrafeoDB()
db_rdf.execute_sparql("""
PREFIX ex: <http://example.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
INSERT DATA {
ex:alix a foaf:Person ; foaf:name "Alix" ; ex:city "Amsterdam" ; ex:age 30 .
ex:gus a foaf:Person ; foaf:name "Gus" ; ex:city "Berlin" ; ex:age 35 .
ex:vincent a foaf:Person ; foaf:name "Vincent" ; ex:city "Paris" ; ex:age 28 .
ex:jules a foaf:Person ; foaf:name "Jules" ; ex:city "Amsterdam" ; ex:age 42 .
ex:alix foaf:knows ex:gus .
ex:alix foaf:knows ex:jules .
ex:gus foaf:knows ex:vincent .
}
""")
Query with SPARQL, analyze with SQL¶
result = db_rdf.execute_sparql("""
PREFIX ex: <http://example.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?city ?age
WHERE {
?person a foaf:Person ;
foaf:name ?name ;
ex:city ?city ;
ex:age ?age .
}
""")
# Export SPARQL results to Arrow
people_table = result.to_arrow()
# Aggregate in DuckDB
stats = duckdb.sql("""
SELECT city, COUNT(*) AS count, AVG(age) AS avg_age
FROM people_table
GROUP BY city
ORDER BY count DESC
""")
print(stats)
The SPARQL query handles the graph pattern matching (triple patterns, OPTIONAL joins, property paths). DuckDB handles the analytical heavy lifting (aggregation, window functions, HAVING clauses).
Export to Parquet¶
DuckDB can write query results directly to Parquet files, turning your graph data into data lake assets:
conn = duckdb.connect()
conn.register("nodes", db.nodes_to_arrow())
conn.register("edges", db.edges_to_arrow())
# Export a summary table to Parquet
conn.sql("""
COPY (
SELECT n.name, n.city, n.age, COUNT(e.id) AS connections
FROM nodes n
LEFT JOIN edges e ON n.id = e.source AND e.type = 'KNOWS'
WHERE list_contains(n.labels, 'Person')
GROUP BY n.name, n.city, n.age
ORDER BY connections DESC
) TO 'people_connections.parquet' (FORMAT PARQUET)
""")
This Parquet file can then be loaded into any data warehouse, shared with colleagues, or queried later with DuckDB, Spark, or Polars.
Performance Notes¶
| Aspect | Detail |
|---|---|
| Zero-copy transfer | Arrow Tables pass to DuckDB without copying data. The column pointers are shared directly. |
| Scale | For graphs with hundreds of thousands of nodes, Arrow export is 10-100x faster than converting through pandas row by row. |
| Memory | DuckDB scans the Arrow Table in-place. No second copy of the data exists in memory. |
| Predicate pushdown | DuckDB pushes filters down into Arrow scans, reading only the columns and rows it needs. |
| Parallelism | DuckDB's vectorized engine processes Arrow batches in parallel across CPU cores. |
When to Use Each¶
| Task | Use |
|---|---|
| Graph traversal, shortest path, pattern matching | Grafeo (GQL) |
| Aggregation, GROUP BY, window functions, pivots | DuckDB (SQL) |
| RDF triple patterns, OPTIONAL, UNION | Grafeo (SPARQL) |
| Export to Parquet, CSV, JSON | DuckDB |
| Subgraph extraction then analytics | Grafeo then DuckDB |
| Full-text or vector search | Grafeo |
Next Steps¶
- NetworkX Integration for graph algorithms and visualization
- Graph Visualization for interactive exploration with anywidget-graph
- Fraud Detection for a practical graph analytics example
- Python API for the full Python binding reference