Arrow Export¶

Export graph data as columnar Arrow tables for fast analytics with PyArrow, Polars, DuckDB and pandas.

When to use Arrow export vs nodes_df()/edges_df()

Method	Best for
`nodes_to_arrow()` / `edges_to_arrow()`	Large graphs, DuckDB integration, zero-copy sharing
`nodes_to_polars()` / `edges_to_polars()`	Polars-native workflows, lazy evaluation
`nodes_to_pandas()` / `edges_to_pandas()`	pandas users who want the Arrow fast path explicitly
`nodes_df()` / `edges_df()`	Quick exploration (auto-uses Arrow when pyarrow is installed)

At scale (100K+ nodes), the Arrow path is 10-100x faster than element-by-element export because the RecordBatch is built in Rust and serialized as a single IPC buffer.

Setup¶

uv add grafeo pyarrow polars pandas duckdb

Build a Sample Graph¶

from grafeo import GrafeoDB

db = GrafeoDB()

db.execute("""
    INSERT (:Person {name: 'Alix', age: 30, city: 'Amsterdam'})
    INSERT (:Person {name: 'Gus', age: 28, city: 'Berlin'})
    INSERT (:Person {name: 'Vincent', age: 35, city: 'Paris'})
    INSERT (:Person {name: 'Jules', age: 32, city: 'Amsterdam'})
    INSERT (:Person {name: 'Mia', age: 27, city: 'Berlin'})
    INSERT (:Company {name: 'Acme', founded: 2015})
    INSERT (:Company {name: 'Globex', founded: 2020})
""")

db.execute("""
    MATCH (a:Person {name: 'Alix'}), (g:Person {name: 'Gus'})
    INSERT (a)-[:KNOWS {since: 2019}]->(g)
""")
db.execute("""
    MATCH (a:Person {name: 'Alix'}), (c:Company {name: 'Acme'})
    INSERT (a)-[:WORKS_AT {role: 'Engineer'}]->(c)
""")
db.execute("""
    MATCH (g:Person {name: 'Gus'}), (c:Company {name: 'Globex'})
    INSERT (g)-[:WORKS_AT {role: 'Designer'}]->(c)
""")
db.execute("""
    MATCH (v:Person {name: 'Vincent'}), (j:Person {name: 'Jules'})
    INSERT (v)-[:KNOWS {since: 2021}]->(j)
""")

PyArrow: Filtering and DuckDB Integration¶

nodes_to_arrow() returns a pyarrow.Table with columns: id (uint64), labels (list\<utf8>), plus one column per property key.

import pyarrow.compute as pc

table = db.nodes_to_arrow()
print(table.schema)
print(f"{table.num_rows} nodes exported")

Output

id: uint64
labels: list<item: string>
name: string
age: int64
city: string
founded: int64
7 nodes exported

Filter directly on the Arrow table:

# People in Amsterdam
mask = pc.equal(table.column("city"), "Amsterdam")
amsterdam = table.filter(mask)
print(amsterdam.to_pandas()[["name", "city"]])

Output

    name       city
0   Alix  Amsterdam
1  Jules  Amsterdam

Query with DuckDB (zero-copy, no data movement):

import duckdb

result = duckdb.sql("""
    SELECT name, age, city
    FROM table
    WHERE age >= 30
    ORDER BY age DESC
""")
print(result.fetchdf())

Output

      name  age       city
0  Vincent   35      Paris
1    Jules   32  Amsterdam
2     Alix   30  Amsterdam

Polars: Lazy Evaluation and Filtering¶

nodes_to_polars() returns a polars.DataFrame directly, without requiring pyarrow.

import polars as pl

df = db.nodes_to_polars()
print(df)

Output

shape: (7, 5)
+-----+-----------+---------+------+-----------+
| id  | labels    | name    | age  | city      |
| u64 | list[str] | str     | i64  | str       |
+-----+-----------+---------+------+-----------+
| 1   | [Person]  | Alix    | 30   | Amsterdam |
| 2   | [Person]  | Gus     | 28   | Berlin    |
| ...                                          |
+-----+-----------+---------+------+-----------+

Polars lazy evaluation for efficient multi-step pipelines:

young_berliners = (
    df.lazy()
    .filter(pl.col("city") == "Berlin")
    .filter(pl.col("age") < 30)
    .select("name", "age")
    .collect()
)
print(young_berliners)

Output

shape: (2, 2)
+------+-----+
| name | age |
| str  | i64 |
+------+-----+
| Gus  | 28  |
| Mia  | 27  |
+------+-----+

pandas: Direct DataFrame Access¶

nodes_to_pandas() builds the Arrow table in Rust and converts to pandas in one step:

df = db.nodes_to_pandas()
print(df.groupby("city")["age"].mean())

Output

city
Amsterdam    31.0
Berlin       27.5
Paris        35.0
Name: age, dtype: float64

Edge Export¶

edges_to_arrow() returns columns: id (uint64), type (utf8), source (uint64), target (uint64), plus one column per property key.

edges_table = db.edges_to_arrow()
print(edges_table.schema)

Output

id: uint64
type: string
source: uint64
target: uint64
since: int64
role: string

With Polars:

edges_df = db.edges_to_polars()

# All KNOWS relationships
knows = edges_df.filter(pl.col("type") == "KNOWS")
print(knows.select("source", "target", "since"))

With DuckDB (join nodes and edges for a full view):

nodes = db.nodes_to_arrow()
edges = db.edges_to_arrow()

result = duckdb.sql("""
    SELECT n1.name AS from_person, n2.name AS to_person, e.since
    FROM edges e
    JOIN nodes n1 ON e.source = n1.id
    JOIN nodes n2 ON e.target = n2.id
    WHERE e.type = 'KNOWS'
""")
print(result.fetchdf())

Output

  from_person to_person  since
0        Alix       Gus   2019
1     Vincent     Jules   2021

Performance Note¶

The Arrow export path builds a single RecordBatch in Rust and serializes it as IPC bytes. Python receives the buffer and deserializes it in one call with no per-element PyO3 crossings. On graphs with 100K+ entities, this is typically 10-100x faster than the row-by-row nodes_df()/edges_df() fallback.

The existing nodes_df()/edges_df() methods auto-detect pyarrow at runtime and use the Arrow fast path when available. Installing pyarrow speeds up all DataFrame exports without code changes.

Next Steps¶

Performance Baselines for detailed throughput numbers
Graph Visualization example for interactive exploration
NetworkX Integration for graph algorithm workflows