Vector Quantization¶

Quantization compresses vectors to reduce memory usage while maintaining search quality. Grafeo supports three quantization methods with different compression-recall tradeoffs.

Overview¶

Method	Compression	Recall	Best For
Scalar (SQ)	4x	~97%	General use, high recall
Binary (BQ)	32x	~85%	Fast filtering, massive datasets
Product (PQ)	8-192x	~90-95%	Large datasets, memory-constrained

Scalar Quantization¶

Scalar quantization converts each f32 (4 bytes) to u8 (1 byte), achieving 4x compression with minimal recall loss.

How It Works¶

Training: Learn min/max values per dimension from sample vectors
Quantization: Map each f32 value to 0-255 range
Search: Use asymmetric distance (f32 query vs u8 stored)

Usage¶

from grafeo import ScalarQuantizer

# Train on sample vectors
vectors = [doc.embedding for doc in documents[:1000]]
quantizer = ScalarQuantizer.train(vectors)

# Quantize vectors
quantized = quantizer.quantize(embedding)  # Returns List[int] (u8 values)

# Compute distance
distance = quantizer.distance(quantized_a, quantized_b)

# Dequantize (approximate reconstruction)
reconstructed = quantizer.dequantize(quantized)

Performance¶

Compression: 4x (384 dims: 1536 → 384 bytes)
Recall: ~97% at k=10
Distance computation: ~424 ns (vs ~38 ns for f32)

Binary Quantization¶

Binary quantization converts each f32 to a single bit, achieving 32x compression. Best for fast pre-filtering with rescoring.

How It Works¶

Quantization: bit = 1 if value > 0 else 0
Distance: Hamming distance (popcount of XOR)
Use: Fast candidate filtering, then rescore top candidates

Usage¶

from grafeo import BinaryQuantizer

# Quantize (no training needed)
binary_vec = BinaryQuantizer.quantize(embedding)  # Returns List[int] (packed u64)

# Hamming distance (very fast with SIMD)
distance = BinaryQuantizer.hamming_distance(binary_a, binary_b)

Performance¶

Compression: 32x (384 dims: 1536 → 48 bytes)
Recall: ~85% at k=10 (higher with rescoring)
Distance computation: ~50 ns (SIMD popcount)

Product Quantization¶

Product quantization (PQ) divides vectors into subvectors and quantizes each using a learned codebook. Achieves high compression with good recall.

How It Works¶

Training: Use k-means to learn K centroids for each of M subvectors
Quantization: Store M codes (indices into centroid tables)
Distance: Asymmetric Distance Computation (ADC) via lookup tables

Configuration¶

Parameter	Typical Values	Effect
`num_subvectors` (M)	8, 16, 32, 48	More = better recall, less compression
`num_centroids` (K)	256 (max)	Usually 256 for u8 codes
`iterations`	10-20	K-means iterations

Compression Ratio¶

compression_ratio = (dimensions * 4) / num_subvectors

# Examples for 384 dimensions:
# M=8:  384*4/8  = 192x compression
# M=16: 384*4/16 = 96x compression
# M=48: 384*4/48 = 32x compression

Usage¶

from grafeo import ProductQuantizer

# Training vectors (should be representative sample)
training_vectors = [doc.embedding for doc in sample_docs]

# Train quantizer
# - 8 subvectors (48 dims each for 384-dim vectors)
# - 256 centroids per subvector
# - 10 k-means iterations
quantizer = ProductQuantizer.train(
    vectors=training_vectors,
    num_subvectors=8,
    num_centroids=256,
    iterations=10
)

# Quantize to M codes
codes = quantizer.quantize(embedding)  # Returns List[int] of length M

# Fast distance computation using precomputed table
table = quantizer.build_distance_table(query)
distance = quantizer.distance_with_table(table, codes)  # ~4.5 ns!

# Or direct asymmetric distance (builds table internally)
distance = quantizer.asymmetric_distance(query, codes)

# Approximate reconstruction
reconstructed = quantizer.reconstruct(codes)

Performance¶

Compression: 8-192x depending on M
Recall: ~90-95% at k=10
Distance computation: 4.5 ns with precomputed table (6x faster than raw!)

Choosing a Quantization Method¶

Decision Tree¶

Is memory the primary constraint?
├── No → Use Scalar Quantization (best recall)
└── Yes → How much compression is needed?
    ├── 4x is enough → Scalar Quantization
    ├── 8-50x needed → Product Quantization
    └── 32x+ needed → Binary Quantization (with rescoring)

Comparison for 1M 384-dim Vectors¶

Method	Memory	Recall@10	Search Time
None (f32)	1.5 GB	100%	Baseline
Scalar	384 MB	~97%	~1.1x
PQ8	8 MB	~92%	~0.8x
PQ48	48 MB	~95%	~0.9x
Binary	48 MB	~85%	~0.5x

Next Steps¶

Python API - Complete Python bindings reference
HNSW Index - Index configuration details