Vector Quantization¶
Quantization compresses vectors to reduce memory usage while maintaining search quality. Grafeo supports three quantization methods with different compression-recall tradeoffs.
Overview¶
| Method | Compression | Recall | Best For |
|---|---|---|---|
| Scalar (SQ) | 4x | ~97% | General use, high recall |
| Binary (BQ) | 32x | ~85% | Fast filtering, massive datasets |
| Product (PQ) | 8-192x | ~90-95% | Large datasets, memory-constrained |
Scalar Quantization¶
Scalar quantization converts each f32 (4 bytes) to u8 (1 byte), achieving 4x compression with minimal recall loss.
How It Works¶
- Training: Learn min/max values per dimension from sample vectors
- Quantization: Map each f32 value to 0-255 range
- Search: Use asymmetric distance (f32 query vs u8 stored)
Usage¶
from grafeo import ScalarQuantizer
# Train on sample vectors
vectors = [doc.embedding for doc in documents[:1000]]
quantizer = ScalarQuantizer.train(vectors)
# Quantize vectors
quantized = quantizer.quantize(embedding) # Returns List[int] (u8 values)
# Compute distance
distance = quantizer.distance(quantized_a, quantized_b)
# Dequantize (approximate reconstruction)
reconstructed = quantizer.dequantize(quantized)
Performance¶
- Compression: 4x (384 dims: 1536 → 384 bytes)
- Recall: ~97% at k=10
- Distance computation: ~424 ns (vs ~38 ns for f32)
Binary Quantization¶
Binary quantization converts each f32 to a single bit, achieving 32x compression. Best for fast pre-filtering with rescoring.
How It Works¶
- Quantization:
bit = 1 if value > 0 else 0 - Distance: Hamming distance (popcount of XOR)
- Use: Fast candidate filtering, then rescore top candidates
Usage¶
from grafeo import BinaryQuantizer
# Quantize (no training needed)
binary_vec = BinaryQuantizer.quantize(embedding) # Returns List[int] (packed u64)
# Hamming distance (very fast with SIMD)
distance = BinaryQuantizer.hamming_distance(binary_a, binary_b)
Performance¶
- Compression: 32x (384 dims: 1536 → 48 bytes)
- Recall: ~85% at k=10 (higher with rescoring)
- Distance computation: ~50 ns (SIMD popcount)
Product Quantization¶
Product quantization (PQ) divides vectors into subvectors and quantizes each using a learned codebook. Achieves high compression with good recall.
How It Works¶
- Training: Use k-means to learn K centroids for each of M subvectors
- Quantization: Store M codes (indices into centroid tables)
- Distance: Asymmetric Distance Computation (ADC) via lookup tables
Configuration¶
| Parameter | Typical Values | Effect |
|---|---|---|
num_subvectors (M) | 8, 16, 32, 48 | More = better recall, less compression |
num_centroids (K) | 256 (max) | Usually 256 for u8 codes |
iterations | 10-20 | K-means iterations |
Compression Ratio¶
compression_ratio = (dimensions * 4) / num_subvectors
# Examples for 384 dimensions:
# M=8: 384*4/8 = 192x compression
# M=16: 384*4/16 = 96x compression
# M=48: 384*4/48 = 32x compression
Usage¶
from grafeo import ProductQuantizer
# Training vectors (should be representative sample)
training_vectors = [doc.embedding for doc in sample_docs]
# Train quantizer
# - 8 subvectors (48 dims each for 384-dim vectors)
# - 256 centroids per subvector
# - 10 k-means iterations
quantizer = ProductQuantizer.train(
vectors=training_vectors,
num_subvectors=8,
num_centroids=256,
iterations=10
)
# Quantize to M codes
codes = quantizer.quantize(embedding) # Returns List[int] of length M
# Fast distance computation using precomputed table
table = quantizer.build_distance_table(query)
distance = quantizer.distance_with_table(table, codes) # ~4.5 ns!
# Or direct asymmetric distance (builds table internally)
distance = quantizer.asymmetric_distance(query, codes)
# Approximate reconstruction
reconstructed = quantizer.reconstruct(codes)
Performance¶
- Compression: 8-192x depending on M
- Recall: ~90-95% at k=10
- Distance computation: 4.5 ns with precomputed table (6x faster than raw!)
Choosing a Quantization Method¶
Decision Tree¶
Is memory the primary constraint?
├── No → Use Scalar Quantization (best recall)
└── Yes → How much compression is needed?
├── 4x is enough → Scalar Quantization
├── 8-50x needed → Product Quantization
└── 32x+ needed → Binary Quantization (with rescoring)
Comparison for 1M 384-dim Vectors¶
| Method | Memory | Recall@10 | Search Time |
|---|---|---|---|
| None (f32) | 1.5 GB | 100% | Baseline |
| Scalar | 384 MB | ~97% | ~1.1x |
| PQ8 | 8 MB | ~92% | ~0.8x |
| PQ48 | 48 MB | ~95% | ~0.9x |
| Binary | 48 MB | ~85% | ~0.5x |
Next Steps¶
- Python API - Complete Python bindings reference
- HNSW Index - Index configuration details