Caching Architecture¶
This document describes the caching architecture in obspec-utils, the design decisions behind it, and guidance on when to use different caching strategies.
Overview¶
obspec-utils provides caching at two levels:
- Reader-level caching: Temporary, scoped to a single file reader's lifetime
- Store-level caching: Shared across all consumers of a store
These serve different phases of a typical VirtualiZarr workflow and have different trade-offs.
Two-Phase Workflow¶
When working with virtual datasets (e.g., via VirtualiZarr), data access happens in two distinct phases:
Phase 1: Parsing (Reader-Level)¶
┌─────────────────────────────────────────────────────────────┐
│ PARSING PHASE │
│ (per-file, short-lived) │
│ │
│ open_virtual_dataset() │
│ │ │
│ ▼ │
│ Reader ──────► Store ──────► Network │
│ │ │
│ └── Reader-level caching │
│ - Caches full file for HDF5/NetCDF parsing │
│ - Released when reader closes │
│ - Isolated per reader instance │
└─────────────────────────────────────────────────────────────┘
During parsing:
- A reader opens the file and parses headers/metadata
- The file may be read multiple times during parsing (HDF5 structure traversal)
- Once parsing completes, the reader closes and cache is released
- Result: Virtual dataset with chunk references (not actual data)
Reader-level caching is appropriate here because:
- File metadata is read once per file, then discarded
- Cache lifecycle matches reader lifecycle (automatic cleanup)
- No cross-contamination between different files being parsed
- Memory is freed immediately when parsing completes
Phase 2: Data Access (Store-Level)¶
┌─────────────────────────────────────────────────────────────┐
│ DATA ACCESS PHASE │
│ (shared, long-lived) │
│ │
│ xarray computation / .load() / .compute() │
│ │ │
│ ▼ │
│ ManifestStore ──────► Store ──────► Network │
│ │ │
│ └── Store-level caching │
│ - Shared across calls │
│ - Persists across │
│ different consumers │
└─────────────────────────────────────────────────────────────┘
During data access:
- Computations trigger reads of actual chunk data
- Same chunks may be accessed multiple times (overlapping operations, retries)
- Cache should persist and be shared across different access patterns
Store-level caching is appropriate here because:
- Blocks may be re-read by different computations
- Cache should be shared across all consumers of the store
- Lifecycle is independent of any single reader
Caching Implementations¶
Reader-Level: EagerStoreReader¶
EagerStoreReader loads the entire file into memory on construction:
from obspec_utils.readers import EagerStoreReader
# File is fully loaded into memory
reader = EagerStoreReader(store, "file.nc")
# All reads served from memory
data = reader.read(1000)
reader.seek(0)
more_data = reader.read(500)
# Cache released when reader closes
reader.close()
Characteristics:
- Fetches file using concurrent
get_ranges()for speed - Caches in
BytesIObuffer - Cache is isolated to this reader instance
- Memory freed on
close()or context manager exit
When to use:
- Parsing HDF5/NetCDF files (need random access during parsing)
- Small-to-medium files that fit in memory
- When you'll read most of the file anyway
Reader-Level: BlockStoreReader¶
BlockStoreReader uses block-based LRU caching:
from obspec_utils.readers import BlockStoreReader
reader = BlockStoreReader(
store, "file.nc",
block_size=256 * 1024, # 256 KB blocks
max_cached_blocks=64, # Up to 64 blocks cached
)
# Blocks fetched on demand via get_ranges()
data = reader.read(1000)
Characteristics:
- Bounded memory usage:
block_size * max_cached_blocks - LRU eviction when cache is full
- Good for sparse/random access patterns
When to use:
- Large files with sparse access
- Memory-constrained environments
- Unknown access patterns
Reader-Level: BufferedStoreReader¶
BufferedStoreReader provides read-ahead buffering for sequential access:
from obspec_utils.readers import BufferedStoreReader
reader = BufferedStoreReader(store, "file.nc", buffer_size=1024 * 1024)
# Sequential reads benefit from buffering
while block := reader.read(4096):
process(block)
Characteristics:
- Position-aware buffering (read-ahead)
- Best for sequential/streaming access
- Minimal memory overhead
When to use:
- Sequential file processing
- Streaming workloads
- When you won't revisit earlier data
Store-Level: CachingReadableStore¶
CachingReadableStore wraps any store to cache full objects:
from obspec_utils.wrappers import CachingReadableStore
# Wrap the store with caching
cached_store = CachingReadableStore(
store,
max_size=256 * 1024 * 1024, # 256 MB cache
)
# All consumers share the same cache
data1 = cached_store.get("file1.nc") # Fetched from network, cached
data2 = cached_store.get("file1.nc") # Served from cache
Characteristics:
- Caches full objects (entire files)
- LRU eviction when
max_sizeexceeded - Thread-safe (works with
ThreadPoolExecutor) - Shared across all consumers of the wrapped store
When to use:
- Multiple consumers reading the same files
- Repeated access to small-to-medium files
- When store-level sharing is beneficial
Limitations:
- Not shared across processes (Dask workers, ProcessPoolExecutor)
- Each process maintains its own cache
- Full-object granularity (not ideal for partial reads of large files)
Distributed Considerations¶
Threading (Shared Memory)¶
With ThreadPoolExecutor or similar:
- Store-level caching IS shared across threads
- All threads benefit from the same cache
- Thread-safe implementations required (provided)
Multi-Process (Separate Memory)¶
With ProcessPoolExecutor, Dask distributed, or Lithops:
- Each worker process has its own memory space
- Store wrappers are serialized and copied to each worker
- Caches are NOT shared across workers
- Each worker maintains an independent cache
This is typically acceptable when:
- Workloads are partitioned by file (each worker processes different files)
- The alternative (no caching) would be worse
For workloads requiring cross-worker cache sharing, consider:
- External caching (Redis, memcached)
- Shared filesystem caching
- Restructuring workloads to minimize cross-worker file access
Pickling and Serialization¶
CachingReadableStore supports Python's pickle protocol for use with multiprocessing and distributed frameworks. When a CachingReadableStore is pickled and unpickled (e.g., sent to a worker process), it is recreated with an empty cache.
import pickle
from obspec_utils.wrappers import CachingReadableStore
# Main process: create and populate cache
cached_store = CachingReadableStore(store, max_size=256 * 1024 * 1024)
cached_store.get("file1.nc") # Cached
cached_store.get("file2.nc") # Cached
print(cached_store.cache_size) # Non-zero
# Simulate sending to worker (pickle roundtrip)
restored = pickle.loads(pickle.dumps(cached_store))
# Worker receives store with empty cache
print(restored.cache_size) # 0
print(restored._max_size) # 256 * 1024 * 1024 (preserved)
Design rationale:
-
Cache contents are not serialized: Serializing the full cache would defeat the purpose of distributed processing—workers would receive potentially huge payloads, and the data may not even be relevant to their partition.
-
Fresh cache per worker: Each worker builds its own cache based on its workload. For file-partitioned workloads (common in data processing), this is optimal—each worker caches only the files it processes.
-
Configuration is preserved: The
max_sizeand underlying store are preserved, so workers use the same caching policy as the main process.
Requirements for pickling:
The underlying store (_store) must also be picklable. For cloud stores, this typically means using stores that can be reconstructed from configuration:
# Works: store can be pickled (configuration-based)
from obstore.store import S3Store
s3_store = S3Store(bucket="my-bucket", region="us-east-1")
cached = CachingReadableStore(s3_store)
pickle.dumps(cached) # OK
# May not work: some Rust-backed stores aren't picklable
from obstore.store import MemoryStore
mem_store = MemoryStore()
cached = CachingReadableStore(mem_store)
pickle.dumps(cached) # TypeError: cannot pickle 'MemoryStore' object
Distributed Usage Patterns¶
Pattern 1: File-Partitioned Workloads (Recommended)¶
When each worker processes a distinct set of files, per-worker caching works well:
from concurrent.futures import ProcessPoolExecutor
from obspec_utils.wrappers import CachingReadableStore
def process_files(cached_store, file_paths):
"""Each worker gets its own cache, processes its own files."""
results = []
for path in file_paths:
# First access: fetch from network, cache locally
data = cached_store.get(path)
# Subsequent accesses to same file: served from cache
result = analyze(data)
results.append(result)
return results
# Create cached store in main process
store = S3Store(bucket="my-bucket")
cached_store = CachingReadableStore(store, max_size=512 * 1024 * 1024)
# Partition files across workers
all_files = ["file1.nc", "file2.nc", "file3.nc", "file4.nc"]
partitions = [all_files[:2], all_files[2:]]
with ProcessPoolExecutor(max_workers=2) as executor:
futures = [
executor.submit(process_files, cached_store, partition)
for partition in partitions
]
results = [f.result() for f in futures]
Pattern 2: Dask Distributed¶
With Dask, the cached store is serialized to each worker:
import dask
from dask.distributed import Client
from obspec_utils.wrappers import CachingReadableStore
client = Client()
store = S3Store(bucket="my-bucket")
cached_store = CachingReadableStore(store)
@dask.delayed
def process_file(cached_store, path):
# Worker receives cached_store with empty cache
# Cache builds up as this worker processes files
data = cached_store.get(path)
return analyze(data)
tasks = [process_file(cached_store, f) for f in file_list]
results = dask.compute(*tasks)
Decision Guide¶
Which reader should I use?¶
| Access Pattern | Recommended Reader |
|---|---|
| Parse HDF5/NetCDF file | EagerStoreReader |
| Sequential streaming | BufferedStoreReader |
| Sparse random access | BlockStoreReader |
| Unknown pattern, large file | BlockStoreReader |
| Small file, repeated access | EagerStoreReader |
Should I use store-level caching?¶
| Scenario | Recommendation |
|---|---|
| Single-threaded, repeated file access | Yes, CachingReadableStore |
| Multi-threaded, shared files | Yes, CachingReadableStore |
| Distributed workers, partitioned by file | Optional (per-worker cache) |
| Distributed workers, shared files | Consider external caching |
| One-time file processing | No (use reader-level only) |
Store Wrappers¶
SplittingReadableStore¶
SplittingReadableStore accelerates get() by splitting large requests into concurrent get_ranges():
from obspec_utils.wrappers import SplittingReadableStore
fast_store = SplittingReadableStore(
store,
request_size=12 * 1024 * 1024, # 12 MB per request
max_concurrent_requests=18,
)
This extracts the concurrent fetching logic from EagerStoreReader into a composable wrapper. It composes naturally with CachingReadableStore:
from obspec_utils.wrappers import CachingReadableStore, SplittingReadableStore
# Compose: fast concurrent fetches + caching
store = S3Store(bucket="my-bucket")
store = SplittingReadableStore(store) # Split large fetches
store = CachingReadableStore(store) # Cache results
# First get(): concurrent fetch -> cache
# Second get(): served from cache
Characteristics:
- Only affects
get()andget_async()- range requests pass through unchanged - Requires
head()support to determine file size (falls back to single request otherwise) - Tuned for cloud storage (12 MB blocks, 18 concurrent requests by default)
Summary¶
| Layer | Scope | Lifetime | Use Case |
|---|---|---|---|
| Reader-level | Per-file, per-instance | Reader lifetime | Parsing phase |
| Store-level | Shared across consumers | Application lifetime | Data access phase |
The two-level architecture reflects the reality that:
- Parsing benefits from isolated, temporary caching (reader-level)
- Data access benefits from shared, persistent caching (store-level)
- Distributed settings require understanding that in-memory caches are per-process