Readers
obspec_utils.readers.BufferedStoreReader ¶
A file-like reader with buffered on-demand reads.
This class provides a file-like interface (read, seek, tell) on top of any
object store. The reader uses get_range() calls to fetch data on-demand,
with optional read-ahead buffering for efficiency.
When to Use
Use BufferedStoreReader when:
- Sequential reading with rare backward seeks: Best for workloads that mostly read forward through a file with rare backward seeks.
- Simple use cases: When you need a basic file-like interface without caching or concurrent fetching.
- Streaming data: Processing data as it arrives without loading the full file into memory.
Consider alternatives when:
- You need to read the entire file anyway → use EagerStoreReader
- You have many non-contiguous reads → use BlockStoreReader
- You'll repeatedly access the same regions → use EagerStoreReader or BlockStoreReader
See Also
- EagerStoreReader : Loads entire file into memory for fast random access.
- BlockStoreReader : Block-based reader with LRU caching for sparse access.
obspec_utils.readers.EagerStoreReader ¶
A file-like reader that eagerly loads the entire file into memory.
This reader fetches the complete file on first access and then serves all subsequent reads from the in-memory cache. Useful for files that will be read multiple times or when seeking is frequent.
By default, the file is fetched using concurrent range requests via
get_ranges(), which can significantly improve load time for large files.
The defaults (12 MB request size, max 18 concurrent requests) are tuned for
cloud storage. The file size is determined automatically via a HEAD request.
The concurrent fetching strategy is based on Icechunk's approach: github.com/earth-mover/icechunk/blob/main/icechunk/src/storage/mod.rs
When to Use
Use EagerStoreReader when:
- Reading the entire file: When you know you'll need most or all of the file's contents.
- Repeated random access: After the initial load, any byte is accessible with no network latency.
- Small to medium files: Files that fit comfortably in memory.
- Concurrent initial fetch: The default settings use concurrent requests for faster download on cloud storage.
Consider alternatives when:
- You only need a small portion of a large file → use BlockStoreReader
- Memory is constrained → use BlockStoreReader (bounded cache) or BufferedStoreReader
- You're streaming sequentially and won't revisit data → use BufferedStoreReader
See Also
- BufferedStoreReader : On-demand reads with read-ahead buffering.
- BlockStoreReader : Block-based reader with LRU caching for sparse access.
Store ¶
__exit__ ¶
__exit__(exc_type, exc_val, exc_tb) -> None
Exit the context manager and close the reader.
__init__ ¶
__init__(
store: Store,
path: str,
request_size: int = 12 * 1024 * 1024,
file_size: int | None = None,
max_concurrent_requests: int = 18,
) -> None
Create an eager reader that loads the entire file into memory.
The file is fetched immediately and cached in memory.
Parameters:
-
store(Store) – -
path(str) –The path to the file within the store.
-
request_size(int, default:12 * 1024 * 1024) –Target size for each concurrent range request in bytes. Default is 12 MB, tuned for cloud storage throughput. The file will be divided into parts of this size and fetched using
get_ranges(). -
file_size(int | None, default:None) –File size in bytes. If not provided, the size is determined via
store.head(). Pass this to skip the HEAD request if you already know the file size. -
max_concurrent_requests(int, default:18) –Maximum number of concurrent range requests. Default is 18. If the file would require more requests than this, request sizes are increased to fit within this limit.
obspec_utils.readers.BlockStoreReader ¶
A file-like reader that uses concurrent range requests for efficient block fetching.
This reader divides the file into fixed-size blocks and uses get_ranges()
to fetch multiple blocks with concurrency. An LRU cache stores recently accessed blocks
to avoid redundant fetches.
This is particularly efficient for workloads that access multiple non-contiguous regions of a file.
When to Use
Use BlockStoreReader when:
- Sparse access patterns: Reading many non-contiguous regions of a file.
- Large files with partial reads: When you only need portions of a large file and don't want to load it all into memory.
- Memory-constrained environments: The LRU cache has bounded memory usage
(
block_size * max_cached_blocks), regardless of file size. - Unknown access patterns: When you don't know upfront which parts of the file you'll need.
Consider alternatives when:
- You'll read the entire file anyway → use EagerStoreReader
- Access is purely sequential → use BufferedStoreReader
- You need repeated access to more data than fits in the cache → use EagerStoreReader to avoid re-fetching evicted blocks
See Also
- BufferedStoreReader : On-demand reads with read-ahead buffering.
- EagerStoreReader : Loads entire file into memory for fast random access.
Store ¶
__exit__ ¶
__exit__(exc_type, exc_val, exc_tb) -> None
Exit the context manager and close the reader.
__init__ ¶
__init__(
store: Store,
path: str,
block_size: int = 1024 * 1024,
max_cached_blocks: int = 64,
) -> None
Create a block-based reader with LRU caching.
Parameters:
-
store(Store) – -
path(str) –The path to the file within the store.
-
block_size(int, default:1024 * 1024) –Size of each block in bytes. Default is 1 MB, tuned for cloud object stores where HTTP request overhead is significant. Smaller blocks mean more granular caching but more requests.
-
max_cached_blocks(int, default:64) –Maximum number of blocks to keep in the LRU cache. Default is 64, giving a 64 MB cache with the default block size.