Skip to content

Glob Implementation Design

This document describes the design of obspec_utils.glob, which provides glob pattern matching for object stores using the obspec List primitive.

Overview

The glob module provides functions to match paths against glob patterns, similar to fsspec.glob, pathlib.glob, and glob.glob. It enables users to find objects in stores using familiar wildcard patterns like data/**/*.nc.

API Design

Two-Function Approach

We provide two separate functions rather than a single function with a detail kwarg:

from obspec_utils import glob, glob_objects

# Get paths only
paths = list(glob(store, "data/**/*.nc"))
# ['data/2024/file1.nc', 'data/2024/01/file2.nc', ...]

# Get full metadata
for obj in glob_objects(store, "data/**/*.nc"):
    print(f"{obj['path']}: {obj['size']} bytes")

Rationale:

Approach Typing API Clarity
Two functions Clean return types Explicit intent
Single function with kwarg Requires @overload decorators Runtime-dependent return type

Following Python's "explicit is better than implicit" philosophy, two functions provide:

  • Clean typing — each function has a single return type
  • Discoverability — both options visible in autocomplete
  • No ambiguity — return type known at call site

Function Matrix

Function Protocol Returns
glob obspec.List Iterator[str]
glob_objects obspec.List Iterator[ObjectMeta]
glob_async obspec.ListAsync AsyncIterator[str]
glob_objects_async obspec.ListAsync AsyncIterator[ObjectMeta]

Protocol Requirements

Following obspec's philosophy, we use obspec.List and obspec.ListAsync directly rather than defining wrapper protocols:

from obspec import List

def glob(store: List, pattern: str) -> Iterator[str]:
    ...

This keeps the API minimal and avoids unnecessary abstraction layers.

Pattern Support

The glob functions support standard Unix-style glob patterns:

Pattern Meaning Example
* Matches any characters within a single path segment data/*.nc matches data/file.nc but not data/sub/file.nc
** Matches any number of path segments (recursive) data/**/*.nc matches data/a/b/c/file.nc
? Matches exactly one character file?.nc matches file1.nc but not file10.nc
[abc] Matches characters in set file[123].nc matches file1.nc, file2.nc, file3.nc
[a-z] Matches characters in range file[a-c].nc matches filea.nc, fileb.nc, filec.nc
[!abc] Matches characters NOT in set file[!0-9].nc matches filea.nc but not file1.nc

Implementation Algorithm

1. Prefix Extraction

Extract the literal prefix from the pattern to optimize the list() call:

GLOB_CHARS = frozenset('*?[')

def _parse_pattern(pattern: str) -> tuple[str, str]:
    """Find the longest prefix without glob characters.

    The prefix must end at a path separator boundary to work with
    obspec's segment-based prefix matching.
    """
    for i, char in enumerate(pattern):
        if char in GLOB_CHARS:
            prefix_end = pattern.rfind('/', 0, i) + 1
            return pattern[:prefix_end], pattern[prefix_end:]

    # No glob chars - use parent directory as prefix
    last_slash = pattern.rfind('/')
    if last_slash >= 0:
        return pattern[:last_slash + 1], pattern[last_slash + 1:]
    return "", pattern

Examples: - data/2024/**/*.nc → prefix data/2024/, remaining **/*.nc - data/*.nc → prefix data/, remaining *.nc - **/*.nc → prefix "", remaining **/*.nc - data/file.nc → prefix data/, remaining file.nc (literal path) - file.nc → prefix "", remaining file.nc (no directory)

2. Pattern Compilation

Convert the glob pattern to a compiled regex using a segment-by-segment approach inspired by CPython's glob.translate():

import re

def _compile_pattern(pattern: str) -> re.Pattern[str]:
    """
    Convert glob pattern to regex, processing segment by segment.

    Inspired by CPython 3.13+ glob.translate() but simplified for
    object stores (/ separator only, no hidden file handling).
    """
    segments = pattern.split('/')
    regex_parts = []

    i = 0
    while i < len(segments):
        segment = segments[i]
        is_last = (i == len(segments) - 1)

        if segment == '**':
            # Skip consecutive ** segments
            while i + 1 < len(segments) and segments[i + 1] == '**':
                i += 1
            is_last = (i == len(segments) - 1)

            if is_last:
                # ** at end: match everything remaining
                regex_parts.append('.*')
            else:
                # ** in middle: match zero or more segments
                regex_parts.append('(?:.+/)?')
        else:
            # Convert segment with wildcards
            segment_regex = _translate_segment(segment)
            if is_last:
                regex_parts.append(segment_regex)
            else:
                regex_parts.append(segment_regex + '/')

        i += 1

    return re.compile(''.join(regex_parts) + r'\Z')

def _translate_segment(segment: str) -> str:
    """Translate a single path segment (no /) to regex."""
    # Handle *, ?, [abc], [!abc], [a-z] and literal characters
    # * -> [^/]* (any chars except /)
    # ? -> [^/] (single char except /)
    # [...] -> [...] (character class, passed through)
    ...

Key design choices (inspired by CPython glob.translate()):

Pattern Regex Rationale
* [^/]* Match any chars within segment (not across /)
** (middle) (?:.+/)? Match zero or more complete segments
** (end) .* Match everything remaining
? [^/] Match single char within segment
[abc] [abc] Character class (passed through)
[!abc] [^abc] Negated character class

Differences from CPython: - Object stores use / only (no os.sep handling) - No hidden file handling (object stores don't have this concept) - Simpler implementation focused on object store paths

3. List and Filter

def _glob_impl(store: List, pattern: str) -> Iterator[ObjectMeta]:
    list_prefix, _ = _parse_pattern(pattern)
    compiled = _compile_pattern(pattern)

    for chunk in store.list(prefix=list_prefix if list_prefix else None):
        for obj in chunk:
            if compiled.match(obj["path"]):
                yield obj

Note: The compiled pattern includes \Z anchor at the end, so match() (which anchors at the start) effectively performs a full match. This is more efficient than fullmatch() in some regex engines.

Behavior Comparison

Feature obspec_utils.glob fsspec.glob pathlib.glob glob.glob
Returns Iterator[str] or Iterator[ObjectMeta] list[str] or dict Iterator[Path] list[str]
* matches / No No No No
** recursive Yes (always) Yes Yes (always) Yes (if recursive=True)
Hidden files Matched Matched Matched Only if pattern starts with .
Case sensitive Yes (always) Platform-dependent Platform-dependent Platform-dependent
Directories Not included Yes (withdirs) Yes Yes
maxdepth Not supported Yes No No
Metadata glob_objects() detail=True No No
Streaming Yes (iterator) No (returns list) Yes (iterator) No (returns list)

Key Differences and Rationale

1. Two functions instead of detail kwarg

obspec_utils fsspec
glob() returns Iterator[str] glob() returns list[str]
glob_objects() returns Iterator[ObjectMeta] glob(..., detail=True) returns dict

Rationale: fsspec uses a runtime detail parameter that changes the return type, requiring @overload decorators for proper typing. Two separate functions provide: - Clean static typing without runtime-dependent return types - Better IDE autocomplete and type inference - Follows Python's "explicit is better than implicit"

2. No maxdepth parameter

Rationale: The obspec List primitive is always recursive—there's no way to request a shallow listing. Adding maxdepth would require: - Counting path segments in every result - Post-filtering results that exceed the depth limit - No performance benefit since all objects are fetched anyway

If depth limiting is needed, users can post-filter:

max_depth = 2
results = [p for p in glob(store, "**/*.nc") if p.count("/") <= max_depth]

3. Always case-sensitive

Rationale: Object stores (S3, GCS, Azure Blob) treat paths as case-sensitive. Unlike filesystems where case sensitivity varies by platform (case-insensitive on Windows/macOS, case-sensitive on Linux), object stores are consistent. Matching this behavior avoids surprises when patterns work locally but fail in production.

4. No directory results

Rationale: Object stores don't have real directories—only objects with /-separated paths. What appears as a "directory" is just a common prefix. fsspec's withdirs=True returns these pseudo-directories, but: - They don't exist as separate entities with metadata - Including them would require using ListWithDelimiter and merging results - Most use cases want actual objects, not prefixes

5. Streaming results (iterator vs list)

obspec_utils fsspec
Iterator[str] (lazy) list[str] (eager)

Rationale: Object store listings can return millions of objects. fsspec materializes all results into a list before returning, which: - Blocks until all pages are fetched - Consumes memory proportional to result count - Can't process results incrementally

Returning an iterator enables: - Processing results as they arrive - Early termination (e.g., "find first 10 matches") - Bounded memory usage regardless of result count

6. Pattern always required

Unlike fsspec.glob() which accepts "bucket/**" to list everything, obspec_utils.glob requires a pattern. To list all objects, use store.list() directly.

Usage Examples

Basic Patterns

from obspec_utils import glob, glob_objects

# Find all NetCDF files in a directory
paths = list(glob(store, "data/2024/*.nc"))

# Find all NetCDF files recursively
paths = list(glob(store, "data/**/*.nc"))

# Find files with single-character suffix
paths = list(glob(store, "data/file?.nc"))

# Find files matching character set
paths = list(glob(store, "data/[abc]*.nc"))

With Metadata

# Get file sizes for matching objects
total_size = sum(obj["size"] for obj in glob_objects(store, "data/**/*.nc"))

# Find recently modified files
from datetime import datetime, timedelta, timezone
cutoff = datetime.now(timezone.utc) - timedelta(days=7)
recent = [
    obj for obj in glob_objects(store, "data/**/*.nc")
    if obj["last_modified"] > cutoff
]

Async Usage

async def process_files(store):
    async for path in glob_async(store, "data/**/*.nc"):
        await process(path)

Dependencies

  • obspec — for List, ListAsync, and ObjectMeta types
  • re — standard library regex

No new external dependencies required. We implement our own translate() function rather than using fnmatch.translate() to properly handle path separators and ** patterns.