Skip to content

Finding Files on Cloud Storage

This guide shows how to discover and list files stored in cloud object storage.

Listing Files in a Directory

To see what files exist in a specific location, use the store's list() method with a prefix:

from obstore.store import S3Store

# Access public AWS Open Data
store = S3Store(
    bucket="nasanex",
    aws_region="us-west-2",
    skip_signature=True,
)

# List files in a specific directory
prefix = "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/"
files = []
for chunk in store.list(prefix=prefix):
    files.extend(chunk)

print(f"Found {len(files)} files in {prefix}")
print(f"\nFirst 5 files:")
for f in files[:5]:
    print(f"  {f['path'].split('/')[-1]}")
Found 3986 files in NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/

First 5 files:
  tasmax_day_BCSD_rcp85_r1i1p1_ACCESS1-0_2006.json
  tasmax_day_BCSD_rcp85_r1i1p1_ACCESS1-0_2006.nc
  tasmax_day_BCSD_rcp85_r1i1p1_ACCESS1-0_2007.json
  tasmax_day_BCSD_rcp85_r1i1p1_ACCESS1-0_2007.nc
  tasmax_day_BCSD_rcp85_r1i1p1_ACCESS1-0_2008.json

Use the class methods rather than obstore top-level functions

When using obspec_utils wrappers like CachingReadableStore, call methods directly on the store (e.g., store.list()) rather than using obstore functions (e.g., obstore.list(store)). The wrappers implement the obspec protocol, which decouples them from specific store instances. Obstore top-level functions are tied to the specific stores implemented by obstore, so they will not work with the obspec-based wrappers provided by obspec-utils.

Finding Files Matching a Pattern

When you need files matching specific criteria (e.g., all files from year 2100), use glob:

from obspec_utils import glob

# Find all NetCDF files for year 2100
paths = list(glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_2100.nc"))
print(f"Found {len(paths)} files for 2100:")
for path in paths[:5]:
    print(f"  {path.split('/')[-1]}")
Found 19 files for 2100:
  tasmax_day_BCSD_rcp85_r1i1p1_ACCESS1-0_2100.nc
  tasmax_day_BCSD_rcp85_r1i1p1_BNU-ESM_2100.nc
  tasmax_day_BCSD_rcp85_r1i1p1_CCSM4_2100.nc
  tasmax_day_BCSD_rcp85_r1i1p1_CESM1-BGC_2100.nc
  tasmax_day_BCSD_rcp85_r1i1p1_CNRM-CM5_2100.nc

Pattern Syntax

Pattern Matches Example
* Any characters in one segment *_2100.nc matches any model for 2100
** Any number of segments data/**/*.nc matches all .nc files recursively
? Exactly one character *_209?.nc matches 2090-2099
[abc] Any character in set *_209[012].nc matches 2090, 2091, 2092
[a-z] Any character in range *_209[0-5].nc matches 2090-2095
[!abc] Any character NOT in set *_209[!9].nc excludes 2099

More Pattern Examples

# Match a range of years (2096-2099) using ?
paths = list(glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_inmcm4_209?.nc"))
print(f"Years 2090-2099: {len(paths)} files")
for p in paths[-4:]:  # Show last 4 (2096-2099)
    print(f"  {p.split('/')[-1]}")

# Match specific years using character range
paths = list(glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_inmcm4_209[5-9].nc"))
print(f"\nYears 2095-2099: {len(paths)} files")
for p in paths:
    print(f"  {p.split('/')[-1]}")
Years 2090-2099: 10 files
  tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2096.nc
  tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2097.nc
  tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2098.nc
  tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2099.nc

Years 2095-2099: 5 files
  tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2095.nc
  tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2096.nc
  tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2097.nc
  tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2098.nc
  tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2099.nc

Getting File Sizes and Dates

To get metadata (size, last modified time) along with paths, use glob_objects:

from obspec_utils import glob_objects

# Get metadata for matching files
objects = list(glob_objects(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_2100.nc"))

# Calculate total size
total_bytes = sum(obj["size"] for obj in objects)
print(f"Total: {total_bytes / 1e9:.2f} GB across {len(objects)} files")

# Show details for a few files
print(f"\nSample files:")
for obj in objects[:3]:
    print(f"  {obj['path'].split('/')[-1]}")
    print(f"    Size: {obj['size'] / 1e6:.1f} MB")
    print(f"    Modified: {obj['last_modified'].date()}")
Total: 14.99 GB across 19 files

Sample files:
  tasmax_day_BCSD_rcp85_r1i1p1_ACCESS1-0_2100.nc
    Size: 800.3 MB
    Modified: 2015-06-10
  tasmax_day_BCSD_rcp85_r1i1p1_BNU-ESM_2100.nc
    Size: 773.4 MB
    Modified: 2015-06-11
  tasmax_day_BCSD_rcp85_r1i1p1_CCSM4_2100.nc
    Size: 798.2 MB
    Modified: 2015-06-11

Improving Performance

Listing files in cloud storage requires network requests. The more files the server needs to enumerate, the slower the operation. Here's how to keep searches fast.

Use Specific Prefixes

The glob function automatically extracts the longest literal prefix from your pattern to minimize the files the server must enumerate:

Pattern Server lists from Files enumerated
data/2024/january/*.nc data/2024/january/ Only January files
data/2024/*/*.nc data/2024/ All of 2024
data/**/*.nc data/ Everything under data/
**/*.nc (root) Entire bucket

Move literal path segments before wildcards when possible:

# Slower: wildcard early means listing more files
glob(store, "NEX-GDDP/**/tasmax/**/v1.0/*_2100.nc")

# Faster: specific prefix narrows the listing
glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_2100.nc")

Process Results Lazily

Both glob and glob_objects return iterators, so you can process results as they arrive without loading all paths into memory:

# Stop after finding 3 files (doesn't load all results)
count = 0
for path in glob(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_2100.nc"):
    print(f"Found: {path.split('/')[-1]}")
    count += 1
    if count >= 3:
        break
Found: tasmax_day_BCSD_rcp85_r1i1p1_ACCESS1-0_2100.nc
Found: tasmax_day_BCSD_rcp85_r1i1p1_BNU-ESM_2100.nc
Found: tasmax_day_BCSD_rcp85_r1i1p1_CCSM4_2100.nc

Async Usage

For async contexts, use glob_async and glob_objects_async:

import asyncio
from obspec_utils import glob_async

async def find_recent_years():
    paths = []
    async for path in glob_async(store, "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/*_inmcm4_209?.nc"):
        paths.append(path)
    return paths

paths = asyncio.run(find_recent_years())
print(f"Found {len(paths)} files asynchronously")
Found 10 files asynchronously