DiskCache#

class sdgx.cachers.disk_cache.DiskCache(cache_dir: str | Path | None = None, identity: str | None = None, *args, **kwargs)[source]#

Bases: Cacher

Cacher that cache data in disk with parquet format

Parameters:
  • blocksize (int) – The blocksize of the cache.

  • cache_dir (str | Path | None, optional) – The directory where the cache will be stored. Defaults to None.

  • identity (str | None, optional) – The identity of the data source. Defaults to None.

Todo

  • Add partial cache when blocksize > chunksize

  • Improve cache invalidation

  • Improve performance if blocksize > chunksize

_get_cache_filename(offset: int) Path[source]#

Get cache filename

_refresh(offset: int, data: DataFrame) None[source]#

Refresh cache, will write data to cache file in parquet format.

clear_cache() None[source]#

Clear all cache in cache_dir.

clear_invalid_cache()[source]#

Clear all cache in cache_dir.

TODO: Improve cache invalidation

is_cached(offset: int) bool[source]#

Check if the data is cached by checking if the cache file exists

iter(chunksize: int, data_connector: DataConnector) Generator[DataFrame, None, None][source]#

Load data from data_connector or cache in chunk

load(offset: int, chunksize: int, data_connector: DataConnector) DataFrame[source]#

Load data from data_connector or cache

load_all(data_connector: DataConnector) DataFrame#

Load all data from data_connector or cache