Data Loader#

class sdgx.data_loader.DataLoader(data_connector: DataConnector, chunksize: int = 10000, cacher: Cacher | str | type[Cacher] | None = None, cacher_kwargs: None | dict[str, Any] = None, identity: str | None = None)[source]#

Bases: object

Combine Cacher and DataConnector to load data in an efficient way.

Default Cacher is DiskCache. Use cacher or cache_mode to specify a Cacher.

GeneratorConnector must combine with Cacher, we will warmup cache for generator to support random access.

Parameters:
  • data_connector (DataConnector) – The data connector

  • chunksize (int, optional) – The chunksize of the cacher. Defaults to 1000.

  • cacher (Cacher, optional) – The cacher. Defaults to None.

  • cache_mode (str, optional) – The cache mode(cachers’ name). Defaults to “DiskCache”, more info in DiskCache.

  • cacher_kwargs (dict, optional) – The kwargs for cacher. Defaults to None

  • identity (str, optional) – The identity of the data source. When using GeneratorConnector, it can be pointed to the original data source, makes it possible to work with MetadataCombiner.

Example

Load and cache data from existing csv file or other data source.

from sdgx.data_loader import DataLoader
from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.utils import download_demo_data

dataset_csv = download_demo_data()
data_connector = CsvConnector(path=dataset_csv)

# Use DataConnector to initialize

dataloader = DataLoader(data_connector)

# Access data

dataloader.load_all()  # This will read all data from csv, and cache it.
dataloader.load_all()  # This will read all data from cache.

dataloader[:10] # dataloader support slicing

for df in dataloader.iter():  # dataloader support iteration
    print(df.shape)

Advanced usage:

Load and cache data from a generator.

from sdgx.data_loader import DataLoader
from sdgx.data_connectors.generator_connector import GeneratorConnector

def generator() -> Generator[pd.DataFrame, None, None]:
    for i in range(100):
        yield pd.DataFrame({"a": [i], "b": [i]})

data_connector = GeneratorConnector(generator)

# Use DataConnector to initialize.
# Generator is not support random access, but we can achieve it by caching.
dataloader = DataLoader(data_connector)

# Access data
dataloader.load_all()  # This will read all data from cache
dataloader.load_all()  # This will read all data from cache.

dataloader[:10] # dataloader support slicing

for df in dataloader.iter():  # dataloader support iteration
    print(df.shape)
DEFAULT_CACHER_INITIAL#

alias of DiskCache

columns() list[source]#

Peak columns.

Returns:

name of columns

Return type:

list

finalize(clear_cache=False) None[source]#

Finalize the dataloader.

iter() Generator[DataFrame, None, None][source]#

Load data from cache in chunk.

keys() list[source]#

Same as columns

load_all() DataFrame[source]#

Load all data from cache.

property shape#