Data Loader#
- class sdgx.data_loader.DataLoader(data_connector: DataConnector, chunksize: int = 10000, cacher: Cacher | str | type[Cacher] | None = None, cacher_kwargs: None | dict[str, Any] = None, identity: str | None = None)[source]#
Bases:
objectCombine Cacher and DataConnector to load data in an efficient way.
Default Cacher is DiskCache. Use
cacherorcache_modeto specify a Cacher.GeneratorConnector must combine with Cacher, we will warmup cache for generator to support random access.
- Parameters:
data_connector (DataConnector) – The data connector
chunksize (int, optional) – The chunksize of the cacher. Defaults to 1000.
cacher (Cacher, optional) – The cacher. Defaults to None.
cache_mode (str, optional) – The cache mode(cachers’ name). Defaults to “DiskCache”, more info in DiskCache.
cacher_kwargs (dict, optional) – The kwargs for cacher. Defaults to None
identity (str, optional) – The identity of the data source. When using GeneratorConnector, it can be pointed to the original data source, makes it possible to work with MetadataCombiner.
Example
Load and cache data from existing csv file or other data source.
from sdgx.data_loader import DataLoader from sdgx.data_connectors.csv_connector import CsvConnector from sdgx.utils import download_demo_data dataset_csv = download_demo_data() data_connector = CsvConnector(path=dataset_csv) # Use DataConnector to initialize dataloader = DataLoader(data_connector) # Access data dataloader.load_all() # This will read all data from csv, and cache it. dataloader.load_all() # This will read all data from cache. dataloader[:10] # dataloader support slicing for df in dataloader.iter(): # dataloader support iteration print(df.shape)
Advanced usage:
Load and cache data from a generator.
from sdgx.data_loader import DataLoader from sdgx.data_connectors.generator_connector import GeneratorConnector def generator() -> Generator[pd.DataFrame, None, None]: for i in range(100): yield pd.DataFrame({"a": [i], "b": [i]}) data_connector = GeneratorConnector(generator) # Use DataConnector to initialize. # Generator is not support random access, but we can achieve it by caching. dataloader = DataLoader(data_connector) # Access data dataloader.load_all() # This will read all data from cache dataloader.load_all() # This will read all data from cache. dataloader[:10] # dataloader support slicing for df in dataloader.iter(): # dataloader support iteration print(df.shape)
- property shape#