Synthesizer#
- class sdgx.synthesizer.Synthesizer(model: str | SynthesizerModel | type[SynthesizerModel], model_path: None | str | Path = None, model_kwargs: None | dict[str, Any] = None, metadata: None | Metadata = None, metadata_path: None | str | Path = None, data_connector: None | str | DataConnector | type[DataConnector] = None, data_connector_kwargs: None | dict[str, Any] = None, raw_data_loaders_kwargs: None | dict[str, Any] = None, processed_data_loaders_kwargs: None | dict[str, Any] = None, data_processors: None | list[str | DataProcessor | type[DataProcessor]] = None, data_processors_kwargs: None | dict[str, Any] = None)[source]#
Bases:
objectSynthesizer is the high level interface for synthesizing data.
We provided several example usage in our Github repository.
- Parameters:
model (str | SynthesizerModel | type[SynthesizerModel]) – The name of the model or the model itself. Type of model must be
SynthesizerModel. When model is a string, it must be registered inModelManager.model_path (str | Path, optional) – The path to the model file. Defaults to None. Used to load the model if
modelis a string or type ofSynthesizerModel.model_kwargs (dict[str, Any], optional) – The keyword arguments for model. Defaults to None.
metadata (Metadata, optional) – The metadata to use. Defaults to None.
metadata_path (str | Path, optional) – The path to the metadata file. Defaults to None. Used to load the metadata if
metadatais None.data_connector (DataConnector | type[DataConnector] | str, optional) – The data connector to use. Defaults to None. When data_connector is a string, it must be registered in
DataConnectorManager.data_connector_kwargs (dict[str, Any], optional) – The keyword arguments for data connectors. Defaults to None.
raw_data_loaders_kwargs (dict[str, Any], optional) – The keyword arguments for raw data loaders. Defaults to None.
processed_data_loaders_kwargs (dict[str, Any], optional) – The keyword arguments for processed data loaders. Defaults to None.
data_processors (list[str | DataProcessor | type[DataProcessor]], optional) – The data processors to use. Defaults to None. When data_processor is a string, it must be registered in
DataProcessorManager.data_processors_kwargs (dict[str, dict[str, Any]], optional) – The keyword arguments for data processors. Defaults to None.
Example
from sdgx.data_connectors.csv_connector import CsvConnector from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel from sdgx.synthesizer import Synthesizer from sdgx.utils import download_demo_data dataset_csv = download_demo_data() data_connector = CsvConnector(path=dataset_csv) synthesizer = Synthesizer( model=CTGANSynthesizerModel(epochs=1), # For quick demo data_connector=data_connector, ) synthesizer.fit() sampled_data = synthesizer.sample(1000)
- METADATA_SAVE_NAME = 'metadata.json'#
Default name for metadata file
- MODEL_SAVE_DIR = 'model'#
Default name for model directory
- _sample_once(count: int, model_sample_args: None | dict[str, Any] = None) DataFrame[source]#
Sample data once.
DataProcessors may drop some broken data after reverse_convert. So we oversample first and then take the first count samples.
Todo
Use an adaptive scale for oversampling will be better for performance.
- cleanup()[source]#
Cleanup resources. This will cause model unavailable and clear the cache.
It useful when Synthesizer object is no longer needed and may hold large resources like GPUs.
- fit(metadata: None | Metadata = None, inspector_max_chunk: int = 10, metadata_include_inspectors: None | list[str] = None, metadata_exclude_inspectors: None | list[str] = None, inspector_init_kwargs: None | dict[str, Any] = None, model_fit_kwargs: None | dict[str, Any] = None)[source]#
Fit the synthesizer with metadata and data processors.
Raw data will be loaded from the dataloader and processed by the data processors in a Generator. The Generator, which prevents the processed data, will be wrapped into a DataLoader, aka ProcessedDataLoader. The ProcessedDataLoader will be used to fit the model.
For more information about DataLoaders, please refer to the
DataLoader.For more information about DataProcessors, please refer to the
DataProcessor.For more information about DataConnectors, please refer to the
DataConnector. Especially, theGeneratorConnector.- Parameters:
metadata (Metadata, optional) – The metadata to use. Defaults to None. If None, it will be inferred from the dataloader with the
from_dataloader()method.inspector_max_chunk (int, optional) – The maximum number of chunks to inspect. Defaults to 10.
metadata_include_inspectors (list[str], optional) – The list of metadata inspectors to include. Defaults to None.
metadata_exclude_inspectors (list[str], optional) – The list of metadata inspectors to exclude. Defaults to None.
inspector_init_kwargs (dict[str, Any], optional) – The keyword arguments for metadata inspectors. Defaults to None.
model_fit_kwargs (dict[str, Any], optional) – The keyword arguments for model.fit. Defaults to None.
- classmethod load(load_dir: str | Path, model: str | type[SynthesizerModel], metadata: None | Metadata = None, data_connector: None | str | DataConnector | type[DataConnector] = None, data_connector_kwargs: None | dict[str, Any] = None, raw_data_loaders_kwargs: None | dict[str, Any] = None, processed_data_loaders_kwargs: None | dict[str, Any] = None, data_processors: None | list[str | DataProcessor | type[DataProcessor]] = None, data_processors_kwargs: None | dict[str, dict[str, Any]] = None, model_kwargs=None) Synthesizer[source]#
Load metadata and model, allow rebuilding Synthesizer for finetuning or other use cases.
We need
modelas not every model support pickle way to save and load.- Parameters:
load_dir (str | Path) – The directory to load the model.
model (str | type[SynthesizerModel]) – The name of the model or the model itself. Type of model must be
SynthesizerModel. When model is a string, it must be registered inModelManager.metadata (Metadata, optional) – The metadata to use. Defaults to None.
data_connector (DataConnector | type[DataConnector] | str, optional) – The data connector to use. Defaults to None. When data_connector is a string, it must be registered in
DataConnectorManager.data_connector_kwargs (dict[str, Any], optional) – The keyword arguments for data connectors. Defaults to None.
raw_data_loaders_kwargs (dict[str, Any], optional) – The keyword arguments for raw data loaders. Defaults to None.
processed_data_loaders_kwargs (dict[str, Any], optional) – The keyword arguments for processed data loaders. Defaults to None.
data_processors (list[str | DataProcessor | type[DataProcessor]], optional) – The data processors to use. Defaults to None. When data_processor is a string, it must be registered in
DataProcessorManager.data_processors_kwargs (dict[str, dict[str, Any]], optional) – The keyword arguments for data processors. Defaults to None.
- Returns:
The synthesizer instance.
- Return type:
- sample(count: int, chunksize: None | int = None, metadata: None | Metadata = None, model_sample_args: None | dict[str, Any] = None) DataFrame | Generator[DataFrame, None, None][source]#
Sample data from the synthesizer.
- Parameters:
count (int) – The number of samples to generate.
chunksize (int, optional) – The chunksize to use. Defaults to None. If is not None, the data will be sampled in chunks. And will return a generator that yields chunks of samples.
metadata (Metadata, optional) – The metadata to use. Defaults to None. If None, will use the metadata in fit first.
model_sample_args (dict[str, Any], optional) – The keyword arguments for model.sample. Defaults to None.
- Returns:
The sampled data. When chunksize is not None, it will be a generator.
- Return type:
pd.DataFrame | Generator[pd.DataFrame, None, None]