Synthesizer#

Bases: object

Synthesizer is the high level interface for synthesizing data.

We provided several example usage in our Github repository.

Parameters:

model (str | SynthesizerModel | type[SynthesizerModel]) – The name of the model or the model itself. Type of model must be SynthesizerModel. When model is a string, it must be registered in ModelManager.
model_path (str | Path, optional) – The path to the model file. Defaults to None. Used to load the model if model is a string or type of SynthesizerModel.
model_kwargs (dict[str, Any], optional) – The keyword arguments for model. Defaults to None.
metadata (Metadata, optional) – The metadata to use. Defaults to None.
metadata_path (str | Path, optional) – The path to the metadata file. Defaults to None. Used to load the metadata if metadata is None.
data_connector (DataConnector | type[DataConnector] | str, optional) – The data connector to use. Defaults to None. When data_connector is a string, it must be registered in DataConnectorManager.
data_connector_kwargs (dict[str, Any], optional) – The keyword arguments for data connectors. Defaults to None.
raw_data_loaders_kwargs (dict[str, Any], optional) – The keyword arguments for raw data loaders. Defaults to None.
processed_data_loaders_kwargs (dict[str, Any], optional) – The keyword arguments for processed data loaders. Defaults to None.
data_processors (list[str | DataProcessor | type[DataProcessor]], optional) – The data processors to use. Defaults to None. When data_processor is a string, it must be registered in DataProcessorManager.
data_processors_kwargs (dict[str, dict[str, Any]], optional) – The keyword arguments for data processors. Defaults to None.

Example

from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.utils import download_demo_data

dataset_csv = download_demo_data()
data_connector = CsvConnector(path=dataset_csv)
synthesizer = Synthesizer(
    model=CTGANSynthesizerModel(epochs=1),  # For quick demo
    data_connector=data_connector,
)
synthesizer.fit()
sampled_data = synthesizer.sample(1000)

METADATA_SAVE_NAME = 'metadata.json'#: Default name for metadata file

MODEL_SAVE_DIR = 'model'#: Default name for model directory

_sample_once(count: int, model_sample_args: None | dict[str, Any] = None) → DataFrame[source]#

Sample data once.

DataProcessors may drop some broken data after reverse_convert. So we oversample first and then take the first count samples.

Todo

Use an adaptive scale for oversampling will be better for performance.

cleanup()[source]#

Cleanup resources. This will cause model unavailable and clear the cache.

It useful when Synthesizer object is no longer needed and may hold large resources like GPUs.

fit(metadata: None | Metadata = None, inspector_max_chunk: int = 10, metadata_include_inspectors: None | list[str] = None, metadata_exclude_inspectors: None | list[str] = None, inspector_init_kwargs: None | dict[str, Any] = None, model_fit_kwargs: None | dict[str, Any] = None)[source]#

Fit the synthesizer with metadata and data processors.

Raw data will be loaded from the dataloader and processed by the data processors in a Generator. The Generator, which prevents the processed data, will be wrapped into a DataLoader, aka ProcessedDataLoader. The ProcessedDataLoader will be used to fit the model.

For more information about DataLoaders, please refer to the DataLoader.

For more information about DataProcessors, please refer to the DataProcessor.

For more information about DataConnectors, please refer to the DataConnector. Especially, the GeneratorConnector.

Parameters:

metadata (Metadata, optional) – The metadata to use. Defaults to None. If None, it will be inferred from the dataloader with the from_dataloader() method.
inspector_max_chunk (int, optional) – The maximum number of chunks to inspect. Defaults to 10.
metadata_include_inspectors (list[str], optional) – The list of metadata inspectors to include. Defaults to None.
metadata_exclude_inspectors (list[str], optional) – The list of metadata inspectors to exclude. Defaults to None.
inspector_init_kwargs (dict[str, Any], optional) – The keyword arguments for metadata inspectors. Defaults to None.
model_fit_kwargs (dict[str, Any], optional) – The keyword arguments for model.fit. Defaults to None.

Load metadata and model, allow rebuilding Synthesizer for finetuning or other use cases.

We need model as not every model support pickle way to save and load.

Parameters:

load_dir (str | Path) – The directory to load the model.
model (str | type[SynthesizerModel]) – The name of the model or the model itself. Type of model must be SynthesizerModel. When model is a string, it must be registered in ModelManager.
metadata (Metadata, optional) – The metadata to use. Defaults to None.
data_connector (DataConnector | type[DataConnector] | str, optional) – The data connector to use. Defaults to None. When data_connector is a string, it must be registered in DataConnectorManager.
data_connector_kwargs (dict[str, Any], optional) – The keyword arguments for data connectors. Defaults to None.
raw_data_loaders_kwargs (dict[str, Any], optional) – The keyword arguments for raw data loaders. Defaults to None.
processed_data_loaders_kwargs (dict[str, Any], optional) – The keyword arguments for processed data loaders. Defaults to None.
data_processors (list[str | DataProcessor | type[DataProcessor]], optional) – The data processors to use. Defaults to None. When data_processor is a string, it must be registered in DataProcessorManager.
data_processors_kwargs (dict[str, dict[str, Any]], optional) – The keyword arguments for data processors. Defaults to None.

Returns:

The synthesizer instance.

Return type:

Synthesizer

sample(count: int, chunksize: None | int = None, metadata: None | Metadata = None, model_sample_args: None | dict[str, Any] = None) → DataFrame | Generator[DataFrame, None, None][source]#

Sample data from the synthesizer.

Parameters:

count (int) – The number of samples to generate.
chunksize (int, optional) – The chunksize to use. Defaults to None. If is not None, the data will be sampled in chunks. And will return a generator that yields chunks of samples.
metadata (Metadata, optional) – The metadata to use. Defaults to None. If None, will use the metadata in fit first.
model_sample_args (dict[str, Any], optional) – The keyword arguments for model.sample. Defaults to None.

Returns:

The sampled data. When chunksize is not None, it will be a generator.

Return type:

pd.DataFrame | Generator[pd.DataFrame, None, None]

save(save_dir: str | Path) → Path[source]#

Dump metadata and model to file

Parameters:: save_dir (str | Path) – The directory to save the model.
Returns:: The directory to save the synthesizer.
Return type:: Path

Synthesizer#

This Page