CTGANSynthesizerModel#
- class sdgx.models.ml.single_table.ctgan.CTGANSynthesizerModel(embedding_dim=128, generator_dim=(256, 256), discriminator_dim=(256, 256), generator_lr=0.0002, generator_decay=1e-06, discriminator_lr=0.0002, discriminator_decay=1e-06, batch_size=500, discriminator_steps=1, log_frequency=True, epochs=300, pac=10, device='cpu')[source]#
Bases:
MLSynthesizerModel,BatchedSynthesizerModified from
sdgx.models.components.sdv_ctgan.synthesizers.ctgan.CTGANSynthesizer. A CTGANSynthesizer but provided SynthesizerModel interface with chunked fit.This is the core class of the CTGAN project, where the different components are orchestrated together. For more details about the process, please check the [Modeling Tabular data using Conditional GAN](https://arxiv.org/abs/1907.00503) paper.
- Parameters:
embedding_dim (int) – Size of the random sample passed to the Generator. Defaults to 128.
generator_dim (tuple or list of ints) – Size of the output samples for each one of the Residuals. A Residual Layer will be created for each one of the values provided. Defaults to (256, 256).
discriminator_dim (tuple or list of ints) – Size of the output samples for each one of the Discriminator Layers. A Linear Layer will be created for each one of the values provided. Defaults to (256, 256).
generator_lr (float) – Learning rate for the generator. Defaults to 2e-4.
generator_decay (float) – Generator weight decay for the Adam Optimizer. Defaults to 1e-6.
discriminator_lr (float) – Learning rate for the discriminator. Defaults to 2e-4.
discriminator_decay (float) – Discriminator weight decay for the Adam Optimizer. Defaults to 1e-6.
batch_size (int) – Number of data samples to process in each step.
discriminator_steps (int) – Number of discriminator updates to do for each generator update. From the WGAN paper: https://arxiv.org/abs/1701.07875. WGAN paper default is 5. Default used is 1 to match original CTGAN implementation.
log_frequency (boolean) – Whether to use log frequency of categorical levels in conditional sampling. Defaults to
True.epochs (int) – Number of training epochs. Defaults to 300.
pac (int) – Number of samples to group together when applying the discriminator. Defaults to 10.
device (str) – Device to run the training on. Preferred to be ‘cuda’ for GPU if available.
- MODEL_SAVE_NAME = 'ctgan.pkl'#
- _filter_discrete_columns(train_data: List[str], discrete_columns: List[str])[source]#
We filter PII Column here, which PII would only be discrete for now. As PII would be generating from PII Generator which not synthetic from model.
Besides we need to figure it out when to stop model fitting: The original data consists entirely of discrete column data, and all of this discrete column data is PII.
- For train_data, there are three possibilities for the columns type.
train_data = valid_discrete + valid_continue
train_data = valid_continue
train_data = valid_discrete
For discrete_columns, discrete_columns = invalid_discrete(PII) + valid_discrete
- Thus, valid_discrete = discrete_columns - invalid_discrete
= discrete_columns - Set.intersection(train_data, discrete_columns)
Thus, original_data_is_all_PII: discrete_columns is not empty & train_data is empty
- _fit(*args, **kwargs)#
- static _gumbel_softmax(logits, tau=1, hard=False, eps=1e-10, dim=-1)[source]#
Deals with the instability of the gumbel_softmax for older versions of torch.
For more details about the issue: https://drive.google.com/file/d/1AA5wPfZ1kquaRtVruCd6BiYZGcDeNxyP/view?usp=sharing
- Parameters:
[… (logits) – Unnormalized log probabilities
num_features] – Unnormalized log probabilities
tau – Non-negative scalar temperature
hard (bool) – If True, the returned samples will be discretized as one-hot vectors, but will be differentiated as if it is the soft sample in autograd
dim (int) – A dimension along which softmax will be computed. Default: -1.
- Returns:
Sampled tensor of same shape as logits from the Gumbel-Softmax distribution.
- _pre_fit(dataloader: DataLoader, discrete_columns: list[str] | None = None, metadata: Metadata | None = None)[source]#
- _sample(*args, **kwargs)#
- _validate_discrete_columns(train_data, discrete_columns)[source]#
Check whether
discrete_columnsexists intrain_data.- Parameters:
train_data (numpy.ndarray or pandas.DataFrame or list) – Training Data. It must be a 2-dimensional numpy array or a pandas.DataFrame.
discrete_columns (list-like) – List of discrete columns to be used to generate the Conditional Vector. If
train_datais a Numpy array, this list should contain the integer indices of the columns. Otherwise, if it is apandas.DataFrame, this list should contain the column names.
- fit(metadata: Metadata, dataloader: DataLoader, epochs=None, *args, **kwargs)[source]#
Fit the model using the given metadata and dataloader.
- Parameters:
metadata (Metadata) – The metadata to use.
dataloader (DataLoader) – The dataloader to use.
- classmethod load(save_dir: str | Path, device: str | None = None) CTGANSynthesizerModel[source]#
Load model from file.
- Parameters:
save_dir (str | Path) – The directory to load the model from.
- sample(count: int, *args, **kwargs) DataFrame[source]#
Sample data from the model.
- Parameters:
count (int) – The number of samples to generate.
- Returns:
The generated data.
- Return type:
pd.DataFrame
- class sdgx.models.ml.single_table.ctgan.Discriminator(input_dim, discriminator_dim, pac=10)[source]#
Bases:
ModuleDiscriminator for the CTGAN.
- class sdgx.models.ml.single_table.ctgan.Generator(embedding_dim, generator_dim, data_dim)[source]#
Bases:
ModuleGenerator for the CTGAN.