CTGANSynthesizerModel#

class sdgx.models.ml.single_table.ctgan.CTGANSynthesizerModel(embedding_dim=128, generator_dim=(256, 256), discriminator_dim=(256, 256), generator_lr=0.0002, generator_decay=1e-06, discriminator_lr=0.0002, discriminator_decay=1e-06, batch_size=500, discriminator_steps=1, log_frequency=True, epochs=300, pac=10, device='cpu')[source]#

Bases: MLSynthesizerModel, BatchedSynthesizer

Modified from sdgx.models.components.sdv_ctgan.synthesizers.ctgan.CTGANSynthesizer. A CTGANSynthesizer but provided SynthesizerModel interface with chunked fit.

This is the core class of the CTGAN project, where the different components are orchestrated together. For more details about the process, please check the [Modeling Tabular data using Conditional GAN](https://arxiv.org/abs/1907.00503) paper.

Parameters:
  • embedding_dim (int) – Size of the random sample passed to the Generator. Defaults to 128.

  • generator_dim (tuple or list of ints) – Size of the output samples for each one of the Residuals. A Residual Layer will be created for each one of the values provided. Defaults to (256, 256).

  • discriminator_dim (tuple or list of ints) – Size of the output samples for each one of the Discriminator Layers. A Linear Layer will be created for each one of the values provided. Defaults to (256, 256).

  • generator_lr (float) – Learning rate for the generator. Defaults to 2e-4.

  • generator_decay (float) – Generator weight decay for the Adam Optimizer. Defaults to 1e-6.

  • discriminator_lr (float) – Learning rate for the discriminator. Defaults to 2e-4.

  • discriminator_decay (float) – Discriminator weight decay for the Adam Optimizer. Defaults to 1e-6.

  • batch_size (int) – Number of data samples to process in each step.

  • discriminator_steps (int) – Number of discriminator updates to do for each generator update. From the WGAN paper: https://arxiv.org/abs/1701.07875. WGAN paper default is 5. Default used is 1 to match original CTGAN implementation.

  • log_frequency (boolean) – Whether to use log frequency of categorical levels in conditional sampling. Defaults to True.

  • epochs (int) – Number of training epochs. Defaults to 300.

  • pac (int) – Number of samples to group together when applying the discriminator. Defaults to 10.

  • device (str) – Device to run the training on. Preferred to be ‘cuda’ for GPU if available.

MODEL_SAVE_NAME = 'ctgan.pkl'#
_apply_activate(data)[source]#

Apply proper activation function to the output of the generator.

_cond_loss(data, c, m)[source]#

Compute the cross entropy loss on the fixed discrete column.

_filter_discrete_columns(train_data: List[str], discrete_columns: List[str])[source]#

We filter PII Column here, which PII would only be discrete for now. As PII would be generating from PII Generator which not synthetic from model.

Besides we need to figure it out when to stop model fitting: The original data consists entirely of discrete column data, and all of this discrete column data is PII.

For train_data, there are three possibilities for the columns type.
  • train_data = valid_discrete + valid_continue

  • train_data = valid_continue

  • train_data = valid_discrete

For discrete_columns, discrete_columns = invalid_discrete(PII) + valid_discrete

Thus, valid_discrete = discrete_columns - invalid_discrete

= discrete_columns - Set.intersection(train_data, discrete_columns)

Thus, original_data_is_all_PII: discrete_columns is not empty & train_data is empty

_fit(*args, **kwargs)#
static _gumbel_softmax(logits, tau=1, hard=False, eps=1e-10, dim=-1)[source]#

Deals with the instability of the gumbel_softmax for older versions of torch.

For more details about the issue: https://drive.google.com/file/d/1AA5wPfZ1kquaRtVruCd6BiYZGcDeNxyP/view?usp=sharing

Parameters:
  • [ (logits) – Unnormalized log probabilities

  • num_features] – Unnormalized log probabilities

  • tau – Non-negative scalar temperature

  • hard (bool) – If True, the returned samples will be discretized as one-hot vectors, but will be differentiated as if it is the soft sample in autograd

  • dim (int) – A dimension along which softmax will be computed. Default: -1.

Returns:

Sampled tensor of same shape as logits from the Gumbel-Softmax distribution.

_pre_fit(dataloader: DataLoader, discrete_columns: list[str] | None = None, metadata: Metadata | None = None)[source]#
_sample(*args, **kwargs)#
_validate_discrete_columns(train_data, discrete_columns)[source]#

Check whether discrete_columns exists in train_data.

Parameters:
  • train_data (numpy.ndarray or pandas.DataFrame or list) – Training Data. It must be a 2-dimensional numpy array or a pandas.DataFrame.

  • discrete_columns (list-like) – List of discrete columns to be used to generate the Conditional Vector. If train_data is a Numpy array, this list should contain the integer indices of the columns. Otherwise, if it is a pandas.DataFrame, this list should contain the column names.

fit(metadata: Metadata, dataloader: DataLoader, epochs=None, *args, **kwargs)[source]#

Fit the model using the given metadata and dataloader.

Parameters:
  • metadata (Metadata) – The metadata to use.

  • dataloader (DataLoader) – The dataloader to use.

classmethod load(save_dir: str | Path, device: str | None = None) CTGANSynthesizerModel[source]#

Load model from file.

Parameters:

save_dir (str | Path) – The directory to load the model from.

sample(count: int, *args, **kwargs) DataFrame[source]#

Sample data from the model.

Parameters:

count (int) – The number of samples to generate.

Returns:

The generated data.

Return type:

pd.DataFrame

save(save_dir: str | Path)[source]#

Dump model to file.

Parameters:

save_dir (str | Path) – The directory to save the model.

set_device(device)[source]#

Set the device to be used (‘GPU’ or ‘CPU).

class sdgx.models.ml.single_table.ctgan.Discriminator(input_dim, discriminator_dim, pac=10)[source]#

Bases: Module

Discriminator for the CTGAN.

calc_gradient_penalty(real_data, fake_data, device='cpu', pac=10, lambda_=10)[source]#

Compute the gradient penalty.

forward(input_)[source]#

Apply the Discriminator to the input_.

class sdgx.models.ml.single_table.ctgan.Generator(embedding_dim, generator_dim, data_dim)[source]#

Bases: Module

Generator for the CTGAN.

forward(input_)[source]#

Apply the Generator to the input_.

class sdgx.models.ml.single_table.ctgan.Residual(i, o)[source]#

Bases: Module

Residual layer for the CTGAN.

forward(input_)[source]#

Apply the Residual layer to the input_.

sdgx.models.ml.single_table.ctgan.register(manager)[source]#