SDG: Synthetic Data Generator#
The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data. It incorporates a wide range of single-table, multi-table data synthesis algorithms and LLM-based synthetic data generation models.
Synthetic data, generated by machines using real data, metadata, and algorithms, does not contain any sensitive information, yet it retains the essential characteristics of the original data. There is no direct correlation between synthetic data and real data, making it exempt from privacy regulations such as GDPR and ADPPA. This eliminates the risk of privacy breaches in practical applications.
High-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc.
Our CODE/ISSUE/PULL REQUESTS are all hosted on github. Feel free to contact us if you have any questions.
Installation#
You can install our python package with pip,
pip install sdgx
Or use pre-built images to quickly experience the latest features.
docker pull idsteam/sdgx:latest
In order to use the GPU for synthesis, you may need to refer to Torch’s GPU installation guide.
Quick demo#
"""
Example for CTGAN
"""
from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.utils import download_demo_data
# This will download demo data to ./dataset
dataset_csv = download_demo_data()
# Create data connector for csv file
data_connector = CsvConnector(path=dataset_csv)
# Initialize synthesizer, use CTGAN model
synthesizer = Synthesizer(
model=CTGANSynthesizerModel(epochs=1), # For quick demo
data_connector=data_connector,
)
# Fit the model
synthesizer.fit()
# Sample
sampled_data = synthesizer.sample(1000)
print(sampled_data)
We provided user guides with lots of examples for researchers, scientists and developers. Learn more if you are interested!