Article
synthetic-datadata-augmentationprivacytabular-datapythonsdvmachine-learningdata-privacy
Generate Synthetic Tabular Data with SDV
Use the Synthetic Data Vault (SDV) library to create artificial tabular data that mirrors a real dataset's statistical properties. This is ideal for augmenting small datasets, protecting sensitive information, or balancing classes for model training.
beginner15 min5 steps
The play
- Install SDVInstall the Synthetic Data Vault (SDV) library using pip. This package contains all the tools needed for modeling and sampling synthetic data from single tables, multiple tables, or time series.
- Load Data and Define MetadataLoad your data into a pandas DataFrame. SDV can automatically infer the metadata (data types, etc.), but for best results, you can define it manually. We'll use a built-in demo dataset for this example.
- Choose and Train a SynthesizerA synthesizer is the model that learns the patterns in your data. SDV offers several models. We'll use the `GaussianCopulaSynthesizer` for this example, a classic and reliable choice for single-table data. The `.fit()` method trains the model on your real data.
- Generate Synthetic DataOnce the synthesizer is trained, use the `.sample()` method to generate new, synthetic data. You can specify how many rows of artificial data you need.
- Evaluate the Synthetic DataAssess the quality of your synthetic data by comparing it to the original. SDV's evaluation functions provide a quality score, indicating how well the synthetic data captures the statistical properties and correlations of the real data. A score closer to 1.0 is better.
Starter code
import pandas as pd
from sdv.datasets.demo import load_tabular_demo
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.evaluation.single_table import evaluate_quality
# 1. Load a demo dataset
real_data = load_tabular_demo('student_placements')
print("--- Original Data Sample ---")
print(real_data.head())
# 2. Create a synthesizer and train it on the real data
# The metadata is automatically detected from the dataframe
synthesizer = GaussianCopulaSynthesizer.from_dataframe(real_data)
# 3. Generate a new table of synthetic data
print("\n--- Generating Synthetic Data ---")
synthetic_data = synthesizer.sample(num_rows=500)
print("\n--- Synthetic Data Sample ---")
print(synthetic_data.head())
# 4. Evaluate the quality of the synthetic data
quality_report = evaluate_quality(
real_data,
synthetic_data,
synthesizer.get_metadata()
)
print(f"\nData Quality Score: {quality_report.get_score()*100:.2f}%")
print(f"Data Similarity Score: {quality_report.get_properties()['Column Shapes'][0]['score']*100:.2f}%")