Synthetic Data Generation with AI: A Hands‑On Tutorial

A practical, end-to-end tutorial for generating, evaluating, and governing synthetic data for ML using Python, SDV, and sdmetrics.

ASOasis
8 min read
Synthetic Data Generation with AI: A Hands‑On Tutorial

Image used for representation purposes only.

Overview

Synthetic data is no longer a curiosity—it’s a practical lever to accelerate model development, reduce privacy risk, and overcome data scarcity. In this tutorial, you’ll learn how to generate, evaluate, and govern synthetic data for analytics and machine learning. We’ll start with a tabular, business-style example (think subscriptions/transactions), then cover evaluation (utility and privacy), and finish with guidance for time series, images, and text.

What you’ll build:

  • A reproducible pipeline that creates realistic synthetic tabular data with a GAN-based model.
  • An evaluation harness (statistical similarity + Train-on-Synthetic, Test-on-Real).
  • Lightweight privacy checks and governance best practices.

Tools used (Python): pandas, Faker, SDV (CTGAN), sdmetrics, scikit-learn, seaborn/matplotlib.

When to use synthetic data

  • Data is limited, imbalanced, or slow to collect (e.g., rare fraud cases).
  • Privacy or compliance blocks access to sensitive records.
  • You need safe data to prototype schemas, pipelines, dashboards, or train baseline models.
  • You must share data across teams/partners without exposing individuals.

Caveat: Synthetic data is not a silver bullet. It should preserve statistical structure and task performance while reducing re-identification risk. You’ll validate both utility and privacy below.

Installation and setup

python -m venv .venv && source .venv/bin/activate  # or .venv\Scripts\activate on Windows
pip install pandas numpy faker sdv sdmetrics scikit-learn seaborn matplotlib

Set a global seed for reproducibility:

import numpy as np
import random
import torch

SEED = 42
np.random.seed(SEED)
random.seed(SEED)
try:
    torch.manual_seed(SEED)
except Exception:
    pass

Step 1: Create a small “real-like” seed dataset

In practice, you’d start from real data that you’re allowed to process (e.g., in a secure environment). For teaching purposes, we’ll fabricate a small seed using Faker and hand-crafted relationships. This seed will drive the generator to learn plausible patterns.

import pandas as pd
from faker import Faker
import numpy as np

fake = Faker()
Faker.seed(SEED)

N = 5000
countries = ["US", "CA", "DE", "FR", "IN", "BR", "AU", "JP"]
devices = ["mobile", "desktop", "tablet"]

rows = []
for i in range(N):
    country = np.random.choice(countries, p=[0.35,0.08,0.08,0.07,0.18,0.1,0.07,0.07])
    device = np.random.choice(devices, p=[0.65,0.3,0.05])
    age = int(np.clip(np.random.normal(34, 9), 18, 80))
    income = max(18000, np.random.lognormal(mean=10.5, sigma=0.4))  # skewed
    sessions_30d = np.random.poisson(lam=12 if device=="mobile" else 8)
    avg_session_min = np.clip(np.random.normal(6 if device=="mobile" else 9, 2), 1, 60)
    # Latent propensity: older + higher income + more sessions => more likely subscriber
    logit = -2.0 + 0.015*(age-30) + 0.00003*(income-30000) + 0.04*(sessions_30d) + (0.3 if country in ["US","DE","JP"] else 0)
    is_subscriber = int(np.random.rand() < 1/(1+np.exp(-logit)))
    rows.append({
        "user_id": f"U{i:06d}",
        "signup_date": fake.date_between(start_date='-3y', end_date='today'),
        "country": country,
        "device": device,
        "age": age,
        "annual_income": round(income, 2),
        "sessions_last_30d": sessions_30d,
        "avg_session_min": round(avg_session_min,2),
        "is_subscriber": is_subscriber,
    })

df = pd.DataFrame(rows)
df.head()

Pro tip: If you’re allowed to use a small real sample (e.g., 1–5%) inside a secure environment, the generator often learns richer structure than from fully fabricated seeds.

Step 2: Model selection in a nutshell

Common choices for tabular data:

  • Copulas/GaussianCopula: fast, interpretable marginals + correlations; good baseline.
  • CTGAN/TVAE (SDV): neural models that capture complex non-linear relationships.
  • Rule/Agent-based simulators: embed domain logic and constraints explicitly.

We’ll use CTGAN from SDV’s single-table API because it handles mixed data types and non-linearities well.

Step 3: Define metadata and train a synthesizer

from sdv.metadata import SingleTableMetadata
from sdv.single_table import CTGANSynthesizer

# Infer metadata from the dataframe
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(df)

# Optionally, tighten semantics and constraints
metadata.update_column("user_id", sdtype="id")
metadata.set_primary_key("user_id")
metadata.update_column("signup_date", sdtype="datetime")
metadata.update_column("country", sdtype="categorical")
metadata.update_column("device", sdtype="categorical")

# Train CTGAN
synth = CTGANSynthesizer(
    metadata,
    epochs=300,
    batch_size=512,
    verbose=True,
    enforce_min_max_values=True,
    enforce_rounding=True,
)

synth.fit(df)

# Sample new records
synthetic = synth.sample(num_rows=50_000)
synthetic.head()

Notes:

  • enforce_min_max_values and enforce_rounding help avoid out-of-range/decimal drift.
  • Increase epochs if your quality metrics plateau low; watch for overfitting.

Step 4: Quick look and sanity checks

import seaborn as sns, matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 3, figsize=(14,8))
cols = ["age","annual_income","sessions_last_30d","avg_session_min"]
for i, c in enumerate(cols):
    ax = axes[i//3, i%3]
    sns.kdeplot(df[c], fill=True, label='real', ax=ax)
    sns.kdeplot(synthetic[c], fill=True, label='synthetic', ax=ax)
    ax.set_title(c)
    ax.legend()
plt.tight_layout(); plt.show()

print("Real subscriber rate:", df.is_subscriber.mean())
print("Synthetic subscriber rate:", synthetic.is_subscriber.mean())
print("Top countries (real):\n", df.country.value_counts(normalize=True).head())
print("Top countries (synthetic):\n", synthetic.country.value_counts(normalize=True).head())

You want close—but not identical—distributions, plus preserved relationships (e.g., higher sessions correlating with subscription).

Step 5: Formal evaluation with sdmetrics

We’ll compute a quality report (univariate/correlation/shape metrics) and a diagnostic report (coverage, outliers, and synthesis feasibility).

from sdmetrics.reports.single_table import QualityReport, DiagnosticReport

qr = QualityReport()
qr.generate(real_data=df, synthetic_data=synthetic, metadata=metadata)
print("Quality score:", round(qr.get_score(), 3))

# Drill into properties (useful for prioritizing fixes)
quality_props = qr.get_properties()
for k, v in quality_props.items():
    print(k, "->", {"score": round(v.get("score"), 3)})

# Diagnostics
dr = DiagnosticReport()
dr.generate(real_data=df, synthetic_data=synthetic, metadata=metadata)
diag_props = dr.get_properties()
print("Diagnostic summary:")
for k, v in diag_props.items():
    print(" ", k, ":", v.get("message"))

Interpretation tips:

  • Low univariate scores: tune epochs, batch size, or try GaussianCopulaSynthesizer as a baseline.
  • Poor correlation structure: CTGAN/TVAE often beat copulas here; consider more training.
  • Coverage issues: add constraints or condition on underrepresented categories when sampling.

Step 6: Utility via Train-on-Synthetic, Test-on-Real (TSTR)

If your model trained on synthetic can generalize to real holdout data, the synthetic data captured task-relevant structure.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, average_precision_score

TARGET = "is_subscriber"
X_real = df.drop(columns=[TARGET])
y_real = df[TARGET]

# Train/test split on REAL data (test is always real)
Xr_train, Xr_test, yr_train, yr_test = train_test_split(X_real, y_real, test_size=0.3, random_state=SEED, stratify=y_real)

# Prepare a model pipeline
num_cols = Xr_train.select_dtypes(include=["int64","float64"]).columns.tolist()
cat_cols = Xr_train.select_dtypes(include=["object","category"]).columns.tolist()

pre = ColumnTransformer([
    ("num", "passthrough", num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
])

clf = Pipeline([
    ("pre", pre),
    ("lr", LogisticRegression(max_iter=1000))
])

# Train on SYNTHETIC
Xs_train = synthetic.drop(columns=[TARGET])
ys_train = synthetic[TARGET]
clf.fit(Xs_train, ys_train)

# Evaluate on REAL holdout
proba = clf.predict_proba(Xr_test)[:,1]
print("TSTR ROC-AUC:", round(roc_auc_score(yr_test, proba), 3))
print("TSTR PR-AUC:", round(average_precision_score(yr_test, proba), 3))

Benchmark: Compare against a model trained on the same-size real subset (TRTR). The gap (TRTR - TSTR) should be small for the synthetic data to be “good enough” for your use case.

Step 7: Lightweight privacy checks

Synthetic data reduces but does not eliminate privacy risk. Add guardrails:

  1. Train a real-vs-synthetic discriminator. If it easily separates the two (AUC » 0.5), the generator may have overfit or leaked artifacts.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

real_labeled = Xr_train.copy(); real_labeled["label"] = 1
syn_labeled = Xs_train.sample(len(Xr_train), random_state=SEED).copy(); syn_labeled["label"] = 0
mix = pd.concat([real_labeled, syn_labeled], ignore_index=True)

y = mix.pop("label")
X = mix

pre2 = ColumnTransformer([
    ("num", "passthrough", [c for c in X.columns if c in num_cols]),
    ("cat", OneHotEncoder(handle_unknown="ignore"), [c for c in X.columns if c in cat_cols]),
])

rf = Pipeline([("pre", pre2), ("rf", RandomForestClassifier(n_estimators=200, random_state=SEED))])
rf.fit(X, y)
auc = roc_auc_score(y, rf.predict_proba(X)[:,1])
print("Real-vs-Synthetic AUC (in-sample):", round(auc,3))

Ideally, use cross-validation and holdout to avoid optimistic bias. AUC near 0.5 suggests the synthetic records are not trivially distinguishable.

  1. Near-duplicate search. Flag synthetic rows that are dangerously close to any real row in high-dimensional space (after encoding). You can compute k-NN distances and alert on values below a small epsilon.
from sklearn.neighbors import NearestNeighbors

# Encode both datasets using the same preprocessor
enc = pre.fit(X_real, y_real)
Xr_enc = enc.transform(X_real)
Xs_enc = enc.transform(Xs_train)

nn = NearestNeighbors(n_neighbors=1, metric="euclidean").fit(X_real if hasattr(Xr_enc, 'toarray') else Xr_enc)
# For sparse matrices, convert appropriately
Xs_mat = Xs_enc.toarray() if hasattr(Xs_enc, 'toarray') else Xs_enc
Xr_mat = Xr_enc.toarray() if hasattr(Xr_enc, 'toarray') else Xr_enc

nn = NearestNeighbors(n_neighbors=1, metric="euclidean").fit(Xr_mat)
dists, _ = nn.kneighbors(Xs_mat)
print("Min/Median/95th NN distance:", float(dists.min()), float(np.median(dists)), float(np.quantile(dists, 0.95)))

Set policy thresholds (e.g., min distance > 1.0 in standardized space) to automatically reject batches.

  1. Governance controls.
  • Remove direct identifiers and rare quasi-identifiers before training.
  • Apply category bucketing (e.g., top-N countries + “OTHER”).
  • Document synthesis settings, seeds, and intended uses; distribute a data card.

Advanced: For formal guarantees, train with differential privacy (DP-SGD) or DP-enabled synthesizers. This trades some utility for provable privacy (epsilon budget). Start with small epsilons (e.g., 2–8) and measure TSTR.

Step 8: Conditioning and targeted oversampling

You can guide the sampler to target underrepresented segments (e.g., rare countries or positive labels) to address class imbalance for downstream training.

Conceptually:

  • Fit on full data.
  • Sample more rows conditioned on desired columns.
  • Rebalance your training set.

Depending on your synthesizer’s API, you can pass a conditions DataFrame to sample additional rows where, for example, country == “IN” and is_subscriber == 1.

Troubleshooting quality

  • Mode collapse (synthetic looks too similar): lower learning rate or increase noise/regularization; ensure category entropy is preserved.
  • Poor long-tail coverage: increase epochs, use conditional sampling, or switch to copulas for better marginal tails.
  • Leaked decimals/out-of-range values: enable rounding and min/max enforcement; add column constraints.
  • Label leakage: re-check feature derivations so labels aren’t trivial functions of features.

Extending beyond tabular

Time series

  • Preserve temporal dependencies, seasonality, and calendar effects.
  • Use sliding-window conditioning (context windows) and calendar features (dow, holiday flags).
  • Evaluate with forecasting metrics (MAE/MAPE) via TSTR: train a forecaster on synthetic series, test on real.

Computer vision

  • Use diffusion models to create class-balanced augmentations; keep lighting/background diversity.
  • Validate with task metrics (mAP/IoU) on real validation sets and run nearest-neighbor checks in an embedding space (e.g., CLIP) to spot near-copies.

NLP/text

  • For classification/NER augmentation, prompt a language model with schema and style constraints; include label-conditioned generation.
  • De-duplicate against the training corpus and public benchmarks; use lexical + embedding similarity thresholds.
  • Evaluate with TSTR and calibration metrics; watch for label drift.

Packaging, versioning, and MLOps integration

  • Config-as-code: store synthesizer type, hyperparameters, constraints, and seed in YAML.
  • Determinism: set seeds across numpy/torch; log library versions and GPU/CPU info.
  • Data versioning: tag each synthetic batch with a semantic version and a lineage pointer to the real data snapshot hash (kept in a secure registry, not shared externally).
  • CI checks: block releases if utility or privacy metrics fall below thresholds.
  • Obtain approvals for allowed purposes; ensure your license/DUA permits synthetic generation.
  • Avoid harmful bias amplification: evaluate metrics per subgroup (e.g., equal opportunity, demographic parity) using real test sets.
  • Be transparent: label datasets as synthetic; provide intended-use and limitations.

Summary checklist

  • Define target use (analytics, prototyping, model training) and success metrics (TSTR, similarity).
  • Remove direct identifiers; bucket rare categories.
  • Train a baseline (GaussianCopula), then CTGAN/TVAE if needed.
  • Evaluate with sdmetrics + TSTR; iterate.
  • Run privacy checks (discriminator AUC, k-NN distance) and set policy thresholds.
  • Document, version, and govern releases.

Next steps

  • Add constraints (monotonicity, ranges, inter-column rules) to improve realism.
  • Try alternative synthesizers for your data: TVAE, GaussianCopula, or domain simulators.
  • Pilot DP-enabled training if your threat model requires formal guarantees.

With these foundations, you can deliver high-utility, lower-risk datasets that unblock experimentation and accelerate model delivery—without waiting months for access to sensitive production tables.

Related Posts