Danger You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software Click here to go to the new docs pages.
Danger
You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software
Click here to go to the new docs pages.
The TabularPreset is a tabular model that comes with pre-configured settings. This is meant for users who want to get started with using synthetic data and spend less time worrying about which model to choose or how to tune its parameters.
TabularPreset
Note
We are currently in Beta testing our speed-optimized machine learning preset. Help us by testing the model and filing issues for any bugs or feature requests you may have.
The FAST_ML preset is our first preset. It uses machine learning (ML) to model your data while optimizing for the modeling time. This is a great choice if it’s your first time using the SDV for a large custom dataset or if you’re exploring the benefits of using ML to create synthetic data.
FAST_ML
This preset optimizes for the modeling time while still applying machine learning to model and generate synthetic data.
Your synthetic data will capture correlations that exist between the columns of the original data.
Your synthetic data will adhere to the basic statistical properties of the original columns: min/max values, averages and standard deviations.
While other SDV models may create higher quality synthetic data, they will take longer. Using the FAST_ML preset allows you to get started with ML to create synthetic data right now.
To use this preset, you must have:
Your data, loaded as a pandas DataFrame, and
(Optional but strongly recommended) A metadata file that describes the columns of your dataset
For this guide, we’ll load the demo data and metadata from the SDV. This data contains information about students, including their grades, major and work experience.
In [1]: from sdv.demo import load_tabular_demo In [2]: metadata, data = load_tabular_demo('student_placements', metadata=True) In [3]: data.head() Out[3]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17264 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0 1 17265 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0 2 17266 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0 3 17267 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN 4 17268 M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
If you want to use your custom dataset, you can load it using pandas. For example, if your data is available as a CSV file, you can use the read_csv method.
read_csv
You can write your metadata as a dictionary. Follow the Metadata guide to create a dictionary for a single table. For example, the metadata for our table looks something like this:
{ 'fields': { 'start_date': {'type': 'datetime', 'format': '%Y-%m-%d'}, 'end_date': {'type': 'datetime', 'format': '%Y-%m-%d'}, 'salary': {'type': 'numerical', 'subtype': 'integer'}, 'duration': {'type': 'categorical'}, 'student_id': {'type': 'id', 'subtype': 'integer'}, 'high_perc': {'type': 'numerical', 'subtype': 'float'}, 'high_spec': {'type': 'categorical'}, 'mba_spec': {'type': 'categorical'}, 'second_perc': {'type': 'numerical', 'subtype': 'float'}, 'gender': {'type': 'categorical'}, 'degree_perc': {'type': 'numerical', 'subtype': 'float'}, 'placed': {'type': 'boolean'}, 'experience_years': {'type': 'numerical', 'subtype': 'integer'}, 'employability_perc': {'type': 'numerical', 'subtype': 'float'}, 'mba_perc': {'type': 'numerical', 'subtype': 'float'}, 'work_experience': {'type': 'boolean'}, 'degree_type': {'type': 'categorical'} }, 'constraints': [], 'primary_key': 'student_id' }
Pass in your metadata to create the TabularPreset FAST_ML model.
In [4]: from sdv.lite import TabularPreset # Use the FAST_ML preset to optimize for modeling time In [5]: model = TabularPreset(name='FAST_ML', metadata=metadata)
Then, simply pass in your data to train the model.
In [6]: model.fit(data)
The modeling step is optimized for speed. The exact time it takes depends on several factors including the number of rows, columns and distinct categories in categorical columns. As a rough benchmark, our analysis shows that:
Datasets with around 100K rows and 50-100 columns will take a few minutes to model
Larger datasets with around 1M rows and hundreds of columns may take closer to an hour
After you are finished modeling, you can save the fitted model and load it in again for future use.
# save the model in a new file In [7]: model.save('fast_ml_model.pkl') # later, you can load it in again In [8]: model = TabularPreset.load('fast_ml_model.pkl')
Once you have your model, you can begin to create synthetic data. Use the sample method and pass in the number of rows you want to synthesize.
In [9]: synthetic_data = model.sample(num_rows=100) In [10]: synthetic_data.head() Out[10]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 M 77.856718 72.820654 Commerce 62.668844 Comm&Mgmt True 2 63.072758 Mkt&Fin 66.975043 NaN True NaT NaT NaN 1 1 F 48.974286 58.695613 Commerce 60.767601 Comm&Mgmt False 0 75.802412 Mkt&Fin 71.636136 NaN False NaT NaT 12.0 2 2 M 76.844911 77.697082 Science 63.887735 Others True 1 66.080579 Mkt&Fin 59.494052 NaN True 2020-02-06 2020-05-30 NaN 3 3 F 60.830093 72.588383 Science 74.019364 Comm&Mgmt True 1 60.130465 Mkt&Fin 65.954341 NaN True NaT NaT 3.0 4 4 F 68.373581 73.035314 Commerce 65.815315 Comm&Mgmt False 0 60.689325 Mkt&Fin 64.818992 21004.0 True 2019-12-30 2020-09-30 12.0
For creating large amounts of synthetic data, provide a batch_size. This breaks up the sampling into multiple batches and shows a progress bar. Use the output_file_path parameter to write results to a file.
In [11]: model.sample(num_rows=1_000_000, batch_size=10_000, output_file_path='synthetic_data.csv') Out[11]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 F 66.076808 70.167282 Commerce 61.123129 Comm&Mgmt False 0 90.352363 Mkt&Fin 52.009704 37292.0 True 2020-03-20 NaT NaN 1 1 M 81.378810 84.665910 Commerce 70.314368 Comm&Mgmt False 1 76.510793 Mkt&HR 64.003891 29712.0 True 2020-03-18 2020-07-02 3.0 2 2 M 74.672915 48.998928 Commerce 65.581655 Comm&Mgmt False 0 75.429694 Mkt&HR 55.613480 20998.0 True 2020-02-26 2019-12-24 3.0 3 3 M 80.600229 86.230726 Commerce 64.245977 Comm&Mgmt False 0 86.365788 Mkt&Fin 68.133274 NaN True NaT NaT 6.0 4 4 M 52.836250 56.600862 Science 50.000000 Comm&Mgmt False 0 68.515163 Mkt&Fin 58.932920 25903.0 False NaT 2020-03-15 NaN ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 999995 9995 M 80.032156 86.618673 Science 68.168072 Others False 0 76.156154 Mkt&Fin 73.580754 20000.0 True 2019-12-28 2020-08-15 NaN 999996 9996 M 66.834672 68.129867 Science 62.890542 Comm&Mgmt False 1 86.144812 Mkt&Fin 56.229785 29926.0 True 2019-11-17 2020-08-14 NaN 999997 9997 M 79.327632 56.517645 Commerce 74.299468 Comm&Mgmt False 0 76.707949 Mkt&Fin 59.972965 38806.0 True 2020-02-13 2020-08-16 NaN 999998 9998 M 52.243179 57.016449 Commerce 71.633991 Sci&Tech False 0 52.177788 Mkt&Fin 58.332636 20774.0 False NaT 2020-07-10 6.0 999999 9999 F 72.408903 63.813712 Science 72.117249 Comm&Mgmt False 0 98.000000 Mkt&Fin 72.292758 NaN True NaT NaT NaN [1000000 rows x 17 columns]
The model generates new synthetic data – synthetic rows that do not refer to the original. But sometimes you may want to fix some values.
For example, you might only be interested in synthesizing science and commerce students with work experience. Using conditional sampling, you can specify the exact, fixed values that you need. The SDV model will then synthesize the rest of the data.
First, use the Condition object to specify the exact values you want. You specify a dictionary of column names and the exact value you want, along with the number of rows to synthesize.
In [12]: from sdv.sampling.tabular import Condition # 100 science students with work experience In [13]: science_students = Condition( ....: column_values={'high_spec': 'Science', 'work_experience': True}, num_rows=100) ....: # 200 commerce students with work experience In [14]: commerce_students = Condition( ....: column_values={'high_spec': 'Commerce', 'work_experience': True}, num_rows=200) ....:
You can now use the sample_conditions function and pass in a list of conditions.
In [15]: all_conditions = [science_students, commerce_students] In [16]: model.sample_conditions(conditions=all_conditions) Out[16]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 F 52.532758 49.561970 Science 51.140416 Comm&Mgmt True 1 74.854705 Mkt&Fin 61.185634 23372.0 False 2019-12-23 NaT NaN 1 1 M 67.391777 67.272080 Science 70.670060 Sci&Tech True 1 60.662458 Mkt&Fin 60.500428 NaN True 2020-06-22 2020-08-21 NaN 2 2 M 75.295397 73.627289 Science 75.030994 Comm&Mgmt True 1 75.446925 Mkt&Fin 66.893770 24174.0 True 2020-04-04 2020-07-31 NaN 3 3 M 43.837698 49.099764 Science 50.000000 Comm&Mgmt True 1 61.788954 Mkt&HR 52.526468 NaN False NaT NaT 6.0 4 4 F 64.153389 57.316688 Science 64.189365 Comm&Mgmt True 1 72.684235 Mkt&Fin 75.674615 31780.0 False NaT NaT NaN .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 295 295 F 80.681077 70.690053 Commerce 68.522117 Comm&Mgmt True 1 83.895190 Mkt&HR 66.331646 30946.0 False 2020-01-24 2020-04-12 3.0 296 296 M 71.552698 63.254244 Commerce 72.997475 Comm&Mgmt True 1 85.743597 Mkt&Fin 63.853970 31098.0 True 2020-01-02 NaT NaN 297 297 F 79.950761 82.388268 Commerce 61.265158 Comm&Mgmt True 1 75.722719 Mkt&Fin 59.528437 NaN True NaT 2020-09-10 NaN 298 298 M 68.525135 78.184983 Commerce 76.519461 Comm&Mgmt True 1 78.799635 Mkt&Fin 62.282153 30818.0 True 2019-12-21 2020-07-22 NaN 299 299 M 43.719921 41.350703 Commerce 58.875628 Comm&Mgmt True 1 67.797344 Mkt&Fin 51.898341 28077.0 False 2020-02-18 NaT NaN [300 rows x 17 columns]
A constraint is a logical business rule that must be met by every row in your dataset.
In most cases, the preset is able to learn a general trend and create synthetic data where most of the rows follow the rule. Use a constraint if you want to enforce that all of the rows must follow the rule.
In our dataset, we have a constraint: All the numerical values in the duration column must be divisible by 3. We can describe this using a FixedIncrements constraint.
In [17]: from sdv.constraints import FixedIncrements # use the formula when defining the constraint In [18]: duration_constraint = FixedIncrements( ....: column_name='duration', ....: increment_value=3, ....: ) ....:
You can input constraints into the presets when creating your model.
In [19]: constrained_model = TabularPreset( ....: name='FAST_ML', ....: metadata=metadata, ....: constraints=[duration_constraint], ....: ) ....: In [20]: constrained_model.fit(data)
When you sample from the model, the synthetic data will follow the constraints
In [21]: constrained_synthetic_data = constrained_model.sample(num_rows=1_000) In [22]: constrained_synthetic_data.head(10) Out[22]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 M 64.592829 73.630461 Commerce 74.138359 Comm&Mgmt False 0 87.675480 Mkt&Fin 66.020486 34853.0 True 2020-03-03 2020-05-12 3.0 1 1 M 70.870053 85.546354 Commerce 70.087021 Comm&Mgmt False 1 50.000000 Mkt&Fin 61.551202 22689.0 True 2020-04-29 2020-09-03 3.0 2 2 M 66.417049 68.707972 Commerce 68.907223 Comm&Mgmt True 1 84.787113 Mkt&Fin 61.278468 20004.0 True 2020-03-19 2020-09-14 6.0 3 3 M 78.246489 70.774004 Science 62.064204 Sci&Tech True 0 96.113512 Mkt&Fin 73.842500 NaN False NaT NaT 3.0 4 4 M 58.706323 58.070675 Commerce 64.180914 Comm&Mgmt False 0 74.148656 Mkt&HR 54.104841 NaN False NaT NaT 3.0 5 5 M 51.897824 60.005120 Commerce 54.509286 Comm&Mgmt False 0 74.693255 Mkt&HR 55.548106 NaN False NaT NaT NaN 6 6 M 81.948772 89.730430 Commerce 81.614766 Comm&Mgmt False 0 94.340478 Mkt&Fin 71.759414 31500.0 True 2020-02-17 2020-07-28 3.0 7 7 M 78.950939 70.579714 Science 77.318894 Comm&Mgmt True 1 50.000000 Mkt&Fin 62.273381 21770.0 True 2020-04-03 2020-11-09 NaN 8 8 M 71.176417 64.074905 Commerce 72.918041 Comm&Mgmt False 0 52.857051 Mkt&Fin 58.704327 20000.0 True 2020-01-23 2020-04-11 3.0 9 9 M 75.734755 68.945450 Science 63.842584 Comm&Mgmt False 0 56.267666 Mkt&HR 71.709611 33248.0 True 2019-10-03 2020-06-14 NaN
To read more about defining constraints, see the Handling Constraints User Guide.
The SDV (Synthetic Data Vault) is an open source project built & maintained by DataCebo. It is free to use under the MIT License.
For other resources see our: GitHub, Docs, Blog.