You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some of the parameters in the gretel-synthetics implementation in SDGym can cause the model to fail during evaluation, and can be optimized for generating synthetic data (details below).
Expected behavior
In sdgym/synthesizers/gretel.py there are a few updates I'd recommend making:
Add learning_rate as a parameter, set default to 0.001 as per Gretel docs.
Add field_cluster_size as a tunable parameter
on batcher.generate_all_batch_lines, set a default max_invalid. For larger datasets, the default value of 1000 can cause the model to unnecessarily terminate and during sampling.
epochs can be set to a standard value (e.g. 100), no need to set epochs based on the number of columns. Early stopping and a validation set can be used to prevent overfitting.
Additional context
I'm happy to submit a PR with fixes and to compare against baseline config for tests, let me know if this would be okay. Cheers!
The text was updated successfully, but these errors were encountered:
Hi @zredlined, thanks for taking a look at the Gretel implementation in SDGym! Please feel free to submit a PR with the updates you have proposed. You can link it to this issue and we'll be happy to take a look.
Problem Description
Some of the parameters in the gretel-synthetics implementation in SDGym can cause the model to fail during evaluation, and can be optimized for generating synthetic data (details below).
Expected behavior
In
sdgym/synthesizers/gretel.py
there are a few updates I'd recommend making:learning_rate
as a parameter, set default to0.001
as per Gretel docs.field_cluster_size
as a tunable parameterbatcher.generate_all_batch_lines
, set a defaultmax_invalid
. For larger datasets, the default value of1000
can cause the model to unnecessarily terminate and during sampling.epochs
can be set to a standard value (e.g. 100), no need to set epochs based on the number of columns. Early stopping and a validation set can be used to prevent overfitting.Additional context
I'm happy to submit a PR with fixes and to compare against baseline config for tests, let me know if this would be okay. Cheers!
The text was updated successfully, but these errors were encountered: