Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use any scalar constraint (ScalarRange, ScalarInequality) with datetime columns that are stored as ints #2324

Open
npatki opened this issue Dec 17, 2024 · 0 comments
Labels
bug Something isn't working feature:constraints Related to inputting rules or business logic

Comments

@npatki
Copy link
Contributor

npatki commented Dec 17, 2024

Environment Details

  • SDV version: 1.17.2 (latest)

Error Description

In some cases, I may have a datetime column (listed as sdtype datetime in my metadata) that is stored as an integer. Eg:

  • I may be only storing the year component, so the column would contain integers such as 2024, 2023, ...
  • Or I my be storing the year, month, and day without any separators (YYYYMMMDD format), so the column would contain integers such as 20240101, 20231231, ...

In such cases, I am able to generally fit and sample synthetic data. However, if I try adding any of the scalar constraints (ScalarRange, ScalarInequality), then I get an InvalidConstraintsError when I try to add the constraint to my synthesizer.

Steps to reproduce

import pandas as pd
from sdv.metadata import Metadata

data = pd.DataFrame(data={
    'col_A': [20201123, 20210125, 20240713, 20230219, 20220104, 20230502, 20211210, 20220101],
    'col_B': ['Yes', 'No', 'No', 'No', 'No', 'Yes', 'Yes', 'No']})

metadata = Metadata.load_from_dict({
    'tables': {
        'table': {
            'columns': {
                'col_A': { 'sdtype': 'datetime', 'datetime_format': '%Y%m%d' },
                'col_B': { 'sdtype': 'categorical'}}}}})

my_constraint = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'col_A',
        'low_value': 20200101,
        'high_value': 20241231 }}

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.add_constraints([my_constraint])

Output:

[/usr/local/lib/python3.10/dist-packages/sdv/data_processing/data_processor.py](https://localhost:8080/#) in add_constraints(self, constraints)
    328 
    329         if errors:
--> 330             raise InvalidConstraintsError(errors)
    331 
    332         self._constraints_list.extend(validated_constraints)

InvalidConstraintsError: The provided constraint is invalid:
Both 'high_value' and 'low_value' must be a datetime string of the right format

Workaround

Currently, these constraints are set up to assume that datetime values must always be represented as strings. So a simple workaround would be to cast the relevant column to a string and then supply the constraint values as strings too.

# cast all the values in col_A to strings
data['col_A'] = data['col_A'].astype(str)

# supply the values in the constraint as strings
my_constraint = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
        'column_name': 'col_A',
        'low_value': '20200101',
        'high_value': '20241231' }}

# now it should be possible to add the constraint, fit, and sample synthetic data
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.add_constraints([my_constraint])
synthesizer.sample(num_rows=5)

Other Info

  • One important thing to consider is whether the ScalarRange or ScalarInequality constraint is absolutely needed. By default, all SDV synthesizers will enforce the min/max values that are observed in the data. If this default is kept, then you do not need to add any constraints.
  • Constraints are currently set up so that they do not have access to the metadata. In Q1, we plan to streamline constraints as part of a larger project, which means updating this part of the SDV workflow. SDV synthesizers will have access to metadata at the time of validating a constraint. So the "correct" approach here will be to validate against the metadata -- i.e. just check that the sdtype is numerical or datetime in the metadata, regardless of how the data is actually stored (int, string, etc.)
@npatki npatki added bug Something isn't working feature:constraints Related to inputting rules or business logic labels Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:constraints Related to inputting rules or business logic
Projects
None yet
Development

No branches or pull requests

1 participant