Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When data is in an SAS file, pandas reads in missing values as empty strings #2313

Open
npatki opened this issue Dec 2, 2024 · 0 comments
Open
Labels
bug Something isn't working

Comments

@npatki
Copy link
Contributor

npatki commented Dec 2, 2024

I'm filing this issue on behalf of Slack user.

Environment Details

  • SDV version: 1.17.2 (latest)

Error Description

I have a SAS file that I am loading into Python using the pandas.read_sas function. When I do this, the missing values in my datetime column are loaded in as empty strings ('') rather than the np.nan values that SDV expects. This causes the SDV synthesizer to crash when fitting.

Steps to reproduce

Reading in the data:

import pandas as pd

dm = pd.read_sas('dm.xpt', encoding="utf-8")

Making sure the metadata correctly identifies datetime columns:

metadata.update_columns(
    column_names=['RFSTDTC', 'RFENDTC', 'DMDTC'],
    sdtype='datetime',
    datetime_format='%Y-%m-%d',
    table_name='Clinical trial participants demographic data'
)

However, because the missing values are read in as empty strings, any SDV synthesizer will throw an error when fitting (originating from validate):

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(dm)
InvalidDataError: The provided data does not match the metadata:
Invalid values found for datetime column 'RFSTDTC': [''].

Workaround

To workaround this, just replace the empty strings with np.nan before fitting any SDV synthesizer.

import numpy as np

dm.replace('', np.nan, inplace=True)

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(dm)
...

Additional Details

If the original data is in CSV format, then pd.read_csv will read in missing values as np.nan. This is compatible with SDV.

To solve the issue with SAS file format, we need to make a few decisions first:

  • Should we support the case where missing datetime values are denoted with empty strings -- or should we only support np.nan values?
  • If we decide to only support np.nan values, then it would be nice to provide a warning and conversion function to take care of this before fitting
@npatki npatki added the bug Something isn't working label Dec 2, 2024
@npatki npatki changed the title When data is in a SAS file, pandas reads in missing values as empty strings When data is in an SAS file, pandas reads in missing values as empty strings Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant