Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to pinpoint problematic table columns when getting metadata from multiple tables #2323

Open
yimingli opened this issue Dec 16, 2024 · 4 comments
Labels
question General question about the software under discussion Issue is currently being discussed

Comments

@yimingli
Copy link

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

  • SDV version: 1.17.2
  • Python version: 3.10.15
  • Operating System: macOS 15.2

Problem description

When running metadata = Metadata.detect_from_dataframes(data=real_data), got error unhashable type: 'dict'.

What I already tried

I thought the error indicates that some dataframe columns are of the 'dict' type, so I used the following snippet to check, and none of the columns are dicts.

for table_id, df in real_data.items():
    dict_columns = [
        col for col in df.columns
        if df[col].apply(lambda x: isinstance(x, dict)).any()
    ]
    print(f"Dict columns for {table_id}: {array_columns}")

I have 2 questions:

  1. what does the error unhashable type: 'dict' mean?
  2. if it means some columns have issues, how do I quickly narrow down to the table and column (because I'm working with multiple tables).

Thank you!

Trace:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[108], line 2
      1 #%%
----> 2 metadata = Metadata.detect_from_dataframes(data=real_data)

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/sdv/metadata/metadata.py:84, in Metadata.detect_from_dataframes(cls, data)
     82 metadata = Metadata()
     83 for table_name, dataframe in data.items():
---> 84     metadata.detect_table_from_dataframe(table_name, dataframe)
     86 metadata._detect_relationships(data)
     87 return metadata

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/sdv/metadata/multi_table.py:547, in MultiTableMetadata.detect_table_from_dataframe(self, table_name, data)
    545 self._validate_table_not_detected(table_name)
    546 table = SingleTableMetadata()
--> 547 table._detect_columns(data)
    548 self.tables[table_name] = table
    549 self._log_detected_table(table)

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/sdv/metadata/single_table.py:620, in SingleTableMetadata._detect_columns(self, data)
    617     sdtype = self._determine_sdtype_for_numbers(column_data)
    619 elif dtype == 'O':
--> 620     sdtype = self._determine_sdtype_for_objects(column_data)
    622 if sdtype is None:
    623     raise InvalidMetadataError(
    624         f"Unsupported data type for column '{field}' (kind: {dtype})."
    625         "The valid data types are: 'object', 'int', 'float', 'datetime', 'bool'."
    626     )

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/sdv/metadata/single_table.py:520, in SingleTableMetadata._determine_sdtype_for_objects(self, data)
    518     sdtype = 'categorical'
    519 else:
--> 520     unique_values = data.nunique()
    521     if unique_values == len(data):
    522         sdtype = 'id'

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/pandas/core/base.py:1063, in IndexOpsMixin.nunique(self, dropna)
   1028 @final
   1029 def nunique(self, dropna: bool = True) -> int:
   1030     """
   1031     Return number of unique elements in the object.
   1032 
   (...)
   1061     4
   1062     """
-> 1063     uniqs = self.unique()
   1064     if dropna:
   1065         uniqs = remove_na_arraylike(uniqs)

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/pandas/core/series.py:2407, in Series.unique(self)
   2344 def unique(self) -> ArrayLike:  # pylint: disable=useless-parent-delegation
   2345     """
   2346     Return unique values of Series object.
   2347 
   (...)
   2405     Categories (3, object): ['a' < 'b' < 'c']
   2406     """
-> 2407     return super().unique()

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/pandas/core/base.py:1025, in IndexOpsMixin.unique(self)
   1023     result = values.unique()
   1024 else:
-> 1025     result = algorithms.unique1d(values)
   1026 return result

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/pandas/core/algorithms.py:401, in unique(values)
    307 def unique(values):
    308     """
    309     Return unique values based on a hash table.
    310 
   (...)
    399     array([('a', 'b'), ('b', 'a'), ('a', 'c')], dtype=object)
    400     """
--> 401     return unique_with_mask(values)

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/pandas/core/algorithms.py:440, in unique_with_mask(values, mask)
    438 table = hashtable(len(values))
    439 if mask is None:
--> 440     uniques = table.unique(values)
    441     uniques = _reconstruct_data(uniques, original.dtype, original)
    442     return uniques

File pandas/_libs/hashtable_class_helper.pxi:7248, in pandas._libs.hashtable.PyObjectHashTable.unique()

File pandas/_libs/hashtable_class_helper.pxi:7195, in pandas._libs.hashtable.PyObjectHashTable._unique()

TypeError: unhashable type: 'dict'
@yimingli yimingli added new Automatic label applied to new issues question General question about the software labels Dec 16, 2024
@yimingli
Copy link
Author

yimingli commented Dec 16, 2024

Ah, I had a typo in the python snippet in the print statement (array_columns should be dict_columns). After fixing the typo, I did find a few dict columns in one of the tables.

So my first question is resolved. The second question/feature request is whether there's a way to include the table name and column name in the trace for easier debugging.

@npatki
Copy link
Contributor

npatki commented Dec 17, 2024

Hi @yimingli, thanks for filing this issue. It seems like you have resolved your overall issue of figuring out which columns were problematic?

We can certainly track a feature request for a better debugging experience. Before we are able to make an update, it would be helpful to better understand what is going on -- as I have never before seen a case where the column labels (aka column names) are dictionaries instead of strings.

Would you be able to explain more about your original data format? How did you load the data into Python to create the real_data dictionary? I am curious if this is specific to a particular type of database or data storage service.

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Dec 17, 2024
@yimingli
Copy link
Author

I have never before seen a case where the column labels (aka column names) are dictionaries instead of strings

Not the column names, but column values are dictionaries. Columns names are strings. Does this help?

@npatki
Copy link
Contributor

npatki commented Dec 20, 2024

Hi @yimingli, thank you for clarifying this. I filed a feature request #2327 for surfacing the name of the column/table to you when the metadata detection crashes, so you won't have this problem next time.

In the meantime, we're always looking for feedback to support your use case. I'm wondering if SDV offering official support for dictionary values in your data would be a useful feature for you? If so, I'm curious as to what kind of info is being stored in dictionary format. Do all the dictionaries in each of the data cells have the same key/value pairs? Any more info you can provide would be helpful. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

2 participants