How to pinpoint problematic table columns when getting metadata from multiple tables #2323

yimingli · 2024-12-16T21:22:30Z

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

SDV version: 1.17.2
Python version: 3.10.15
Operating System: macOS 15.2

Problem description

When running metadata = Metadata.detect_from_dataframes(data=real_data), got error unhashable type: 'dict'.

What I already tried

I thought the error indicates that some dataframe columns are of the 'dict' type, so I used the following snippet to check, and none of the columns are dicts.

for table_id, df in real_data.items():
    dict_columns = [
        col for col in df.columns
        if df[col].apply(lambda x: isinstance(x, dict)).any()
    ]
    print(f"Dict columns for {table_id}: {array_columns}")

I have 2 questions:

what does the error unhashable type: 'dict' mean?
if it means some columns have issues, how do I quickly narrow down to the table and column (because I'm working with multiple tables).

Thank you!

Trace:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[108], line 2
      1 #%%
----> 2 metadata = Metadata.detect_from_dataframes(data=real_data)

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/sdv/metadata/metadata.py:84, in Metadata.detect_from_dataframes(cls, data)
     82 metadata = Metadata()
     83 for table_name, dataframe in data.items():
---> 84     metadata.detect_table_from_dataframe(table_name, dataframe)
     86 metadata._detect_relationships(data)
     87 return metadata

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/sdv/metadata/multi_table.py:547, in MultiTableMetadata.detect_table_from_dataframe(self, table_name, data)
    545 self._validate_table_not_detected(table_name)
    546 table = SingleTableMetadata()
--> 547 table._detect_columns(data)
    548 self.tables[table_name] = table
    549 self._log_detected_table(table)

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/sdv/metadata/single_table.py:620, in SingleTableMetadata._detect_columns(self, data)
    617     sdtype = self._determine_sdtype_for_numbers(column_data)
    619 elif dtype == 'O':
--> 620     sdtype = self._determine_sdtype_for_objects(column_data)
    622 if sdtype is None:
    623     raise InvalidMetadataError(
    624         f"Unsupported data type for column '{field}' (kind: {dtype})."
    625         "The valid data types are: 'object', 'int', 'float', 'datetime', 'bool'."
    626     )

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/sdv/metadata/single_table.py:520, in SingleTableMetadata._determine_sdtype_for_objects(self, data)
    518     sdtype = 'categorical'
    519 else:
--> 520     unique_values = data.nunique()
    521     if unique_values == len(data):
    522         sdtype = 'id'

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/pandas/core/base.py:1063, in IndexOpsMixin.nunique(self, dropna)
   1028 @final
   1029 def nunique(self, dropna: bool = True) -> int:
   1030     """
   1031     Return number of unique elements in the object.
   1032 
   (...)
   1061     4
   1062     """
-> 1063     uniqs = self.unique()
   1064     if dropna:
   1065         uniqs = remove_na_arraylike(uniqs)

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/pandas/core/series.py:2407, in Series.unique(self)
   2344 def unique(self) -> ArrayLike:  # pylint: disable=useless-parent-delegation
   2345     """
   2346     Return unique values of Series object.
   2347 
   (...)
   2405     Categories (3, object): ['a' < 'b' < 'c']
   2406     """
-> 2407     return super().unique()

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/pandas/core/base.py:1025, in IndexOpsMixin.unique(self)
   1023     result = values.unique()
   1024 else:
-> 1025     result = algorithms.unique1d(values)
   1026 return result

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/pandas/core/algorithms.py:401, in unique(values)
    307 def unique(values):
    308     """
    309     Return unique values based on a hash table.
    310 
   (...)
    399     array([('a', 'b'), ('b', 'a'), ('a', 'c')], dtype=object)
    400     """
--> 401     return unique_with_mask(values)

File ~/.pyenv/versions/sdv/lib/python3.10/site-packages/pandas/core/algorithms.py:440, in unique_with_mask(values, mask)
    438 table = hashtable(len(values))
    439 if mask is None:
--> 440     uniques = table.unique(values)
    441     uniques = _reconstruct_data(uniques, original.dtype, original)
    442     return uniques

File pandas/_libs/hashtable_class_helper.pxi:7248, in pandas._libs.hashtable.PyObjectHashTable.unique()

File pandas/_libs/hashtable_class_helper.pxi:7195, in pandas._libs.hashtable.PyObjectHashTable._unique()

TypeError: unhashable type: 'dict'

The text was updated successfully, but these errors were encountered:

yimingli · 2024-12-16T21:52:34Z

Ah, I had a typo in the python snippet in the print statement (array_columns should be dict_columns). After fixing the typo, I did find a few dict columns in one of the tables.

So my first question is resolved. The second question/feature request is whether there's a way to include the table name and column name in the trace for easier debugging.

npatki · 2024-12-17T16:28:57Z

Hi @yimingli, thanks for filing this issue. It seems like you have resolved your overall issue of figuring out which columns were problematic?

We can certainly track a feature request for a better debugging experience. Before we are able to make an update, it would be helpful to better understand what is going on -- as I have never before seen a case where the column labels (aka column names) are dictionaries instead of strings.

Would you be able to explain more about your original data format? How did you load the data into Python to create the real_data dictionary? I am curious if this is specific to a particular type of database or data storage service.

yimingli · 2024-12-18T04:48:39Z

I have never before seen a case where the column labels (aka column names) are dictionaries instead of strings

Not the column names, but column values are dictionaries. Columns names are strings. Does this help?

npatki · 2024-12-20T19:39:38Z

Hi @yimingli, thank you for clarifying this. I filed a feature request #2327 for surfacing the name of the column/table to you when the metadata detection crashes, so you won't have this problem next time.

In the meantime, we're always looking for feedback to support your use case. I'm wondering if SDV offering official support for dictionary values in your data would be a useful feature for you? If so, I'm curious as to what kind of info is being stored in dictionary format. Do all the dictionaries in each of the data cells have the same key/value pairs? Any more info you can provide would be helpful. Thanks.

yimingli added new Automatic label applied to new issues question General question about the software labels Dec 16, 2024

npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Dec 17, 2024

npatki mentioned this issue Dec 20, 2024

Surface more detailed error info when detecting metadata from dataframes #2327

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to pinpoint problematic table columns when getting metadata from multiple tables #2323

How to pinpoint problematic table columns when getting metadata from multiple tables #2323

yimingli commented Dec 16, 2024

yimingli commented Dec 16, 2024 •

edited

Loading

npatki commented Dec 17, 2024 •

edited

Loading

yimingli commented Dec 18, 2024

npatki commented Dec 20, 2024 •

edited

Loading

How to pinpoint problematic table columns when getting metadata from multiple tables #2323

How to pinpoint problematic table columns when getting metadata from multiple tables #2323

Comments

yimingli commented Dec 16, 2024

Environment details

Problem description

What I already tried

yimingli commented Dec 16, 2024 • edited Loading

npatki commented Dec 17, 2024 • edited Loading

yimingli commented Dec 18, 2024

npatki commented Dec 20, 2024 • edited Loading

yimingli commented Dec 16, 2024 •

edited

Loading

npatki commented Dec 17, 2024 •

edited

Loading

npatki commented Dec 20, 2024 •

edited

Loading