The `ContingencySimilarity` metric should be able to discretize continuous columns #700

npatki · 2024-12-26T23:02:10Z

Problem Description

The ContingencySimilarity metric is run whenever I am generating a quality report. According to the Quality report documentation, the report will discretize any numerical columns before applying the metric.

However in some cases, I am doing analysis of my own and would like to run the ContingencySimilarity metric in a standalone way. In these cases, I am unable to take advantage of the discretization (binning). I am not able to easily apply this metric to compare a continuous column (datetime, numerical) with a categorical column.

Expected behavior

Add the following parameters to the ContingencySimilarity metric:

continuous_column_names, a list of column names that represent continuous values
- (default) None, meaning that the metric will not discretize any column
- Optionally, the user can provide a list of columns to discretize before running the metric. The column names in this list should match the column names in the real and synthetic data.
num_discrete_bins, or the number of bins to create for the continuous columns
- (default) 10: Discretize continuous columns into 10 bins

If there are columns to discretize, then the metric should apply a similar binning process as the quality report.

Determine the bins based on the real data only. Use the parameter to figure out the # of bins and which column(s) to bin.
Make the min value of the lowest bin -inf, and the max value of the highest bin +inf. That way if the synthetic data goes outside the boundary of the real data, it can still be discretized appropriately
Run the metric on the discretized data

Additional context

The new parameters we would add to ContingencySimilarity are only there for standalone, one-off uses of this metric.

Nothing needs to change in the quality report. For performance reasons, the quality report will chose to do its own binning first (across all columns) before calling ContingencySimilarity. We should continue doing this, as we want the quality report to run fast.

However, we may want to refactor the discretization logic so that both the ContingencySimilarity metric and Quality report can make use of it.

The text was updated successfully, but these errors were encountered:

npatki added the feature request Request for a new feature label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `ContingencySimilarity` metric should be able to discretize continuous columns #700

The `ContingencySimilarity` metric should be able to discretize continuous columns #700

npatki commented Dec 26, 2024 •

edited

Loading

The ContingencySimilarity metric should be able to discretize continuous columns #700

The ContingencySimilarity metric should be able to discretize continuous columns #700

Comments

npatki commented Dec 26, 2024 • edited Loading

Problem Description

Expected behavior

Additional context

The `ContingencySimilarity` metric should be able to discretize continuous columns #700

The `ContingencySimilarity` metric should be able to discretize continuous columns #700

npatki commented Dec 26, 2024 •

edited

Loading