The ContingencySimilarity
metric should be able to discretize continuous columns
#700
Labels
feature request
Request for a new feature
Problem Description
The ContingencySimilarity metric is run whenever I am generating a quality report. According to the Quality report documentation, the report will discretize any numerical columns before applying the metric.
However in some cases, I am doing analysis of my own and would like to run the ContingencySimilarity metric in a standalone way. In these cases, I am unable to take advantage of the discretization (binning). I am not able to easily apply this metric to compare a continuous column (datetime, numerical) with a categorical column.
Expected behavior
Add the following parameters to the
ContingencySimilarity
metric:continuous_column_names
, a list of column names that represent continuous valuesNone
, meaning that the metric will not discretize any columnnum_discrete_bins
, or the number of bins to create for the continuous columns10
: Discretize continuous columns into 10 binsIf there are columns to discretize, then the metric should apply a similar binning process as the quality report.
Additional context
The new parameters we would add to ContingencySimilarity are only there for standalone, one-off uses of this metric.
Nothing needs to change in the quality report. For performance reasons, the quality report will chose to do its own binning first (across all columns) before calling ContingencySimilarity. We should continue doing this, as we want the quality report to run fast.
However, we may want to refactor the discretization logic so that both the ContingencySimilarity metric and Quality report can make use of it.
The text was updated successfully, but these errors were encountered: