-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support to specify categorical features in lgbm learner #197
base: master
Are you sure you want to change the base?
Adds support to specify categorical features in lgbm learner #197
Conversation
LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot coding, LightGBM can find the optimal split of categorical features. Such an optimal split can provide the much better accuracy than one-hot coding solution. You can learn about this option in: https://github.com/microsoft/LightGBM/blob/master/docs/Advanced-Topics.rst#categorical-feature-support https://github.com/Microsoft/LightGBM/blob/v3.3.1/docs/Parameters.rst
It is a Union[List[str], str]
63961f4
to
bf4dc15
Compare
Codecov Report
@@ Coverage Diff @@
## master #197 +/- ##
=======================================
Coverage 94.24% 94.24%
=======================================
Files 32 32
Lines 1928 1928
Branches 258 258
=======================================
Hits 1817 1817
Misses 76 76
Partials 35 35
Continue to review full report at Codecov.
|
Previously, the source was the underlying numpy.array, but in order to allow categorical_feature='auto' we need to pass a DataFrame.
I'm worried that the change below might alter the behavior of existing models (due to the use of pandas + categorical_feature='auto' setting). - dtrain = lgbm.Dataset(df[features].values, label=df[target], feature_name=list(map(str, features)), weight=weights,
- silent=True)
+ dtrain = lgbm.Dataset(df[features], label=df[target], feature_name=list(map(str, features)), weight=weights,
+ silent=True, categorical_feature=categorical_features) |
I'd say this change is for the better because it's the same behavior as using LightGBM directly. However, taking a closer look at the code I see that when predicting the fklearn/src/fklearn/training/regression.py Line 487 in a3de09a
So it'll definitely cause some headaches if we don't change it there as well. Yet for SHAP the dataframe is used: fklearn/src/fklearn/training/regression.py Line 492 in a3de09a
I think it'd be best to use the dataframe everywhere to not cause any surprises on the user, and using values isn't always more efficient than the dataframe. Also the dataframe allows to use the categorical features in their "raw" form, i.e. if we leave the .values there the user will always have to convert them to integer codes.
|
Agree with the above comments, this change itself looks good but is better to review the |
Uses the DataFrame everywhere it's possible.
Also adds a unittest.
Hi guys! The recent commits:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic
Status
READY
Todo list
Background context
Older versions of LigthGBM used to allow passing
categorical_feature
through the argumentparam
, but recent versions raise the following warning:It is dubious (for me) if the warning is misleading and the option still works. A contributor at lightgbm said that the correct way to pass them is through the Dataset object.
Description of the changes proposed in the pull request
Adds a new option
categorical_features
tolgbm_classification_learner
. It allows a list of column names that should be treated as categorical features. Further instructions are found in https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst