24.05.2018
- 'Ordinary' training vs. Gradient Boosting Machine
- XGBoost vs. LightGBM
- Boosting techniques
- Basic learning control parameters
- Observation sampling techniques
- Feature sampling techniques
- Optimizations for categorical features
- Dealing with missing data
- Dealing with big data
- Parallelizations
- Dealing with GPU
- Summary
Assume that we have
formula | meaning, examples |
---|---|
|
a neural network with fixed structure and initial weights |
For given |
model after |
|
predictions |
|
prediction errors, e.g. |
loss function, e.g. MSE for the above |
|
|
gradients |
|
SGB step with learning rate |
model after |
formula | meaning, examples |
---|---|
|
constant model |
|
constant model |
For given |
model after |
|
predictions |
|
prediction errors, e.g. |
|
gradients, e.g. |
we approximate |
|
model after |
Gradient Boosting Machine (GBM) using decision trees as learner is called Gradient Boosting Decision Tree (GBDT).
General notes regarding GBDT:
- Smaller learning rate - more iterations.
- Trees with
$r$ leaves are able to model interactions between at most$r-1$ features. Therefore e.g. if you bet that 5 features interact with each other, don't use trees with maximal depth 2.
formula | meaning, examples |
---|---|
features | |
|
gradients and hessians, e.g. |
indices of observations in the current node sorted by feature |
|
|
indices of observation going to the left and right node, respectively |
gain from the split |
According to the LightGBM paper, there are two main innovations in LightGBM compared to XGBoost: GOSS and EFB.
List of parameters for XGBoost and LightGBM are taken from here and here, respectively.
LightGBM paper (messy): https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf Complete documentation of LightGBM: https://media.readthedocs.org/pdf/lightgbm/latest/lightgbm.pdf Magnificent parameters comparison: https://sites.google.com/view/lauraepp/parameters Comparison with Python snippets: https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db
Below I explain some non-obvious parameters.
XGBoost | LightGBM | meaning |
---|---|---|
booster= |
boosting= |
boosting technique |
gbtree (default) |
gbdt (default) |
GBDT |
gblinear |
--- | linear learners |
dart |
dart |
DART |
--- | goss |
GOSS |
Takes a random subset of learners in each iteration.
Additional parameters:
XGBoost | LightGBM |
---|---|
sample_type |
uniform_drop |
rate_drop |
drop_rate |
skip_drop |
skip_drop |
normalize_type |
--- |
one_drop |
--- |
--- | max_drop |
--- | xgboost_dart_mode |
--- | drop_seed |
Given
- Sample a subset
$D$ from${1,\ldots,t}$ . - Perform GBM iteration, but for
$\tilde{M}^t=f^0+\sum_{j\in D}f^j$ . - Obtained
$\tilde{f}^{t+1}$ . - Let
$k=t-|D|$ (number of dropped learners).- XGBoost and LightGBM with
xgboost_dart_mode=True
: put$$M^{t+1}=\sum_{j\in D}f^j+\frac{k}{k+1}\sum_{j\notin D}f^j+\frac{1}{k+1}\tilde{f}^{t+1}.$$ - LightGBM with
xgboost_dart_mode=False
(default): put$$M^{t+1}=\sum_{j\in D}f^j+\frac{k}{k+\eta}\sum_{j\notin D}f^j+\frac{\eta}{k+\eta}\tilde{f}^{t+1}.$$
- XGBoost and LightGBM with
Use sampled observations with greatest gradients and also some other obserations to build the next tree.
Additional parameters:
LightGBM | meaning |
---|---|
top_rate (default 0.2 ) |
fraction of large gradient data |
other_rate (default 0.1 ) |
fraction of nonlarge gradient data |
Given
- Compute gradients
$g_i$ and hessians$h_i$ for each$i$ . - Let:
- $a=
$top_rate$ \cdot n$ (number of sampled observations with large gradients), - $b=
$other_rate$ \cdot n$ (number of sampled observations with nonlarge gradients).
- $a=
- Compute
$g_i\cdot h_i$ for all$i$ . For example, for the square loss function we have$g_i=\hat{y}_i^t-y_i$ ,$h_i=1$ . - Sample
$a$ observations with greatest$g_i\cdot h_i$ . - Also sample
$b$ other observations obtaining$(x_i)_{i\in S}$ . Multiply their gradients and hessians by$(1-a)/b$ - Build the next tree
$f^{t+1}$ with use of sampled observations$(x_i)_{i\in S}$ only.
XGBoost | LightGBM | meaning |
---|---|---|
eta (default 0.3 ) |
learning_rate (default 0.1 ) |
|
alpha (default 0 ) |
lambda_l1 (default 0 ) |
L1 regularization |
lambda (default 0 ) |
lambda_l2 (default 0 ) |
L2 regularization |
max_depth (default 6 ) |
max_depth (default -1 ) |
maximal depth of a tree |
gamma (default 0 ) |
min_split_gain (default 0 ) |
minimal gain to perform split |
... | ... | ... |
XGBoost | LightGBM | meaning |
---|---|---|
subsample (default 1 ) |
bagging_fraction (default 1 ) |
part of observations used for building the tree |
--- | bagging_freq |
performs bagging every bagging_freq iterations |
--- | bagging_seed |
random seed |
max_bin (default 256 ) |
max_bin (default 255 ) |
maximum number of histogram bins |
tree_method= |
--- | split algorithm |
auto (default) |
--- | automatic |
exact |
--- | all the split points checked |
approx |
--- | histogram algorithm, once per iteration |
hist |
default | histogram algorithm, once |
gpu_exact |
--- | exact for GPU |
gpu_hist |
--- | hist for GPU |
--- | boosting=goss |
GOSS |
XGBoost | LightGBM | meaning |
---|---|---|
colsample_bytree (default 1 ) |
feature_fraction (default 1 ) |
column sampling, once per tree |
colsample_bylevel (default 1 ) |
--- | column sampling, once per node |
--- | feature_fraction_seed |
random seed |
enable_feature_grouping (default 0 ) |
--- | ? |
--- | enable_bundle (default True ) |
EFB |
Bundles exclusive features into a single feature.
Note: bundles one-hot-encoded features into a single one, so one-hot-encoding is useless.
LightGBM offers some, see: categorical_feature
and min_data_per_group
, max_cat_threshold
, cat_smooth
, cat_l2 max_cat_to_onehot
.
XGBoost | LightGBM | meaning |
---|---|---|
default | use_missing (default True ) |
column sampling, once per tree |
--- | zero_as_missing (default False ) |
Treats 0 as missing value. |
--- | is_sparse (default True ) |
Optimizations for sparse data. |
In XGBoost, missing values goes are ignored during split finding and afterwards all of them goes to either left or right node (the better one is chosen). For the default use_missing=True
LightGBM does the same.
XGBoost | LightGBM | meaning |
---|---|---|
--- | two_round (default False ) |
whether to not load features from the data in memory |
When data is bigger than memory size, set two_round=True . Otherwise leave it as False for better speed. |
XGBoost | LightGBM | meaning |
---|---|---|
nthread |
num_threads |
number of threads |
updater |
tree_learner |
the choice of tree learner |
- While setting
num_threads
, follow the number of real CPU cores, not logical ones. - Do not set
num_threads
too large if your dataset is small (i.e. do not use 64 threads for a dataset with 10,000 rows). - Be aware a task manager or any similar CPU monitoring tool might report cores not being fully utilized. This is normal.
For LightGBM, the default value of tree_learner
is serial
. There are also three parallel scenarios prepared:
#data is small | #data is large | |
---|---|---|
#feature is small | feature |
data |
#feature is large | feature |
voting |
For parallel learning, you should not use full CPU cores since this will cause poor performance for the network. See also: http://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html
Both XGBoost and LightGBM support MPI.
XGBoost | LightGBM | meaning |
---|---|---|
predictor= |
device= |
choice of device |
cpu_predictior (default) |
cpu (default) |
CPU |
gpu_predictor |
gpu |
GPU |
--- | gpu_use_dp (default False ) |
whether to use 64-bit float point |
- In LightGBM with
device=gpu
it is recommended to use the smallermax_bin
(e.g.63
) to get the better speed up. - For better speed, GPU use 32-bit float point to sum up by default, it may affect the accuracy for some tasks. You can set
gpu_use_dp=True
to enable 64-bit float point, but it will slow down the training.
- Try LightGBM because why not.
- Try
boosting=goss
. - Sparse features are bundled automatically -
enable_bundle=True
by default. One-hot-encoding is useless. - Specify categorical features:
categorical_feature=
. - Leave the default
use_missing=True
. Setis_sparse=False
if your data are dense. - Set
two_round=True
only if your data does not fit in memory. - If you have some real CPU cores, set
num_threads
properly. - If your data are big, try parallelization scenarios.
- You can build GPU version of LightGBM. Then use smaller
max_bin
. Setgpu_use_dp=True
only if the accuracy is worse than on CPU.