Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom MAE Python Loss Function Does Not Match C++ Results #6744

Open
OUStudent opened this issue Dec 10, 2024 · 7 comments
Open

Custom MAE Python Loss Function Does Not Match C++ Results #6744

OUStudent opened this issue Dec 10, 2024 · 7 comments
Labels

Comments

@OUStudent
Copy link

OUStudent commented Dec 10, 2024

Description

MAE loss function implemented in Python does not match results from C++ implemented MAE despite being (1) mathematically correct (mae_loss_v1), (2) adjusting for hessian of ones (mae_loss_v2), and (3) matching C++ implementation (mae_loss_v3).

Reproducible example

This python code trains four regressors, one using mae_loss_v1 (which is mathematically correct with hessian being zeros), another using mae_loss_v2 (which returns a hessian of ones,) mae_loss_v3 (matches the C++ implementation), and one using the "mae" C++ loss function.

Mathematically, the gradient of MAE ($-\frac{1}{n}\sum |y-\hat y|$, where $\hat y$ is model output and $y$ is target) is the following: $\frac{\partial }{\partial \hat y }=-sign(y - \hat y)$, where the hessian is $\frac{\partial }{\partial \hat y ^ 2 }=0$.

The negative sign out front of the sign operation can be removed if MAE is rearranged to $-\frac{1}{n}\sum |\hat y-y|$, as $\frac{\partial }{\partial \hat y }=sign(\hat y-y)$.

I based mae_loss_v3 off the C++ code below:
https://github.com/microsoft/LightGBM/blob/master/src/objective/regression_objective.hpp#L207-L234
Specifically:

const double diff = score[i] - label_[i];
gradients[i] = static_cast<score_t>(Common::Sign(diff));
hessians[i] = 1.0f;   // (emphasis mine) which is mathematically incorrect as hessian is 0 (emphasis mine).
from sklearn.datasets import make_regression
import lightgbm as lgb

print(lgb.__version__)

x, y = make_regression(n_samples=100, n_features=3, noise=1, random_state=42)

def mae_loss_v1(y_true, y_pred):  // mathematically correct
    grad = -np.sign(y_true - y_pred)
    hess = np.zeros(y_true.shape)
    return grad, hess

def mae_loss_v2(y_true, y_pred):
    grad = -np.sign(y_true - y_pred)
    hess = np.ones(y_true.shape)   // same as mae_loss_v1 except now with ones as hessian
    return grad, hess

def mae_loss_v3(y_true, y_pred):
    grad = np.sign(y_true - y_pred)
    hess = np.ones(y_true.shape)   // based off C++ implementation where there is no negative in front of gradient
    return grad, hess

def mae_metric(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

model = lgb.LGBMRegressor(objective=mae_loss_v1)
model.fit(x, y)
print(mae_metric(y, model.predict(x)))

model = lgb.LGBMRegressor(objective=mae_loss_v2)
model.fit(x, y)
print(mae_metric(y, model.predict(x)))

model = lgb.LGBMRegressor(objective=mae_loss_v3)
model.fit(x, y)
print(mae_metric(y, model.predict(x)))

model = lgb.LGBMRegressor(objective="mae")
model.fit(x, y)
print(mae_metric(y, model.predict(x)))

Printed results are

4.5.0  // LGBM Version
62.75745920285339  // Mathematically Correct MAE
56.65167879409149  // Hessian of 1's
69.47952394285281  // Based off C++ MAE
16.44239336178542  // C+ MAE

Environment info

LightGBM version 4.5.0
sklearn version 1.5.1

Additional Comments

There has been a long going problem of Python side implemented loss functions not matching C++ results. Most issues have no clear answer. If one is unable to match C++ results using a Python side implementation of a loss function due to some obscure internal calculation, it needs to be much more well communicated that it already has.

@jmoralez
Copy link
Collaborator

Hey @OUStudent, thanks for using LightGBM.

Most issues have no clear answer.

Can you link them here? I'm pretty sure the answer in all of them is the init_score, which you're not setting in your example.

@OUStudent
Copy link
Author

OUStudent commented Dec 11, 2024

Hey @OUStudent, thanks for using LightGBM.

Most issues have no clear answer.

Can you link them here? I'm pretty sure the answer in all of them is the init_score, which you're not setting in your example.

Here is one example of an issue similar to this issue that was marked as completed but yet no definitive answer was given (#2128).

You also mention the init_score. After searching around, I understand that these are the initial values being predicted by the model for the first round. However, I am finding a lack of documentation on (1) what values to set them to for each loss function (if these are being set somewhere in C++, it should be reported in documentation), (2) a lack of a concrete example on LightGBM documentation on how to use said init_score along with a working Python loss function that matches C++ code, and (3) a lack of documentation stating that it is required to use the init_score in order to match C++ code.

For the MAE loss function, what values should I set the init_score to? I have already tried an array of 1's and mean of the target, but nothing matches.

@shiyu1994
Copy link
Collaborator

Hi, @OUStudent, thanks for using LightGBM.
For MAE implemented in LightGBM, there's two key facts that make it differ from the customized implementations in your examples.

  1. The initial score. The media of labels is used as initial scores for all samples.
    double BoostFromScore(int) const override {
    const double alpha = 0.5;
    if (weights_ != nullptr) {
    #define data_reader(i) (label_[i])
    #define weight_reader(i) (weights_[i])
    WeightedPercentileFun(label_t, data_reader, weight_reader, num_data_, alpha);
    #undef data_reader
    #undef weight_reader
    } else {
    #define data_reader(i) (label_[i])
    PercentileFun(label_t, data_reader, num_data_, alpha);
    #undef data_reader
    }
    }
  2. The refitting operation after a tree is grown with the gradients and hessians (which are all 1's) of MAE loss.
    double RenewTreeOutput(double, std::function<double(const label_t*, int)> residual_getter,
    const data_size_t* index_mapper,
    const data_size_t* bagging_mapper,
    data_size_t num_data_in_leaf) const override {
    const double alpha = 0.5;
    if (weights_ == nullptr) {
    if (bagging_mapper == nullptr) {
    #define data_reader(i) (residual_getter(label_, index_mapper[i]))
    PercentileFun(double, data_reader, num_data_in_leaf, alpha);
    #undef data_reader
    } else {
    #define data_reader(i) (residual_getter(label_, bagging_mapper[index_mapper[i]]))
    PercentileFun(double, data_reader, num_data_in_leaf, alpha);
    #undef data_reader
    }
    } else {
    if (bagging_mapper == nullptr) {
    #define data_reader(i) (residual_getter(label_, index_mapper[i]))
    #define weight_reader(i) (weights_[index_mapper[i]])
    WeightedPercentileFun(double, data_reader, weight_reader, num_data_in_leaf, alpha);
    #undef data_reader
    #undef weight_reader
    } else {
    #define data_reader(i) (residual_getter(label_, bagging_mapper[index_mapper[i]]))
    #define weight_reader(i) (weights_[bagging_mapper[index_mapper[i]]])
    WeightedPercentileFun(double, data_reader, weight_reader, num_data_in_leaf, alpha);
    #undef data_reader
    #undef weight_reader
    }
    }
    }

@OUStudent
Copy link
Author

OUStudent commented Dec 11, 2024

@shiyu1994 Thank you. I believe (1) the initial score being the median of the dataset should be better documented. However, for (2), I am struggling to understand how I can change that on my end on the Python side. I am I able to change this? If yes, can you please give a basic example on how to do so.

@shiyu1994
Copy link
Collaborator

For (2), it is not trivial to implement with python interface solely. What I can suggest is by using continual training for every iteration,
getting the leaf index prediction for every training data sample with the pred_leaf =true in the recalculating the predict method and recalculating the leaf values as the RenewTreeOutput method above.

@OUStudent
Copy link
Author

@shiyu1994 Are you able to give a basic example on how to do so? Is there any documentation on how do so in general for custom loss functions?

@jameslamb @jmoralez Why is there an advertisement on the ability to implement custom loss functions for LightGBM in Python when one cannot even implement the most basic of machine learning loss functions (MAE in this example) using the advertised tools without jumping through hoops?

Why is there such a lack of documentation on how to actually replicate loss functions in Python?

@jameslamb
Copy link
Collaborator

Why is there such a lack of documentation on how to actually replicate loss functions in Python?

We know that work needs to be done, it just hasn't been yet. It's tracked in #6440, you can subscribe to that if you'd like to be notified of changes.

Why is there an advertisement on the ability to implement custom loss functions for LightGBM in Python when one cannot even implement the most basic ... without jumping through hoops?

It is not true that you "cannot implement" MAE through a custom objective function in lightgbm... you implemented 3 different versions of it in a few lines of Python code in the original post in this issue.

"implement MAE" and "match the behavior of LightGBM's C++ implementation of MAE" are different tasks. Math that might seem "basic" symbolically has more than one possible "correct" representation in software like this library, with different tradeoffs for speed, memory-efficiency, and precision. The "hoops" you're referring to are coming from you trying to imitate LightGBM's particular choices made in the face of those choices in your own code.

Notice that despite you saying LightGBM's implementation is not "mathematically correct", it outperforms the Python one that you described as "mathematically correct". If you'd like to learn why, there's some more information on using the median in the ways @shiyu1994 described here: https://explained.ai/gradient-boosting/L1-loss.html#sec:1.3

Printed results are

Your example code is being used to make exact comparisons on the result for a small dataset. That might be misleading... LightGBM's defaults for preventing overfitting and some sources of nondeterminism (on by default in exchange for speed) mean those differences are driven by other factors besides the objective function.

I adjusted your example to use these parameters:

common_params = {
     # turn off multi-threading (can lead to different results from numerical precision differences
    "n_jobs": 1,
     # always use the same type of Dataset construction
    "force_row_wise": True,
    # prefer the slower but deterministic form of some other operations
    "deterministic": True,
    # allow leaf nodes that only match a single sample
    "min_data_in_leaf": 1,
    # use the same set of pseudorandom numbers for anything requiring randomness inside lightGBM
    "seed": 708
}

And ran that with Python 3.11 and lightgbm=4.5.0.

code (click me)
from sklearn.datasets import make_regression
import lightgbm as lgb
import numpy as np

print(lgb.__version__)

x, y = make_regression(n_samples=1_000, n_features=3, random_state=42)

def mae_loss_v1(y_true, y_pred):
    grad = -np.sign(y_true - y_pred)
    hess = np.zeros(y_true.shape)
    return grad, hess

def mae_loss_v2(y_true, y_pred):
    grad = -np.sign(y_true - y_pred)
    hess = np.ones(y_true.shape)
    return grad, hess

def mae_loss_v3(y_true, y_pred):
    grad = np.sign(y_true - y_pred)
    hess = np.ones(y_true.shape)
    return grad, hess

def mae_metric(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

common_params = {
     # turn off multi-threading (can lead to different results from numerical precision differences
    "n_jobs": 1,
     # always use the same type of Dataset construction
    "force_row_wise": True,
    # prefer the slower but deterministic form of some other operations
    "deterministic": True,
    # allow leaf nodes that only match a single sample
    "min_data_in_leaf": 1,
    # use the same set of pseudorandom numbers for anything requiring randomness inside lightGBM
    "seed": 708
}

model = lgb.LGBMRegressor(**common_params, objective=mae_loss_v1)
model.fit(x, y)
print(f"mae_loss_v1: {mae_metric(y, model.predict(x))}")
# mae_loss_v1: 101.29960471190125

model = lgb.LGBMRegressor(**common_params, objective=mae_loss_v2)
model.fit(x, y)
print(f"mae_loss_v2: {mae_metric(y, model.predict(x))}")
# mae_loss_v2: 92.10924897938341

model = lgb.LGBMRegressor(**common_params, objective=mae_loss_v3)
model.fit(x, y)
print(f"mae_loss_v3: {mae_metric(y, model.predict(x))}")
# mae_loss_v3: 110.91077035541966

model = lgb.LGBMRegressor(**common_params, objective="mae")
model.fit(x, y)
print(f"built-in 'mae': {mae_metric(y, model.predict(x))}")
# built-in 'mae': 3.140136718338369

Saw these results:

mae_loss_v1: 101.29960471190125
mae_loss_v2: 92.10924897938341
mae_loss_v3: 110.91077035541966
built-in 'mae': 3.140136718338369

There is also work planned to make deterministic=True be sufficient to replace all these other parameters for getting deterministic results. You can subscribe to that too if you'd like: #6731

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants