Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] do not copy column-major numpy arrays when creating Dataset #6721

Merged
merged 12 commits into from
Dec 10, 2024

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Nov 12, 2024

Changes the logic to flatten the data array to consider its layout:

  • If the array is Fortran contiguous, then it's flattened as such and passed to the Dataset constructor indicating that the array is column-major, which avoids copying it to make it row-major.
  • If the array is C contiguous, no copy is made (current behavior).
  • If the array is not contiguous because it was sliced, then a copy is made to make it C-contiguous (current behavior).

I ran some quick tests with an array of (100k, 50) and the timings to construct the dataset from a C-contiguous and an F-contiguous arrays are roughly the same, so this doesn't introduce extra latency, just avoids the copy.

@jmoralez jmoralez changed the title do not copy column-major numpy arrays when creating Dataset WIP: [python-package] do not copy column-major numpy arrays when creating Dataset Nov 12, 2024
@jmoralez jmoralez marked this pull request as ready for review November 14, 2024 19:31
@jmoralez jmoralez changed the title WIP: [python-package] do not copy column-major numpy arrays when creating Dataset [python-package] do not copy column-major numpy arrays when creating Dataset Nov 14, 2024
@jmoralez
Copy link
Collaborator Author

This could also be done for the predict portion, i.e.

def __inner_predict_np2d(
self,
mat: np.ndarray,
start_iteration: int,
num_iteration: int,
predict_type: int,
preds: Optional[np.ndarray],
) -> Tuple[np.ndarray, int]:
if mat.dtype == np.float32 or mat.dtype == np.float64:
data = np.asarray(mat.reshape(mat.size), dtype=mat.dtype)
else: # change non-float data to float data, need to copy
data = np.array(mat.reshape(mat.size), dtype=np.float32)
ptr_data, type_ptr_data, _ = _c_float_array(data)
n_preds = self.__get_num_preds(
start_iteration=start_iteration,
num_iteration=num_iteration,
nrow=mat.shape[0],
predict_type=predict_type,
)
if preds is None:
preds = np.empty(n_preds, dtype=np.float64)
elif len(preds.shape) != 1 or len(preds) != n_preds:
raise ValueError("Wrong length of pre-allocated predict array")
out_num_preds = ctypes.c_int64(0)
_safe_call(
_LIB.LGBM_BoosterPredictForMat(
self._handle,
ptr_data,
ctypes.c_int(type_ptr_data),
ctypes.c_int32(mat.shape[0]),
ctypes.c_int32(mat.shape[1]),
ctypes.c_int(_C_API_IS_ROW_MAJOR),
ctypes.c_int(predict_type),
ctypes.c_int(start_iteration),
ctypes.c_int(num_iteration),
_c_str(self.pred_parameter),
ctypes.byref(out_num_preds),
preds.ctypes.data_as(ctypes.POINTER(ctypes.c_double)),
)
)

Please let me know if it's ok to include it here or in a separate PR.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for very useful improvement!

I just left a couple of non-blocking comments below for your consideration.

Comment on lines 196 to 199
if mat.flags["F_CONTIGUOUS"]:
order = "F"
else:
order = "C"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to highlight that arrays can be C-contiguous and F-contiguous simultaneously.
https://stackoverflow.com/q/60230562

Arrays can be both C-style and Fortran-style contiguous simultaneously. This is clear for 1-dimensional arrays, but can also be true for higher dimensional arrays.
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flags.html

Seems it's correctly handled here.

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
@StrikerRUS
Copy link
Collaborator

Please let me know if it's ok to include it here or in a separate PR.

I believe it would be better to have a follow-up PR for prediction after merging this one.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me! I agree with all of @StrikerRUS 's comments, so won't approve until those are addressed. But I don't have any additional comments.

Thanks for working on this, and really nice tests!

@@ -4611,3 +4612,22 @@ def test_bagging_by_query_in_lambdarank():
ndcg_score_no_bagging_by_query = gbm_no_bagging_by_query.best_score["valid_0"]["ndcg@5"]
assert ndcg_score_bagging_by_query >= ndcg_score - 0.1
assert ndcg_score_no_bagging_by_query >= ndcg_score - 0.1


def test_equal_datasets_from_row_major_and_col_major_data(tmp_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this test now doesn't include train(), I think it should go to test_basic.py file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. Moved in 38c6786

@StrikerRUS
Copy link
Collaborator

@jameslamb

I agree with all of @StrikerRUS 's comments, so won't approve until those are addressed.

Would you like to take a look after the recent commits?

@jameslamb jameslamb self-requested a review December 8, 2024 04:38
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great to me, thank you!

I looked at the failing Dask tests... they're unrelated to this PR. Documented in #6739.

@jmoralez do you have time in the next few days to investigate that?

@jmoralez
Copy link
Collaborator Author

jmoralez commented Dec 9, 2024

Sure, I'll take a look.

@StrikerRUS StrikerRUS merged commit ae76aad into master Dec 10, 2024
48 checks passed
@StrikerRUS StrikerRUS deleted the no-copy-np-col-major branch December 10, 2024 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants