[python-package] do not copy column-major numpy arrays when creating Dataset #6721

jmoralez · 2024-11-12T22:54:19Z

Changes the logic to flatten the data array to consider its layout:

If the array is Fortran contiguous, then it's flattened as such and passed to the Dataset constructor indicating that the array is column-major, which avoids copying it to make it row-major.
If the array is C contiguous, no copy is made (current behavior).
If the array is not contiguous because it was sliced, then a copy is made to make it C-contiguous (current behavior).

I ran some quick tests with an array of (100k, 50) and the timings to construct the dataset from a C-contiguous and an F-contiguous arrays are roughly the same, so this doesn't introduce extra latency, just avoids the copy.

jmoralez · 2024-11-14T19:44:08Z

This could also be done for the predict portion, i.e.

LightGBM/python-package/lightgbm/basic.py

Lines 1268 to 1307 in 5151fe8

    
           def __inner_predict_np2d( 
        
               self, 
        
               mat: np.ndarray, 
        
               start_iteration: int, 
        
               num_iteration: int, 
        
               predict_type: int, 
        
               preds: Optional[np.ndarray], 
        
           ) -> Tuple[np.ndarray, int]: 
        
               if mat.dtype == np.float32 or mat.dtype == np.float64: 
        
                   data = np.asarray(mat.reshape(mat.size), dtype=mat.dtype) 
        
               else:  # change non-float data to float data, need to copy 
        
                   data = np.array(mat.reshape(mat.size), dtype=np.float32) 
        
               ptr_data, type_ptr_data, _ = _c_float_array(data) 
        
               n_preds = self.__get_num_preds( 
        
                   start_iteration=start_iteration, 
        
                   num_iteration=num_iteration, 
        
                   nrow=mat.shape[0], 
        
                   predict_type=predict_type, 
        
               ) 
        
               if preds is None: 
        
                   preds = np.empty(n_preds, dtype=np.float64) 
        
               elif len(preds.shape) != 1 or len(preds) != n_preds: 
        
                   raise ValueError("Wrong length of pre-allocated predict array") 
        
               out_num_preds = ctypes.c_int64(0) 
        
               _safe_call( 
        
                   _LIB.LGBM_BoosterPredictForMat( 
        
                       self._handle, 
        
                       ptr_data, 
        
                       ctypes.c_int(type_ptr_data), 
        
                       ctypes.c_int32(mat.shape[0]), 
        
                       ctypes.c_int32(mat.shape[1]), 
        
                       ctypes.c_int(_C_API_IS_ROW_MAJOR), 
        
                       ctypes.c_int(predict_type), 
        
                       ctypes.c_int(start_iteration), 
        
                       ctypes.c_int(num_iteration), 
        
                       _c_str(self.pred_parameter), 
        
                       ctypes.byref(out_num_preds), 
        
                       preds.ctypes.data_as(ctypes.POINTER(ctypes.c_double)), 
        
                   ) 
        
               )

Please let me know if it's ok to include it here or in a separate PR.

StrikerRUS

LGTM! Thank you for very useful improvement!

I just left a couple of non-blocking comments below for your consideration.

StrikerRUS · 2024-11-23T16:50:52Z

python-package/lightgbm/basic.py

+    if mat.flags["F_CONTIGUOUS"]:
+        order = "F"
+    else:
+        order = "C"


Just want to highlight that arrays can be C-contiguous and F-contiguous simultaneously.
https://stackoverflow.com/q/60230562

Arrays can be both C-style and Fortran-style contiguous simultaneously. This is clear for 1-dimensional arrays, but can also be true for higher dimensional arrays.
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flags.html

Seems it's correctly handled here.

python-package/lightgbm/basic.py

tests/python_package_test/test_basic.py

StrikerRUS · 2024-11-23T17:13:14Z

Please let me know if it's ok to include it here or in a separate PR.

I believe it would be better to have a follow-up PR for prediction after merging this one.

tests/python_package_test/test_engine.py

jameslamb

This looks great to me! I agree with all of @StrikerRUS 's comments, so won't approve until those are addressed. But I don't have any additional comments.

Thanks for working on this, and really nice tests!

StrikerRUS · 2024-11-29T09:24:08Z

tests/python_package_test/test_engine.py

@@ -4611,3 +4612,22 @@ def test_bagging_by_query_in_lambdarank():
    ndcg_score_no_bagging_by_query = gbm_no_bagging_by_query.best_score["valid_0"]["ndcg@5"]
    assert ndcg_score_bagging_by_query >= ndcg_score - 0.1
    assert ndcg_score_no_bagging_by_query >= ndcg_score - 0.1
+
+
+def test_equal_datasets_from_row_major_and_col_major_data(tmp_path):


As this test now doesn't include train(), I think it should go to test_basic.py file.

Sorry. Moved in 38c6786

StrikerRUS · 2024-12-05T20:48:47Z

@jameslamb

I agree with all of @StrikerRUS 's comments, so won't approve until those are addressed.

Would you like to take a look after the recent commits?

jameslamb

Changes look great to me, thank you!

I looked at the failing Dask tests... they're unrelated to this PR. Documented in #6739.

@jmoralez do you have time in the next few days to investigate that?

jmoralez · 2024-12-09T16:49:00Z

Sure, I'll take a look.

do not copy column-major numpy arrays when creating Dataset

e3cc120

jmoralez changed the title ~~do not copy column-major numpy arrays when creating Dataset~~ WIP: [python-package] do not copy column-major numpy arrays when creating Dataset Nov 12, 2024

jmoralez added 2 commits November 12, 2024 17:47

fix logic

84607e3

lint

0d61224

jmoralez marked this pull request as ready for review November 14, 2024 19:31

jmoralez requested review from guolinke, jameslamb, shiyu1994, borchero and StrikerRUS as code owners November 14, 2024 19:31

jmoralez changed the title ~~WIP: [python-package] do not copy column-major numpy arrays when creating Dataset~~ [python-package] do not copy column-major numpy arrays when creating Dataset Nov 14, 2024

jmoralez added feature efficiency and removed feature labels Nov 14, 2024

Merge branch 'master' into no-copy-np-col-major

f6d58af

StrikerRUS approved these changes Nov 23, 2024

View reviewed changes

StrikerRUS reviewed Nov 24, 2024

View reviewed changes

tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved

jameslamb reviewed Nov 27, 2024

View reviewed changes

jmoralez added 2 commits November 28, 2024 17:43

code review

f7df7ad

update test

50dda90

StrikerRUS reviewed Nov 29, 2024

View reviewed changes

jmoralez and others added 5 commits November 29, 2024 09:20

move dataset test to basic

38c6786

increase features

524f45e

assert single layout

95a21a4

Merge branch 'master' into no-copy-np-col-major

0d86338

Merge branch 'master' into no-copy-np-col-major

eb60e6e

jameslamb self-requested a review December 8, 2024 04:38

jameslamb approved these changes Dec 8, 2024

View reviewed changes

Merge branch 'master' into no-copy-np-col-major

1812257

StrikerRUS merged commit ae76aad into master Dec 10, 2024
48 checks passed

StrikerRUS deleted the no-copy-np-col-major branch December 10, 2024 09:11

jmoralez mentioned this pull request Dec 12, 2024

[python-package] do not copy column-major numpy arrays when predicting #6751

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] do not copy column-major numpy arrays when creating Dataset #6721

[python-package] do not copy column-major numpy arrays when creating Dataset #6721

jmoralez commented Nov 12, 2024 •

edited

Loading

jmoralez commented Nov 14, 2024

StrikerRUS left a comment

StrikerRUS Nov 23, 2024

StrikerRUS commented Nov 23, 2024

jameslamb left a comment

StrikerRUS Nov 29, 2024

jmoralez Nov 29, 2024

StrikerRUS commented Dec 5, 2024

jameslamb left a comment

jmoralez commented Dec 9, 2024

[python-package] do not copy column-major numpy arrays when creating Dataset #6721

[python-package] do not copy column-major numpy arrays when creating Dataset #6721

Conversation

jmoralez commented Nov 12, 2024 • edited Loading

jmoralez commented Nov 14, 2024

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS Nov 23, 2024

Choose a reason for hiding this comment

StrikerRUS commented Nov 23, 2024

jameslamb left a comment

Choose a reason for hiding this comment

StrikerRUS Nov 29, 2024

Choose a reason for hiding this comment

jmoralez Nov 29, 2024

Choose a reason for hiding this comment

StrikerRUS commented Dec 5, 2024

jameslamb left a comment

Choose a reason for hiding this comment

jmoralez commented Dec 9, 2024

jmoralez commented Nov 12, 2024 •

edited

Loading