-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] do not copy column-major numpy arrays when creating Dataset #6721
Conversation
This could also be done for the predict portion, i.e. LightGBM/python-package/lightgbm/basic.py Lines 1268 to 1307 in 5151fe8
Please let me know if it's ok to include it here or in a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for very useful improvement!
I just left a couple of non-blocking comments below for your consideration.
python-package/lightgbm/basic.py
Outdated
if mat.flags["F_CONTIGUOUS"]: | ||
order = "F" | ||
else: | ||
order = "C" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to highlight that arrays can be C-contiguous and F-contiguous simultaneously.
https://stackoverflow.com/q/60230562
Arrays can be both C-style and Fortran-style contiguous simultaneously. This is clear for 1-dimensional arrays, but can also be true for higher dimensional arrays.
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flags.html
Seems it's correctly handled here.
I believe it would be better to have a follow-up PR for prediction after merging this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me! I agree with all of @StrikerRUS 's comments, so won't approve until those are addressed. But I don't have any additional comments.
Thanks for working on this, and really nice tests!
@@ -4611,3 +4612,22 @@ def test_bagging_by_query_in_lambdarank(): | |||
ndcg_score_no_bagging_by_query = gbm_no_bagging_by_query.best_score["valid_0"]["ndcg@5"] | |||
assert ndcg_score_bagging_by_query >= ndcg_score - 0.1 | |||
assert ndcg_score_no_bagging_by_query >= ndcg_score - 0.1 | |||
|
|||
|
|||
def test_equal_datasets_from_row_major_and_col_major_data(tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this test now doesn't include train()
, I think it should go to test_basic.py
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Moved in 38c6786
Would you like to take a look after the recent commits? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll take a look. |
Changes the logic to flatten the data array to consider its layout:
I ran some quick tests with an array of (100k, 50) and the timings to construct the dataset from a C-contiguous and an F-contiguous arrays are roughly the same, so this doesn't introduce extra latency, just avoids the copy.