Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store numpy objects to be compatible across numpy 2.0 #782

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

sjoerd-bouma
Copy link
Collaborator

@sjoerd-bouma sjoerd-bouma commented Dec 19, 2024

This fixes numpy 2.0 cross-compatibility by using numpy.save and numpy.load internally. As pickle stores also the function used to unpickle stored data this should be fully backwards compatible (i.e. reading old .nur files is unchanged).

There is some performance impact to reading (writing is essentially unchanged), but at least on my laptop (SSD) the file IO still dominates. So possibly on superfast PCIE SSDs the performance impact is noticeable, but the .nur format is probably not optimal for very-high throughput in general.

This replaces #720. My proposal is to merge this to master as a hotfix first, and then to develop after.

Medium to long term, other upgrades we can consider are safe(r) pickling (see https://docs.python.org/3/library/pickle.html#restricting-globals), and before I thought of this I got halfway to using h5py to store NuRadio events. I think it may still be worth finishing that up - there would be benefits to security, readability and potentially performance. But I think both of these can be discussed and implemented in future PRs.

@sjoerd-bouma
Copy link
Collaborator Author

So I probably didn't pick the ideal test file originally - it had no numpy ints or strings, and very long traces. Although I've now (I think) fixed the missing numpy scalars, I've also done some more extensive benchmarking... and it turns out that for smaller trace lengths the overhead is significant.

Benchmark result on master:

WARNING:test-io:Write speed:   4.08 ms / event (17500 events total).
WARNING:test-io:Read speed :   0.57 ms / event (17500 events total).

...and on the current branch

WARNING:test-io:Write speed:   3.68 ms / event (17500 events total).
WARNING:test-io:Read speed :   2.35 ms / event (17500 events total).

The reason seems to be that numpy.load has a relatively high constant performance penalty, i.e. it becomes competitive with pickle.loads only for very large arrays. I've tried to benchmark that as well:

image

(I certainly don't trust some of the features here - who knows what's going on in the background with different parts of memory and CPU cache etc - but hopefully the overall trends can be believed. See also https://stackoverflow.com/a/58942584 for an independent, more exhaustive benchmark)

The short version is that pickle is much, much faster for loading small arrays, and as we currently store traces individually (each class has its own serialize function), we will probably not be able to do better unless we decide to store multiple (potentially many/all) traces 'together' (although in principle traces can have arbitrary shapes, so this probably comes with its own difficulties and performance penalties). So we probably have to come back to this after Christmas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant