Store numpy objects to be compatible across numpy 2.0 #782

sjoerd-bouma · 2024-12-19T17:18:22Z

This fixes numpy 2.0 cross-compatibility by using numpy.save and numpy.load internally. As pickle stores also the function used to unpickle stored data this should be fully backwards compatible (i.e. reading old .nur files is unchanged).

There is some performance impact to reading (writing is essentially unchanged), but at least on my laptop (SSD) the file IO still dominates. So possibly on superfast PCIE SSDs the performance impact is noticeable, but the .nur format is probably not optimal for very-high throughput in general.

This replaces #720. My proposal is to merge this to master as a hotfix first, and then to develop after.

Medium to long term, other upgrades we can consider are safe(r) pickling (see https://docs.python.org/3/library/pickle.html#restricting-globals), and before I thought of this I got halfway to using h5py to store NuRadio events. I think it may still be worth finishing that up - there would be benefits to security, readability and potentially performance. But I think both of these can be discussed and implemented in future PRs.

…nal.hann was deprecated

sjoerd-bouma · 2024-12-20T14:19:33Z

So I probably didn't pick the ideal test file originally - it had no numpy ints or strings, and very long traces. Although I've now (I think) fixed the missing numpy scalars, I've also done some more extensive benchmarking... and it turns out that for smaller trace lengths the overhead is significant.

Benchmark result on master:

WARNING:test-io:Write speed:   4.08 ms / event (17500 events total).
WARNING:test-io:Read speed :   0.57 ms / event (17500 events total).

...and on the current branch

WARNING:test-io:Write speed:   3.68 ms / event (17500 events total).
WARNING:test-io:Read speed :   2.35 ms / event (17500 events total).

The reason seems to be that numpy.load has a relatively high constant performance penalty, i.e. it becomes competitive with pickle.loads only for very large arrays. I've tried to benchmark that as well:

(I certainly don't trust some of the features here - who knows what's going on in the background with different parts of memory and CPU cache etc - but hopefully the overall trends can be believed. See also https://stackoverflow.com/a/58942584 for an independent, more exhaustive benchmark)

The short version is that pickle is much, much faster for loading small arrays, and as we currently store traces individually (each class has its own serialize function), we will probably not be able to do better unless we decide to store multiple (potentially many/all) traces 'together' (although in principle traces can have arbitrary shapes, so this probably comes with its own difficulties and performance penalties). So we probably have to come back to this after Christmas.

sjoerd-bouma added 7 commits September 4, 2024 10:42

import hann window from scipy.signal.windows as import from scipy.sig…

ec48b3d

…nal.hann was deprecated

use numpy save/load to pickle for numpy 2 cross-compatibility

197f245

update version and changelog

f499b5c

remove cPickle import from Python2.x

d58de30

use correct numpy types to determine corresponding Python type

02f0407

add (hopefully all) missing types of numpy scalar

c4c6f2b

added io benchmark script

57bdcdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store numpy objects to be compatible across numpy 2.0 #782

Store numpy objects to be compatible across numpy 2.0 #782

sjoerd-bouma commented Dec 19, 2024 •

edited

Loading

sjoerd-bouma commented Dec 20, 2024

Store numpy objects to be compatible across numpy 2.0 #782

Are you sure you want to change the base?

Store numpy objects to be compatible across numpy 2.0 #782

Conversation

sjoerd-bouma commented Dec 19, 2024 • edited Loading

sjoerd-bouma commented Dec 20, 2024

sjoerd-bouma commented Dec 19, 2024 •

edited

Loading