How to serialize models #12
Replies: 36 comments
-
I guess in joblib you can use the compress option to make a single file https://pythonhosted.org/joblib/persistence.html |
Beta Was this translation helpful? Give feedback.
-
However by reading this article (https://pythonhosted.org/joblib/generated/joblib.dump.html) I don't think the parameter is boolean, instead it is an integer between 0 to 9. |
Beta Was this translation helpful? Give feedback.
-
Ah, that looks really useful. I did notice that joblib pickles are not supported across python versions. That means that if someone built a scikit-learn model with Python 2 it cannot be loaded by someone running it on Python 3? Should we worry about that or can it be easily solved? |
Beta Was this translation helpful? Give feedback.
-
Where did you read that? |
Beta Was this translation helpful? Give feedback.
-
@zardaloop On the bottom of the link you posted :) |
Beta Was this translation helpful? Give feedback.
-
Well I guess you really need to rethink about this . Because joblib is only for local storage and that's all about it. Eeven scikit-learn to be able to rebuild a model with its future version needs additional metadata along with the pickled model which contains : |
Beta Was this translation helpful? Give feedback.
-
Therefore as Matthias recomended I also think pickle is your best bet. But you need to make sure to include the metadata along with the pickled model so it can work in the future version of scikit-learn 😊 |
Beta Was this translation helpful? Give feedback.
-
I'm not sure if it's possible to easily read pickles which were done with python2 in python3 and vice versa. Given that python2 is getting less and less used, one might think of not supporting it at all. Besides that, @zardaloop has a valid point that storing sklearn models is not that easy and I don't think sklearn has a common way to solve this issue except storing all metadata as @zardaloop suggested. We should have a look at this in the new year. |
Beta Was this translation helpful? Give feedback.
-
I think joblib will do single-file exports soon. Maybe for the moment pickle is enough. Be sure to use the latest protocol of pickle, because the default results in much larger files (at least in python2, not sure about python3). Both joblib and pickle have the issue that they serialize a class, without the corresponding class definition. So it is only guaranteed that a model will work and give the same result when using the exact same code it was created with. To make sure a result is entirely reproducible, the "easiest" way is to use docker containers or similar virtual environments (conda envs might be enough) with the exact same version of everything. What is your exact use case? If you want to load a model that "works", having the same scikit-learn version is sufficient. Even if the learning of a model, and therefore the serialization didn't change between versions, it could be that a bug in the prediction code was fixed. So even if you can load a model from an older version, it is not ensured that you get similar predictions. Hope that helps. This is a tricky issue. Feel free to ping me on these discussions, I don't generally follow the tracker atm, but I'm happy to give input. |
Beta Was this translation helpful? Give feedback.
-
Thanks, Andreas, for your valuable input. When it comes down to sharing the The reproducibility discussion is equally important though, and we should On Tue, Dec 29, 2015 at 5:15 PM Andreas Mueller [email protected]
|
Beta Was this translation helpful? Give feedback.
-
The best I can do at the moment is to offer advice on what not to do. Don't use pickle! Here's a summary as to why http://eev.ee/blog/2015/10/15/dont-use-pickle-use-camel/ I'm not sure what one should use instead though...still trying to figure that out myself. |
Beta Was this translation helpful? Give feedback.
-
Interesting as that blog post is, do we really have an alternative right Incidentally, what causes pickles to break? Will they still break if one Practically speaking, for the experiments that I want to run now, is it On Wed, Dec 30, 2015 at 11:31 AM Mike Croucher [email protected]
|
Beta Was this translation helpful? Give feedback.
-
If you think that, you overestimate our resources by a lot. We haven't been able to provide better backward compatibility, even with pickle.
Well given the same predictions given the same instances can really only be guaranteed with a full container (because of blas issues etc). If you system is reasonably static, storing the scikit-learn version will work as an intermediate solution. But once your hosting provider upgrades their distribution, you might be in trouble. We haven't done docker containers for reproducibility. We use travis and circleci and appveyor for continuous integration. But we don't really have a need to create highly reproducible environments. |
Beta Was this translation helpful? Give feedback.
-
I think pickle or joblib or dill + conda is the best solution for now, with pickle or joblib or dill + conda + docker the optimum upgrade. |
Beta Was this translation helpful? Give feedback.
-
@mikecroucher asked me to comment. I'm a Python old-hand, but know nothing of scikit-learn, so what I have to say is slanted more towards generic Python advice. To be able to answer a question like "is pickle adequate" we have to be able to pin down some requirements. For example, is it required that:
I would guess that various people would want all of these in some combination, so the real issue is how much do you want to pay (in money, time, and tears) for each of these things. Additionally, there are various semantic issues. For example: I might be able to load the model, but it gives different predictions, but the predictions are different only in ways that are unimportant (for example, a few ULP). @amueller seems to be aware of these. With that in mind, |
Beta Was this translation helpful? Give feedback.
-
Just FTR since I was asked: I don't know enough about conda to have a reliable opinion, but if it can be used to record all versions of all software in use (as @amueller suggests), then that's a good start. |
Beta Was this translation helpful? Give feedback.
-
@joaquinvanschoren Ok, if people can submit their models, then you would need them to use conda and submit their conda environment config with the model. That is not terribly hard and probably the most feasible way. There might still be minor differences due to OS, but the only way to avoid those is to have every user work in a virtual machine (or docker container) and provides the virtual machine with the model. That is way more complicated, and probably not worth the effort. @drj11 conda is basically a cross-platform package manager that ships binaries (unlike pip), mostly for python and related scientific software. |
Beta Was this translation helpful? Give feedback.
-
btw, you might be interested in reprozip and hyperos which are two approaches to create reproducible environments (but they are kinda alpha-stage iirc). Conda or docker seem the better choices for now. If someone wrote a custom transformer (which probably most interesting models have), you have some code part that is not a standard package. So in addition to the environment config you get from conda, and the state you get from pickle, you also need to have access to the source of the custom part. |
Beta Was this translation helpful? Give feedback.
-
@zardaloop has asked me to comment here. I am not very familiar with the situation so my comment will be generic. I don't have much experience with serialization, so I can't comment on |
Beta Was this translation helpful? Give feedback.
-
It would be great to rekindle this discussion, because it looks like it was converging towards a good solution, and storing models in OpenML would be very useful. Would a conda+joblib/dill/pickle approach work? Even if it covers a large percentage of use cases it would make many people happy :) |
Beta Was this translation helpful? Give feedback.
-
Another thing I'd like to mention: security. Pickle is a bit insecure and I am very hesitant putting a solution based on pickle in the python package. See here. |
Beta Was this translation helpful? Give feedback.
-
Is there any (scientific or praktical) use case in which storing models becomes relevant? The only thing that I can think of is when a new test set of data becomes available, the model can be reevaluated on this. However, this unfortunately rarely happens. |
Beta Was this translation helpful? Give feedback.
-
I agree it is challenging, but I would really love to track the models I'm building. Maybe not during a large-scale benchmark, but there are plenty of other cases where I either want to look at the models to better understand what they are doing or share them so that other people may learn from them and reuse them. |
Beta Was this translation helpful? Give feedback.
-
I recently talked to Matei (MLflow). They use a simple format which is just a file containing the model (could be a pickle) and some meta-data on how to read it in. It is probably best to leave this to the user. The python API should just retrieve the file and meta-data to tell the user what to do with it. Reading in models will probably be done rather occasionally. |
Beta Was this translation helpful? Give feedback.
-
One more thing to keep in mind is file size. Running the default scikit-learn random forest on the popular EEG-Eye-State dataset (1471) results in 7.5MB: In [1]: %paste
import openml
import sklearn.ensemble
import pickle
data = openml.datasets.get_dataset(1471)
X, y = data.get_data(target=data.default_target_attribute)
rf = sklearn.ensemble.RandomForestClassifier()
rf.fit(X, y)
string = pickle.dumps(rf)
len(string) / 1024. / 1024.
Out[1]: 7.461672782897949 The most popular task on that dataset has ~85k runs, assuming that only 1 percent of these are random forests, that would require at least 6.3GB. If you would increase the tree size from 10 trees to something reasonable, this space requirement would grow drastically. |
Beta Was this translation helpful? Give feedback.
-
Hi everyone! thinking a lot on this issue these past days, slightly more related to operationalization, pipeline reuse (ex: eval), retraining, & complete reproducibility. Remembered from Joaquin that this was an hot question for openml, & this was a great read/help! I'm perfectly aware of the security implications and overal versioning issues of loaded resources, but even so, really pipelines solve so much of the issues that were bothering me.... (if only they could be slightly easier to work with :) ) Adding one additional problem, like mentioned above by @amueller , customtransformers. if we have to track the actual code for these... hard to see how can this could be proper operationalized. (and very error prone) I did some tests with cloudpickle (dill probably will do similar?), and it seems to persist everything that is needed. No need to save/track any customtransformer code. Can load multiple pipelines with no problem. Everything seems really straightforward, save pipeline, (on new kernel) load, predict, refit, just works. Huge flexibility, ex: eval on new refits. I also did some experiments on mixing sequential preparation flow, but in a fit/transform compatible way.... (seems to good to be true... what do you think?
|
Beta Was this translation helpful? Give feedback.
-
Similar concept, using dill |
Beta Was this translation helpful? Give feedback.
-
ps-lik mentioned above,probably the size and amount of runs will be a challenge for openml, noting that if the pipeline was really a gridsearch fit, then refitting would be rather expensive. :) |
Beta Was this translation helpful? Give feedback.
-
For serialization, onnx format might also be relevant (cf https://github.com/onnx/onnxmltools) |
Beta Was this translation helpful? Give feedback.
-
More of a developer-to-developer question: we are working on exporting scikit-learn runs, but we are unsure what is the best way to share learned models. On first sight, creating a pickle is the best and most general way to go. Matthias confirms that this works with scikit-learn SVMs, even though the files can get large for large datasets.
However, scikit-learn recommends to use joblib because it is more efficient: http://scikit-learn.org/stable/modules/model_persistence.html
The problem here is that it creates a bunch of files in a folder. This is much harder to share, and sending many many files to the OpenML server for every single run seems unwieldy and error-prone.
Would creating a single pickle file still be the best way forward, or is there a better solution?
Beta Was this translation helpful? Give feedback.
All reactions