How to serialize models #12

joaquinvanschoren · 2015-12-23T22:14:20Z

joaquinvanschoren
Dec 23, 2015
Maintainer

More of a developer-to-developer question: we are working on exporting scikit-learn runs, but we are unsure what is the best way to share learned models. On first sight, creating a pickle is the best and most general way to go. Matthias confirms that this works with scikit-learn SVMs, even though the files can get large for large datasets.

However, scikit-learn recommends to use joblib because it is more efficient: http://scikit-learn.org/stable/modules/model_persistence.html

The problem here is that it creates a bunch of files in a folder. This is much harder to share, and sending many many files to the OpenML server for every single run seems unwieldy and error-prone.

Would creating a single pickle file still be the best way forward, or is there a better solution?

zardaloop · 2015-12-23T22:47:03Z

zardaloop
Dec 23, 2015

I guess in joblib you can use the compress option to make a single file https://pythonhosted.org/joblib/persistence.html
Would that answer your question ?

0 replies

zardaloop · 2015-12-23T23:01:08Z

zardaloop
Dec 23, 2015

However by reading this article (https://pythonhosted.org/joblib/generated/joblib.dump.html) I don't think the parameter is boolean, instead it is an integer between 0 to 9.

0 replies

joaquinvanschoren · 2015-12-23T23:02:34Z

joaquinvanschoren
Dec 23, 2015
Maintainer Author

Ah, that looks really useful.

I did notice that joblib pickles are not supported across python versions. That means that if someone built a scikit-learn model with Python 2 it cannot be loaded by someone running it on Python 3? Should we worry about that or can it be easily solved?

0 replies

zardaloop · 2015-12-24T06:46:14Z

zardaloop
Dec 24, 2015

Where did you read that?

0 replies

joaquinvanschoren · 2015-12-24T10:18:19Z

joaquinvanschoren
Dec 24, 2015
Maintainer Author

@zardaloop On the bottom of the link you posted :)
https://pythonhosted.org/joblib/persistence.html

0 replies

zardaloop · 2015-12-24T11:00:32Z

zardaloop
Dec 24, 2015

Well I guess you really need to rethink about this . Because joblib is only for local storage and that's all about it. Eeven scikit-learn to be able to rebuild a model with its future version needs additional metadata along with the pickled model which contains :
The training data, e.g. a reference to a immutable snapshot
The python source code used to generate the model
The versions of scikit-learn and its dependencies
The cross validation score obtained on the training data
http://scikit-learn.org/stable/modules/model_persistence.html

0 replies

zardaloop · 2015-12-24T11:07:30Z

zardaloop
Dec 24, 2015

Therefore as Matthias recomended I also think pickle is your best bet. But you need to make sure to include the metadata along with the pickled model so it can work in the future version of scikit-learn 😊

0 replies

mfeurer · 2015-12-24T11:41:18Z

mfeurer
Dec 24, 2015
Maintainer

I'm not sure if it's possible to easily read pickles which were done with python2 in python3 and vice versa. Given that python2 is getting less and less used, one might think of not supporting it at all.

Besides that, @zardaloop has a valid point that storing sklearn models is not that easy and I don't think sklearn has a common way to solve this issue except storing all metadata as @zardaloop suggested. We should have a look at this in the new year.

0 replies

amueller · 2015-12-29T16:15:02Z

amueller
Dec 29, 2015

I think joblib will do single-file exports soon. Maybe for the moment pickle is enough. Be sure to use the latest protocol of pickle, because the default results in much larger files (at least in python2, not sure about python3).

Both joblib and pickle have the issue that they serialize a class, without the corresponding class definition. So it is only guaranteed that a model will work and give the same result when using the exact same code it was created with.
We try to keep conflicts in loading to a minimum, but the trees frequently change their internal representation.

To make sure a result is entirely reproducible, the "easiest" way is to use docker containers or similar virtual environments (conda envs might be enough) with the exact same version of everything.

What is your exact use case?
A big question is whether you want the model to "work" or want the exact same results. Changing the numpy or scipy version, or changing the BLAS, might give different results. So If you want the exact same results, that's hard to achieve without some very controlled environment.

If you want to load a model that "works", having the same scikit-learn version is sufficient.

Even if the learning of a model, and therefore the serialization didn't change between versions, it could be that a bug in the prediction code was fixed. So even if you can load a model from an older version, it is not ensured that you get similar predictions.

Hope that helps. This is a tricky issue. Feel free to ping me on these discussions, I don't generally follow the tracker atm, but I'm happy to give input.

0 replies

joaquinvanschoren · 2015-12-29T22:01:46Z

joaquinvanschoren
Dec 29, 2015
Maintainer Author

Thanks, Andreas, for your valuable input. When it comes down to sharing the
model itself, it is sufficient that it just works (will be able to give the
same predictions given the same instances). It seems then that storing the
exact scikit version in the run, and storing the pickle, it the most
workable solution.

The reproducibility discussion is equally important though, and we should
look into this when sharing flows. We are currently thinking of storing
just the python script that creates the model given a task, with
meta-information such as the scikit-learn version, but a docker container
would be a better solution (and we are exploring the same thing for R right
now). We could generate those for each major scikit-learn version? Do you
have experience with this in the scikit-learn team?

On Tue, Dec 29, 2015 at 5:15 PM Andreas Mueller [email protected]
wrote:

I think joblib will do single-file exports soon. Maybe for the moment
pickle is enough. Be sure to use the latest protocol of pickle, because the
default results in much larger files (at least in python2, not sure about
python3).

Both joblib and pickle have the issue that they serialize a class, without
the corresponding class definition. So it is only guaranteed that a model
will work and give the same result when using the exact same code it was
created with.
We try to keep conflicts in loading to a minimum, but the trees frequently
change their internal representation.

To make sure a result is entirely reproducible, the "easiest" way is to
use docker containers or similar virtual environments (conda envs might be
enough) with the exact same version of everything.

What is your exact use case?
A big question is whether you want the model to "work" or want the exact
same results. Changing the numpy or scipy version, or changing the BLAS,
might give different results. So If you want the exact same results, that's
hard to achieve without some very controlled environment.

If you want to load a model that "works", having the same scikit-learn
version is sufficient.

Even if the learning of a model, and therefore the serialization didn't
change between versions, it could be that a bug in the prediction code was
fixed. So even if you can load a model from an older version, it is not
ensured that you get similar predictions.

Hope that helps. This is a tricky issue. Feel free to ping me on these
discussions, I don't generally follow the tracker atm, but I'm happy to
give input.

—
Reply to this email directly or view it on GitHub
https://github.com/openml/python/issues/21#issuecomment-167823391.

0 replies

mikecroucher · 2015-12-30T10:31:27Z

mikecroucher
Dec 30, 2015

The best I can do at the moment is to offer advice on what not to do. Don't use pickle!

Here's a summary as to why

http://eev.ee/blog/2015/10/15/dont-use-pickle-use-camel/

I'm not sure what one should use instead though...still trying to figure that out myself.

0 replies

joaquinvanschoren · 2015-12-30T22:44:30Z

joaquinvanschoren
Dec 30, 2015
Maintainer Author

Interesting as that blog post is, do we really have an alternative right
now? A library like scikit-learn could likely come up with something
better, but expecting this for everyone running ML experiments in Python
seems a tall order?

Incidentally, what causes pickles to break? Will they still break if one
also provides a docker container with an environment in which they work?

Practically speaking, for the experiments that I want to run now, is it
ok to use pickle until something better comes along?

On Wed, Dec 30, 2015 at 11:31 AM Mike Croucher [email protected]
wrote:

The best I can do at the moment is to offer advice on what not to do.
Don't use pickle!

Here's a summary as to why

http://eev.ee/blog/2015/10/15/dont-use-pickle-use-camel/

I'm not sure what one should use instead though...still trying to figure
that out myself.

—
Reply to this email directly or view it on GitHub
https://github.com/openml/python/issues/21#issuecomment-167974433.

0 replies

amueller · 2016-01-04T16:24:17Z

amueller
Jan 4, 2016

A library like scikit-learn could likely come up with something better

If you think that, you overestimate our resources by a lot. We haven't been able to provide better backward compatibility, even with pickle.

When it comes down to sharing the model itself, it is sufficient that it just works (will be able to give the same predictions given the same instances)

Well given the same predictions given the same instances can really only be guaranteed with a full container (because of blas issues etc). If you system is reasonably static, storing the scikit-learn version will work as an intermediate solution. But once your hosting provider upgrades their distribution, you might be in trouble.
A conda environment is reasonably save, I think.

We haven't done docker containers for reproducibility. We use travis and circleci and appveyor for continuous integration. But we don't really have a need to create highly reproducible environments.

0 replies

amueller · 2016-01-04T16:25:18Z

amueller
Jan 4, 2016

I think pickle or joblib or dill + conda is the best solution for now, with pickle or joblib or dill + conda + docker the optimum upgrade.

0 replies

drj11 · 2016-01-05T11:10:21Z

drj11
Jan 5, 2016

@mikecroucher asked me to comment. I'm a Python old-hand, but know nothing of scikit-learn, so what I have to say is slanted more towards generic Python advice.

To be able to answer a question like "is pickle adequate" we have to be able to pin down some requirements. For example, is it required that:

basic persistence: I can persist a model and load it into a later session of the same software configuration;
forward persistence: I can persist a model and load it into a later session that uses newer versions of the software;
backward persistence: I can persist a model and load it into a later session that uses older versions of the software;
portable persistence: I can persist a model and transfer that persisted model for someone else to load into a later session.

I would guess that various people would want all of these in some combination, so the real issue is how much do you want to pay (in money, time, and tears) for each of these things.

Additionally, there are various semantic issues. For example: I might be able to load the model, but it gives different predictions, but the predictions are different only in ways that are unimportant (for example, a few ULP). @amueller seems to be aware of these.

With that in mind, pickle is terrible for all of those requirements except basic persistence. Loading a pickle runs arbitrary code, so you should never download and open a pickle. Pickles are extremely brittle (many reasons, but for example, they refer to classes by their module location, so if you reorganise your files for an internal class, everything breaks), so are next to useless for providing forwards or backwards compatibility.

0 replies

drj11 · 2016-01-06T10:15:37Z

drj11
Jan 6, 2016

Just FTR since I was asked: I don't know enough about conda to have a reliable opinion, but if it can be used to record all versions of all software in use (as @amueller suggests), then that's a good start.

0 replies

amueller · 2016-01-06T16:41:11Z

amueller
Jan 6, 2016

@joaquinvanschoren Ok, if people can submit their models, then you would need them to use conda and submit their conda environment config with the model. That is not terribly hard and probably the most feasible way.

There might still be minor differences due to OS, but the only way to avoid those is to have every user work in a virtual machine (or docker container) and provides the virtual machine with the model. That is way more complicated, and probably not worth the effort.

@drj11 conda is basically a cross-platform package manager that ships binaries (unlike pip), mostly for python and related scientific software.

0 replies

amueller · 2016-01-06T16:57:41Z

amueller
Jan 6, 2016

btw, you might be interested in reprozip and hyperos which are two approaches to create reproducible environments (but they are kinda alpha-stage iirc). Conda or docker seem the better choices for now.
One downside of conda is that it does not necessarily capture all dependencies.

If someone wrote a custom transformer (which probably most interesting models have), you have some code part that is not a standard package. So in addition to the environment config you get from conda, and the state you get from pickle, you also need to have access to the source of the custom part.

0 replies

asmeurer · 2016-01-12T17:04:44Z

asmeurer
Jan 12, 2016

@zardaloop has asked me to comment here. I am not very familiar with the situation so my comment will be generic. I don't have much experience with serialization, so I can't comment on
that. As for creating a conda package, I can tell you that it is a good fit if the packaged files are read-only and can be installed to a location in the library prefix (the conda environment). If this is not the case, then conda packages are not a good fit.

0 replies

joaquinvanschoren · 2018-03-17T10:03:11Z

joaquinvanschoren
Mar 17, 2018
Maintainer Author

It would be great to rekindle this discussion, because it looks like it was converging towards a good solution, and storing models in OpenML would be very useful.

Would a conda+joblib/dill/pickle approach work? Even if it covers a large percentage of use cases it would make many people happy :)
@ameuller What do you think of reprozip and hyperos 2 years later?

0 replies

mfeurer · 2018-03-19T08:13:40Z

mfeurer
Mar 19, 2018
Maintainer

Another thing I'd like to mention: security. Pickle is a bit insecure and I am very hesitant putting a solution based on pickle in the python package. See here.

0 replies

rizplate · 2018-06-15T21:03:06Z

rizplate
Jun 15, 2018

+1

0 replies

janvanrijn · 2018-06-15T21:43:25Z

janvanrijn
Jun 15, 2018
Maintainer

and storing models in OpenML would be very useful.

Is there any (scientific or praktical) use case in which storing models becomes relevant? The only thing that I can think of is when a new test set of data becomes available, the model can be reevaluated on this. However, this unfortunately rarely happens.

0 replies

joaquinvanschoren · 2018-06-16T21:11:52Z

joaquinvanschoren
Jun 16, 2018
Maintainer Author

Generally looking into what your model looks like (visualize a tree, count support vectors, looking at learned weights,...)
Transfer learning, few-shot learning
Streaming data (where new data actually comes in)
Making predictions for new test instances (granted, this does not happen for older public datasets, but it does happen all the time in real applications)

I agree it is challenging, but I would really love to track the models I'm building. Maybe not during a large-scale benchmark, but there are plenty of other cases where I either want to look at the models to better understand what they are doing or share them so that other people may learn from them and reuse them.

0 replies

joaquinvanschoren · 2018-06-16T21:20:46Z

joaquinvanschoren
Jun 16, 2018
Maintainer Author

I recently talked to Matei (MLflow). They use a simple format which is just a file containing the model (could be a pickle) and some meta-data on how to read it in.

It is probably best to leave this to the user. The python API should just retrieve the file and meta-data to tell the user what to do with it. Reading in models will probably be done rather occasionally.

0 replies

mfeurer · 2018-06-18T07:33:26Z

mfeurer
Jun 18, 2018
Maintainer

One more thing to keep in mind is file size. Running the default scikit-learn random forest on the popular EEG-Eye-State dataset (1471) results in 7.5MB:

In [1]: %paste
import openml
import sklearn.ensemble
import pickle
data = openml.datasets.get_dataset(1471)
X, y = data.get_data(target=data.default_target_attribute)
rf = sklearn.ensemble.RandomForestClassifier()
rf.fit(X, y)
string = pickle.dumps(rf)
len(string) / 1024. / 1024.

Out[1]: 7.461672782897949

The most popular task on that dataset has ~85k runs, assuming that only 1 percent of these are random forests, that would require at least 6.3GB. If you would increase the tree size from 10 trees to something reasonable, this space requirement would grow drastically.

0 replies

rquintino · 2018-08-31T23:18:03Z

rquintino
Aug 31, 2018

Hi everyone! thinking a lot on this issue these past days, slightly more related to operationalization, pipeline reuse (ex: eval), retraining, & complete reproducibility. Remembered from Joaquin that this was an hot question for openml, & this was a great read/help!

I'm perfectly aware of the security implications and overal versioning issues of loaded resources, but even so, really pipelines solve so much of the issues that were bothering me.... (if only they could be slightly easier to work with :) )

Adding one additional problem, like mentioned above by @amueller , customtransformers. if we have to track the actual code for these... hard to see how can this could be proper operationalized. (and very error prone)

I did some tests with cloudpickle (dill probably will do similar?), and it seems to persist everything that is needed. No need to save/track any customtransformer code. Can load multiple pipelines with no problem. Everything seems really straightforward, save pipeline, (on new kernel) load, predict, refit, just works. Huge flexibility, ex: eval on new refits.

I also did some experiments on mixing sequential preparation flow, but in a fit/transform compatible way....
(sample below, or you can test in binder here https://mybinder.org/v2/gh/DevScope/ai-lab/master?filepath=notebooks%2Fdeconstructing%20-pipelines )

(seems to good to be true... what do you think?
ps-anyone knows if the actual code is "recoverable" from the saved cloudpickle?)
thanks!

class FitState():
    def __init__(self):
        pass
    
class PrepPipeline(BaseEstimator, TransformerMixin):
    
    def __init__(self,impute_age=True,impute_cabin=True,
                 add_missing_indicators=True,
                 train_filter="",copy=True,notes=None):
        self.impute_age=impute_age
        self.notes=notes
        self.copy=copy
        self.train_filter=train_filter
        self.impute_cabin=impute_cabin
        self.add_missing_indicators=add_missing_indicators
    
    def fit(self, X, y=None):
        print("Fitting...")
        self.fit_state=FitState()
        self.prepare(X=X,y=y,fit=True)
        return self

    def transform(self, X,y=None):
        assert isinstance(X, pd.DataFrame)
        print("Transforming...")
        return self.prepare(X=X,y=y,fit=False)
    
    def show_params(self):
        print("fit_state",vars(self.fit_state))
        print("params",self.get_params())
        
    # Experiment is reduce class overhead, bring related fit & transform closer, no models without pipelines
    def prepare(self,X,y=None,fit=False):
        print(f"Notes: {self.notes}")
        
        fit_state=self.fit_state
        if (self.copy):
            X=X.copy()
        
        # Fit only steps, ex: filtering, drop cols 
        if (fit):
            # Probably a very bad idea... thinking on it...
            if self.train_filter:
                X.query(self.train_filter,inplace=True)
        
        if (self.add_missing_indicators):
            if fit:
                fit_state.cols_with_nas=X.columns[X.isna().any()].tolist()
            X=pd.concat([X,X[fit_state.cols_with_nas].isnull().astype(int).add_suffix('_missing')],axis=1)
            
        # A typical titanic prep step (grabbed few ones from kaggle kernels)   
        if (self.impute_age):
            if fit:
                fit_state.impute_age=X.Age.median()

            X.Age.fillna(fit_state.impute_age,inplace=True)

        # Another one
        if (self.impute_cabin):
            if fit:
                fit_state.impute_cabin=X.Cabin.mode()[0]
            X.Cabin.fillna(fit_state.impute_cabin,inplace=True)
            
        return X


prep_pipeline=PrepPipeline(impute_age=True,impute_cabin=True, copy=False,train_filter="Sex=='female'",notes="test1")
X=prep_pipeline.fit_transform(df_full.copy())
prep_pipeline.show_params()
print(X.info())

0 replies

rquintino · 2018-08-31T23:25:07Z

rquintino
Aug 31, 2018

Similar concept, using dill
https://www.analyticsvidhya.com/blog/2017/09/machine-learning-models-as-apis-using-flask/

0 replies

rquintino · 2018-08-31T23:36:50Z

rquintino
Aug 31, 2018

ps-lik mentioned above,probably the size and amount of runs will be a challenge for openml,
nevertheless really interesting that, when saving the full pipelines (complete flow with all prep/model), we can refit/predict with new train/test folds at any time, ex: refresh leaderboard.

noting that if the pipeline was really a gridsearch fit, then refitting would be rather expensive. :)

0 replies

rth · 2018-09-17T10:22:50Z

rth
Sep 17, 2018

For serialization, onnx format might also be relevant (cf https://github.com/onnx/onnxmltools)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenML

How to serialize models #12

{{title}}

Replies: 36 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to serialize models #12

joaquinvanschoren Dec 23, 2015 Maintainer

Replies: 36 comments

joaquinvanschoren Dec 23, 2015 Maintainer Author

joaquinvanschoren Dec 24, 2015 Maintainer Author

mfeurer Dec 24, 2015 Maintainer

joaquinvanschoren Dec 29, 2015 Maintainer Author

joaquinvanschoren Dec 30, 2015 Maintainer Author

joaquinvanschoren Mar 17, 2018 Maintainer Author

mfeurer Mar 19, 2018 Maintainer

janvanrijn Jun 15, 2018 Maintainer

joaquinvanschoren Jun 16, 2018 Maintainer Author

joaquinvanschoren Jun 16, 2018 Maintainer Author

mfeurer Jun 18, 2018 Maintainer

joaquinvanschoren
Dec 23, 2015
Maintainer

joaquinvanschoren
Dec 23, 2015
Maintainer Author

joaquinvanschoren
Dec 24, 2015
Maintainer Author

mfeurer
Dec 24, 2015
Maintainer

joaquinvanschoren
Dec 29, 2015
Maintainer Author

joaquinvanschoren
Dec 30, 2015
Maintainer Author

joaquinvanschoren
Mar 17, 2018
Maintainer Author

mfeurer
Mar 19, 2018
Maintainer

janvanrijn
Jun 15, 2018
Maintainer

joaquinvanschoren
Jun 16, 2018
Maintainer Author

joaquinvanschoren
Jun 16, 2018
Maintainer Author

mfeurer
Jun 18, 2018
Maintainer