Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

surrogate model #77

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from
Draft

surrogate model #77

wants to merge 18 commits into from

Conversation

aklawonn
Copy link
Collaborator

@aklawonn aklawonn commented Apr 27, 2022

Implements the code structure for using surrogate models (meta models, response surfaces) with probeye. This structure is up for discussion, so please feel free to comment and propose changes. In order to check the currently intended use of a surrogate model, check out this integration test. When finished, this fixes #78.

@aklawonn aklawonn marked this pull request as draft April 27, 2022 08:39
@aklawonn aklawonn requested a review from joergfunger April 27, 2022 08:39
@aklawonn aklawonn added the enhancement New feature or request label Apr 27, 2022
@aklawonn aklawonn requested a review from atulag0711 April 27, 2022 11:37
@@ -211,6 +219,11 @@ def forward_models(self) -> dict:
"""Access self._forward_models from outside via self.forward_models."""
return self._forward_models

@property
def surrogate_models(self) -> dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the surrogate model has to be stored at the inference problem? An alternative would be

my_metamodel = gp_forwardmodel(orig_forward_model, additional_parameters_of_gp)

and then the user can decide on either using the metamodel

problem.add_forward_model(my_metamodel)

or the original forward model

problem.add_forward_model(orig_forward_model)

@atulag0711
Copy link
Collaborator

Thanks Alex. The integration example looks nice and makes sense to me. However, fit method in probeye context would be difficult. I will complete my implementation, push here and then we can discuss.

@merijnpepijndebakker
Copy link

Thanks for this. In my opinion the structure you propose makes a lot of sense, i.e. consider the surrogate model as a separate class and as model object. One remark is that we also have different ways to sample new datapoints to generate the surrogate. How would you connect the surrogate with the sampling methods?

@joergfunger
Copy link
Member

joergfunger commented May 16, 2022

I would think the sampling could be done separately. So we would either have a sample_from_prior functionality, or any other sampler that takes as input a parameter estimation problem and returns the samples. Then these samples would be passed to the metamodel, which would then evalute the samples and build the metamodel. You could certainly perform the computation of the outputs (so the forward model output that is to be metamodeled) as a separate independent step, however doing it inside the metamodel allows for an adaptation (so if the metamodel requires new samples, they could just be added whenever the model is evaluated).

Copy link
Member

@joergfunger joergfunger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the integration. Can we somehow decouple the specific implementations from harlow from the general surrogate interface and put the specific implementations for harlow into a surrogate subdirectory? This way other metamodels could be integrated as well.

from probeye.definition.forward_model import ForwardModelBase

# Harlow imports
from harlow.sampling import Sampler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that be in a very general definition of the interface (rather than in a specific implementation using harlow).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I have separated the general definition of the interface (SamplerBase) from the particular implementation for harlow (HarlowSampler(SamplerBase)).


def __init__(
self,
problem: InverseProblem,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do we need that complex interface? In particular, why do we need the inverseProblem? What about the sampler, shouldn't the samples directly be passed?

problem: InverseProblem,
forward_model: ForwardModelBase,
sampler: Sampler,
surrogate_model: Surrogate,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also Harlow specific? How would we integrate different surrogate models (apart from Harlow)?

# TODO: Implement check that the expensive forward model
# has been added to the problem
self.input_sensor_data = dict()
for exp_name, exp in self.problem.experiments.items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't all this information stored in the forward model?


"""

self.surrogate.fit(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to store all the fit points and then call the fit routine with these internal variables instead of providing a fit method that requires the input of sampling/training points.

self.sampler.fit_points_x, self.sampler.fit_points_y, **kwargs
)

def sample(self, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be done outside of the surrogate? What is the sampler part of the surrogate?

Copy link
Member

@joergfunger joergfunger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not review the complete code, but I realized one main design decision that I would like to dicuss. You are surrogating the inference problem (with all parameters), but shouldn't we surrogate each forward model and only use the inference problem to define bounds/distributions of the parameters?

examples/example_surrogate_hartman6D.py Show resolved Hide resolved

# An iterative sampler. Here we pass the surrogate ForwardModel directly to the sampler. However, it is
# also possible to pass a surrogate model that will be included in a forward model after fitting.
harlow_sampler = HarlowSampler(problem, forward_model, LatinHypercube, surrogate_model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is LatinHypercube here? could you document that somehow?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the harlow implementation of LHS. I will try to improve the documentation and fix the docstrings asap.

examples/example_surrogate_hartman6D.py Show resolved Hide resolved
@@ -408,6 +408,37 @@ def change_constant(self, prm_name: str, new_value: Union[int, float]):
value=new_value
)

def get_latent_prior_hyperparameters(self, prm_name: str) -> list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem definition in probeye allows a hierarchical definition, with parameter priors having hyperparameters that itself could have again parameters, that again could have hyperparameters (though I don't think that is often used with more than a single level of hierarchy). So in the case of two levels of hierarchy, what would that return?
In addition, what is the difference to the existing methods here So why do you need for individual parameters the hyper parameters?

Returns
-------
prms
Contains <local parameter name> : <(global) parameter value> pairs. If a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure of I understand that, there is a local and global name, but the value is the same (so neither local or global)


# make sure that all parameters are one-dimensional; it is not straight forward
# how to do LHS for general multivariate parameters
for prm_name in self.problem.latent_prms:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we loop over the latent_prms of the inverse problem, or rather of the original forward model? In case of a single forward model, that is identical, but in case of multiple models, I don't think we should surrogate all models in parallel but rather split that into surrogates for each model.

The considered inverse problem.
"""

def __init__(self, problem: InverseProblem):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design question if we surrogate the inverse problem, or each individual forward model. I would opt for the latter. The inverse problem will only provide in a somehow automatic way the bounds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newer version of LatinHypercubeSampler in sampling.py (which I simply copied from this implementation by Alex) is derived from SamplerBase. We could ensure that the sampling and surrogating is done for each individual forward model by specifying the forward model as an input argument in SamplerBase.

E.g. this part:

class SamplerBase:
    """
    Base class for probeye samplers
    """

    def __init__(self, problem: InverseProblem):

would become:

class SamplerBase:
    """
    Base class for probeye samplers
    """

    def __init__(self, problem: InverseProblem, forward_model: ForwardModelBase):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the challenge, however it is also tricky to hide that. What do we do in case of a parametric prior (where e.g. the std. dev is again a parameter).

@JanKoune
Copy link
Collaborator

JanKoune commented Sep 16, 2022

Hi, thank you for the feedback. It may be best to provide a brief description of the changes and the intended layout of the sampling and surrogating interface before addressing your individual comments:

  1. Surrogating and sampling are now separated by defining the SurrogateBase and SamplerBase classes, from which the user is expected to derive their own implementation.
  2. Both base classes should be as simple as possible for flexibility to allow for different approaches to surrogating and sampling. E.g. in a simple case we may want to draw LHS samples from the prior and manually fit a surrogate model which we then use to define a forward model, or in a more complex case we may want to pass the surrogate model to the sampler to perform iterative fitting and sampling.
  3. The main use of the SurrogateBase class is to copy the interface of a given forward model, and provide a template for surrogate models by implementing "empty" methods that are typically useful for surrogating (e.g. fit and predict).
  4. The SamplerBase is meant to read the inverse problem and forward model definitions and extract the information that is needed by the sampler (e.g. parameter priors and bounds, and input/output shapes). Indeed it would be best to do the surrogating for each forward model and not for the entire inverse problem. This could be enforced by making the forward model an input argument to SamplerBase, ensuring that a sampler is tied to a specific forward model.
  5. Most of the functionality for getting the problem and forward model information is not yet implemented in the base class. I will try to get it working asap.
  6. The base classes are now decoupled from harlow.

Copy link
Collaborator

@danielandresarcones danielandresarcones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to develop the interface in such a way that separates completely the sampler and surrogate from the inverse and forward problem. The separated base classes work for that, but still require knowledge of the requirements of the ForwardModel. I woudl suggest something like this, where the SurrogateAndSampler interface controls the flow of information with the inverse problem and between sampler and surrogate, while acting as the ForwardModel. This would allow to add the necessary utilities in the interface instead of having to redefine them for every surrogate. More complex interactions such as adaptative samplers can be implemented as children. The data format between the interface and the individual wrappers should be decided beforehand.

Additionally, the wrappers are to be implemented individually, allowing the user to adapt the inputs to their specific solver. If, for example, a surrogate needs information of a specific forward model, that can be requested in the wrapper and introduced in the initialization, encapsulating each of the implementations individually.

from harlow.surrogating import Surrogate
# external imports
from harlow.sampling import Sampler as HarlowSamplerBase
from harlow.surrogating import Surrogate as HarlowSurrogateBase
from harlow.utils.helper_functions import latin_hypercube_sampling

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very fond of having external imports in the same script where the base class is defined. I would personally keep the base class separated from the derived ones, creating one new script for each group of samplers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done for convenience initially, and it would be trivial to separate the specific implementations from the base class if necessary later.

self.priors = copy.deepcopy(self.problem.priors)
for prior_template in self.problem.priors.values():
prm_name = prior_template.ref_prm
self.priors[prm_name] = translate_prior(prior_template)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the sampler should have to treat the priors itself, probably better in the interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the challenge with the implementation is that we strictly separated between the problem definition and compute functions (such as evaluating the prior with e.g. a scipy pdf evaluation, or a pytorch pdf evaluation), i.e. the problem definition cannot perform a sampling from the prior a priori. Thus, we would somehow have to create a solver just for sampling the prior using e.g. LHS. Not sure what exactly would be the recommended option, I see two options

  1. The sampler just get's the distributions and performs the (initial) sampling internally. This would mean to extract the distributions with all the relevant parameters from the inference problem.
  2. We create a dummy sampler that is able to sample from the prior and extracts that samples as a first set of points to the surrogate.
    In both situations, the intervals for the variables have to be transferred to the surrogate sampling either by having a check_bounds in the inference problem, or by "copying" or extracting these information from inference problem.

sampler: Sampler,
surrogate_model: Surrogate,
sampler: HarlowSamplerBase,
surrogate_model: HarlowSurrogateBase,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally a harlow sampler should be usable with a non-harlow surrogate, but it is not the case. Probably more a "harlow" thing than a "probeye" one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Harlow offers an abstract surrogate class which can be used to define new surrogates that can be used with the harlow samplers. Since most samplers need access to the surrogate (for fitting and making predictions iteratively), I don't think its feasible to make them work with any surrogate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exactly should that not work with other surrogates? But I somehow agree that the sampler and the surrogate are connected and maybe we should actually follow a suggestions from Daniel to put that into a same base class.

raise NotImplementedError


class HarlowSurrogate(SurrogateModelBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would separate this from the base class.

"""

def __init__(
self, name: str, surrogate: HarlowSurrogateBase, forward_model: ForwardModelBase
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base class should be general instead of forcing to input a harlow surrogate.
From my point of view, this base class should just coordinate the input and output of data to the surrogate model as well as its format.

Copy link
Collaborator

@JanKoune JanKoune Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks for catching this! I will fix this in the next update.

@JanKoune
Copy link
Collaborator

JanKoune commented Sep 16, 2022

Thank you for the suggestions @danielandresarcones. Some notes:

  1. What is the benefit of creating surrogating and sampling wrapper base classes plus an interface class? Why not use the __init__ method of SamplerBase to obtain all the information needed for surrogating and sampling from InverseProblem and ForwardModel?
  2. Ideally, with the current structure of SamplerBase and SurrogateBase the user would not need any knowledge of the InverseProblem and ForwardModel internals since extracting the necessary information would be taken care of in the __init__ of SamplerBase, although this is not yet implemented.
  3. I like the idea of having the interface also act as the forward model. However, it seems that there is a lot of processing of the forward model that happens internally in probeye and could potentially cause problems (e.g. the issue with deepcopy) if the forward model becomes too complex. It could also be that this is fine, but I tried this approach initially and encountered some issues which is why I opted to keep SurrogateBase as light as possible.

@danielandresarcones
Copy link
Collaborator

  1. My idea behind it is to separate functions and allow an easier integration with external generic samplers/surrogates, unless we decide to go always through harlow. Using the __init__ from SamplerBase means that the interactions between surrogate and sampler are controled by the sampler, and that this one will have to store the information regarding datasets, formats and parameters, as well as implementing the extra utilities that we may need. If we want to use a surrogate with a previously generated dataset with no sampling at all, having to define a sampler seems counterintuitive. In the same way, derived samplers shouldn't be able to modify the utilities concerning the ForwardModel, such as comparing the output of the surrogate model or generating the covariance matrices.
  2. The derived class would not have to modify it, but it will still be passed through the __init__ and accessible, as well as having to generate a copy of it. I consider encapsulating the sampler and surrogate from the inverse and forward problems the safest option.
  3. That is a good point, although it is something to consider.

@JanKoune
Copy link
Collaborator

I don't see anything in the current implementation tying us to Harlow or necessitating the use of a sampler to fit a surrogate model to an existing dataset. It seems to me that the two approaches are mostly functionally equivalent, with the main differences being:

  • Having a separate interface class vs. having SamplerBase act as the interface
  • Defining the forward model as part of the interface vs. defining it as part of the surrogate model

I do not have a strong preference for either approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Surrogate models
7 participants