Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surrogate scaling #315

Merged
merged 26 commits into from
Jul 24, 2024
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
d9aefe5
Remove current scaling functionality
AdrianSosic Jul 9, 2024
369da45
Make to_tensor also handle numpy arrays
AdrianSosic Jul 16, 2024
0ede1cc
Replace param_bounds_comp with comp_rep_bounds
AdrianSosic Jul 16, 2024
00c40ae
Draft input scaling mechanism
AdrianSosic Jul 17, 2024
79f8f44
Introduce ScalerProtocol class
AdrianSosic Jul 19, 2024
24f2c49
Make transformation return a dataframe
AdrianSosic Jul 19, 2024
2938c48
Update streamlit dev script
AdrianSosic Jul 19, 2024
ae1a366
Fix handling of dropped columns in ColumnTransformer
AdrianSosic Jul 19, 2024
5068148
Remove obsolete TODO note
AdrianSosic Jul 19, 2024
fb14927
Make surrogate scaling work with continuous parameters
AdrianSosic Jul 19, 2024
c3a4cc6
Rename _get_parameter_scaler to _make_parameter_scaler
AdrianSosic Jul 19, 2024
64b5450
Draft output scaling mechanism
AdrianSosic Jul 22, 2024
6dad04a
Silence warning by allowing extra columns
AdrianSosic Jul 22, 2024
25e356a
Improve signatures
AdrianSosic Jul 22, 2024
2a2849b
Harmonize terminology
AdrianSosic Jul 22, 2024
920b079
Update test for empty bounds
AdrianSosic Jul 22, 2024
cdf6688
Fix import order
AdrianSosic Jul 22, 2024
6e052f7
Decide for transformation approach
AdrianSosic Jul 22, 2024
ef84a35
Update docstrings
AdrianSosic Jul 22, 2024
2b3dcab
Remove separate scaling logic from GPs
AdrianSosic Jul 23, 2024
161bddb
Rename ScalerProtocol to ParameterScalerProtocol
AdrianSosic Jul 23, 2024
e7f3f67
Update CHANGELOG.md
AdrianSosic Jul 23, 2024
21953d4
Replace literal return type with None
AdrianSosic Jul 23, 2024
536a3a8
Implement workaround to circumvent ColumnTransformer limitations
AdrianSosic Jul 24, 2024
b88b3ba
Improve code grouping
AdrianSosic Jul 24, 2024
1619bd7
Remove register_custom_architecture decorator
AdrianSosic Jul 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `Surrogate` models now operate on dataframes in experimental representation instead of
tensors in computational representation
- `Surrogate.posterior` models now returns a `Posterior` object
- `param_bounds_comp` of `SearchSpace`, `SubspaceDiscrete` and `SubspaceContinuous` has
AdrianSosic marked this conversation as resolved.
Show resolved Hide resolved
been replaced with `comp_rep_bounds`, which returns a dataframe

### Added
- `Surrogate` base class now exposes a `to_botorch` method
Expand All @@ -33,6 +35,10 @@ _ `_optional` subpackage for managing optional dependencies
- `transform` methods of `SearchSpace`, `SubspaceDiscrete` and `SubspaceContinuous`
now take additional `allow_missing` and `allow_extra` keyword arguments
- `GaussianSurrogate` base class for surrogate models with Gaussian posteriors
- `comp_rep_columns` property for `Parameter`, `SearchSpace`, `SubspaceDiscrete`
and `SubspaceContinuous` classes
- Reworked mechanisms for surrogate input/output scaling configurable per class
- `ParameterScalerProtocol` class for enabling user-defined input scaling mechanisms

### Changed
- Passing an `Objective` to `Campaign` is now optional
Expand All @@ -44,11 +50,12 @@ _ `_optional` subpackage for managing optional dependencies
- Context information required by `Surrogate` models is now cleanly encapsulated into
a `context` object passed to `Surrogate._fit`
- Fallback models created by `catch_constant_targets` are stored outside of surrogate
- `to_tensor` now also handles `numpy` arrays
- `GaussianProcessSurrogate` no longer uses a separate scaling approach

### Removed
- Support for Python 3.9 removed due to new [BoTorch requirements](https://github.com/pytorch/botorch/pull/2293)
and guidelines from [Scientific Python](https://scientific-python.org/specs/spec-0000/)
- `register_custom_architecture` decorator
- `Scalar` and `DefaultScaler` classes

### Fixed
Expand All @@ -64,8 +71,6 @@ _ `_optional` subpackage for managing optional dependencies
- Passing a dataframe via the `data` argument to the `transform` methods of
`SearchSpace`, `SubspaceDiscrete` and `SubspaceContinuous` is no longer possible.
The dataframe must now be passed as positional argument.
- Role of `register_custom_architecture` has been taken over by
`baybe.surrogates.base.SurrogateProtocol`

## [0.9.1] - 2024-06-04
### Changed
Expand Down
2 changes: 1 addition & 1 deletion baybe/acquisition/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ def to_botorch(
params_dict = filter_attributes(object=self, callable_=acqf_cls.__init__)

train_x = surrogate.transform_inputs(measurements)
train_y = surrogate.transform_targets(measurements)
train_y = surrogate.transform_outputs(measurements)

signature_params = signature(acqf_cls).parameters
additional_params = {}
Expand Down
19 changes: 15 additions & 4 deletions baybe/parameters/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,6 @@ def is_in_range(self, item: Any) -> bool:
``True`` if the item is within the parameter range, ``False`` otherwise.
"""

@abstractmethod
def summary(self) -> dict:
"""Return a custom summarization of the parameter."""

def __str__(self) -> str:
return str(self.summary())

Expand All @@ -65,6 +61,15 @@ def is_discrete(self) -> bool:
"""Boolean indicating if this is a discrete parameter."""
return isinstance(self, DiscreteParameter)

@property
@abstractmethod
def comp_rep_columns(self) -> tuple[str, ...]:
"""The columns spanning the computational representation."""

@abstractmethod
def summary(self) -> dict:
"""Return a custom summarization of the parameter."""


@define(frozen=True, slots=False)
class DiscreteParameter(Parameter, ABC):
Expand All @@ -84,8 +89,14 @@ def values(self) -> tuple:
@cached_property
@abstractmethod
def comp_df(self) -> pd.DataFrame:
# TODO: Should be renamed to `comp_rep`
"""Return the computational representation of the parameter."""

@property
def comp_rep_columns(self) -> tuple[str, ...]: # noqa: D102
# See base class.
return tuple(self.comp_df.columns)

def is_in_range(self, item: Any) -> bool: # noqa: D102
# See base class.
return item in self.values
Expand Down
5 changes: 5 additions & 0 deletions baybe/parameters/numerical.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,11 @@ def is_in_range(self, item: float) -> bool: # noqa: D102

return self.bounds.contains(item)

@property
def comp_rep_columns(self) -> tuple[str, ...]: # noqa: D102
# See base class.
return (self.name,)

def summary(self) -> dict: # noqa: D102
# See base class.
param_dict = dict(
Expand Down
4 changes: 2 additions & 2 deletions baybe/recommenders/pure/bayesian/botorch.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ def _recommend_continuous(

points, _ = optimize_acqf(
acq_function=self._botorch_acqf,
bounds=torch.from_numpy(subspace_continuous.param_bounds_comp),
bounds=torch.from_numpy(subspace_continuous.comp_rep_bounds.values),
q=batch_size,
num_restarts=5, # TODO make choice for num_restarts
raw_samples=10, # TODO make choice for raw_samples
Expand Down Expand Up @@ -244,7 +244,7 @@ def _recommend_hybrid(
# Actual call of the BoTorch optimization routine
points, _ = optimize_acqf_mixed(
acq_function=self._botorch_acqf,
bounds=torch.from_numpy(searchspace.param_bounds_comp),
bounds=torch.from_numpy(searchspace.comp_rep_bounds.values),
q=batch_size,
num_restarts=5, # TODO make choice for num_restarts
raw_samples=10, # TODO make choice for raw_samples
Expand Down
132 changes: 0 additions & 132 deletions baybe/scaler.py

This file was deleted.

23 changes: 14 additions & 9 deletions baybe/searchspace/continuous.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@
from baybe.serialization import SerialMixin, converter, select_constructor_hook
from baybe.utils.basic import to_tuple
from baybe.utils.dataframe import pretty_print_df
from baybe.utils.numerical import DTypeFloatNumpy

if TYPE_CHECKING:
from baybe.searchspace.core import SearchSpace
Expand Down Expand Up @@ -211,11 +210,17 @@ def param_names(self) -> tuple[str, ...]:
return tuple(p.name for p in self.parameters)

@property
def param_bounds_comp(self) -> np.ndarray:
"""Return bounds as numpy array."""
if not self.parameters:
return np.empty((2, 0), dtype=DTypeFloatNumpy)
return np.stack([p.bounds.to_ndarray() for p in self.parameters]).T
def comp_rep_columns(self) -> tuple[str, ...]:
"""The columns spanning the computational representation."""
return tuple(chain.from_iterable(p.comp_rep_columns for p in self.parameters))

@property
def comp_rep_bounds(self) -> pd.DataFrame:
"""The minimum and maximum values of the computational representation."""
return pd.DataFrame(
{p.name: p.bounds.to_tuple() for p in self.parameters},
index=["min", "max"],
)

def _drop_parameters(self, parameter_names: Collection[str]) -> SubspaceContinuous:
"""Create a copy of the subspace with certain parameters removed.
Expand Down Expand Up @@ -324,10 +329,10 @@ def sample_uniform(self, batch_size: int = 1) -> pd.DataFrame:
and len(self.constraints_lin_ineq) == 0
and len(self.constraints_cardinality) == 0
):
return self._sample_from_bounds(batch_size, self.param_bounds_comp)
return self._sample_from_bounds(batch_size, self.comp_rep_bounds.values)

if len(self.constraints_cardinality) == 0:
return self._sample_from_polytope(batch_size, self.param_bounds_comp)
return self._sample_from_polytope(batch_size, self.comp_rep_bounds.values)

return self._sample_from_polytope_with_cardinality_constraints(batch_size)

Expand Down Expand Up @@ -453,7 +458,7 @@ def sample_from_full_factorial(self, batch_size: int = 1) -> pd.DataFrame:
def full_factorial(self) -> pd.DataFrame:
"""Get the full factorial of the continuous space."""
index = pd.MultiIndex.from_product(
self.param_bounds_comp.T.tolist(), names=self.param_names
self.comp_rep_bounds.values.T.tolist(), names=self.param_names
)

return pd.DataFrame(index=index).reset_index()
Expand Down
14 changes: 9 additions & 5 deletions baybe/searchspace/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
from enum import Enum
from typing import cast

import numpy as np
import pandas as pd
from attr import define, field

Expand Down Expand Up @@ -244,10 +243,15 @@ def contains_rdkit(self) -> bool:
)

@property
def param_bounds_comp(self) -> np.ndarray:
"""Return bounds as tensor."""
return np.hstack(
[self.discrete.param_bounds_comp, self.continuous.param_bounds_comp]
def comp_rep_columns(self) -> tuple[str, ...]:
"""The columns spanning the computational representation."""
return self.discrete.comp_rep_columns + self.continuous.comp_rep_columns

@property
def comp_rep_bounds(self) -> pd.DataFrame:
"""The minimum and maximum values of the computational representation."""
return pd.concat(
AdrianSosic marked this conversation as resolved.
Show resolved Hide resolved
[self.discrete.comp_rep_bounds, self.continuous.comp_rep_bounds], axis=1
)

@property
Expand Down
28 changes: 11 additions & 17 deletions baybe/searchspace/discrete.py
Original file line number Diff line number Diff line change
Expand Up @@ -537,27 +537,21 @@ def is_empty(self) -> bool:
return len(self.parameters) == 0

@property
def param_bounds_comp(self) -> np.ndarray:
"""Return bounds as tensor.
def comp_rep_columns(self) -> tuple[str, ...]:
"""The columns spanning the computational representation."""
# We go via `comp_rep` here instead of using the columns of the individual
# parameters because the search space potentially uses only a subset of the
# columns due to decorrelation
return tuple(self.comp_rep.columns)

Take bounds from the parameter definitions, but discards bounds belonging to
columns that were filtered out during the creation of the space.
"""
if not self.parameters:
return np.empty((2, 0))
bounds = np.hstack(
[
np.vstack([p.comp_df[col].min(), p.comp_df[col].max()])
for p in self.parameters
for col in p.comp_df
if col in self.comp_rep.columns
]
)
return bounds
@property
def comp_rep_bounds(self) -> pd.DataFrame:
"""The minimum and maximum values of the computational representation."""
return pd.DataFrame({"min": self.comp_rep.min(), "max": self.comp_rep.max()}).T
AdrianSosic marked this conversation as resolved.
Show resolved Hide resolved

@staticmethod
def estimate_product_space_size(
parameters: Sequence[DiscreteParameter]
parameters: Sequence[DiscreteParameter],
) -> MemorySize:
"""Estimate an upper bound for the memory size of a product space.

Expand Down
Loading
Loading