Skip to content

Commit

Permalink
chore: docs, linting, version bump (#39)
Browse files Browse the repository at this point in the history
  • Loading branch information
FBruzzesi authored Apr 30, 2024
1 parent b4b2e00 commit 6a16939
Show file tree
Hide file tree
Showing 9 changed files with 98 additions and 65 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ repos:
- id: check-ast
- id: check-added-large-files
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.3.4 # Ruff version.
rev: v0.4.0 # Ruff version.
hooks:
- id: ruff # Run the linter.
args: [--fix, timebasedcv, tests]
Expand Down
71 changes: 46 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,41 +24,60 @@ This codebase is experimental and is working for my use cases. It is very probab

The current implementation of [scikit-learn TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) lacks the flexibility of having multiple samples within the same time period/unit.

This codebase addresses such problem by providing a cross validation strategy based on a time period rather than the number of samples. This is useful when the data is time dependent, and the model should be trained on past data and tested on future data, independently from the number of observations present within a given time period.
This codebase addresses such problem by providing a cross validation strategy based on a **time period** rather than the number of samples. This is useful when the data is time dependent, and the model should be trained on past data and tested on future data, independently from the number of observations present within a given time period.

Temporal data leakage is an issue and we want to prevent that from happening!

We introduce two main classes:

- [`TimeBasedSplit`](https://fbruzzesi.github.io/timebasedcv/api/timebasedsplit/#timebasedcv.timebasedsplit.TimeBasedSplit): a class that allows to define a time based split with a given frequency, train size, test size, gap, stride and window type. It's core method `split` requires to pass a time series as input to create the boolean masks for train and test from the instance information defined above. Therefore it is not compatible with [scikit-learn CV Splitters](https://scikit-learn.org/stable/common_pitfalls.html#id3).
- [`TimeBasedCVSplitter`](https://fbruzzesi.github.io/timebasedcv/api/timebasedsplit/#timebasedcv.timebasedsplit.TimeBasedCVSplitter): a class that conforms with scikit-learn CV Splitters but requires to pass the time series as input to the instance. That is because a CV Splitter needs to know a priori the number of splits and the `split` method shouldn't take any extra arguments as input other than the arrays to split.
- [`TimeBasedSplit`](https://fbruzzesi.github.io/timebasedcv/api/timebasedsplit/#timebasedcv.timebasedsplit.TimeBasedSplit) allows to define a time based split with a given frequency, train size, test size, gap, stride and window type. Its core method `split` requires to pass a time series as input to create the boolean masks for train and test from the instance information defined above. Therefore it is not compatible with [scikit-learn CV Splitters](https://scikit-learn.org/stable/common_pitfalls.html#id3).
- [`TimeBasedCVSplitter`](https://fbruzzesi.github.io/timebasedcv/api/timebasedsplit/#timebasedcv.timebasedsplit.TimeBasedCVSplitter) conforms with scikit-learn CV Splitters but requires to pass the time series as input to the instance. That is because a CV Splitter needs to know a priori the number of splits, and the `split` method shouldn't take any extra arguments as input other than the arrays to split.

## Installation

**timebasedcv** is a published Python package on [pypi](https://pypi.org/), therefore it can be installed directly via pip, as well as from source using pip and git, or with a local clone:

- **pip** (suggested):
<details open>

<summary> <b>pip</b> (suggested)</summary>

```bash
python -m pip install timebasedcv
```

</details>

<details closed>

<summary> <b>pip + source/git</b></summary>

```bash
python -m pip install timebasedcv
```
```bash
python -m pip install git+https://github.com/FBruzzesi/timebasedcv.git
```

</details>

- **pip + source/git**:
<details closed>

```bash
python -m pip install git+https://github.com/FBruzzesi/timebasedcv.git
```
<summary> <b>local clone</b></summary>

```bash
git clone https://github.com/FBruzzesi/timebasedcv.git
cd timebasedcv
python -m pip install .
```

- **local clone**:
</details>

```bash
git clone https://github.com/FBruzzesi/timebasedcv.git
cd timebasedcv
python -m pip install .
```
## Dependencies

As of **timebasecv v0.1.0**, the only two dependencies are [`numpy`](https://numpy.org/doc/stable/index.html) and [`narwhals>=0.7.15`](https://marcogorelli.github.io/narwhals/).

The latter allows to have a compatibility layer between polars, pandas and other dataframe libraries. Therefore, as long as narwhals supports such dataframe object, we will as well.

## Quickstart

As a **quickstart**, you can use the following code snippet to get started.
Consider checkout out the [Getting Started](https://fbruzzesi.github.io/timebasedcv/getting-started/) section of for a detailed guide on how to use the library.
The following code snippet is all you need to get started, yet consider checking out the [Getting Started](https://fbruzzesi.github.io/timebasedcv/getting-started/) section of the documentation for a detailed guide on how to use the library.

First let's generate some data with different number of points per day:

Expand All @@ -80,13 +99,15 @@ df = pd.concat([

time_series, X = df["time"], df["value"]
df.set_index("time").resample("D").count().head(5)
```

# time value
# 2023-01-01 14
# 2023-01-02 2
# 2023-01-03 22
# 2023-01-04 11
# 2023-01-05 1
```terminal
time value
2023-01-01 14
2023-01-02 2
2023-01-03 22
2023-01-04 11
2023-01-05 1
```

Now let's run the split with a given frequency, train size, test size, gap, stride and window type:
Expand Down
30 changes: 18 additions & 12 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,11 +52,12 @@ print(f"Number of splits: {tbs.n_splits_of(time_series=time_series)}")

for X_train, X_forecast, y_train, y_forecast in tbs.split(X, y, time_series=time_series):
print(f"Train: {X_train.shape}, Forecast: {X_forecast.shape}")

# Train: (30, 2), Forecast: (7, 2)
# Train: (30, 2), Forecast: (7, 2)
# ...
# Train: (30, 2), Forecast: (7, 2)
```
```terminal
Train: (30, 2), Forecast: (7, 2)
Train: (30, 2), Forecast: (7, 2)
...
Train: (30, 2), Forecast: (7, 2)
```

Another optional parameter that can be passed to the `split` method is `return_splitstate`. If `True`, the method will return a [`SplitState`](api/splitstate.md) dataclass which contains the "split" points for training and test, namely `train_start`, `train_end`, `forecast_start` and `forecast_end`. These can be useful if a particular logic needs to be applied to the data before training and/or forecasting.
Expand Down Expand Up @@ -106,7 +107,10 @@ random_search_cv = RandomizedSearchCV(
).fit(X, y)

random_search_cv.best_params_
# {'positive': True, 'fit_intercept': False, 'alpha': 0.1}
```

```terminal
{'positive': True, 'fit_intercept': False, 'alpha': 0.1}
```

## Examples of Cross Validation
Expand All @@ -133,13 +137,15 @@ df = pd.concat([

time_series, X = df["time"], df["value"]
df.set_index("time").resample("D").count().head(5)
```

# time value
# 2023-01-01 14
# 2023-01-02 2
# 2023-01-03 22
# 2023-01-04 11
# 2023-01-05 1
```terminal
time value
2023-01-01 14
2023-01-02 2
2023-01-03 22
2023-01-04 11
2023-01-05 1
```

As we can see every day has a different number of points.
Expand Down
10 changes: 6 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,18 +25,20 @@ This codebase is experimental and is working for my use cases. It is very probab

The current implementation of [scikit-learn TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) lacks the flexibility of having multiple samples within the same time period/unit.

This codebase addresses such problem by providing a cross validation strategy based on a time period rather than the number of samples. This is useful when the data is time dependent, and the model should be trained on past data and tested on future data, independently from the number of observations present within a given time period.
This codebase addresses such problem by providing a cross validation strategy based on a **time period** rather than the number of samples. This is useful when the data is time dependent, and the model should be trained on past data and tested on future data, independently from the number of observations present within a given time period.

Temporal data leakage is an issue and we want to prevent that from happening!

We introduce two main classes:

- [`TimeBasedSplit`](api/timebasedsplit.md#timebasedcv.timebasedsplit.TimeBasedSplit): a class that allows to define a time based split with a given frequency, train size, test size, gap, stride and window type. It's core method `split` requires to pass a time series as input to create the boolean masks for train and test from the instance information defined above. Therefore it is not compatible with [scikit-learn CV Splitters](https://scikit-learn.org/stable/common_pitfalls.html#id3).
- [`TimeBasedCVSplitter`](api/timebasedsplit.md#timebasedcv.timebasedsplit.TimeBasedCVSplitter): a class that conforms with scikit-learn CV Splitters but requires to pass the time series as input to the instance. That is because a CV Splitter needs to know a priori the number of splits and the `split` method shouldn't take any extra arguments as input other than the arrays to split.
- [`TimeBasedSplit`](api/timebasedsplit.md#timebasedcv.timebasedsplit.TimeBasedSplit)allows to define a time based split with a given frequency, train size, test size, gap, stride and window type. Its core method `split` requires to pass a time series as input to create the boolean masks for train and test from the instance information defined above. Therefore it is not compatible with [scikit-learn CV Splitters](https://scikit-learn.org/stable/common_pitfalls.html#id3).
- [`TimeBasedCVSplitter`](api/timebasedsplit.md#timebasedcv.timebasedsplit.TimeBasedCVSplitter) conforms with scikit-learn CV Splitters but requires to pass the time series as input to the instance. That is because a CV Splitter needs to know a priori the number of splits, and the `split` method shouldn't take any extra arguments as input other than the arrays to split.it.

## Installation

**timebasedcv** is a published Python package on [pypi](https://pypi.org/), therefore it can be installed directly via pip, as well as from source using pip and git, or with a local clone:

=== "pip"
=== "pip (suggested)"

```bash
python -m pip install timebasedcv
Expand Down
14 changes: 11 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "timebasedcv"
version = "0.0.2"
version = "0.1.0"
description = "Time based cross validation"

license = {file = "LICENSE"}
Expand Down Expand Up @@ -45,7 +45,7 @@ dev = [
]

lint = [
"ruff>=0.1.6"
"ruff>=0.4.0"
]

docs = [
Expand Down Expand Up @@ -76,7 +76,15 @@ packages = ["timebasedcv"]
line-length = 120

[tool.ruff.lint]
extend-select = ["I"]
extend-select = [
"E",
"F",
"I",
# "N", # pep8-naming
"W",
"PERF",
"RUF",
]
ignore = [
"E731", # do not assign a `lambda` expression, use a `def`
]
Expand Down
2 changes: 1 addition & 1 deletion tests/utils/backends_test.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
from contextlib import nullcontext as does_not_raise

import narwhals as nw
import numpy as np
import pandas as pd
import pytest
import narwhals as nw

from timebasedcv.utils._backends import (
BACKEND_TO_INDEXING_METHOD,
Expand Down
9 changes: 6 additions & 3 deletions timebasedcv/timebasedsplit.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
from datetime import timedelta
from itertools import chain
from typing import Generator, Tuple, Union, get_args
import narwhals as nw

import narwhals as nw
import numpy as np

from timebasedcv.splitstate import SplitState
Expand Down Expand Up @@ -371,7 +371,7 @@ def split(
ts_shape = time_series.shape
if len(ts_shape) != 1:
raise ValueError(f"Time series must be 1-dimensional. Got {len(ts_shape)} dimensions.")

arrays = tuple([nw.from_native(array, eager_only=True, allow_series=True, strict=False) for array in arrays])
time_series = nw.from_native(time_series, series_only=True, strict=False)
a0 = arrays[0]
Expand All @@ -396,7 +396,10 @@ def split(

train_forecast_arrays = tuple(
chain.from_iterable(
(nw.to_native(_idx_method(_arr, train_mask), strict=False), nw.to_native(_idx_method(_arr, forecast_mask), strict=False))
(
nw.to_native(_idx_method(_arr, train_mask), strict=False),
nw.to_native(_idx_method(_arr, forecast_mask), strict=False),
)
for _arr, _idx_method in zip(arrays, _index_methods)
)
)
Expand Down
2 changes: 1 addition & 1 deletion timebasedcv/utils/_backends.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from typing import Callable, Dict, TypeVar

import numpy as np
import narwhals as nw
import numpy as np


def default_indexing_method(arr, mask):
Expand Down
23 changes: 8 additions & 15 deletions timebasedcv/utils/_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

import sys
from datetime import date, datetime
from typing import Literal, Protocol, Tuple, TypeVar, Union, TYPE_CHECKING
from typing import TYPE_CHECKING, Literal, Protocol, Tuple, TypeVar, Union

if sys.version_info >= (3, 10):
from typing import TypeAlias # pragma: no cover
Expand Down Expand Up @@ -45,23 +45,17 @@ def max(self: Self) -> T:
...

@property
def shape(self: Self) -> Tuple[int]:
...
def shape(self: Self) -> Tuple[int]: ...

def __lt__(self: Self, other: Union[T, SeriesLike[T]]) -> SeriesLike[bool]:
...
def __lt__(self: Self, other: Union[T, SeriesLike[T]]) -> SeriesLike[bool]: ...

def __gt__(self: Self, other: Union[T, SeriesLike[T]]) -> SeriesLike[bool]:
...
def __gt__(self: Self, other: Union[T, SeriesLike[T]]) -> SeriesLike[bool]: ...

def __le__(self: Self, other: Union[T, SeriesLike[T]]) -> SeriesLike[bool]:
...
def __le__(self: Self, other: Union[T, SeriesLike[T]]) -> SeriesLike[bool]: ...

def __ge__(self: Self, other: Union[T, SeriesLike[T]]) -> SeriesLike[bool]:
...
def __ge__(self: Self, other: Union[T, SeriesLike[T]]) -> SeriesLike[bool]: ...

def __and__(self: SeriesLike[bool], other: SeriesLike[bool]) -> SeriesLike[bool]:
...
def __and__(self: SeriesLike[bool], other: SeriesLike[bool]) -> SeriesLike[bool]: ...


T_co = TypeVar("T_co", covariant=True)
Expand All @@ -76,5 +70,4 @@ class TensorLike(Protocol[T_co]):
"""

@property
def shape(self: Self) -> Tuple[int, ...]:
...
def shape(self: Self) -> Tuple[int, ...]: ...

0 comments on commit 6a16939

Please sign in to comment.