Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit Tests #19

Closed
wants to merge 55 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
13c75ad
create first unit test with pytest
diegomarvid May 18, 2024
bbd87f0
add unit testing gh workflow
diegomarvid May 18, 2024
d10df8d
add `EncodeStep` tests
diegomarvid May 18, 2024
96da6c8
minor refactor
diegomarvid May 18, 2024
8123c43
add `TabularSplit` tests. `group_by_columns` test is `FAILING`
diegomarvid May 18, 2024
3a654e5
add test for `CalculateFeaturesStep`
diegomarvid May 18, 2024
591a349
ensure pytest fixtures in first functions for encoding and tabular tests
diegomarvid May 18, 2024
7c793c2
update encoding tests
diegomarvid May 18, 2024
ce010ee
remove `MAPE` metric and add `CalculateMetrics` tests
diegomarvid May 18, 2024
97108c2
add `CleanStep` tests
diegomarvid May 18, 2024
09481f3
fix unused import in calculate metrics test
diegomarvid May 18, 2024
5d06dd4
add `pipeline` test and fix important errors
diegomarvid May 19, 2024
86c9a6b
install poetry with extras in unit testing workflow
diegomarvid May 19, 2024
a5cb0b9
add test for `ModelStep`
diegomarvid May 19, 2024
74a9f4c
add tests for `ModelRegistry`
diegomarvid May 19, 2024
03b33ba
add tests for `StepRegistry`
diegomarvid May 19, 2024
5fdb73e
delete unused files and functions
diegomarvid May 19, 2024
dd63e7d
add `log_experiment` tests in pipeline tests
diegomarvid May 19, 2024
efbbd11
add ames housing testing performance to unit tests
diegomarvid May 19, 2024
7bef194
fix style checks
diegomarvid May 19, 2024
7ff55f2
Merge branch 'main' into unit-test
froukje May 27, 2024
4e29c42
style checks
froukje May 27, 2024
8180f25
adapt mlflow tests & log
froukje May 27, 2024
c9c2eb1
Update unit-testing.yaml
froukje May 27, 2024
0e73a31
test workflow
froukje May 28, 2024
c048c7d
test workflow
froukje May 28, 2024
b5f13cd
Update readme
ovejabu May 28, 2024
e5b7ae8
Update readme warning
ovejabu May 29, 2024
1a6aa62
create first unit test with pytest
diegomarvid May 18, 2024
5493994
add unit testing gh workflow
diegomarvid May 18, 2024
b8578ea
add `EncodeStep` tests
diegomarvid May 18, 2024
3301385
minor refactor
diegomarvid May 18, 2024
104ed26
add `TabularSplit` tests. `group_by_columns` test is `FAILING`
diegomarvid May 18, 2024
fee81a4
add test for `CalculateFeaturesStep`
diegomarvid May 18, 2024
474c60d
ensure pytest fixtures in first functions for encoding and tabular tests
diegomarvid May 18, 2024
6fc4c55
update encoding tests
diegomarvid May 18, 2024
643f511
remove `MAPE` metric and add `CalculateMetrics` tests
diegomarvid May 18, 2024
559cad7
add `CleanStep` tests
diegomarvid May 18, 2024
7238a70
fix unused import in calculate metrics test
diegomarvid May 18, 2024
3f48dc9
add `pipeline` test and fix important errors
diegomarvid May 19, 2024
a72a32d
install poetry with extras in unit testing workflow
diegomarvid May 19, 2024
2a9e414
add test for `ModelStep`
diegomarvid May 19, 2024
b710e10
add tests for `ModelRegistry`
diegomarvid May 19, 2024
4331029
add tests for `StepRegistry`
diegomarvid May 19, 2024
f40adc2
delete unused files and functions
diegomarvid May 19, 2024
f0b7807
add `log_experiment` tests in pipeline tests
diegomarvid May 19, 2024
9a751aa
add ames housing testing performance to unit tests
diegomarvid May 19, 2024
75f7869
fix style checks
diegomarvid May 19, 2024
fe3f73d
style checks
froukje May 27, 2024
c02cc74
adapt mlflow tests & log
froukje May 27, 2024
e8e6b6e
Update unit-testing.yaml
froukje May 27, 2024
33eb9b9
test workflow
froukje May 28, 2024
7ba2558
test workflow
froukje May 28, 2024
6178249
Merge branch 'unit-test' of github.com:tryolabs/pipeline-lib into uni…
froukje May 30, 2024
53e1c02
merge conflicts
froukje May 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions .github/workflows/unit-testing.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: Unit testing

on: push

jobs:
unit-testing:
runs-on: ubuntu-latest
steps:
- name: Checkout repo
uses: actions/checkout@v3

- name: Setup Python
uses: actions/setup-python@v4

- name: Install Poetry
uses: snok/install-poetry@v1
with:
virtualenvs-create: true
virtualenvs-in-project: true

# Allow loading a cached venv created in a previous run if the lock file is identical
- name: Load cached venv if it exists
id: venv-cache
uses: actions/cache@v3
with:
path: .venv
key: venv-${{ runner.os }}-${{ hashFiles('**/poetry.lock', '**/pyproject.toml') }}

- name: Install dependencies
run: poetry install --no-interaction --extras all_models

- name: Run tests
run: poetry run pytest
5 changes: 5 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -54,4 +54,9 @@
// TODO: this setting is showing a deprecation warning. Maybe we should drop it?
"jupyter.generateSVGPlots": true,
"autoDocstring.docstringFormat": "numpy",
"python.testing.pytestArgs": [
"tests"
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
}
162 changes: 80 additions & 82 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,21 @@
# Pipeline Library
# ML-GARDEN

The Pipeline Library is a powerful and flexible tool designed to simplify the creation and management of machine learning pipelines. It provides a high-level interface for defining and executing pipelines, allowing users to focus on the core aspects of their machine learning projects. The library currently supports XGBoost models, with plans to expand support for more models in the future.
ml-garden is a pipeline library that simplifies the creation and management of machine learning projects. It offers a high-level interface for defining and executing pipelines, allowing users to focus on their projects without getting lost in details. It currently supports XGBoost models for regression tasks on tabular data, with plans to expand support for more models in the future.
The key components of the pipeline include Pipeline Steps, which are predefined steps connected to pass information through a data container; a Config File for setting pipeline steps and parameters; and a Data Container for storing and transferring essential data and results throughout the pipeline, facilitating effective data processing and analysis in machine learning projects.

> [!WARNING]
> This library is in the early stages of development and is not yet ready for production use. The API and functionality may change without notice. Use at your own risk.
> Please be advised that this library is currently in the early stages of development and is not recommended for production use at this time. The API and functionality of the library may undergo changes without prior notice. Kindly use it at your own discretion and be aware of the associated risks.

## Features

* Intuitive and easy-to-use API for defining pipeline steps and configurations
* Support for various data loading formats, including CSV and Parquet
* Flexible data preprocessing steps, such as data cleaning, feature calculation, and encoding
* Seamless integration with XGBoost for model training and prediction
* Hyperparameter optimization using Optuna for fine-tuning models
* Evaluation metrics calculation and reporting
* Explainable AI (XAI) dashboard for model interpretability
* Extensible architecture for adding custom pipeline steps
- Intuitive and easy-to-use API for defining pipeline steps and configurations
- Support for various data loading formats, including CSV and Parquet
- Flexible data preprocessing steps, such as data cleaning, feature calculation, and encoding
- Seamless integration with XGBoost for model training and prediction
- Hyperparameter optimization using Optuna for fine-tuning models
- Evaluation metrics calculation and reporting
- Explainable AI (XAI) dashboard for model interpretability
- Extensible architecture for adding custom pipeline steps

## Installation

Expand All @@ -23,95 +24,92 @@ To install the Pipeline Library, you need to have Python 3.9 or higher and Poetr
1. Clone the repository:

```bash
git clone https://github.com/tryolabs/pipeline-lib.git
git clone https://github.com/tryolabs/ml-garden.git
```

2. Navigate to the project directory:

```bash
cd pipeline-lib
```
```bash
cd ml-garden
```

3. Install the dependencies using Poetry:

```bash
poetry install
```
```bash
poetry install
```

If you want to include optional dependencies, you can specify the extras:
If you want to include optional dependencies, you can specify the extras:

```bash
poetry install --extras "xgboost"
```
```bash
poetry install --extras "xgboost"
```

or
or

```bash
poetry install --extras "all_models"
```
```bash
poetry install --extras "all_models"
```

## Usage

Here's an example of how to use the library to run an XGBoost pipeline:

1. Create a `config.json` file with the following content:


```json
{
"pipeline": {
"name": "XGBoostTrainingPipeline",
"description": "Training pipeline for XGBoost models.",
"pipeline": {
"name": "XGBoostTrainingPipeline",
"description": "Training pipeline for XGBoost models.",
"parameters": {
"save_data_path": "ames_housing.pkl",
"target": "SalePrice",
"tracking": {
"experiment": "ames_housing",
"run": "baseline"
}
},
"steps": [
{
"step_type": "GenerateStep",
"parameters": {
"train_path": "examples/ames_housing/data/train.csv",
"predict_path": "examples/ames_housing/data/test.csv",
"drop_columns": ["Id"]
}
},
{
"step_type": "TabularSplitStep",
"parameters": {
"train_percentage": 0.7,
"validation_percentage": 0.2,
"test_percentage": 0.1
}
},
{
"step_type": "CleanStep"
},
{
"step_type": "EncodeStep"
},
{
"step_type": "ModelStep",
"parameters": {
"model_class": "XGBoost"
}
},
{
"step_type": "CalculateMetricsStep"
},
{
"step_type": "ExplainerDashboardStep",
"parameters": {
"save_data_path": "ames_housing.pkl",
"target": "SalePrice",
"tracking": {
"experiment": "ames_housing",
"run": "baseline"
}
},
"steps": [
{
"step_type": "GenerateStep",
"parameters": {
"train_path": "examples/ames_housing/data/train.csv",
"predict_path": "examples/ames_housing/data/test.csv",
"drop_columns": [
"Id"
]
}
},
{
"step_type": "TabularSplitStep",
"parameters": {
"train_percentage": 0.7,
"validation_percentage": 0.2,
"test_percentage": 0.1
}
},
{
"step_type": "CleanStep"
},
{
"step_type": "EncodeStep"
},
{
"step_type": "ModelStep",
"parameters": {
"model_class": "XGBoost"
}
},
{
"step_type": "CalculateMetricsStep"
},
{
"step_type": "ExplainerDashboardStep",
"parameters": {
"enable_step": false
}
}
]
}
"enable_step": false
}
}
]
}
}
```

Expand All @@ -120,7 +118,7 @@ Here's an example of how to use the library to run an XGBoost pipeline:
```python
import logging

from pipeline_lib import Pipeline
from ml-garden import Pipeline

logging.basicConfig(level=logging.INFO)

Expand All @@ -143,14 +141,14 @@ This will use the DataFrame provided in code, not needing the `predict_path` fil

The library allows users to define custom steps for data generation, cleaning, and preprocessing, which can be seamlessly integrated into the pipeline.


## Performance and Memory Profiling

We've added pyinsytrument and memray as development dependencies for optimizing performance and memory usage of the library.
Refer to the tools documentation for usage notes:

- [memray](https://github.com/bloomberg/memray?tab=readme-ov-file#usage)
- [pyinstrument](https://pyinstrument.readthedocs.io/en/latest/guide.html#profile-a-python-cli-command)


## Contributing

Contributions to the Pipeline Library are welcome! If you encounter any issues, have suggestions for improvements, or want to add new features, please open an issue or submit a pull request on the GitHub repository.
47 changes: 0 additions & 47 deletions init.sh

This file was deleted.

60 changes: 0 additions & 60 deletions pipeline_lib/core/data_container.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@

import dill as pickle
import pandas as pd
import yaml
from explainerdashboard.explainers import BaseExplainer
from sklearn.compose import ColumnTransformer

Expand Down Expand Up @@ -246,65 +245,6 @@ def from_pickle(
new_container.logger.info(f"{cls.__name__} loaded from {file_path}")
return new_container

@classmethod
def from_json(cls, file_path: str) -> DataContainer:
"""
Create a new DataContainer instance from a JSON file.

Parameters
----------
file_path : str
The path to the JSON file containing the configurations.

Returns
-------
DataContainer
A new instance of DataContainer populated with the data from the JSON file.
"""
# Check file is a JSON file
if not file_path.endswith(".json"):
raise ValueError(f"File {file_path} is not a JSON file")

with open(file_path, "r") as f:
data = json.load(f)

# The loaded data is used as the initial data for the DataContainer instance
return cls(initial_data=data)

@classmethod
def from_yaml(cls, file_path: str) -> DataContainer:
"""
Create a new DataContainer instance from a YAML file.

Parameters
----------
file_path : str
The path to the YAML file containing the configurations.

Returns
-------
DataContainer
A new instance of DataContainer populated with the data from the YAML file.

Raises
------
ValueError
If the provided file is not a YAML file.
"""
# Check if the file has a .yaml or .yml extension
if not (file_path.endswith(".yaml") or file_path.endswith(".yml")):
raise ValueError(f"The file {file_path} is not a YAML file.")

try:
with open(file_path, "r") as f:
data = yaml.safe_load(f)
except yaml.YAMLError as e:
# Handle cases where the file content is not valid YAML
raise ValueError(f"Error parsing YAML from {file_path}: {e}")

# The loaded data is used as the initial data for the DataContainer instance
return cls(initial_data=data)

@property
def clean(self) -> pd.DataFrame:
"""
Expand Down
Loading
Loading