tryolabs · diegomarvid · May 18, 2024 · May 18, 2024 · May 18, 2024 · May 18, 2024
diff --git a/.github/workflows/unit-testing.yaml b/.github/workflows/unit-testing.yaml
@@ -0,0 +1,33 @@
+name: Unit testing
+
+on: push
+
+jobs:
+  unit-testing:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout repo
+        uses: actions/checkout@v3
+
+      - name: Setup Python
+        uses: actions/setup-python@v4
+
+      - name: Install Poetry
+        uses: snok/install-poetry@v1
+        with:
+          virtualenvs-create: true
+          virtualenvs-in-project: true
+
+      # Allow loading a cached venv created in a previous run if the lock file is identical
+      - name: Load cached venv if it exists
+        id: venv-cache
+        uses: actions/cache@v3
+        with:
+          path: .venv
+          key: venv-${{ runner.os }}-${{ hashFiles('**/poetry.lock', '**/pyproject.toml') }}
+
+      - name: Install dependencies
+        run: poetry install --no-interaction --extras all_models
+
+      - name: Run tests
+        run: poetry run pytest
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -54,4 +54,9 @@
     // TODO: this setting is showing a deprecation warning. Maybe we should drop it?
     "jupyter.generateSVGPlots": true,
     "autoDocstring.docstringFormat": "numpy",
+    "python.testing.pytestArgs": [
+        "tests"
+    ],
+    "python.testing.unittestEnabled": false,
+    "python.testing.pytestEnabled": true,
 }
diff --git a/README.md b/README.md
@@ -1,20 +1,21 @@
-# Pipeline Library
+# ML-GARDEN
 
-The Pipeline Library is a powerful and flexible tool designed to simplify the creation and management of machine learning pipelines. It provides a high-level interface for defining and executing pipelines, allowing users to focus on the core aspects of their machine learning projects. The library currently supports XGBoost models, with plans to expand support for more models in the future.
+ml-garden is a pipeline library that simplifies the creation and management of machine learning projects. It offers a high-level interface for defining and executing pipelines, allowing users to focus on their projects without getting lost in details. It currently supports XGBoost models for regression tasks on tabular data, with plans to expand support for more models in the future.
+The key components of the pipeline include Pipeline Steps, which are predefined steps connected to pass information through a data container; a Config File for setting pipeline steps and parameters; and a Data Container for storing and transferring essential data and results throughout the pipeline, facilitating effective data processing and analysis in machine learning projects.
 
 > [!WARNING]
-> This library is in the early stages of development and is not yet ready for production use. The API and functionality may change without notice. Use at your own risk.
+> Please be advised that this library is currently in the early stages of development and is not recommended for production use at this time. The API and functionality of the library may undergo changes without prior notice. Kindly use it at your own discretion and be aware of the associated risks.
 
 ## Features
 
-* Intuitive and easy-to-use API for defining pipeline steps and configurations
-* Support for various data loading formats, including CSV and Parquet
-* Flexible data preprocessing steps, such as data cleaning, feature calculation, and encoding
-* Seamless integration with XGBoost for model training and prediction
-* Hyperparameter optimization using Optuna for fine-tuning models
-* Evaluation metrics calculation and reporting
-* Explainable AI (XAI) dashboard for model interpretability
-* Extensible architecture for adding custom pipeline steps
+- Intuitive and easy-to-use API for defining pipeline steps and configurations
+- Support for various data loading formats, including CSV and Parquet
+- Flexible data preprocessing steps, such as data cleaning, feature calculation, and encoding
+- Seamless integration with XGBoost for model training and prediction
+- Hyperparameter optimization using Optuna for fine-tuning models
+- Evaluation metrics calculation and reporting
+- Explainable AI (XAI) dashboard for model interpretability
+- Extensible architecture for adding custom pipeline steps
 
 ## Installation
 
@@ -23,95 +24,92 @@ To install the Pipeline Library, you need to have Python 3.9 or higher and Poetr
 1. Clone the repository:
 
    ```bash
-   git clone https://github.com/tryolabs/pipeline-lib.git
+   git clone https://github.com/tryolabs/ml-garden.git
    ```
 
 2. Navigate to the project directory:
 
-    ```bash
-    cd pipeline-lib
-    ```
+   ```bash
+   cd ml-garden
+   ```
 
 3. Install the dependencies using Poetry:
 
-    ```bash
-    poetry install
-    ```
+   ```bash
+   poetry install
+   ```
 
-    If you want to include optional dependencies, you can specify the extras:
+   If you want to include optional dependencies, you can specify the extras:
 
-    ```bash
-    poetry install --extras "xgboost"
-    ```
+   ```bash
+   poetry install --extras "xgboost"
+   ```
 
-    or
+   or
 
-    ```bash
-    poetry install --extras "all_models"
-    ```
+   ```bash
+   poetry install --extras "all_models"
+   ```
 
 ## Usage
 
 Here's an example of how to use the library to run an XGBoost pipeline:
 
 1. Create a `config.json` file with the following content:
 
-
 ```json
 {
-    "pipeline": {
-        "name": "XGBoostTrainingPipeline",
-        "description": "Training pipeline for XGBoost models.",
+  "pipeline": {
+    "name": "XGBoostTrainingPipeline",
+    "description": "Training pipeline for XGBoost models.",
+    "parameters": {
+      "save_data_path": "ames_housing.pkl",
+      "target": "SalePrice",
+      "tracking": {
+        "experiment": "ames_housing",
+        "run": "baseline"
+      }
+    },
+    "steps": [
+      {
+        "step_type": "GenerateStep",
+        "parameters": {
+          "train_path": "examples/ames_housing/data/train.csv",
+          "predict_path": "examples/ames_housing/data/test.csv",
+          "drop_columns": ["Id"]
+        }
+      },
+      {
+        "step_type": "TabularSplitStep",
+        "parameters": {
+          "train_percentage": 0.7,
+          "validation_percentage": 0.2,
+          "test_percentage": 0.1
+        }
+      },
+      {
+        "step_type": "CleanStep"
+      },
+      {
+        "step_type": "EncodeStep"
+      },
+      {
+        "step_type": "ModelStep",
+        "parameters": {
+          "model_class": "XGBoost"
+        }
+      },
+      {
+        "step_type": "CalculateMetricsStep"
+      },
+      {
+        "step_type": "ExplainerDashboardStep",
         "parameters": {
-            "save_data_path": "ames_housing.pkl",
-            "target": "SalePrice",
-            "tracking": {
-                "experiment": "ames_housing",
-                "run": "baseline"
-            }
-        },
-        "steps": [
-            {
-                "step_type": "GenerateStep",
-                "parameters": {
-                    "train_path": "examples/ames_housing/data/train.csv",
-                    "predict_path": "examples/ames_housing/data/test.csv",
-                    "drop_columns": [
-                        "Id"
-                    ]
-                }
-            },
-            {
-                "step_type": "TabularSplitStep",
-                "parameters": {
-                    "train_percentage": 0.7,
-                    "validation_percentage": 0.2,
-                    "test_percentage": 0.1
-                }
-            },
-            {
-                "step_type": "CleanStep"
-            },
-            {
-                "step_type": "EncodeStep"
-            },
-            {
-                "step_type": "ModelStep",
-                "parameters": {
-                    "model_class": "XGBoost"
-                }
-            },
-            {
-                "step_type": "CalculateMetricsStep"
-            },
-            {
-                "step_type": "ExplainerDashboardStep",
-                "parameters": {
-                    "enable_step": false
-                }
-            }
-        ]
-    }
+          "enable_step": false
+        }
+      }
+    ]
+  }
 }
 ```
 
@@ -120,7 +118,7 @@ Here's an example of how to use the library to run an XGBoost pipeline:
 ```python
 import logging
 
-from pipeline_lib import Pipeline
+from ml-garden import Pipeline
 
 logging.basicConfig(level=logging.INFO)
 
@@ -143,14 +141,14 @@ This will use the DataFrame provided in code, not needing the `predict_path` fil
 
 The library allows users to define custom steps for data generation, cleaning, and preprocessing, which can be seamlessly integrated into the pipeline.
 
-
 ## Performance and Memory Profiling
 
 We've added pyinsytrument and memray as development dependencies for optimizing performance and memory usage of the library.
 Refer to the tools documentation for usage notes:
+
 - [memray](https://github.com/bloomberg/memray?tab=readme-ov-file#usage)
 - [pyinstrument](https://pyinstrument.readthedocs.io/en/latest/guide.html#profile-a-python-cli-command)
 
-
 ## Contributing
+
 Contributions to the Pipeline Library are welcome! If you encounter any issues, have suggestions for improvements, or want to add new features, please open an issue or submit a pull request on the GitHub repository.
diff --git a/init.sh b/init.sh
diff --git a/pipeline_lib/core/data_container.py b/pipeline_lib/core/data_container.py
@@ -10,7 +10,6 @@
 
 import dill as pickle
 import pandas as pd
-import yaml
 from explainerdashboard.explainers import BaseExplainer
 from sklearn.compose import ColumnTransformer
 
@@ -246,65 +245,6 @@ def from_pickle(
         new_container.logger.info(f"{cls.__name__} loaded from {file_path}")
         return new_container
 
-    @classmethod
-    def from_json(cls, file_path: str) -> DataContainer:
-        """
-        Create a new DataContainer instance from a JSON file.
-
-        Parameters
-        ----------
-        file_path : str
-            The path to the JSON file containing the configurations.
-
-        Returns
-        -------
-        DataContainer
-            A new instance of DataContainer populated with the data from the JSON file.
-        """
-        # Check file is a JSON file
-        if not file_path.endswith(".json"):
-            raise ValueError(f"File {file_path} is not a JSON file")
-
-        with open(file_path, "r") as f:
-            data = json.load(f)
-
-        # The loaded data is used as the initial data for the DataContainer instance
-        return cls(initial_data=data)
-
-    @classmethod
-    def from_yaml(cls, file_path: str) -> DataContainer:
-        """
-        Create a new DataContainer instance from a YAML file.
-
-        Parameters
-        ----------
-        file_path : str
-            The path to the YAML file containing the configurations.
-
-        Returns
-        -------
-        DataContainer
-            A new instance of DataContainer populated with the data from the YAML file.
-
-        Raises
-        ------
-        ValueError
-            If the provided file is not a YAML file.
-        """
-        # Check if the file has a .yaml or .yml extension
-        if not (file_path.endswith(".yaml") or file_path.endswith(".yml")):
-            raise ValueError(f"The file {file_path} is not a YAML file.")
-
-        try:
-            with open(file_path, "r") as f:
-                data = yaml.safe_load(f)
-        except yaml.YAMLError as e:
-            # Handle cases where the file content is not valid YAML
-            raise ValueError(f"Error parsing YAML from {file_path}: {e}")
-
-        # The loaded data is used as the initial data for the DataContainer instance
-        return cls(initial_data=data)
-
     @property
     def clean(self) -> pd.DataFrame:
         """