Exploring machine learning engineering datasets, tools, and approaches

The purpose of this repo is to create a machine learning pipeline to analyze multi-modal biomedical data, implementing MLOps best practices and deploying in a cloud environment.

Project components:

Data preparation
- Use of a dataset from The Cancer Genome Atlas Program (TCGA). Specifically, I use data from the following paper of uterine carcinosarcoma https://www.cell.com/cancer-cell/fulltext/S1535-6108(17)30053-3
- Preprocess and clean the data using Python libraries.
Model development
- Implement a simple neural network model using PyTorch to predict patient remission (notebook)).
- Start with clinical features before integrating genomic data.
MLOps Pipeline:
- Set up a version control system using Git for code management (this repo).
- Implement experiment tracking and model versioning using Weights and Biases
- Set up a virtual environment which can also be used in combination with a Dockerfile for reproducibility.
Cloud Deployment:
- Deploy the model on AWS SageMaker.
- Implement an API endpoint for model inference.
- Note: I am currently troubleshooting this step.

Setup

I preferred to install packages with mamba. I installed with miniforge release Release 24.9.0-0.

I cloned data-science-base-repo, renamed to wandb-explore, then updated the environment yaml file to be specific to this project.

git clone https://github.com/benslack19/data-science-base-repo.git wandb-explore
cd wandb-explore

Create mamba environment:

mamba env create --file=environments/wandb.yml

Install pre-commit hooks:

pre-commit install

Usage

Activate the base environment for data science work:

mamba activate wandb

Code!
Commit with Conventional Commits
Run pre-commit on files:

# all files
pre-commit run --all-files
# or a specific file
pre-commit run --files ./project/utils.py

Or run specific tools:

# ruff only
ruff check .
ruff format . 

# mypy only
mypy .

Formatting notebooks

In the settings.json within jupyter lab code formatter:

{
  "preferences": {
    "default_formatter": {
      "python": ["ruff", "black"]
    }
  },
  "formatOnSave": true
}

Note that ruff does not autofix long line lengths. Therefore, it helps to use ruff in combination with black.

Formatting scripts using VSCode

Install the Ruff extension.
Install the Black formatter extension.
Ensure there are no potential conflicts for extensions (e.g. isort is disabled).

# Other configurations
If using iterm2, it might be helpful to allow for this option for moving the cursor by going to `Settings -> Profiles -> Keys -> Key Mappings` as explained [here](https://stackoverflow.com/questions/81272/how-to-move-the-cursor-word-by-word-in-the-os-x-terminal) and [here](https://coderwall.com/p/h6yfda/use-and-to-jump-forwards-backwards-words-in-iterm-2-on-os-x).


## Examples to verify formatting

Paste these examples as a way to verify formatting in your notebook or script is working.

```python
# formatting should alphabetize this list of packages
import seaborn as sns
import pandas as pd
import numpy as np

# formatting should change this long list so that each element is on its own line
test_list = [ "apple", "banana", "orange", "apple", "banana", "orange", "apple", "banana", "orange"]

Using docker image with colima

Colima allows use of running containers while obviating the use of Docker Desktop.

Some settings were necessary for testing of local deployment.

import os
import docker

# Set the Docker host to Colima's socket
os.environ["DOCKER_HOST"] = (
    f"unix:///Users/{os.getlogin()}/.colima/default/docker.sock"
)

client = docker.from_env(
    environment={
        "DOCKER_HOST": f"unix:///Users/{os.getlogin()}/.colima/default/docker.sock"
    }
)

predictor_local = pt_local_builder.deploy(
    docker_settings={
        "DOCKER_HOST": f"unix:///Users/{os.getlogin()}/.colima/default/docker.sock"
    }
)

To do

Add tests?

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github/workflows		.github/workflows
.vscode		.vscode
data		data
environments		environments
project		project
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring machine learning engineering datasets, tools, and approaches

Setup

Usage

Formatting notebooks

Formatting scripts using VSCode

Using docker image with colima

To do

About

Releases

Packages

Languages

benslack19/wandb-explore

Folders and files

Latest commit

History

Repository files navigation

Exploring machine learning engineering datasets, tools, and approaches

Setup

Usage

Formatting notebooks

Formatting scripts using VSCode

Using docker image with colima

To do

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages