Skip to content

benslack19/wandb-explore

Repository files navigation

Exploring machine learning engineering datasets, tools, and approaches

The purpose of this repo is to create a machine learning pipeline to analyze multi-modal biomedical data, implementing MLOps best practices and deploying in a cloud environment.

Project components:

  1. Data preparation
  2. Model development
    • Implement a simple neural network model using PyTorch to predict patient remission (notebook)).
    • Start with clinical features before integrating genomic data.
  3. MLOps Pipeline:
    • Set up a version control system using Git for code management (this repo).
    • Implement experiment tracking and model versioning using Weights and Biases
    • Set up a virtual environment which can also be used in combination with a Dockerfile for reproducibility.
  4. Cloud Deployment:
    • Deploy the model on AWS SageMaker.
    • Implement an API endpoint for model inference.
    • Note: I am currently troubleshooting this step.

Setup

I preferred to install packages with mamba. I installed with miniforge release Release 24.9.0-0.

  1. I cloned data-science-base-repo, renamed to wandb-explore, then updated the environment yaml file to be specific to this project.
git clone https://github.com/benslack19/data-science-base-repo.git wandb-explore
cd wandb-explore
  1. Create mamba environment:

mamba env create --file=environments/wandb.yml

  1. Install pre-commit hooks:
pre-commit install

Usage

  • Activate the base environment for data science work:

mamba activate wandb

# all files
pre-commit run --all-files
# or a specific file
pre-commit run --files ./project/utils.py
  • Or run specific tools:
# ruff only
ruff check .
ruff format . 

# mypy only
mypy .

Formatting notebooks

In the settings.json within jupyter lab code formatter:

{
  "preferences": {
    "default_formatter": {
      "python": ["ruff", "black"]
    }
  },
  "formatOnSave": true
}

Note that ruff does not autofix long line lengths. Therefore, it helps to use ruff in combination with black.

Formatting scripts using VSCode

# Other configurations
If using iterm2, it might be helpful to allow for this option for moving the cursor by going to `Settings -> Profiles -> Keys -> Key Mappings` as explained [here](https://stackoverflow.com/questions/81272/how-to-move-the-cursor-word-by-word-in-the-os-x-terminal) and [here](https://coderwall.com/p/h6yfda/use-and-to-jump-forwards-backwards-words-in-iterm-2-on-os-x).


## Examples to verify formatting

Paste these examples as a way to verify formatting in your notebook or script is working.

```python
# formatting should alphabetize this list of packages
import seaborn as sns
import pandas as pd
import numpy as np

# formatting should change this long list so that each element is on its own line
test_list = [ "apple", "banana", "orange", "apple", "banana", "orange", "apple", "banana", "orange"]

Using docker image with colima

Colima allows use of running containers while obviating the use of Docker Desktop.

Some settings were necessary for testing of local deployment.

import os
import docker

# Set the Docker host to Colima's socket
os.environ["DOCKER_HOST"] = (
    f"unix:///Users/{os.getlogin()}/.colima/default/docker.sock"
)

client = docker.from_env(
    environment={
        "DOCKER_HOST": f"unix:///Users/{os.getlogin()}/.colima/default/docker.sock"
    }
)

predictor_local = pt_local_builder.deploy(
    docker_settings={
        "DOCKER_HOST": f"unix:///Users/{os.getlogin()}/.colima/default/docker.sock"
    }
)

To do

  • Add tests?

About

A repo to learn Weights and Biases

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages