Skip to content

Utilizing the snorkel machine learning model to label biomimicry papers. Snorkel uses weak supervision to label large amounts of training data using programmatic labeling functions based on keyword rules.

Notifications You must be signed in to change notification settings

nasa-petal/nasa_petal_snorkel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generating Training Data Using Weak Supervision for the NASA PeTaL Project

The NASA PeTaL (Periodic Table of Life) Project is an open source artificial intelligence design tool that leverages data and information from nature and technology to advance biomimicry R&D.

Links

Overview

The aim of this project is to use Snorkel to build a training set of labeled biomimicry papers. Our goal is to train a classifier over the data that can predict what label a certain biomimicry paper should receive. We have access to a large amount of unlabeled data, but in order to train a classifier we need to label our data, but doing so by hand for real world applications can often be prohibitively slow and expensive. In these cases, we can turn to a weak supervision approach, using labeling functions (LFs) in Snorkel. LFs are noisy, programmatic rules and heuristics that assign labels to unlabeled training data.

This repository contains scripts, notebooks, data, and docs used for utilizing the snorkel system to build a training set.

An overview of the Snorkel system. (1) Subject matter experts (SME) write labeling functions (LFs) that express weak supervision sources like distant supervision, patterns, and heuristics. (2) Snorkel applies the LFs over unlabeled data and learns a generative model to combine the LFs' outputs into probabilistic labels. (3) Snorkel uses these labels to train a discriminative classification model, such as a deep neural network. Adapted from Ratner et. al (2017).

This README was last updated on 24 February 2022.

Files

snorkel.environment.yml environment for running snorkel with required dependencies.

petal_snorkel_train_golden.py Main file for running snorkel.

biomimicry_function_rules.csv contains rules for 40 of the 100 biomimicry functions.

biomimicry_functions_enumerated.csv contains all 100 of the biomimicry functions labeled 0-99.

create_labeling_functions.py file to create keyword labeling functions (lfs).

utils.py data cleaning and train/test/split of data

snorkel_spam_test folder containing all the files needed to run a short test of snorkel using a spam YouTube dataset.

Getting Started

Environment and setup

Snorkel requires Python 3.6 or later. The entire conda environment for running snorkel can be found in snorkel.environment.yml

Running Snorkel

snorkel_spam_test

Get a sense of how snorkel works and run a quick data labeling tutorial using a YouTube spam comments dataset. More info can be found here: https://www.snorkel.org/use-cases/01-spam-tutorial

labeled_data.csv

Sample dataset of labeled biomimicry data. Includes: doi, url, title, abstract, URL, journal, and level1/2/3 biomimicry labels.

biomimicry_functions_enumerated.csv

Contains all 100 biomimicry functions labeled 0-99. These numbers are what snorkel recognizes in place of a biomimicry function, e.g. 'attach_permanently' = 0.

biomimicry_function_rules.csv

Contains 661 rules representing 40 of the 100 biomimicry functions. For example, the function 'attach permanently', contains keyword rules such as 'attach firmly', 'biological adhesive', and 'biological glue'.

utils.py

Takes in data from labeled_data.csv and applies a -1 'abstain' label to each row as a default, and performs a train/test/split of the data.

create_labeling_functions.py

Create keyword labeling functions (lfs) for every rule in biomimicry_function_rules.csv

petal_snorkel_train_Alex.py

Trains snorkel on small subset of golden json papers and returns a prediction of each label.

petal_snorkel_train_golden.py

Trains snorkel on golden json papers and returns a prediction of each label.

More Information

Notable papers

Future Work

  • Write LFs for the remaining 60 biomimicry functions.
  • Include 'regular expression' labeling functions to increase coverage.

Contact

For questions contact Alexandra Ralevski ([email protected])

About

Utilizing the snorkel machine learning model to label biomimicry papers. Snorkel uses weak supervision to label large amounts of training data using programmatic labeling functions based on keyword rules.

Topics

Resources

Stars

Watchers

Forks