The NASA PeTaL (Periodic Table of Life) Project is an open source artificial intelligence design tool that leverages data and information from nature and technology to advance biomimicry R&D.
The aim of this project is to use Snorkel to build a training set of labeled biomimicry papers. Our goal is to train a classifier over the data that can predict what label a certain biomimicry paper should receive. We have access to a large amount of unlabeled data, but in order to train a classifier we need to label our data, but doing so by hand for real world applications can often be prohibitively slow and expensive. In these cases, we can turn to a weak supervision approach, using labeling functions (LFs) in Snorkel. LFs are noisy, programmatic rules and heuristics that assign labels to unlabeled training data.
This repository contains scripts, notebooks, data, and docs used for utilizing the snorkel system to build a training set.
An overview of the Snorkel system. (1) Subject matter experts (SME) write labeling functions (LFs) that express weak supervision sources like distant supervision, patterns, and heuristics. (2) Snorkel applies the LFs over unlabeled data and learns a generative model to combine the LFs' outputs into probabilistic labels. (3) Snorkel uses these labels to train a discriminative classification model, such as a deep neural network. Adapted from Ratner et. al (2017).
This README was last updated on 24 February 2022.
snorkel.environment.yml
environment for running snorkel with required dependencies.
petal_snorkel_train_golden.py
Main file for running snorkel.
biomimicry_function_rules.csv
contains rules for 40 of the 100 biomimicry functions.
biomimicry_functions_enumerated.csv
contains all 100 of the biomimicry functions labeled 0-99.
create_labeling_functions.py
file to create keyword labeling functions (lfs).
utils.py
data cleaning and train/test/split of data
snorkel_spam_test
folder containing all the files needed to run a short test of snorkel using a spam YouTube dataset.
Snorkel requires Python 3.6 or later. The entire conda environment for running snorkel can be found in snorkel.environment.yml
Get a sense of how snorkel works and run a quick data labeling tutorial using a YouTube spam comments dataset. More info can be found here: https://www.snorkel.org/use-cases/01-spam-tutorial
Sample dataset of labeled biomimicry data. Includes: doi, url, title, abstract, URL, journal, and level1/2/3 biomimicry labels.
Contains all 100 biomimicry functions labeled 0-99. These numbers are what snorkel recognizes in place of a biomimicry function, e.g. 'attach_permanently' = 0.
Contains 661 rules representing 40 of the 100 biomimicry functions. For example, the function 'attach permanently', contains keyword rules such as 'attach firmly', 'biological adhesive', and 'biological glue'.
Takes in data from labeled_data.csv
and applies a -1 'abstain' label to each row as a default, and performs a train/test/split of the data.
Create keyword labeling functions (lfs) for every rule in biomimicry_function_rules.csv
Trains snorkel on small subset of golden json papers and returns a prediction of each label.
Trains snorkel on golden json papers and returns a prediction of each label.
- Snorkel: Rapid Training Data Creation with Weak Supervision
- Data Programming: Creating Large Training Sets, Quickly
- Practical Weak Supervision
- Write LFs for the remaining 60 biomimicry functions.
- Include 'regular expression' labeling functions to increase coverage.
For questions contact Alexandra Ralevski ([email protected])