Generating Training Data Using Weak Supervision for the NASA PeTaL Project

The NASA PeTaL (Periodic Table of Life) Project is an open source artificial intelligence design tool that leverages data and information from nature and technology to advance biomimicry R&D.

Links

Overview
Files
Getting Started
Running Snorkel
More Informaion
Future Work
Contact

Overview

The aim of this project is to use Snorkel to build a training set of labeled biomimicry papers. Our goal is to train a classifier over the data that can predict what label a certain biomimicry paper should receive. We have access to a large amount of unlabeled data, but in order to train a classifier we need to label our data, but doing so by hand for real world applications can often be prohibitively slow and expensive. In these cases, we can turn to a weak supervision approach, using labeling functions (LFs) in Snorkel. LFs are noisy, programmatic rules and heuristics that assign labels to unlabeled training data.

This repository contains scripts, notebooks, data, and docs used for utilizing the snorkel system to build a training set.

An overview of the Snorkel system. (1) Subject matter experts (SME) write labeling functions (LFs) that express weak supervision sources like distant supervision, patterns, and heuristics. (2) Snorkel applies the LFs over unlabeled data and learns a generative model to combine the LFs' outputs into probabilistic labels. (3) Snorkel uses these labels to train a discriminative classification model, such as a deep neural network. Adapted from Ratner et. al (2017).

This README was last updated on 24 February 2022.

Files

snorkel.environment.yml environment for running snorkel with required dependencies.

petal_snorkel_train_golden.py Main file for running snorkel.

biomimicry_function_rules.csv contains rules for 40 of the 100 biomimicry functions.

biomimicry_functions_enumerated.csv contains all 100 of the biomimicry functions labeled 0-99.

create_labeling_functions.py file to create keyword labeling functions (lfs).

utils.py data cleaning and train/test/split of data

snorkel_spam_test folder containing all the files needed to run a short test of snorkel using a spam YouTube dataset.

Getting Started

Environment and setup

Snorkel requires Python 3.6 or later. The entire conda environment for running snorkel can be found in snorkel.environment.yml

Running Snorkel

snorkel_spam_test

Get a sense of how snorkel works and run a quick data labeling tutorial using a YouTube spam comments dataset. More info can be found here: https://www.snorkel.org/use-cases/01-spam-tutorial

labeled_data.csv

Sample dataset of labeled biomimicry data. Includes: doi, url, title, abstract, URL, journal, and level1/2/3 biomimicry labels.

biomimicry_functions_enumerated.csv

Contains all 100 biomimicry functions labeled 0-99. These numbers are what snorkel recognizes in place of a biomimicry function, e.g. 'attach_permanently' = 0.

biomimicry_function_rules.csv

Contains 661 rules representing 40 of the 100 biomimicry functions. For example, the function 'attach permanently', contains keyword rules such as 'attach firmly', 'biological adhesive', and 'biological glue'.

utils.py

Takes in data from labeled_data.csv and applies a -1 'abstain' label to each row as a default, and performs a train/test/split of the data.

create_labeling_functions.py

Create keyword labeling functions (lfs) for every rule in biomimicry_function_rules.csv

petal_snorkel_train_Alex.py

Trains snorkel on small subset of golden json papers and returns a prediction of each label.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.vscode		.vscode
snorkel_spam_test		snorkel_spam_test
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
alex csv matches large model trained from alex csv.csv		alex csv matches large model trained from alex csv.csv
alex csv matches small models trained from alex csv.csv		alex csv matches small models trained from alex csv.csv
alex paper matches large model predicted using golden train.csv		alex paper matches large model predicted using golden train.csv
alex paper matches small models predicted using golden train.csv		alex paper matches small models predicted using golden train.csv
biomimicry_function_rules.csv		biomimicry_function_rules.csv
biomimicry_functions_enumerated.csv		biomimicry_functions_enumerated.csv
create_labeling_functions.py		create_labeling_functions.py
golden json matches large model.csv		golden json matches large model.csv
golden json matches small models.csv		golden json matches small models.csv
golden.json		golden.json
labeled_data.csv		labeled_data.csv
petal_snorkel_predict_alex.py		petal_snorkel_predict_alex.py
petal_snorkel_predict_golden.py		petal_snorkel_predict_golden.py
petal_snorkel_train_alex.py		petal_snorkel_train_alex.py
petal_snorkel_train_golden.py		petal_snorkel_train_golden.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generating Training Data Using Weak Supervision for the NASA PeTaL Project

Links

Overview

Files

Getting Started

Environment and setup

Running Snorkel

snorkel_spam_test

labeled_data.csv

biomimicry_functions_enumerated.csv

biomimicry_function_rules.csv

utils.py

create_labeling_functions.py

petal_snorkel_train_Alex.py

petal_snorkel_train_golden.py

More Information

Notable papers

Future Work

Contact

About

Contributors 2

Languages

nasa-petal/nasa_petal_snorkel

Folders and files

Latest commit

History

Repository files navigation

Generating Training Data Using Weak Supervision for the NASA PeTaL Project

Links

Overview

Files

Getting Started

Environment and setup

Running Snorkel

snorkel_spam_test

labeled_data.csv

biomimicry_functions_enumerated.csv

biomimicry_function_rules.csv

utils.py

create_labeling_functions.py

petal_snorkel_train_Alex.py

petal_snorkel_train_golden.py

More Information

Notable papers

Future Work

Contact

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages