This project uses:
- Rocker Group's Tidyverse R 4.0 Ubuntu 18 LTS docker container image
- data from the TERRA-REF project accessed through the traits R package
- jags for Gibbs Sampled MCMC modeling
- causalnex to implement the
NO TEARS
directed acyclic graph structure learning algorithm as described here causalnex
has dependencies:pandas
,sklearn
, andigraph
To develop a causal Bayesian network, also known as a Bayesian Belief Network, predicting growth rate as a phenotype from the Sorghum bicolor biomass accumulation panel.
This analysis produces a casual inference Bayesian Belief Network similar to Judea Pearle's work, where the nodes (vertices) of the network represent variables and the edges (arcs) represent linked dependencies supported by conditional probailities.
To run any aspect of this analysis it is recommended that you have Docker installed on the host machine. Or use singularity-ce to run the containers on high performance clusters.
- All RScripts detailed below can be run with the container image cyversevice/rstudio-bayes-cpu:4.0-ubuntu-jags, including the growth rate modeling
- All python code will run in the command line with this Docker container image and is written so that this repository is mounted as a volume in the container image as
/work/phenophasebbn/
- Ex.
docker run --rm -it -v /local/path/to/phenophasebbn/:/work/phenophasebbn/ rbartelme/pytorch-causalnex:0.10.0 python /work/phenophasebbn/bbn/bbn_structure.py
(See note below) - The current Dockerfile for this image is contained in this repository at
/causal_nex/Dockerfile
- Ex.
- A JupyterLab Docker container image has been created to facilitate the exploration of the python codebase
In order to speed up the directed acyclic graph generation for the Bayesian Belief Network, an initial graph was instantiated using lists of tuples that reference the edge/node connections and directions outlined in the conceptual diagram above.
NOTE: Learning the graph structure without any expert knowledge graph encodings via the NO TEARS implementation in causalnex without GPU acceleration is a computationally intensive process and may not solve the graph structure with the Sorghum gene data included in these analyses.
How the contents of this repository were used to generate the analysis.
1. Processing raw data:
- Weather & phenotype data processing:
- Code:
/bnprocess_functional.R
- Exports (TSV):
/season4_combined.txt
/season6_combined.txt
/ksu_combined.txt
(No longer used in final analysis)
- Code:
- Genomic Data:
- Code to process the SNP frequency by Sorghum bicolor gene table from this repository can be found in
/genomic_preprocessign/snp_normalization.R
- Exports (TSV):
/genomic_preprocessing/genewise_snp_relative_abundance.txt
where the relative abundance of single nucleotide polymorphisms is calculated relative to the Sorghum bicolor biomass accumilation panel population
- Code to process the SNP frequency by Sorghum bicolor gene table from this repository can be found in
- Development work:
- notes and pseudo code are in
/sandbox/
and/bnprocess_mac.R
- notes and pseudo code are in
2. Model Growth Rate by Sorghum bicolor Cultivar using JAGS in R:
/jags/
contains the dev code for the growth rate modeling below, these scripts & files are used in thebbn
structure learning model- Full logistic growth rate modeling by Jessica Guo
- Summary plots of the logistic growth models can be found in
/data_figs/
3. Prepare dataset for structure learning in R & Python:
- Join genomic, environmental, and phenotypic data
- This is done with the Rscript
/bbn/join_datasets.R
- This is done with the Rscript
- Exports:
/bbn/rgr_snp_joined.csv
4. BBN Structure Learning in Python with NO TEARS algorithm:
- Ingest joined data
/bbn/rgr_snp_joined.csv
and learns structure with:/bbn/bbn_structure.py
- Process categorical data with
labelencoder
fromscikit-learn
- Encode expert knowledge into graph structure via a list of tuples in the first invocation of
StructureModel()
png
exported as/bbn/init_graph.png
(as of 10-25-2021 this takes a long time to write the png and is commented out of the code, the pickle of this graph is available at/bbn/expert_sm.pickle
for the CPD fitting in step #5 after unpicklign the structure model binary)
- Optional: learn graph structure with
NO TEARS
using thefrom_pandas
function fromcausalnex
blacklisting spurrious node + edge connections with a second list of tuples - Exports:
- categorical label encodings for
genotype
(or cultivar)/bbn/genotype_map.json
&/bbn/season_map.json
- Currently stuck solving graph structure, so only expert knowledge encoded graph is available
- categorical label encodings for
5. Discritized Data Mapping & Conditional Probability Distribution Fitting:
- Import Bayesian Network by structure model pickle
- Instantiate Bayesian network with
BayesianNetwork()
function fromcausalnex
- Map continuous variables into categories
- A detailed checklist of these steps can be found= in this GitHub issue