Skip to content

Latest commit

 

History

History
250 lines (197 loc) · 11.1 KB

README.md

File metadata and controls

250 lines (197 loc) · 11.1 KB

Get Started

tidyomics is an open project to develop and integrate software and documentation to enable a tidy data analysis framework for omics data objects. tidyomics enables the use of familiar tidyverse verbs (select, filter, mutate, etc.) to manipulate rich data objects in the Bioconductor ecosystem. Importantly, the data objects are not modified, but tidyomics provides a tidy interface to work on the native objects, leveraging existing Bioconductor classes and algorithms.

tidyomics is a set of R packages by an international group of developers.

tidyomics allows for code such as the following:

single_cell_data |>
  filter(Phase == "G1") |>
  ggplot(aes(UMAP_1, UMAP_2, color=score)) + 
  geom_point()

(filter single cells in G1 phase and plot UMAP coordinates)

or

chip_seq_peaks |>
  filter(FDR < 0.01) |>
  join_overlap_inner(promoters) |>
  group_by(promoter_type) |>
  summarize(ave_score = mean(score))

(compute average score by the type of promoter overlap for significant peaks)

Installer

Core tidyomics packages can be installed and loaded with the tidyomics package. See the following URL for details and instructions:

https://github.com/tidyomics/tidyomics


Below find links to:

Key Tidyomics Packages

Here we list the packages that provide a tidy data interface to manipulate native Bioconductor objects. The tidyomics project also involves other convenience packages listed below.

Package Intro GitHub Description
tidySummarizedExperiment Vignette GitHub Tidy manipulation of SummarizedExperiment objects
tidySingleCellExperiment Vignette GitHub Tidy manipulation of SingleCellExperiment objects
tidySeurat Vignette GitHub Tidy manipulation of Seurat objects
tidySpatialExperiment Vignette GitHub Tidy manipulation of SpatialExperiment objects
tidytof Vignette GitHub Tidy manipulation of high-dimensional cytometry data
plyranges Vignette GitHub Tidy manipulation of genomics ranges
plyinteractions Vignette GitHub Tidy manipulation of genomic interactions

Consult each package homepage for a description of recent changes.

Note that many of these packages have more than one vignette, which you can find by navigating the package main page.

Convenience packages

Package Intro GitHub Description
tidybulk Vignette GitHub Tidy bulk RNA-seq data analysis
nullranges Vignette GitHub Generation of null genomic range sets
easylift Vignette GitHub Perform genomic liftover

Comparison to base R

As the tidyomics packages offer an interface to underlying R/Bioconductor function evaluations, operations carried out in tidyomics can also be performed with base R/Bioconductor. The benefit from the tidyomics approach is often in readability, interpretability, and extensability of code, gained through elimination of temporary variables, square bracket indexing ([...,...]) and control code (e.g. for, if/else, apply/sapply, etc.).

For example, a filtering and grouping operation on a SummarizedExperiment data in tidyomics would look like:

data |>
  filter(score > 0) |>
  group_by(gene_class) |>
  summarize(mean_count = mean(counts))

In comparison, we can obtain the same with base R/Bioconductor, but with more variables and some control code:

subdata <- data[rowData(data)$score > 0,]
gene_classes <- levels(rowData(subdata)$gene_class)
mean_count <- numeric(length(gene_classes))
for (i in seq_along(gene_classes)) {
  tmp_idx <- rowData(subdata)$gene_class == gene_classes[i]
  mean_count[i] <- mean(assay(subdata, "counts")[tmp_idx,])
}

This can be improved a bit if you know some more base R functions. Here is a base R alternative making use of subset and aggregate, h/t Martin Morgan:

subdata <- subset(data, score > 0)
aggregate(
  as.vector(assay(subdata)),
  list(rep(rowData(subdata)$gene_class, ncol(subdata))),
  mean
)

Even still, the tidyomics version above (filter, group_by, summarize) is likely the easiest to read and extend if the analyst wants to do additional operations, and is the easiest to directly pipe into a plot or a printed table.

For exploring this example, you can define data as follows:

set.seed(5)
data <- SummarizedExperiment(
  assay=list(counts =
    matrix(rnorm(100),10,10, 
    dimnames=list(letters[1:10],letters[1:10]))
  ), 
  rowData = DataFrame(
    score=rnorm(10), 
    gene_class=factor(rep(1:3,c(3,3,4)))
  )
)

Comparison to Bioconductor

A key innovation in Bioconductor is the use of object-oriented programming and specific data structures. As described in Gentleman et al 2004,

An exprSet is a data structure that binds together array-based expression measurements with covariate and administrative data for a collection of [experiments]... [its] design facilitates a three-tier architecture for providing analysis tools for new microarray platforms: low-level data are bridged to high-level analysis manipulations via the exprSet structure.

In Bioconductor, rich, structured data about experiments is maintained throughout analyses by passing data objects from one method to another. E.g. estimateDispersions adds dispersion information to the rowData slot of a DESeqDataSet which is a sub-class of a SummarizedExperiment therefore inheriting the structure and methods of that class. The structure of the data is preserved after running the function (like many Biodonctor methods, it is an endomorphic function).

The goal of tidyomics is to preserve the object-oriented programming style and stucture of Bioconductor data objects, while allowing users to manipulate these data objects with expressive commands, familiar to tidyverse users.

Tidyomics aims to allow users to flexibly explore and plot biological datasets, by combining simple functions with human-readable names in a modular fashion to perform complex operations, including grouping and summarization tasks. Operations should still be performed with comparable efficiency to the underlying base R/Bioconductor code.

Tutorials

News

Talks

Tidyomics paper

Getting Help

We value community feedback and collaboration, and are happy to help you get started. Join the ongoing discussion, or you can ask specific questions about code on the support site.

Get Involved

diagram of tidyomics community

The tidyomics organization is open to new members and contributions; it is an effort of many developers in the Bioconductor community and beyond.

  • See our tidyomics open challenges project to see what we are currently working on
  • Issues tagged with good first issue are those that developers think would be good for a new developer to start working on
  • Read over our Guidelines for contributing
  • Read over our Code of Conduct
  • As with new users, for new developers please consider joining our Slack Channel, #tidiness_in_bioc. Most of the tidyomics developers are active there and we are happy to talk through updates, PRs, or give guidance on your development of a new package in this space.