Merge branch 'release-0.14.0'

3DGenomes · Nov 9, 2018 · 13c7041 · 13c7041
2 parents f00fb4b + 0845e23
commit 13c7041
Show file tree

Hide file tree

Showing 274 changed files with 97 additions and 1,536 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -1,2 +1,4 @@
 *.tsv.gz filter=lfs diff=lfs merge=lfs -text
 *.tsv filter=lfs diff=lfs merge=lfs -text
+*.dat.gz filter=lfs diff=lfs merge=lfs -text
+*.RData filter=lfs diff=lfs merge=lfs -text
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,13 @@ All notable changes to *binless* will be documented in this file.
 The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
 for versions 0.x of binless, minor releases might break backwards compatibility.
 
+## [0.14.0]
+### Changed
+- Tutorials now use the smaller SEMA3C dataset, which can be run quickly on a
+  laptop
+### Removed
+- Old article/ folder
+
 ## [0.13.0]
 ### Added
 - use 64 bit integers in gfl_graph_fl to expand the index limit in lasso.
@@ -121,7 +128,8 @@ for versions 0.x of binless, minor releases might break backwards compatibility.
 - Initial commit
 
 
-[0.13.0]: ../../compare/v0.12.0...HEAD
+[0.14.0]: ../../compare/v0.13.0...HEAD
+[0.13.0]: ../../compare/v0.12.0...v0.13.0
 [0.12.0]: ../../compare/v0.11.0...v0.12.0
 [0.11.0]: ../../compare/v0.10.2...v0.11.0
 [0.10.2]: ../../compare/v0.10.1...v0.10.2

diff --git a/README.md b/README.md
@@ -14,6 +14,8 @@ install.packages("devtools")
 devtools::install_github("3DGenomes/binless",subdir="binless")
 ```
 
+Installation should take about 10 minutes.
+
 #### Manual installation
 You can also install it manually as follows:
 
@@ -29,11 +31,16 @@ Binless uses the following packages: `data.table`, `Hmisc`, `foreach`,
 `doParallel`, `MASS`, `matrixStats`, `ggplot2`, `dplyr`, `Matrix`, `quadprog`,
 `scales`, `utils`
 
+Binless has been developed and tested on a MacBook Pro (2015) and on a CentOS 7 
+linux workstation with 128Gb RAM and 32 cores. Resource usage can go from modest 
+(fast binless will run on a laptop for loci <10Mb) to huge (fast binless on
+human chromosome 1 at 5kb base resolution requires about 500Gb of RAM).
+
 ### How does it work?
 
 In the `example/` folder, we provide plots and files to perform a normalization,
 taken from publicly available data (Rao *et al.*, 2014). Alternatively you can
-use your own data.  Start with something not too large, for example 2Mb. If you
+use your own data.  Start with something not too large, for example 1Mb. If you
 want a quick and dirty overview, skip to the *Fast binless* section. Otherwise,
 read on.
 
@@ -55,16 +62,15 @@ on the `CSnorm` object you built at the previous step. Once normalized, datasets
 can be combined, and signal and difference detection can be performed.  **This
 is the full-blown version of the algorithm, with statistically
 significant output**. Note that this is a beta version, so check for updates
-frequently.
+regularly.
 
 ### Fast binless
 
 See `fast_binless.R`. Here, we implemented a fast approximation with fixed
-fusion penalty and an approximate decay. You can either use a `CSnorm` object
+fusion penalty and approximate decay and bias terms. You can either use a `CSnorm` object
 produced at the preprocessing stage, or directly provide the binned raw matrix.
 **This is a fast and approximate version of the full algorithm, so you will not
-get statistically significant output**, and it might not look as *smooth* as the
-full-blown algorithm. But you can try out a whole chromosome ;)
+get statistically significant output**. But you can try out a whole chromosome ;)
 
 ### Base-resolution (arrow) plots
 
@@ -108,7 +114,7 @@ the following columns
 1. `re.up2`
 1. `re.dn2`
 
-Binned raw matrix (used for fast binless): tab or space-separated text file
+Binned raw matrix (used for fast binless): tab, comma or space-separated text file
 containing multiple datasets. The first line is a header that must start with
 `"name" "bin1" "pos1" "bin2" "pos2" "distance" "observed" "nobs"`. Optionally, more columns
 can be added but make sure their column names are different.
@@ -126,6 +132,3 @@ provide them as integers starting at 1 (i.e. use 1 for the first dataset, 2 for
 Also, **you must have pos2 >= pos1, and the data must be sorted by name, pos1 and pos2, in that order**.
 
 
-
-
-
diff --git a/arrow_plot.R b/arrow_plot.R
@@ -16,11 +16,14 @@ library(binless)
 #if you use a Mac, use gzcat instead of zcat, or provide the path to an uncompressed file
 #refer to README.md for a description of the tsv file format
 #Use nrows optional argument if you only want to read parts of the file
-data=read_tsv("zcat example/GM12878_MboI_HICall_FOXP1ext.tsv.gz")
+data=read_tsv("zcat example/GM12878_MboI_HICall_SEMA3C.tsv.gz")
 
 #plot the whole region at 10kb resolution
 plot_binned(data, resolution=10000, b1=data[,min(rbegin1)], e1=data[,max(rend2)])
 
+#plot the whole region at 5kb resolution
+plot_binned(data, resolution=5000, b1=data[,min(rbegin1)], e1=data[,max(rend2)])
+
 #arrow plots need a category column. we can add a dummy one
 data[,category:="NA"]
 #plot a 20kb subset of it with base resolution (arrow plot)
@@ -41,3 +44,8 @@ data = categorize_by_new_type(data, dangling.L = c(0), dangling.R = c(3), maxlen
 #plot the same region as before, with the new colours
 plot_raw(data, b1=data[,min(rbegin1)+50000], e1=data[,min(rbegin1)+70000])
 
+#plot a region that's further away from the diagonal
+plot_raw(data, b1=data[,min(rbegin1)+120000], e1=data[,min(rbegin1)+130000],
+         b2=data[,min(rbegin1)+900000], e2=data[,min(rbegin1)+910000])
+
+