From 40ba919af803b0836386e4387dcc1549df0c405f Mon Sep 17 00:00:00 2001 From: medmaca Date: Thu, 19 Dec 2019 10:52:54 +0000 Subject: [PATCH] Added new version of PGCNA. --- PGCNA/README.md | 339 ++++ pgcna-multi.py => PGCNA/pgcna-multi.py | 2416 ++++++++++++------------ pgcna.py => PGCNA/pgcna.py | 1412 +++++++------- PGCNA2/README.md | 335 ++++ PGCNA2/pgcna2.py | 1124 +++++++++++ README.md | 322 +--- 6 files changed, 3718 insertions(+), 2230 deletions(-) create mode 100644 PGCNA/README.md rename pgcna-multi.py => PGCNA/pgcna-multi.py (97%) rename pgcna.py => PGCNA/pgcna.py (97%) create mode 100644 PGCNA2/README.md create mode 100644 PGCNA2/pgcna2.py diff --git a/PGCNA/README.md b/PGCNA/README.md new file mode 100644 index 0000000..4dcd9cb --- /dev/null +++ b/PGCNA/README.md @@ -0,0 +1,339 @@ +# PGCNA (Parsimonious Gene Correlation Network Analysis) + +## Introduction + +PGCNA is a gene correlation network analysis approach that is computationally simple yet yields stable and biologically meaningful modules and allows visualisation of very large networks, showing substructure and relationships that are normally hard to see. The parsimonious approach, retaining the 3 most correlated edges per gene, results in a vast reduction in network complexity meaning that large networks can be clustered quickly and reduced to a small output file that can be used by downstream software. + +## Citation + +For more details see: +[Care, M.A., Westhead, D.R., Tooze, R.M., 2019. Parsimonious Gene Correlation Network Analysis (PGCNA): a tool to define modular gene co-expression for refined molecular stratification in cancer. npj Syst. Biol. Appl. 5, 13. https://doi.org/10.1038/s41540-019-0090-7](https://www.nature.com/articles/s41540-019-0090-7). + +Please cite this when using PGCNA. + +### Paper website + +PGCNA paper website: [http://pgcna.gets-it.net](http://pgcna.gets-it.net) + + +---------- + +## Requirements + +Python 2.7 and the following non-standard packages : numpy, scipy (for pgcna-multi.py) and h5py. +I recommend the Anaconda distribution ([https://www.anaconda.com/download/](https://www.anaconda.com/download/)), which comes with numpy, scipy and h5py included. This script has been tested on Windows 10 (Python 2.7.14, numpy 1.15.4, scipy 1.1.0 and h5py 2.8.0) and Linux CentOS 6.9 (python 2.7.13, numpy 1.13.1, scipy 0.19.1 and h5py 2.7.0). + +### Optional, but recommended + +Pgcna.py / pgcna-multi.py can be run without generating modules (**--noFastUF**) if all you require is output for network visualisation tools (e.g. Gephi/Cytoscape). For single data-set analysis use pgcna.py, which has less non-standard package requirements and uses less memory. + +#### Fast unfolding of communities in large networks (Louvain) + +To carry out clustering you will require the additional package **Louvain** ([Blondel, V.D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). **Fast unfolding of communities in large networks**. J. Stat. Mech. Theory Exp. 2008, P10008.](https://arxiv.org/abs/0803.0476)). This can be downloaded here: [https://sourceforge.net/projects/louvain/files/louvain-generic.tar.gz/download](https://sourceforge.net/projects/louvain/files/louvain-generic.tar.gz/download). Please make sure that you've downloaded v0.3 (see package README.txt). + +##### Example installation + +On Linux CentOS +``` +$ wget https://sourceforge.net/projects/louvain/files/louvain-generic.tar.gz +$ tar -xvzf louvain-generic.tar.gz +$ cd louvain-generic +$ make +``` +Copy **louvain**, **convert** and **hierarchy** into your bin folder. + +#### Gephi + +For visualising the output of PGCNA we highly recommend using Gephi ([https://gephi.org/](https://gephi.org/)) which is able to layout large networks very efficiently. It has the added bonus that it includes the **louvain/FastUnfold** method to quickly visualise the modularity of the network. See **Examples** below for details of loading the output from pgcna.py into Gephi and analysing it. + + +---------- + + +## Installation + +Using github: + +``` +git clone https://github.com/medmaca/PGCNA.git +``` + +Manual download: [https://github.com/medmaca/PGCNA/archive/master.zip](https://github.com/medmaca/PGCNA/archive/master.zip) + + +---------- + + +## Overview + +We have endeavoured to make the process of using PGCNA as simple as possible, there are relatively few user choices. The default parameters will give good results and most users will only want to alter the **--corrChunk** option depending on amount of available RAM (see **Common options**) + +### Note on terminology + +The louvain/FastUnfold method **clusters** the network, each run will produce a **clustering** of the data and each of these will contain individual **clusters/modules** (the terms are used interchangeably). Each time louvain/FastUnfold is run, due to a random start position, it will produce a different result and though these results are highly related they differ in exact gene/module groupings. PGCNA used the louvain/FastUnfold modularity score to rank the different clusterings of the data (number set with -n, --fuNumber parameter) and select the best (number set with -r, --fuRetain parameter) for output. + +## Usage + +### Single vs multiple data-set analysis + +**Pgcna.py** is for processing **single data-sets**, should you want to use **multiple data-sets** to generate a median correlation file (as used in the paper) then use **pgcna-multi.py**. Many of the parameters are shared between the two scripts, with pgcna-multi.py gaining a few additional options (highlighted below). It is possible to analyse single data-sets with pgcna-multi.py, and this will result in exactly the same output as pgcna.py. + +### Input file + +#### Single data-set analysis + +The input file for **pgcna.py** should be an expression file that has the format of unique identifiers (genes/probes etc.) in the first column, with the remaining columns containing the expression values across the samples. Redundancy of identifiers is not allowed and must be dealt with before processing with PGCNA. The default is for a tab separated file with a single line of header -- however, this can be altered with the (--fileSep and --headerL parameters respectively). + +#### Multiple data-set analysis + +When running **pgcna-multi.py** the -d/--dataF option should point to a folder containing multiple data-sets (rather than a single data-set). These files should be the same format as those used in a single data-set analysis (i.e. unique identifiers in the first column, remaining columns containing expression values). The data-set folder also needs an additional file (default name #FileInfo.txt; set by -m/--metaInf) that contains a tab separated list of the file names to process (1st column) and the number of header lines (to allow for meta data) in those files (2nd column), this also is expected to have a header, e.g.: +``` +Filename HeaderLines +FileName1.txt 3 +FileName2.txt 1 +FileName3.txt 12 +``` + +### Basic run + +PGCNA run via **pgcna.py** script: + + +On windows +``` +python pgcna.py -w workFolderPath -d expressionFilePath +``` + +On linux +``` +./pgcna.py -w workFolderPath -d expressionFilePath +``` + +PGCNA run via **pgcna-multi.py** script: + + +On windows +``` +python pgcna-multi.py -w workFolderPath -d expressionFilesFolderPath +``` + +On linux +``` +./pgcna-multi.py -w workFolderPath -d expressionFilesFolderPath +``` + +### Common options + +#### Alter RAM requirements (--corrChunk) + +Within the PGCNA method the step that requires the most RAM is calculating the pairwise correlations. For a small number of genes (<=20,000) this is easily carried out in memory, but this increases dramatically with increasing gene/probe numbers: 2x10^4 = 3GB, 4x10^4 = 12GB, 6x10^4 = 27GB, 8x10^4 = 48GB, 1x10^5 = 75GB etc. PGCNA carries out only a small portion of processing in memory, breaking the large correlation problem into chunks. This means that with default settings (--corrChunk 5000) PGCNA can process a 2x10^5 matrix in <1GB or RAM, rather than the 300GB that would be required using memory alone. + +When processing larger matrices the user can choose to increase **--corrChunk** if they want to utilise more memory for calculating the correlation. While this will speed up the correlation step, it should be noted that as this is only one step in the PGCNA process and thus the total time saving may be minor. + +Setting PGCNA to have a correlation chunk size of 20,000 + +``` +./pgcna.py -w workFolderPath -d expressionFilePath --corrChunk 2e4 +``` + +#### Retain all genes + +With default setting PGCNA will only retain the top 80% most variable genes in the expression data for analysis. If the input expression files has been pre-filtered to remove invariant genes you may want to process all of the data, this can be acomplished using the -f/--retainF parameters: + +``` +./pgcna.py -w workFolderPath -d expressionFilePath --retainF 1 +``` + +#### Run without clustering + +If you don't have the **louvain** package installed and wish to generate output for downstream tools you can still run PGCNA with the following command: + +``` +./pgcna.py -w workFolderPath -d expressionFilePath --noFastUF +``` + +#### Changing input file separator and header size + +The default input format is a tab ("\t") separated file with a single header line. This can be changed with the **--fileSep** and **--headerL** parameters, so for a .csv file with 3 header lines: + +``` +./pgcna.py -w workFolderPath -d expressionFilePath --fileSep "," --headerL 3 +``` +Note: for pgcna-multi.py the number of header lines is set using the meta-information file (default #FileInfo.txt) rather than --headerL. In addition all the data-sets to be processed must have the same separator, so for a folder of csv separated files: +``` +./pgcna-multi.py -w workFolderPath -d expressionFilesFolderPath --fileSep "," +``` + +#### Changing the number of retained edges + +The default is to retain the top 3 most correlated edges per gene, and in our paper we show that there is no benefit from increasing this. However, should users wish to increase this they can using the **--edgePG** parameter. Increasing the number of edges will result in the **louvain/fastUnfold** method generating fewer clusters/module, yet these clusters will be super sets of clusters/modules produced with fewer edges. Note: when expression data contains groups of highly correlated genes it is possible for the default 3 edges to generate orphan modules (modules disconnected from the rest of the network), in this situation increasing the number of edges should reconnect these to the total network. + +To run PGCNA retaining 5 edges + +``` +./pgcna.py -w workFolderPath -d expressionFilePath --edgePG 5 +``` + +#### Louvain/FastUnfold options + +By default PGCNA will run 100 clusterings of the data using louvain and will then process the best one (judged by louvain modularity score). Users may wish to increase both the number of clusterings of the network that are carried out and the fraction retained for downstream processing. + +For instance to run 10,000 clusterings and retain the top 10: + +``` +./pgcna.py -w workFolderPath -d expressionFilePath -n 1e4 -r 10 +``` + +#### pgcna-multi.py specific options + +pgcna-multi.py merges the correlations across data-sets by calculating a gene's median correlation across all data-sets that contain that gene. During this process if the user wants pgcna-multi.py can output a folder containing a single file per gene in the median correlation matrix. This file shows the correlations between that single gene and all other genes across the different data-sets, allowing the user to see if the data-sets have a good level of agreement, and potentially highlighting outliers. + +To output all single gene correlation files (**--singleCorr**). Note: this **significantly** increases the time to construct the median correlation matrix: + +``` +./pgcna-multi.py -w workFolderPath -d expressionFilesFolderPath --singleCorr +``` + +If you want to limit the single gene correlations output to a subset of the total genes, you can use the **--singleCorrL** option. This uses a file in the --workFolder (default corrGenes.txt; set by --singleCorrListF) to provide a list of genes to limit the output to. If --singleCorrL is set and corrGenes.txt is missing it will be created and the default gene (IRF4) included. Using --singleCorrL is highly recommended over --singleCorr if you're only interested in a small number of genes due to the significant speed increase it provides. + +``` +./pgcna-multi.py -w workFolderPath -d expressionFilesFolderPath --singleCorrL +``` + +See **Output** for information on all downstream files. + +---------- + + +## Parameters + +Complete list of parameters and their default values. + +Can also use built in help flag: +``` +./pgcna.py -h +``` + +### Required + +|Parameter|Description|Default Value|single/multi specific| +|---|---|---|---| +|-w, --workFolder|Base folder for all output|**Required**| | +|-d, --dataF|Expression data file path|**Required**|pgcna.py --> single file, pgcna-multi.py --> folder of files| + +### Input file related + +|Parameter|Description|Default Value|single/multi specific| +|---|---|---|---| +|-s, --fileSep|Separator used in expression file|"\t"| | +|--headerL|Number of header lines before expression values|1|pgcna.py only| +|-m, --metaInf|File containing information about data files to process (must be in same folder as --dataF)|#FileInfo.txt|pgcna-multi.py only| +|-f, --retainF|Retain gene fraction -- keeping most variant genes|0.8| | +|-g, --geneFrac|Fraction of expression files a gene needs to be present in to be included in median correlation matrix|1/3|pgcna-multi.py only| +|-e, --edgePG|Edges to keep per gene -- **Recommend leaving as default**|3| | + +### Folders + +|Parameter|Description|Default Value|single/multi specific| +|---|---|---|---| +|--outF|Root output folder|"PGCNA"| | +|--corrMatF|Correlation matrix folder -- where HDF5 files are stored|"CORR_MATRIX"| | +|--corrMatFS|Folder for single gene correlation files|"CORR_MATRIX_SG"|pgcna-multi.py only| +|--gephiF|Folder to store files for Gephi|"GEPHI"| | +|--fastUF|Folder for fast unfolding clustering output|"FAST_UNFOLD"| | + +### Control usage + +|Parameter|Description|Default Value|single/multi specific| +|---|---|---|---| +|--noFastUF|Flag -- don't run Fast Unfolding clustering but complete everything else|False| | +|--usePearson|Flag -- Instead of Spearman's ranked correlation coefficient calculate Pearson's|False| | +|--keepBigF|Flag -- Retain big HDF5 files after finishing, if needed for independent downstream analysis|False|For pgcna-multi.py --keepBigF retains median corr HDF5| +|--keepBigFA|Flag -- Retain **ALL** big HDF5 files after finishing, if needed for independent downstream analysis|False|pgcna-multi.py only| +|--corrChunk|Size of chunk to split correlation problem over -- higher values will speed up correlation calculation at the cost of RAM|5000| | +|--ignoreDuplicates|Flag -- Ignore correlation duplicates when cutting top --edgePG genes, faster if set but may miss genes if correlations are duplicated|False| | +|--singleCorr|Flag -- Output individual gene correlation files -- Warning this generates 1 file per gene in final correlation matrix|False|pgcna-multi.py only| +|--singleCorrL|Flag -- Output individual gene correlation files -- limited to those in --singleCorrListF|False|pgcna-multi.py only| +|--singleCorrListF|If --singleCorrL is set then create single correlation files for all those in this file (must be within --workFolder)|corrGenes.txt|pgcna-multi.py only| + +### Clustering related + +|Parameter|Description|Default Value|single/multi specific| +|---|---|---|---| +|-n, --fuNumber|Number of times to run **louvain/FastUnfold** method|100| | +|-r, --fuRetain|How many of the best (based on louvain modularity) clusterings to process|1| | +|--fuRenumberStart|For clusters retained (--fuRetain), what value to start numbering them from|1| | +|--tOutFold|FastUnfold Trees output folder|"Trees"| | +|--fTOutFold|FastUnfold final retained (--fuRetain) trees output folder|"TreesF"| | +|--cOutFold|FastUnfold clusters folder|"Clusters"| | +|--cOutTxtFold|FastUnfold clusters mapped back to genes|"ClustersTxt"| | +|--cOutListsFold|FastUnfold clusters mapped back to genes and split into individual text files|"ClustersLists"| | + + +### Output + +If PGCNA is run using default settings, the root folder (-w, --workFolder) will contain the following: + +* EPG3 (Edge per gene 3 folder) + * *CorrelationSettingsInfo.txt : timestamped file containing parameters used for run. + * CORR_MATRIX (if --keepBigF) : contains the temporary files used for calculating correlations + * *_RetainF*.h5 : The HDF5 file of all pairwise correlations after edge reduction + * *_RetainF*_Genes.txt : Genes that remain after -f/--retainF filtering ordered by decending standard deviation + * CORR_MATRIX_SG (if --singleCorr/--singleCorrL) : contains gzipped single gene correlation files split into first letter subfolders + * FAST_UNFOLD : Root folder for output from fast unfold + * *retainF1.0_EPG3 : folder specific to an input data file + * Clusters : Clusterings output by the FastUnfold **hierarchy** tool + * **ClustersLists** : Contains subfolders for the -r/--fuRetain number of "best" clusterings. Each subfolder contains genes split across the clusters/modules (M1...M*.txt) + * **ClustersTxt** : Contains the -r/--fuRetain number of "best" clusterings as individual *.csv files + * GeneIntMap.txt : Mapping of gene to numbers, required to get back to gene names after processing with **louvain** + * modScores.txt : Modularity scores across all clusterings -- used to rank clusterings by and select "best" to retain (-r/--fuRetain) + * sym_edges.bin : binary version of sym_edges.txt output by FastUnfold **convert** tool and required by **louvain** + * sym_edges.txt : gene pairs (encoded as numbers, see GeneIntMap.txt) along with their correlation score. + * sym_edges.weights | Edge weights, output by FastUnfold **convert** tool and require by **louvain** + * Trees: Trees output by **louvain** + * TreesF: Contains the -r/--fuRetain number of "best" trees. + * **GEPHI**: contains files for processing with Gephi package ([https://gephi.org/](https://gephi.org/)) + * *_RetainF1.0_EPG3_Edges.tsv : Tab separated file of edges + * *_RetainF1.0_EPG3_Nodes.tsv : Tab separated file of nodes + +The most important folders for users are highlighted in **bold**. + + +## Examples + +### Gephi Visualisation + +To visualise the output from PGCNA in Gephi is quite straightforward (correct for Gephi V0.92): + +Using data from **EPG3/GEPHI** folder + +1. Open Gephi +2. Select File/"New Project" +3. Select "Data Table" (tab above screen centre) +4. Select "Import Spreadsheet" + 1. Select *_RetainF1.0_EPG3_Nodes.tsv , make sure "Import as": "Nodes table", click **Next** and then **Finish**. Hopefully should import with no issues, select **OK** + 2. Repeat with *_RetainF1.0_EPG3_Nodes.tsv, making sure that "Import as" : "Edges table", click **Next** and then **Finish**. Hopefully should import with no issues, select **OK** +5. Click "Graph" (tab above screen centre) -- you should now see a massive blob of nodes in the screen centre +6. Under Statistics/NetworkOverview (right of screen) select **Modularity**. This will run the built in version of the louvain/fastUnfold method. + 1. Each run will generate a different clustering, the higher the score the better the clustering is perceived to be. +7. Select Appearance/Partition (left of screen) and select "Modularity Class" from drop-down list + 1. Click Palette button in the bottom right of this panel and select Generate. Untick "Limit number of colors" and then select **OK**. + 2. If you want you can manually change any of the resultant colours by left clicking on them and dragging the mouse around. + 3. When you're happy with the range of colours select **Apply** button. +8. To layout the network: + 1. Under **Layout** (left middle of screen) select **ForceAtlas2** + 2. For big networks set the following ("Approximate Repulsion":selected, Scaling:<1 often as small as 0.1, "LinLog mode":selected and importantly "Prevent Overlap":**not** selected). + 3. Wait until network layout has finished (You may need to alter **Scaling** if all nodes are pushed to the edge of the screen.) + 4. Once you're happy with the network set "Prevent Overlap":selected to finally prevent node overlaps. + +The Gephi version of the louvain/fastUnfold is older than that used by PGCNA, so if you've run PGCNA with clustering included, you should use the results contained in the **EPG/FAST_UNFOLD/\*retainF1.0_EPG3/ClustersTxt**, assuming you've already loaded the network (see above): + +1. Select "Data Table" (tab above screen centre) +2. Select "Import Spreadsheet" + 1. Within ClustersTxt choose a *.csv file to import + 2. Make sure "Import as": "Nodes table", click **Next**, make sure **Modularity Class** is set to integer then click **Finish**. +3. Follow step 7 above to set appearance with new Modularity Class data. + +## Feedback and questions + +If you have any queries or notice any bugs please email me at **m.a.care@leeds.ac.uk** (please include PGCNA in the subject heading). diff --git a/pgcna-multi.py b/PGCNA/pgcna-multi.py similarity index 97% rename from pgcna-multi.py rename to PGCNA/pgcna-multi.py index ab156e3..534bbe6 100644 --- a/pgcna-multi.py +++ b/PGCNA/pgcna-multi.py @@ -1,1208 +1,1208 @@ -#!/usr/bin/env python -""" -This program is free software: you can redistribute it and/or modify -it under the terms of the GNU General Public License as published by -the Free Software Foundation, either version 3 of the License, or -(at your option) any later version. - -This program is distributed in the hope that it will be useful, -but WITHOUT ANY WARRANTY; without even the implied warranty of -MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -GNU General Public License for more details. - -You should have received a copy of the GNU General Public License -along with this program. If not, see . - -Author: Matthew Care - -A memory efficient implementation of PGCNA, that requires very little RAM -to process very large expression data-sets. - -This version is for processing multiple data-sets, via calculating a -median correlation matrix. -""" -from __future__ import print_function - -import sys -import os -import argparse -import shutil -import string -from datetime import datetime -from collections import defaultdict -import re -import gzip -import glob -from subprocess import Popen, PIPE, STDOUT -import numpy as np -import numpy.ma as ma -import scipy.stats as stats -import h5py - - -##----------------------------------------Create argparse argments----------------------------------------## -parser = argparse.ArgumentParser() -# Required -parser.add_argument("-w","--workFolder",help="Work Folder [REQUIRED]",default=None) -parser.add_argument("-d","--dataF",help="Expression data folder path [REQUIRED]",default=None) - -# Optional -parser.add_argument("-m","--metaInf",help="File containing information about data files (must be in same folder as --dataF) [#FileInfo.txt]",default="#FileInfo.txt") -parser.add_argument("-s","--fileSep",help="Separator used in expression file(s) [\t]",default="\t") -parser.add_argument("-f","--retainF",help="Retain gene fraction -- keeping most variant genes [0.8]",type=float,default=0.8) -parser.add_argument("-g","--geneFrac",help="Fraction of files a gene needs to be present in to be included in median corr matrix [1/3]",type=float,default=1/3.0) -parser.add_argument("-e","--edgePG",help="Edges to keep per gene [3] -- Highly recommend leaving as default",type=int,default=3) - -# Not often changed -parser.add_argument("--outF",help="Root output folder [PGCNA]",default="PGCNA") -parser.add_argument("--corrMatF",help="Correlation matrix folder [CORR_MATRIX]",default="CORR_MATRIX") -parser.add_argument("--corrMatFS",help="Folder for single gene correlation files [CORR_MATRIX_SG]",default="CORR_MATRIX_SG") -parser.add_argument("--gephiF",help="Folder to store files for Gephi [GEPHI]",default="GEPHI") -parser.add_argument("--fastUF",help="Fast Unfolding folder [FAST_UNFOLD]",default="FAST_UNFOLD") - -# Flags -parser.add_argument("--noFastUF",help="Don't try to run Fast Unfolding Clustering, but complete everything else [False] -- Flag",action="store_true") -parser.add_argument("--usePearson",help="Use Pearson Correlation instead of Spearman [False] -- Flag",action="store_true") -parser.add_argument("--keepBigFA",help="Keep ALL big HDF5 files after finishing [False] -- Flag",action="store_true") -parser.add_argument("--keepBigF",help="Keep median correlations HDF5 files after finishing [False] -- Flag",action="store_true") -parser.add_argument("--ignoreDuplicates",help="Ignore correlation duplicates when cutting top --edgePG genes [False] -- Flag, faster if set",action="store_true") -parser.add_argument("--singleCorr",help="Output individual gene correlation files -- Warning this generates 1 file per gene in final correlation matrix. [False] -- Flag",action="store_true") -parser.add_argument("--singleCorrL",help="Output individual gene correlation files -- limited to those in --singleCorrListF [False] -- Flag",action="store_true") - -# Single corr related -parser.add_argument("--singleCorrListF",help="If --singleCorrL is set then create single correlation files for all those in this file (must be within --workFolder)",default="corrGenes.txt") - -# Correlation related -parser.add_argument("--corrChunk",dest="corrChunk",help="Size of chunk (rows) to split correlation problem over [5000] -- Higher will speed up correlation calculation at cost of RAM",default=5000,type=float) - -# FastUnfolding specific -parser.add_argument("-n","--fuNumber",dest="fuRunNum",help="Number of times to run [100]",default=100,type=float) -parser.add_argument("-r","--fuRetain",dest="fuRetainNum",help="Retain top [1] clusterings",default=1,type=float) -parser.add_argument("--fuRenumberStart",dest="fuRenumberStart",help="FU clustering re-numbering start [1]",default=1,type=int) -parser.add_argument("--tOutFold",dest="tOutFold",help="Trees Ouput folder [Trees]",default="Trees") -parser.add_argument("--fTOutFold",dest="fTOutFold",help="Final retained Trees Ouput folder [TreesF]",default="TreesF") -parser.add_argument("--cOutFold",dest="cOutFold",help="Clusters out folder [Clusters]",default="Clusters") -parser.add_argument("--cOutTxtFold",dest="cOutTxtFold",help="Clusters out folder - mapped back to genes [ClustersTxt]",default="ClustersTxt") -parser.add_argument("--cOutListsFold",dest="cOutListsFold",help="Clusters out folder - split into lists [ClustersLists]",default="ClustersLists") - - -########################################################################################### -args = parser.parse_args() -if not args.workFolder or not args.dataF: - print("\n\nNeed to specifiy REQUIRED variables see help (-h)") - sys.exit() - -########################################################################################### -OUT_FOLDER = os.path.join(args.workFolder,args.outF,"EPG" + str(args.edgePG)) -if not os.path.exists(OUT_FOLDER): - os.makedirs(OUT_FOLDER) - -########################################################################################### -args.metaInf = os.path.join(args.dataF,args.metaInf) -########################################################################################### -if not args.noFastUF: - def testCommandOut(outR,testFail): - if re.search("command not found",outR): # /bin/sh: louvainn: command not found - return True - elif re.search("not recognized as an internal or external command",outR): - return True - - # Make sure that if Fast Unfolding is to be run it exists on the path - testCommand = "louvain -h" - p = Popen(testCommand,shell=True,stdout=PIPE,stderr=STDOUT) - stdout,stderr = p.communicate() - - testCommand = "convert" - p = Popen(testCommand,shell=True,stdout=PIPE,stderr=STDOUT) - stdout2,stderr = p.communicate() - - testCommand = "hierarchy" - p = Popen(testCommand,shell=True,stdout=PIPE,stderr=STDOUT) - stdout3,stderr = p.communicate() - - testFail = False - testFail = testCommandOut(stdout,testFail) - testFail = testCommandOut(stdout2,testFail) - testFail = testCommandOut(stdout3,testFail) - - if testFail: - print("\nCouldn't run Fast unfold commands (louvainn, convert and hierarchy) must be in your path/environment (to skip clustering step use --noFastUF flag), see readme for more information") - sys.exit() -########################################################################################### -# Tidy input -# Decode fileSep for passed "\t" -args.fileSep = args.fileSep.decode('string-escape') -args.corrChunk = int(round(args.corrChunk,0)) -args.fuRunNum = int(round(args.fuRunNum,0)) -args.fuRetainNum = int(round(args.fuRetainNum,0)) -##----------------------------------------LOGGER----------------------------------------## - - -class multicaster(object): - def __init__(self, filelist): - self.filelist = filelist - - def write(self, str): - for f in self.filelist: - f.write(str) - - -def concatenateAsString(joinWith,*args): - temp = [str(x) for x in args] - return joinWith.join(temp) - -print("##----------------------------------------",str(datetime.now()),"----------------------------------------##",sep="") -# print out settings -settingInf = concatenateAsString( - "\n", - "##----------------------------------------Arguments----------------------------------------##", - "# Required", - "WorkFolder [-w,--workFolder] = " + args.workFolder, - "Expresion data file path [-d,--dataF] = " + args.dataF, - "\n# Optional", - "Meta info file describing expression files [-m, --metaInf] = " + args.metaInf, - "Separator used in expression file [-s, --fileSep] = " + args.fileSep, - "Retain gene fraction [-f, --retainF] = " + str(args.retainF), - "Fraction of expression files gene required in to be retained [-g, --geneFrac] = " + str(args.geneFrac), - "Edges to retain per gene [-e, --edgePG] = " + str(args.edgePG), - "\n# Not changed often" - "Root output folder [--outF] = " + args.outF, - "Correlation matrix folder [--corrMatF] = " + args.corrMatF, - "Single Gene Correlation files folder [--corrMatFS] = " + args.corrMatFS, - "Gephi files folder [--gephiF] = " + args.gephiF, - "Fast unfolding folder [--fastUF] = " + args.fastUF, - "\n# Flags", - "Don't run Fast Unfolding clustering [--noFastUF] = " + str(args.noFastUF), - "Use Pearson Correlation [--usePearson] = " + str(args.usePearson), - "Keep big HDF5 files after run [--keepBigFA] = " + str(args.keepBigFA), - "Keep median correlations HDF5 files after run [--keepBigF] = " + str(args.keepBigF), - "Ignore correlation duplicates when cutting top --edgePG genes [--ignoreDuplicates] = " + str(args.ignoreDuplicates), - "Output individual gene correlation files [--singleCorr] = " + str(args.singleCorr), - "Output individual gene correlation files for select list (--singleCorrListF) [--singleCorrL] = " + str(args.singleCorrL), - "\n# Single gene correlation options", - "List of genes to process if --singleCorrL [--singleCorrListF]:\t" + str(args.singleCorrListF), - "\n# Correlation Options", - "Chunk size (rows) to split correlation over [--corrChunk]:\t" + str(args.corrChunk), - "\n# Fast-Unfolding Specific", - "Cluster output folder -- Clusters split into lists [--cOutListsFold]:\t" + str(args.cOutListsFold), - "Run number [-n, --fuNumber]:\t" + str(args.fuRunNum), - "Retain top number of clusterings [-r, --fuRetain]:\t" + str(args.fuRetainNum), - "Fast-Unfolding clustering renumber start [--fuRenumberStart]:\t" + str(args.fuRenumberStart), - "Tree output folder [--tOutFold]:\t" + str(args.tOutFold), - "Final retained tree output folder [--fTOutFold]:\t" + str(args.fTOutFold), - "Cluster output folder [--cOutFold]:\t" + str(args.cOutFold), - "Cluster output folder -- mapped back to genes [--cOutTxtFold]:\t" + str(args.cOutTxtFold), -) - -settingsF = open(os.path.join(OUT_FOLDER,str(datetime.now()).replace(" ","-").replace(":",".")[:-7] + "_CorrelationSettingsInfo.txt"),"w") -print(settingInf,file=settingsF) -settingsF.close() -########################################################################################### - - -##----------------------------------------Methods----------------------------------------## - - -def makeSafeFilename(inputFilename): - # Declare safe characters here - safechars = string.letters + string.digits + "~-_.@#" - try: - return filter(lambda c: c in safechars, inputFilename) - except: - return "" - pass - - -def make_key_naturalSort(): - """ - A factory function: creates a key function to use in sort. - Sort data naturally - """ - def nSort(s): - convert = lambda text: int(text) if text.isdigit() else text - alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)] - - return alphanum_key(s) - return nSort - - -def listToPercentiles(x,inverse=False): - data = stats.rankdata(x, "average")/float(len(x)) - - if inverse: - return (1-data) * 100 - else: - return data * 100 - - -def dataSpread(x,fixNeg=True): - """ - Returns the min, Q1 (25%), median (Q2), Q3 (75%), max, IQR, Quartile Coefficient of Dispersion and IQR/Median (CV like) - - As QCOD can't cope with negative numbers if fixNeg==True will solve the problem by simply adding the abs(min) value to - all values. So new minimum will be zero. - """ - - if fixNeg: - if min(x) < 0: - x = np.array(x) - x = x + abs(min(x)) - - q1 = float(np.percentile(x,25,interpolation="lower")) - q2 = np.percentile(x,50) - q3 = float(np.percentile(x,75,interpolation="higher")) - - if (q2 == 0) or ((q3+q1) == 0): - return min(x), q1, q2, q3, max(x), (q3-q1), 0,0 - else: - return min(x), q1, q2, q3, max(x), (q3-q1), (q3-q1)/(q3+q1), (q3-q1)/q2 - - -def mad(arr): - """ Median Absolute Deviation: a "Robust" version of standard deviation. - Indices variabililty of the sample. - """ - arr = np.array(arr) - med = np.median(arr) - return np.median(np.abs(arr - med)) - - -def loadCorrGeneList(workF,listFN,defaultG="IRF4"): - - print("\nLoading list of genes for single correlation file output") - fileP = os.path.join(workF,listFN) - - if not os.path.exists(fileP): - print("\t\t",fileP,"does not exist, will create file and add default gene (",defaultG,")") - outF = open(fileP,"w") - print(defaultG,file=outF) - outF.close() - - corrG = {defaultG:1} - - else: - corrG = {} - for line in open(fileP): - line = line.rstrip() - - corrG[line] = 1 - - print("\tTotal genes:",len(corrG)) - - return corrG - - -def generateGeneMeta(outF,fileN,geneMetaInfo,genesNP,npA): - """ - Generate meta information for gene - """ - ########################################################################################### - def expToPercentiles(expToPerc,expA): - return [expToPerc[e] for e in expA] - - ########################################################################################### - print("\t\t# Generate gene meta information : ",end="") - - expressionVals = npA.flatten() - - # Convert from expression values to percentiles - expPercentiles = listToPercentiles(expressionVals,inverse=False) - - # Calculate mapping from expression to percentile - expToPerc = {} - for i,e in enumerate(expressionVals): - expToPerc[e] = expPercentiles[i] - - outF = open(os.path.join(outF,fileN + "_metaInf.txt"),"w") - for i,gene in enumerate(genesNP): - # Convert from expression values to percentiles, so can work out MAD of percentiles - genePercentiles = expToPercentiles(expToPerc,npA[i]) - - # min, Q1 (25%), median (Q2), Q3 (75%), max, IQR, Quartile Coefficient of Dispersion and IQR/Median (CV like) - try: - minE,q1E,q2E,q3E,maxE,iqrE,qcodE,iqrME = dataSpread(npA[i]) - except: - print("Issue with :",gene,npA[i]) - sys.exit() - - medianPercentiles = np.median(genePercentiles) - print(gene,q2E,qcodE,medianPercentiles,sep="\t",file=outF) - - geneMetaInfo[gene].append([q2E,qcodE,medianPercentiles]) - - outF.close() - print("done") - - -def mergeMetaInfo(geneMetaInfo): - print("\n\tMerging gene meta information") - medianMetaInfo = {} - - for gene in geneMetaInfo: - medians, qcods, percentiles = zip(*geneMetaInfo[gene]) - # Store Median_QCOD(varWithin), MedianPercentile, MADpercentile(VarAcross) - medianMetaInfo[gene] = [np.median(qcods),np.median(percentiles),mad(percentiles)] - - return medianMetaInfo - - -def loadMeta(metaFile,dataF,splitBy="\t",headerL=1): - - print("\n\nLoad meta-file (",os.path.basename(metaFile),")",sep="") - if not os.path.exists(metaFile): - print("\t\t# Meta file (",metaFile,") does not exist!",sep="") - sys.exit() - - header = headerL - fileInfo = {} - for line in open(metaFile): - cols = line.rstrip().split(splitBy) - - if header: - header -= 1 - continue - - if line.rstrip() == "": - continue - - totalPath = os.path.join(dataF,cols[0]) - if not os.path.exists(totalPath): - print("\t\t# File (",totalPath,") does not exist!, won't add to fileInfo",sep="") - else: - try: - fileInfo[totalPath] = int(cols[1]) - except: - print("Meta file line (",line.rstrip(),") is not formed properly, skipping") - - print("\tLoaded information on:",len(fileInfo),"files") - return fileInfo - - -def getGenes(metaInf,splitBy="\t",geneFrac=1/3.0,retainF=0.8): - """ - First pass over expression data-files to find the set of genes we will be - working with after filtering each data-set by retainF and then only keeping - genes that are present in geneFrac - """ - natSort = make_key_naturalSort() - - geneCount = defaultdict(int) - - print("\nLoad information on genes present per expression array:") - for fileP in sorted(metaInf,key=natSort): - fileN = os.path.basename(fileP) - - print("\t",fileN,end="") - - headerL = metaInf[fileP] - - ############################# - # Read in data file - seenGenes = {} - genes = [] - expression = [] - - print("\tReading expression data file:",fileN) - - for i, line in enumerate(open(fileP)): - cols = line.rstrip().split(splitBy) - - if headerL: - headerL -= 1 - continue - - genes.append(cols[0]) - if not cols[0] in seenGenes: - seenGenes[cols[0]] = 1 - else: - print("\n\nERROR: Duplicate gene (",cols[0],") in expression data (",fileN,"), expects unique identifiers!\nPlease remove duplicates and re-run\n",sep="") - sys.exit() - expression.append(cols[1:]) - - print("\t\tTotal genes = ",len(genes)) - ############################# - - ############################# - print("\t\t# Generate Numpy matrices") - # Move genes to numpy array, so can be sorted along with the rest - genesNP = np.empty((len(genes)),dtype=str) - genesNP = np.array(genes) - - # Create numpy array - nrow = len(expression[0]) - ncol = len(expression) - npA = np.empty((ncol,nrow),dtype=object) - for i,eSet in enumerate(expression): - npA[i] = eSet - - # Calculate SD for each gene - npSD = np.std(npA.astype(np.float64),axis=1) - - # Sort by SD - sortI = np.argsort(npSD)[::-1] # Sort ascending - # Sort matrix by index - npA = npA[sortI,:] - - # Sort genes by index - genesNP = genesNP[sortI] - - # Only retain top fract # - # Work out index to cut at - cutPos = int(round(retainF * genesNP.shape[0],0)) - print("\t\tRetain Fraction (",retainF,") Cut at: ",cutPos,sep="") - print("\t\t\tPre-cut shape:",npA.shape) - - # Retain most variant - npA = npA[:cutPos,].astype(np.float64) - genesNP = genesNP[:cutPos] - print("\t\t\tPost-cut shape:",npA.shape) - - # Keep track of seen genes - for gene in genesNP: - geneCount[gene] += 1 - - print("\n\tTotal unique genes/identifiers:",len(geneCount)) - - requiredFileNum = int(round(geneFrac * len(metaInf),0)) - print("\tRetaining genes present in ",round(geneFrac,2)," (>=",requiredFileNum,") of files : ",end="",sep="") - - genesToKeep = {} - for g in geneCount: - if geneCount[g] >= requiredFileNum: - genesToKeep[g] = 1 - - print(len(genesToKeep)) - - return genesToKeep,requiredFileNum - - -def loadHDF5files(outF,hdf5Paths): - - hdf5Files = {} - for fileN in hdf5Paths: - hdf5Files[fileN] = h5py.File(fileN,'r') - - return hdf5Files - - -def generateMasks(sortedKeepGenes,genesPerHDF5,hdf5paths): - - print("\tGenerating index mappings and masks to speed up combining correlations") - - masks = np.zeros((len(hdf5paths),len(sortedKeepGenes))) - - for i,g in enumerate(sortedKeepGenes): - for j,hd5 in enumerate(hdf5paths): - if g not in genesPerHDF5[hd5]: - masks[j][i] = 1 # Mask missing values - return masks - - -def generateCorrMatrix(workF,metaInf,genesToKeep,requiredFileNum,singleCorrList,geneFrac=1/3.0,retainF=0.8,corrMatF="CORR_MATRIX",corrMatFS="CORR_MATRIX_SG",usePearson=False,corrChunk=5000,splitBy="\t",decimalP=3,printEveryDiv=10,keepBigFA=False,singleCorr=False, - singleCorrL=False): - """ - Memory efficient method for generating all pairwise correlations of genes (rows) across a set of samples (columns). Uses HDF5 files to greatly - reduce memory useage, keeping most data residing on the hard disk. - """ - print("\n\nGenerate correlations for expression data:") - natSort = make_key_naturalSort() - sortedKeepGenes = sorted(genesToKeep,key=natSort) - - # Find mapping of keep genes - keepGeneMapping = {} - for i, g in enumerate(sortedKeepGenes): - keepGeneMapping[g] = i - - # Create subfolder - outF = os.path.join(workF,corrMatF + "_GMF" + str(requiredFileNum)) - if not os.path.exists(outF): - os.makedirs(outF) - - if singleCorr or singleCorrL: - outFS = os.path.join(workF,corrMatFS + "_GMF" + str(requiredFileNum)) - if not os.path.exists(outF): - os.makedirs(outF) - - geneMetaInfo = defaultdict(list) # Per data-set store gene meta information - genePosPerHDF5 = defaultdict(dict) # Mapping of location of gene in HDF5 file - perHDF5index = defaultdict(list) # Mapping of gene to where they should be in final matrix - genesPerHDF5 = defaultdict(dict) # Keep tally of which genes are in which HDF5 file - hdf5paths = [] - - print("\n\tCalculating correlations for data file:") - for fileP in sorted(metaInf,key=natSort): - fileN = os.path.basename(fileP) - - print("\t",fileN,end="") - - headerL = metaInf[fileP] - ############################# - # Read in data file - seenGenes = {} - genes = [] - expression = [] - - print("\tReading expression data file:",fileN) - - for i, line in enumerate(open(fileP)): - cols = line.rstrip().split(splitBy) - - if headerL: - headerL -= 1 - continue - - # Only retain required genes - if cols[0] not in genesToKeep: - continue - - genes.append(cols[0]) - if not cols[0] in seenGenes: - seenGenes[cols[0]] = 1 - else: - print("\n\nERROR: Duplicate gene (",cols[0],") in expression data (",fileN,"), expects unique identifiers!\nPlease remove duplicates and re-run\n",sep="") - sys.exit() - expression.append(cols[1:]) - - print("\t\tTotal genes = ",len(genes)) - ############################# - - ############################# - print("\t\t# Generate Numpy matrices") - # Move genes to numpy array - genesNP = np.empty((len(genes)),dtype=str) - genesNP = np.array(genes) - - # Store position of gene in matrix for this file and create mapping index for HDF5 files - outMatrixHDF5 = os.path.join(outF,fileN + "_RetainF" + str(retainF) + ".h5") - hdf5paths.append(outMatrixHDF5) - - for i,g in enumerate(genesNP): - perHDF5index[outMatrixHDF5].append(keepGeneMapping[g]) - genePosPerHDF5[g][outMatrixHDF5] = i - genesPerHDF5[outMatrixHDF5][g] = 1 - - # Create numpy array - nrow = len(expression[0]) - ncol = len(expression) - npA = np.zeros((ncol,nrow),dtype=np.float64) - for i,eSet in enumerate(expression): - npA[i] = eSet - - print("\t\t\tMarix shape:",npA.shape) - - # Output genes - outMatrixN = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_Genes.txt") - np.savetxt(outMatrixN,genesNP,fmt="%s",delimiter="\t") - - # Generate gene meta information - generateGeneMeta(outF,fileN + "_RetainF" + str(retainF),geneMetaInfo,genesNP,npA) - - ####################################### - # Calculate correlations - print("\t\t# Calculating correlations using HDF5 file to save memory") - rowN, colN = npA.shape - - if os.path.exists(outMatrixHDF5): - print("\t\tAlready exists -- skipping") - continue - with h5py.File(outMatrixHDF5, "w") as f: - h5 = f.create_dataset("corr", (rowN,rowN), dtype="f8") - - # # Load into memory - print("\t\t\tCreating HDF5 file") - h5 = h5py.File(outMatrixHDF5,'r+') - - if not usePearson: - # Note this isn't a perfect Spearman, as ties are not dealt with, but it is fast, and given we're only - # Retaining the most correlated edges will have almost no effect on the output. - npA = npA.argsort(axis=1).argsort(axis=1).astype(np.float64) - - # subtract means from the input data - npA -= np.mean(npA, axis=1)[:,None] - - # normalize the data - npA /= np.sqrt(np.sum(npA*npA, axis=1))[:,None] - - # Calculate correlations per chunk - print("\t\t\tCalculating correlations for Chunk:") - for r in range(0, rowN, corrChunk): - print("\t\t\t",r) - for c in range(0, rowN, corrChunk): - r1 = r + corrChunk - c1 = c + corrChunk - chunk1 = npA[r:r1] - chunk2 = npA[c:c1] - h5["corr"][r:r1, c:c1] = np.dot(chunk1, chunk2.T) - - # # Write output out for debugging - # finalCorr = np.copy(h5["corr"][:]) - # outMatrixN = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_Corr") - # stringFormat = "%." + str(decimalP) + "f" - # # stringFormat2 = "%." + str(decimalPpVal) + "f" - # np.savetxt(outMatrixN + ".txt",finalCorr,fmt=stringFormat,delimiter="\t") - - # Remove matrices to save memory - del npA - del h5 - - # Calculate median gene meta information - medianMetaInfo = mergeMetaInfo(geneMetaInfo) - - - ########################################################################################### - ##----------------------------Calculate Median Correlations------------------------------## - ########################################################################################### - # Calculate median correlation matrix - print("\nCalculating median correlations") - if len(hdf5paths) == 1: - # If we're only analysing a single data-set - if not (singleCorr or singleCorrL): - print("\t\tOnly single data-set analysed, skipping generating median correlations") - # No need to calculate median correlations, just return path to HDF5 file and genes matrix - return outMatrixHDF5, genesNP - else: - print("\t\tOnly single data-set analysed, but --singeCorr/--singleCorrL so will proceed with output") - - printEvery = int(round(len(genesToKeep)/float(printEveryDiv),0)) - printE = printEvery - tell = printE - count = 0 - - # Create HDF5 median correlation matrix - outMatrixHDF5 = os.path.join(outF,"#Median_RetainF" + str(retainF) + ".h5") - with h5py.File(outMatrixHDF5, "w") as f: - h5 = f.create_dataset("corr", (len(genesToKeep),len(genesToKeep)), dtype="f8") - h5 = h5py.File(outMatrixHDF5,'r+') - - # Load HDF5 correlation files - hdf5Files = loadHDF5files(outF,hdf5paths) - - # Output genes - outMatrixN = os.path.join(outF,"#Median_RetainF" + str(retainF) + "_Genes.txt") - np.savetxt(outMatrixN,sortedKeepGenes,fmt="%s",delimiter="\t") - - # Get masks - maskMappings = generateMasks(sortedKeepGenes,genesPerHDF5,hdf5paths) - - print("\tCalculating median correlations (report every 1/",printEveryDiv,"th of total):",sep="") - if singleCorr or singleCorrL: - print("\t\tOutputting single gene correlation files:") - - for genePos, gene in enumerate(sortedKeepGenes): - # print("\t\t",gene1) - # Inform user of position - count += 1 - if count == printE: - printE = printE + tell - if singleCorr: - print("\n\t\tProcessed:",count,end="\n\n") - else: - print("\t\t",count) - - rowsPerHDF5 = {} - maskPos = [] - dataSetNames = [] - # Grab row for gene across files - for i,hdf5 in enumerate(hdf5paths): - try: - rowsPerHDF5[hdf5] = hdf5Files[hdf5]["corr"][genePosPerHDF5[gene][hdf5]] - maskPos.append(i) - dataSetNames.append(os.path.basename(hdf5)[:-3]) - except: - pass - - # Second pass - # Convert to numpy array - npA = np.full((len(rowsPerHDF5),len(sortedKeepGenes)),-10,dtype=np.float64) # Missing set to -10 - for i,hdf5 in enumerate(sorted(rowsPerHDF5,key=natSort)): - npA[i][perHDF5index[hdf5]] = rowsPerHDF5[hdf5] # Use indexes to place in correct location - - # Get appropriate masks - tempMask = [] - for i in maskPos: - tempMask.append(maskMappings[i]) - - npAMasked = ma.masked_array(npA,mask=tempMask) - - # Generate medians - medianRowCorr = np.copy(ma.median(npAMasked,axis=0)) - h5["corr"][genePos] = medianRowCorr - - ########################################################################################### - ##-------------------------------SINGLE GENE CORR----------------------------------------## - ########################################################################################### - def mergeCorrData(gene,corrInf,medianMetaInfo,medianCorr,missingVal=-10): - finalCorrs = [] - dataSetCount = 0 - - for corr in corrInf: - if (corr <= missingVal).all(): - - finalCorrs.append("") - else: - finalCorrs.append(str(round(corr,decimalP))) - dataSetCount += 1 - - roundedMeta = map(str,[round(origV,decimalP) for origV in medianMetaInfo[gene]]) - scaledCorr = dataSetCount ** medianCorr - finalInfo = gene + "\t" + str(round(scaledCorr,decimalP)) + "\t" + str(dataSetCount) + "\t" + str(round(medianCorr,decimalP)) + "\t" + "\t".join(roundedMeta) + "\t" + "\t".join(finalCorrs) - - return scaledCorr,finalInfo - ################################### - if singleCorr or singleCorrL: - - if singleCorrL: # Only output genes in list - if gene not in singleCorrList: - continue - - subFolder = os.path.join(outFS,gene[0]) - singleCorrFName = os.path.join(subFolder,makeSafeFilename(gene) + "_corr_RetainF" + str(retainF) + ".txt.gz") - if os.path.exists(singleCorrFName): - continue - - print("\t\t\tSingle Corr:",makeSafeFilename(gene)) - - # Make subfolder - if not os.path.exists(subFolder): - os.makedirs(subFolder) - - singleCorrF = gzip.open(singleCorrFName,"wb",compresslevel=9) - dataSetsH = "\t".join(dataSetNames) - - if usePearson: - print("Gene\tNumDataSets^MedianPCC\tNumDataSets\tMedianPCC\tMedian_QCODexpression(VarWithin)\tMedianPercentile\tMADpercentile(VarAcross)",dataSetsH,sep="\t",file=singleCorrF) - else: - print("Gene\tNumDataSets^MedianRho\tNumDataSets\tMedianRho\tMedian_QCODexpression(VarWithin)\tMedianPercentile\tMADpercentile(VarAcross)",dataSetsH,sep="\t",file=singleCorrF) - - rankedByCorr = defaultdict(list) - for i,g in enumerate(sortedKeepGenes): - scaledCorr, info = mergeCorrData(g,npA.T[i],medianMetaInfo,medianRowCorr[keepGeneMapping[g]]) - rankedByCorr[scaledCorr].append(info) - - # Rank by scaledCorr - for sCorr in sorted(rankedByCorr,reverse=True): - for info in rankedByCorr[sCorr]: - print(info,file=singleCorrF) - singleCorrF.close() - - ########################################################################################### - - # # Write output out for debugging - # finalCorr = np.copy(h5["corr"][:]) - # outMatrixN = os.path.join(outF,"#Median_RetainF" + str(retainF) + "_Corr") - # stringFormat = "%." + str(decimalP) + "f" - # np.savetxt(outMatrixN + ".txt",finalCorr,fmt=stringFormat,delimiter="\t") - ########################################################################################### - ########################################################################################### - - # Remove all single HDF5 files unless requested to keep them - if not keepBigFA: - del hdf5Files - for hdf5P in hdf5paths: - os.remove(hdf5P) - - # # Return path to HDF5 file and genes matrix - return outMatrixHDF5, sortedKeepGenes - - -def reduceEdges(workF,dataF,gephiF,corrh5,genesM,retainF=0.8,edgePG=3,printEveryDiv=10,corrMatF=None,keepBigFA=False,keepBigF=False,ignoreDuplicates=False): - """ - Reduce edges in correlation matrix, only retaining edgePG maximum correlated genes per row - """ - - def bottomToZero(npA,n=1): - """ - Set everything below n to zero - """ - topI = np.argpartition(npA,-n) - npA[topI[:-n]] = 0 - return npA - - def bottomToZeroWithDuplicates(npA,n=1): - """ - Set everything below n to zero, - but deal with duplicates - """ - unique = np.unique(npA) - topIunique = np.argpartition(unique,-n)[-n:] - - toKeep = [] - for val in unique[topIunique]: - toKeep.extend(np.where(npA == val)[0]) - - # Mask and reverse - mask = np.ones(len(npA),np.bool) - mask[toKeep] = 0 - npA[mask] = 0 - - return npA - - fileN = os.path.basename(dataF) - - print("\tLoad HDF5 file") - # Load old HDF5 - h5 = h5py.File(corrh5,'r+') - rowN, colN = h5["corr"].shape - - printEvery = int(round(rowN/float(printEveryDiv),0)) - print("\nReduces edges to (",edgePG,") per gene:",sep="") - printE = printEvery - tell = printE - count = 0 - - print("\tWorking (report every 1/",printEveryDiv,"th of total):",sep="") - for i, row in enumerate(h5["corr"]): - # Inform user of position - count += 1 - if count == printE: - printE = printE + tell - print("\t\t",count) - - if ignoreDuplicates: - h5["corr"][i] = bottomToZero(row,edgePG + 1) - else: - h5["corr"][i] = bottomToZeroWithDuplicates(row,edgePG + 1) - - # Write output out for debugging - # finalCorr = np.copy(h5EPG["corr"][:]) - # outMatrixN = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_NonSym") - # stringFormat = "%." + str(3) + "f" - # # stringFormat2 = "%." + str(decimalPpVal) + "f" - # np.savetxt(outMatrixN + ".txt",finalCorr,fmt=stringFormat,delimiter="\t") - - ########################################################################################### - - print("\nGenerating files for Gephi network visualization tool") - # Create subfolder - outF = os.path.join(workF,gephiF) - if not os.path.exists(outF): - os.makedirs(outF) - - # First output list of genes (nodes) - print("\tCreate node file") - nodesFP = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_EPG" + str(edgePG) + "_Nodes.tsv") - nodesFile = open(nodesFP,"w") - print("Id\tLabel\tCluster",file=nodesFile) - for gene in genesM: - print(gene,gene,"NA",file=nodesFile,sep="\t") - nodesFile.close() - - # Second, output list of edges - print("\tCreate edges file (report every 1/",printEveryDiv,"th of total):",sep="") - edgesFP = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_EPG" + str(edgePG) + "_Edges.tsv") - edgesFile = open(edgesFP,"w") - print("Source\tTarget\tWeight\tType\tfromAltName\ttoAltName",file=edgesFile) - - # Finally output edges - printE = printEvery - tell = printE - count = 0 - seen = defaultdict() - printedLines = 0 - - for i,row in enumerate(h5["corr"]): - for j,item in enumerate(row): - if not (i == j): # Don't output self edges... - geneO = genesM[i] - geneT = genesM[j] - if not (geneO + "-@@@-" + geneT) in seen: - # Output info - if not item == 0: - print(geneO,geneT,item,"undirected",geneO,geneT,sep="\t",file=edgesFile) - printedLines += 1 - - # Store the fact that we've see this and it's equivalent pair - seen[geneO + "-@@@-" + geneT] = 1 - seen[geneT + "-@@@-" + geneO] = 1 - - count += 1 - if count == printEvery: - printEvery = printEvery + tell - print("\t\t",count,"Genes",", Edges:",printedLines) - - edgesFile.close() - print("\n\t\tTotal printed (",printedLines,") edges",sep="") - - if not (keepBigFA or keepBigF): - h5.close() - shutil.rmtree(os.path.join(workF,corrMatF)) - - return edgesFP, nodesFP - - -def prepareFastUnfold(workF,dataF,fastUF,edgesFP,retainF=0.8,edgePG=3,splitBy="\t",header=1): - print("\nPreparing files for clustering using Fast Unfolding method") - - fileN = os.path.basename(dataF) - # Create subfolder - outF = os.path.join(workF,fastUF,fileN + "_retainF" + str(retainF) + "_EPG" + str(edgePG)) - if not os.path.exists(outF): - os.makedirs(outF) - - nodePairs = defaultdict(dict) - naturalKey = make_key_naturalSort() - - nameToInt = {} - nameIntCount = 0 - - edgeNum = 0 - for line in open(edgesFP): - cols = line.rstrip().split(splitBy) - - if header: - header -= 1 - continue - - gene1 = cols[0] - gene2 = cols[1] - edgeNum += 1 - - # Convert from name to int - if gene1 in nameToInt: - gene1 = nameToInt[gene1] - else: - nameToInt[gene1] = str(nameIntCount) - gene1 = str(nameIntCount) - nameIntCount += 1 - - if gene2 in nameToInt: - gene2 = nameToInt[gene2] - else: - nameToInt[gene2] = str(nameIntCount) - gene2 = str(nameIntCount) - nameIntCount += 1 - - ################################# - - nodePairs[gene1][gene2] = cols[2] - - print("\tTotal edges =",edgeNum) - - outFilePath = os.path.join(outF,"sym_edges.txt") - outFile = open(outFilePath,"w") - for node in sorted(nodePairs,key=naturalKey): - for node2 in sorted(nodePairs[node],key=naturalKey): - print(node,node2,nodePairs[node][node2],sep="\t",file=outFile) - outFile.close() - - # Write out mapping between genes and ints - outFilePath2 = os.path.join(outF,"GeneIntMap.txt") - outFile2 = open(outFilePath2,"w") - for gene in sorted(nameToInt,key=naturalKey): - print(gene,nameToInt[gene],sep="\t",file=outFile2) - outFile2.close() - - return outFilePath, outFilePath2 - - -def runFastUF(symEdgesF,geneIntMapF,tOutFold,fTOutFold,cOutFold,cOutTxtFold,cOutListsFold,runNum=10,retainNum=1,printEveryDiv=10): - MOD_SCORES = "modScores.txt" - - print("\nRunning Fast Unfolding") - baseFold,symEdgesFile = os.path.split(symEdgesF) - - # Make folders - T_OUT_FOLDER = os.path.join(baseFold,tOutFold) - TF_OUT_FOLDER = os.path.join(baseFold,fTOutFold) - C_OUT_FOLDER = os.path.join(baseFold,cOutFold) - C_OUT_TXT_FOLDER = os.path.join(baseFold,cOutTxtFold) - C_OUT_LISTS_FOLDER = os.path.join(baseFold,cOutListsFold) - - # Remove old - if os.path.exists(TF_OUT_FOLDER): - shutil.rmtree(TF_OUT_FOLDER) - if os.path.exists(T_OUT_FOLDER): - shutil.rmtree(T_OUT_FOLDER) - if os.path.exists(C_OUT_FOLDER): - shutil.rmtree(C_OUT_FOLDER) - if os.path.exists(C_OUT_TXT_FOLDER): - shutil.rmtree(C_OUT_TXT_FOLDER) - if os.path.exists(C_OUT_LISTS_FOLDER): - shutil.rmtree(C_OUT_LISTS_FOLDER) - - # Replace - if not os.path.exists(T_OUT_FOLDER): - os.makedirs(T_OUT_FOLDER) - if not os.path.exists(TF_OUT_FOLDER): - os.makedirs(TF_OUT_FOLDER) - if not os.path.exists(C_OUT_FOLDER): - os.makedirs(C_OUT_FOLDER) - if not os.path.exists(C_OUT_TXT_FOLDER): - os.makedirs(C_OUT_TXT_FOLDER) - if not os.path.exists(C_OUT_LISTS_FOLDER): - os.makedirs(C_OUT_LISTS_FOLDER) - - ##################################### - - # Fast Unfold commands - def convertBinary(iF): - command = "convert -i " + iF + " -o " + iF[:-3]+"bin " + " -w " + iF[:-3] + "weights" - p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) - stdout,stderr = p.communicate() - - def cluster(wF,iF,outFolder,runNum,PRINT_EVERY): - modScores = [] - - printEvery = PRINT_EVERY - tell = printEvery - count = 0 - - for i in range(runNum): - command = "louvain " + iF[:-3]+"bin -w " + iF[:-3] + "weights -l -1 > " + os.path.join(outFolder,str(i)) - p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) - stdout,stderr = p.communicate() - modScores.append(stdout.rstrip()) - - # Notify position - count += 1 - if count == printEvery: - printEvery = printEvery + tell - print("\t",count) - - outFile = open(os.path.join(wF,MOD_SCORES),"w") - for i,score in enumerate(modScores): - print(i,score,sep="\t",file=outFile) - outFile.close() - - def selectClusterings(wF,modS,retainNumber,treeFold,finalTreeFold,splitBy="\t"): - clusters = [] - scores = [] - - for line in open(os.path.join(wF,modS)): - cols = line.rstrip().split(splitBy) - - clusters.append(cols[0]) - scores.append(float(cols[1])) - - j = zip(scores,clusters) - jSorted = sorted(j,reverse=True) - - scoresS,clustersS = zip(*jSorted) - retained = clustersS[:retainNumber] - - for clst in retained: - shutil.copy(os.path.join(treeFold,clst),os.path.join(finalTreeFold,clst)) - - def outputClusters(finalTreeFold,clustFold): - for path in glob.glob(os.path.join(finalTreeFold,"*")): - fileName = os.path.basename(path) - command = "hierarchy " + path - p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) - stdout,stderr = p.communicate() - - bottomLevel = int(stdout.split("\n")[0].split(":")[-1].rstrip()) - 1 - - command = "hierarchy " + path + " -l " + str(bottomLevel) + " > " + os.path.join(clustFold,fileName) - p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) - stdout,stderr = p.communicate() - - def readMappings(mF,splitBy="\t"): - - intToGene = {} - - for line in open(mF): - cols = line.rstrip().split(splitBy) - - if cols[0] in intToGene: - print("ALREADY SEEN",cols[0],"SHOULD BE NON-REDUNDANT!") - sys.exit() - - intToGene[cols[1]] = cols[0] - - return intToGene - - def processClusters(mapping,inFold,outFold,outListsFold,renumber_start=1,splitBy=" "): - - natKey = make_key_naturalSort() - - fileNumber = renumber_start - for path in sorted(glob.glob(os.path.join(inFold,"*")),key=natKey): - outFile = open(os.path.join(outFold,str(fileNumber) + ".csv"),"w") - - # Create folder to store split gene lists in - listFold = os.path.join(outListsFold,"Clust" + str(fileNumber)) - if not os.path.exists(listFold): - os.makedirs(listFold) - - fileNumber += 1 - - print("Id,Modularity Class",file=outFile) - byClust = defaultdict(list) - - for line in open(path): - cols = line.rstrip().split(splitBy) - - gene, clust = cols - clust = str(int(clust) + 1) # Start numbering from 1 not 0 - if gene in mapping: - byClust[clust].append(mapping[gene]) - else: - print("Gene (",gene,") is missing from mapping file!") - - for clst in sorted(byClust,key=natKey): - tempLF = open(os.path.join(listFold,"M" + str(clst) + ".txt"),"w") - for gene in sorted(byClust[clst],key=natKey): - print(gene,clst,sep=",",file=outFile) - print(gene,file=tempLF) - tempLF.close() - outFile.close() - - ##################################### - ##################################### - - print("\tConvert file to binary") - convertBinary(symEdgesF) - - print("\tRunning clusterings") - printEvery = int(round(runNum/float(printEveryDiv),0)) - - cluster(baseFold,symEdgesF,T_OUT_FOLDER,runNum,printEvery) - - print("\tRetrieving best",args.fuRetainNum,"clusterings") - selectClusterings(baseFold,MOD_SCORES,retainNum,T_OUT_FOLDER,TF_OUT_FOLDER) - - print("\tOutput clusterings from bottom level") - outputClusters(TF_OUT_FOLDER,C_OUT_FOLDER) - - # Post-process results back to human readable files - print("\tMap nodes back to genes and split into lists") - mappings = readMappings(geneIntMapF) - processClusters(mappings,C_OUT_FOLDER,C_OUT_TXT_FOLDER,C_OUT_LISTS_FOLDER,renumber_start=args.fuRenumberStart) - - - - -########################################################################################### - -def main(finishT="Finished!"): - - if args.singleCorrL: - singleCorrList = loadCorrGeneList(args.workFolder,args.singleCorrListF) - else: - singleCorrList = None - - # Load meta-file detailing list of files to process - metaInf = loadMeta(args.metaInf,args.dataF) - - # Find genes present in each data-set - genesToKeep, requireFileNum = getGenes(metaInf,splitBy=args.fileSep,geneFrac=args.geneFrac,retainF=args.retainF) - - # Generate correlation HDF5 files - corrh5,genesM = generateCorrMatrix(OUT_FOLDER,metaInf,genesToKeep,requireFileNum,singleCorrList,geneFrac=args.geneFrac,retainF=args.retainF,corrMatF=args.corrMatF,corrMatFS=args.corrMatFS,usePearson=args.usePearson,corrChunk=args.corrChunk, - splitBy=args.fileSep,keepBigFA=args.keepBigFA,singleCorr=args.singleCorr,singleCorrL=args.singleCorrL) - - # Reduce edges - edgesFP,nodesFP = reduceEdges(OUT_FOLDER,args.dataF,args.gephiF,corrh5,genesM,retainF=args.retainF,edgePG=args.edgePG,corrMatF=args.corrMatF + "_GMF" + str(requireFileNum),keepBigFA=args.keepBigFA,keepBigF=args.keepBigF,ignoreDuplicates=args.ignoreDuplicates) - - # Prepare and run Fast Unfolding Of Communities in Large Networks (https://sourceforge.net/projects/louvain/files/louvain-generic.tar.gz/download) - outFP1,outFP2 = prepareFastUnfold(OUT_FOLDER,args.dataF,args.fastUF,edgesFP,retainF=args.retainF,edgePG=args.edgePG) - - # Run Fast Unfolding - if not args.noFastUF: - runFastUF(outFP1,outFP2,args.tOutFold,args.fTOutFold,args.cOutFold,args.cOutTxtFold,args.cOutListsFold,runNum=args.fuRunNum,retainNum=args.fuRetainNum) - else: - print("\nSkipping Fast-Unfolding clustering and downstream output as --noFastUF is set") - print(finishT) - -if __name__ == '__main__': - main() +#!/usr/bin/env python +""" +This program is free software: you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation, either version 3 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with this program. If not, see . + +Author: Matthew Care + +A memory efficient implementation of PGCNA, that requires very little RAM +to process very large expression data-sets. + +This version is for processing multiple data-sets, via calculating a +median correlation matrix. +""" +from __future__ import print_function + +import sys +import os +import argparse +import shutil +import string +from datetime import datetime +from collections import defaultdict +import re +import gzip +import glob +from subprocess import Popen, PIPE, STDOUT +import numpy as np +import numpy.ma as ma +import scipy.stats as stats +import h5py + + +##----------------------------------------Create argparse argments----------------------------------------## +parser = argparse.ArgumentParser() +# Required +parser.add_argument("-w","--workFolder",help="Work Folder [REQUIRED]",default=None) +parser.add_argument("-d","--dataF",help="Expression data folder path [REQUIRED]",default=None) + +# Optional +parser.add_argument("-m","--metaInf",help="File containing information about data files (must be in same folder as --dataF) [#FileInfo.txt]",default="#FileInfo.txt") +parser.add_argument("-s","--fileSep",help="Separator used in expression file(s) [\t]",default="\t") +parser.add_argument("-f","--retainF",help="Retain gene fraction -- keeping most variant genes [0.8]",type=float,default=0.8) +parser.add_argument("-g","--geneFrac",help="Fraction of files a gene needs to be present in to be included in median corr matrix [1/3]",type=float,default=1/3.0) +parser.add_argument("-e","--edgePG",help="Edges to keep per gene [3] -- Highly recommend leaving as default",type=int,default=3) + +# Not often changed +parser.add_argument("--outF",help="Root output folder [PGCNA]",default="PGCNA") +parser.add_argument("--corrMatF",help="Correlation matrix folder [CORR_MATRIX]",default="CORR_MATRIX") +parser.add_argument("--corrMatFS",help="Folder for single gene correlation files [CORR_MATRIX_SG]",default="CORR_MATRIX_SG") +parser.add_argument("--gephiF",help="Folder to store files for Gephi [GEPHI]",default="GEPHI") +parser.add_argument("--fastUF",help="Fast Unfolding folder [FAST_UNFOLD]",default="FAST_UNFOLD") + +# Flags +parser.add_argument("--noFastUF",help="Don't try to run Fast Unfolding Clustering, but complete everything else [False] -- Flag",action="store_true") +parser.add_argument("--usePearson",help="Use Pearson Correlation instead of Spearman [False] -- Flag",action="store_true") +parser.add_argument("--keepBigFA",help="Keep ALL big HDF5 files after finishing [False] -- Flag",action="store_true") +parser.add_argument("--keepBigF",help="Keep median correlations HDF5 files after finishing [False] -- Flag",action="store_true") +parser.add_argument("--ignoreDuplicates",help="Ignore correlation duplicates when cutting top --edgePG genes [False] -- Flag, faster if set",action="store_true") +parser.add_argument("--singleCorr",help="Output individual gene correlation files -- Warning this generates 1 file per gene in final correlation matrix. [False] -- Flag",action="store_true") +parser.add_argument("--singleCorrL",help="Output individual gene correlation files -- limited to those in --singleCorrListF [False] -- Flag",action="store_true") + +# Single corr related +parser.add_argument("--singleCorrListF",help="If --singleCorrL is set then create single correlation files for all those in this file (must be within --workFolder)",default="corrGenes.txt") + +# Correlation related +parser.add_argument("--corrChunk",dest="corrChunk",help="Size of chunk (rows) to split correlation problem over [5000] -- Higher will speed up correlation calculation at cost of RAM",default=5000,type=float) + +# FastUnfolding specific +parser.add_argument("-n","--fuNumber",dest="fuRunNum",help="Number of times to run [100]",default=100,type=float) +parser.add_argument("-r","--fuRetain",dest="fuRetainNum",help="Retain top [1] clusterings",default=1,type=float) +parser.add_argument("--fuRenumberStart",dest="fuRenumberStart",help="FU clustering re-numbering start [1]",default=1,type=int) +parser.add_argument("--tOutFold",dest="tOutFold",help="Trees Ouput folder [Trees]",default="Trees") +parser.add_argument("--fTOutFold",dest="fTOutFold",help="Final retained Trees Ouput folder [TreesF]",default="TreesF") +parser.add_argument("--cOutFold",dest="cOutFold",help="Clusters out folder [Clusters]",default="Clusters") +parser.add_argument("--cOutTxtFold",dest="cOutTxtFold",help="Clusters out folder - mapped back to genes [ClustersTxt]",default="ClustersTxt") +parser.add_argument("--cOutListsFold",dest="cOutListsFold",help="Clusters out folder - split into lists [ClustersLists]",default="ClustersLists") + + +########################################################################################### +args = parser.parse_args() +if not args.workFolder or not args.dataF: + print("\n\nNeed to specifiy REQUIRED variables see help (-h)") + sys.exit() + +########################################################################################### +OUT_FOLDER = os.path.join(args.workFolder,args.outF,"EPG" + str(args.edgePG)) +if not os.path.exists(OUT_FOLDER): + os.makedirs(OUT_FOLDER) + +########################################################################################### +args.metaInf = os.path.join(args.dataF,args.metaInf) +########################################################################################### +if not args.noFastUF: + def testCommandOut(outR,testFail): + if re.search("command not found",outR): # /bin/sh: louvainn: command not found + return True + elif re.search("not recognized as an internal or external command",outR): + return True + + # Make sure that if Fast Unfolding is to be run it exists on the path + testCommand = "louvain -h" + p = Popen(testCommand,shell=True,stdout=PIPE,stderr=STDOUT) + stdout,stderr = p.communicate() + + testCommand = "convert" + p = Popen(testCommand,shell=True,stdout=PIPE,stderr=STDOUT) + stdout2,stderr = p.communicate() + + testCommand = "hierarchy" + p = Popen(testCommand,shell=True,stdout=PIPE,stderr=STDOUT) + stdout3,stderr = p.communicate() + + testFail = False + testFail = testCommandOut(stdout,testFail) + testFail = testCommandOut(stdout2,testFail) + testFail = testCommandOut(stdout3,testFail) + + if testFail: + print("\nCouldn't run Fast unfold commands (louvainn, convert and hierarchy) must be in your path/environment (to skip clustering step use --noFastUF flag), see readme for more information") + sys.exit() +########################################################################################### +# Tidy input +# Decode fileSep for passed "\t" +args.fileSep = args.fileSep.decode('string-escape') +args.corrChunk = int(round(args.corrChunk,0)) +args.fuRunNum = int(round(args.fuRunNum,0)) +args.fuRetainNum = int(round(args.fuRetainNum,0)) +##----------------------------------------LOGGER----------------------------------------## + + +class multicaster(object): + def __init__(self, filelist): + self.filelist = filelist + + def write(self, str): + for f in self.filelist: + f.write(str) + + +def concatenateAsString(joinWith,*args): + temp = [str(x) for x in args] + return joinWith.join(temp) + +print("##----------------------------------------",str(datetime.now()),"----------------------------------------##",sep="") +# print out settings +settingInf = concatenateAsString( + "\n", + "##----------------------------------------Arguments----------------------------------------##", + "# Required", + "WorkFolder [-w,--workFolder] = " + args.workFolder, + "Expresion data file path [-d,--dataF] = " + args.dataF, + "\n# Optional", + "Meta info file describing expression files [-m, --metaInf] = " + args.metaInf, + "Separator used in expression file [-s, --fileSep] = " + args.fileSep, + "Retain gene fraction [-f, --retainF] = " + str(args.retainF), + "Fraction of expression files gene required in to be retained [-g, --geneFrac] = " + str(args.geneFrac), + "Edges to retain per gene [-e, --edgePG] = " + str(args.edgePG), + "\n# Not changed often" + "Root output folder [--outF] = " + args.outF, + "Correlation matrix folder [--corrMatF] = " + args.corrMatF, + "Single Gene Correlation files folder [--corrMatFS] = " + args.corrMatFS, + "Gephi files folder [--gephiF] = " + args.gephiF, + "Fast unfolding folder [--fastUF] = " + args.fastUF, + "\n# Flags", + "Don't run Fast Unfolding clustering [--noFastUF] = " + str(args.noFastUF), + "Use Pearson Correlation [--usePearson] = " + str(args.usePearson), + "Keep big HDF5 files after run [--keepBigFA] = " + str(args.keepBigFA), + "Keep median correlations HDF5 files after run [--keepBigF] = " + str(args.keepBigF), + "Ignore correlation duplicates when cutting top --edgePG genes [--ignoreDuplicates] = " + str(args.ignoreDuplicates), + "Output individual gene correlation files [--singleCorr] = " + str(args.singleCorr), + "Output individual gene correlation files for select list (--singleCorrListF) [--singleCorrL] = " + str(args.singleCorrL), + "\n# Single gene correlation options", + "List of genes to process if --singleCorrL [--singleCorrListF]:\t" + str(args.singleCorrListF), + "\n# Correlation Options", + "Chunk size (rows) to split correlation over [--corrChunk]:\t" + str(args.corrChunk), + "\n# Fast-Unfolding Specific", + "Cluster output folder -- Clusters split into lists [--cOutListsFold]:\t" + str(args.cOutListsFold), + "Run number [-n, --fuNumber]:\t" + str(args.fuRunNum), + "Retain top number of clusterings [-r, --fuRetain]:\t" + str(args.fuRetainNum), + "Fast-Unfolding clustering renumber start [--fuRenumberStart]:\t" + str(args.fuRenumberStart), + "Tree output folder [--tOutFold]:\t" + str(args.tOutFold), + "Final retained tree output folder [--fTOutFold]:\t" + str(args.fTOutFold), + "Cluster output folder [--cOutFold]:\t" + str(args.cOutFold), + "Cluster output folder -- mapped back to genes [--cOutTxtFold]:\t" + str(args.cOutTxtFold), +) + +settingsF = open(os.path.join(OUT_FOLDER,str(datetime.now()).replace(" ","-").replace(":",".")[:-7] + "_CorrelationSettingsInfo.txt"),"w") +print(settingInf,file=settingsF) +settingsF.close() +########################################################################################### + + +##----------------------------------------Methods----------------------------------------## + + +def makeSafeFilename(inputFilename): + # Declare safe characters here + safechars = string.letters + string.digits + "~-_.@#" + try: + return filter(lambda c: c in safechars, inputFilename) + except: + return "" + pass + + +def make_key_naturalSort(): + """ + A factory function: creates a key function to use in sort. + Sort data naturally + """ + def nSort(s): + convert = lambda text: int(text) if text.isdigit() else text + alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)] + + return alphanum_key(s) + return nSort + + +def listToPercentiles(x,inverse=False): + data = stats.rankdata(x, "average")/float(len(x)) + + if inverse: + return (1-data) * 100 + else: + return data * 100 + + +def dataSpread(x,fixNeg=True): + """ + Returns the min, Q1 (25%), median (Q2), Q3 (75%), max, IQR, Quartile Coefficient of Dispersion and IQR/Median (CV like) + + As QCOD can't cope with negative numbers if fixNeg==True will solve the problem by simply adding the abs(min) value to + all values. So new minimum will be zero. + """ + + if fixNeg: + if min(x) < 0: + x = np.array(x) + x = x + abs(min(x)) + + q1 = float(np.percentile(x,25,interpolation="lower")) + q2 = np.percentile(x,50) + q3 = float(np.percentile(x,75,interpolation="higher")) + + if (q2 == 0) or ((q3+q1) == 0): + return min(x), q1, q2, q3, max(x), (q3-q1), 0,0 + else: + return min(x), q1, q2, q3, max(x), (q3-q1), (q3-q1)/(q3+q1), (q3-q1)/q2 + + +def mad(arr): + """ Median Absolute Deviation: a "Robust" version of standard deviation. + Indices variabililty of the sample. + """ + arr = np.array(arr) + med = np.median(arr) + return np.median(np.abs(arr - med)) + + +def loadCorrGeneList(workF,listFN,defaultG="IRF4"): + + print("\nLoading list of genes for single correlation file output") + fileP = os.path.join(workF,listFN) + + if not os.path.exists(fileP): + print("\t\t",fileP,"does not exist, will create file and add default gene (",defaultG,")") + outF = open(fileP,"w") + print(defaultG,file=outF) + outF.close() + + corrG = {defaultG:1} + + else: + corrG = {} + for line in open(fileP): + line = line.rstrip() + + corrG[line] = 1 + + print("\tTotal genes:",len(corrG)) + + return corrG + + +def generateGeneMeta(outF,fileN,geneMetaInfo,genesNP,npA): + """ + Generate meta information for gene + """ + ########################################################################################### + def expToPercentiles(expToPerc,expA): + return [expToPerc[e] for e in expA] + + ########################################################################################### + print("\t\t# Generate gene meta information : ",end="") + + expressionVals = npA.flatten() + + # Convert from expression values to percentiles + expPercentiles = listToPercentiles(expressionVals,inverse=False) + + # Calculate mapping from expression to percentile + expToPerc = {} + for i,e in enumerate(expressionVals): + expToPerc[e] = expPercentiles[i] + + outF = open(os.path.join(outF,fileN + "_metaInf.txt"),"w") + for i,gene in enumerate(genesNP): + # Convert from expression values to percentiles, so can work out MAD of percentiles + genePercentiles = expToPercentiles(expToPerc,npA[i]) + + # min, Q1 (25%), median (Q2), Q3 (75%), max, IQR, Quartile Coefficient of Dispersion and IQR/Median (CV like) + try: + minE,q1E,q2E,q3E,maxE,iqrE,qcodE,iqrME = dataSpread(npA[i]) + except: + print("Issue with :",gene,npA[i]) + sys.exit() + + medianPercentiles = np.median(genePercentiles) + print(gene,q2E,qcodE,medianPercentiles,sep="\t",file=outF) + + geneMetaInfo[gene].append([q2E,qcodE,medianPercentiles]) + + outF.close() + print("done") + + +def mergeMetaInfo(geneMetaInfo): + print("\n\tMerging gene meta information") + medianMetaInfo = {} + + for gene in geneMetaInfo: + medians, qcods, percentiles = zip(*geneMetaInfo[gene]) + # Store Median_QCOD(varWithin), MedianPercentile, MADpercentile(VarAcross) + medianMetaInfo[gene] = [np.median(qcods),np.median(percentiles),mad(percentiles)] + + return medianMetaInfo + + +def loadMeta(metaFile,dataF,splitBy="\t",headerL=1): + + print("\n\nLoad meta-file (",os.path.basename(metaFile),")",sep="") + if not os.path.exists(metaFile): + print("\t\t# Meta file (",metaFile,") does not exist!",sep="") + sys.exit() + + header = headerL + fileInfo = {} + for line in open(metaFile): + cols = line.rstrip().split(splitBy) + + if header: + header -= 1 + continue + + if line.rstrip() == "": + continue + + totalPath = os.path.join(dataF,cols[0]) + if not os.path.exists(totalPath): + print("\t\t# File (",totalPath,") does not exist!, won't add to fileInfo",sep="") + else: + try: + fileInfo[totalPath] = int(cols[1]) + except: + print("Meta file line (",line.rstrip(),") is not formed properly, skipping") + + print("\tLoaded information on:",len(fileInfo),"files") + return fileInfo + + +def getGenes(metaInf,splitBy="\t",geneFrac=1/3.0,retainF=0.8): + """ + First pass over expression data-files to find the set of genes we will be + working with after filtering each data-set by retainF and then only keeping + genes that are present in geneFrac + """ + natSort = make_key_naturalSort() + + geneCount = defaultdict(int) + + print("\nLoad information on genes present per expression array:") + for fileP in sorted(metaInf,key=natSort): + fileN = os.path.basename(fileP) + + print("\t",fileN,end="") + + headerL = metaInf[fileP] + + ############################# + # Read in data file + seenGenes = {} + genes = [] + expression = [] + + print("\tReading expression data file:",fileN) + + for i, line in enumerate(open(fileP)): + cols = line.rstrip().split(splitBy) + + if headerL: + headerL -= 1 + continue + + genes.append(cols[0]) + if not cols[0] in seenGenes: + seenGenes[cols[0]] = 1 + else: + print("\n\nERROR: Duplicate gene (",cols[0],") in expression data (",fileN,"), expects unique identifiers!\nPlease remove duplicates and re-run\n",sep="") + sys.exit() + expression.append(cols[1:]) + + print("\t\tTotal genes = ",len(genes)) + ############################# + + ############################# + print("\t\t# Generate Numpy matrices") + # Move genes to numpy array, so can be sorted along with the rest + genesNP = np.empty((len(genes)),dtype=str) + genesNP = np.array(genes) + + # Create numpy array + nrow = len(expression[0]) + ncol = len(expression) + npA = np.empty((ncol,nrow),dtype=object) + for i,eSet in enumerate(expression): + npA[i] = eSet + + # Calculate SD for each gene + npSD = np.std(npA.astype(np.float64),axis=1) + + # Sort by SD + sortI = np.argsort(npSD)[::-1] # Sort ascending + # Sort matrix by index + npA = npA[sortI,:] + + # Sort genes by index + genesNP = genesNP[sortI] + + # Only retain top fract # + # Work out index to cut at + cutPos = int(round(retainF * genesNP.shape[0],0)) + print("\t\tRetain Fraction (",retainF,") Cut at: ",cutPos,sep="") + print("\t\t\tPre-cut shape:",npA.shape) + + # Retain most variant + npA = npA[:cutPos,].astype(np.float64) + genesNP = genesNP[:cutPos] + print("\t\t\tPost-cut shape:",npA.shape) + + # Keep track of seen genes + for gene in genesNP: + geneCount[gene] += 1 + + print("\n\tTotal unique genes/identifiers:",len(geneCount)) + + requiredFileNum = int(round(geneFrac * len(metaInf),0)) + print("\tRetaining genes present in ",round(geneFrac,2)," (>=",requiredFileNum,") of files : ",end="",sep="") + + genesToKeep = {} + for g in geneCount: + if geneCount[g] >= requiredFileNum: + genesToKeep[g] = 1 + + print(len(genesToKeep)) + + return genesToKeep,requiredFileNum + + +def loadHDF5files(outF,hdf5Paths): + + hdf5Files = {} + for fileN in hdf5Paths: + hdf5Files[fileN] = h5py.File(fileN,'r') + + return hdf5Files + + +def generateMasks(sortedKeepGenes,genesPerHDF5,hdf5paths): + + print("\tGenerating index mappings and masks to speed up combining correlations") + + masks = np.zeros((len(hdf5paths),len(sortedKeepGenes))) + + for i,g in enumerate(sortedKeepGenes): + for j,hd5 in enumerate(hdf5paths): + if g not in genesPerHDF5[hd5]: + masks[j][i] = 1 # Mask missing values + return masks + + +def generateCorrMatrix(workF,metaInf,genesToKeep,requiredFileNum,singleCorrList,geneFrac=1/3.0,retainF=0.8,corrMatF="CORR_MATRIX",corrMatFS="CORR_MATRIX_SG",usePearson=False,corrChunk=5000,splitBy="\t",decimalP=3,printEveryDiv=10,keepBigFA=False,singleCorr=False, + singleCorrL=False): + """ + Memory efficient method for generating all pairwise correlations of genes (rows) across a set of samples (columns). Uses HDF5 files to greatly + reduce memory useage, keeping most data residing on the hard disk. + """ + print("\n\nGenerate correlations for expression data:") + natSort = make_key_naturalSort() + sortedKeepGenes = sorted(genesToKeep,key=natSort) + + # Find mapping of keep genes + keepGeneMapping = {} + for i, g in enumerate(sortedKeepGenes): + keepGeneMapping[g] = i + + # Create subfolder + outF = os.path.join(workF,corrMatF + "_GMF" + str(requiredFileNum)) + if not os.path.exists(outF): + os.makedirs(outF) + + if singleCorr or singleCorrL: + outFS = os.path.join(workF,corrMatFS + "_GMF" + str(requiredFileNum)) + if not os.path.exists(outF): + os.makedirs(outF) + + geneMetaInfo = defaultdict(list) # Per data-set store gene meta information + genePosPerHDF5 = defaultdict(dict) # Mapping of location of gene in HDF5 file + perHDF5index = defaultdict(list) # Mapping of gene to where they should be in final matrix + genesPerHDF5 = defaultdict(dict) # Keep tally of which genes are in which HDF5 file + hdf5paths = [] + + print("\n\tCalculating correlations for data file:") + for fileP in sorted(metaInf,key=natSort): + fileN = os.path.basename(fileP) + + print("\t",fileN,end="") + + headerL = metaInf[fileP] + ############################# + # Read in data file + seenGenes = {} + genes = [] + expression = [] + + print("\tReading expression data file:",fileN) + + for i, line in enumerate(open(fileP)): + cols = line.rstrip().split(splitBy) + + if headerL: + headerL -= 1 + continue + + # Only retain required genes + if cols[0] not in genesToKeep: + continue + + genes.append(cols[0]) + if not cols[0] in seenGenes: + seenGenes[cols[0]] = 1 + else: + print("\n\nERROR: Duplicate gene (",cols[0],") in expression data (",fileN,"), expects unique identifiers!\nPlease remove duplicates and re-run\n",sep="") + sys.exit() + expression.append(cols[1:]) + + print("\t\tTotal genes = ",len(genes)) + ############################# + + ############################# + print("\t\t# Generate Numpy matrices") + # Move genes to numpy array + genesNP = np.empty((len(genes)),dtype=str) + genesNP = np.array(genes) + + # Store position of gene in matrix for this file and create mapping index for HDF5 files + outMatrixHDF5 = os.path.join(outF,fileN + "_RetainF" + str(retainF) + ".h5") + hdf5paths.append(outMatrixHDF5) + + for i,g in enumerate(genesNP): + perHDF5index[outMatrixHDF5].append(keepGeneMapping[g]) + genePosPerHDF5[g][outMatrixHDF5] = i + genesPerHDF5[outMatrixHDF5][g] = 1 + + # Create numpy array + nrow = len(expression[0]) + ncol = len(expression) + npA = np.zeros((ncol,nrow),dtype=np.float64) + for i,eSet in enumerate(expression): + npA[i] = eSet + + print("\t\t\tMarix shape:",npA.shape) + + # Output genes + outMatrixN = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_Genes.txt") + np.savetxt(outMatrixN,genesNP,fmt="%s",delimiter="\t") + + # Generate gene meta information + generateGeneMeta(outF,fileN + "_RetainF" + str(retainF),geneMetaInfo,genesNP,npA) + + ####################################### + # Calculate correlations + print("\t\t# Calculating correlations using HDF5 file to save memory") + rowN, colN = npA.shape + + if os.path.exists(outMatrixHDF5): + print("\t\tAlready exists -- skipping") + continue + with h5py.File(outMatrixHDF5, "w") as f: + h5 = f.create_dataset("corr", (rowN,rowN), dtype="f8") + + # # Load into memory + print("\t\t\tCreating HDF5 file") + h5 = h5py.File(outMatrixHDF5,'r+') + + if not usePearson: + # Note this isn't a perfect Spearman, as ties are not dealt with, but it is fast, and given we're only + # Retaining the most correlated edges will have almost no effect on the output. + npA = npA.argsort(axis=1).argsort(axis=1).astype(np.float64) + + # subtract means from the input data + npA -= np.mean(npA, axis=1)[:,None] + + # normalize the data + npA /= np.sqrt(np.sum(npA*npA, axis=1))[:,None] + + # Calculate correlations per chunk + print("\t\t\tCalculating correlations for Chunk:") + for r in range(0, rowN, corrChunk): + print("\t\t\t",r) + for c in range(0, rowN, corrChunk): + r1 = r + corrChunk + c1 = c + corrChunk + chunk1 = npA[r:r1] + chunk2 = npA[c:c1] + h5["corr"][r:r1, c:c1] = np.dot(chunk1, chunk2.T) + + # # Write output out for debugging + # finalCorr = np.copy(h5["corr"][:]) + # outMatrixN = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_Corr") + # stringFormat = "%." + str(decimalP) + "f" + # # stringFormat2 = "%." + str(decimalPpVal) + "f" + # np.savetxt(outMatrixN + ".txt",finalCorr,fmt=stringFormat,delimiter="\t") + + # Remove matrices to save memory + del npA + del h5 + + # Calculate median gene meta information + medianMetaInfo = mergeMetaInfo(geneMetaInfo) + + + ########################################################################################### + ##----------------------------Calculate Median Correlations------------------------------## + ########################################################################################### + # Calculate median correlation matrix + print("\nCalculating median correlations") + if len(hdf5paths) == 1: + # If we're only analysing a single data-set + if not (singleCorr or singleCorrL): + print("\t\tOnly single data-set analysed, skipping generating median correlations") + # No need to calculate median correlations, just return path to HDF5 file and genes matrix + return outMatrixHDF5, genesNP + else: + print("\t\tOnly single data-set analysed, but --singeCorr/--singleCorrL so will proceed with output") + + printEvery = int(round(len(genesToKeep)/float(printEveryDiv),0)) + printE = printEvery + tell = printE + count = 0 + + # Create HDF5 median correlation matrix + outMatrixHDF5 = os.path.join(outF,"#Median_RetainF" + str(retainF) + ".h5") + with h5py.File(outMatrixHDF5, "w") as f: + h5 = f.create_dataset("corr", (len(genesToKeep),len(genesToKeep)), dtype="f8") + h5 = h5py.File(outMatrixHDF5,'r+') + + # Load HDF5 correlation files + hdf5Files = loadHDF5files(outF,hdf5paths) + + # Output genes + outMatrixN = os.path.join(outF,"#Median_RetainF" + str(retainF) + "_Genes.txt") + np.savetxt(outMatrixN,sortedKeepGenes,fmt="%s",delimiter="\t") + + # Get masks + maskMappings = generateMasks(sortedKeepGenes,genesPerHDF5,hdf5paths) + + print("\tCalculating median correlations (report every 1/",printEveryDiv,"th of total):",sep="") + if singleCorr or singleCorrL: + print("\t\tOutputting single gene correlation files:") + + for genePos, gene in enumerate(sortedKeepGenes): + # print("\t\t",gene1) + # Inform user of position + count += 1 + if count == printE: + printE = printE + tell + if singleCorr: + print("\n\t\tProcessed:",count,end="\n\n") + else: + print("\t\t",count) + + rowsPerHDF5 = {} + maskPos = [] + dataSetNames = [] + # Grab row for gene across files + for i,hdf5 in enumerate(hdf5paths): + try: + rowsPerHDF5[hdf5] = hdf5Files[hdf5]["corr"][genePosPerHDF5[gene][hdf5]] + maskPos.append(i) + dataSetNames.append(os.path.basename(hdf5)[:-3]) + except: + pass + + # Second pass + # Convert to numpy array + npA = np.full((len(rowsPerHDF5),len(sortedKeepGenes)),-10,dtype=np.float64) # Missing set to -10 + for i,hdf5 in enumerate(sorted(rowsPerHDF5,key=natSort)): + npA[i][perHDF5index[hdf5]] = rowsPerHDF5[hdf5] # Use indexes to place in correct location + + # Get appropriate masks + tempMask = [] + for i in maskPos: + tempMask.append(maskMappings[i]) + + npAMasked = ma.masked_array(npA,mask=tempMask) + + # Generate medians + medianRowCorr = np.copy(ma.median(npAMasked,axis=0)) + h5["corr"][genePos] = medianRowCorr + + ########################################################################################### + ##-------------------------------SINGLE GENE CORR----------------------------------------## + ########################################################################################### + def mergeCorrData(gene,corrInf,medianMetaInfo,medianCorr,missingVal=-10): + finalCorrs = [] + dataSetCount = 0 + + for corr in corrInf: + if (corr <= missingVal).all(): + + finalCorrs.append("") + else: + finalCorrs.append(str(round(corr,decimalP))) + dataSetCount += 1 + + roundedMeta = map(str,[round(origV,decimalP) for origV in medianMetaInfo[gene]]) + scaledCorr = dataSetCount ** medianCorr + finalInfo = gene + "\t" + str(round(scaledCorr,decimalP)) + "\t" + str(dataSetCount) + "\t" + str(round(medianCorr,decimalP)) + "\t" + "\t".join(roundedMeta) + "\t" + "\t".join(finalCorrs) + + return scaledCorr,finalInfo + ################################### + if singleCorr or singleCorrL: + + if singleCorrL: # Only output genes in list + if gene not in singleCorrList: + continue + + subFolder = os.path.join(outFS,gene[0]) + singleCorrFName = os.path.join(subFolder,makeSafeFilename(gene) + "_corr_RetainF" + str(retainF) + ".txt.gz") + if os.path.exists(singleCorrFName): + continue + + print("\t\t\tSingle Corr:",makeSafeFilename(gene)) + + # Make subfolder + if not os.path.exists(subFolder): + os.makedirs(subFolder) + + singleCorrF = gzip.open(singleCorrFName,"wb",compresslevel=9) + dataSetsH = "\t".join(dataSetNames) + + if usePearson: + print("Gene\tNumDataSets^MedianPCC\tNumDataSets\tMedianPCC\tMedian_QCODexpression(VarWithin)\tMedianPercentile\tMADpercentile(VarAcross)",dataSetsH,sep="\t",file=singleCorrF) + else: + print("Gene\tNumDataSets^MedianRho\tNumDataSets\tMedianRho\tMedian_QCODexpression(VarWithin)\tMedianPercentile\tMADpercentile(VarAcross)",dataSetsH,sep="\t",file=singleCorrF) + + rankedByCorr = defaultdict(list) + for i,g in enumerate(sortedKeepGenes): + scaledCorr, info = mergeCorrData(g,npA.T[i],medianMetaInfo,medianRowCorr[keepGeneMapping[g]]) + rankedByCorr[scaledCorr].append(info) + + # Rank by scaledCorr + for sCorr in sorted(rankedByCorr,reverse=True): + for info in rankedByCorr[sCorr]: + print(info,file=singleCorrF) + singleCorrF.close() + + ########################################################################################### + + # # Write output out for debugging + # finalCorr = np.copy(h5["corr"][:]) + # outMatrixN = os.path.join(outF,"#Median_RetainF" + str(retainF) + "_Corr") + # stringFormat = "%." + str(decimalP) + "f" + # np.savetxt(outMatrixN + ".txt",finalCorr,fmt=stringFormat,delimiter="\t") + ########################################################################################### + ########################################################################################### + + # Remove all single HDF5 files unless requested to keep them + if not keepBigFA: + del hdf5Files + for hdf5P in hdf5paths: + os.remove(hdf5P) + + # # Return path to HDF5 file and genes matrix + return outMatrixHDF5, sortedKeepGenes + + +def reduceEdges(workF,dataF,gephiF,corrh5,genesM,retainF=0.8,edgePG=3,printEveryDiv=10,corrMatF=None,keepBigFA=False,keepBigF=False,ignoreDuplicates=False): + """ + Reduce edges in correlation matrix, only retaining edgePG maximum correlated genes per row + """ + + def bottomToZero(npA,n=1): + """ + Set everything below n to zero + """ + topI = np.argpartition(npA,-n) + npA[topI[:-n]] = 0 + return npA + + def bottomToZeroWithDuplicates(npA,n=1): + """ + Set everything below n to zero, + but deal with duplicates + """ + unique = np.unique(npA) + topIunique = np.argpartition(unique,-n)[-n:] + + toKeep = [] + for val in unique[topIunique]: + toKeep.extend(np.where(npA == val)[0]) + + # Mask and reverse + mask = np.ones(len(npA),np.bool) + mask[toKeep] = 0 + npA[mask] = 0 + + return npA + + fileN = os.path.basename(dataF) + + print("\tLoad HDF5 file") + # Load old HDF5 + h5 = h5py.File(corrh5,'r+') + rowN, colN = h5["corr"].shape + + printEvery = int(round(rowN/float(printEveryDiv),0)) + print("\nReduces edges to (",edgePG,") per gene:",sep="") + printE = printEvery + tell = printE + count = 0 + + print("\tWorking (report every 1/",printEveryDiv,"th of total):",sep="") + for i, row in enumerate(h5["corr"]): + # Inform user of position + count += 1 + if count == printE: + printE = printE + tell + print("\t\t",count) + + if ignoreDuplicates: + h5["corr"][i] = bottomToZero(row,edgePG + 1) + else: + h5["corr"][i] = bottomToZeroWithDuplicates(row,edgePG + 1) + + # Write output out for debugging + # finalCorr = np.copy(h5EPG["corr"][:]) + # outMatrixN = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_NonSym") + # stringFormat = "%." + str(3) + "f" + # # stringFormat2 = "%." + str(decimalPpVal) + "f" + # np.savetxt(outMatrixN + ".txt",finalCorr,fmt=stringFormat,delimiter="\t") + + ########################################################################################### + + print("\nGenerating files for Gephi network visualization tool") + # Create subfolder + outF = os.path.join(workF,gephiF) + if not os.path.exists(outF): + os.makedirs(outF) + + # First output list of genes (nodes) + print("\tCreate node file") + nodesFP = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_EPG" + str(edgePG) + "_Nodes.tsv") + nodesFile = open(nodesFP,"w") + print("Id\tLabel\tCluster",file=nodesFile) + for gene in genesM: + print(gene,gene,"NA",file=nodesFile,sep="\t") + nodesFile.close() + + # Second, output list of edges + print("\tCreate edges file (report every 1/",printEveryDiv,"th of total):",sep="") + edgesFP = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_EPG" + str(edgePG) + "_Edges.tsv") + edgesFile = open(edgesFP,"w") + print("Source\tTarget\tWeight\tType\tfromAltName\ttoAltName",file=edgesFile) + + # Finally output edges + printE = printEvery + tell = printE + count = 0 + seen = defaultdict() + printedLines = 0 + + for i,row in enumerate(h5["corr"]): + for j,item in enumerate(row): + if not (i == j): # Don't output self edges... + geneO = genesM[i] + geneT = genesM[j] + if not (geneO + "-@@@-" + geneT) in seen: + # Output info + if not item == 0: + print(geneO,geneT,item,"undirected",geneO,geneT,sep="\t",file=edgesFile) + printedLines += 1 + + # Store the fact that we've see this and it's equivalent pair + seen[geneO + "-@@@-" + geneT] = 1 + seen[geneT + "-@@@-" + geneO] = 1 + + count += 1 + if count == printEvery: + printEvery = printEvery + tell + print("\t\t",count,"Genes",", Edges:",printedLines) + + edgesFile.close() + print("\n\t\tTotal printed (",printedLines,") edges",sep="") + + if not (keepBigFA or keepBigF): + h5.close() + shutil.rmtree(os.path.join(workF,corrMatF)) + + return edgesFP, nodesFP + + +def prepareFastUnfold(workF,dataF,fastUF,edgesFP,retainF=0.8,edgePG=3,splitBy="\t",header=1): + print("\nPreparing files for clustering using Fast Unfolding method") + + fileN = os.path.basename(dataF) + # Create subfolder + outF = os.path.join(workF,fastUF,fileN + "_retainF" + str(retainF) + "_EPG" + str(edgePG)) + if not os.path.exists(outF): + os.makedirs(outF) + + nodePairs = defaultdict(dict) + naturalKey = make_key_naturalSort() + + nameToInt = {} + nameIntCount = 0 + + edgeNum = 0 + for line in open(edgesFP): + cols = line.rstrip().split(splitBy) + + if header: + header -= 1 + continue + + gene1 = cols[0] + gene2 = cols[1] + edgeNum += 1 + + # Convert from name to int + if gene1 in nameToInt: + gene1 = nameToInt[gene1] + else: + nameToInt[gene1] = str(nameIntCount) + gene1 = str(nameIntCount) + nameIntCount += 1 + + if gene2 in nameToInt: + gene2 = nameToInt[gene2] + else: + nameToInt[gene2] = str(nameIntCount) + gene2 = str(nameIntCount) + nameIntCount += 1 + + ################################# + + nodePairs[gene1][gene2] = cols[2] + + print("\tTotal edges =",edgeNum) + + outFilePath = os.path.join(outF,"sym_edges.txt") + outFile = open(outFilePath,"w") + for node in sorted(nodePairs,key=naturalKey): + for node2 in sorted(nodePairs[node],key=naturalKey): + print(node,node2,nodePairs[node][node2],sep="\t",file=outFile) + outFile.close() + + # Write out mapping between genes and ints + outFilePath2 = os.path.join(outF,"GeneIntMap.txt") + outFile2 = open(outFilePath2,"w") + for gene in sorted(nameToInt,key=naturalKey): + print(gene,nameToInt[gene],sep="\t",file=outFile2) + outFile2.close() + + return outFilePath, outFilePath2 + + +def runFastUF(symEdgesF,geneIntMapF,tOutFold,fTOutFold,cOutFold,cOutTxtFold,cOutListsFold,runNum=10,retainNum=1,printEveryDiv=10): + MOD_SCORES = "modScores.txt" + + print("\nRunning Fast Unfolding") + baseFold,symEdgesFile = os.path.split(symEdgesF) + + # Make folders + T_OUT_FOLDER = os.path.join(baseFold,tOutFold) + TF_OUT_FOLDER = os.path.join(baseFold,fTOutFold) + C_OUT_FOLDER = os.path.join(baseFold,cOutFold) + C_OUT_TXT_FOLDER = os.path.join(baseFold,cOutTxtFold) + C_OUT_LISTS_FOLDER = os.path.join(baseFold,cOutListsFold) + + # Remove old + if os.path.exists(TF_OUT_FOLDER): + shutil.rmtree(TF_OUT_FOLDER) + if os.path.exists(T_OUT_FOLDER): + shutil.rmtree(T_OUT_FOLDER) + if os.path.exists(C_OUT_FOLDER): + shutil.rmtree(C_OUT_FOLDER) + if os.path.exists(C_OUT_TXT_FOLDER): + shutil.rmtree(C_OUT_TXT_FOLDER) + if os.path.exists(C_OUT_LISTS_FOLDER): + shutil.rmtree(C_OUT_LISTS_FOLDER) + + # Replace + if not os.path.exists(T_OUT_FOLDER): + os.makedirs(T_OUT_FOLDER) + if not os.path.exists(TF_OUT_FOLDER): + os.makedirs(TF_OUT_FOLDER) + if not os.path.exists(C_OUT_FOLDER): + os.makedirs(C_OUT_FOLDER) + if not os.path.exists(C_OUT_TXT_FOLDER): + os.makedirs(C_OUT_TXT_FOLDER) + if not os.path.exists(C_OUT_LISTS_FOLDER): + os.makedirs(C_OUT_LISTS_FOLDER) + + ##################################### + + # Fast Unfold commands + def convertBinary(iF): + command = "convert -i " + iF + " -o " + iF[:-3]+"bin " + " -w " + iF[:-3] + "weights" + p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) + stdout,stderr = p.communicate() + + def cluster(wF,iF,outFolder,runNum,PRINT_EVERY): + modScores = [] + + printEvery = PRINT_EVERY + tell = printEvery + count = 0 + + for i in range(runNum): + command = "louvain " + iF[:-3]+"bin -w " + iF[:-3] + "weights -l -1 > " + os.path.join(outFolder,str(i)) + p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) + stdout,stderr = p.communicate() + modScores.append(stdout.rstrip()) + + # Notify position + count += 1 + if count == printEvery: + printEvery = printEvery + tell + print("\t",count) + + outFile = open(os.path.join(wF,MOD_SCORES),"w") + for i,score in enumerate(modScores): + print(i,score,sep="\t",file=outFile) + outFile.close() + + def selectClusterings(wF,modS,retainNumber,treeFold,finalTreeFold,splitBy="\t"): + clusters = [] + scores = [] + + for line in open(os.path.join(wF,modS)): + cols = line.rstrip().split(splitBy) + + clusters.append(cols[0]) + scores.append(float(cols[1])) + + j = zip(scores,clusters) + jSorted = sorted(j,reverse=True) + + scoresS,clustersS = zip(*jSorted) + retained = clustersS[:retainNumber] + + for clst in retained: + shutil.copy(os.path.join(treeFold,clst),os.path.join(finalTreeFold,clst)) + + def outputClusters(finalTreeFold,clustFold): + for path in glob.glob(os.path.join(finalTreeFold,"*")): + fileName = os.path.basename(path) + command = "hierarchy " + path + p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) + stdout,stderr = p.communicate() + + bottomLevel = int(stdout.split("\n")[0].split(":")[-1].rstrip()) - 1 + + command = "hierarchy " + path + " -l " + str(bottomLevel) + " > " + os.path.join(clustFold,fileName) + p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) + stdout,stderr = p.communicate() + + def readMappings(mF,splitBy="\t"): + + intToGene = {} + + for line in open(mF): + cols = line.rstrip().split(splitBy) + + if cols[0] in intToGene: + print("ALREADY SEEN",cols[0],"SHOULD BE NON-REDUNDANT!") + sys.exit() + + intToGene[cols[1]] = cols[0] + + return intToGene + + def processClusters(mapping,inFold,outFold,outListsFold,renumber_start=1,splitBy=" "): + + natKey = make_key_naturalSort() + + fileNumber = renumber_start + for path in sorted(glob.glob(os.path.join(inFold,"*")),key=natKey): + outFile = open(os.path.join(outFold,str(fileNumber) + ".csv"),"w") + + # Create folder to store split gene lists in + listFold = os.path.join(outListsFold,"Clust" + str(fileNumber)) + if not os.path.exists(listFold): + os.makedirs(listFold) + + fileNumber += 1 + + print("Id,Modularity Class",file=outFile) + byClust = defaultdict(list) + + for line in open(path): + cols = line.rstrip().split(splitBy) + + gene, clust = cols + clust = str(int(clust) + 1) # Start numbering from 1 not 0 + if gene in mapping: + byClust[clust].append(mapping[gene]) + else: + print("Gene (",gene,") is missing from mapping file!") + + for clst in sorted(byClust,key=natKey): + tempLF = open(os.path.join(listFold,"M" + str(clst) + ".txt"),"w") + for gene in sorted(byClust[clst],key=natKey): + print(gene,clst,sep=",",file=outFile) + print(gene,file=tempLF) + tempLF.close() + outFile.close() + + ##################################### + ##################################### + + print("\tConvert file to binary") + convertBinary(symEdgesF) + + print("\tRunning clusterings") + printEvery = int(round(runNum/float(printEveryDiv),0)) + + cluster(baseFold,symEdgesF,T_OUT_FOLDER,runNum,printEvery) + + print("\tRetrieving best",args.fuRetainNum,"clusterings") + selectClusterings(baseFold,MOD_SCORES,retainNum,T_OUT_FOLDER,TF_OUT_FOLDER) + + print("\tOutput clusterings from bottom level") + outputClusters(TF_OUT_FOLDER,C_OUT_FOLDER) + + # Post-process results back to human readable files + print("\tMap nodes back to genes and split into lists") + mappings = readMappings(geneIntMapF) + processClusters(mappings,C_OUT_FOLDER,C_OUT_TXT_FOLDER,C_OUT_LISTS_FOLDER,renumber_start=args.fuRenumberStart) + + + + +########################################################################################### + +def main(finishT="Finished!"): + + if args.singleCorrL: + singleCorrList = loadCorrGeneList(args.workFolder,args.singleCorrListF) + else: + singleCorrList = None + + # Load meta-file detailing list of files to process + metaInf = loadMeta(args.metaInf,args.dataF) + + # Find genes present in each data-set + genesToKeep, requireFileNum = getGenes(metaInf,splitBy=args.fileSep,geneFrac=args.geneFrac,retainF=args.retainF) + + # Generate correlation HDF5 files + corrh5,genesM = generateCorrMatrix(OUT_FOLDER,metaInf,genesToKeep,requireFileNum,singleCorrList,geneFrac=args.geneFrac,retainF=args.retainF,corrMatF=args.corrMatF,corrMatFS=args.corrMatFS,usePearson=args.usePearson,corrChunk=args.corrChunk, + splitBy=args.fileSep,keepBigFA=args.keepBigFA,singleCorr=args.singleCorr,singleCorrL=args.singleCorrL) + + # Reduce edges + edgesFP,nodesFP = reduceEdges(OUT_FOLDER,args.dataF,args.gephiF,corrh5,genesM,retainF=args.retainF,edgePG=args.edgePG,corrMatF=args.corrMatF + "_GMF" + str(requireFileNum),keepBigFA=args.keepBigFA,keepBigF=args.keepBigF,ignoreDuplicates=args.ignoreDuplicates) + + # Prepare and run Fast Unfolding Of Communities in Large Networks (https://sourceforge.net/projects/louvain/files/louvain-generic.tar.gz/download) + outFP1,outFP2 = prepareFastUnfold(OUT_FOLDER,args.dataF,args.fastUF,edgesFP,retainF=args.retainF,edgePG=args.edgePG) + + # Run Fast Unfolding + if not args.noFastUF: + runFastUF(outFP1,outFP2,args.tOutFold,args.fTOutFold,args.cOutFold,args.cOutTxtFold,args.cOutListsFold,runNum=args.fuRunNum,retainNum=args.fuRetainNum) + else: + print("\nSkipping Fast-Unfolding clustering and downstream output as --noFastUF is set") + print(finishT) + +if __name__ == '__main__': + main() diff --git a/pgcna.py b/PGCNA/pgcna.py similarity index 97% rename from pgcna.py rename to PGCNA/pgcna.py index 32df55b..3fc9b54 100644 --- a/pgcna.py +++ b/PGCNA/pgcna.py @@ -1,707 +1,707 @@ -#!/usr/bin/env python -""" -This program is free software: you can redistribute it and/or modify -it under the terms of the GNU General Public License as published by -the Free Software Foundation, either version 3 of the License, or -(at your option) any later version. - -This program is distributed in the hope that it will be useful, -but WITHOUT ANY WARRANTY; without even the implied warranty of -MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -GNU General Public License for more details. - -You should have received a copy of the GNU General Public License -along with this program. If not, see . - -Author: Matthew Care - -A memory efficient implementation of PGCNA, that requires very little RAM -to process very large expression data-sets. -""" -from __future__ import print_function - -import sys -import os -import argparse -import shutil -from datetime import datetime -from collections import defaultdict -import re -import glob -from subprocess import Popen, PIPE, STDOUT -import numpy as np -import h5py - - -##----------------------------------------Create argparse argments----------------------------------------## -parser = argparse.ArgumentParser() -# Required -parser.add_argument("-w","--workFolder",help="Work Folder [REQUIRED]",default=None) -parser.add_argument("-d","--dataF",help="Expression data file path [REQUIRED]",default=None) - -# Optional -parser.add_argument("-s","--fileSep",help="Separator used in expression file [\t]",default="\t") -parser.add_argument("--headerL",help="Number of header lines before expression values [1]",default=1,type=int) -parser.add_argument("-f","--retainF",help="Retain gene fraction -- keeping most variant genes [0.8]",type=float,default=0.8) -parser.add_argument("-e","--edgePG",help="Edges to keep per gene [3] -- Highly recommend leaving as default",type=int,default=3) - -# Not often changed -parser.add_argument("--outF",help="Root output folder [PGCNA]",default="PGCNA") -parser.add_argument("--corrMatF",help="Correlation matrix folder [CORR_MATRIX]",default="CORR_MATRIX") -parser.add_argument("--gephiF",help="Folder to store files for Gephi [GEPHI]",default="GEPHI") -parser.add_argument("--fastUF",help="Fast Unfolding folder [FAST_UNFOLD]",default="FAST_UNFOLD") - -# Flags -parser.add_argument("--noFastUF",help="Don't try to run Fast Unfolding Clustering, but complete everything else [False] -- Flag",action="store_true") -parser.add_argument("--usePearson",help="Use Pearson Correlation instead of Spearman [False] -- Flag",action="store_true") -parser.add_argument("--keepBigF",help="Keep big HDF5 files after finishing [False] -- Flag",action="store_true") -parser.add_argument("--ignoreDuplicates",help="Ignore correlation duplicates when cutting top --edgePG genes [False] -- Flag, faster if set",action="store_true") - -# Correlation related -parser.add_argument("--corrChunk",dest="corrChunk",help="Size of chunk (rows) to split correlation problem over [5000] -- Higher will speed up correlation calculation at cost of RAM",default=5000,type=float) - -# FastUnfolding specific -parser.add_argument("-n","--fuNumber",dest="fuRunNum",help="Number of times to run [100]",default=100,type=float) -parser.add_argument("-r","--fuRetain",dest="fuRetainNum",help="Retain top [1] clusterings",default=1,type=float) -parser.add_argument("--fuRenumberStart",dest="fuRenumberStart",help="FU clustering re-numbering start [1]",default=1,type=int) -parser.add_argument("--tOutFold",dest="tOutFold",help="Trees Ouput folder [Trees]",default="Trees") -parser.add_argument("--fTOutFold",dest="fTOutFold",help="Final retained Trees Ouput folder [TreesF]",default="TreesF") -parser.add_argument("--cOutFold",dest="cOutFold",help="Clusters out folder [Clusters]",default="Clusters") -parser.add_argument("--cOutTxtFold",dest="cOutTxtFold",help="Clusters out folder - mapped back to genes [ClustersTxt]",default="ClustersTxt") -parser.add_argument("--cOutListsFold",dest="cOutListsFold",help="Clusters out folder - split into lists [ClustersLists]",default="ClustersLists") - - -########################################################################################### -args = parser.parse_args() -if not args.workFolder or not args.dataF: - print("\n\nNeed to specifiy REQUIRED variables see help (-h)") - sys.exit() - -########################################################################################### -OUT_FOLDER = os.path.join(args.workFolder,args.outF,"EPG" + str(args.edgePG)) -if not os.path.exists(OUT_FOLDER): - os.makedirs(OUT_FOLDER) -########################################################################################### -########################################################################################### -if not args.noFastUF: - def testCommandOut(outR,testFail): - if re.search("command not found",outR): # /bin/sh: louvainn: command not found - return True - elif re.search("not recognized as an internal or external command",outR): - return True - - # Make sure that if Fast Unfolding is to be run it exists on the path - testCommand = "louvain -h" - p = Popen(testCommand,shell=True,stdout=PIPE,stderr=STDOUT) - stdout,stderr = p.communicate() - - testCommand = "convert" - p = Popen(testCommand,shell=True,stdout=PIPE,stderr=STDOUT) - stdout2,stderr = p.communicate() - - testCommand = "hierarchy" - p = Popen(testCommand,shell=True,stdout=PIPE,stderr=STDOUT) - stdout3,stderr = p.communicate() - - testFail = False - testFail = testCommandOut(stdout,testFail) - testFail = testCommandOut(stdout2,testFail) - testFail = testCommandOut(stdout3,testFail) - - if testFail: - print("\nCouldn't run Fast unfold commands (louvainn, convert and hierarchy) must be in your path/environment (to skip clustering step use --noFastUF flag), see readme for more information") - sys.exit() -########################################################################################### -# Tidy input -# Decode fileSep for passed "\t" -args.fileSep = args.fileSep.decode('string-escape') -args.corrChunk = int(round(args.corrChunk,0)) -args.fuRunNum = int(round(args.fuRunNum,0)) -args.fuRetainNum = int(round(args.fuRetainNum,0)) -##----------------------------------------LOGGER----------------------------------------## - - -class multicaster(object): - def __init__(self, filelist): - self.filelist = filelist - - def write(self, str): - for f in self.filelist: - f.write(str) - - -def concatenateAsString(joinWith,*args): - temp = [str(x) for x in args] - return joinWith.join(temp) - -print("##----------------------------------------",str(datetime.now()),"----------------------------------------##",sep="") -# print out settings -settingInf = concatenateAsString( - "\n", - "##----------------------------------------Arguments----------------------------------------##", - "# Required", - "WorkFolder [-w,--workFolder] = " + args.workFolder, - "Expresion data file path [-d,--dataF] = " + args.dataF, - "\n# Optional", - "Separator used in expression file [-s, --fileSep] = " + args.fileSep, - "Number of header lines in expression file [--headerL] = " + str(args.headerL), - "Retain gene fraction [-f, --retainF] = " + str(args.retainF), - "Edges to retain per gene [-e, --edgePG] = " + str(args.edgePG), - "\n# Not changed often" - "Root output folder [--outF] = " + args.outF, - "Correlation matrix folder [--corrMatF] = " + args.corrMatF, - "Gephi files folder [--gephiF] = " + args.gephiF, - "Fast unfolding folder [--fastUF] = " + args.fastUF, - "\n# Flags", - "Don't run Fast Unfolding clustering [--noFastUF] = " + str(args.noFastUF), - "Use Pearson Correlation [--usePearson] = " + str(args.usePearson), - "Keep big HDF5 files after run [--keepBigF] = " + str(args.keepBigF), - "Ignore correlation duplicates when cutting top --edgePG genes [--ignoreDuplicates] = " + str(args.ignoreDuplicates), - "\n# Correlation Options", - "Chunk size (rows) to split correlation over [--corrChunk]:\t" + str(args.corrChunk), - "\n# Fast-Unfolding Specific", - "Cluster output folder -- Clusters split into lists [--cOutListsFold]:\t" + str(args.cOutListsFold), - "Run number [-n, --fuNumber]:\t" + str(args.fuRunNum), - "Retain top number of clusterings [-r, --fuRetain]:\t" + str(args.fuRetainNum), - "Fast-Unfolding clustering renumber start [--fuRenumberStart]:\t" + str(args.fuRenumberStart), - "Tree output folder [--tOutFold]:\t" + str(args.tOutFold), - "Final retained tree output folder [--fTOutFold]:\t" + str(args.fTOutFold), - "Cluster output folder [--cOutFold]:\t" + str(args.cOutFold), - "Cluster output folder -- mapped back to genes [--cOutTxtFold]:\t" + str(args.cOutTxtFold), -) - -settingsF = open(os.path.join(OUT_FOLDER,str(datetime.now()).replace(" ","-").replace(":",".")[:-7] + "_CorrelationSettingsInfo.txt"),"w") -print(settingInf,file=settingsF) -settingsF.close() -########################################################################################### - - -##----------------------------------------Methods----------------------------------------## - -def make_key_naturalSort(): - """ - A factory function: creates a key function to use in sort. - Sort data naturally - """ - def nSort(s): - convert = lambda text: int(text) if text.isdigit() else text - alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)] - - return alphanum_key(s) - return nSort - - -def generateCorrMatrix(workF,dataF,retainF=0.8,corrMatF="CORR_MATRIX",usePearson=False,corrChunk=5000,splitBy="\t",headerL=1,decimalP=3,decimalPpVal=5): - """ - Memory efficient method for generating all pairwise correlations of genes (rows) across a set of samples (columns). Uses HDF5 files to greatly - reduce memory useage, keeping most data residing on the hard disk. - """ - print("\n\nGenerate correlations for expression data:") - - - - # Create subfolder - outF = os.path.join(workF,corrMatF) - if not os.path.exists(outF): - os.makedirs(outF) - - ############################# - # Read in data file - seenGenes = {} - genes = [] - expression = [] - header = headerL - fileN = os.path.basename(dataF) - print("\tReading expression data file:",fileN) - - for i, line in enumerate(open(dataF)): - cols = line.rstrip().split(splitBy) - - if header: - header -= 1 - continue - - genes.append(cols[0]) - if not cols[0] in seenGenes: - seenGenes[cols[0]] = 1 - else: - print("\n\nERROR: Duplicate gene (",cols[0],") in expression data (",fileN,"), expects unique identifiers!\nPlease remove duplicates and re-run\n",sep="") - sys.exit() - expression.append(cols[1:]) - - print("\t\tTotal genes = ",len(genes)) - ############################# - - ############################# - print("\tGenerate Numpy matrices") - # Move genes to numpy array, so can be sorted along with the rest - genesNP = np.empty((len(genes)),dtype=str) - genesNP = np.array(genes) - - # Create numpy array - nrow = len(expression[0]) - ncol = len(expression) - npA = np.empty((ncol,nrow),dtype=object) - for i,eSet in enumerate(expression): - npA[i] = eSet - - # Calculate SD for each gene - npSD = np.std(npA.astype(np.float64),axis=1) - - # Sort by SD - sortI = np.argsort(npSD)[::-1] # Sort ascending - # Sort matrix by index - npA = npA[sortI,:] - - # Sort genes by index - genesNP = genesNP[sortI] - - # Only retain top fract # - # Work out index to cut at - cutPos = int(round(retainF * genesNP.shape[0],0)) - print("\t\tRetain Fraction (",retainF,") Cut at: ",cutPos,sep="") - print("\t\t\tPre-cut shape:",npA.shape) - - # Retain most variant - npA = npA[:cutPos,].astype(np.float64) - genesNP = genesNP[:cutPos] - print("\t\t\tPost-cut shape:",npA.shape) - - # Output genes - outMatrixN = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_Genes.txt") - np.savetxt(outMatrixN,genesNP,fmt="%s",delimiter="\t") - - ####################################### - # Calculate correlations - print("\tCalculating correlations using HDF5 file to save memory") - rowN, colN = npA.shape - - outMatrixHDF5 = os.path.join(outF,fileN + "_RetainF" + str(retainF) + ".h5") - with h5py.File(outMatrixHDF5, "w") as f: - h5 = f.create_dataset("corr", (rowN,rowN), dtype="f8") - - # # Load into memory - print("\t\tCreating HDF5 file") - h5 = h5py.File(outMatrixHDF5,'r+') - - if not usePearson: - # Note this isn't a perfect Spearman, as ties are not dealt with, but it is fast, and given we're only - # Retaining the most correlated edges will have almost no effect on the output. - npA = npA.argsort(axis=1).argsort(axis=1).astype(np.float64) - - # subtract means from the input data - npA -= np.mean(npA, axis=1)[:,None] - - # normalize the data - npA /= np.sqrt(np.sum(npA*npA, axis=1))[:,None] - - # Calculate correlations per chunk - print("\t\tCalculating correlations for Chunk:") - for r in range(0, rowN, corrChunk): - print("\t\t\t",r) - for c in range(0, rowN, corrChunk): - r1 = r + corrChunk - c1 = c + corrChunk - chunk1 = npA[r:r1] - chunk2 = npA[c:c1] - h5["corr"][r:r1, c:c1] = np.dot(chunk1, chunk2.T) - - # # Write output out for debugging - # finalCorr = np.copy(h5["corr"][:]) - # outMatrixN = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_Corr") - # stringFormat = "%." + str(decimalP) + "f" - # # stringFormat2 = "%." + str(decimalPpVal) + "f" - # np.savetxt(outMatrixN + ".txt",finalCorr,fmt=stringFormat,delimiter="\t") - - # Return path to HDF5 file and genes matrix - return outMatrixHDF5, genesNP - - -def reduceEdges(workF,dataF,gephiF,corrh5,genesM,retainF=0.8,edgePG=3,printEveryDiv=10,corrMatF=None,keepBigF=False,ignoreDuplicates=False): - """ - Reduce edges in correlation matrix, only retaining edgePG maximum correlated genes per row - """ - - def bottomToZero(npA,n=1): - """ - Set everything below n to zero - """ - topI = np.argpartition(npA,-n) - npA[topI[:-n]] = 0 - return npA - - def bottomToZeroWithDuplicates(npA,n=1): - """ - Set everything below n to zero, - but deal with duplicates - """ - unique = np.unique(npA) - topIunique = np.argpartition(unique,-n)[-n:] - - toKeep = [] - for val in unique[topIunique]: - toKeep.extend(np.where(npA == val)[0]) - - # Mask and reverse - mask = np.ones(len(npA),np.bool) - mask[toKeep] = 0 - npA[mask] = 0 - - return npA - - fileN = os.path.basename(dataF) - - print("\tLoad HDF5 file") - # Load old HDF5 - h5 = h5py.File(corrh5,'r+') - rowN, colN = h5["corr"].shape - - printEvery = int(round(rowN/float(printEveryDiv),0)) - print("\nReduces edges to (",edgePG,") per gene:",sep="") - printE = printEvery - tell = printE - count = 0 - - print("\tWorking (report every 1/",printEveryDiv,"th of total):",sep="") - for i, row in enumerate(h5["corr"]): - # Inform user of position - count += 1 - if count == printE: - printE = printE + tell - print("\t\t",count) - - if ignoreDuplicates: - h5["corr"][i] = bottomToZero(row,edgePG + 1) - else: - h5["corr"][i] = bottomToZeroWithDuplicates(row,edgePG + 1) - - # Write output out for debugging - # finalCorr = np.copy(h5EPG["corr"][:]) - # outMatrixN = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_NonSym") - # stringFormat = "%." + str(3) + "f" - # # stringFormat2 = "%." + str(decimalPpVal) + "f" - # np.savetxt(outMatrixN + ".txt",finalCorr,fmt=stringFormat,delimiter="\t") - - ########################################################################################### - - print("\nGenerating files for Gephi network visualization tool") - # Create subfolder - outF = os.path.join(workF,gephiF) - if not os.path.exists(outF): - os.makedirs(outF) - - # First output list of genes (nodes) - print("\tCreate node file") - nodesFP = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_EPG" + str(edgePG) + "_Nodes.tsv") - nodesFile = open(nodesFP,"w") - print("Id\tLabel\tCluster",file=nodesFile) - for gene in genesM: - print(gene,gene,"NA",file=nodesFile,sep="\t") - nodesFile.close() - - # Second, output list of edges - print("\tCreate edges file (report every 1/",printEveryDiv,"th of total):",sep="") - edgesFP = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_EPG" + str(edgePG) + "_Edges.tsv") - edgesFile = open(edgesFP,"w") - print("Source\tTarget\tWeight\tType\tfromAltName\ttoAltName",file=edgesFile) - - # Finally output edges - printE = printEvery - tell = printE - count = 0 - seen = defaultdict() - printedLines = 0 - - for i,row in enumerate(h5["corr"]): - for j,item in enumerate(row): - if not (i == j): # Don't output self edges... - geneO = genesM[i] - geneT = genesM[j] - if not (geneO + "-@@@-" + geneT) in seen: - # Output info - if not item == 0: - print(geneO,geneT,item,"undirected",geneO,geneT,sep="\t",file=edgesFile) - printedLines += 1 - - # Store the fact that we've see this and it's equivalent pair - seen[geneO + "-@@@-" + geneT] = 1 - seen[geneT + "-@@@-" + geneO] = 1 - - count += 1 - if count == printEvery: - printEvery = printEvery + tell - print("\t\t",count,"Genes",", Edges:",printedLines) - - edgesFile.close() - print("\n\t\tTotal printed (",printedLines,") edges",sep="") - - if not keepBigF: - h5.close() - shutil.rmtree(os.path.join(workF,corrMatF)) - - return edgesFP, nodesFP - - -def prepareFastUnfold(workF,dataF,fastUF,edgesFP,retainF=0.8,edgePG=3,splitBy="\t",header=1): - print("\nPreparing files for clustering using Fast Unfolding method") - - fileN = os.path.basename(dataF) - # Create subfolder - outF = os.path.join(workF,fastUF,fileN + "_retainF" + str(retainF) + "_EPG" + str(edgePG)) - if not os.path.exists(outF): - os.makedirs(outF) - - nodePairs = defaultdict(dict) - naturalKey = make_key_naturalSort() - - nameToInt = {} - nameIntCount = 0 - - edgeNum = 0 - for line in open(edgesFP): - cols = line.rstrip().split(splitBy) - - if header: - header -= 1 - continue - - gene1 = cols[0] - gene2 = cols[1] - edgeNum += 1 - - # Convert from name to int - if gene1 in nameToInt: - gene1 = nameToInt[gene1] - else: - nameToInt[gene1] = str(nameIntCount) - gene1 = str(nameIntCount) - nameIntCount += 1 - - if gene2 in nameToInt: - gene2 = nameToInt[gene2] - else: - nameToInt[gene2] = str(nameIntCount) - gene2 = str(nameIntCount) - nameIntCount += 1 - - ################################# - - nodePairs[gene1][gene2] = cols[2] - - print("\tTotal edges =",edgeNum) - - outFilePath = os.path.join(outF,"sym_edges.txt") - outFile = open(outFilePath,"w") - for node in sorted(nodePairs,key=naturalKey): - for node2 in sorted(nodePairs[node],key=naturalKey): - print(node,node2,nodePairs[node][node2],sep="\t",file=outFile) - outFile.close() - - # Write out mapping between genes and ints - outFilePath2 = os.path.join(outF,"GeneIntMap.txt") - outFile2 = open(outFilePath2,"w") - for gene in sorted(nameToInt,key=naturalKey): - print(gene,nameToInt[gene],sep="\t",file=outFile2) - outFile2.close() - - return outFilePath, outFilePath2 - - -def runFastUF(symEdgesF,geneIntMapF,tOutFold,fTOutFold,cOutFold,cOutTxtFold,cOutListsFold,runNum=10,retainNum=1,printEveryDiv=10): - MOD_SCORES = "modScores.txt" - - print("\nRunning Fast Unfolding") - baseFold,symEdgesFile = os.path.split(symEdgesF) - - # Make folders - T_OUT_FOLDER = os.path.join(baseFold,tOutFold) - TF_OUT_FOLDER = os.path.join(baseFold,fTOutFold) - C_OUT_FOLDER = os.path.join(baseFold,cOutFold) - C_OUT_TXT_FOLDER = os.path.join(baseFold,cOutTxtFold) - C_OUT_LISTS_FOLDER = os.path.join(baseFold,cOutListsFold) - - # Remove old - if os.path.exists(TF_OUT_FOLDER): - shutil.rmtree(TF_OUT_FOLDER) - if os.path.exists(T_OUT_FOLDER): - shutil.rmtree(T_OUT_FOLDER) - if os.path.exists(C_OUT_FOLDER): - shutil.rmtree(C_OUT_FOLDER) - if os.path.exists(C_OUT_TXT_FOLDER): - shutil.rmtree(C_OUT_TXT_FOLDER) - if os.path.exists(C_OUT_LISTS_FOLDER): - shutil.rmtree(C_OUT_LISTS_FOLDER) - - # Replace - if not os.path.exists(T_OUT_FOLDER): - os.makedirs(T_OUT_FOLDER) - if not os.path.exists(TF_OUT_FOLDER): - os.makedirs(TF_OUT_FOLDER) - if not os.path.exists(C_OUT_FOLDER): - os.makedirs(C_OUT_FOLDER) - if not os.path.exists(C_OUT_TXT_FOLDER): - os.makedirs(C_OUT_TXT_FOLDER) - if not os.path.exists(C_OUT_LISTS_FOLDER): - os.makedirs(C_OUT_LISTS_FOLDER) - - ##################################### - - # Fast Unfold commands - def convertBinary(iF): - command = "convert -i " + iF + " -o " + iF[:-3]+"bin " + " -w " + iF[:-3] + "weights" - p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) - stdout,stderr = p.communicate() - - def cluster(wF,iF,outFolder,runNum,PRINT_EVERY): - modScores = [] - - printEvery = PRINT_EVERY - tell = printEvery - count = 0 - - for i in range(runNum): - command = "louvain " + iF[:-3]+"bin -w " + iF[:-3] + "weights -l -1 > " + os.path.join(outFolder,str(i)) - p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) - stdout,stderr = p.communicate() - modScores.append(stdout.rstrip()) - - # Notify position - count += 1 - if count == printEvery: - printEvery = printEvery + tell - print("\t",count) - - outFile = open(os.path.join(wF,MOD_SCORES),"w") - for i,score in enumerate(modScores): - print(i,score,sep="\t",file=outFile) - outFile.close() - - def selectClusterings(wF,modS,retainNumber,treeFold,finalTreeFold,splitBy="\t"): - clusters = [] - scores = [] - - for line in open(os.path.join(wF,modS)): - cols = line.rstrip().split(splitBy) - - clusters.append(cols[0]) - scores.append(float(cols[1])) - - j = zip(scores,clusters) - jSorted = sorted(j,reverse=True) - - scoresS,clustersS = zip(*jSorted) - retained = clustersS[:retainNumber] - - for clst in retained: - shutil.copy(os.path.join(treeFold,clst),os.path.join(finalTreeFold,clst)) - - def outputClusters(finalTreeFold,clustFold): - for path in glob.glob(os.path.join(finalTreeFold,"*")): - fileName = os.path.basename(path) - command = "hierarchy " + path - p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) - stdout,stderr = p.communicate() - - bottomLevel = int(stdout.split("\n")[0].split(":")[-1].rstrip()) - 1 - - command = "hierarchy " + path + " -l " + str(bottomLevel) + " > " + os.path.join(clustFold,fileName) - p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) - stdout,stderr = p.communicate() - - def readMappings(mF,splitBy="\t"): - - intToGene = {} - - for line in open(mF): - cols = line.rstrip().split(splitBy) - - if cols[0] in intToGene: - print("ALREADY SEEN",cols[0],"SHOULD BE NON-REDUNDANT!") - sys.exit() - - intToGene[cols[1]] = cols[0] - - return intToGene - - def processClusters(mapping,inFold,outFold,outListsFold,renumber_start=1,splitBy=" "): - - natKey = make_key_naturalSort() - - fileNumber = renumber_start - for path in sorted(glob.glob(os.path.join(inFold,"*")),key=natKey): - outFile = open(os.path.join(outFold,str(fileNumber) + ".csv"),"w") - - # Create folder to store split gene lists in - listFold = os.path.join(outListsFold,"Clust" + str(fileNumber)) - if not os.path.exists(listFold): - os.makedirs(listFold) - - fileNumber += 1 - - print("Id,Modularity Class",file=outFile) - byClust = defaultdict(list) - - for line in open(path): - cols = line.rstrip().split(splitBy) - - gene, clust = cols - clust = str(int(clust) + 1) # Start numbering from 1 not 0 - if gene in mapping: - byClust[clust].append(mapping[gene]) - else: - print("Gene (",gene,") is missing from mapping file!") - - for clst in sorted(byClust,key=natKey): - tempLF = open(os.path.join(listFold,"M" + str(clst) + ".txt"),"w") - for gene in sorted(byClust[clst],key=natKey): - print(gene,clst,sep=",",file=outFile) - print(gene,file=tempLF) - tempLF.close() - outFile.close() - - ##################################### - ##################################### - - print("\tConvert file to binary") - convertBinary(symEdgesF) - - print("\tRunning clusterings") - printEvery = int(round(runNum/float(printEveryDiv),0)) - - cluster(baseFold,symEdgesF,T_OUT_FOLDER,runNum,printEvery) - - print("\tRetrieving best",args.fuRetainNum,"clusterings") - selectClusterings(baseFold,MOD_SCORES,retainNum,T_OUT_FOLDER,TF_OUT_FOLDER) - - print("\tOutput clusterings from bottom level") - outputClusters(TF_OUT_FOLDER,C_OUT_FOLDER) - - # Post-process results back to human readable files - print("\tMap nodes back to genes and split into lists") - mappings = readMappings(geneIntMapF) - processClusters(mappings,C_OUT_FOLDER,C_OUT_TXT_FOLDER,C_OUT_LISTS_FOLDER,renumber_start=args.fuRenumberStart) - - -########################################################################################### - -def main(finishT="Finished!"): - - # Calculate correlations - corrh5,genesM = generateCorrMatrix(OUT_FOLDER,args.dataF,retainF=args.retainF,corrMatF=args.corrMatF,usePearson=args.usePearson,corrChunk=args.corrChunk,splitBy=args.fileSep,headerL=args.headerL) - - # Reduce edges - edgesFP,nodesFP = reduceEdges(OUT_FOLDER,args.dataF,args.gephiF,corrh5,genesM,retainF=args.retainF,edgePG=args.edgePG,corrMatF=args.corrMatF,keepBigF=args.keepBigF,ignoreDuplicates=args.ignoreDuplicates) - - # Prepare and run Fast Unfolding Of Communities in Large Networks (https://sourceforge.net/projects/louvain/files/louvain-generic.tar.gz/download) - outFP1,outFP2 = prepareFastUnfold(OUT_FOLDER,args.dataF,args.fastUF,edgesFP,retainF=args.retainF,edgePG=args.edgePG) - - # Run Fast Unfolding - if not args.noFastUF: - runFastUF(outFP1,outFP2,args.tOutFold,args.fTOutFold,args.cOutFold,args.cOutTxtFold,args.cOutListsFold,runNum=args.fuRunNum,retainNum=args.fuRetainNum) - else: - print("\nSkipping Fast-Unfolding clustering and downstream output as --noFastUF is set") - print(finishT) - - -if __name__ == '__main__': +#!/usr/bin/env python +""" +This program is free software: you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation, either version 3 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with this program. If not, see . + +Author: Matthew Care + +A memory efficient implementation of PGCNA, that requires very little RAM +to process very large expression data-sets. +""" +from __future__ import print_function + +import sys +import os +import argparse +import shutil +from datetime import datetime +from collections import defaultdict +import re +import glob +from subprocess import Popen, PIPE, STDOUT +import numpy as np +import h5py + + +##----------------------------------------Create argparse argments----------------------------------------## +parser = argparse.ArgumentParser() +# Required +parser.add_argument("-w","--workFolder",help="Work Folder [REQUIRED]",default=None) +parser.add_argument("-d","--dataF",help="Expression data file path [REQUIRED]",default=None) + +# Optional +parser.add_argument("-s","--fileSep",help="Separator used in expression file [\t]",default="\t") +parser.add_argument("--headerL",help="Number of header lines before expression values [1]",default=1,type=int) +parser.add_argument("-f","--retainF",help="Retain gene fraction -- keeping most variant genes [0.8]",type=float,default=0.8) +parser.add_argument("-e","--edgePG",help="Edges to keep per gene [3] -- Highly recommend leaving as default",type=int,default=3) + +# Not often changed +parser.add_argument("--outF",help="Root output folder [PGCNA]",default="PGCNA") +parser.add_argument("--corrMatF",help="Correlation matrix folder [CORR_MATRIX]",default="CORR_MATRIX") +parser.add_argument("--gephiF",help="Folder to store files for Gephi [GEPHI]",default="GEPHI") +parser.add_argument("--fastUF",help="Fast Unfolding folder [FAST_UNFOLD]",default="FAST_UNFOLD") + +# Flags +parser.add_argument("--noFastUF",help="Don't try to run Fast Unfolding Clustering, but complete everything else [False] -- Flag",action="store_true") +parser.add_argument("--usePearson",help="Use Pearson Correlation instead of Spearman [False] -- Flag",action="store_true") +parser.add_argument("--keepBigF",help="Keep big HDF5 files after finishing [False] -- Flag",action="store_true") +parser.add_argument("--ignoreDuplicates",help="Ignore correlation duplicates when cutting top --edgePG genes [False] -- Flag, faster if set",action="store_true") + +# Correlation related +parser.add_argument("--corrChunk",dest="corrChunk",help="Size of chunk (rows) to split correlation problem over [5000] -- Higher will speed up correlation calculation at cost of RAM",default=5000,type=float) + +# FastUnfolding specific +parser.add_argument("-n","--fuNumber",dest="fuRunNum",help="Number of times to run [100]",default=100,type=float) +parser.add_argument("-r","--fuRetain",dest="fuRetainNum",help="Retain top [1] clusterings",default=1,type=float) +parser.add_argument("--fuRenumberStart",dest="fuRenumberStart",help="FU clustering re-numbering start [1]",default=1,type=int) +parser.add_argument("--tOutFold",dest="tOutFold",help="Trees Ouput folder [Trees]",default="Trees") +parser.add_argument("--fTOutFold",dest="fTOutFold",help="Final retained Trees Ouput folder [TreesF]",default="TreesF") +parser.add_argument("--cOutFold",dest="cOutFold",help="Clusters out folder [Clusters]",default="Clusters") +parser.add_argument("--cOutTxtFold",dest="cOutTxtFold",help="Clusters out folder - mapped back to genes [ClustersTxt]",default="ClustersTxt") +parser.add_argument("--cOutListsFold",dest="cOutListsFold",help="Clusters out folder - split into lists [ClustersLists]",default="ClustersLists") + + +########################################################################################### +args = parser.parse_args() +if not args.workFolder or not args.dataF: + print("\n\nNeed to specifiy REQUIRED variables see help (-h)") + sys.exit() + +########################################################################################### +OUT_FOLDER = os.path.join(args.workFolder,args.outF,"EPG" + str(args.edgePG)) +if not os.path.exists(OUT_FOLDER): + os.makedirs(OUT_FOLDER) +########################################################################################### +########################################################################################### +if not args.noFastUF: + def testCommandOut(outR,testFail): + if re.search("command not found",outR): # /bin/sh: louvainn: command not found + return True + elif re.search("not recognized as an internal or external command",outR): + return True + + # Make sure that if Fast Unfolding is to be run it exists on the path + testCommand = "louvain -h" + p = Popen(testCommand,shell=True,stdout=PIPE,stderr=STDOUT) + stdout,stderr = p.communicate() + + testCommand = "convert" + p = Popen(testCommand,shell=True,stdout=PIPE,stderr=STDOUT) + stdout2,stderr = p.communicate() + + testCommand = "hierarchy" + p = Popen(testCommand,shell=True,stdout=PIPE,stderr=STDOUT) + stdout3,stderr = p.communicate() + + testFail = False + testFail = testCommandOut(stdout,testFail) + testFail = testCommandOut(stdout2,testFail) + testFail = testCommandOut(stdout3,testFail) + + if testFail: + print("\nCouldn't run Fast unfold commands (louvainn, convert and hierarchy) must be in your path/environment (to skip clustering step use --noFastUF flag), see readme for more information") + sys.exit() +########################################################################################### +# Tidy input +# Decode fileSep for passed "\t" +args.fileSep = args.fileSep.decode('string-escape') +args.corrChunk = int(round(args.corrChunk,0)) +args.fuRunNum = int(round(args.fuRunNum,0)) +args.fuRetainNum = int(round(args.fuRetainNum,0)) +##----------------------------------------LOGGER----------------------------------------## + + +class multicaster(object): + def __init__(self, filelist): + self.filelist = filelist + + def write(self, str): + for f in self.filelist: + f.write(str) + + +def concatenateAsString(joinWith,*args): + temp = [str(x) for x in args] + return joinWith.join(temp) + +print("##----------------------------------------",str(datetime.now()),"----------------------------------------##",sep="") +# print out settings +settingInf = concatenateAsString( + "\n", + "##----------------------------------------Arguments----------------------------------------##", + "# Required", + "WorkFolder [-w,--workFolder] = " + args.workFolder, + "Expresion data file path [-d,--dataF] = " + args.dataF, + "\n# Optional", + "Separator used in expression file [-s, --fileSep] = " + args.fileSep, + "Number of header lines in expression file [--headerL] = " + str(args.headerL), + "Retain gene fraction [-f, --retainF] = " + str(args.retainF), + "Edges to retain per gene [-e, --edgePG] = " + str(args.edgePG), + "\n# Not changed often" + "Root output folder [--outF] = " + args.outF, + "Correlation matrix folder [--corrMatF] = " + args.corrMatF, + "Gephi files folder [--gephiF] = " + args.gephiF, + "Fast unfolding folder [--fastUF] = " + args.fastUF, + "\n# Flags", + "Don't run Fast Unfolding clustering [--noFastUF] = " + str(args.noFastUF), + "Use Pearson Correlation [--usePearson] = " + str(args.usePearson), + "Keep big HDF5 files after run [--keepBigF] = " + str(args.keepBigF), + "Ignore correlation duplicates when cutting top --edgePG genes [--ignoreDuplicates] = " + str(args.ignoreDuplicates), + "\n# Correlation Options", + "Chunk size (rows) to split correlation over [--corrChunk]:\t" + str(args.corrChunk), + "\n# Fast-Unfolding Specific", + "Cluster output folder -- Clusters split into lists [--cOutListsFold]:\t" + str(args.cOutListsFold), + "Run number [-n, --fuNumber]:\t" + str(args.fuRunNum), + "Retain top number of clusterings [-r, --fuRetain]:\t" + str(args.fuRetainNum), + "Fast-Unfolding clustering renumber start [--fuRenumberStart]:\t" + str(args.fuRenumberStart), + "Tree output folder [--tOutFold]:\t" + str(args.tOutFold), + "Final retained tree output folder [--fTOutFold]:\t" + str(args.fTOutFold), + "Cluster output folder [--cOutFold]:\t" + str(args.cOutFold), + "Cluster output folder -- mapped back to genes [--cOutTxtFold]:\t" + str(args.cOutTxtFold), +) + +settingsF = open(os.path.join(OUT_FOLDER,str(datetime.now()).replace(" ","-").replace(":",".")[:-7] + "_CorrelationSettingsInfo.txt"),"w") +print(settingInf,file=settingsF) +settingsF.close() +########################################################################################### + + +##----------------------------------------Methods----------------------------------------## + +def make_key_naturalSort(): + """ + A factory function: creates a key function to use in sort. + Sort data naturally + """ + def nSort(s): + convert = lambda text: int(text) if text.isdigit() else text + alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)] + + return alphanum_key(s) + return nSort + + +def generateCorrMatrix(workF,dataF,retainF=0.8,corrMatF="CORR_MATRIX",usePearson=False,corrChunk=5000,splitBy="\t",headerL=1,decimalP=3,decimalPpVal=5): + """ + Memory efficient method for generating all pairwise correlations of genes (rows) across a set of samples (columns). Uses HDF5 files to greatly + reduce memory useage, keeping most data residing on the hard disk. + """ + print("\n\nGenerate correlations for expression data:") + + + + # Create subfolder + outF = os.path.join(workF,corrMatF) + if not os.path.exists(outF): + os.makedirs(outF) + + ############################# + # Read in data file + seenGenes = {} + genes = [] + expression = [] + header = headerL + fileN = os.path.basename(dataF) + print("\tReading expression data file:",fileN) + + for i, line in enumerate(open(dataF)): + cols = line.rstrip().split(splitBy) + + if header: + header -= 1 + continue + + genes.append(cols[0]) + if not cols[0] in seenGenes: + seenGenes[cols[0]] = 1 + else: + print("\n\nERROR: Duplicate gene (",cols[0],") in expression data (",fileN,"), expects unique identifiers!\nPlease remove duplicates and re-run\n",sep="") + sys.exit() + expression.append(cols[1:]) + + print("\t\tTotal genes = ",len(genes)) + ############################# + + ############################# + print("\tGenerate Numpy matrices") + # Move genes to numpy array, so can be sorted along with the rest + genesNP = np.empty((len(genes)),dtype=str) + genesNP = np.array(genes) + + # Create numpy array + nrow = len(expression[0]) + ncol = len(expression) + npA = np.empty((ncol,nrow),dtype=object) + for i,eSet in enumerate(expression): + npA[i] = eSet + + # Calculate SD for each gene + npSD = np.std(npA.astype(np.float64),axis=1) + + # Sort by SD + sortI = np.argsort(npSD)[::-1] # Sort ascending + # Sort matrix by index + npA = npA[sortI,:] + + # Sort genes by index + genesNP = genesNP[sortI] + + # Only retain top fract # + # Work out index to cut at + cutPos = int(round(retainF * genesNP.shape[0],0)) + print("\t\tRetain Fraction (",retainF,") Cut at: ",cutPos,sep="") + print("\t\t\tPre-cut shape:",npA.shape) + + # Retain most variant + npA = npA[:cutPos,].astype(np.float64) + genesNP = genesNP[:cutPos] + print("\t\t\tPost-cut shape:",npA.shape) + + # Output genes + outMatrixN = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_Genes.txt") + np.savetxt(outMatrixN,genesNP,fmt="%s",delimiter="\t") + + ####################################### + # Calculate correlations + print("\tCalculating correlations using HDF5 file to save memory") + rowN, colN = npA.shape + + outMatrixHDF5 = os.path.join(outF,fileN + "_RetainF" + str(retainF) + ".h5") + with h5py.File(outMatrixHDF5, "w") as f: + h5 = f.create_dataset("corr", (rowN,rowN), dtype="f8") + + # # Load into memory + print("\t\tCreating HDF5 file") + h5 = h5py.File(outMatrixHDF5,'r+') + + if not usePearson: + # Note this isn't a perfect Spearman, as ties are not dealt with, but it is fast, and given we're only + # Retaining the most correlated edges will have almost no effect on the output. + npA = npA.argsort(axis=1).argsort(axis=1).astype(np.float64) + + # subtract means from the input data + npA -= np.mean(npA, axis=1)[:,None] + + # normalize the data + npA /= np.sqrt(np.sum(npA*npA, axis=1))[:,None] + + # Calculate correlations per chunk + print("\t\tCalculating correlations for Chunk:") + for r in range(0, rowN, corrChunk): + print("\t\t\t",r) + for c in range(0, rowN, corrChunk): + r1 = r + corrChunk + c1 = c + corrChunk + chunk1 = npA[r:r1] + chunk2 = npA[c:c1] + h5["corr"][r:r1, c:c1] = np.dot(chunk1, chunk2.T) + + # # Write output out for debugging + # finalCorr = np.copy(h5["corr"][:]) + # outMatrixN = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_Corr") + # stringFormat = "%." + str(decimalP) + "f" + # # stringFormat2 = "%." + str(decimalPpVal) + "f" + # np.savetxt(outMatrixN + ".txt",finalCorr,fmt=stringFormat,delimiter="\t") + + # Return path to HDF5 file and genes matrix + return outMatrixHDF5, genesNP + + +def reduceEdges(workF,dataF,gephiF,corrh5,genesM,retainF=0.8,edgePG=3,printEveryDiv=10,corrMatF=None,keepBigF=False,ignoreDuplicates=False): + """ + Reduce edges in correlation matrix, only retaining edgePG maximum correlated genes per row + """ + + def bottomToZero(npA,n=1): + """ + Set everything below n to zero + """ + topI = np.argpartition(npA,-n) + npA[topI[:-n]] = 0 + return npA + + def bottomToZeroWithDuplicates(npA,n=1): + """ + Set everything below n to zero, + but deal with duplicates + """ + unique = np.unique(npA) + topIunique = np.argpartition(unique,-n)[-n:] + + toKeep = [] + for val in unique[topIunique]: + toKeep.extend(np.where(npA == val)[0]) + + # Mask and reverse + mask = np.ones(len(npA),np.bool) + mask[toKeep] = 0 + npA[mask] = 0 + + return npA + + fileN = os.path.basename(dataF) + + print("\tLoad HDF5 file") + # Load old HDF5 + h5 = h5py.File(corrh5,'r+') + rowN, colN = h5["corr"].shape + + printEvery = int(round(rowN/float(printEveryDiv),0)) + print("\nReduces edges to (",edgePG,") per gene:",sep="") + printE = printEvery + tell = printE + count = 0 + + print("\tWorking (report every 1/",printEveryDiv,"th of total):",sep="") + for i, row in enumerate(h5["corr"]): + # Inform user of position + count += 1 + if count == printE: + printE = printE + tell + print("\t\t",count) + + if ignoreDuplicates: + h5["corr"][i] = bottomToZero(row,edgePG + 1) + else: + h5["corr"][i] = bottomToZeroWithDuplicates(row,edgePG + 1) + + # Write output out for debugging + # finalCorr = np.copy(h5EPG["corr"][:]) + # outMatrixN = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_NonSym") + # stringFormat = "%." + str(3) + "f" + # # stringFormat2 = "%." + str(decimalPpVal) + "f" + # np.savetxt(outMatrixN + ".txt",finalCorr,fmt=stringFormat,delimiter="\t") + + ########################################################################################### + + print("\nGenerating files for Gephi network visualization tool") + # Create subfolder + outF = os.path.join(workF,gephiF) + if not os.path.exists(outF): + os.makedirs(outF) + + # First output list of genes (nodes) + print("\tCreate node file") + nodesFP = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_EPG" + str(edgePG) + "_Nodes.tsv") + nodesFile = open(nodesFP,"w") + print("Id\tLabel\tCluster",file=nodesFile) + for gene in genesM: + print(gene,gene,"NA",file=nodesFile,sep="\t") + nodesFile.close() + + # Second, output list of edges + print("\tCreate edges file (report every 1/",printEveryDiv,"th of total):",sep="") + edgesFP = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_EPG" + str(edgePG) + "_Edges.tsv") + edgesFile = open(edgesFP,"w") + print("Source\tTarget\tWeight\tType\tfromAltName\ttoAltName",file=edgesFile) + + # Finally output edges + printE = printEvery + tell = printE + count = 0 + seen = defaultdict() + printedLines = 0 + + for i,row in enumerate(h5["corr"]): + for j,item in enumerate(row): + if not (i == j): # Don't output self edges... + geneO = genesM[i] + geneT = genesM[j] + if not (geneO + "-@@@-" + geneT) in seen: + # Output info + if not item == 0: + print(geneO,geneT,item,"undirected",geneO,geneT,sep="\t",file=edgesFile) + printedLines += 1 + + # Store the fact that we've see this and it's equivalent pair + seen[geneO + "-@@@-" + geneT] = 1 + seen[geneT + "-@@@-" + geneO] = 1 + + count += 1 + if count == printEvery: + printEvery = printEvery + tell + print("\t\t",count,"Genes",", Edges:",printedLines) + + edgesFile.close() + print("\n\t\tTotal printed (",printedLines,") edges",sep="") + + if not keepBigF: + h5.close() + shutil.rmtree(os.path.join(workF,corrMatF)) + + return edgesFP, nodesFP + + +def prepareFastUnfold(workF,dataF,fastUF,edgesFP,retainF=0.8,edgePG=3,splitBy="\t",header=1): + print("\nPreparing files for clustering using Fast Unfolding method") + + fileN = os.path.basename(dataF) + # Create subfolder + outF = os.path.join(workF,fastUF,fileN + "_retainF" + str(retainF) + "_EPG" + str(edgePG)) + if not os.path.exists(outF): + os.makedirs(outF) + + nodePairs = defaultdict(dict) + naturalKey = make_key_naturalSort() + + nameToInt = {} + nameIntCount = 0 + + edgeNum = 0 + for line in open(edgesFP): + cols = line.rstrip().split(splitBy) + + if header: + header -= 1 + continue + + gene1 = cols[0] + gene2 = cols[1] + edgeNum += 1 + + # Convert from name to int + if gene1 in nameToInt: + gene1 = nameToInt[gene1] + else: + nameToInt[gene1] = str(nameIntCount) + gene1 = str(nameIntCount) + nameIntCount += 1 + + if gene2 in nameToInt: + gene2 = nameToInt[gene2] + else: + nameToInt[gene2] = str(nameIntCount) + gene2 = str(nameIntCount) + nameIntCount += 1 + + ################################# + + nodePairs[gene1][gene2] = cols[2] + + print("\tTotal edges =",edgeNum) + + outFilePath = os.path.join(outF,"sym_edges.txt") + outFile = open(outFilePath,"w") + for node in sorted(nodePairs,key=naturalKey): + for node2 in sorted(nodePairs[node],key=naturalKey): + print(node,node2,nodePairs[node][node2],sep="\t",file=outFile) + outFile.close() + + # Write out mapping between genes and ints + outFilePath2 = os.path.join(outF,"GeneIntMap.txt") + outFile2 = open(outFilePath2,"w") + for gene in sorted(nameToInt,key=naturalKey): + print(gene,nameToInt[gene],sep="\t",file=outFile2) + outFile2.close() + + return outFilePath, outFilePath2 + + +def runFastUF(symEdgesF,geneIntMapF,tOutFold,fTOutFold,cOutFold,cOutTxtFold,cOutListsFold,runNum=10,retainNum=1,printEveryDiv=10): + MOD_SCORES = "modScores.txt" + + print("\nRunning Fast Unfolding") + baseFold,symEdgesFile = os.path.split(symEdgesF) + + # Make folders + T_OUT_FOLDER = os.path.join(baseFold,tOutFold) + TF_OUT_FOLDER = os.path.join(baseFold,fTOutFold) + C_OUT_FOLDER = os.path.join(baseFold,cOutFold) + C_OUT_TXT_FOLDER = os.path.join(baseFold,cOutTxtFold) + C_OUT_LISTS_FOLDER = os.path.join(baseFold,cOutListsFold) + + # Remove old + if os.path.exists(TF_OUT_FOLDER): + shutil.rmtree(TF_OUT_FOLDER) + if os.path.exists(T_OUT_FOLDER): + shutil.rmtree(T_OUT_FOLDER) + if os.path.exists(C_OUT_FOLDER): + shutil.rmtree(C_OUT_FOLDER) + if os.path.exists(C_OUT_TXT_FOLDER): + shutil.rmtree(C_OUT_TXT_FOLDER) + if os.path.exists(C_OUT_LISTS_FOLDER): + shutil.rmtree(C_OUT_LISTS_FOLDER) + + # Replace + if not os.path.exists(T_OUT_FOLDER): + os.makedirs(T_OUT_FOLDER) + if not os.path.exists(TF_OUT_FOLDER): + os.makedirs(TF_OUT_FOLDER) + if not os.path.exists(C_OUT_FOLDER): + os.makedirs(C_OUT_FOLDER) + if not os.path.exists(C_OUT_TXT_FOLDER): + os.makedirs(C_OUT_TXT_FOLDER) + if not os.path.exists(C_OUT_LISTS_FOLDER): + os.makedirs(C_OUT_LISTS_FOLDER) + + ##################################### + + # Fast Unfold commands + def convertBinary(iF): + command = "convert -i " + iF + " -o " + iF[:-3]+"bin " + " -w " + iF[:-3] + "weights" + p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) + stdout,stderr = p.communicate() + + def cluster(wF,iF,outFolder,runNum,PRINT_EVERY): + modScores = [] + + printEvery = PRINT_EVERY + tell = printEvery + count = 0 + + for i in range(runNum): + command = "louvain " + iF[:-3]+"bin -w " + iF[:-3] + "weights -l -1 > " + os.path.join(outFolder,str(i)) + p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) + stdout,stderr = p.communicate() + modScores.append(stdout.rstrip()) + + # Notify position + count += 1 + if count == printEvery: + printEvery = printEvery + tell + print("\t",count) + + outFile = open(os.path.join(wF,MOD_SCORES),"w") + for i,score in enumerate(modScores): + print(i,score,sep="\t",file=outFile) + outFile.close() + + def selectClusterings(wF,modS,retainNumber,treeFold,finalTreeFold,splitBy="\t"): + clusters = [] + scores = [] + + for line in open(os.path.join(wF,modS)): + cols = line.rstrip().split(splitBy) + + clusters.append(cols[0]) + scores.append(float(cols[1])) + + j = zip(scores,clusters) + jSorted = sorted(j,reverse=True) + + scoresS,clustersS = zip(*jSorted) + retained = clustersS[:retainNumber] + + for clst in retained: + shutil.copy(os.path.join(treeFold,clst),os.path.join(finalTreeFold,clst)) + + def outputClusters(finalTreeFold,clustFold): + for path in glob.glob(os.path.join(finalTreeFold,"*")): + fileName = os.path.basename(path) + command = "hierarchy " + path + p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) + stdout,stderr = p.communicate() + + bottomLevel = int(stdout.split("\n")[0].split(":")[-1].rstrip()) - 1 + + command = "hierarchy " + path + " -l " + str(bottomLevel) + " > " + os.path.join(clustFold,fileName) + p = Popen(command,shell=True,stdout=PIPE,stderr=STDOUT) + stdout,stderr = p.communicate() + + def readMappings(mF,splitBy="\t"): + + intToGene = {} + + for line in open(mF): + cols = line.rstrip().split(splitBy) + + if cols[0] in intToGene: + print("ALREADY SEEN",cols[0],"SHOULD BE NON-REDUNDANT!") + sys.exit() + + intToGene[cols[1]] = cols[0] + + return intToGene + + def processClusters(mapping,inFold,outFold,outListsFold,renumber_start=1,splitBy=" "): + + natKey = make_key_naturalSort() + + fileNumber = renumber_start + for path in sorted(glob.glob(os.path.join(inFold,"*")),key=natKey): + outFile = open(os.path.join(outFold,str(fileNumber) + ".csv"),"w") + + # Create folder to store split gene lists in + listFold = os.path.join(outListsFold,"Clust" + str(fileNumber)) + if not os.path.exists(listFold): + os.makedirs(listFold) + + fileNumber += 1 + + print("Id,Modularity Class",file=outFile) + byClust = defaultdict(list) + + for line in open(path): + cols = line.rstrip().split(splitBy) + + gene, clust = cols + clust = str(int(clust) + 1) # Start numbering from 1 not 0 + if gene in mapping: + byClust[clust].append(mapping[gene]) + else: + print("Gene (",gene,") is missing from mapping file!") + + for clst in sorted(byClust,key=natKey): + tempLF = open(os.path.join(listFold,"M" + str(clst) + ".txt"),"w") + for gene in sorted(byClust[clst],key=natKey): + print(gene,clst,sep=",",file=outFile) + print(gene,file=tempLF) + tempLF.close() + outFile.close() + + ##################################### + ##################################### + + print("\tConvert file to binary") + convertBinary(symEdgesF) + + print("\tRunning clusterings") + printEvery = int(round(runNum/float(printEveryDiv),0)) + + cluster(baseFold,symEdgesF,T_OUT_FOLDER,runNum,printEvery) + + print("\tRetrieving best",args.fuRetainNum,"clusterings") + selectClusterings(baseFold,MOD_SCORES,retainNum,T_OUT_FOLDER,TF_OUT_FOLDER) + + print("\tOutput clusterings from bottom level") + outputClusters(TF_OUT_FOLDER,C_OUT_FOLDER) + + # Post-process results back to human readable files + print("\tMap nodes back to genes and split into lists") + mappings = readMappings(geneIntMapF) + processClusters(mappings,C_OUT_FOLDER,C_OUT_TXT_FOLDER,C_OUT_LISTS_FOLDER,renumber_start=args.fuRenumberStart) + + +########################################################################################### + +def main(finishT="Finished!"): + + # Calculate correlations + corrh5,genesM = generateCorrMatrix(OUT_FOLDER,args.dataF,retainF=args.retainF,corrMatF=args.corrMatF,usePearson=args.usePearson,corrChunk=args.corrChunk,splitBy=args.fileSep,headerL=args.headerL) + + # Reduce edges + edgesFP,nodesFP = reduceEdges(OUT_FOLDER,args.dataF,args.gephiF,corrh5,genesM,retainF=args.retainF,edgePG=args.edgePG,corrMatF=args.corrMatF,keepBigF=args.keepBigF,ignoreDuplicates=args.ignoreDuplicates) + + # Prepare and run Fast Unfolding Of Communities in Large Networks (https://sourceforge.net/projects/louvain/files/louvain-generic.tar.gz/download) + outFP1,outFP2 = prepareFastUnfold(OUT_FOLDER,args.dataF,args.fastUF,edgesFP,retainF=args.retainF,edgePG=args.edgePG) + + # Run Fast Unfolding + if not args.noFastUF: + runFastUF(outFP1,outFP2,args.tOutFold,args.fTOutFold,args.cOutFold,args.cOutTxtFold,args.cOutListsFold,runNum=args.fuRunNum,retainNum=args.fuRetainNum) + else: + print("\nSkipping Fast-Unfolding clustering and downstream output as --noFastUF is set") + print(finishT) + + +if __name__ == '__main__': main() \ No newline at end of file diff --git a/PGCNA2/README.md b/PGCNA2/README.md new file mode 100644 index 0000000..f944bd1 --- /dev/null +++ b/PGCNA2/README.md @@ -0,0 +1,335 @@ +# PGCNA2 (Parsimonious Gene Correlation Network Analysis version 2) + +## Introduction + +PGCNA is a gene correlation network analysis approach that is computationally simple yet yields stable and biologically meaningful modules and allows visualisation of very large networks, showing substructure and relationships that are normally hard to see. The parsimonious approach, retaining the 3 most correlated edges per gene, results in a vast reduction in network complexity meaning that large networks can be clustered quickly and reduced to a small output file that can be used by downstream software. + +## Citation + +For more details see: +[Care, M.A., Westhead, D.R., Tooze, R.M., 2019. Parsimonious Gene Correlation Network Analysis (PGCNA): a tool to define modular gene co-expression for refined molecular stratification in cancer. npj Syst. Biol. Appl. 5, 13. https://doi.org/10.1038/s41540-019-0090-7](https://www.nature.com/articles/s41540-019-0090-7). + +Please cite this when using PGCNA. + +### Paper website + +PGCNA paper website: [http://pgcna.gets-it.net](http://pgcna.gets-it.net) + + +---------- + +## PGCNA2 + +**pgcna2.py** is a modified version of **pgcna-multi.py** that uses **Leidenalg** instead of the **Louvain** community detection algorithm. Leidenalg has been shown to produce better clustering solutions which guarantee well-connected communities (for more details see [Traag, V. A., Waltman, L., & van Eck, N. J. (2019). From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9(1). https://doi.org/10.1038/s41598-019-41695-z](https://www.nature.com/articles/s41598-019-41695-z)). + +## Requirements + +Due to the inclusion of the leidenalg community detection method pgcna2.py requires Python 3 and the following non-standard packages : numpy, scipy, h5py (and for clustering: python-igraph and leidenalg). +We recommend the Anaconda distribution ([https://www.anaconda.com/download/](https://www.anaconda.com/download/)), which comes with numpy, scipy and h5py included and makes installing python-igraph and leidenalg simple. This script has been tested on Windows 10 (Python 3.7.3, numpy 1.16.4, scipy 1.2.1, h5py 2.9.0, python-igraph 0.7.1 and leidenalg 0.7.0) and Linux CentOS 6.9 (Python 3.7.3, numpy 1.16.4, scipy 1.3.0, h5py 2.9.0, python-igraph 0.7.1 and leidenalg 0.7.0). + +### Optional, but recommended + +pgcna2.py can be run without generating modules (**--noLeidenalg**) if all you require is output for network visualisation tools (e.g. Gephi/Cytoscape). + +#### Community detection with leidenalg + +To carry out clustering you will require the additional python packages **python-igraph** and **leidenalg**. + +##### Example installation + +On both Windows/Linux: + +Presuming that you're using Anaconda python (**must be Py3 environment**) these can be installed with: +``` +conda install -c vtraag python-igraph  +conda install -c vtraag leidenalg +``` + + +#### Gephi + +For visualising the output of PGCNA we highly recommend using Gephi ([https://gephi.org/](https://gephi.org/)) which is able to layout large networks very efficiently. It has the added bonus that it includes the **louvain/FastUnfold** method to quickly visualise the modularity of the network (however, would recommend the leidenalg results over those produced by Gephi). See **Examples** below for details of loading the output from pgcna.py into Gephi and analysing it. + + +---------- + + +## Installation + +Using github: + +``` +git clone https://github.com/medmaca/PGCNA.git +``` + +Manual download: [https://github.com/medmaca/PGCNA/archive/master.zip](https://github.com/medmaca/PGCNA/archive/master.zip) + + +---------- + + +## Overview + +We have endeavoured to make the process of using PGCNA as simple as possible, there are relatively few user choices. The default parameters will give good results and most users will only want to alter the **--corrChunk** option depending on amount of available RAM (see **Common options**) + +### Note on terminology + +The leidenalg method **clusters** the network, each run will produce a **clustering** of the data and each of these will contain individual **clusters/modules** (the terms are used interchangeably). Each time leidenalg is run, due to a random start position, it will produce a different result and though these results are highly related they differ in exact gene/module groupings. PGCNA used the leidenalg modularity score to rank the different clusterings of the data (number set with **-n**, **--laNumber**) and select the best (number set with **-b**, **--laBestPerc** parameter) percentage for output. + +## Usage + +### Data-set analysis + +**pgcna2.py** can process both single and multiple data-sets (as used in the original paper). + +### Input file + +When running **pgcna2** the **-d/--dataF** option should point to a folder containing one or more data-sets. The input file(s) should be expression file(s) that have the format of unique identifiers (genes/probes etc.) in the first column, with the remaining columns containing the expression values across the samples. Redundancy of identifiers is not allowed and must be dealt with before processing with PGCNA. The data-set folder **also needs an additional file (default name #FileInfo.txt; set by **-m/--metaInf**)** that contains a tab separated list of the file names to process (1st column) and the number of header lines (to allow for meta data) in those files (2nd column), this also is expected to have a header, e.g.: + +``` +Filename HeaderLines +FileName1.txt 3 +FileName2.txt 1 +FileName3.txt 12 +``` + +### Basic run + +PGCNA run via **pgcna2.py** script: + + +On windows +``` +python3 pgcna2.py -w workFolderPath -d expressionFilesFolderPath +``` + +On linux +``` +./pgcna2.py -w workFolderPath -d expressionFilesFolderPath +``` + +### Common options + +#### Alter RAM requirements (--corrChunk) + +Within the PGCNA method the step that requires the most RAM is calculating the pairwise correlations. For a small number of genes (<=20,000) this is easily carried out in memory, but this increases dramatically with increasing gene/probe numbers: 2x10^4 = 3GB, 4x10^4 = 12GB, 6x10^4 = 27GB, 8x10^4 = 48GB, 1x10^5 = 75GB etc. PGCNA carries out only a small portion of processing in memory, breaking the large correlation problem into chunks. This means that with default settings (--corrChunk 5000) PGCNA can process a 2x10^5 matrix in <1GB or RAM, rather than the 300GB that would be required using memory alone. + +When processing larger matrices the user can choose to increase **--corrChunk** if they want to utilise more memory for calculating the correlation. While this will speed up the correlation step, it should be noted that as this is only one step in the PGCNA process and thus the total time saving may be minor. + +Setting PGCNA to have a correlation chunk size of 20,000 + +``` +./pgcna2.py -w workFolderPath -d expressionFilesFolderPath --corrChunk 2e4 +``` + +#### Retain all genes + +With default setting PGCNA will only retain the top 80% most variable genes in the expression data for analysis. If the input expression files has been pre-filtered to remove invariant genes you may want to process all of the data, this can be accomplished using the -f/--retainF parameters: + +``` +./pgcna2.py -w workFolderPath -d expressionFilesFolderPath --retainF 1 +``` + +#### Run without clustering + +If you don't have the **python-igraph/leidenalg** packages installed and wish to generate output for downstream tools you can still run PGCNA with the following command: + +``` +./pgcna2.py -w workFolderPath -d expressionFilesFolderPath --noLeidenalg +``` + +#### Changing input file separator and header size + +The default input format is a tab ("\t") separated file with a single header line. This can be changed with the **--fileSep** parameter, and the number of header lines is set using the meta-information file (default #FileInfo.txt). In addition all the data-sets to be processed **must** have the same separator, so for a folder of csv separated files: +``` +./pgcna2.py -w workFolderPath -d expressionFilesFolderPath --fileSep "," +``` + +#### Changing the number of retained edges + +The default is to retain the top 3 most correlated edges per gene, and in our paper we show that there is no benefit from increasing this. However, should users wish to increase this they can using the **--edgePG** parameter. Increasing the number of edges will result in the **leidenalg** method generating fewer clusters/modules, yet these clusters will be super sets of clusters/modules produced with fewer edges. Note: when expression data contains groups of highly correlated genes it is possible for the default 3 edges to generate orphan modules (modules disconnected from the rest of the network), in this situation increasing the number of edges should reconnect these to the total network. + +To run PGCNA retaining 5 edges + +``` +./pgcna2.py -w workFolderPath -d expressionFilesFolderPath --edgePG 5 +``` + + +#### Leidenalg options + +By default PGCNA will run 100 clusterings of the data using leidenalg and will retain the best 10% (judged by leidenalg modularity score). Users may wish to increase both the number of clusterings of the network that are carried out and the fraction retained for downstream processing (copied into lBaseFold/BEST folder). + +For instance to run 10,000 clusterings and retain the top 20%: + +``` +./pgcna2.py -w workFolderPath -d expressionFilesFolderPath -n 1e4 -b 20 +``` + +#### Options specific to processing multiple expression files + +If pgcna2.py is processing multiple expression files it merges the correlations across data-sets by calculating a gene's median correlation across all data-sets that contain that gene. During this process if the user wants pgcna2.py can output a folder containing a single file per gene in the median correlation matrix. This file shows the correlations between that single gene and all other genes across the different data-sets, allowing the user to see if the data-sets have a good level of agreement, and potentially highlighting outliers. + +To output all single gene correlation files (**--singleCorr**). Note: this **significantly** increases the time to construct the median correlation matrix: + +``` +./pgcna2.py -w workFolderPath -d expressionFilesFolderPath --singleCorr +``` + +If you want to limit the single gene correlations output to a subset of the total genes, you can use the **--singleCorrL** option. This uses a file in the **--workFolder** (default corrGenes.txt; set by **--singleCorrListF**) to provide a list of genes to limit the output to. If **--singleCorrL** is set and corrGenes.txt is missing it will be created and the default gene (IRF4) included. Using **--singleCorrL** is highly recommended over **--singleCorr** if you're only interested in a small number of genes due to the significant speed increase it provides. + +``` +./pgcna2.py -w workFolderPath -d expressionFilesFolderPath --singleCorrL +``` + +See **Output** for information on all downstream files. + +---------- + + +## Parameters + +Complete list of parameters and their default values. + +Can also use built in help flag: +``` +./pgcna2.py -h +``` + +### Required + +|Parameter|Description|Default Value| +|---|---|---| +|-w, --workFolder|Base folder for all output|**Required**| +|-d, --dataF|Path to folder of expression file(s)|**Required**| + +### Input file related + +|Parameter|Description|Default Value|Multi specific| +|---|---|---|---| +|-s, --fileSep|Separator used in expression file|"\t"| | +|-m, --metaInf|File containing information about data files to process (must be in same folder as --dataF)|#FileInfo.txt|| +|-f, --retainF|Retain gene fraction -- keeping most variant genes|0.8| | +|-g, --geneFrac|Fraction of expression files a gene needs to be present in to be included in median correlation matrix|1/3|Yes| + + +### Edge reduction related + +|Parameter|Description|Default Value| +|---|---|---| +|-e, --edgePG|Edges to keep per gene -- **Recommend leaving as default**|3| +|-r, --roundL|Decimal places to round correlations to before edge reduction|3| + + +### Folders + +|Parameter|Description|Default Value|Multi specific| +|---|---|---|---| +|--outF|Root output folder|"PGCNA"| | +|--corrMatF|Correlation matrix folder -- where HDF5 files are stored|"CORR_MATRIX"| | +|--corrMatFS|Folder for single gene correlation files|"CORR_MATRIX_SG"|Yes| +|--gephiF|Folder to store files for Gephi|"GEPHI"| | + + + +### Control usage + +|Parameter|Description|Default Value|Multi specific| +|---|---|---|---| +|--noLeidenalg|Flag -- don't run **leidenalg** clustering but complete everything else|False| | +|--usePearson|Flag -- Instead of Spearman's ranked correlation coefficient calculate Pearson's|False| | +|--keepBigF|Flag -- Retain big HDF5 files after finishing, if needed for independent downstream analysis|False|For multiple files --keepBigF retains median corr HDF5| +|--keepBigFA|Flag -- Retain **ALL** big HDF5 files after finishing, if needed for independent downstream analysis|False|Yes| +|--corrChunk|Size of chunk to split correlation problem over -- higher values will speed up correlation calculation at the cost of RAM|5000| | +|--ignoreDuplicates|Flag -- Ignore correlation duplicates when cutting top --edgePG genes, faster if set but may miss genes if correlations are duplicated|False| | +|--singleCorr|Flag -- Output individual gene correlation files -- Warning this generates 1 file per gene in final correlation matrix|False|Yes| +|--singleCorrL|Flag -- Output individual gene correlation files -- limited to those in --singleCorrListF|False|Yes| +|--singleCorrListF|If --singleCorrL is set then create single correlation files for all those in this file (must be within --workFolder)|corrGenes.txt|Yes| + +### Clustering related + +|Parameter|Description|Default Value| +|---|---|---| +|-n, --laNumber|Number of times to run **leidenalg** method|100| +|-b, --laBestPerc|Percentage of best (based on leidenalg modularity) clusterings to copy to lBaseFold/BEST |10| +|--lBaseFold|Base folder for leidenalg|"LEIDENALG"| +|--lClustTxtFold|Leidenalg Clusters text folders |ClustersTxt| +|--lClustListFold|Leidenalg Clusters module list folder |ClustersLists| + + + +### Output + +If PGCNA is run using default settings, the root folder (-w, --workFolder) will contain the following: + +* **EPG3** (Edge per gene 3 folder) + * *CorrelationSettingsInfo.txt : timestamped file containing parameters used for run. + * **CORR_MATRIX_GMF*** (if --keepBigF) : contains the temporary files used for calculating correlations + * *_RetainF*.h5 : The HDF5 file of all pairwise correlations after edge reduction + * *_RetainF*_Genes.txt : Genes that remain after -f/--retainF filtering ordered by decending standard deviation + * **CORR_MATRIX_SG_GMF*** (if --singleCorr/--singleCorrL) : contains gzipped single gene correlation files split into first letter subfolders + * **_GEPHI_**: contains files for processing with Gephi package ([https://gephi.org/](https://gephi.org/)) + * *_RetainF1.0_EPG3_Edges.tsv.gz : Tab separated file of edges + * *_RetainF1.0_EPG3_Nodes.tsv.gz : Tab separated file of nodes + * **LEIDENALG** : Root folder for output from leidenalg clustering + * **ALL** : contains all results from clustering + * **ClustersLists** : Contains one folder per clustering each of which contains one text file per module/cluster containing that modules genes. + * **ClustersTxt** : Contains one *.csv file per clustering each of which contains info on the gene/module pairings. + * **BEST** : contains the best **-b/--laBestPerc** percentage results (judged by leidenalg modularity score) + * **ClustersLists** : Contains one folder per clustering each of which contains one text file per module/cluster containing that modules genes. + * **ClustersTxt** : Contains one *.csv file per clustering each of which contains info on the gene/module pairings. + * moduleInfoAll.txt : Contains meta information about clusterings for the data in **ALL** folder + * moduleInfoBest.txt : Contains meta information about clusterings for the data in **BEST** folder + + +## Examples + +### Gephi Visualisation + +To visualise the output from PGCNA in Gephi is quite straightforward (correct for Gephi V0.92): + +Using data from **EPG3/GEPHI** folder + +1. Open Gephi +2. Select File/"New Project" +3. Select "Data Table" (tab above screen centre) +4. Select "Import Spreadsheet" + 1. Select *_RetainF1.0_EPG3_Nodes.tsv , make sure "Import as": "Nodes table", click **Next** and then **Finish**. Hopefully should import with no issues, select **OK** + 2. Repeat with *_RetainF1.0_EPG3_Nodes.tsv, making sure that "Import as" : "Edges table", click **Next** and then **Finish**. Hopefully should import with no issues, select **OK** +5. Click "Graph" (tab above screen centre) -- you should now see a massive blob of nodes in the screen centre +6. Under Statistics/NetworkOverview (right of screen) select **Modularity**. This will run the built in version of the louvain/fastUnfold method. + 1. Each run will generate a different clustering, the higher the score the better the clustering is perceived to be. +7. Select Appearance/Partition (left of screen) and select "Modularity Class" from drop-down list + 1. Click Palette button in the bottom right of this panel and select Generate. Untick "Limit number of colors" and then select **OK**. + 2. If you want you can manually change any of the resultant colours by left clicking on them and dragging the mouse around. + 3. When you're happy with the range of colours select **Apply** button. +8. To layout the network: + 1. Under **Layout** (left middle of screen) select **ForceAtlas2** + 2. For big networks set the following ("Approximate Repulsion":selected, Scaling:<1 often as small as 0.1, "LinLog mode":selected and importantly "Prevent Overlap":**not** selected). + 3. Wait until network layout has finished (You may need to alter **Scaling** if all nodes are pushed to the edge of the screen.) + 4. Once you're happy with the network set "Prevent Overlap":selected to finally prevent node overlaps. + +The Gephi version of the louvain/fastUnfold is inferior to the leidenalg method used by PGCNA, so if you've run PGCNA with clustering included, you should use the results contained in the **EPG/LEIDENALG/BEST/ClustersTxt**, assuming you've already loaded the network (see above): + +1. Select "Data Table" (tab above screen centre) +2. Select "Import Spreadsheet" + 1. Within ClustersTxt choose a *.csv file to import + 2. Make sure "Import as": "Nodes table", click **Next**, make sure **Modularity Class** is set to integer then click **Finish**. +3. Follow step 7 above to set appearance with new Modularity Class data. + +## Troubleshooting + +### Too many edges + +If after running pgcna2.py the number of edges in the network is >> **-e/--edgePG** x nodes (retained genes) then there are two likely causes: + +1. The data-set has many highly related genes/probes that have identical correlations, and thus aren't reduced during edge reduction. +2. The data-set being analysed is very sparse in nature (e.g. from single-cell RNA-seq). + +In both cases setting the **--usePearson** option will potential improve the result. This is because PGCNA uses an approximation of Spearman's correlation to reduce memory overhead and decrease run time. However, this approximation doesn't deal with tied ranks and thus will have issues when there are blocks of highly correlated (or indeed sparse) data. + +However, we would suggest that if you are suffering from this issue that you explore your data-set first to try to understand why you have so many redundant (highly correlated) genes/probes. In addition while pgcna2.py will generate meaningful results for scRNA-seq data (with --usePearson) it has not been developed with that in mind and thus we'd recommend you seek another approach for such data. + +## Feedback and questions + +If you have any queries or notice any bugs please email me at **m.a.care@leeds.ac.uk** (please include PGCNA in the subject heading). diff --git a/PGCNA2/pgcna2.py b/PGCNA2/pgcna2.py new file mode 100644 index 0000000..b9f68a2 --- /dev/null +++ b/PGCNA2/pgcna2.py @@ -0,0 +1,1124 @@ +#!/usr/bin/env python +""" +This program is free software: you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation, either version 3 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with this program. If not, see . + +Author: Matthew Care + +A memory efficient implementation of PGCNA, that requires very little RAM +to process very large expression data-sets. + +This version is for processing multiple data-sets, via calculating a +median correlation matrix. + +# PGCNA2 changes # +This is a modification of the original pgcna-multi.py that uses the +improved Leidenalg community detection method (https://arxiv.org/abs/1810.08473) +over the Louvain community detection method used originally. + +""" +import sys +import os +import argparse +import shutil +import string +from datetime import datetime +from collections import defaultdict +import re +import gzip +import random +import numpy as np +import numpy.ma as ma +import scipy.stats as stats +import h5py + +##----------------------------------------Create argparse argments----------------------------------------## +parser = argparse.ArgumentParser() +# Required +parser.add_argument("-w", "--workFolder", help="Work Folder [REQUIRED]", default=None) +parser.add_argument("-d", "--dataF", help="Expression data folder path [REQUIRED]", default=None) + +# Optional +parser.add_argument("-m", "--metaInf", help="File containing information about data files (must be in same folder as --dataF) [#FileInfo.txt]", default="#FileInfo.txt") +parser.add_argument("-s", "--fileSep", help="Separator used in expression file(s) [\t]", default="\t") +parser.add_argument("-f", "--retainF", help="Retain gene fraction -- keeping most variant genes [0.8]", type=float, default=0.8) +parser.add_argument("-g", "--geneFrac", help="Fraction of files a gene needs to be present in to be included in median corr matrix [1/3]", type=float, default=1/3.0) +parser.add_argument("-e", "--edgePG", help="Edges to keep per gene [3] -- Highly recommend leaving as default", type=int, default=3) +parser.add_argument("-r", "--roundL", help="Decimal places to round to before edge reduction [3]", type=int, default=3) + +# Not often changed +parser.add_argument("--outF", help="Root output folder [PGCNA]", default="PGCNA") +parser.add_argument("--corrMatF", help="Correlation matrix folder [CORR_MATRIX]", default="CORR_MATRIX") +parser.add_argument("--corrMatFS", help="Folder for single gene correlation files [CORR_MATRIX_SG]", default="CORR_MATRIX_SG") +parser.add_argument("--gephiF", help="Folder to store files for Gephi [GEPHI]", default="GEPHI") + +# Flags +parser.add_argument("--noLeidenalg", help="Don't try to run Leidenalg clustering, but complete everything else [False] -- Flag", action="store_true") +parser.add_argument("--usePearson", help="Use Pearson Correlation instead of Spearman [False] -- Flag", action="store_true") +parser.add_argument("--keepBigFA", help="Keep ALL big HDF5 files after finishing [False] -- Flag", action="store_true") +parser.add_argument("--keepBigF", help="Keep median correlations HDF5 files after finishing [False] -- Flag", action="store_true") +parser.add_argument("--ignoreDuplicates", help="Ignore correlation duplicates when cutting top --edgePG genes [False] -- Flag, faster if set", action="store_true") +parser.add_argument("--singleCorr", help="Output individual gene correlation files -- Warning this generates 1 file per gene in final correlation matrix. [False] -- Flag", action="store_true") +parser.add_argument("--singleCorrL", help="Output individual gene correlation files -- limited to those in --singleCorrListF [False] -- Flag", action="store_true") + +# Single corr related +parser.add_argument("--singleCorrListF", help="If --singleCorrL is set then create single correlation files for all those in this file (must be within --workFolder)", default="corrGenes.txt") + +# Correlation related +parser.add_argument("--corrChunk", dest="corrChunk", help="Size of chunk (rows) to split correlation problem over [5000] -- Higher will speed up correlation calculation at cost of RAM", default=5000, type=float) + +# Leidenalg (community detection) specific +parser.add_argument("-n", "--laNumber", dest="laRunNum", help="Number of times to run Leidenalg [100]", default=100, type=int) +parser.add_argument("-b", dest="laBestPerc", help='Copy top [10] %% of clusterings into lBaseFold/BEST folder', default=10, type=int) +parser.add_argument("--lBaseFold", dest="lBaseFold", help="Leidenalg base folder [LEIDENALG]", default="LEIDENALG") +parser.add_argument("--lClustTxtFold", dest="lClustTxtFold", help="Leidenalg Clusters text folders [ClustersTxt]", default="ClustersTxt") +parser.add_argument("--lClustListFold", dest="lClustListFold", help="Leidenalg Clusters module list folder [ClustersLists]", default="ClustersLists") + + +########################################################################################### +args = parser.parse_args() +if not args.workFolder or not args.dataF: + print("\n\nNeed to specifiy REQUIRED variables see help (-h)") + sys.exit() +########################################################################################### +OUT_FOLDER = os.path.join(args.workFolder, args.outF, "EPG" + str(args.edgePG)) +if not os.path.exists(OUT_FOLDER): + os.makedirs(OUT_FOLDER) + +########################################################################################### +args.metaInf = os.path.join(args.dataF, args.metaInf) +########################################################################################### +if not args.noLeidenalg: + try: + import igraph as ig + except ModuleNotFoundError: + print("\nCould not import igraph module need to make sure it is installed (e.g. conda install -c vtraag python-igraph ). To skip clustering step use --noLeidenalg flag, see readme for more information\n") + sys.exit() + + try: + import leidenalg as la + except ModuleNotFoundError: + print("\nCould not import leidenalg module need to make sure it is installed (e.g. conda install -c vtraag leidenalg ). To skip clustering step use --noLeidenalg flag, see readme for more information\n") + sys.exit() + + +########################################################################################### +# Tidy input +# Decode fileSep for passed "\t" +args.fileSep = args.fileSep.encode().decode('unicode_escape') # string-escape in py2 +args.corrChunk = int(round(args.corrChunk, 0)) +args.laRunNum = int(round(args.laRunNum, 0)) + +##----------------------------------------LOGGER----------------------------------------## + + +class multicaster(object): + def __init__(self, filelist): + self.filelist = filelist + + def write(self, str): + for f in self.filelist: + f.write(str) + + +def concatenateAsString(joinWith, *args): + temp = [str(x) for x in args] + return joinWith.join(temp) + +print("##----------------------------------------", str(datetime.now()), "----------------------------------------##", sep="") +# print out settings +settingInf = concatenateAsString( + "\n", + "##----------------------------------------Arguments----------------------------------------##", + "# Required", + "WorkFolder [-w,--workFolder] = " + args.workFolder, + "Expresion data file path [-d,--dataF] = " + args.dataF, + + "\n# Optional", + "Meta info file describing expression files [-m, --metaInf] = " + args.metaInf, + "Separator used in expression file [-s, --fileSep] = " + args.fileSep, + "Retain gene fraction [-f, --retainF] = " + str(args.retainF), + "Fraction of expression files gene required in to be retained [-g, --geneFrac] = " + str(args.geneFrac), + "Edges to retain per gene [-e, --edgePG] = " + str(args.edgePG), + "Decimal places to round to before edge reduction [-r, --roundL] = " + str(args.roundL), + + "\n# Main Folders" + "Root output folder [--outF] = " + args.outF, + "Correlation matrix folder [--corrMatF] = " + args.corrMatF, + "Single Gene Correlation files folder [--corrMatFS] = " + args.corrMatFS, + "Gephi files folder [--gephiF] = " + args.gephiF, + + "\n# Flags", + "Don't run Fast Unfolding clustering [--noLeidenalg] = " + str(args.noLeidenalg), + "Use Pearson Correlation [--usePearson] = " + str(args.usePearson), + "Keep all big HDF5 files after run [--keepBigFA] = " + str(args.keepBigFA), + "Keep median correlations HDF5 files after run [--keepBigF] = " + str(args.keepBigF), + "Ignore correlation duplicates when cutting top --edgePG genes [--ignoreDuplicates] = " + str(args.ignoreDuplicates), + "Output individual gene correlation files [--singleCorr] = " + str(args.singleCorr), + "Output individual gene correlation files for select list (--singleCorrListF) [--singleCorrL] = " + str(args.singleCorrL), + + "\n# Single gene correlation options", + "List of genes to process if --singleCorrL [--singleCorrListF]:\t" + str(args.singleCorrListF), + + "\n# Correlation Options", + "Chunk size (rows) to split correlation over [--corrChunk]:\t" + str(args.corrChunk), + + "\n# Leidenalg Specific", + "Run number [-n, --laNumber]:\t" + str(args.laRunNum), + "Copy top % of clusterings into *_BEST folder [-b, --laBestPerc]:\t" + str(args.laBestPerc), + "Base folder [--lBaseFold] = " + str(args.lBaseFold), + "Clusters text folder [--lClustTxtFold] = " + str(args.lClustTxtFold), + "Clusters List folder [--lClustListFold] = " + str(args.lClustListFold), + +) + +settingsF = open(os.path.join(OUT_FOLDER, str(datetime.now()).replace(" ", "-").replace(":", ".")[:-7] + "_CorrelationSettingsInfo.txt"), "w") +print(settingInf, file=settingsF) +settingsF.close() +########################################################################################### + + +##----------------------------------------Methods----------------------------------------## + + +def makeSafeFilename(inputFilename): + # Declare safe characters here + safechars = string.ascii_letters + string.digits + "~-_.@#" # string.letters in py2 + try: + return "".join(list(filter(lambda c: c in safechars, inputFilename))) + except: + return "" + pass + + +def make_key_naturalSort(): + """ + A factory function: creates a key function to use in sort. + Sort data naturally + """ + def nSort(s): + convert = lambda text: int(text) if text.isdigit() else text + alphanum_key = lambda key: [convert(c) for c in re.split('([0-9]+)', key)] + + return alphanum_key(s) + return nSort + + +def listToPercentiles(x, inverse=False): + data = stats.rankdata(x, "average")/float(len(x)) + + if inverse: + return (1-data) * 100 + else: + return data * 100 + + +def dataSpread(x): + """ + Returns the min, Q1 (25%), median (Q2), Q3 (75%), max, IQR, Quartile Coefficient of Dispersion and IQR/Median (CV like) + """ + + q1 = float(np.percentile(x, 25, interpolation="lower")) + q2 = np.percentile(x, 50) + q3 = float(np.percentile(x, 75, interpolation="higher")) + + if (q2 == 0) or ((q3+q1) == 0): + return min(x), q1, q2, q3, max(x), abs(q3-q1), 0, 0 + else: + return min(x), q1, q2, q3, max(x), abs(q3-q1), abs((q3-q1)/(q3+q1)), abs((q3-q1)/q2) + + +def mad(arr): + """ Median Absolute Deviation: a "Robust" version of standard deviation. + Indices variabililty of the sample. + """ + arr = np.array(arr) + med = np.median(arr) + return np.median(np.abs(arr - med)) + + +def loadCorrGeneList(workF, listFN, defaultG="IRF4"): + + print("\nLoading list of genes for single correlation file output") + fileP = os.path.join(workF, listFN) + + if not os.path.exists(fileP): + print("\t\t", fileP, "does not exist, will create file and add default gene (", defaultG, ")") + outF = open(fileP, "w") + print(defaultG, file=outF) + outF.close() + + corrG = {defaultG: 1} + + else: + corrG = {} + for line in open(fileP): + line = line.rstrip() + + corrG[line] = 1 + + print("\tTotal genes:", len(corrG)) + + return corrG + + +def generateGeneMeta(outF, fileN, geneMetaInfo, genesNP, npA): + """ + Generate meta information for gene + """ + ########################################################################################### + def expToPercentiles(expToPerc, expA): + return [expToPerc[e] for e in expA] + + ########################################################################################### + print("\t\t# Generate gene meta information : ", end="") + + expressionVals = npA.flatten() + + # Convert from expression values to percentiles + expPercentiles = listToPercentiles(expressionVals, inverse=False) + + # Calculate mapping from expression to percentile + expToPerc = {} + for i, e in enumerate(expressionVals): + expToPerc[e] = expPercentiles[i] + + outF = open(os.path.join(outF, fileN + "_metaInf.txt"), "w") + for i, gene in enumerate(genesNP): + # Convert from expression values to percentiles, so can work out MAD of percentiles + genePercentiles = expToPercentiles(expToPerc, npA[i]) + + # min, Q1 (25%), median (Q2), Q3 (75%), max, IQR, Quartile Coefficient of Dispersion and IQR/Median (CV like) + try: + minE, q1E, q2E, q3E, maxE, iqrE, qcodE, iqrME = dataSpread(npA[i]) + except: + print("Issue with :", gene, npA[i]) + sys.exit() + + medianPercentiles = np.median(genePercentiles) + print(gene, q2E, qcodE, medianPercentiles, sep="\t", file=outF) + + geneMetaInfo[gene].append([q2E, qcodE, medianPercentiles]) + + outF.close() + print("done") + + +def mergeMetaInfo(geneMetaInfo): + print("\n\tMerging gene meta information") + medianMetaInfo = {} + + for gene in geneMetaInfo: + medians, qcods, percentiles = zip(*geneMetaInfo[gene]) + # Store Median_QCOD(varWithin), MedianPercentile, MADpercentile(VarAcross) + medianMetaInfo[gene] = [np.median(qcods), np.median(percentiles), mad(percentiles)] + + return medianMetaInfo + + +def loadMeta(metaFile, dataF, splitBy="\t", headerL=1): + + print("\n\nLoad meta-file (", os.path.basename(metaFile), ")", sep="") + if not os.path.exists(metaFile): + print("\t\t# Meta file (", metaFile, ") does not exist!", sep="") + sys.exit() + + header = headerL + fileInfo = {} + for line in open(metaFile): + cols = line.rstrip().split(splitBy) + + if header: + header -= 1 + continue + + if line.rstrip() == "": + continue + + totalPath = os.path.join(dataF, cols[0]) + if not os.path.exists(totalPath): + print("\t\t# File (", totalPath, ") does not exist!, won't add to fileInfo", sep="") + else: + try: + fileInfo[totalPath] = int(cols[1]) + except: + print("Meta file line (", line.rstrip(), ") is not formed properly, skipping") + + print("\tLoaded information on:", len(fileInfo), "files") + return fileInfo + + +def getGenes(metaInf, splitBy="\t", geneFrac=1/3.0, retainF=0.8): + """ + First pass over expression data-files to find the set of genes we will be + working with after filtering each data-set by retainF and then only keeping + genes that are present in geneFrac + """ + natSort = make_key_naturalSort() + + geneCount = defaultdict(int) + + print("\nLoad information on genes present per expression array:") + for fileP in sorted(metaInf, key=natSort): + fileN = os.path.basename(fileP) + + print("\t", fileN, end="") + + headerL = metaInf[fileP] + ############################# + # Read in data file + seenGenes = {} + genes = [] + expression = [] + + print("\tReading expression data file:", fileN) + + for i, line in enumerate(open(fileP)): + cols = line.rstrip().replace('"', '').replace("'", '').split(splitBy) + + if headerL: + headerL -= 1 + continue + + genes.append(cols[0]) + if not cols[0] in seenGenes: + seenGenes[cols[0]] = 1 + else: + print("\n\nERROR: Duplicate gene (", cols[0], ") in expression data (", fileN, "), expects unique identifiers!\nPlease remove duplicates and re-run\n", sep="") + sys.exit() + expression.append(cols[1:]) + + print("\t\tTotal genes = ", len(genes)) + ############################# + + ############################# + print("\t\t# Generate Numpy matrices") + # Move genes to numpy array, so can be sorted along with the rest + genesNP = np.empty((len(genes)), dtype=str) + genesNP = np.array(genes) + + # Create numpy array + nrow = len(expression[0]) + ncol = len(expression) + npA = np.empty((ncol, nrow), dtype=object) + for i, eSet in enumerate(expression): + npA[i] = eSet + + # Calculate SD for each gene + npSD = np.std(npA.astype(np.float64), axis=1) + + # Sort by SD + sortI = np.argsort(npSD)[::-1] # Sort ascending + # Sort matrix by index + npA = npA[sortI, :] + + # Sort genes by index + genesNP = genesNP[sortI] + + # Only retain top fract # + # Work out index to cut at + cutPos = int(round(retainF * genesNP.shape[0], 0)) + print("\t\tRetain Fraction (", retainF, ") Cut at: ", cutPos, sep="") + print("\t\t\tPre-cut shape:", npA.shape) + + # Retain most variant + npA = npA[:cutPos, ].astype(np.float64) + genesNP = genesNP[:cutPos] + print("\t\t\tPost-cut shape:", npA.shape) + + # Keep track of seen genes + for gene in genesNP: + geneCount[gene] += 1 + + print("\n\tTotal unique genes/identifiers:", len(geneCount)) + + requiredFileNum = int(round(geneFrac * len(metaInf), 0)) + print("\tRetaining genes present in ", round(geneFrac, 2), " (>=", requiredFileNum, ") of files : ", end="", sep="") + + genesToKeep = {} + for g in geneCount: + if geneCount[g] >= requiredFileNum: + genesToKeep[g] = 1 + + print(len(genesToKeep)) + + return genesToKeep, requiredFileNum + + +def loadHDF5files(outF, hdf5Paths): + + hdf5Files = {} + for fileN in hdf5Paths: + hdf5Files[fileN] = h5py.File(fileN, 'r') + + return hdf5Files + + +def generateMasks(sortedKeepGenes, genesPerHDF5, hdf5paths): + + print("\tGenerating index mappings and masks to speed up combining correlations") + + masks = np.zeros((len(hdf5paths), len(sortedKeepGenes))) + for i, g in enumerate(sortedKeepGenes): + for j, hd5 in enumerate(hdf5paths): + if g not in genesPerHDF5[hd5]: + masks[j][i] = 1 # Mask missing values + return masks + + +def generateCorrMatrix(workF, metaInf, genesToKeep, requiredFileNum, singleCorrList, geneFrac=1/3.0, retainF=0.8, corrMatF="CORR_MATRIX", corrMatFS="CORR_MATRIX_SG", usePearson=False, corrChunk=5000, splitBy="\t", decimalP=3, printEveryDiv=10, keepBigFA=False, singleCorr=False, singleCorrL=False): + """ + Memory efficient method for generating all pairwise correlations of genes (rows) across a set of samples (columns). Uses HDF5 files to greatly + reduce memory useage, keeping most data residing on the hard disk. + """ + print("\n\nGenerate correlations for expression data:") + natSort = make_key_naturalSort() + sortedKeepGenes = sorted(genesToKeep, key=natSort) + + # Find mapping of keep genes + keepGeneMapping = {} + for i, g in enumerate(sortedKeepGenes): + keepGeneMapping[g] = i + + # Create subfolder + outF = os.path.join(workF, corrMatF + "_GMF" + str(requiredFileNum)) + if not os.path.exists(outF): + os.makedirs(outF) + + if singleCorr or singleCorrL: + outFS = os.path.join(workF, corrMatFS + "_GMF" + str(requiredFileNum)) + if not os.path.exists(outF): + os.makedirs(outF) + + geneMetaInfo = defaultdict(list) # Per data-set store gene meta information + genePosPerHDF5 = defaultdict(dict) # Mapping of location of gene in HDF5 file + perHDF5index = defaultdict(list) # Mapping of gene to where they should be in final matrix + genesPerHDF5 = defaultdict(dict) # Keep tally of which genes are in which HDF5 file + hdf5paths = [] + + print("\n\tCalculating correlations for data file:") + for fileP in sorted(metaInf, key=natSort): + fileN = os.path.basename(fileP) + + print("\t", fileN, end="") + + headerL = metaInf[fileP] + ############################# + # Read in data file + seenGenes = {} + genes = [] + expression = [] + + print("\tReading expression data file:", fileN) + + for i, line in enumerate(open(fileP)): + cols = line.rstrip().replace('"', '').replace("'", '').split(splitBy) + + if headerL: + headerL -= 1 + continue + + # Only retain required genes + if cols[0] not in genesToKeep: + continue + + genes.append(cols[0]) + if not cols[0] in seenGenes: + seenGenes[cols[0]] = 1 + else: + print("\n\nERROR: Duplicate gene (", cols[0], ") in expression data (", fileN, "), expects unique identifiers!\nPlease remove duplicates and re-run\n", sep="") + sys.exit() + expression.append(cols[1:]) + + print("\t\tTotal genes = ", len(genes)) + ############################# + + ############################# + print("\t\t# Generate Numpy matrices") + # Move genes to numpy array + genesNP = np.empty((len(genes)), dtype=str) + genesNP = np.array(genes) + + # Store position of gene in matrix for this file and create mapping index for HDF5 files + outMatrixHDF5 = os.path.join(outF, fileN + "_RetainF" + str(retainF) + ".h5") + hdf5paths.append(outMatrixHDF5) + + for i, g in enumerate(genesNP): + perHDF5index[outMatrixHDF5].append(keepGeneMapping[g]) + genePosPerHDF5[g][outMatrixHDF5] = i + genesPerHDF5[outMatrixHDF5][g] = 1 + + # Create numpy array + nrow = len(expression[0]) + ncol = len(expression) + npA = np.zeros((ncol, nrow), dtype=np.float64) + for i, eSet in enumerate(expression): + npA[i] = eSet + + print("\t\t\tMarix shape:", npA.shape) + + # Output genes + outMatrixN = os.path.join(outF, fileN + "_RetainF" + str(retainF) + "_Genes.txt") + np.savetxt(outMatrixN, genesNP, fmt="%s", delimiter="\t") + + # Generate gene meta information + generateGeneMeta(outF, fileN + "_RetainF" + str(retainF), geneMetaInfo, genesNP, npA) + + ####################################### + # Calculate correlations + print("\t\t# Calculating correlations using HDF5 file to save memory") + rowN, colN = npA.shape + + if os.path.exists(outMatrixHDF5): + print("\t\tAlready exists -- skipping") + continue + with h5py.File(outMatrixHDF5, "w") as f: + h5 = f.create_dataset("corr", (rowN, rowN), dtype="f8") + + # # Load into memory + print("\t\t\tCreating HDF5 file") + h5 = h5py.File(outMatrixHDF5, 'r+') + + if not usePearson: + # Note this isn't a perfect Spearman, as ties are not dealt with, but it is fast, and given we're only + # Retaining the most correlated edges will have almost no effect on the output. + npA = npA.argsort(axis=1).argsort(axis=1).astype(np.float64) + + # subtract means from the input data + npA -= np.mean(npA, axis=1)[:, None] + + # normalize the data + npA /= np.sqrt(np.sum(npA*npA, axis=1))[:, None] + + # Calculate correlations per chunk + print("\t\t\tCalculating correlations for Chunk:") + for r in range(0, rowN, corrChunk): + print("\t\t\t", r) + for c in range(0, rowN, corrChunk): + r1 = r + corrChunk + c1 = c + corrChunk + chunk1 = npA[r:r1] + chunk2 = npA[c:c1] + h5["corr"][r:r1, c:c1] = np.dot(chunk1, chunk2.T) + + # # Write output out for debugging + # finalCorr = np.copy(h5["corr"][:]) + # outMatrixN = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_Corr") + # stringFormat = "%." + str(decimalP) + "f" + # # stringFormat2 = "%." + str(decimalPpVal) + "f" + # np.savetxt(outMatrixN + ".txt",finalCorr,fmt=stringFormat,delimiter="\t") + + # Remove matrices to save memory + del npA + del h5 + + # Calculate median gene meta information + medianMetaInfo = mergeMetaInfo(geneMetaInfo) + + + ########################################################################################### + ##----------------------------Calculate Median Correlations------------------------------## + ########################################################################################### + # Calculate median correlation matrix + print("\nCalculating median correlations") + outMatrixHDF5_orig = outMatrixHDF5 # Retain incase only single data-set + outMatrixHDF5 = os.path.join(outF, "#Median_RetainF" + str(retainF) + ".h5") + if not os.path.exists(outMatrixHDF5): + if len(hdf5paths) == 1: + # If we're only analysing a single data-set + if not (singleCorr or singleCorrL): + print("\t\tOnly single data-set analysed, skipping generating median correlations") + # No need to calculate median correlations, just return path to HDF5 file and genes matrix + return outMatrixHDF5_orig, genesNP + else: + print("\t\tOnly single data-set analysed, but --singeCorr/--singleCorrL so will proceed with output") + + printEvery = int(round(len(genesToKeep)/float(printEveryDiv), 0)) + printE = printEvery + tell = printE + count = 0 + + # Create HDF5 median correlation matrix + + with h5py.File(outMatrixHDF5, "w") as f: + h5 = f.create_dataset("corr", (len(genesToKeep), len(genesToKeep)),dtype="f8") + h5 = h5py.File(outMatrixHDF5, 'r+') + + # Load HDF5 correlation files + hdf5Files = loadHDF5files(outF, hdf5paths) + + # Output genes + outMatrixN = os.path.join(outF, "#Median_RetainF" + str(retainF) + "_Genes.txt") + np.savetxt(outMatrixN, sortedKeepGenes, fmt="%s", delimiter="\t") + + # Get masks + maskMappings = generateMasks(sortedKeepGenes, genesPerHDF5, hdf5paths) + + print("\tCalculating median correlations (report every 1/", printEveryDiv, "th of total):", sep="") + if singleCorr or singleCorrL: + print("\t\tOutputting single gene correlation files:") + + for genePos, gene in enumerate(sortedKeepGenes): + # print("\t\t",gene1) + # Inform user of position + count += 1 + if count == printE: + printE = printE + tell + if singleCorr: + print("\n\t\tProcessed:", count, end="\n\n") + else: + print("\t\t", count) + + rowsPerHDF5 = {} + maskPos = [] + dataSetNames = [] + # Grab row for gene across files + for i, hdf5 in enumerate(hdf5paths): + try: + rowsPerHDF5[hdf5] = hdf5Files[hdf5]["corr"][genePosPerHDF5[gene][hdf5]] + maskPos.append(i) + dataSetNames.append(os.path.basename(hdf5)[:-3]) + except: + pass + + # Second pass + # Convert to numpy array + npA = np.full((len(rowsPerHDF5), len(sortedKeepGenes)), -10, dtype=np.float64) # Missing set to -10 + for i, hdf5 in enumerate(sorted(rowsPerHDF5, key=natSort)): + npA[i][perHDF5index[hdf5]] = rowsPerHDF5[hdf5] # Use indexes to place in correct location + + # Get appropriate masks + tempMask = [] + for i in maskPos: + tempMask.append(maskMappings[i]) + + npAMasked = ma.masked_array(npA, mask=tempMask) + + # Generate medians + medianRowCorr = np.copy(ma.median(npAMasked, axis=0)) + h5["corr"][genePos] = medianRowCorr + + ########################################################################################### + ##-------------------------------SINGLE GENE CORR----------------------------------------## + ########################################################################################### + def mergeCorrData(gene, corrInf, medianMetaInfo, medianCorr, missingVal=-10): + finalCorrs = [] + dataSetCount = 0 + + for corr in corrInf: + if (corr <= missingVal).all(): + finalCorrs.append("") + else: + finalCorrs.append(str(round(corr, decimalP))) + dataSetCount += 1 + + roundedMeta = map(str, [round(origV, decimalP) for origV in medianMetaInfo[gene]]) + scaledCorr = dataSetCount ** medianCorr + finalInfo = gene + "\t" + str(round(scaledCorr, decimalP)) + "\t" + str(dataSetCount) + "\t" + str(round(medianCorr, decimalP)) + "\t" + "\t".join(roundedMeta) + "\t" + "\t".join(finalCorrs) + + return scaledCorr, finalInfo + ################################### + if singleCorr or singleCorrL: + if singleCorrL: # Only output genes in list + if gene not in singleCorrList: + continue + + subFolder = os.path.join(outFS, gene[0]) + singleCorrFName = os.path.join(subFolder, str(makeSafeFilename(gene)) + "_corr_RetainF" + str(retainF) + ".txt.gz") + if os.path.exists(singleCorrFName): + continue + + print("\t\t\tSingle Corr:", str(makeSafeFilename(gene))) + + # Make subfolder + if not os.path.exists(subFolder): + os.makedirs(subFolder) + + singleCorrF = gzip.open(singleCorrFName, "wt", compresslevel=9) # "wb" in py2 + dataSetsH = "\t".join(dataSetNames) + + if usePearson: + print("Gene\tNumDataSets^MedianPCC\tNumDataSets\tMedianPCC\tMedian_QCODexpression(VarWithin)\tMedianPercentile\tMADpercentile(VarAcross)", dataSetsH, sep="\t", file=singleCorrF) + else: + print("Gene\tNumDataSets^MedianRho\tNumDataSets\tMedianRho\tMedian_QCODexpression(VarWithin)\tMedianPercentile\tMADpercentile(VarAcross)", dataSetsH, sep="\t", file=singleCorrF) + + rankedByCorr = defaultdict(list) + for i, g in enumerate(sortedKeepGenes): + scaledCorr, info = mergeCorrData(g, npA.T[i], medianMetaInfo, medianRowCorr[keepGeneMapping[g]]) + rankedByCorr[scaledCorr].append(info) + + # Rank by scaledCorr + for sCorr in sorted(rankedByCorr, reverse=True): + for info in rankedByCorr[sCorr]: + print(info, file=singleCorrF) + singleCorrF.close() + + ########################################################################################### + # # Write output out for debugging + # finalCorr = np.copy(h5["corr"][:]) + # outMatrixN = os.path.join(outF,"#Median_RetainF" + str(retainF) + "_Corr") + # stringFormat = "%." + str(decimalP) + "f" + # np.savetxt(outMatrixN + ".txt",finalCorr,fmt=stringFormat,delimiter="\t") + ########################################################################################### + ########################################################################################### + + # Remove all single HDF5 files unless requested to keep them + if not keepBigFA: + del hdf5Files + for hdf5P in hdf5paths: + os.remove(hdf5P) + else: + print("\t\tAlready exists -- skipping") + + # # Return path to HDF5 file and genes matrix + return outMatrixHDF5, sortedKeepGenes + + +def reduceEdges(workF, dataF, gephiF, corrh5, genesM, retainF=0.8, edgePG=3, printEveryDiv=10, corrMatF=None, keepBigFA=False, keepBigF=False, ignoreDuplicates=False, roundL=3): + """ + Reduce edges in correlation matrix, only retaining edgePG maximum correlated genes per row + """ + + def bottomToZero(npA, n=1): + """ + Set everything below n to zero + """ + topI = np.argpartition(npA, -n) + npA[topI[:-n]] = 0 + return npA + + def bottomToZeroWithDuplicates(npA, n=1): + """ + Set everything below n to zero, + but deal with duplicates + """ + unique = np.unique(npA) + + uniqueGTzero = len(unique[unique > 0]) + if n > uniqueGTzero: + # Deal with edgePG extending into negative correlations + n = uniqueGTzero + + topIunique = np.argpartition(unique, -n)[-n:] + toKeep = [] + for val in unique[topIunique]: + toKeep.extend(np.where(npA == val)[0]) + + # Mask and reverse + mask = np.ones(len(npA), np.bool) + mask[toKeep] = 0 + npA[mask] = 0 + + return npA + + ########################################################################################### + # Setup # + print("\nReduces edges to (", edgePG, ") per gene:", sep="") + fileN = os.path.basename(dataF) + # Create subfolder + outF = os.path.join(workF, gephiF) + if not os.path.exists(outF): + os.makedirs(outF) + nodesFP = os.path.join(outF, fileN + "_RetainF" + str(retainF) + "_EPG" + str(edgePG) + "_Nodes.tsv.gz") + edgesFP = os.path.join(outF, fileN + "_RetainF" + str(retainF) + "_EPG" + str(edgePG) + "_Edges.tsv.gz") + if os.path.exists(nodesFP) and os.path.exists(edgesFP): + print("\t\tEdge reduction files already exist, skipping") + return edgesFP, nodesFP + ########################################################################################### + + print("\tLoad HDF5 file") + # Load old HDF5 + h5 = h5py.File(corrh5, 'r+') + rowN, colN = h5["corr"].shape + + printEvery = int(round(rowN/float(printEveryDiv), 0)) + + printE = printEvery + tell = printE + count = 0 + + print("\tWorking (report every 1/", printEveryDiv, "th of total):", sep="") + for i, row in enumerate(h5["corr"]): + # Inform user of position + count += 1 + if count == printE: + printE = printE + tell + print("\t\t", count) + + # Before edge reduction round correlations + row = np.round(row,decimals=roundL) + + if ignoreDuplicates: + h5["corr"][i] = bottomToZero(row, edgePG + 1) + else: + h5["corr"][i] = bottomToZeroWithDuplicates(row, edgePG + 1) + + # Write output out for debugging + # finalCorr = np.copy(h5EPG["corr"][:]) + # outMatrixN = os.path.join(outF,fileN + "_RetainF" + str(retainF) + "_NonSym") + # stringFormat = "%." + str(3) + "f" + # # stringFormat2 = "%." + str(decimalPpVal) + "f" + # np.savetxt(outMatrixN + ".txt",finalCorr,fmt=stringFormat,delimiter="\t") + + ########################################################################################### + print("\nGenerating files for Gephi network visualization tool") + + + # First output list of genes (nodes) + print("\tCreate node file") + + nodesFile = gzip.open(nodesFP, "wt", compresslevel=9) + print("Id\tLabel\tCluster", file=nodesFile) + for gene in genesM: + print(gene, gene, "NA", file=nodesFile, sep="\t") + nodesFile.close() + + # Second, output list of edges + print("\tCreate edges file (report every 1/", printEveryDiv, "th of total):", sep="") + edgesFile = gzip.open(edgesFP, "wt", compresslevel=9) + print("Source\tTarget\tWeight\tType\tfromAltName\ttoAltName", file=edgesFile) + + # Finally output edges + printE = printEvery + tell = printE + count = 0 + seen = defaultdict() + printedLines = 0 + + for i, row in enumerate(h5["corr"]): + for j, item in enumerate(row): + if not (i == j): # Don't output self edges... + geneO = genesM[i] + geneT = genesM[j] + if not (geneO + "-@@@-" + geneT) in seen: + # Output info + if not item == 0: + print(geneO, geneT, item, "undirected", geneO, geneT, sep="\t", file=edgesFile) + printedLines += 1 + + # Store the fact that we've see this and it's equivalent pair + seen[geneO + "-@@@-" + geneT] = 1 + seen[geneT + "-@@@-" + geneO] = 1 + + count += 1 + if count == printEvery: + printEvery = printEvery + tell + print("\t\t", count, "Genes", ", Edges:", printedLines) + + edgesFile.close() + print("\n\t\tTotal printed (", printedLines, ") edges", sep="") + + if not (keepBigFA or keepBigF): + h5.close() + shutil.rmtree(os.path.join(workF, corrMatF)) + + return edgesFP, nodesFP + + +def runleidenalgF(wF, edgesFP, baseFold="LEIDENALG", outFoldTorig="ClustersTxt", outFoldLorig="ClustersLists", runNum=10, bestPerc=10, printEveryDiv=10, removeExisting=True, allFold="ALL", bestFold="BEST"): + """ + Convert edges into igraph (https://igraph.org/python/) format and then carry out community detection using the python Leidenalg package (https://github.com/vtraag/leidenalg). This is an improvement + upon the Louvain method used in pgcna.py/pgcna-multi.py that ensures that all communities are well connected within the network. + """ + ########################################################################################### + + def convertGephiToIgraph(gephiEdgeFile, splitBy="\t", header=1, outN="igraphG.temp"): + """ + Convert edge file into format for import to igraph + """ + baseFold, fileName = os.path.split(gephiEdgeFile) + outFileP = os.path.join(baseFold, outN) + outFile = open(outFileP, "w") + + for line in gzip.open(gephiEdgeFile): + line = line.decode() + if header: + header -= 1 + continue + + cols = line.split(splitBy) + print(" ".join(cols[0:3]), file=outFile) + + outFile.close() + return outFileP + + def outputModules(outFoldT, outFoldL, partition, runNum=1): + # Natural sort + natSort = make_key_naturalSort() + + names = partition.graph.vs["name"] + members = partition.membership + namePerMod = defaultdict(dict) + + for n, m in zip(names, members): + namePerMod[int(m)+1][n] = 1 + + # Output module lists + outFileT = open(os.path.join(outFoldT, str(runNum)+".csv"), "w") + print("Id,Modularity Class", file=outFileT) + for m in sorted(namePerMod): + outFileL = open(os.path.join(outFoldL, "M" + str(m) + ".txt"), "w") + for n in sorted(namePerMod[m], key=natSort): + print(n, file=outFileL) + print(n, m, sep=",", file=outFileT) + outFileL.close() + outFileT.close() + return namePerMod + ########################################################################################### + + print("\nGenerating Leidenalg clusterings (n=", runNum, ")", sep="") + + # Create output folder + if removeExisting: + baseL = os.path.join(wF, baseFold) + if os.path.exists(baseL): + shutil.rmtree(baseL) + + outFoldT = os.path.join(wF, baseFold, allFold, outFoldTorig) + outFoldL = os.path.join(wF, baseFold, allFold, outFoldLorig) + if not os.path.exists(outFoldT): + os.makedirs(outFoldT) + if not os.path.exists(outFoldL): + os.makedirs(outFoldL) + + tempP = convertGephiToIgraph(edgesFP) + # Input graph + print("\tConverting gephi edge file --> igraph edge file") + g = ig.Graph().Read_Ncol(tempP, directed=False) + os.remove(tempP) + + # Get weights + graphWeights = g.es["weight"] + + infoPerClustering = [] + modInfo = [] + modInfoScores = defaultdict(list) + + printEvery = int(round(runNum/float(printEveryDiv), 0)) + printE = printEvery + tell = printE + count = 0 + + print("\tWorking (report every 1/", printEveryDiv, "th of total):", sep="") + for rNum in range(1, runNum + 1): + # Make sub folder + outFoldLS = os.path.join(outFoldL, "Clust" + str(rNum)) + if not os.path.exists(outFoldLS): + os.makedirs(outFoldLS) + + seed = random.randint(0, 1e9) # Generate random seed as leidenalg doesn't appear to do this (even though it says it does) + partition = la.find_partition(g, la.ModularityVertexPartition, weights=graphWeights, n_iterations=-1, seed=seed) # n_iterations=-1 to converge on best local result + modScore = partition.modularity + + modInfo.append([modScore, partition.sizes()]) + modInfoScores[modScore].append(rNum) + # Output cluster membership and store for generating co-asociation matrix + infoPerClustering.append(outputModules(outFoldT, outFoldLS, partition, runNum=rNum)) + # Inform user of position + count += 1 + if count == printE: + printE = printE + tell + # Give some feedback about quality of results as working + print("\t\t#", count, " ModScore:", round(modScore, 3), " ModNum:", len(partition.sizes()), " ModSizes:", partition.sizes()) + + # Print out information on mod scores/sizes + modInfoF = open(os.path.join(wF, baseFold, "moduleInfoAll.txt"), "w") + print("Mod#\tModularityScore\tAvgModSize\tModuleNum\tModuleSizes", file=modInfoF) + for i, inf in enumerate(modInfo): + print(i+1, inf[0], sum(inf[1])/float(len(inf[1])), len(inf[1]), inf[1], sep="\t", file=modInfoF) + modInfoF.close() + + if runNum > 1: + ########################################################################################### + # Copy best results (based on modularity score) to bestFold folder. + # Create output folder + outFoldTB = os.path.join(wF, baseFold, bestFold, outFoldTorig) + outFoldLB = os.path.join(wF, baseFold, bestFold, outFoldLorig) + if not os.path.exists(outFoldTB): + os.makedirs(outFoldTB) + if not os.path.exists(outFoldLB): + os.makedirs(outFoldLB) + + infoPerClusteringBest = [] + copyNumber = int(round(runNum * (bestPerc/100.0), 0)) + if copyNumber == 0: + copyNumber = 1 + + print("\tWill copy Best ", bestPerc, "% (n=", copyNumber, ") results to ", os.path.join(baseFold, bestFold), sep="") + copyNum = 0 + + modInfoF = open(os.path.join(wF, baseFold, "moduleInfoBest.txt"), "w") + print("Mod#\tModularityScore\tAvgModSize\tModuleNum\tModuleSizes", file=modInfoF) + for mScore in sorted(modInfoScores, reverse=True): + if copyNum >= copyNumber: + break + for clustNum in modInfoScores[mScore]: + if copyNum >= copyNumber: + break + + # Write best module information + inf = modInfo[clustNum - 1] + print(clustNum, inf[0], sum(inf[1])/float(len(inf[1])), len(inf[1]), inf[1], sep="\t", file=modInfoF) + + # Store best info per cluster info + infoPerClusteringBest.append(infoPerClustering[clustNum - 1]) + + # Copy best results to bestFold + textName = str(clustNum) + ".csv" + shutil.copy(os.path.join(outFoldT, textName), os.path.join(outFoldTB, textName)) + listName = "Clust" + str(clustNum) + shutil.copytree(os.path.join(outFoldL, listName), os.path.join(outFoldLB, listName)) + + # Increment number copied + copyNum += 1 + + modInfoF.close() + return partition.graph.vs["name"], infoPerClusteringBest + else: + return partition.graph.vs["name"], infoPerClustering + + + +########################################################################################### + + +def main(finishT="Finished!"): + + if args.singleCorrL: + singleCorrList = loadCorrGeneList(args.workFolder, args.singleCorrListF) + else: + singleCorrList = None + + # Load meta-file detailing list of files to process + metaInf = loadMeta(args.metaInf, args.dataF) + + # Find genes present in each data-set + genesToKeep, requireFileNum = getGenes(metaInf, splitBy=args.fileSep, geneFrac=args.geneFrac, retainF=args.retainF) + + # Generate correlation HDF5 files + corrh5, genesM = generateCorrMatrix(OUT_FOLDER, metaInf, genesToKeep, requireFileNum, singleCorrList, geneFrac=args.geneFrac, retainF=args.retainF, corrMatF=args.corrMatF, corrMatFS=args.corrMatFS, usePearson=args.usePearson, corrChunk=args.corrChunk, splitBy=args.fileSep, + keepBigFA=args.keepBigFA, singleCorr=args.singleCorr, singleCorrL=args.singleCorrL) + + # Reduce edges + edgesFP, nodesFP = reduceEdges(OUT_FOLDER, args.dataF, args.gephiF, corrh5, genesM, retainF=args.retainF, edgePG=args.edgePG, corrMatF=args.corrMatF + "_GMF" + str(requireFileNum), keepBigFA=args.keepBigFA, keepBigF=args.keepBigF, ignoreDuplicates=args.ignoreDuplicates,roundL=args.roundL) + + # Run Leidenalg + if not args.noLeidenalg: + names, infoPerClustering = runleidenalgF(OUT_FOLDER, edgesFP, runNum=args.laRunNum, bestPerc=args.laBestPerc, baseFold=args.lBaseFold, outFoldTorig=args.lClustTxtFold, outFoldLorig=args.lClustListFold) + + else: + print("\nSkipping Leidenalg community detection and downstream output as --noLeidenalg is set") + print(finishT) + +if __name__ == '__main__': + main() diff --git a/README.md b/README.md index 01c7267..4bccdc9 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# PGCNA (Parsimonious Gene Correlation Network Analysis) +# PGCNA2 (Parsimonious Gene Correlation Network Analysis) ## Introduction @@ -15,324 +15,14 @@ Please cite this when using PGCNA. PGCNA paper website: [http://pgcna.gets-it.net](http://pgcna.gets-it.net) +## PGCNA and PGCNA2 ----------- +There are two versions of PGCNA: -## Requirements +1. **PGCNA**, the original version of PGCNA (used in the original [manuscript](https://www.nature.com/articles/s41540-019-0090-7)) : [PGCNA](https://github.com/medmaca/PGCNA/tree/master/PGCNA). +2. **PGCNA2**, a version of PGCNA that uses an improved community detection alogorithm: [PGCNA2](https://github.com/medmaca/PGCNA/tree/master/PGCNA2). -Python 2.7 and the following non-standard packages : numpy, scipy (for pgcna-multi.py) and h5py. -I recommend the Anaconda distribution ([https://www.anaconda.com/download/](https://www.anaconda.com/download/)), which comes with numpy, scipy and h5py included. This script has been tested on Windows 10 (Python 2.7.14, numpy 1.15.4, scipy 1.1.0 and h5py 2.8.0) and Linux CentOS 6.9 (python 2.7.13, numpy 1.13.1, scipy 0.19.1 and h5py 2.7.0). - -### Optional, but recommended - -Pgcna.py / pgcna-multi.py can be run without generating modules (**--noFastUF**) if all you require is output for network visualisation tools (e.g. Gephi/Cytoscape). For single data-set analysis use pgcna.py, which has less non-standard package requirements and uses less memory. - -#### Fast unfolding of communities in large networks (Louvain) - -To carry out clustering you will require the additional package **Louvain** ([Blondel, V.D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). **Fast unfolding of communities in large networks**. J. Stat. Mech. Theory Exp. 2008, P10008.](https://arxiv.org/abs/0803.0476)). This can be downloaded here: [https://sourceforge.net/projects/louvain/files/louvain-generic.tar.gz/download](https://sourceforge.net/projects/louvain/files/louvain-generic.tar.gz/download). Please make sure that you've downloaded v0.3 (see package README.txt). - -##### Example installation - -On Linux CentOS -``` -$ wget https://sourceforge.net/projects/louvain/files/louvain-generic.tar.gz -$ tar -xvzf louvain-generic.tar.gz -$ cd louvain-generic -$ make -``` -Copy **louvain**, **convert** and **hierarchy** into your bin folder. - -#### Gephi - -For visualising the output of PGCNA we highly recommend using Gephi ([https://gephi.org/](https://gephi.org/)) which is able to layout large networks very efficiently. It has the added bonus that it includes the **louvain/FastUnfold** method to quickly visualise the modularity of the network. See **Examples** below for details of loading the output from pgcna.py into Gephi and analysing it. - - ----------- - - -## Installation - -Using Mercurial: - -``` -hg clone https://mcare@bitbucket.org/mcare/pythonscripts-pgcna -``` - -Manual download: [https://bitbucket.org/mcare/pythonscripts-pgcna/downloads/](https://bitbucket.org/mcare/pythonscripts-pgcna/downloads/) - - ----------- - - -## Overview - -We have endeavoured to make the process of using PGCNA as simple as possible, there are relatively few user choices. The default parameters will give good results and most users will only want to alter the **--corrChunk** option depending on amount of available RAM (see **Common options**) - -### Note on terminology - -The louvain/FastUnfold method **clusters** the network, each run will produce a **clustering** of the data and each of these will contain individual **clusters/modules** (the terms are used interchangeably). Each time louvain/FastUnfold is run, due to a random start position, it will produce a different result and though these results are highly related they differ in exact gene/module groupings. PGCNA used the louvain/FastUnfold modularity score to rank the different clusterings of the data (number set with -n, --fuNumber parameter) and select the best (number set with -r, --fuRetain parameter) for output. - -## Usage - -### Single vs multiple data-set analysis - -**Pgcna.py** is for processing **single data-sets**, should you want to use **multiple data-sets** to generate a median correlation file (as used in the paper) then use **pgcna-multi.py**. Many of the parameters are shared between the two scripts, with pgcna-multi.py gaining a few additional options (highlighted below). It is possible to analyse single data-sets with pgcna-multi.py, and this will result in exactly the same output as pgcna.py. - -### Input file - -#### Single data-set analysis - -The input file for **pgcna.py** should be an expression file that has the format of unique identifiers (genes/probes etc.) in the first column, with the remaining columns containing the expression values across the samples. Redundancy of identifiers is not allowed and must be dealt with before processing with PGCNA. The default is for a tab separated file with a single line of header -- however, this can be altered with the (--fileSep and --headerL parameters respectively). - -#### Multiple data-set analysis - -When running **pgcna-multi.py** the -d/--dataF option should point to a folder containing multiple data-sets (rather than a single data-set). These files should be the same format as those used in a single data-set analysis (i.e. unique identifiers in the first column, remaining columns containing expression values). The data-set folder also needs an additional file (default name #FileInfo.txt; set by -m/--metaInf) that contains a tab separated list of the file names to process (1st column) and the number of header lines (to allow for meta data) in those files (2nd column), this also is expected to have a header, e.g.: -``` -Filename HeaderLines -FileName1.txt 3 -FileName2.txt 1 -FileName3.txt 12 -``` - -### Basic run - -PGCNA run via **pgcna.py** script: - - -On windows -``` -python pgcna.py -w workFolderPath -d expressionFilePath -``` - -On linux -``` -./pgcna.py -w workFolderPath -d expressionFilePath -``` - -PGCNA run via **pgcna-multi.py** script: - - -On windows -``` -python pgcna-multi.py -w workFolderPath -d expressionFilesFolderPath -``` - -On linux -``` -./pgcna-multi.py -w workFolderPath -d expressionFilesFolderPath -``` - -### Common options - -#### Alter RAM requirements (--corrChunk) - -Within the PGCNA method the step that requires the most RAM is calculating the pairwise correlations. For a small number of genes (<=20,000) this is easily carried out in memory, but this increases dramatically with increasing gene/probe numbers: 2x10^4 = 3GB, 4x10^4 = 12GB, 6x10^4 = 27GB, 8x10^4 = 48GB, 1x10^5 = 75GB etc. PGCNA carries out only a small portion of processing in memory, breaking the large correlation problem into chunks. This means that with default settings (--corrChunk 5000) PGCNA can process a 2x10^5 matrix in <1GB or RAM, rather than the 300GB that would be required using memory alone. - -When processing larger matrices the user can choose to increase **--corrChunk** if they want to utilise more memory for calculating the correlation. While this will speed up the correlation step, it should be noted that as this is only one step in the PGCNA process and thus the total time saving may be minor. - -Setting PGCNA to have a correlation chunk size of 20,000 - -``` -./pgcna.py -w workFolderPath -d expressionFilePath --corrChunk 2e4 -``` - -#### Retain all genes - -With default setting PGCNA will only retain the top 80% most variable genes in the expression data for analysis. If the input expression files has been pre-filtered to remove invariant genes you may want to process all of the data, this can be acomplished using the -f/--retainF parameters: - -``` -./pgcna.py -w workFolderPath -d expressionFilePath --retainF 1 -``` - -#### Run without clustering - -If you don't have the **louvain** package installed and wish to generate output for downstream tools you can still run PGCNA with the following command: - -``` -./pgcna.py -w workFolderPath -d expressionFilePath --noFastUF -``` - -#### Changing input file separator and header size - -The default input format is a tab ("\t") separated file with a single header line. This can be changed with the **--fileSep** and **--headerL** parameters, so for a .csv file with 3 header lines: - -``` -./pgcna.py -w workFolderPath -d expressionFilePath --fileSep "," --headerL 3 -``` -Note: for pgcna-multi.py the number of header lines is set using the meta-information file (default #FileInfo.txt) rather than --headerL. In addition all the data-sets to be processed must have the same separator, so for a folder of csv separated files: -``` -./pgcna-multi.py -w workFolderPath -d expressionFilesFolderPath --fileSep "," -``` - -#### Changing the number of retained edges - -The default is to retain the top 3 most correlated edges per gene, and in our paper we show that there is no benefit from increasing this. However, should users wish to increase this they can using the **--edgePG** parameter. Increasing the number of edges will result in the **louvain/fastUnfold** method generating fewer clusters/module, yet these clusters will be super sets of clusters/modules produced with fewer edges. Note: when expression data contains groups of highly correlated genes it is possible for the default 3 edges to generate orphan modules (modules disconnected from the rest of the network), in this situation increasing the number of edges should reconnect these to the total network. - -To run PGCNA retaining 5 edges - -``` -./pgcna.py -w workFolderPath -d expressionFilePath --edgePG 5 -``` - -#### Louvain/FastUnfold options - -By default PGCNA will run 100 clusterings of the data using louvain and will then process the best one (judged by louvain modularity score). Users may wish to increase both the number of clusterings of the network that are carried out and the fraction retained for downstream processing. - -For instance to run 10,000 clusterings and retain the top 10: - -``` -./pgcna.py -w workFolderPath -d expressionFilePath -n 1e4 -r 10 -``` - -#### pgcna-multi.py specific options - -pgcna-multi.py merges the correlations across data-sets by calculating a gene's median correlation across all data-sets that contain that gene. During this process if the user wants pgcna-multi.py can output a folder containing a single file per gene in the median correlation matrix. This file shows the correlations between that single gene and all other genes across the different data-sets, allowing the user to see if the data-sets have a good level of agreement, and potentially highlighting outliers. - -To output all single gene correlation files (**--singleCorr**). Note: this **significantly** increases the time to construct the median correlation matrix: - -``` -./pgcna-multi.py -w workFolderPath -d expressionFilesFolderPath --singleCorr -``` - -If you want to limit the single gene correlations output to a subset of the total genes, you can use the **--singleCorrL** option. This uses a file in the --workFolder (default corrGenes.txt; set by --singleCorrListF) to provide a list of genes to limit the output to. If --singleCorrL is set and corrGenes.txt is missing it will be created and the default gene (IRF4) included. Using --singleCorrL is highly recommended over --singleCorr if you're only interested in a small number of genes due to the significant speed increase it provides. - -``` -./pgcna-multi.py -w workFolderPath -d expressionFilesFolderPath --singleCorrL -``` - -See **Output** for information on all downstream files. - ----------- - - -## Parameters - -Complete list of parameters and their default values. - -Can also use built in help flag: -``` -./pgcna.py -h -``` - -### Required - -|Parameter|Description|Default Value|single/multi specific| -|---|---|---|---| -|-w, --workFolder|Base folder for all output|**Required**| | -|-d, --dataF|Expression data file path|**Required**|pgcna.py --> single file, pgcna-multi.py --> folder of files| - -### Input file related - -|Parameter|Description|Default Value|single/multi specific| -|---|---|---|---| -|-s, --fileSep|Separator used in expression file|"\t"| | -|--headerL|Number of header lines before expression values|1|pgcna.py only| -|-m, --metaInf|File containing information about data files to process (must be in same folder as --dataF)|#FileInfo.txt|pgcna-multi.py only| -|-f, --retainF|Retain gene fraction -- keeping most variant genes|0.8| | -|-g, --geneFrac|Fraction of expression files a gene needs to be present in to be included in median correlation matrix|1/3|pgcna-multi.py only| -|-e, --edgePG|Edges to keep per gene -- **Recommend leaving as default**|3| | - -### Folders - -|Parameter|Description|Default Value|single/multi specific| -|---|---|---|---| -|--outF|Root output folder|"PGCNA"| | -|--corrMatF|Correlation matrix folder -- where HDF5 files are stored|"CORR_MATRIX"| | -|--corrMatFS|Folder for single gene correlation files|"CORR_MATRIX_SG"|pgcna-multi.py only| -|--gephiF|Folder to store files for Gephi|"GEPHI"| | -|--fastUF|Folder for fast unfolding clustering output|"FAST_UNFOLD"| | - -### Control usage - -|Parameter|Description|Default Value|single/multi specific| -|---|---|---|---| -|--noFastUF|Flag -- don't run Fast Unfolding clustering but complete everything else|False| | -|--usePearson|Flag -- Instead of Spearman's ranked correlation coefficient calculate Pearson's|False| | -|--keepBigF|Flag -- Retain big HDF5 files after finishing, if needed for independent downstream analysis|False|For pgcna-multi.py --keepBigF retains median corr HDF5| -|--keepBigFA|Flag -- Retain **ALL** big HDF5 files after finishing, if needed for independent downstream analysis|False|pgcna-multi.py only| -|--corrChunk|Size of chunk to split correlation problem over -- higher values will speed up correlation calculation at the cost of RAM|5000| | -|--ignoreDuplicates|Flag -- Ignore correlation duplicates when cutting top --edgePG genes, faster if set but may miss genes if correlations are duplicated|False| | -|--singleCorr|Flag -- Output individual gene correlation files -- Warning this generates 1 file per gene in final correlation matrix|False|pgcna-multi.py only| -|--singleCorrL|Flag -- Output individual gene correlation files -- limited to those in --singleCorrListF|False|pgcna-multi.py only| -|--singleCorrListF|If --singleCorrL is set then create single correlation files for all those in this file (must be within --workFolder)|corrGenes.txt|pgcna-multi.py only| - -### Clustering related - -|Parameter|Description|Default Value|single/multi specific| -|---|---|---|---| -|-n, --fuNumber|Number of times to run **louvain/FastUnfold** method|100| | -|-r, --fuRetain|How many of the best (based on louvain modularity) clusterings to process|1| | -|--fuRenumberStart|For clusters retained (--fuRetain), what value to start numbering them from|1| | -|--tOutFold|FastUnfold Trees output folder|"Trees"| | -|--fTOutFold|FastUnfold final retained (--fuRetain) trees output folder|"TreesF"| | -|--cOutFold|FastUnfold clusters folder|"Clusters"| | -|--cOutTxtFold|FastUnfold clusters mapped back to genes|"ClustersTxt"| | -|--cOutListsFold|FastUnfold clusters mapped back to genes and split into individual text files|"ClustersLists"| | - - -### Output - -If PGCNA is run using default settings, the root folder (-w, --workFolder) will contain the following: - -* EPG3 (Edge per gene 3 folder) - * *CorrelationSettingsInfo.txt : timestamped file containing parameters used for run. - * CORR_MATRIX (if --keepBigF) : contains the temporary files used for calculating correlations - * *_RetainF*.h5 : The HDF5 file of all pairwise correlations after edge reduction - * *_RetainF*_Genes.txt : Genes that remain after -f/--retainF filtering ordered by decending standard deviation - * CORR_MATRIX_SG (if --singleCorr/--singleCorrL) : contains gzipped single gene correlation files split into first letter subfolders - * FAST_UNFOLD : Root folder for output from fast unfold - * *retainF1.0_EPG3 : folder specific to an input data file - * Clusters : Clusterings output by the FastUnfold **hierarchy** tool - * **ClustersLists** : Contains subfolders for the -r/--fuRetain number of "best" clusterings. Each subfolder contains genes split across the clusters/modules (M1...M*.txt) - * **ClustersTxt** : Contains the -r/--fuRetain number of "best" clusterings as individual *.csv files - * GeneIntMap.txt : Mapping of gene to numbers, required to get back to gene names after processing with **louvain** - * modScores.txt : Modularity scores across all clusterings -- used to rank clusterings by and select "best" to retain (-r/--fuRetain) - * sym_edges.bin : binary version of sym_edges.txt output by FastUnfold **convert** tool and required by **louvain** - * sym_edges.txt : gene pairs (encoded as numbers, see GeneIntMap.txt) along with their correlation score. - * sym_edges.weights | Edge weights, output by FastUnfold **convert** tool and require by **louvain** - * Trees: Trees output by **louvain** - * TreesF: Contains the -r/--fuRetain number of "best" trees. - * **GEPHI**: contains files for processing with Gephi package ([https://gephi.org/](https://gephi.org/)) - * *_RetainF1.0_EPG3_Edges.tsv : Tab separated file of edges - * *_RetainF1.0_EPG3_Nodes.tsv : Tab separated file of nodes - -The most important folders for users are highlighted in **bold**. - - -## Examples - -### Gephi Visualisation - -To visualise the output from PGCNA in Gephi is quite straightforward (correct for Gephi V0.92): - -Using data from **EPG3/GEPHI** folder - -1. Open Gephi -2. Select File/"New Project" -3. Select "Data Table" (tab above screen centre) -4. Select "Import Spreadsheet" - 1. Select *_RetainF1.0_EPG3_Nodes.tsv , make sure "Import as": "Nodes table", click **Next** and then **Finish**. Hopefully should import with no issues, select **OK** - 2. Repeat with *_RetainF1.0_EPG3_Nodes.tsv, making sure that "Import as" : "Edges table", click **Next** and then **Finish**. Hopefully should import with no issues, select **OK** -5. Click "Graph" (tab above screen centre) -- you should now see a massive blob of nodes in the screen centre -6. Under Statistics/NetworkOverview (right of screen) select **Modularity**. This will run the built in version of the louvain/fastUnfold method. - 1. Each run will generate a different clustering, the higher the score the better the clustering is perceived to be. -7. Select Appearance/Partition (left of screen) and select "Modularity Class" from drop-down list - 1. Click Palette button in the bottom right of this panel and select Generate. Untick "Limit number of colors" and then select **OK**. - 2. If you want you can manually change any of the resultant colours by left clicking on them and dragging the mouse around. - 3. When you're happy with the range of colours select **Apply** button. -8. To layout the network: - 1. Under **Layout** (left middle of screen) select **ForceAtlas2** - 2. For big networks set the following ("Approximate Repulsion":selected, Scaling:<1 often as small as 0.1, "LinLog mode":selected and importantly "Prevent Overlap":**not** selected). - 3. Wait until network layout has finished (You may need to alter **Scaling** if all nodes are pushed to the edge of the screen.) - 4. Once you're happy with the network set "Prevent Overlap":selected to finally prevent node overlaps. - -The Gephi version of the louvain/fastUnfold is older than that used by PGCNA, so if you've run PGCNA with clustering included, you should use the results contained in the **EPG/FAST_UNFOLD/\*retainF1.0_EPG3/ClustersTxt**, assuming you've already loaded the network (see above): - -1. Select "Data Table" (tab above screen centre) -2. Select "Import Spreadsheet" - 1. Within ClustersTxt choose a *.csv file to import - 2. Make sure "Import as": "Nodes table", click **Next**, make sure **Modularity Class** is set to integer then click **Finish**. -3. Follow step 7 above to set appearance with new Modularity Class data. +We highly recommend using **PGCNA2**. ## Feedback and questions