This page presents the different codes and functions developped for the multiscale analysis of 3C data of the organism Escherichia coli presented in the article Functional partition of a bacterial chromosome through the interplay of nucleoid-associated proteins and condensin by Virginia S. Lioy, Axel Cournac, Martial Marbouty, Stéphane Duigou, Julien Mozziconacci, Olivier Espéli, Frédéric Boccard, Romain Koszul. The codes presented here should allow to exactly reproduce the various plots present in the main manuscript and supplementary data.
For queries or help getting these running, you can send email or open an issue at the github repository.
- Dependencies
- Raw data extraction and alignment
- Building of the contacts map
- Correlation with other data
- Scalogram visualization tool
- Directionality Index
- Correlation between transcription and contacts
- 3D structure
- Ratio of contacts
Scripts and codes can be run on OS X and other Unix-based systems. It basically requires to have Python installed on your machine which is commonly installed on Unix-based systems. For windows, you can have a look to https://www.python.org/downloads/windows/. Then, a few python modules are necessary for diverses operations on arrays and vizualisation.
- Numpy
- Matplotlib (>=1.0)
- Scipy
- Biopython
Data can be dowloaded on Short Read Archive server at the following address http://www.ncbi.nlm.nih.gov/sra.
A SRA executable called fastq-dump from SRA can be used to extract and split both mates of a library (to use it, you can go with your terminal to the directory containg the executables files by using the bash command cd).Then the program can be used like this: /fastq-dump library_identification --split-3 -O /path_to_a_directory
./fastq-dump SRR639031 --split-3 -O /run/media/axel/EColi_data/
For the alignement step, we will use the sofware Bowtie2 and an iterative procedure like the one of [hiclib] (https://bitbucket.org/mirnylab/hiclib).
We process the pairs of reads so that every read has a mapping quality superior to 30. Here, some lines that can be used to do this task:
# Keeping only the columns of the sam file that contain necessary information:
awk '{print $1,$3,$4,$2,$5;}' p1.sam > p1.sam.0
awk '{print $1,$3,$4,$2,$5;}' p2.sam > p2.sam.0
# Sort according to the read identification to have both mates in the same order
# if sort does not have -V option try -d
sort -V -k1 p1.sam.0 > p1.sam.0.sorted
sort -V -k1 p2.sam.0 > p2.sam.0.sorted
# Pairing of both mates in a single file
paste p1.sam.0.sorted p2.sam.0.sorted > p1_p2_merged
# Removal of intermediar files
rm p1.sam.0.sorted
rm p2.sam.0.sorted
# Filtering of paires of reads that both have a Mapping Quality above 30
awk '{if($1 eq $6 && $5>= 30 && $10 >= 30) print $2,$3,$4,$7,$8,$9}' p1_p2_merged > output_alignment_idpt.dat
# Removal of intermediar file
rm p1_p2_merged
At this stage, you should have a file containing these information:
chr1 104180 16 chr1 104057 0
chr1 3570510 16 chr1 3570450 0
chr1 4255981 0 chr1 4256104 16
chr1 159457 16 chr1 159370 0
chr1 4113710 16 chr1 4113584 0
chr1 4259849 16 chr1 4259818 0
chr1 3753874 0 chr1 3754001 16
chr1 2856270 16 chr1 2856124 0
chr1 4134782 16 chr1 4134678 0
chr1 corresponds here to the chromosome of Escherichi coli genome. We used MG1655 reference genome (GenBank: U00096.2, total length 4639675). We name this file output_alignment_idpt.dat.ind3.
We then assigned eahc read to its corresponding restriction fragment as described previously in https://github.com/axelcournac/3C_tutorial.
To build the contact map and and filtered the non informative events, we use the python code 3Cevents_MATRICE.py 3Cevents_MATRICE.py
:
python 3Cevents_MATRICE.py output_alignment_idpt.dat.ind3 5000 WT_rep1_5kb
The first argument corresponds to the path of the file named output_alignment_idpt.dat.ind3 containing the informations of the mapped pairs of reads. The second argument is the size of the bin, here 5000 bp. We generally use this resolution for the whole study which is a good compromise between resolution and signa robustness. The third argument is the name of the prefixe for the file of the contacts maps.
The generated picture is and corresponds to the first figure of the manuscript. To plot the contact map and save in various formats, we use the python code plot_mat_temp.py
python plot_mat_temp.py mat_temp_WT_rep1_5000.txt WT_rep1_5000
The scalogram tool allows to vizualise the dispersion of the contacts signal along the spatial scales. The functions are implemented in the code multi_scale_domainogram_FILES2_dom3_3plots.py
python multi_scale_domainogram_FILES2_dom3_3plots.py mat_temp_WT_rep1_5000.txt WT_rep1_5000
The second argument is the prefixe for names of output files.
To correlate 3C contacts and recombination prevuously generated in Valens et al., EMBO 2004, we used the python code recombination_3C_Correlation.py
We use a similar code to confront geometrical distances mesaured from microscopy and the ones extracted from 3D structure built with 3C contacts data: distance_structure.py
To correlate MSD (Mean square Displacements) from time lapse microscopy experiments and cumulative contacts signal, the code correlation_MSD_compaction_Marco.py was used.
We computed at two different scales: at 400 kb scale (macrodomains) and 100 kb scale (CIDs). The computation was done on the correlation matrices (to attenutate fluctuations in the signals) as previously described.
To put in evidence the correlation bewteen transcription and 3C contacts at short range, we use the python code correlation_transcription_3C.py
We use gaussian function to smooth both signals and attenuate the fluctuations. This code generates the following figure:
To plot the genomic distance law for different groups of bins classified according to their transcription level, we first compute the distribution of log2 number of reads from the transcriptome for every bins. We divided the bins into 3 groups: poorly expressed bins with transcription level < 7, moderately expressed with transcription level > 7 and < 10 and highly expressed with transcription level > 10.
chip=loadtxt("hist_EV-4_TGACCA_L002_R1_001.MQ30.hist5000")
#plot(log(chip[:,1]) );
c,h,w=my_histo.my_histo( log(chip[:,1]), 100);
plt.bar(c, h, align='center', width=w)
plt.bar(c[c<7], h[c<7], align='center', width=w,color="green");
plt.bar(c[c>10], h[c>10], align='center', width=w,color="red");
plt.xlabel("Transcription level (log2 of number of reads)");
plt.ylabel("Number of occurences (bins of 5kb))");
plt.show();
We then computed the genomic distance on the normalised matrice (poor interacting bins were removed during this process) with the following lines of code:
import distance_law3
import scn
MAT_INT = loadtxt('mat_temp_WT_fused_4banks_5000.txt')
matscn1= scn.scn_func(MAT_INT,100);
# 2) with rna-seq from Olivier Espeli
# very expressed:
indices_highly=np.where( log(chip[:,1]) > 10);
#indices_highly=indices_highly[0];
mat_highly= MAT_INT[indices_highly[0],:];
mat_highly.shape;
dih=distance_law3.dist_law(mat_highly,indices_highly[0]);
# poorly expressed:
indices_poor=np.where( log(chip[:,1]) < 7);
#indices_poor=indices_poor[0];
mat_poor= MAT_INT[indices_poor[0],:];
mat_poor.shape
dip=distance_law3.dist_law(mat_poor,indices_poor[0]);
# normaly expressed:
#indices_norm=np.where( (log(chip) > 6) and (log(chip) < 9) );
indices_norm = np.where( logical_and( (log(chip[:,1]) > 7) , (log(chip[:,1]) < 10) ) );
#indices_poor=indices_poor[0];
mat_norm= MAT_INT[indices_norm[0],:];
mat_norm.shape
din=distance_law3.dist_law(mat_norm,indices_norm[0]);
plot(dih,color="red",label="very transcribed",linewidth=4.0);
plot(din,color="blue",label="moderately transcribed",linewidth=4.0);
plot(dip,color="green",label="poorly transcribed",linewidth=4.0);
xlabel("Genomic distance (in bins of 5kb)");
ylabel("Number of reads per possible distance")
legend();
grid();
This gives the following graph showing that transcription process increases contacts at short scales.
For the construction of 3D structure, we processed the matrice by removing the outliers elements. We computed the genomic distance law and removed points outside the mean + 2 std using the function 'filter_dist'. We use the algorithm Shrec3D with the modification that the lwa to convert contact frequencies into geometricla distance is d=(1/f^0.5). We use home made pymol script to generate the 3D picture including the strong contacts by adding links between monomers.
bg_color white
#cmd.load("/home/axel/Bureau/script_pymol/3Dcoor_ecoli_5kb_3_2.xyz","mov")
#cmd.load("/home/axel/Bureau/files_for_shrec3D/3Dcoor_ecoli_3.xyz","mov")
cmd.load("/home/axel/Bureau/files_for_shrec3D/3Dcoor_ecoli_WT_Hiseq_WT_MM_30C_BC164_2.xyz","mov")
cmd.load("/home/axel/Bureau/files_for_shrec3D/3Dcoor_ecoli_matp_2.xyz","mov")
#n = 864
n = 908
set sphere_scale, 2, (all)
show sphere
for idx in range(0,n):cmd.bond('index %s' % str(idx), 'index %s' % str(idx+1))
# for idx in range(0,n):cmd.unbond('index %s' % str(idx), 'index %s' % str(idx+1))
#inFile = open("/home/axel/Bureau/script_pymol/cumulative3C_200kb.txt", 'r')
inFile = open("/home/axel/Bureau/files_for_shrec3D/cumulative_signal_200kb_matrice_WT1_for_shreck.txt")
inFile = open("/home/axel/Bureau/files_for_shrec3D/cumulative_signal_200kb_WT_MM_30C_BC164_SCN_despeckled.txt")
inFile = open("/home/axel/Bureau/files_for_shrec3D/cumulative_signal_200kb_matP.txt01")
# create the global, stored array
stored.newB = []
# read the new B factors from file
for line in inFile.readlines(): stored.newB.append( float(line) )
# close the input file
inFile.close()
# clear out the old B Factors
alter mov, b=0.0
# update the B Factors with new properties
alter mov and n. A, b=stored.newB.pop(0)
# color the protein based on the new B Factors of the alpha carbons
cmd.spectrum("b", "blue_white_red", "mov and n. A")
set sphere_scale, 3, (all)
show sphere
set_bond stick_radius, 1.0, all
We computed the ratio of contacts between mutant and correponding WT by averaging the concacts made at a certain distance for mutant and WT normalised contact maps and then we took the log2 ratio. The computation is implemented in the code RATIO_CONTACTS_2dtype.py
python RATIO_CONTACTS_2dtype.py mat_temp_5000_WT.dat mat_temp_5000_MatP.dat MatP
To plot on short scales:
s=s[range(s.shape[0]-1,-1,-1),:];
s2=s[range(0,8),:];
extent=(0, 928, 0, 80)
imshow( s2[range(s2.shape[0]-1,-1,-1),:],interpolation="none", extent=extent, vmin=-1.0, vmax=1.5,cmap="seismic")
colorbar();
Here an example for the short scales zoom of the ratio of contacts for HNS/WT:
To analyse plasmids contacts signal along the main Escherichia coli genome and plot the signal centered around matS sites, we used the R code plasmid_genome_norm4.R.
The enrichment analysis of the CIDs borders was carried out thanks to the R script borders_statistics5.R.