Explore and prepare data for further analysis of E-G network with genes involved in hemochromatosis¶
Requirements¶
Data availability¶
All data are downloaded from Nasser et al. 2021 and available in /work2/project/regenet/results/multi/abc.model/Nasser2021
on Genotoul.
AllPredictions.AvgHiC.ABC0.015.minus150.ForABCPaperV3.txt
contains predictions made with the ABC model over 131 biosamples from Nasser et al. 2021.
wget ftp://ftp.broadinstitute.org/outgoing/lincRNA/ABC/AllPredictions.AvgHiC.ABC0.015.minus150.ForABCPaperV3.txt.gz -P /work2/project/regenet/results/multi/abc.model/Nasser2021/
List of biosamples linked to liver:
hepatocyte-ENCODE
HepG2-Roadmap # human liver cancer cell line
liver-ENCODE
List of biosamples linked to intestine:
large_intestine_fetal-Roadmap
small_intestine_fetal-Roadmap
List of biosamples linked to colon:
#HCT116-ENCODE # colon
#HT29 # colon
#sigmoid_colon-ENCODE # sigmoid colon
#transverse_colon-ENCODE # transverse colon
Output¶
At the end of the day, we want to compute bedpe
files the columns of which should be as follows:
"chrom1", "start1", "end1", "chrom2", "start2", "end2", "name", "ABC.score", "strand1", "strand2", "biosample", "gene", "distance", "tissue"(, ...)
where "1" refers to enhancers and "2" refers to genes (either its TSS or the most upstream base of its most probable promoter according to what we want to do with the bedpe).
bedtools
¶
One can load the appropriate bedtools module using:
conda activate base && module load bioinfo/bedtools-2.27.1
Main remarks¶
On merges of enhancers¶
- When, for a single biosample, we replace the enhancers by the "merged" enhancers (whatever they are) and we merge the resulting E-G pairs in which the merged enhancers overlap, we will have to merge ABC predictions eventually. This raises the following question: how to combine the ABC scores? Well, strictly speaking, there are no easy rigorous way to combine two ABC scores such that the score obtained is an ABC score too, even if we take the overlap into account (see next paragraph) because it could happen that the denominator in the two ABC scores are different, and would be different also if one looked at the merged enhancers from the beginning.
Here we simply take the mean of the two ABC scores.
A slightly better approach could be to do as follows (do a drawing to re-understand when reading again): let E1 and E3 be the two enhancers, such that the intersection of them is F2 ; and denote F1 = E1-F2, F3 = E3-F2 and F = F1+F2+F3 = union of E1 and E3. Consider the following ratios, the sum of which is 1: r1 = F1/F, r2 = F2/F, r3 = F3/F. Then, the new ABC score could read as follows: ABC = r1 x ABC1 + r3 x ABC3 + r2 x (ABC1+ABC3)/2.
Extracting data of interest¶
head -n 1 AllPredictions.AvgHiC.ABC0.015.minus150.ForABCPaperV3.txt
1 chr 2 start 3 end 4 name 5 class 6 activity_base 7 TargetGene 8 TargetGeneTSS 9 TargetGeneExpression 10 TargetGenePromoterActivityQuantile 11 TargetGeneIsExpressed 12 distance 13 isSelfPromoter 14 hic_contact 15 powerlaw_contact 16 powerlaw_contact_reference 17 hic_contact_pl_scaled 18 hic_pseudocount 19 hic_contact_pl_scaled_adj 20 ABC.Score.Numerator 21 ABC.Score 22 powerlaw.Score.Numerator 23 powerlaw.Score 24 CellType
wc -l AllPredictions.AvgHiC.ABC0.015.minus150.ForABCPaperV3.txt
7,717,393
Main filters¶
First we verified that all genes in AllPredictions.AvgHiC.ABC0.015.minus150.ForABCPaperV3.txt
are expressed.
tail -n+2 AllPredictions.AvgHiC.ABC0.015.minus150.ForABCPaperV3.txt |awk -F "\t" '{_[$11]++} END{for(u in _){print u, _[u]}}'
True 7717392
OK.
Now we verify that all ABC scores are above 0.015 as expected:
tail -n+2 AllPredictions.AvgHiC.ABC0.015.minus150.ForABCPaperV3.txt |awk -F "\t" '{if($21>=0.015){print $0}}' |wc -l
7,717,392
OK.
We see that in a lot (> 1 million) of those element-gene pairs, the element is actually the promoter of the considered gene:
tail -n+2 AllPredictions.AvgHiC.ABC0.015.minus150.ForABCPaperV3.txt |awk 'BEGIN{FS="\t"; OFS="\t"} {_[$13]++} END{for(u in _){print u, _[u]}}'
True 1177021 False 6540371
Let's keep only the rows corresponding to enhancers and store the results, with the fields of interest, in a bedpe file.
conda activate base && module load bioinfo/bedtools-2.27.1
tail -n +2 AllPredictions.AvgHiC.ABC0.015.minus150.ForABCPaperV3.txt |awk 'BEGIN{FS="\t"; OFS="\t"} {if($13=="False"){print $1, $2, $3, $1, $8, $8, $4"::"$7, $21, ".", ".", $24, $7, $12}}' |bedtools sort -faidx chr_sizes -i stdin > Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.sorted.bedpe
wc -l Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.bedpe
6,540,371
OK, this is the expected length.
In the resulting Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.sorted.bedpe
bedpe file, there are no more self-promoters, and the candidate elements are distributed as follows (a promoter of a gene G1 can be an enhancer for gene G2):
tail -n +2 AllPredictions.AvgHiC.ABC0.015.minus150.ForABCPaperV3.txt |awk -F "\t" '{if($13=="False"){_[$5]++}} END{for(u in _){print u, _[u]}}'
intergenic 2758413 promoter 224350 genic 3557608
Now we extract the list of all enhancers involved in the predictions of interest for all 131 biosamples:
awk 'BEGIN{FS="\t"; OFS="\t"} {print $1, $2, $3}' Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.sorted.bedpe |uniq > list_all_enhancers.bed
Contains 2,463,310 enhancers.
And we define the list of merged enhancers among those 131 biosamples, so that there are no overlaps in the resulting list:
bedtools merge -i list_all_enhancers.bed > list_all_enhancers.merged.bed
Contains 269,254 enhancers.
Now let's replace all enhancers coordinates in the result, by the coordinates of the corresponding merged enhancers in list_all_enhancers.merged.bed
.
WARNING: it is absolutely necessary to have at least 32 GB of memory to run the code below. If the memory is not sufficient, the arrays indexed by $11":::"$7
are not going to be stored but no error message will be yielded, which could make it pretty hard to debug the code.
srun --mem=32G --pty bash
conda activate base && module load bioinfo/bedtools-2.27.1
Note that the "." we add at the very last column is for compatibility with scrips used in downstream analysis, as when we select particular biosamples, this fields will contain the list of tissues of interest.
bedtools intersect -wb -a Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.sorted.bedpe -b list_all_enhancers.merged.bed |awk 'BEGIN{FS="\t"; OFS="\t"} {split($7,l1,":"); print $14, $15, $16, $4, $5, $6, $14":"$15"-"$16"::"l1[4], $8, $9, $10, $11, $12, $13}' |awk 'BEGIN{FS="\t"; OFS="\t"} {lines[$11":::"$7]=$0; score[$11":::"$7]+=$8; nb[$11":::"$7]++} END{for(u in lines){split(u,parts,":::"); split(lines[u],l,"\t"); print l[1], l[2], l[3], l[4], l[5], l[6], parts[2], score[u]/nb[u], l[9], l[10], l[11], l[12], l[13], "."}}' |bedtools sort -faidx chr_sizes -i stdin > Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.bedpe
After a pretty long time:
wc -l Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.new.bedpe
6,086,345
To ensure that the Awk array has not been "cropped out" for insufficient memory reasons, we executed the same code with twice as much max memory - eg 64 GB - to see if there is a difference (as it simply gives nothing with a small max memory) and it gave the same result => OK
Note that by construction, Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.bedpe
contains almost duplicated rows when identical E-G are found in different biosamples, the only difference being the ABC score, the biosample and the distance fields.
We will also need the corresponding a similar files where all unique E-G correspond to unique lines with $11 is the list of biosamples, $13 the list of initial E-TSS distances (before replacing the enhancers by merged enhancers) and $14 the list of tissues + other additional fields are defined, see header of the result.
For that purpose, we use the pandas
library from Python, which is more efficient. We use the script reduce_df.py
, the content of which is:
import pandas as pd
import numpy as np
import argparse
import sys
# Create the parser
parser = argparse.ArgumentParser(description="<input file>, <output file>")
# Add the arguments
parser.add_argument("-i",
metavar="path",
type=str,
required=True,
help="path to input file")
parser.add_argument("-o",
metavar="path",
type=str,
required=True,
help="path to output file")
args = parser.parse_args()
input_path = args.i
output_path = args.o
colnames = ["chrom1", "start1", "end1", "chrom2", "start2", "end2", "name",
"ABC.score", "strand1", "strand2", "biosample", "gene", "original_distance", "tissue"]
print("Loading input...")
eg_pairs = pd.read_csv(input_path, sep='\t', header=None, dtype='str', engine='c', names = colnames)
print("Done.")
eg_pairs[["start1", "end1", "start2", "end2", "ABC.score", "original_distance"]] = eg_pairs[["start1", "end1", "start2", "end2", "ABC.score", "original_distance"]].apply(pd.to_numeric)
eg_pairs[["ABC.score", "original_distance"]].astype(float, copy=False)
# For testing purposes we keep only a few rows
#eg_pairs = eg_pairs.iloc[:10000,:]
column_identical_for_unique_EG_pair = ["chrom1", "start1", "end1", "chrom2", "start2", "end2", "name",
"strand1", "strand2", "gene"]
print("Reducing input (this may take a while)...")
eg_reduced = (
eg_pairs.groupby(column_identical_for_unique_EG_pair)
.agg({"ABC.score": list, "biosample": list, "original_distance": list, "tissue": list})
.reset_index()
)
print("Done.")
eg_reduced.rename({"ABC.score": "ABC.scores", "biosample": "biosamples", "original_distance": "original_distances", "tissue": "tissues"}, axis=1, inplace=True)
print("Adding more columns...")
eg_reduced["ABC.mean"] = eg_reduced["ABC.scores"].apply(np.mean)
eg_reduced["ABC.min"] = eg_reduced["ABC.scores"].apply(np.min)
eg_reduced["ABC.max"] = eg_reduced["ABC.scores"].apply(np.max)
eg_reduced["original_distance.mean"] = eg_reduced["original_distances"].apply(np.mean)
eg_reduced["original_distance.min"] = eg_reduced["original_distances"].apply(np.min)
eg_reduced["original_distance.max"] = eg_reduced["original_distances"].apply(np.max)
print("Done.")
print("Preparing output...")
eg_reduced["ABC.scores"] = eg_reduced["ABC.scores"].apply(lambda x: ','.join(list(map(str, x))))
eg_reduced["original_distances"] = eg_reduced["original_distances"].apply(lambda x: ','.join(list(map(str, x))))
eg_reduced["biosamples"] = eg_reduced["biosamples"].apply(lambda x: ','.join(x))
eg_reduced["tissues"] = eg_reduced["tissues"].apply(lambda x: ','.join(x))
print("Done")
print("Writing output... Don't interrupt...")
eg_reduced.to_csv(output_path, sep='\t', header=True, index=False)
print("Done.")
sys.exit(0)
# requires a lot of memory, better use at least 32GB
python reduce_df.py -i Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.bedpe -o Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.uniques_eg.bedpe
wc -l Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.bedpe
6,086,345
wc -l Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.uniques_eg.bedpe
1,041,155
Select biosamples¶
All liver and intestine biosamples¶
The five biosamples of interest are:
hepatocyte-ENCODE HepG2-Roadmap # human liver cancer cell line liver-ENCODE large_intestine_fetal-Roadmap small_intestine_fetal-Roadmap
We extract corresponding data:
awk 'BEGIN{FS="\t"; OFS="\t"} {bool=0; if($11~/^(hepatocyte-ENCODE|HepG2-Roadmap|liver-ENCODE)$/){bool=1; tissue="liver"} else if ($11~/^(large_intestine_fetal-Roadmap|small_intestine_fetal-Roadmap)$/){bool=1; tissue="intestine"}; if(bool){print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, tissue}}' Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.bedpe > Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_enhancers.sorted.bedpe
awk -F "\t" '{tissues[$14]++} END{for(u in tissues){print u, tissues[u]}}' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_enhancers.sorted.bedpe
liver 145132 intestine 101488
Note that by construction, Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_enhancers.sorted.bedpe
contains almost duplicated rows when identical E-G are found in different biosamples, the only difference being the ABC score, the biosample and the distance fields.
Just as before, we use reduce_df.py
to compute a bedpe in which all E-G pairs correspond to unique lines with $11 is the list of biosamples, $13 the list of initial E-TSS distances (before replacing the enhancers by merged enhancers) and $14 the list of tissues + other additional fields are defined, see header of the result.
python reduce_df.py -i Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_enhancers.sorted.bedpe -o Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_enhancers.sorted.uniques_eg.bedpe
wc -l Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_enhancers.sorted.bedpe
246,620
wc -l Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_enhancers.sorted.uniques_eg.bedpe
147,748
We do the same with unmerged (original) enhancers:
awk 'BEGIN{FS="\t"; OFS="\t"} {bool=0; if($11~/^(hepatocyte-ENCODE|HepG2-Roadmap|liver-ENCODE)$/){bool=1; tissue="liver"} else if ($11~/^(large_intestine_fetal-Roadmap|small_intestine_fetal-Roadmap)$/){bool=1; tissue="intestine"}; if(bool){print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, tissue}}' Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.sorted.bedpe > Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.bedpe
awk -F "\t" '{tissues[$14]++} END{for(u in tissues){print u, tissues[u]}}' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.bedpe
liver 157901 intestine 108221
Just as before, we use reduce_df.py
to compute a bedpe in which all E-G pairs correspond to unique lines with $11 is the list of biosamples, $13 the list of initial E-TSS distances (before replacing the enhancers by merged enhancers) and $14 the list of tissues + other additional fields are defined, see header of the result.
python reduce_df.py -i Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.bedpe -o Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.uniques_eg.bedpe
wc -l Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.bedpe
266,122
wc -l Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.uniques_eg.bedpe
265,975
And finally, we do the same again with enhancers merged only over the selected biosamples, ie the 5 liver and intestine biosamples:
awk 'BEGIN{FS="\t"; OFS="\t"} {print $1, $2, $3}' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.bedpe |uniq > list_enhancers.liver_and_intestine.bed
Contains 102,996 enhancers.
bedtools merge -i list_enhancers.liver_and_intestine.bed > list_enhancers.liver_and_intestine.merged.bed
Contains 57,398 enhancers.
So we replace all enhancers coordinates in the result, by that of the corresponding merged enhancers in list_enhancers.liver_and_intestine.merged.bed
.
bedtools intersect -wb -a Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.bedpe -b list_enhancers.liver_and_intestine.merged.bed |awk 'BEGIN{FS="\t"; OFS="\t"} {split($7,l1,":"); print $15, $16, $17, $4, $5, $6, $15":"$16"-"$17"::"l1[4], $8, $9, $10, $11, $12, $13, $14}' |awk 'BEGIN{FS="\t"; OFS="\t"} {lines[$11":::"$7]=$0; score[$11":::"$7]+=$8; nb[$11":::"$7]++} END{for(u in lines){split(u,parts,":::"); split(lines[u],l,"\t"); print l[1], l[2], l[3], l[4], l[5], l[6], parts[2], score[u]/nb[u], l[9], l[10], l[11], l[12], l[13], l[14]}}' |bedtools sort -faidx chr_sizes -i stdin > Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_liver_and_intestine_enhancers.sorted.bedpe
awk -F "\t" '{tissues[$14]++} END{for(u in tissues){print u, tissues[u]}}' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_liver_and_intestine_enhancers.sorted.bedpe
liver 153404 intestine 105800
Just as before, we use reduce_df.py
to compute a bedpe in which all E-G pairs correspond to unique lines with $11 is the list of biosamples, $13 the list of initial E-TSS distances (before replacing the enhancers by merged enhancers) and $14 the list of tissues + other additional fields are defined, see header of the result.
python reduce_df.py -i Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_liver_and_intestine_enhancers.sorted.bedpe -o Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_liver_and_intestine_enhancers.sorted.uniques_eg.bedpe
wc -l Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_liver_and_intestine_enhancers.sorted.bedp
259,204
wc -l Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_liver_and_intestine_enhancers.sorted.uniques_eg.bedpe
159,675
All liver biosamples¶
awk '$14=="liver"' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_enhancers.sorted.bedpe > Nasser2021ABCPredictions.liver.all_putative_enhancers.merged_enhancers.sorted.bedpe
OK.
Just as before, we use reduce_df.py
to compute a bedpe in which all E-G pairs correspond to unique lines with $11 is the list of biosamples, $13 the list of initial E-TSS distances (before replacing the enhancers by merged enhancers) and $14 the list of tissues + other additional fields are defined, see header of the result.
python reduce_df.py -i Nasser2021ABCPredictions.liver.all_putative_enhancers.merged_enhancers.sorted.bedpe -o Nasser2021ABCPredictions.liver.all_putative_enhancers.merged_enhancers.sorted.uniques_egbedpe
We do the same with unmerged (original) enhancers:
awk '$14=="liver"' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.bedpe > Nasser2021ABCPredictions.liver.all_putative_enhancers.sorted.bedpe
Just as before, we use reduce_df.py
to compute a bedpe in which all E-G pairs correspond to unique lines with $11 is the list of biosamples, $13 the list of initial E-TSS distances (before replacing the enhancers by merged enhancers) and $14 the list of tissues + other additional fields are defined, see header of the result.
python reduce_df.py -i Nasser2021ABCPredictions.liver.all_putative_enhancers.sorted.bedpe -o Nasser2021ABCPredictions.liver.all_putative_enhancers.sorted.uniques_eg.bedpe
And finally, we do the same again with enhancers merged only over the selected biosamples, ie the 3 liver biosamples:
awk 'BEGIN{FS="\t"; OFS="\t"} {print $1, $2, $3}' Nasser2021ABCPredictions.liver.all_putative_enhancers.sorted.bedpe |uniq > list_enhancers.liver.bed
Contains 61,838 enhancers.
bedtools merge -i list_enhancers.liver.bed > list_enhancers.liver.merged.bed
Contains 44,979 enhancers.
So we replace all enhancers coordinates in the result, by that of the corresponding merged enhancers in list_enhancers.liver.merged.bed
.
bedtools intersect -wb -a Nasser2021ABCPredictions.liver.all_putative_enhancers.sorted.bedpe -b list_enhancers.liver.merged.bed |awk 'BEGIN{FS="\t"; OFS="\t"} {split($7,l1,":"); print $15, $16, $17, $4, $5, $6, $15":"$16"-"$17"::"l1[4], $8, $9, $10, $11, $12, $13, $14}' |awk 'BEGIN{FS="\t"; OFS="\t"} {lines[$11":::"$7]=$0; score[$11":::"$7]+=$8; nb[$11":::"$7]++} END{for(u in lines){split(u,parts,":::"); split(lines[u],l,"\t"); print l[1], l[2], l[3], l[4], l[5], l[6], parts[2], score[u]/nb[u], l[9], l[10], l[11], l[12], l[13], l[14]}}' |bedtools sort -faidx chr_sizes -i stdin > Nasser2021ABCPredictions.liver.all_putative_enhancers.merged_liver_enhancers.sorted.bedpe
awk -F "\t" '{tissues[$14]++} END{for(u in tissues){print u, tissues[u]}}' Nasser2021ABCPredictions.liver.all_putative_enhancers.merged_liver_enhancers.sorted.bedpe
liver 155293
Just as before, we use reduce_df.py
to compute a bedpe in which all E-G pairs correspond to unique lines with $11 is the list of biosamples, $13 the list of initial E-TSS distances (before replacing the enhancers by merged enhancers) and $14 the list of tissues + other additional fields are defined, see header of the result.
python reduce_df.py -i Nasser2021ABCPredictions.liver.all_putative_enhancers.merged_liver_enhancers.sorted.bedpe -o Nasser2021ABCPredictions.liver.all_putative_enhancers.merged_liver_enhancers.sorted.uniques_eg.bedpe
All intestine biosamples¶
awk '$14=="intestine"' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_enhancers.sorted.bedpe > Nasser2021ABCPredictions.intestine.all_putative_enhancers.merged_enhancers.sorted.bedpe
OK.
Just as before, we use reduce_df.py
to compute a bedpe in which all E-G pairs correspond to unique lines with $11 is the list of biosamples, $13 the list of initial E-TSS distances (before replacing the enhancers by merged enhancers) and $14 the list of tissues + other additional fields are defined, see header of the result.
python reduce_df.py -i Nasser2021ABCPredictions.intestine.all_putative_enhancers.merged_enhancers.sorted.bedpe -o Nasser2021ABCPredictions.intestine.all_putative_enhancers.merged_enhancers.sorted.uniques_eg.bedpe
awk '$14=="intestine"' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.bedpe > Nasser2021ABCPredictions.intestine.all_putative_enhancers.sorted.bedpe
Just as before, we use reduce_df.py
to compute a bedpe in which all E-G pairs correspond to unique lines with $11 is the list of biosamples, $13 the list of initial E-TSS distances (before replacing the enhancers by merged enhancers) and $14 the list of tissues + other additional fields are defined, see header of the result.
python reduce_df.py -i Nasser2021ABCPredictions.intestine.all_putative_enhancers.sorted.bedpe -o Nasser2021ABCPredictions.intestine.all_putative_enhancers.sorted.uniques_eg.bedpe
And finally, we do the same again with enhancers merged only over the selected biosamples, ie the 2 intestine biosamples:
awk 'BEGIN{FS="\t"; OFS="\t"} {print $1, $2, $3}' Nasser2021ABCPredictions.intestine.all_putative_enhancers.sorted.bedpe |uniq > list_enhancers.intestine.bed
Contains 41,102 enhancers.
bedtools merge -i list_enhancers.intestine.bed > list_enhancers.intestine.merged.bed
Contains 27,102 enhancers.
So we replace all enhancers coordinates in the result, by that of the corresponding merged enhancers in list_enhancers.liver.merged.bed
.
bedtools intersect -wb -a Nasser2021ABCPredictions.intestine.all_putative_enhancers.sorted.bedpe -b list_enhancers.intestine.merged.bed |awk 'BEGIN{FS="\t"; OFS="\t"} {split($7,l1,":"); print $15, $16, $17, $4, $5, $6, $15":"$16"-"$17"::"l1[4], $8, $9, $10, $11, $12, $13, $14}' |awk 'BEGIN{FS="\t"; OFS="\t"} {lines[$11":::"$7]=$0; score[$11":::"$7]+=$8; nb[$11":::"$7]++} END{for(u in lines){split(u,parts,":::"); split(lines[u],l,"\t"); print l[1], l[2], l[3], l[4], l[5], l[6], parts[2], score[u]/nb[u], l[9], l[10], l[11], l[12], l[13], l[14]}}' |bedtools sort -faidx chr_sizes -i stdin > Nasser2021ABCPredictions.intestine.all_putative_enhancers.merged_intestine_enhancers.sorted.bedpe
awk -F "\t" '{tissues[$14]++} END{for(u in tissues){print u, tissues[u]}}' Nasser2021ABCPredictions.intestine.all_putative_enhancers.merged_intestine_enhancers.sorted.bedpe
intestine 106883
Just as before, we use reduce_df.py
to compute a bedpe in which all E-G pairs correspond to unique lines with $11 is the list of biosamples, $13 the list of initial E-TSS distances (before replacing the enhancers by merged enhancers) and $14 the list of tissues + other additional fields are defined, see header of the result.
python reduce_df.py -i Nasser2021ABCPredictions.intestine.all_putative_enhancers.merged_intestine_enhancers.sorted.bedpe -o Nasser2021ABCPredictions.intestine.all_putative_enhancers.merged_intestine_enhancers.sorted.uniques_eg.bedpe
All 5 separate biosamples of interest¶
hepatocyte-ENCODE¶
awk '$11=="hepatocyte-ENCODE"' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_enhancers.sorted.bedpe > Nasser2021ABCPredictions.hepatocyte-ENCODE.all_putative_enhancers.merged_enhancers.sorted.bedpe
OK.
awk '$11=="hepatocyte-ENCODE"' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.bedpe > Nasser2021ABCPredictions.hepatocyte-ENCODE.all_putative_enhancers.sorted.bedpe
And finally, we do the same again with enhancers merged only over the selected biosample:
awk 'BEGIN{FS="\t"; OFS="\t"} {print $1, $2, $3}' Nasser2021ABCPredictions.hepatocyte-ENCODE.all_putative_enhancers.sorted.bedpe |uniq > list_enhancers.hepatocyte-ENCODE.bed
Contains 23,436 enhancers.
bedtools merge -i list_enhancers.hepatocyte-ENCODE.bed > list_enhancers.hepatocyte-ENCODE.merged.bed
Contains 23,436 enhancers => indeed, it seems natural that enhancers in one biosample only do not overlap each other.
So we do not need to replace all enhancers coordinates in the result, by that of the corresponding merged enhancers in list_enhancers.hepatocyte-ENCODE.merged.bed
.
liver-ENCODE¶
awk '$11=="liver-ENCODE"' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_enhancers.sorted.bedpe > Nasser2021ABCPredictions.liver-ENCODE.all_putative_enhancers.merged_enhancers.sorted.bedpe
OK.
awk '$11=="liver-ENCODE"' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.bedpe > Nasser2021ABCPredictions.liver-ENCODE.all_putative_enhancers.sorted.bedpe
And finally, we do the same again with enhancers merged only over the selected biosample:
awk 'BEGIN{FS="\t"; OFS="\t"} {print $1, $2, $3}' Nasser2021ABCPredictions.liver-ENCODE.all_putative_enhancers.sorted.bedpe |uniq > list_enhancers.liver-ENCODE.bed
Contains 18,623 enhancers.
bedtools merge -i list_enhancers.liver-ENCODE.bed > list_enhancers.liver-ENCODE.merged.bed
Contains 18,623 enhancers => indeed, it seems natural that enhancers in one biosample only do not overlap each other.
So we do not need to replace all enhancers coordinates in the result, by that of the corresponding merged enhancers in list_enhancers.liver-ENCODE.merged.bed
.
HepG2-Roadmap¶
awk '$11=="HepG2-Roadmap"' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_enhancers.sorted.bedpe > Nasser2021ABCPredictions.HepG2-Roadmap.all_putative_enhancers.merged_enhancers.sorted.bedpe
OK.
awk '$11=="HepG2-Roadmap"' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.bedpe > Nasser2021ABCPredictions.HepG2-Roadmap.all_putative_enhancers.sorted.bedpe
And finally, we do the same again with enhancers merged only over the selected biosample:
awk 'BEGIN{FS="\t"; OFS="\t"} {print $1, $2, $3}' Nasser2021ABCPredictions.HepG2-Roadmap.all_putative_enhancers.sorted.bedpe |uniq > list_enhancers.HepG2-Roadmap.bed
Contains 19,766 enhancers.
bedtools merge -i list_enhancers.HepG2-Roadmap.bed > list_enhancers.HepG2-Roadmap.merged.bed
Contains 19,766 enhancers => indeed, it seems natural that enhancers in one biosample only do not overlap each other.
So we do not need to replace all enhancers coordinates in the result, by that of the corresponding merged enhancers in list_enhancers.HepG2-Roadmap.merged.bed
.
large_intestine_fetal-Roadmap¶
awk '$11=="large_intestine_fetal-Roadmap"' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_enhancers.sorted.bedpe > Nasser2021ABCPredictions.large_intestine_fetal-Roadmap.all_putative_enhancers.merged_enhancers.sorted.bedpe
OK.
awk '$11=="large_intestine_fetal-Roadmap"' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.bedpe > Nasser2021ABCPredictions.large_intestine_fetal-Roadmap.all_putative_enhancers.sorted.bedpe
And finally, we do the same again with enhancers merged only over the selected biosample:
awk 'BEGIN{FS="\t"; OFS="\t"} {print $1, $2, $3}' Nasser2021ABCPredictions.large_intestine_fetal-Roadmap.all_putative_enhancers.sorted.bedpe |uniq > list_enhancers.large_intestine_fetal-Roadmap.bed
Contains 20,517 enhancers.
bedtools merge -i list_enhancers.large_intestine_fetal-Roadmap.bed > list_enhancers.large_intestine_fetal-Roadmap.merged.bed
Contains 20,517 enhancers => indeed, it seems natural that enhancers in one biosample only do not overlap each other.
So we do not need to replace all enhancers coordinates in the result, by that of the corresponding merged enhancers in list_enhancers.large_intestine_fetal-Roadmap.merged.bed
.
small_intestine_fetal-Roadmap¶
awk '$11=="small_intestine_fetal-Roadmap"' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.merged_enhancers.sorted.bedpe > Nasser2021ABCPredictions.small_intestine_fetal-Roadmap.all_putative_enhancers.merged_enhancers.sorted.bedpe
OK.
awk '$11=="small_intestine_fetal-Roadmap"' Nasser2021ABCPredictions.liver_and_intestine.all_putative_enhancers.sorted.bedpe > Nasser2021ABCPredictions.small_intestine_fetal-Roadmap.all_putative_enhancers.sorted.bedpe
And finally, we do the same again with enhancers merged only over the selected biosample:
awk 'BEGIN{FS="\t"; OFS="\t"} {print $1, $2, $3}' Nasser2021ABCPredictions.small_intestine_fetal-Roadmap.all_putative_enhancers.sorted.bedpe |uniq > list_enhancers.small_intestine_fetal-Roadmap.bed
Contains 20,555 enhancers.
bedtools merge -i list_enhancers.small_intestine_fetal-Roadmap.bed > list_enhancers.small_intestine_fetal-Roadmap.merged.bed
Contains 20,555 enhancers => indeed, it seems natural that enhancers in one biosample only do not overlap each other.
So we do not need to replace all enhancers coordinates in the result, by that of the corresponding merged enhancers in list_enhancers.small_intestine_fetal-Roadmap.merged.bed
.
All biosamples¶
We already did it in section Main filters.
Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.bedpe
for merged enhancersNasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.uniques_eg.bedpe
for merged enhancers (with 2 rows = 2 distinct unique E-G pairs)Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.sorted.bedpe
for unmerged (original) enhancers
Investigate putative regulatory elements¶
Small R script to compute summary statistics¶
See compute_summary_stats_enhancer_list.R
. Content:
options <- commandArgs(trailingOnly = TRUE) enhancers = as.data.frame(read.table(options[1], sep = "\t")) enhancers$length = abs(enhancers$V3-enhancers$V2) summary(enhancers$length)
We need to load a module containing R to be able to launch the script:
module load system/R-4.1.1_gcc-9.3.0
We can execute the R script with:
Rscript compute_summary_stats_enhancer_list.R <relative_path_to_enhancers_list>
ccRE¶
We take the ccRE from /work2/project/regenet/results/multi/bengi/Annotations/hg19-cCREs.bed
.
cp /work2/project/regenet/results/multi/bengi/Annotations/hg19-cCREs.bed .
awk -F "\t" '$6~/(^Enhancer-like$)/' hg19-cCREs.bed > /work2/project/regenet/results/multi/abc.model/Nasser2021/ccRE-ELS.bed
989,712 ccRE-ELS
Min. 1st Qu. Median Mean 3rd Qu. Max.
50.0 247.0 352.0 423.4 519.0 16633.0
Merged enhancers of all 131 biosamples¶
Original enhancers¶
Min. 1st Qu. Median Mean 3rd Qu. Max.
200.0 200.0 308.0 481.7 637.0 6991.0
Merged enhancers¶
Min. 1st Qu. Median Mean 3rd Qu. Max.
200.0 288.0 631.0 807.8 1078.0 11616.0
Overlap with ccRE-ELS¶
wc -l list_all_enhancers.merged.bed
269,254
bedtools intersect -c -a list_all_enhancers.merged.bed -b ccRE-ELS.bed |awk 'BEGIN{FS="\t"; OFS="\t"} {nbs[$4]++} END{for(u in nbs){print u, nbs[u]}}' |sort -nrk2,2
1 112356 0 85937 2 44230 3 16616 4 6135 5 2319 6 934 7 388 8 164 9 78 10 43 11 24 12 14 13 7 14 4 17 3 16 2
85,937 (32%) merged putative enhancers across all 131 biosamples, do not match any ccRE-ELS.
wc -l list_all_enhancers.bed
2,463,310
bedtools intersect -c -a list_all_enhancers.bed -b ccRE-ELS.bed |awk 'BEGIN{FS="\t"; OFS="\t"} {nbs[$4]++} END{for(u in nbs){print u, nbs[u]}}' |sort -nrk2,2
0 1117560 1 1054002 2 221538 3 52355 4 12610 5 3471 6 1025 7 422 8 180 9 69 10 38 11 26 12 9 13 2 17 1 16 1 14 1
Overlap with all ccRE¶
bedtools intersect -c -a list_all_enhancers.merged.bed -b hg19-cCREs.bed |awk 'BEGIN{FS="\t"; OFS="\t"} {nbs[$4]++} END{for(u in nbs){print u, nbs[u]}}' |sort -nrk2,2
1 144690 2 58313 0 25709 3 23341 4 9498 5 4016 6 1854 7 899 8 422 9 235 10 114 11 63 12 41 13 24 14 11 15 6 17 5 16 5 20 3 18 3 22 1 19 1
Only 25,709 (9%) merged putative enhancers across all 131 biosamples, do not match any ccRE (vs 32% against ccRE-ELS), suggesting that a lot of those "putative enhancers" are actually other types of regulatory element as defined by Moore/ENCODE.
Merged enhancers of all liver and intestine biosamples¶
Original enhancers¶
Min. 1st Qu. Median Mean 3rd Qu. Max.
200.0 200.0 353.0 490.7 644.0 4626.0
Merged enhancers (merged across 131 biosamples)¶
bedtools intersect -wb -a list_enhancers.liver_and_intestine.bed -b list_all_enhancers.merged.bed |awk 'BEGIN{FS="\t"; OFS="\t"} {print $4, $5, $6}' |bedtools sort -faidx chr_sizes -i stdin |uniq > list_enhancers.liver_and_intestine.merged131.bed
wc -l list_enhancers.liver_and_intestine.bed 102996 wc -l list_enhancers.liver_and_intestine.merged131.bed 50881
Min. 1st Qu. Median Mean 3rd Qu. Max.
200 520 982 1231 1668 11616
Merged enhancers (merged across 5 liver and intestine biosamples)¶
Min. 1st Qu. Median Mean 3rd Qu. Max.
200.0 200.0 423.0 574.6 756.0 6298.0
Overlap with ccRE-ELS¶
bedtools intersect -c -a list_enhancers.liver_and_intestine.merged.bed -b ccRE-ELS.bed |awk 'BEGIN{FS="\t"; OFS="\t"} {nbs[$4]++} END{for(u in nbs){print u, nbs[u]}}' |sort -nrk2,2
0 26768 1 22443 2 6037 3 1564 4 398 5 135 6 33 7 10 9 6 10 2 8 1 16 1
47% do not overlap any ccRE-ELS !
Merged enhancers of all liver biosamples¶
Original enhancers¶
Min. 1st Qu. Median Mean 3rd Qu. Max.
200.0 200.0 296.0 455.3 585.0 4626.0
Merged enhancers (merged across 131 biosamples)¶
bedtools intersect -wb -a list_enhancers.liver.bed -b list_all_enhancers.merged.bed |awk 'BEGIN{FS="\t"; OFS="\t"} {print $4, $5, $6}' |bedtools sort -faidx chr_sizes -i stdin |uniq > list_enhancers.liver.merged131.bed
wc -l list_enhancers.liver.bed 61838 wc -l list_enhancers.liver.merged131.bed 39477
Min. 1st Qu. Median Mean 3rd Qu. Max.
200 528 1027 1292 1780 11616
Merged enhancers (across all 3 liver biosamples)¶
Min. 1st Qu. Median Mean 3rd Qu. Max.
200.0 200.0 348.0 507.1 650.0 4737.0
Overlap with ccRE-ELS¶
bedtools intersect -c -a list_enhancers.liver.merged.bed -b ccRE-ELS.bed |awk 'BEGIN{FS="\t"; OFS="\t"} {nbs[$4]++} END{for(u in nbs){print u, nbs[u]}}' |sort -nrk2,2
0 22717 1 17340 2 3802 3 851 4 184 5 64 6 11 9 5 7 3 8 1 10 1
51% do not overlap any ccRE-ELS !
Merged enhancers of all intestine biosamples¶
Original enhancers¶
Min. 1st Qu. Median Mean 3rd Qu. Max.
200 200 428 543 724 3679
Merged enhancers (merged across 131 biosamples)¶
bedtools intersect -wb -a list_enhancers.intestine.bed -b list_all_enhancers.merged.bed |awk 'BEGIN{FS="\t"; OFS="\t"} {print $4, $5, $6}' |bedtools sort -faidx chr_sizes -i stdin |uniq > list_enhancers.intestine.merged131.bed
wc -l list_enhancers.intestine.bed 41102 wc -l list_enhancers.intestine.merged131.bed 24492
Min. 1st Qu. Median Mean 3rd Qu. Max.
200 710 1209 1469 1978 11073
Merged enhancers (across all 2 intestine biosamples)¶
Min. 1st Qu. Median Mean 3rd Qu. Max.
200.0 221.0 488.0 606.8 814.0 4461.0