Skip to content

Explore and prepare data for further analysis of E-G network with genes involved in hemochromatosis, based on CHiC data

Requirements

Data availability

All data can be found here:

/work2/project/regenet/results/phic/homo_sapiens/hg19/jung.ren.2019/LI11/score2/pall/LI11.pall.score2.gninfo.bedpe

cd /work2/project/regenet/results/phic/homo_sapiens/hg19/jung.ren.2019/LI11/score2/pall/
mkdir networks && cd networks
ln -s ../LI11.pall.score2.gninfo.bedpe raw_data_liver.full.bedpe

bedtools

One can load the appropriate modules with:

conda activate base && module load bioinfo/bedtools-2.27.1  

Summary statistics with R

We need to load a module containing R to be able to launch the compute_summary_stats_enhancer_list.R script:

module load system/R-4.1.1_gcc-9.3.0
cp /work2/project/regenet/results/multi/abc.model/Nasser2021/compute_summary_stats_enhancer_list.R .

We can execute the R script with:

Rscript compute_summary_stats_enhancer_list.R <relative_path_to_enhancers_list>

Extracting data of interest

Note that the LI11 repository already corresponds to putative E-G pairs in liver cells. The list of all available cell types can be found here; /work2/project/regenet/results/phic/homo_sapiens/hg19/jung.ren.2019/metadata.tsv.

There are 12 columns:

head -n 1 raw_data_liver.full.bedpe
chr1  943049  965801  chr1    1183838 1209657 chr1:943049:965801,chr1:1183838:1209657 2.76719343781528..  pp  11837   22511   7   242322  1.9015  3.6813  1.1317  2.1959  2   chr1:948802:949920:+:ENSG00000187608.5:ISG15,chr1:955502:991496:+:ENSG00000188157.9:AGRN,   1   chr1:1189288:1209265:-:ENSG00000160087.16:UBE2J2,   elt2B   218037
1 chr1 # promoter of the gene
2 start1 # 0-based
3 end1 # 1-based
4 chr2 # elt2 (putative enhancer)
5 start2 # 0-based
6 end2 # 1-based
7 short name <chr1:start1:end1, chr2:start2:end2> # <prom_coord, elt2_coord>
8 strand1 # promoter
9 strand2 # elt2
10 score (does not matter because all p-values are small enough in those files)
11 type of relation (pp or po)  # I think that
                              # pp stands for promoter-promoter
                              # po stands for promoter-other
12-19 does not matter # cf mail "sens des champs des *gninfo.bedpe.gz" for details
20 number of genes whose tss+-500bp overlaps the prom parts
21 list of such genes in the form <chr1:start1:end1:strand1:gene_id:gene_symbol,chr2:...,...>
22 number of genes whose tss+-500bp overlaps the elt2 part
23 the list of such genes in the same form as for prom (21)
24 type of elt2 # two possible values:
              # elt2A if the element 2 is never a promoter anywhere else in the file
              # elt2B otherwise, that is if the element 2 also appears as a promoter somewhere else in the file
25 interfragment distance (from end to beg)
awk -F "\t" '{types[$24]++} END{for(u in types){print u, types[u]}}' raw_data_liver.full.bedpe |sort -nrk2,2
elt2A 33553
elt2B 4706

Exploration

Are there self-promoters ? If so, we have to remove them for comparison purpose with ABC data

awk -F "\t" '$24~/(^elt2B$)/ {if($1":"$2":"$3==$4":"$5":"$6){print $0}}' raw_data_liver.bedpe |wc -l
0

=> No. OK.

Main filters

Remove entries for which no gene TSS+-500 bp overlaps the promoter

This is the case for 46 entires:

awk -F "\t" '{nb[$20]++} END{for(u in nb){print u, nb[u]}}' raw_data_liver.full.bedpe |sort -nrk2,2
1 30065
2 6672
3 1114
4 286
0 46
5 39
6 30
7 7
awk -F "\t" '$20>0' raw_data_liver.full.bedpe > temp.bedpe
mv temp.bedpe raw_data_liver.full.bedpe
awk -F "\t" '{nb[$20]++} END{for(u in nb){print u, nb[u]}}' raw_data_liver.full.bedpe |sort -nrk2,2
1 30065
2 6672
3 1114
4 286
5 39
6 30
7 7

Deal with pp type of relation

Well there should be no problem keeping them:

awk -F "\t" '{type[$11]++} END{for(u in type){print u, type[u]}}' raw_data_liver.full.bedpe |sort -nrk2,2
po 33506
pp 4707

Duplicates entries for which the E-Prom pair is associated to multiple E-G pairs, and modify file structure

awk -F "\t" 'BEGIN{OFS="\t"} {split($21,genes,","); i=0; n=length(genes)-1; while(i<n){i=i+1; gene=genes[i]; split(gene,parts,":"); entry[$4"\t"$5"\t"$6"\t"parts[1]"\t"parts[2]"\t"parts[3]"\t"parts[6]"\t"parts[5]]=$4"\t"$5"\t"$6"\t"parts[1]"\t"parts[2]"\t"parts[3]"\t"$4":"$5"-"$6"::"parts[5]"::"parts[6]"\t"$8"\t"$10"\t"$9"\tliver\t"parts[6]"\t"$25"\t"parts[5]}} END{for(u in entry){print entry[u]}}' raw_data_liver.full.bedpe > liver.all_putative_enhancers.bedpe

OK.

Warning:

  • in raw_data_liver.full.bedpe, element 1 is the promoter and element 2 is the putative enhancer
  • in liver.all_putative_enhancers.bedpe, element 1 is the putative enhancer and element 2 is the gene

Also, let us note that there are some duplicates, ie in a few cases (279), multiple entries in the original raw file, correspond to the same "id" (coordinates of the putative enhancer + coordinates, id and name of the gene) in the new file:

awk -F "\t" 'BEGIN{OFS="\t"} {split($21,genes,","); i=0; n=length(genes)-1; while(i<n){i=i+1; gene=genes[i]; split(gene,parts,":"); entry[$4":"$5":"$6"::"parts[1]":"parts[2]":"parts[3]"::"parts[6]"::"parts[5]]++}} END{for(u in entry){if(entry[u]>1){print u, entry[u]}}}' raw_data_liver.full.bedpe |sort -nrk2,2 |wc -l

279

More precisely, there are 2 new entries that both match 3 old entries, and 277 new entries matching 2 old entries each.

The reason must be (I did not verify though) that for some given putative enhancer, let us say for E, there are multiple promoters in the original file, let's say P1 and P2, each one overlapping the same TSS+- 500 bp of a gene G.

Well, for now we don't care.

cp /work2/project/regenet/results/multi/abc.model/Nasser2021/chr_sizes chr_sizes
bedtools sort -faidx chr_sizes -i liver.all_putative_enhancers.bedpe > liver.all_putative_enhancers.sorted.bedpe

Extract enhancers

Now we extract the list of all enhancers in our liver biosample:

awk 'BEGIN{FS="\t"; OFS="\t"} {print $1, $2, $3}' liver.all_putative_enhancers.sorted.bedpe |uniq > list_all_enhancers.bed

OK. Contains 31,749 enhancers.

And we define the list of merged enhancers, such that none are overlapping:

bedtools merge -i list_all_enhancers.bed > list_all_enhancers.merged_liver.bed

It did not change anything, so there was no overlap, which is expected as we are focused on only 1 biosample.

rm list_all_enhancers.merged_liver.bed

Here are the summary statistics on those enhancers:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1111    4247    5596    7007    7756 1553733

Note that they are pretty large!

Overlap between enhancers and ccRE-ELS

We take the ccRE from /work2/project/regenet/results/multi/bengi/Annotations/hg19-cCREs.bed.

ln -s /work2/project/regenet/results/multi/bengi/Annotations/hg19-cCREs.bed .
awk -F "\t" '$6~/(^Enhancer-like$)/' hg19-cCREs.bed > ccRE-ELS.bed

989,712 ccRE-ELS

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   50.0   247.0   352.0   423.4   519.0 16633.0
bedtools intersect -c -a list_all_enhancers.bed -b ccRE-ELS.bed |awk 'BEGIN{FS="\t"; OFS="\t"} {nbs[$4]++} END{for(u in nbs){print u, nbs[u]}}' |sort -nrk2,2   
0 6149
1 4355
2 4238
3 3869
4 3302
5 2650
6 1975
7 1402
8 1040
9 744
10    568
11    363
12    247
13    176
14    148
15    105
16    90
17    73
18    48
19    34
20    31
22    19
23    18
21    18
25    13
24    13
26    9
34    7
30    7
28    7
27    5
32    4
29    4
33    3
49    2
39    2
37    2
36    2
31    2
51    1
42    1
41    1
40    1
35    1

6,149 (19%) of putative enhancers, do not match any ccRE-ELS. Only 4,355 (14%) of putative enhancers match exactly one ccRE-ELS.

Overlap between enhancers and all ccRE

bedtools intersect -c -a list_all_enhancers.bed -b hg19-cCREs.bed |awk 'BEGIN{FS="\t"; OFS="\t"} {nbs[$4]++} END{for(u in nbs){print u, nbs[u]}}' |sort -nrk2,2
3 3962
2 3805
4 3629
1 3524
0 3184
5 3010
6 2443
7 1757
8 1397
9 998
10    785
11    589
12    435
13    351
14    292
15    237
16    198
17    162
19    131
18    120
20    93
21    76
22    75
24    63
23    56
26    36
28    34
25    34
29    32
27    25
30    24
33    22
31    22
32    16
36    13
35    13
38    12
37    12
34    12
41    11
40    6
39    5
53    4
52    4
51    4
45    4
42    4
50    3
49    3
44    3
43    3
48    2
46    2
79    1
76    1
74    1
69    1
64    1
61    1
60    1
58    1
57    1
55    1
54    1
47    1

3,184 (10%) of all putative enhancers, do not match any ccRE.

Intersect enhancers with Nasser2021 enhancers

TODO. Not essential.

Keep only putative enhancers that overlap at least one ccRE-ELS

At the end of the day we chose to cast out putative enhancers that do not overlap any ccRE-ELS, in order to remove element-Gene pairs in which the element is highly suspected not to be an enhancer. Note that we have not done so for Nasser2021 enhancers (instead, for Nasser2021 enhancers, we will create a confidence label on the resulting predictions according to whether the enhancer overlaps a ccRE-ELS or not), but for the ones here it is necessary given the important size of elements: if they do not overlap any ccRE-ELS despite their large size, they are very unlikely to be actual enhancers.

Doing so, we lost 19% of our enhancers, resulting in a list of 31,749−6,149 = 25,600 enhancers:

bedtools intersect -wa -a list_all_enhancers.bed -b ccRE-ELS.bed |uniq > list_all_enhancers.overlapping_ccRE-ELS.bed
wc -l list_all_enhancers.overlapping_ccRE-ELS.bed

25,600

Then, over 48,038 initial putative E-G pairs in liver, there are 39,252 remaining pairs for which the enhancer overlaps at least one ccRE-ELS.

awk -F "\t" 'BEGIN{OFS="\t"} {if(NR==FNR){found[$1":"$2"-"$3]++; next}; if(found[$1":"$2"-"$3]){print $0}}' list_all_enhancers.overlapping_ccRE-ELS.bed liver.all_putative_enhancers.sorted.bedpe > liver.all_putative_enhancers.overlapping_ccRE-ELS.sorted.bedpe
wc -l liver.all_putative_enhancers.overlapping_ccRE-ELS.sorted.bedpe

39,252

The file liver.all_putative_enhancers.overlapping_ccRE-ELS.sorted.bedpe is the one we will use from now.