Are the 10 genes of interest more connected to other genes, than genes are in general?¶
To answer this question, we need to compute:
-
A table indicating for each gene of our annotation, the list of enhancers regulating it (easy with Awk)
-
Another table indicating for each enhancer of our annotation, the list of genes regulated by it (easy with Awk)
Then, we compute, with R:
- The distribution of the number of connections made by our 10 genes, with R (well, with 10 genes only, better compute only mean, min and max)
- The distribution of the number of connections made by all genes
All genes¶
Compute tables¶
cd .../networks_hemochromatosis/data
awk 'BEGIN{FS="\t"; OFS="\t"} {split($7,parts,"::"); e=parts[1]; g=parts[2]; if(!new[e,g]++){if(reg[g]){reg[g]=reg[g]","e} else {reg[g]=e}}} END{for(g in reg){print g, reg[g]}}' <(head -n 1000 Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.bedpe) |head
awk 'BEGIN{FS="\t"; OFS="\t"} {split($7,parts,"::"); e=parts[1]; g=parts[2]; if(!new[e,g]++){if(reg[g]){reg[g]=reg[g]","e} else {reg[g]=e}}} END{for(g in reg){print g, reg[g]}}' Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.bedpe |sort -dk1,1 > g_to_regE.tsv
awk 'BEGIN{FS="\t"; OFS="\t"} {split($7,parts,"::"); e=parts[1]; g=parts[2]; if(!new[e,g]++){if(reg[e]){reg[e]=reg[e]","g} else {reg[e]=g}}} END{for(e in reg){print e, reg[e]}}' <(head -n 1000 Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.bedpe) |head
awk 'BEGIN{FS="\t"; OFS="\t"} {split($7,parts,"::"); e=parts[1]; g=parts[2]; if(!new[e,g]++){if(reg[e]){reg[e]=reg[e]","g} else {reg[e]=g}}} END{for(e in reg){print e, reg[e]}}' Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.bedpe > e_to_regG.tsv
By the way, we also computed a table giving, for each enhancer, the number of genes it is connected to.
awk 'BEGIN{FS="\t"; OFS="\t"} {if(!new[$7]++){split($7,parts,"::"); nb[parts[1]]++}} END{for(e in nb){print e, nb[e]}}' <(head -n 1000 Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.bedpe) |head
awk 'BEGIN{FS="\t"; OFS="\t"} {if(!new[$7]++){split($7,parts,"::"); nb[parts[1]]++}} END{for(e in nb){print e, nb[e]}}' Nasser2021ABCPredictions.all_biosamples.all_putative_enhancers.merged_enhancers.sorted.bedpe > e_to_nbG.tsv
Compute distributions¶
cd .../shared/automne_2021/networks_hemochromatosis/
We use the script compute_avg_nb_of_connections_per_gene.R
, the content of which is:
rm(list = ls())
options <- commandArgs(trailingOnly = TRUE)
table1_file = options[1]
table2_file = options[2]
print("Loading input...")
if (length(options)==3) {
gene_source_file = options[3]
gene.source = read.table(gene_source_file, sep = "\t")[,c(1)]
} else{
gene.source = NULL
}
gE = as.data.frame(read.table(table1_file, sep = "\t"))
eG = as.data.frame(read.table(table2_file, sep = "\t"))
print("Done.")
colnames(gE) = c("gene", "enhancer.list")
colnames(eG) = c("enhancer", "gene.list")
# For development purpose only, reduce the dimensions:
#gE = gE[1:10000,]
#eG = eG[1:10000,]
if(!is.null(gene.source)){
gE = gE[gE$gene %in% gene.source,]
}
print("Computing the list of targeted genes for each gene...")
gE$target.genes.list = lapply(gE$enhancer.list, function(liste){
list_of_list_of_genes = eG[eG$enhancer %in% unlist(strsplit(liste,',')),]$gene.list
list_of_genes = Reduce(function(x,y) {
paste(unlist(sort(unique(c(strsplit(x,',')[[1]],strsplit(y,',')[[1]])))), collapse=',')
}, list_of_list_of_genes)
return(list_of_genes)
})
gE$target.genes.list = as.character(gE$target.genes.list)
print("Done.")
print("Counting the number of targeted genes for each gene...")
gE$target.genes.nb = lapply(gE$target.genes.list, function(liste){
if(liste=="NULL"){
res = 0}
else{
res = length(unlist(strsplit(liste, ',')))}
return(res)
})
gE$target.genes.nb = as.numeric(gE$target.genes.nb)
print("Done...")
print("Summary statistics of number of target genes per gene:")
summary(gE$target.genes.nb)
Well I should have done this with Python and pandas
rather than R! The execution is very slow:
Rscript compute_avg_nb_of_connections_per_gene.R data/g_to_regE.tsv data/e_to_regG.tsv 10genes.list
Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 36.00 53.00 61.84 82.00 237.00
10 initial genes¶
Rscript compute_avg_nb_of_connections_per_gene.R data/g_to_regE.tsv data/e_to_regG.tsv 10genes.list
Min. 1st Qu. Median Mean 3rd Qu. Max. 28.0 40.0 72.0 77.7 101.8 160.0
Coding genes¶
Extract full lists of coding / non-coding genes from RefSeq annotation¶
We go back on Genotoul.
cd /work2/project/regenet/results/multi/abc.model/Nasser2021/
wc -l all_genes_in_eg_pairs_131_biosamples.list
23,220
Our RefSeq annotation is in /work2/project/regenet/workspace/thoellinger/data/GRCh37/GRCh37.p13/genes.gtf
(private). We copy it here as genes.refseq.gtf
cp /work2/project/regenet/workspace/thoellinger/data/GRCh37/GRCh37.p13/genes.gtf genes.refseq.gtf
which contains 48,531 lines.
Here are the types of genes it contains:
awk 'BEGIN{FS="\t"; OFS="\t"} {if($9~/(gene_biotype)/){split($9,fields,/(gene_biotype)/); split(fields[2],subf,"\""); biotypes[subf[2]]++}} END{for(u in biotypes){print u, biotypes[u]}}' genes.refseq.gtf |sort -nrk2,2
protein_coding 21663 pseudogene 15537 lncRNA 5706 miRNA 2091 transcribed_pseudogene 1268 tRNA 660 snoRNA 593 V_segment 307 V_segment_pseudogene 280 J_segment 111 snRNA 76 D_segment 60 C_region 32 guide_RNA 31 other 27 rRNA 23 antisense_RNA 19 misc_RNA 13 J_segment_pseudogene 11 C_region_pseudogene 7 Y_RNA 4 vault_RNA 4 scRNA 4 telomerase_RNA 1 RNase_P_RNA 1 RNase_MRP_RNA 1 ncRNA_pseudogene 1
So let's extract the list of protein coding genes only:
awk -F "\t" '$9~/(gene_biotype)/ {split($9,fields,/(gene_biotype)/); split(fields[2],sub1,"\""); if(sub1[2]=="protein_coding"){if($9~/(gene_id)/){split($9,fields,/(gene_id)/); split(fields[2],sub2,"\""); print sub2[2]}}}' genes.refseq.gtf |sort -dk1,1 |uniq > genes.protein_coding.refseq.list
Contains the 21,663 unique protein coding genes, OK.
Lets also extract the non-protein coding genes while we're here:
awk -F "\t" '$9~/(gene_biotype)/ {split($9,fields,/(gene_biotype)/); split(fields[2],sub1,"\""); if(sub1[2]!="protein_coding"){if($9~/(gene_id)/){split($9,fields,/(gene_id)/); split(fields[2],sub2,"\""); print sub2[2]}}}' genes.refseq.gtf |sort -dk1,1 |uniq > genes.not_protein_coding.refseq.list
Contains the 26,868 unique non -protein-coding genes.
And finally let's simply extract the list of all genes of the annotation
awk -F "\t" '$9~/(gene_id)/ {split($9,fields,/(gene_id)/); split(fields[2],subf,"\""); print subf[2]}' genes.refseq.gtf |sort -dk1,1 |uniq > genes.refseq.list
Contains the 48,531 unique genes in our RefSeq annotation.
Keep Nasser2021 genes that are found in our RefSeq annotation¶
Now let's extract from all the 23,220 genes in the 131 biosamples, the ones contained in the annotation:
awk -F "\t" 'NR==FNR {genes[$1]++; next}; genes[$1]' genes.refseq.list all_genes_in_eg_pairs_131_biosamples.list > all_genes_in_eg_pairs_131_biosamples.in_annotation.list
Contains 21,845 genes.
Extract lists of coding / non-coding genes both in our RefSeq annotation and in Nasser2021 list of genes¶
awk -F "\t" 'NR==FNR {genes[$1]++; next}; genes[$1]' genes.protein_coding.refseq.list all_genes_in_eg_pairs_131_biosamples.list > all_genes_in_eg_pairs_131_biosamples.in_annotation.protein_coding.list
Contains 17,665 genes.
awk -F "\t" 'NR==FNR {genes[$1]++; next}; genes[$1]' genes.not_protein_coding.refseq.list all_genes_in_eg_pairs_131_biosamples.list > all_genes_in_eg_pairs_131_biosamples.in_annotation.non_protein_coding.list
Compute statistics on number of targeted genes¶
TODO
Non coding genes¶
TODO