Evaluation of Moore et al 's DNase-expression correlation method over Fulco et al CRISPRi-FlowFISH validation dataset¶
Prerequisites¶
First we need to download the curated DNase counts from http://big.databio.org/papers/RED/supplement/ (see p. 14 of the BENGI paper).
Correlation-Methods/ └── Sheffield └── Signal-Matrices ├── dhs112_v3_int.bed ├── dhs112_v3_int.bed.gz └── exp112.bed
Note that, in the Python code, we will have to use the appropriate argument for the column 87 of exp112.bed
, corresponding to K562, to be selected instead of the one corresponding to GM12878. Fortunately, it suffices to provide "K562" as an argument for this purpose.
Partial reimplementation of Run-Sheffield.sh
¶
First of all we made a copy of all scripts, ie we duplicated Scripts
folder in a new folder which we named local_Scripts
.
We adapted of local_Scripts/Unsupervised-Methods/Run-Sheffield.sh
in local_Scripts/Unsupervised-Methods/CRISPRiFF_Run-Sheffield.sh
to use the CRISPRi-FlowFISH validation dataset.
#!/bin/bash
#################
## DEPENDENCIES #
#################
# bedtools
# python2
# scipy
#################
## INPUT ARGS: ##
#################
# $1: `intermediate_output` relative_path to parent folder for intermediate results.
# Folder need not exist yet.
# WARNING: some inputs are not passed as args, see code below!
#################
#### OUTPUT #####
#################
# els.bed: input data (DNase, expression) are intersected with ccREs
# Then, the result is sorted and uniqued.
# enhancer-matrix.bed
# genes
# genesFull
# Inserm computer
#workDir=~/Documents
# Personal computer
#workDir=~/Documents/INSERM
# Genotoul
workDir=/work2/project/regenet/workspace/thoellinger
ccres=$workDir/BENGI/Benchmark/Annotations/hg19-cCREs.bed
gnid_gname=/work2/project/fragencode/data/species.bk/homo_sapiens/hg19.gencv19/homo_sapiens.gnid.gnname.tsv
# One must create the following repository before launching the script:
signalDir=$workDir/BENGI/Correlation-Methods/Sheffield/
featureDir=$signalDir/Signal-Matrices # must contain DNase counts and gene expression
# "dhs112_v3_int.bed", "exp112.bed"
# see (http://big.databio.org/papers/RED/supplement/)
intermediate_outputs_dir=$1 # All output that are not final results are
# going to be stored here. Need not exist before launching the
# script
intermediate_outputs=$intermediate_outputs_dir/K562.CRISPRi.FlowFISH
genes=$intermediate_outputs/genes # need not exist yet
genesFull=$intermediate_outputs/genesFull # need not exist yet
els=$intermediate_outputs/els.bed # need not exist yet
enhancer_matrix=$intermediate_outputs/enhancer-matrix.txt # need not exist yet
biosample=K562
echo $biosample
train=/work2/project/regenet/workspace/thoellinger/CRISPRi_FlowFISH/k562/3672_ccRE_gene_pairs.fulco.txt
scriptDir=$workDir/BENGI/local_Scripts/Unsupervised-Methods
outputDir=$signalDir/Results
mkdir -p $outputDir
mkdir -p $intermediate_outputs
# dhs112_v3_int.bed (http://big.databio.org/papers/RED/supplement/) is space-delimited
# so we first replace spaces by tabs. It contains the DNase counts over whole genome
# for 112 cell types.
# exp112.bed: idem. It contains the gene expressions for 112 cell types.
if [[ ! -f $els ]]; then
echo "Finding ccRE-dELS in input data..."
awk 'FNR==NR {x[$1];next} ($5 in x)' $train $ccres | awk '{print $1 "\t" \
$2 "\t" $3 "\t" $5}' | sort -u > $els
fi
# Warning: the following requires a lot of memory
if [[ ! -f $featureDir/dhs112_v3_int_tab.bed ]]; then
echo "dhs112 => dhs112_tab..."
awk 'BEGIN{} {n=split($0, line, " "); ORS="\n"; print line[0]; ORS="\t"; for(u=1; u<n; u++){print line[u]}; ORS=""; print line[n]}' $featureDir/dhs112_v3_int.bed > $featureDir/dhs112_v3_int_tab.bed
fi
if [[ ! -f $enhancer_matrix ]]; then
echo "Building enhancer matrix (DNase counts for all the input cell types) using bedtools... This can take several minutes..."
bedtools intersect -wo -a $els -b $featureDir/dhs112_v3_int_tab.bed > $enhancer_matrix
fi
# enhancer-matrix.txt contains:
# $1 to $4: enhancer infos, bed-like format
# $5 to $7: for the current enhancer, chr, start and end of an overlapping DHS.
# $8 to $(n_columns-1): respective DNase counts for all the 112
# cell types encountered in dhs112_v3_int_tab.bed
# $n_columns: length (in bp) of the overlap
if [[ ! -f $genes ]]; then
echo "Finding all genes in input..."
cat $train | awk '{print $2}' | sort -u > $genes
fi
if [[ ! -f $genesFull ]]; then
echo "Computing gene list (<id> <name>)..."
awk 'FNR==NR {x[$1];next} ($1 in x)' $genes $gnid_gname > $genesFull
fi
if [[ ! -f $featureDir/exp112_tab.bed ]]; then
echo "Adapting expression file format..."
awk '{n=split($0, line, " "); ORS="\n"; if(NR==1){ORS=""}; print line[0]; ORS="\t"; for(u=1; u<n; u++){print line[u]}; ORS=""; print line[n]}' $featureDir/exp112.bed > $featureDir/exp112_tab.bed
fi
# we replaced `$stats` by NOTHING as we do not use this argument.
# "$biosample" is not a window dressing argument, it is the key of a dictionnary defined in the Python script, which provide the correct
# column of "exp112_tab.bed"
echo "Running python script... This can take several minutes..."
python $scriptDir/sheffield.correlation.py $featureDir/exp112_tab.bed $genesFull $enhancer_matrix NOTHING $train $biosample |sort -t $'\t' -k 5,6 > $outputDir/$biosample.CRISPRi-FlowFISH-Correlation.txt
Partial reimplementation of sheffield.correlation.py
¶
We adapted of local_Scripts/Unsupervised-Methods/sheffield.correlation.py
in local_Scripts/Unsupervised-Methods/CRISPRiFF_sheffield.correlation.py.sh
to use the CRISPRi-FlowFISH validation dataset.
Actually, we had nothing to change wrt our previous partial reimplementations.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys, scipy, math
from scipy import stats
import numpy as np # to reconstruct statDict "by hand"
from itertools import groupby
# We use all_equal to prevent a division-by-zero error in Calculate_Correlation
def all_equal(iterable):
g = groupby(iterable)
return next(g, True) and not next(g, False)
def Process_Biosample(biosample):
biosampleDict={"GM12878":28, "HeLa":50, "K562":88,"HMEC":59,"NHEK":100}
if biosample in biosampleDict:
return biosampleDict[biosample]
else:
return "no"
def Calculate_Correlation(array1, array2):
if not (all_equal(array1) or all_equal(array2)):
stat=stats.pearsonr(array1, array2)[0]
else:
stat = float('NaN')
return stat
def Create_Gene_Dict(genes):
geneDict={}
genes.next() #skips header
for line in genes:
line=line.rstrip().split("\t")
geneDict[line[3]]=[float(i) for i in line[5:]]
return geneDict
def Create_ELS_Dict(els, column):
elsDictA={}
elsDictB={}
for line in els:
line=line.rstrip().split("\t")
if column == "no":
score=max(float(i) for i in line[7:-1])
else:
score=float(line[column])
if line[3] not in elsDictA:
elsDictA[line[3]]=[float(line[-1]),score] # [size of overlap, counts fot the cell type considered]
elsDictB[line[3]]=[float(i) for i in line[7:-1]] # [size of overlap, [counts for all cell types]]
elif elsDictA[line[3]][0] < float(line[-1]):
# See comment below first. Here, same kind of thing, but it is about maximizing the overlap.
# Note that first the overlap is maximized, and only then the counts are maximized.
# This has SEVERE IMPLICATIONS: if there are 1000 counts for a candidate enhancer with which the
# overlap is 149, and at best 3 counts for enhancers with overlap 150, the ones with overlap 150 and 3
# counts only are kept...
# To prevent such a "side effect", one may replace the following "elif" with an "if"
# I personnaly think this side effect is undesirable, but here we keep the "elif"
# to obtain the same results as Moore et al.
elsDictA[line[3]]=[float(line[-1]),score]
elsDictB[line[3]]=[float(i) for i in line[7:-1]]
elif elsDictA[line[3]][1] < score: # note that all peak widths are 150
# if counts for another open chromatin region overlapping this enhancer has already been written,
# but are fewer than for the current open chromatin region overlapping this same enhancer,
# then we overwrite the counts with these new ones.
# At the end of the day, this results in taking into account, for each enhancer, only the open chromatin
# region that maximize the counts of DNase over a given enhancer ; although the selection is made among
# regions that maximize the overlap only, see comment above.
elsDictA[line[3]]=[float(line[-1]),score]
elsDictB[line[3]]=[float(i) for i in line[7:-1]]
# Well at the end of the day, elsDictB[key="enhancer id"] = [max overlap with a DHS, max counts in this DHS]
return elsDictB, elsDictA
def Create_Symbol_Dict(symbols):
symbolDict={}
for line in symbols:
line=line.rstrip().split("\t")
symbolDict[line[0]]=line[1]
return symbolDict
def Create_Stat_Dict(stats):
# For now we know that:
# stat (argv[4]) must contain at least 3 columns
# first column is a gene id (see for loop at the end of the code)
# 2nd and 3rd columns are float / real numbers.
# Both are later compared to correlations.
statDict={}
for line in stats:
line=line.rstrip().split("\t")
statDict[line[0].rstrip()]=[float(line[1]),float(line[2])]
return statDict
genes=open(sys.argv[1]) # gene expression
geneDict=Create_Gene_Dict(genes)
genes.close()
# geneDict[key = "gene NAME (not id)"] = [expression in cell line 1, expression in cell line 2, etc]
symbols=open(sys.argv[2]) # gene annotation
symbolDict=Create_Symbol_Dict(symbols)
symbols.close()
column=Process_Biosample(sys.argv[6]) # the correct column is 28 for GM12878
els=open(sys.argv[3]) # ccRE expression
elsDict, test =Create_ELS_Dict(els,column)
els.close()
# elsDict[key="enhancer id"] = [max overlap of this enhancer with a DHS, max counts for this max overlap]
# For now we try to rebuild statArray directly in this script.
#stat=open(sys.argv[4])
#statArray = Create_Stat_Dict(stat)
#stat.close()
# On essaye ci-dessous
# On le fait volontairement d'une maniere très peu optimisée, c'est juste pour tester, en évitant d'induire d'autres sources
# potentielles de problèmes
pairs=open(sys.argv[5]) # BENGI benchmark
all_cor = {}
for line in pairs:
line=line.rstrip().split("\t")
els=line[0] # candidate enhancer id
gene=symbolDict[line[1].rstrip()] # name (not id) of candidate paired gene
if els in elsDict and gene in geneDict:
if gene in all_cor:
all_cor[gene].append(Calculate_Correlation(elsDict[els],geneDict[gene]))
else:
all_cor[gene] = [Calculate_Correlation(elsDict[els],geneDict[gene])]
pairs.close()
# Now for each gene name of gene id found in the benchmark pairs, all_cor[gene] should
# be the list of the correlations of its expression accross all cell lines, with
# the counts of DNase accross all cell lines, for enhancers it is paired with
pairs=open(sys.argv[5]) # BENGI benchmark
statArray = {}
for line in pairs: #inutile il suffit de boucler sur tous les genes trouves, une seule fois chacun, qu'on aurait du enregistrer precedemment deja
line=line.rstrip().split("\t")
els=line[0] # candidate enhancer id
gene=symbolDict[line[1].rstrip()] # name (not id) of candidate paired gene
if els in elsDict and gene in geneDict:
if gene not in statArray:
statArray[gene] = [np.mean(all_cor[gene]), np.std(all_cor[gene])]
pairs.close()
pairs=open(sys.argv[5]) # BENGI benchmark
for line in pairs:
line=line.rstrip().split("\t")
els=line[0] # candidate enhancer
gene=symbolDict[line[1].rstrip()] # name (not id) of candidate paired gene
if els in elsDict and gene in geneDict:
cor=Calculate_Correlation(elsDict[els],geneDict[gene]) # correlation between counts of DNase over els and gene expression accross the 112 biosamples
# cor est la corrélation entre les counts max de DNase sur els, et l'expression du gène appairé avec cet els.
# On a dit que :
# statArray devait être issu d'un fichier de trois colonnes qui est tel que :
# La clef est un id de gene (visiblement plutôt un NOM de gène ! - pourquoi pas...)
# La première valeur est, pour chaque gène, la corrélation moyenne qu'on trouve entre ce gène et les ccRE auxquels il est appairé
# La deuxième valeur est, pour chaque gène, l'écart-type de ces mêmes corrélations.
if math.isnan(cor):
cor=0
if statArray[gene][1] != 0:
Z=(cor-statArray[gene][0])/statArray[gene][1]
else:
Z=0
p=stats.norm.sf(abs(Z))*2
print line[2], "\t", cor, "\t", p, "\t", Z, "\t", els, "\t", line[1]
else:
pass
print line[2], "\t", -100, "\t", 1, "\t", -100, "\t", els, "\t", line[1]
Running the code for BENGI benchmarks over GM12878¶
If working on Genotoul, start with:
conda activate base && conda activate abcmodel && conda activate py2 && module load bioinfo/bedtools-2.27.1
and before launching the script, do not forget (a lot of memory is required):
srun --mem=8G --pty bash
Now it suffices to run Run-Sheffield.sh
over the CRISPRi-FlowFISH validation dataset:
./local_Scripts/Unsupervised-Methods/CRISPRiFF_Run-Sheffield.sh Correlation-Methods/Sheffield
./local_Scripts/Unsupervised-Methods/Run-Sheffield.sh GM12878.CHiC v3 Correlation-Methods/Sheffield
K562 Finding ccRE-dELS in input data... Building enhancer matrix (DNase counts for all the input cell types) using bedtools... This can take several minutes... Finding all genes in input... Computing gene list (
)... Running python script... This can take several minutes...
Analysis with R¶
Code¶
.badCode {
background-color: #C9DDE4;
}
library(knitr)
## Global options
options(max.print="75")
opts_chunk$set(echo=TRUE,
cache=FALSE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE,
class.source="badCode")
opts_knit$set(width=75)
library(ggplot2)
library(ggpubr) # for ggarrange
library(dplyr) # for bind_rows
# Tools for precision-recall : (see https://classeval.wordpress.com/tools-for-roc-and-precision-recall/)
library(precrec)
#library(ROCR)
#library(pROC)
#library(PRROC)
rm(list = ls())
# Personal
work_dir = "~/Documents/INSERM/"
# Inserm
#work_dir = "~/Documents/"
#####################
## Distance method ##
#####################
path_to_results = paste(work_dir, "BENGI/Correlation-Methods/Sheffield/Results/", sep='')
file_names = c("K562.CRISPRi-FlowFISH-Correlation.txt")
short_names = c('CRiFF')
nb_files = length(file_names)
colnames <- c('interaction', 'cor', 'p', 'Z', 'ccRE', 'gene')
results <- sapply(file_names, simplify=FALSE, function(file_name){
Df <- read.table(paste(path_to_results, as.character(file_name), sep=''), sep='\t')
Df[[1]] <- factor(Df[[1]], levels=c(0,1), labels=c("no interaction", "interaction"))
names(Df) <- colnames
return(Df)
})
names(results) <- short_names
#library(dplyr)
All_results <- bind_rows(results, .id = 'method')
ggplot(aes(y = cor, x = method, fill = interaction), data = All_results) + geom_boxplot()
sscurves_dnase_expression <- list()
sscurves_dnase_expression <- sapply(results, simplify=FALSE, function(Df){
evalmod(scores = Df$cor, labels = Df$interaction) # comes with "precrec" library
})
#library(ggplot2)
p1 <- autoplot(sscurves_dnase_expression[[1]], curvetype = c("PRC")) + ggtitle(paste(short_names[1], signif(attr(sscurves_dnase_expression[[1]], 'auc')[[4]][2], digits=2), sep = " AUPR="))
# ggarrange comes with library('ggpubr')
figure <- ggarrange(p1,
ncol = 1, nrow = 1)
figure