NOTEBOOK: Convert BENGI txt to custom bedpe containing enhancer and gene loci + TSS lists¶
Jill E. Moore, Henry E. Pratt, Michael J. Purcaro et Zhiping Weng
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1924-8
In this notebook we reprocess BENGI datasets into .bedpe
containing comprehensive yet synthetic information for further statistical analysis with R.
How to use this notebook?¶
First, make a copy of this notebook on your computer / cluster. The notebook can be found at scripts/notebooks/ipynb/.
To use Jupyter Notebook through ssh tunneling, one can refer to the following tutorial: guidebooks/notebooks_ssh
Then, to use this notebook, one should only have to carefully modify the content of the "Set working directory" section, then to execute the notebook cell by cell, in the correct order. After execution of each cell, remember to check for errors before executing the next one.
Introduction¶
In this notebook we reprocess BENGI datasets into .bedpe
containing comprehensive yet synthetic information for further statistical analysis with R.
Note: Moore et al. used only 54,846 genes (or at least, provided TSS annotation for 54,846 genes), out of our 57,820 genes (coding + non-coding - of which 20,345 are protein-coding) in our whole GENCODE annotation. Yet we did not find any straightforward way to retrieve what filter they applied to evict 2,974 genes. Moreover, even restricting to these 54,846 genes, we find 175,554 TSS in our Gencode annotation (out of a total of 178,758 TSS), whereas the TSS list provided by Moore et al contains only 167,147 TSS. Again, we did not manage to find where this difference comes from.
In this notebook, we use the TSS list provided by Moore et al, not our own.
Dependencies¶
BENGI: Benchmark of Enhancer Gene Interactions.
One need to download BENGI datasets from Weng lab, then to unzip all zipped files in BENGI/Benchmark/Annotations/
.
gzip -cd GENCODEv19-TSSs.bed.gz > GENCODEv19-TSSs.bed
gzip -cd hg19-cCREs.bed.gz > hg19-cCREs.bed
Data importation¶
Set working directory¶
# Inserm:
#root_dir = "/home/thoellinger/Documents/"
# Personal:
#root_dir = "/home/hoellinger/Documents/INSERM/"
#annotation_dir =
# Genotoul:
annotation_dir = "/work2/project/fragencode/data/species.bk/homo_sapiens/hg19.gencv19/"
work_dir = "/work2/project/regenet/results/multi/bengi/"
# When not on Genotoul:
#work_dir = root_dir+"BENGI/Benchmark/"
#annotation_dir = root_dir+"data/homo_sapiens/hg19.gencv19/"
path_to_TSSs_annotation = work_dir+"Annotations/GENCODEv19-TSSs.bed" # provided by Moore et al
path_to_ccREs_annotation = work_dir+"Annotations/hg19-cCREs.bed" # provided by Moore et al
path_to_genes_annotation = annotation_dir+"homo_sapiens.gtf" # full hg19 gencv19 annotation. Shall be processed.
path_to_benchmarks = work_dir+"All-Pairs.Natural-Ratio/"
nb_benchmarks = 6
file_names = list()
# benchmark files names without extensions
file_names.append("GM12878.CHiC-Benchmark.v3")
file_names.append("GM12878.CTCF-ChIAPET-Benchmark.v3")
file_names.append("GM12878.GEUVADIS-Benchmark.v3")
file_names.append("GM12878.GTEx-Benchmark.v3")
file_names.append("GM12878.HiC-Benchmark.v3")
file_names.append("GM12878.RNAPII-ChIAPET-Benchmark.v3")
# short custom names for benchmarks, same order as above
names = ["CHiC", "CTCF", "GEUVADIS", "GTEx", "HiC", "RNAPII"]
# Should have nothing to change below this line
# ---------------------------------------------
col_names = ["ccRE", "gene", "interaction", "CV"] # column names in benchmarks
names_TSSs = ["chr", "start", "end", "transcript_id", "score", "strand", "gene_id"] # column names for TSSs annotation. "score" column is not used
names_ccREs = ["chr", "start", "end", "rDHS", "ccRE", "group"] # column names for ccREs. neither "rDHS" nor "group" are used.
files_list = list()
for k in range(nb_benchmarks):
files_list.append(path_to_benchmarks+file_names[k]+".txt")
Import standard libraries¶
import pandas as pd
Import files¶
benchmarks = list()
for k in range(nb_benchmarks):
benchmarks.append(pd.read_csv(files_list[k], sep='\t', header=None, names=col_names, dtype='str', engine='c'))
TSSs = pd.read_csv(path_to_TSSs_annotation, sep='\t', header=None, names=names_TSSs, dtype='str', engine='c')
ccREs = pd.read_csv(path_to_ccREs_annotation, sep='\t', header=None, names=names_ccREs, dtype='str', engine='c')
unprocessed_genes = pd.read_csv(path_to_genes_annotation, sep='\t', header=None, dtype='str', engine='c', skiprows=5)
unprocessed_genes = unprocessed_genes[unprocessed_genes.iloc[:,2]=="gene"]
unprocessed_genes.iloc[:,3] = (unprocessed_genes.iloc[:,3].astype(int) - 1).astype(str)
TSSs.iloc[:,[1, 2]] = (TSSs.iloc[:,[1, 2]].astype(int) - 1).astype(str)
Data reprocessing¶
genes
annotation¶
# The following filter keeps all gene types found in `hg19.gencv19/homo_sapiens.gtf`
# One can easily remove any unwanted gene type
list_of_gene_types = ["protein_coding", "pseudogene", "lincRNA", "antisense", "processed_transcript", "miRNA", "misc_RNA", "snRNA", "snoRNA", "polymorphic_pseudogene", "sense_intronic", "rRNA", "sense_overlapping", "IG_V_gene", "TR_V_gene", "IG_V_pseudogene", "TR_J_gene", "IG_C_gene", "IG_D_gene", "3prime_overlapping_ncrna", "TR_V_pseudogene", "IG_J_gene", "Mt_tRNA", "TR_C_gene", "IG_C_pseudogene", "TR_J_pseudogene", "TR_J_pseudogene", "TR_D_gene", "IG_J_pseudogene", "Mt_rRNA"]
filter_valid_genes = unprocessed_genes[8].str.split('; ').str[2].str.split(' \"').str[1].str.rstrip('\"').isin(list_of_gene_types)
unprocessed_genes = unprocessed_genes[filter_valid_genes]
len(unprocessed_genes)
57820
col_genes = ["chr","start", "end", "gene_id", "strand"]
last_cols_unprocessed_genes = unprocessed_genes[8].str.split('; ').str[0].str.split(' \"')
genes = pd.concat([unprocessed_genes.iloc[:, [0,3,4]], last_cols_unprocessed_genes.str[1].str.rstrip('\"').to_frame(), unprocessed_genes.iloc[:,6]], axis=1)
genes.columns = col_genes
genes.head(1)
chr | start | end | gene_id | strand | |
---|---|---|---|---|---|
0 | chr1 | 11868 | 14412 | ENSG00000223972.4 | + |
len(genes)
57820
# make sure there are no scaffolds
#print(list(genes['chr'].drop_duplicates()))
# make sure gene ids are not degenerate
#print(len(genes['gene_id'].drop_duplicates()))
#genes.iloc[test] # find degenerate ones
TSSs annotation -> 1 list / gene id¶
names_compacted_TSSs = [names_TSSs[6], "TSSs"]
TSSs_lists = TSSs.iloc[:,[1, 6]]
TSSs_lists = TSSs_lists.groupby(TSSs_lists.columns[1])[TSSs_lists.columns[0]].apply(lambda x: ','.join(x)).to_frame().reset_index()
TSSs_lists.columns = names_compacted_TSSs
TSSs_lists.head(1)
gene_id | TSSs | |
---|---|---|
0 | ENSG00000000003.10 | 99891685,99891802,99894987 |
len(TSSs_lists)
54846
Gather all useful information in 1 dataframe per file¶
names_bedpe=['chrom1', 'start1', 'end1', 'chrom2', 'start2', 'end2', 'name', 'interaction', 'strand1', 'strand2', 'TSSs']
# chrom1, start1 and end1 are relative to genes ; chrom2, start2, end2 to candidate regulatory elements
new_names = {col_genes[0]: names_bedpe[0], col_genes[1]: names_bedpe[1], col_genes[2]: names_bedpe[2], col_genes[3]: names_bedpe[6], col_genes[4]: names_bedpe[8]}
bedpe = [None]*nb_benchmarks
for k in range(nb_benchmarks):
bedpe[k] = genes.copy().merge(benchmarks[k].drop(col_names[3], axis=1), how='inner', left_on=col_genes[3], right_on=col_names[1]).merge(TSSs_lists, how='inner', left_on=col_genes[3], right_on=names_TSSs[-1]).merge(ccREs.iloc[:,[0,1,2,4]].rename(columns={names_ccREs[0]: names_bedpe[3], names_ccREs[1]: names_bedpe[4], names_ccREs[2]: names_bedpe[5]}), how='left', left_on=col_names[0], right_on=names_ccREs[4])
bedpe[k].iloc[:,3] += ':'+bedpe[k].iloc[:,5].map(str)
bedpe[k].drop([col_names[0], col_names[1]], axis=1, inplace=True)
bedpe[k][col_names[2]] = bedpe[k][col_names[2]].str.rstrip(' ')
bedpe[k].insert(5, names_bedpe[9], bedpe[k].iloc[:,4])
bedpe[k].rename(columns=new_names, inplace=True)
cols = bedpe[k].columns.tolist()
new_cols = cols[:3] + cols[-3:] + [cols[3], cols[6]] + cols[4:6] + [cols[7]]
bedpe[k] = bedpe[k][new_cols]
bedpe[0]
chrom1 | start1 | end1 | chrom2 | start2 | end2 | name | interaction | strand1 | strand2 | TSSs | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 536815 | 659930 | chr1 | 927848 | 928157 | ENSG00000230021.3:EH37E0064164 | 0 | - | - | 655529,655573,655579,655579,659929 |
1 | chr1 | 621058 | 622053 | chr1 | 927848 | 928157 | ENSG00000185097.2:EH37E0064164 | 0 | - | - | 622052 |
2 | chr1 | 657471 | 660283 | chr1 | 927848 | 928157 | ENSG00000229376.3:EH37E0064164 | 0 | + | + | 657471 |
3 | chr1 | 677192 | 685396 | chr1 | 927848 | 928157 | ENSG00000235373.1:EH37E0064164 | 0 | - | - | 685395 |
4 | chr1 | 677192 | 685396 | chr1 | 974131 | 974485 | ENSG00000235373.1:EH37E0064194 | 0 | - | - | 685395 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
375723 | chrX | 154687177 | 154687276 | chrX | 154459477 | 154459854 | ENSG00000221603.1:EH37E1053818 | 0 | + | + | 154687177 |
375724 | chrX | 154689079 | 154689596 | chrX | 154459477 | 154459854 | ENSG00000185978.4:EH37E1053818 | 0 | - | - | 154689595 |
375725 | chrX | 154695630 | 154841277 | chrX | 154459477 | 154459854 | ENSG00000224533.3:EH37E1053818 | 0 | + | + | 154695630,154696200,154718983 |
375726 | chrX | 154697946 | 154716707 | chrX | 154459477 | 154459854 | ENSG00000225393.1:EH37E1053818 | 0 | + | + | 154697946 |
375727 | chrX | 154719775 | 154899605 | chrX | 154459477 | 154459854 | ENSG00000185973.6:EH37E1053818 | 0 | - | - | 154722149,154774937,154842537,154842555,154842... |
375728 rows × 11 columns
Export new dataframes to bedpe¶
#for k in range(nb_benchmarks):
# bedpe[k].to_csv(path_to_benchmarks+file_names[k]+".new.bedpe",sep='\t',header=False,index=False)
# with open(path_to_benchmarks+file_names[k]+'.header.new.bedpe.txt', 'w') as f:
# f.write("%s\n" % ', '.join(list(bedpe[k].columns)))