NOTEBOOK: Convert BENGI txt to custom bedpe containing enhancer and gene loci + TSS lists¶

Jill E. Moore, Henry E. Pratt, Michael J. Purcaro et Zhiping Weng

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1924-8

In this notebook we reprocess BENGI datasets into .bedpe containing comprehensive yet synthetic information for further statistical analysis with R.

How to use this notebook?¶

First, make a copy of this notebook on your computer / cluster. The notebook can be found at scripts/notebooks/ipynb/.

To use Jupyter Notebook through ssh tunneling, one can refer to the following tutorial: guidebooks/notebooks_ssh

Then, to use this notebook, one should only have to carefully modify the content of the "Set working directory" section, then to execute the notebook cell by cell, in the correct order. After execution of each cell, remember to check for errors before executing the next one.

Introduction¶

In this notebook we reprocess BENGI datasets into .bedpe containing comprehensive yet synthetic information for further statistical analysis with R.

Note: Moore et al. used only 54,846 genes (or at least, provided TSS annotation for 54,846 genes), out of our 57,820 genes (coding + non-coding - of which 20,345 are protein-coding) in our whole GENCODE annotation. Yet we did not find any straightforward way to retrieve what filter they applied to evict 2,974 genes. Moreover, even restricting to these 54,846 genes, we find 175,554 TSS in our Gencode annotation (out of a total of 178,758 TSS), whereas the TSS list provided by Moore et al contains only 167,147 TSS. Again, we did not manage to find where this difference comes from.

In this notebook, we use the TSS list provided by Moore et al, not our own.

Dependencies¶

BENGI: Benchmark of Enhancer Gene Interactions.

One need to download BENGI datasets from Weng lab, then to unzip all zipped files in BENGI/Benchmark/Annotations/.

gzip -cd GENCODEv19-TSSs.bed.gz > GENCODEv19-TSSs.bed
gzip -cd hg19-cCREs.bed.gz > hg19-cCREs.bed

Data importation¶

Set working directory¶

# Inserm:
#root_dir = "/home/thoellinger/Documents/"
# Personal:
#root_dir = "/home/hoellinger/Documents/INSERM/"
#annotation_dir = 
# Genotoul:
annotation_dir = "/work2/project/fragencode/data/species.bk/homo_sapiens/hg19.gencv19/"
work_dir = "/work2/project/regenet/results/multi/bengi/"

# When not on Genotoul:
#work_dir = root_dir+"BENGI/Benchmark/"
#annotation_dir = root_dir+"data/homo_sapiens/hg19.gencv19/"

path_to_TSSs_annotation = work_dir+"Annotations/GENCODEv19-TSSs.bed" # provided by Moore et al
path_to_ccREs_annotation = work_dir+"Annotations/hg19-cCREs.bed" # provided by Moore et al
path_to_genes_annotation = annotation_dir+"homo_sapiens.gtf" # full hg19 gencv19 annotation. Shall be processed.
path_to_benchmarks = work_dir+"All-Pairs.Natural-Ratio/"

nb_benchmarks = 6
file_names = list()
# benchmark files names without extensions
file_names.append("GM12878.CHiC-Benchmark.v3")
file_names.append("GM12878.CTCF-ChIAPET-Benchmark.v3")
file_names.append("GM12878.GEUVADIS-Benchmark.v3")
file_names.append("GM12878.GTEx-Benchmark.v3")
file_names.append("GM12878.HiC-Benchmark.v3")
file_names.append("GM12878.RNAPII-ChIAPET-Benchmark.v3")

# short custom names for benchmarks, same order as above
names = ["CHiC", "CTCF", "GEUVADIS", "GTEx", "HiC", "RNAPII"]

# Should have nothing to change below this line
# --------------------------------------------- 
col_names = ["ccRE", "gene", "interaction", "CV"] # column names in benchmarks
names_TSSs = ["chr", "start", "end", "transcript_id", "score", "strand", "gene_id"] # column names for TSSs annotation. "score" column is not used
names_ccREs = ["chr", "start", "end", "rDHS", "ccRE", "group"] # column names for ccREs. neither "rDHS" nor "group" are used.

files_list = list()
for k in range(nb_benchmarks):
    files_list.append(path_to_benchmarks+file_names[k]+".txt")

Import standard libraries¶

import pandas as pd

Import files¶

benchmarks = list()
for k in range(nb_benchmarks):
    benchmarks.append(pd.read_csv(files_list[k], sep='\t', header=None, names=col_names, dtype='str', engine='c'))
TSSs = pd.read_csv(path_to_TSSs_annotation, sep='\t', header=None, names=names_TSSs, dtype='str', engine='c')
ccREs = pd.read_csv(path_to_ccREs_annotation, sep='\t', header=None, names=names_ccREs, dtype='str', engine='c')
unprocessed_genes = pd.read_csv(path_to_genes_annotation, sep='\t', header=None, dtype='str', engine='c', skiprows=5)

unprocessed_genes = unprocessed_genes[unprocessed_genes.iloc[:,2]=="gene"]
unprocessed_genes.iloc[:,3] = (unprocessed_genes.iloc[:,3].astype(int) - 1).astype(str)
TSSs.iloc[:,[1, 2]] = (TSSs.iloc[:,[1, 2]].astype(int) - 1).astype(str)

Data reprocessing¶

`genes` annotation¶

# The following filter keeps all gene types found in `hg19.gencv19/homo_sapiens.gtf`
# One can easily remove any unwanted gene type
list_of_gene_types = ["protein_coding", "pseudogene", "lincRNA", "antisense", "processed_transcript", "miRNA", "misc_RNA", "snRNA", "snoRNA", "polymorphic_pseudogene", "sense_intronic", "rRNA", "sense_overlapping", "IG_V_gene", "TR_V_gene", "IG_V_pseudogene", "TR_J_gene", "IG_C_gene", "IG_D_gene", "3prime_overlapping_ncrna", "TR_V_pseudogene", "IG_J_gene", "Mt_tRNA", "TR_C_gene", "IG_C_pseudogene", "TR_J_pseudogene", "TR_J_pseudogene", "TR_D_gene", "IG_J_pseudogene", "Mt_rRNA"]
filter_valid_genes = unprocessed_genes[8].str.split('; ').str[2].str.split(' \"').str[1].str.rstrip('\"').isin(list_of_gene_types)
unprocessed_genes = unprocessed_genes[filter_valid_genes]

len(unprocessed_genes)

col_genes = ["chr","start", "end", "gene_id", "strand"]
last_cols_unprocessed_genes = unprocessed_genes[8].str.split('; ').str[0].str.split(' \"')

genes = pd.concat([unprocessed_genes.iloc[:, [0,3,4]], last_cols_unprocessed_genes.str[1].str.rstrip('\"').to_frame(), unprocessed_genes.iloc[:,6]], axis=1)
genes.columns = col_genes

genes.head(1)

	chr	start	end	gene_id	strand
0	chr1	11868	14412	ENSG00000223972.4	+

len(genes)

# make sure there are no scaffolds
#print(list(genes['chr'].drop_duplicates()))
# make sure gene ids are not degenerate
#print(len(genes['gene_id'].drop_duplicates()))
#genes.iloc[test] # find degenerate ones

TSSs annotation -> 1 list / gene id¶

names_compacted_TSSs = [names_TSSs[6], "TSSs"]
TSSs_lists = TSSs.iloc[:,[1, 6]]
TSSs_lists = TSSs_lists.groupby(TSSs_lists.columns[1])[TSSs_lists.columns[0]].apply(lambda x: ','.join(x)).to_frame().reset_index()
TSSs_lists.columns = names_compacted_TSSs
TSSs_lists.head(1)

	gene_id	TSSs
0	ENSG00000000003.10	99891685,99891802,99894987

len(TSSs_lists)

Gather all useful information in 1 dataframe per file¶

names_bedpe=['chrom1', 'start1', 'end1', 'chrom2', 'start2', 'end2', 'name', 'interaction', 'strand1', 'strand2', 'TSSs']
# chrom1, start1 and end1 are relative to genes ; chrom2, start2, end2 to candidate regulatory elements

new_names = {col_genes[0]: names_bedpe[0], col_genes[1]: names_bedpe[1], col_genes[2]: names_bedpe[2], col_genes[3]: names_bedpe[6], col_genes[4]: names_bedpe[8]}
bedpe = [None]*nb_benchmarks    
for k in range(nb_benchmarks):
    bedpe[k] = genes.copy().merge(benchmarks[k].drop(col_names[3], axis=1), how='inner', left_on=col_genes[3], right_on=col_names[1]).merge(TSSs_lists, how='inner', left_on=col_genes[3], right_on=names_TSSs[-1]).merge(ccREs.iloc[:,[0,1,2,4]].rename(columns={names_ccREs[0]: names_bedpe[3], names_ccREs[1]: names_bedpe[4], names_ccREs[2]: names_bedpe[5]}), how='left', left_on=col_names[0], right_on=names_ccREs[4])
    bedpe[k].iloc[:,3] += ':'+bedpe[k].iloc[:,5].map(str)
    bedpe[k].drop([col_names[0], col_names[1]], axis=1, inplace=True)
    bedpe[k][col_names[2]] = bedpe[k][col_names[2]].str.rstrip(' ')
    bedpe[k].insert(5, names_bedpe[9], bedpe[k].iloc[:,4])
    bedpe[k].rename(columns=new_names, inplace=True)
    cols = bedpe[k].columns.tolist()
    new_cols = cols[:3] + cols[-3:] + [cols[3], cols[6]] + cols[4:6] + [cols[7]]
    bedpe[k] = bedpe[k][new_cols]

bedpe[0]

	chrom1	start1	end1	chrom2	start2	end2	name	interaction	strand1	strand2	TSSs
0	chr1	536815	659930	chr1	927848	928157	ENSG00000230021.3:EH37E0064164	0	-	-	655529,655573,655579,655579,659929
1	chr1	621058	622053	chr1	927848	928157	ENSG00000185097.2:EH37E0064164	0	-	-	622052
2	chr1	657471	660283	chr1	927848	928157	ENSG00000229376.3:EH37E0064164	0	+	+	657471
3	chr1	677192	685396	chr1	927848	928157	ENSG00000235373.1:EH37E0064164	0	-	-	685395
4	chr1	677192	685396	chr1	974131	974485	ENSG00000235373.1:EH37E0064194	0	-	-	685395
...	...	...	...	...	...	...	...	...	...	...	...
375723	chrX	154687177	154687276	chrX	154459477	154459854	ENSG00000221603.1:EH37E1053818	0	+	+	154687177
375724	chrX	154689079	154689596	chrX	154459477	154459854	ENSG00000185978.4:EH37E1053818	0	-	-	154689595
375725	chrX	154695630	154841277	chrX	154459477	154459854	ENSG00000224533.3:EH37E1053818	0	+	+	154695630,154696200,154718983
375726	chrX	154697946	154716707	chrX	154459477	154459854	ENSG00000225393.1:EH37E1053818	0	+	+	154697946
375727	chrX	154719775	154899605	chrX	154459477	154459854	ENSG00000185973.6:EH37E1053818	0	-	-	154722149,154774937,154842537,154842555,154842...

375728 rows × 11 columns

Export new dataframes to bedpe¶

#for k in range(nb_benchmarks):
#    bedpe[k].to_csv(path_to_benchmarks+file_names[k]+".new.bedpe",sep='\t',header=False,index=False)
#    with open(path_to_benchmarks+file_names[k]+'.header.new.bedpe.txt', 'w') as f:
#        f.write("%s\n" % ', '.join(list(bedpe[k].columns)))