Beyond the exome

DFG funded Research Group (FOR 2841)

P05: A comprehensive repository of regulatory elements and their variations in human disease

Open postions: 1 PhD student

Principal investigator: Prof. Dr. Ulf Leser and Prof. Dr. Dominik Seelow

P01 A comprehensive and high-quality account of the current state of knowledge of human gene regulation is a prerequisite for the design of future experiments and fundamental for all subprojects in this research unit. Regulatory data is currently dispersed over a multitude of databases with different scopes and extends. This problem becomes even more prominent as the majority of generated data are only available within scientific publications. The gene regulatory effect of variants within regulatory elements is particularly difficult to identify. Such data can either be searched manually, an error-prone and exceedingly time consuming process, or extracted semi-automatically by means of information extraction algorithms, also called biomedical text mining (BMT). We will use BMT to develop IDBReg: a comprehensive repository of regulatory elements, their effects on gene expression, the impact of variants on gene expression, and the association of these variants to human diseases. The repository will integrate data from existing publicly available databases and information extracted from the scientific literature. We will research novel methods for harmonizing the heterogeneous representation of regulatory information in different databases, for extracting and curating relevant information from the biomedical literature, and for storing all data together in a unified database schema. Data obtained by Information Extraction (IE) will be made available in two flavors: (i) A large data set extracted automatically using advanced language processing and machine learning methods, and (ii) a smaller data set of higher quality consisting of text-mined results that were subject to expert curation. The project will develop novel IE-methods based on deep learning algorithms combined with a new curation framework employing active learning. Our results will be incorporated into the popular variant effect prediction tool MutationTaster.