PhD project: Representation Learning for Corpus-level Biomedical Relation Extraction

Researchers are currently producing so many publications that it is impossible to keep up with the boom of discoveries even within a single field. Biomedical information extraction (IE) encompasses methods that aim to automatically collect biomedical knowledge from the scientific literature. These techniques are considered crucial for efficient access to published results at a scale that can cope with scientific progress. IE plays is essential in database curation, the construction of comprehensive models of pathways and cells, and fields such as Personalised Medicine. A key task for IE is the extraction of relationships between entities, such as drugs or proteins that interact with each other in a pathway or cell. While considerable progress in IE has been made over the two decades, there are deficits. Almost all the techniques have focused on extracting relationships from single sentences or single articles.

All sentence- and article-based methods suffer from a number of severe disadvantages in terms of design. First, a single record rarely provides enough evidence to establish the biological validity of a relationship, as the experimental evidence might be weak, or limited to a very specific context. Statements in texts may be more speculative than confirmative, and different articles often contradict each other. Experts therefore usually (a) try to acquire a comprehensive picture of the published state-of-the-art for any given question, and (b) need to include information from other sources in making informed decisions about relationships. There is no consensus on the best way to achieve this automatically. A solution will require finding suitable ways to encode the knowledge contained in large collections of texts and design efficient approaches to integrate different kinds of information (e.g. textual, numerical, categorical and molecular data) that originates from various sources.

In my PhD project I will contribute to this question while examining, harnessing and combining multiple information sources, such as the entire corpus of literature available through PubMed and additional knowledge base information, in hopes of improving the extraction of information on biomedical relationships. Our approach is fundamentally different than traditional approaches. I classify relations on a global, corpus-based level instead of the sentence- or article-based approaches currently in use. In particular, I want to explore representation learning techniques: instead of explicitly, manually modelling the connections between biomedical concepts, we will apply methods capable of learning adequate representations for these concepts by exploring correlations in large collections of (textual) data.

Corpora and Datasets

• SCARE: German Corpus for Aspect-based Sentiment Analysis in App-Reviews (2016)

The SCARE corpus consists of fine-grained annotations for mobile application reviews from the Google Play Store. For each user review the mentioned application aspects, i.e., the design or the usability, as well as subjective phrases, which evaluate these aspects, are annotated. In addition, the polarity (positive, negative or neutral) of each subjective phrase is recorded as well as the relationship of an aspect to the main app in discussion. In total, the corpus consists of 1,760 German application reviews with 2,487 aspects and 3,959 subjective phrases.

For further information see Website


  • BMC Medical Informatics and Decision Making, 2022, Website
  • Nature Scientific Reports, 2021, Website
  • The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021, Website
  • 30th ACM International Conference on Information and Knowledge Management (CIKM 2021), 2021, Website
  • The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL), 2021, Website
  • European Chapter of the Association for Computational Linguistics (EACL), 2021, Website
  • BioNLP 2021 Workshop on Biomedical Natural Language Processing, 2021, Website
  • Health Informatics Journal (HIJ), 2021, Website
  • Journal on Data Semantics, 2020, Website
  • BioNLP 2020 Workshop on Biomedical Natural Language Processing, 2020, Website
  • BMC Bioinformatics, 2019|2020, Website