DeepCurate

1. June 2021

DeepCurate is a project using machine learning to transfer life science research results from the specialist literature into a structured, machine-readable form. The project is funded by the federal Ministry of Education and Research (BMBF) for three years, starting in January 2020.

DeepCurate: OCR representation of a scanned document (left) and automatically extracted database records (right).

Experimental data from biochemical reactions and their reaction kinetic properties are very important for research in biotechnology, medical treatment methods or diagnostics. Most of this data is published in conventional literature, not or only weakly structured. The current practice is the manual extraction and curation by human experts. SABIO-RK, developed by the SDBV group, is a curated database on biochemical reactions and their reaction kinetic properties.

In an innovative approach, DeepCurate combines different sources that are used in the curation process. DeepCurate introduces database-to-paper backprojection to define the exact data location of exctracted data. For each SABIO-RK data item, the corresponding text locations are found, thus creating rich and highly precise training data for machine-learning-based data extraction methods. Subsequently these will be used together with other training data to improve Deep Learning based NLP methods. The purpose is to integrate these findings into curation pipelines of SABIO-RK and other systems.

Contributors:

Wolfgang Müller (SDBV)
Michael Strube (NLP)

Members:

Sucheta Ghosh
Mark-Christoph Müller
Maja Rey

Publications:

Müller M, Ghosh S, Rey M, Wittig U, Müller W, Strube M (2020). Reconstructing Manual Information Extraction with DB-to-Document Backprojection: Experiments in the Life Science Domain, In Proceedings of the First Workshop on Scholarly Document Processing, Online, November 2020, pp. 81-90. DOI: 10.18653/v1/2020.sdp-1.9  

Müller M, Ghosh S, Wittig U, Rey M (2021). Word-Level Alignment of Paper Documents with their Electronic Full-Text Counterparts, In Proceedings of the 20th SIGBioMed Workshop on Biomedical Language Processing, BioNLP 2021, Online, June 11, 2021. https://arxiv.org/abs/2104.14925

Switch to the German homepage or stay on this page