Extracting Knowledge from Biomedical Literature

Key Tools and Research Outputs from the Computable Knowledge Project

In 2017, the University of Massachusetts Amherst’s Information Extraction and Synthesis Laboratory (IESL) was awarded the Computable Knowledge grant. In partnership with CZI’s Meta team, this work advanced state-of-the-art technology to extract knowledge from scientific publications to explore new ways to construct and reason over scientific knowledge bases.

Biomedical research papers are published at a staggering rate. Every day, more than 4,000 new papers are posted to services such as PubMed and bioRxiv. Amidst the current coronavirus pandemic, the world is more aware than ever of the need to accelerate scientific progress. When scientists can quickly and comprehensively understand the advancing landscape of research, they can more rapidly make life-changing or life-saving discoveries—some of which require deep technical connections between disparate fields.

Understanding the vast and ever-growing technical literature is, of course, no small task. Tools for exploring the literature are crucial. Search engine-based technologies allow researchers to find topically-related queries, but this can still produce hundreds of relevant papers every day — well beyond any single researcher’s ability to read and understand.

When scientists can quickly and comprehensively understand the advancing landscape of research, they can more rapidly make life-changing or life-saving discoveries — some of which require deep technical connections between disparate fields.

We would like to automatically build a navigable map of biomedical science — supporting efficient connection-finding, natural exploration of alternative routes, and deeper understanding of waypoints. Gathering evidence from each paper like a surveyor and map-maker, we collect sightings of entities (proteins, diseases, genes, etc.) as well as the connecting relationships between those entities. Example relationships include gene-disease associations like biomarkers, genetic variations, and protein-protein interactions.

The extraction of these facts provides the basis for building knowledge graphs of science. Knowledge graphs are structured data-storing entities (such as MERS, a disease, or azathioprine, a chemical), and relationships between entities, such as biomarkers, interactions, or up-regulation. These knowledge graphs facilitate efficient indexing and thereby support the navigation of facts embedded in the technical language of the literature. At a glance, we can observe various properties of entities such as the interactions of a drug or the co-morbidities of a disease. We can provide an indexing mechanism where users can find papers by searching for specific entities rather than text-based search, which may miss various aliases of the entity names. We can also provide the ability to retrieve papers — or even to directly answer questions — based on the relationships in the knowledge graph.

An example knowledge graph consisting of genes, proteins, and diseases using relationships from DisGeNET v7 and NCBI Gene. | Image provided by the University of Massachusetts Amherst.

The need for these knowledge graphs is far beyond just indexing for search. This structured knowledge combined with text is the basis for reasoning over knowledge graphs. This is the ability to derive new facts and inferences to answer questions and queries to the system. An intelligent system would be able to suggest connections, highlight relationships between entities, and provide scientists with connections that otherwise would have been difficult to discover due to the volume of the literature.

We are working on ways to automatically build and update such knowledge graphs directly from scientific text of biomedical literature. As a result, we need tools that can ingest the multitude of technical content and use machine learning to produce structured knowledge.

Given the text of a research paper, we need to identify all of the spans of text that refer to entities. Identifying these entity mentions allows us to determine whether the surrounding text describes any relationship between the entities. It also allows us to record which entities are mentioned in which document for the sake of building a semantic index. We developed several tools for discovering these biomedical entities, including ones that leverage multiple, disparate sets of labeled data.

These entity mentions are inherently ambiguous. For instance, the Chikungunya virus may also be referred to as CHIKV, and F9 can refer to the Coagulation factor IX in a number of different species. We need to resolve the ambiguity of these entity mentions by linking them to knowledge graphs/ontologies, such as UMLS. This is necessary both to correctly associate documents with their mentioned entities in a semantic index, but also to correctly attribute relationships between entities.

Entity mentions identified in the abstract of a PubMed article showing entity resolution decisions for these mentions as well as relationships between entities. | Image provided by the University of Massachusetts Amherst.

However, it may be the case that these mentions refer to entities that are not present in any existing knowledge base. For instance, the mentions may refer to newly emerging entities or missing information in existing resources. We hope to discover new entities and add these to the knowledge graph. To this end, we have developed general purpose incremental clustering algorithms. We have also developed proof-of-concept systems specifically for biomedical text. Our work attempts to operate on the continuous stream of newly arriving biomedical research, adaptively determining when new entities are arriving and efficiently reconsidering past work in the context of the new data. We have investigated how humans can interact with the entity resolution decisions.

We hope to extract meaningful relationships between entities from the text of scientific documents. We train models that classify which relations are expressed in paragraph-level context surrounding pairs of entity mentions. We also develop methods that learn how to reason over these extracted relationships using case-based reasoning. Case-based reasoning makes inferences by retrieving and adapting inferences from similar situations in the past, providing not only a robust framework for reasoning, but also a framework that can provide some explanation for why the systems made such inferences. Case-based reasoning also provides interpretable predictions and can be used in open-world settings where data is continuously added and removed from knowledge graphs.

We are also exploring the use of unsupervised pre-training methods to improve question-answering tasks in the biomedical domain. We introduced a new pre-training task that consists of corrupting a given passage by randomly replacing a mention of a biomedical entity and training the model to locate the corrupted mention, given the uncorrupted one. This joint project by CZI and UMass Amherst was a highly ranked system in the BioASQ competition Task 8b Phase B.

To show how such information could be browsable, we developed a system for searching the CORD-19 dataset called KDCOVID, working with collaborators at CMU, Google, University of Toronto, and ETH Zurich. Our system provides a search and QA tool that shows evidence from full text of documents and highlights entity links and relations for the entities involved. Our system retrieves documents and provides a browseable knowledge graph per document expressing some of the key relationships between the entities of interest. KDCOVID and the resources mentioned above are all available as open source software.

We are excited that the tools, resources, and infrastructure being built through this collaboration between UMass Amherst and CZI will benefit researchers around the globe working on knowledge discovery to help accelerate science.

Learn more about CZI’s Open Science program and how we’re working to reduce barriers to knowledge discovery and access.

Andrew McCallum is a Distinguished Professor and Director of the Center for Data Science in the College of Information and Computer Sciences at the University of Massachusetts Amherst. He has published over 300 papers in many areas of artificial intelligence, including natural language processing, machine learning, data mining and reinforcement learning; his work has received over 70,000 citations. In the early 2000s he was Vice President of Research and Development at WhizBang Labs, a 170-person start-up company that used machine learning for information extraction from the Web. He was named a AAAI Fellow in 2009, and an ACM Fellow in 2017. He was the Program Co-chair for the International Conference on Machine Learning (ICML) 2008, its General Chair in 2012, and from 2013 to 2017 was the President of the International Machine Learning Society. He organized the first workshop on Automated Knowledge Base Construction in 2009, and is the instigator and General Chair of the first international conference on Automated Knowledge Base Construction in 2019. He is also the creator of OpenReview.net, an open platform aiming to revolutionize scientific peer review.

Nicholas Monath is a PhD student in computer science at the University of Massachusetts Amherst, advised by Professor Andrew McCallum, in the Information Extraction and Synthesis Laboratory. His research focuses on machine learning and natural language processing, in particular, scalable and online methods for clustering and entity resolution.

Alex Wade is Technical Program Manager for the Chan Zuckerberg Initiative’s Open Science and Meta programs, working to build programs and technology to support open, reproducible, and accessible research. Alex manages the internal data science and research projects as well as technical and research partnerships.



Chan Zuckerberg Initiative Science

Supporting the science and technology that will make it possible to cure, prevent, or manage all diseases by the end of the century.