A graph data structure extracts knowledge from various sources, creates connections between entities, and forms an entwined net called the ‘Knowledge graph.’ A common source of information for knowledge graphs is text. The text to graph approach builds a representation of the knowledge and provides a birds-eye view of the entire corpus.
Humans generate text-rich documents all the time, such as online survey forms, medical reports, insurance documents, and the list goes on. Knowledge graphs can extract this information as triples and embed it within the nodes and edges. But this information retrieval is challenging.
In this article, we will discuss the rich knowledge that text documents contain, how knowledge graphs leverage this information and how a transformer-based approach aids the text to graph approach.
Knowledge Graph Generation From Text
Conventional data resides in relational databases. This data lives within tables, and each entity relates to another via a row and column structure. Tabular data has a defined structure to build semantics for knowledge graphs. Text data, on the other hand, is an unstructured piece of information that requires extra effort to create meaningful context and relationships from entities within.
Organizations generate large amounts of information in emails, letters, contracts, etc. These documents hold vital information that can add value to the company but remain unused. While humans can comprehend information in words and sentences, engineers must train computers separately for the same task. Take the following sentence:
“Jhon is an engineer. He works at the Smith’s Engineering Firm in Washington.”
It tells us about Jhon (name of the person), who is an engineer (occupation). Jhon works at the Smith’s Engineering Firm (organization) in Washington (location). While reading this sentence, our brain distinguishes the different entities and determines what role each plays to build some meaning out of it. Before building a knowledge graph, computers perform the same steps; break down the text and tag the various entities it contains. Let’s talk about these steps in detail below.
- Coreference Resolution
Text often contains pronouns that reference previously mentioned entities. In their original form, these pronouns are a unique entity for the machine and cannot be related to any existing term. Coreference resolution is a Natural Language Processing (NLP) technique that works to identify such references and replace them with the original named entity.
Using the example from the last section, we see that the second sentence uses the pronoun ‘He. ‘He’ refers to ‘Jhon’; hence we replace it with the corresponding entity. The final sentence becomes the following:
“Jhon is an engineer. <Jhon> works at the Smith’s Engineering Firm in Washington.”
Coreference resolution removes ambiguity from text and provides meaning to every entity.
- Entity Recognition
Named entity recognition (NER) is a sub-domain of NLP that labels all the subjects and objects within a text based on contextual information. NER analyzes the entire text to identify all possible named entities within it. Next, it uses a machine learning model to classify these entities into predefined classes. Some examples of these classes are:
The below illustration explains how a NER model would tag the text.
With the help of NER, our machine now understands that the text contains a person (Jhon), an occupation (engineer), an organization's name (Smith’s Engineering Firm), and a location (Washington)
For NER, contextual information is essential. In our example sentence, Washington can be a person or place since the word has both uses. The additional context from the text helps resolve ambiguity and classify the term as a place.
NER provides us with individual entities, but these need to be linked to build a semantic network.
Relationship extraction identifies and extracts semantic relationships from a text. Semantic relationships bring meaning to the text by linking entities and creating a flow between the words.
One possible approach for extracting relationships is by defining rules. This algorithm matches pre-defined rules to recognized entities in text and uses linking phrases to determine relationships. An example could be “<person> lives in <location>”. This rule dictates that <person> and <location> have a semantic relationship as defined by the words “lives in”. A rule-based approach’s performance relies on the number of rules created. This technique has a greater chance of producing errors since text can contain many variations, and determining all rules is nearly impossible.
A more interesting approach is using supervised Machine Learning. NLP models are trained on relationship-tracking datasets that accurately predict the semantics contained within the text. Some common text relationship datasets include Fewrel 2 and New York Times Corpus. It is also important to note that semantics within the text depends on the context of the written material. A model trained on medical notes might perform poorly on other kinds of text.
Text to Graph
By this point, we have all the elements for knowledge graph generation from text. Our named entities will form the graph's nodes, and the relationships between them will serve as the semantic connection represented on the graph edges.
The entity linkages can be used to generate an independent knowledge graph. This approach creates a proprietary graph for your organization; however, it is important to note that you will need sufficient information from your text dataset to build a usable data structure.
A more preferred option is to link the extracted entities to a pre-defined knowledge base. One can think of this as transfer learning. The new extracted entities are algorithmically plugged into an existing knowledge graph. This approach has several benefits;
- It enhances the knowledge of the existing KG.
- Creates additional links to the extracted entities.
- Saves time by providing a knowledge-rich graph with little effort.
Final KG from our sample text
A knowledge graph extracts and delivers value from unstructured data. Google uses knowledge graphs to enhance the user's search experience. With KGs, the search engine can infer the search queries to provide better results and present additional relevant information. Knowledge graphs from medical documents can create semantic patient profiles to derive valuable information such as probable medical conditions and their treatments.
Transformer Models for a Text to Graph Pipeline
Transformers are attention-based models that demonstrate excellent performance for NLP tasks. They specialize in dealing with sequential data and have served as the basis for many state-of-the-art models for text-based problems. A knowledge graph pipeline includes multiple NLP problems, such as entity recognition and relationship extraction. Transformer-based models enhance the working of each of these models, and the graph generated from these outputs contains robust relations.
A major roadblock for many NLP projects is the unavailability of labeled datasets. Luckily, several transformer-based pre-trained open-source models are available online. These models can be re-trained on a smaller dataset, making them better suited for your particular problem. Researchers from SIFT have demonstrated building a knowledge graph using transformer models. They have used pre-trained models in their architecture and achieved impressive basic-level reasoning results. Researchers can achieve further improvements with the fine-tuning of the models.
Exploring Biomedical Text With Wisecube
Wisecube packs a suite of advanced NLP techniques that analyze biomedical information and extract valuable insights. It provides a user-friendly interface to access and scan millions of biomedical publications, reports, and clinical trials to form a rich knowledge base. Customers can further extend the Wisecube knowledge base by integrating their proprietary data and analyzing hidden patterns.
Wisecube has partnered with some market-leading data and biomedical firms, including VueData, Roche, and John Snow Labs. These collaborations ensure that our products are backed by some of the greatest minds in the industry.