Table of Contents

The application of language models, such as ChatGPT, has been a game-changer in various domains, including biomedicine. However, the potential of data-driven healthcare extends far beyond ChatGPT alone. 

In this article, we will explore new frontiers for intelligent biomedical data exploration and decision-making. We will discuss the importance of building robust biomedical knowledge graphs and the impact of combining them with advanced language models. Continue reading to discover how this collaborative approach propels biomedicine into a new era of transformative discoveries.

Limitations of Building Biomedical Knowledge Graphs

Knowledge graph construction can be complex, especially when dealing with diverse and large-scale biomedical data sources. The process involves various steps, such as data collection, mapping, inference, natural language processing (NLP), and loading into a graph database, which can be intricate and time-consuming.

The multistep nature of knowledge graph creation, from data collection to loading into a graph database, can lead to delays in operationalization. Integrating and transforming data into a cohesive knowledge graph often requires careful planning and execution, which can slow down the deployment of knowledge-driven biomedical applications.

Building Robust Knowledge Graphs Using Graphster

Graphster is a powerful open-source knowledge graph library explicitly designed for scalable, end-to-end construction and querying knowledge graphs from unstructured and structured source data. 

The Graphster library utilizes Spark-based technology to extract information such as mentions and relationships from a collection of documents. This process results in the creation of a raw knowledge graph. To ensure accuracy, it links extracted mentions to entities in external knowledge bases like Wikidata and further enriches the graph with factual data from Wikidata. The resulting knowledge graph can then be natively queried using SPARQL, enabling efficient and flexible information retrieval for insightful data exploration.

Graphster streamlines the process of building knowledge graphs in the biomedical domain, making it an invaluable tool for organizing and extracting valuable insights from vast amounts of biomedical information.

How Does Graphster Simplify Knowledge Graph Construction?

Graphster simplifies the knowledge graph construction process in several ways, making it accessible and practical for users:

  • Simplified Data References: Graphster's simplified references to data elements within the knowledge graph improves readability and reusability. This makes it easier for users to understand and work with the graph's content.
  • Connecting Disjointed Data Sets: By linking entities and information from different data sources, Graphster resolves the issue of disjointed data and promotes a holistic understanding of the biomedical landscape.
  • Enriching Context: Linking mentions to entities in Wikidata and enriching the graph with additional facts allows for a deeper analysis of relationships and patterns within the biomedical knowledge graph.
  • Optimized Natural Language Querying: With the ability to natively query the knowledge graph using SPARQL, Graphster provides optimized natural language querying capabilities. This streamlines the information retrieval process, making it easier for users to pose complex queries and obtain relevant results.

Key Processes of Building a Knowledge Graph Using Graphster?

The following key processes of robust knowledge graph construction drive Graphster's ability to offer a comprehensive view of biomedical data.

1. Subsetting

This process involves creating a subgraph from a larger, more general knowledge graph. Graphster leverages subsetting to allow users to extract relevant subsets of data for specific analysis or research purposes.

2. Data Fusion

Data fusion involves taking structured data that has been cleaned, transformed, or extracted and mapping it into the schema of the knowledge graph. Graphster enables this fusion by efficiently matching entities in the new data to existing entities in the graph and mapping the relationships and properties to appropriate predicates.

3. Link Prediction

Graphster offers advanced network analytics capabilities to synthesize new insights from the available data. Through global and targeted link prediction techniques, the library can identify potential connections and missing links, enhancing the overall accuracy and completeness of the knowledge graph.

The Role of Large Language Models in Augmenting Knowledge Discovery

While building a robust knowledge graph is crucial for organizing biomedical information, it is only the first step in the data-driven landscape. To unlock the full potential of the vast amount of data available, there is a pressing need to combine the transformative capabilities of large language models (LLMs) and knowledge graphs

By harnessing LLMs' natural language understanding and the structured wealth of knowledge graphs, researchers can access a deeper level of insights and drive data-powered innovation in biomedicine.

How Can LLMs & Knowledge Graphs Work Together?

Integrating large language models and knowledge graphs creates a powerful tool that leverages both technologies' strengths.. By combining the context-rich structured data of graphs with the natural language processing capabilities of LLMs, innovative approaches for data exploration and knowledge discovery emerge. 

Here are some of the ways LLMs and knowledge graphs can work together to revolutionize information retrieval and decision-making processes:

Direct Approach: Knowledge Graphs as Context For Large Language Models

Using knowledge graphs as a contextual source of information, LLMs can gain access to structured knowledge, enhancing their language understanding and content generation. This reduces the likelihood of LLMs generating inaccurate or hallucinated information, as they are constrained by the factual information present in the graph. Consequently, LLMs can offer more accurate and contextually relevant responses, making them invaluable tools for information retrieval and natural language understanding tasks. However, this approach carries high computation costs, and performance efficiency may decrease as the size of the knowledge graph increases.

Direct interactions of LLMs and Knowledge Graphs using graphs as a contextual source of information

Indirect Interaction: LLMs as Query Generators for Knowledge Graphs

An alternative way for LLMs to work with knowledge graphs involves posing natural language questions to LLMs, which process and convert them into SPARQL queries. These queries are then executed on the knowledge graph based on the Resource Description Framework (RDF) data model to retrieve relevant information and generate appropriate responses. This approach simplifies the querying process, makes graphs more accessible to a broader audience, and allows for more complex querying. However, this approach requires more specific training data to ensure accurate SPARQL query generation, making it harder to achieve optimal results.

Indirect interaction of LLMs and Knowledge Graphs using LLMs as query generators to retrieve information from graph

Challenges of Using LLMs to Generate SPARQL Queries

While leveraging language models to generate SPARQL queries for knowledge graphs offers significant advantages in terms of accessibility and natural language interaction, it also presents several challenges that must be addressed to ensure accurate and reliable query generation.

  • Hallucinating Graph Entities: LLMs have been known to exhibit hallucination, a phenomenon where they generate information that does not exist in the input data or knowledge graph. When generating SPARQL queries, this can lead to the inclusion of non-existent graph entities, potentially resulting in inaccurate or irrelevant query results.
  • Incorrect Predicates: LLMs may also generate SPARQL queries with incorrect predicates, causing queries to be formulated based on relationships that do not align with the actual structure of the graph. This can lead to misleading or erroneous query outcomes.
  • Malformed Queries that Don't Follow the Graph Schema: The generated SPARQL queries may not always adhere to the schema and structure of the Knowledge Graph. Malformed queries may not retrieve the desired information or result in errors when executed on the graph, leading to suboptimal query performance and reduced data exploration capabilities.

How to Solve LLM’s Problem of Converting NL to SPARQL?

Addressing the challenges of converting natural language queries to SPARQL requires an intermediate representation that can bridge the two languages. This intermediary framework should ensure accurate mapping of natural language queries to SPARQL queries by incorporating a structured and standardized representation.

1. Set Builder Notation

Set Builder Notation (SBN) is a mathematical representation used to describe a set by specifying its elements or properties. SBN, as an intermediate representation, can structure queries into a standardized format while retaining its natural language elements. It enables a clear and well-defined way to express natural language queries, making the subsequent transformation to SPARQL more straightforward.

How SBN Solves LLM Issues:

  • Requires Less Training Data: As SBN does not involve training LLMs on SPARQL and graph entities, it reduces the need for extensive training data, simplifying the learning process for LLMs.
  • Relieves Hallucination Problem: LLMs do not need to produce graph entities when using SBN, the risk of generating nonexistent information is mitigated, addressing the hallucination issue commonly observed in LLM-generated answers. 

2. Entity Linking 

Entity linking refers to mapping natural language labels to specific entities in the knowledge graph by converting them into uniform resource identifiers (URIs). It is a technique that assists in handling the variation of model responses when transforming natural language to set builder notation to enable accurate SPARQL query construction. 

How Entity Linking Solves LLM Issues:

  • Account for LLM Variation: Entity linking allows for incorporating heuristics that help in natural language to SPARQL conversion, making it more adept at handling variations in LLM-generated queries. 
  • Implicit Information Extraction: By linking entities in the natural language query to specific knowledge graph URIs, entity linking can extract implicit information that might not be explicitly stated in the natural language question. This helps in handling complex queries that require additional contextual understanding. For example, handling queries containing entities with multiple instances of the same name, such as genes found in multiple species.
Natural Language to SPARQL conversion workflow

Levels of Difficulty Encountered in NL to SPARQL Conversion

Converting natural language queries to SPARQL poses various challenges due to the inherent differences between the two languages. As the complexity of natural language queries increases, the difficulty of mapping them to corresponding SPARQL queries grows significantly. 

Here are the different levels of difficulty encountered in natural language to SPARQL conversion and the unique challenges each level presents:

  • Single-hop: The most superficial level of difficulty, where the natural language query can be directly mapped to a single SPARQL query. The conversion involves straightforward translation, and the natural language query corresponds to one specific query in the knowledge graph.

Example:What is the mode of inheritance of Bloom syndrome?

  • Heuristic Multi-hop / Implied Multi-hop: This level involves more complex natural language queries that require a combination of multiple SPARQL queries to retrieve the desired information. The conversion process may require heuristics or implied relationships between entities in the natural language query to construct multi-hop SPARQL queries.

Example:What genes are associated with Autism Spectrum Disorder?

  • Multi-hop: At this level, the natural language query necessitates explicit and multiple SPARQL queries with distinct intermediate steps to reach the final result. The conversion requires identifying and organizing these interconnected steps to navigate the knowledge graph effectively.

Example:What are the molecular functions of the protein encoded by the gene EGFR?

  • Statement to Statement: The most challenging level, where the natural language query involves complex reasoning and inference across the Knowledge Graph. Converting such queries requires understanding and translating natural language statements into SPARQL queries encompassing intricate relationships and complex reasoning steps.

Example:What is the cellular target of gefitinib?

As the complexity increases from single-hop to statement-to-statement, sophisticated intermediate representations and intelligent reasoning mechanisms are needed to achieve accurate and efficient query generation.

Using NL-SPARQL Conversion to Power a Federated Q&A System

Having gained insights into the challenges and levels of difficulty encountered in natural language to SPARQL conversion, we can now delve into how this conversion can be used to power federated Question and Answer (Q&A) systems. By leveraging the combined capabilities of knowledge graphs, language models, and SPARQL, federated Q&A systems offer a powerful solution for intelligent information retrieval and decision-making in a wide range of domains. This querying solution has the potential to revolutionize the way we access and utilize knowledge from vast and disparate sources.

Federated Querying

Federated querying is an information retrieval technique in which multiple autonomous and distributed data sources are queried simultaneously, and the results are integrated into a unified response. It enables information retrieval from diverse and geographically distributed databases or knowledge bases, providing a more comprehensive data view and enabling efficient data explorations. Federated querying is particularly valuable in scenarios where information is scattered across multiple sources and where centralizing all data in a single repository may not be practical or feasible.

The biomedical industry can specifically benefit from federated querying as biomedical data is often distributed across various databases, research repositories, and clinical records, making it scattered and challenging to access in a unified manner. Federated querying provides a powerful solution by efficiently retrieving and consolidating relevant biomedical information from these disparate sources, enabling a holistic approach to biomedical research and improving patient care outcomes.

How Wisecube Leveraged NL-SPARQL for Combining Graph and Text Q&A

Realizing the importance of a federated Q&A system for the biomedical domain, Wisecube is developing a transformative federated querying solution to take biomedical data exploration to new heights. Harnessing the power of NL-SPARQL conversion as the bridge between natural language and structured queries, Wisecube offers a solution with the combined strengths of Graph Q&A and Text Q&A.

To comprehend the need for this holistic approach, it is important to understand the  strengths that Graph and Text Q&A bring to the system:

1. Leveraging NL-SPARQL Conversion for Structured Data

The Graph-based Q&A system is based on the NL-SPARQL conversion process, bridging the gap between natural language and structured SPARQL queries. By transforming NL queries into SPARQL, A graph Q&A system can efficiently retrieve and process data from the knowledge graph - a rich source of structured biomedical information. The knowledge graph, a comprehensive representation of biomedical knowledge, contains vast amounts of data that may not be explicitly mentioned in any documents, making it ideal for answering complex queries with deep contextual understanding.

2. Incorporating Text Q&A for Corpus Data and Document Information

While the knowledge graph delivers structured data for explicit biomedical relationships, there remains a need for Text Q&A to incorporate corpus data and document information. Such data may not be easily extracted or represented in the graph but provides valuable insights that complement the knowledge graph. Text Q&A effectively bridges the gap by utilizing natural language processing and machine learning to extract information from a diverse corpus of biomedical literature and unstructured documents, enriching the Q&A system with context and insights that might not be present in the graph.

Flowchart illustrating the flow of a query in a Graph + Text Q&A system

Through NL-SPARQL conversion and the strategic integration of text Q&A and graph Q&A, WiseCube is set to take biomedicine beyond ChatGPT. By combining their strengths, Wisecube offers a holistic approach to biomedical data exploration, providing researchers and healthcare professionals with comprehensive insights that would be challenging to achieve using a single approach.

A federated biomedical Q&A system has the potential to revolutionize biomedical data discovery, fostering a data-driven landscape that enhances biomedical research and healthcare practices for a healthier future.

Get in touch with us to learn more!