The world of artificial intelligence has been buzzing lately with the emergence of large language models (LLMs), in special ChatGPT (GPT for standing for Generative Pre-trained Transformer). These models are trained on vast amounts of text data and are capable of understanding natural language and generating human-like responses. With their remarkable capabilities, it's no wonder that researchers and developers are exploring new ways to integrate LLMs into existing platforms.
At Wisecube, a knowledge graph company, we believe that integrating LLMs with knowledge graphs is an important step towards the future of knowledge. Knowledge graphs, which are collections of interlinked data points that represent real-world entities and their relationships, are typically accessed via graph query languages. At Wisecube, we work with SPARQL, a powerful query language that is similar to the popular SQL language. SPARQL is the standard graph query language for the World Wide Web Consortium and powers the querying of many knowledge graphs.
However, learning a new language like SPARQL can be intimidating, especially for those who are new to knowledge graphs. To address this challenge, Wisecube has developed an intuitive interface for building complex SPARQL queries visually. With our visual query builder, users can easily create and customize queries without having to learn the syntax and intricacies of SPARQL.
Despite the availability of tools like the visual query builder, there is still a growing need to turn natural language questions into SPARQL queries. This need arises from the fact that knowledge graphs are becoming more integrated into dashboards and analytics, and users require a more natural way to interact with them.
To address this need, researchers and developers have been exploring ways to integrate LLMs with SPARQL queries. This involves training LLMs to understand natural language questions and generate SPARQL queries that accurately retrieve the desired information from the knowledge graph. At Wisecube, we have been closely following these developments.
Here we summarize tips and perspectives from online material on converting natural language questions into SPARQL, with a special focus on Wikidata, a massive, public-domain knowledge graph that we integrate into our platform.
GPTs, LLMs and generation of database queries
As natural language processing continues to advance, the possibilities of what language models can do seem almost limitless. One exciting development in this field is the ability of large language models (LLMs) to generate database queries, such as SQL queries, with ease.
OpenAI, for example, has highlighted SQL generation as one of the applications of LLMs. With recent additions of GPT Plugins, it is now clear how GPTs are now capable of working with other query languages as well. In fact, Stephen Wolfram, the creator of the Wolfram|Alpha platform, has showcased how GPTs can now generate queries for Wolfram|Alpha.
At the forefront of this development are language models that are capable of writing, sending, and parsing database queries, a technology called “agents”. One of the most popular toolkits for building such agents is LangChain, which allows developers to combine language models with external data. This means that it's possible to create agents that can perform complex tasks, such as retrieving data from databases and generating reports, with just a few lines of code.
Generation of SPARQL
Given the ability of LLMs to generate SQL queries and other types of database queries, it's reasonable to expect that they could also excel at generating SPARQL queries. In fact, early research in this area has shown promising results.
At Wisecube, we have been closely following developments in this area, especially in relation to Wikidata SPARQL queries. Wikidata is a massive, public-domain knowledge graph that is widely used in research and data analysis. At Wisecube, we integrate Wikidata into our platform to provide our users with access to a vast array of knowledge. See the example below on how ChatGPT handles a request for a Wikidata query:
For example, in the experiment above, an LLM generated a SPARQL query to retrieve information about bipolar disorder from Wikidata. While it correctly identified some of the predicates and Wikidata identifiers, it incorrectly identified the identifier for bipolar disorder. This demonstrates the ongoing challenge of inferring identifiers from natural language inputs, especially in knowledge graphs with numeric or complex identifiers.
Despite these challenges, there are several exciting initiatives underway that aim to improve the interface between SPARQL generation and LLMs. One example is the Monarch Initiative, part of the Open Biomedical Ontologies Foundry, which is exploring the extraction of RDF semantic triples from biomedical text using LLMs. They have developed a tool called OntoGPT that prompts the model to generate the CamelCase labels of entities, which are then converted to identifiers in a later step.
Another interesting resource for those interested in the intersection of LLMs and knowledge graphs is Kurt Cagles’ blog “The Cagle Report”. While he hasn't written specifically on SPARQL generation, he has outlined many tricks to make ChatGPT generate useful triples for knowledge graphs. In one recent post, he makes a compelling argument for why we should use knowledge graphs instead of blindly relying on LLMs alone.
“ (I see) knowledge graphs as being short-term memory – dynamic, malleable, precise, addressable – while machine learning is information processed by sleep – more contextual, somewhat more amorphous, less addressable, but capable of easier querying. (...) memories come from knowledge that can maintain provenance, authority, and local context and, most importantly, can benefit from stewardship and governance. This is where knowledge graphs are perfectly positioned.”
Coming bach to SPARQL generation, One example is the work of Rony and colleagues, who have recently developed SGPT, a model focused on using GPT and SPARQL. They use the simpler GPT-2 model and consider two situations: direct natural language to SPARQL and conversion of the question to SPARQL while providing the entities that will be needed. This separation of concerns might help with hallucination, for example. They train the knowledge graph directly into the model, and mention that while they use GPT-2, any transformer-decoder-based LM can be used with their system.
The SGPT team alsos explore SPARQL generation in the context of Wikidata. They see some degree of hallucination but report good performance. While they do not provide a public-facing tool, their work demonstrates the feasibility of generating SPARQL even using medium-sized language models like GPT-2. With the much larger models that have since appeared, the possibilities for generating high-quality SPARQL queries from natural language inputs are even more exciting.
Another notable example is a blog post by Fan Li, in which he describes a workflow for integrating ChatGPT with knowledge graphs, including SPARQL generation. Li demonstrates how he used ChatGPT to retrieve articles containing the word "heart" and published in the 2000s from the Open Citations Meta database. While the process involved splitting the SPARQL generation steps into several small chunks and encountered some challenges, Li's work shows the potential for using LLMs to map natural language questions to structured ontologies.
All these new works with GPT follow recent advances using neural networks with transformers, and before the explosion of GPT, BERT was the leading tool. It is important to mention the work of Gu et al (2020) in building the GrailQA crowd curated dataset and their BERT-based model, which generates intermediate S-expressions before producing SPARQL queries. Thus, while GPT brings some novelties to the table, many people (including we at Wisecube!) have been working with this task for some time already. Thus, we can expect advances to be relatively quick. Be ready for novelty!
To summarize, the integration of LLMs and SPARQL queries represents a significant step forward in the field of knowledge graphs. The development of tools like Wisecube's visual query builder, combined with ongoing research into LLM-powered natural language processing, is creating a more intuitive and user-friendly experience for interacting with knowledge graphs. As LLMs and their integration to knowledge graphtscontinue to evolve and improve, we can expect to see even more exciting developments in the coming years.
Overall, the integration of LLMs and SPARQL queries is poised to revolutionize the way we interact with knowledge graphs, making it easier than ever before to extract insights and gain a deeper understanding of complex data. As a company at the forefront of this exciting field, Wisecube is committed to exploring new and innovative ways to integrate LLMs and other cutting-edge technologies into our platform, helping our users make sense of data and insights in a more intuitive and effective way.