Graphs and Language

Graphs and Language


A rising tide lifts all boats, and the recent advances in LLMs are no exception. In this blog post, we will explore how Knowledge Graphs can benefit from LLMs, and vice versa.

Where do KGs fit with LLMs?

Where do Knowledge Graphs fit with Large Language Models?

In particular, Knowledge Graphs can ground LLMs with facts using Graph RAG, which can be cheaper than Vector RAG. We'll look at a 10-line code example in LlamaIndex and see how easy it is to start. LLMs can help build automated KGs, which have been a bottleneck in the past. Graphs can provide your Domain Experts with an interface to supervise your AI systems.

Note: this is a written version of a talk I gave at the AI in Production online conference on February 15th, 2024. You can watch the talk here.

A trip down memory lane at Spacy IRL 2019

I've been working with Natural Language Processing for a few years now, and I've seen the rise of Large Language Models. The start of my NLP and Graphs work dates back to 2018, applied to the Sports Media domain when I worked as a Machine Learning Engineer at OneFootball, a football media company from Berlin, Germany.

As a practitioner, I remember that time well because it was a time of great change in the NLP field. We were moving from the era of rule-based systems and word embeddings to the era of deep learning, moving from LSTMs to a slew of models like Elmo or ULMfit based on the transformer architecture. I was one of the lucky few who could attend the Spacy IRL 2019 conference in Berlin. There were corporate training workshops followed by talks about Transformers, conversational AI assistants, and applied NLP in finance or media.

Spacy IRL 2019
Spacy IRL 2019 keynote by Sebastian Ruder
Me at Spacy IRL

Me standing in the background, at Spacy IRL, networking with a ghost

In his keynote, The missing elements in NLP (spaCy IRL 2019), Yoav Goldberg predicts that the next big development will be to enable non-experts to use NLP. He was right ✅. He thought we would get there by humans writing rules aided by Deep Learning resulting in transparent and debuggable models. He was wrong ❌. We got there with chat, and we now have less transparent and less debuggable models. We moved further right and down on his chart (see below) to a place deeper than Deep Learning. The jury is still out on whether we can move towards more transparent models that work for non-experts and with little data.

Spacy IRL Keynote

Yoav Goldberg: The missing elements in NLP (spaCy IRL 2019)

In the context of my employer at the time, OneFootball, a football media in 12 languages with 10 million monthly active users, we used NLP to assist our newsroom and unlock new product features. I built systems to extract entities and relations from football articles, tag the news, and recommend articles to users. I shared some of that work in a previous talk at a Berlin NLP meetup. We had medium data, not a lot. And we had partial labels in the form of "retags". We also could not pay for much compute. So we had to be creative. It was the realm of Applied NLP.

That's where I stumbled upon the beautiful world of Graphs, specifically the great work from my now friend Paco Nathan with his library pytextrank. Graphs (along with rule-based matchers, weak supervision, and other NLP tricks I applied over the years) helped me work with little annotated data and incorporate declarative knowledge from domain experts while building a system that could be used and maintained by non-experts, with some level of human+machine collaboration. We shipped a much better tagging system and a new recommendation system, and I was hooked.

Today with the rise of LLMs, I see a lot of potential to combine the two worlds of Graphs and LLMs, and I want to share that with you.

1. Fact grounding with Graph RAG

1.1 Fine-tuning vs Retrieval-Augmented Generation

The first place where Graphs and LLMs meet is in the area of fact grounding. LLMs suffer from a few issues like hallucination, knowledge cut-off, bias, and lack of control. To circumvent those issues, people have turned to their available domain data. In particular, two approaches emerged: Fine Tuning and Retrieval-Augmented Generation (RAG).

In his talk LLMs in Production at the AI Conference 3 months ago, Dr. Waleed Kadous, Chief Scientist at AnyScale, sheds some light on navigating the trade-offs between the two approaches. "Fine-tuning is for form, not facts", he says. "RAG is for facts".

Fine-tuning will get easier and cheaper. Open-source libraries like OpenAccess-AI-Collective/axolotl and huggingface/trl already make this process easier. But, it's still resource-intensive and requires more NLP maturity as a business. RAG is more accessible, on the other hand.

According to this Hacker News thread from 2 months ago, Ask HN: How do I train a custom LLM/ChatGPT on my documents in Dec 2023?, the vast majority of practitioners are indeed using RAG rather than fine-tuning.

1.2 Vector RAG vs Graph RAG

When people say RAG, they usually mean Vector RAG, which is a retrieval system based on a Vector Database. In their blog post and accompanying notebook tutorial, NebulaGraph introduces an alternative that they call Graph RAG, which is a retrieval system based on a Graph Database (disclaimer: they are a Graph database vendor). They show that the facts retrieved by the RAG system will vary based on the chosen architecture.

They also show in a separate tutorial part of the LlamaIndex docs that Graph RAG is more concise and hence cheaper in terms of tokens than Vector RAG.

1.3 RAG Zoo

To make sense of the different RAG architectures, consider the following diagrams I created:

Differences and similarities of the RAG architectures

In all cases, we ask a question in natural language QNL and we get an answer in natural language ANL. In all cases, there is some kind of Encoding model that extracts structure from the question, coupled with some kind of Generator model ("Answer Gen") that generates the answer.

Vector RAG embeds the query (usually with a smaller model than the LLM; something like FlagEmbeddings or any small of the models at the top of the Huggingface Embeddings Leaderboard) into a vector embedding vQ. It then retrieves the top-k document chunks from the Vector DB that are closest to vQ and returns those as vectors and chunks (vj, Cj). Those are passed along with QNL as context to the LLM, which generates the answer ANL.

Graph RAG extracts the keywords ki from the query and retrieves triples from the graph that match the keyword. It then passes the triples (sj, pj, oj) along with QNL to the LLM, which generates the answer ANL.

Structured RAG uses a Generator model (LLM or smaller fine-tuned model) to generate a query in the database's query language. It could generate a SQL query for a RDBMS or a Cypher query for a Graph DB. For example, let's imagine we query a RDBMS: the model will generate QSQL which is then passed to the database to retrieve the answer. We note the answer ASQL but those are data records that result from running QSQL in the database. The answer ASQL as well as QNL are passed to the LLM to generate ANL.

In the case of Hybrid RAG, the system uses a combination of the above. There are multiple hybridation techniques that go beyond this blog post. The simple idea is that you pass more context to the LLM for Answer Gen, and you let it use its summarisation strength to generate the answer.

1.4 Graph RAG implementation in LlamaIndex

And now for the code, with the current frameworks, we can build a Graph RAG system in 10 lines of python.

from llama_index.llms import Ollama
from llama_index import ServiceContext, KnowledgeGraphIndex
from llama_index.retrievers import KGTableRetriever
from llama_index.graph_stores import Neo4jGraphStore
from import StorageContext
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.data_structs.data_structs import KG
from IPython.display import Markdown, display

llm = Ollama(model='mistral', base_url="http://localhost:11434")
service_context = ServiceContext.from_defaults(llm=llm, embed_model="local:BAAI/bge-small-en")

graph_store = Neo4jGraphStore(username="neo4j", password="password", url="bolt://localhost:7687", database="neo4j")
storage_context = StorageContext.from_defaults(graph_store=graph_store)

kg_index = KnowledgeGraphIndex(index_struct=KG(index_id="vector"), service_context=service_context, storage_context=storage_context)
graph_rag_retriever = KGTableRetriever(index=kg_index, retriever_mode="keyword")

kg_rag_query_engine = RetrieverQueryEngine.from_args(retriever=graph_rag_retriever, service_context=service_context)

response_graph_rag = kg_rag_query_engine.query("Tell me about Peter Quill.")

This snippet supposes you have Ollama serving the mistral model and a Neo4j database running locally. It also assumes you have a Knowledge Graph in your Neo4j database, but if you don't we'll cover in the next section how to build one.

2. KG construction

2.1 Building a Knowledge Graph

Before conducting inference, you need to index your data either in a Vector DB or a Graph DB.

Indexing DBs
Indexing architectures for RAG

The equivalent of chunking and embedding documents for Vector RAG is extracting triples for Graph RAG. Triples are of the form (s, p, o) where s is the subject, p is the predicate, and o is the object. Subjects and objects are entities, and predicates are relationships.

There are a few ways to extract triples from text, but the most common way is to use a combination of a Named Entity Recogniser (NER) and a Relation Extractor (RE). NER will extract entities like "Peter Quill" and "Guardians of the Galaxy vol 3", and RE will extract relationships like "plays role in" and "directed by".

There are fine-tuned models specialised in RE like REBEL, but people started using LLMs to extract triples. Here is the default prompt chain of LlamaIndex for RE:

Some text is provided below. Given the text, extract up to
knowledge triplets in the form of (subject, predicate, object). Avoid stopwords.
Text: Alice is Bob's mother.
Triplets: (Alice, is mother of, Bob)
Text: Philz is a coffee shop founded in Berkeley in 1982.
(Philz, is, coffee shop)
(Philz, founded in, Berkeley)
(Philz, founded in, 1982)
Text: {text}

The issue with this approach is that first you have to parse the chat output with regexes, and second you have no control over the quality of entities or relationships extracted.

2.2 KG construction implementation in LlamaIndex

With LlamaIndex however, you can build a KG in 10 lines of python using the following code snippet:

from llama_index.llms import Ollama
from llama_index import ServiceContext, KnowledgeGraphIndex
from llama_index.graph_stores import Neo4jGraphStore
from import StorageContext
from llama_index import download_loader

llm = Ollama(model='mistral', base_url="http://localhost:11434")
service_context = ServiceContext.from_defaults(llm=llm, embed_model="local:BAAI/bge-small-en")

graph_store = Neo4jGraphStore(username="neo4j", password="password", url="bolt://localhost:7687", database="neo4j")
storage_context = StorageContext.from_defaults(graph_store=graph_store)

loader = download_loader("WikipediaReader")()
documents = loader.load_data(pages=['Guardians of the Galaxy Vol. 3'], auto_suggest=False)

kg_index = KnowledgeGraphIndex.from_documents(

2.3 Example failure modes of LLM-based KG construction

However, if we have a look at the resulting KG for the movie "Guardians of the Galaxy vol 3", we can note a few issues.

Example constructed KG with LLM
Neo4j Bloom screenshot of a KG constructed with a LLM

Here is a table overview of the issues

1."Peter Quill / star-lord" vs "Quill" or "Guardians of the Galaxy" vs "Vol. 3" are separate entitiesDifferent synonyms should still disambiguate to the same entityEntity Linking systems are used to disambiguate entities via collected "surface forms"
2."plays role in" and "is part of the cast in" are different relationships that mean the same thingRelationships should be consistent or, even better, matching a provided controlled vocabularyRelation Extraction systems are used to extract standardised relationships
3.triples (Quill, speaks uncensored language in, Guardians of the Galaxy) and (James Gunn, could not imagine, Guardians of the Galaxy) are impreciseIf a triple is found, it should resolve to the most important information. In this case (Quill, is present in, Guardians of the Galaxy) or (James Gunn, directed, Guardians of the Galaxy)Could be mitigated by using a controlled vocabulary for relationships

This is to be compared with the Wikidata graph labelled by humans, which looks like this:

Human-labelled KG

Human-labelled KG in Wikidata generated with metaphacts

2.4 Towards better KG construction

So where do we go from there? KGs are difficult to construct and evolve by nature, which challenges the existing methods in KGs to generate new facts and represent unseen knowledge. The paper Unifying Large Language Models and Knowledge Graphs: A Roadmap provides a good overview of the current state of the art and the challenges ahead.

Knowledge graph construction involves creating a structured representation of knowledge within a specific domain. This includes identifying entities and their relationships with each other. The process of knowledge graph construction typically involves multiple stages, including 1) entity discovery, 2) coreference resolution, and 3) relation extraction. Fig 19 presents the general framework of applying LLMs for each stage in KG construction. More recent approaches have explored 4) end-to-end knowledge graph construction, which involves constructing a complete knowledge graph in one step or directly 5) distilling knowledge graphs from LLMs.

Which is summarised in this figure from the paper:

The general framework of LLM-based KG construction
The general framework of LLM-based KG construction

I've seen only a few projects that have tried to tackle this problem: DerwenAI/textgraphs and IBM/zshot.

3. Unlock Experts

3.1 Human vs AI

The final place where Graphs and LLMs meet is Human+Machine collaboration. Who doesn't love a "Human vs AI" story? News headlines about "AGI" or "ChatGPT passing the bar exam" are everywhere.

Human vs AI
Human vs AI news headlines

I would encourage the reader to have a look at this answer from the AI Snake Oil newsletter. They make a good point that models like ChatGPT memorise the solutions rather than reason about them, which makes exams a bad way to compare humans with machines.

Going beyond Memorisation, there is a whole area of research around what's called Generalization, Reasoning, Planning, Representation Learning, and graphs can help with that.

3.2 Human + Machine: Visualisation

Rather than against each other, I'm interested in ways Humans and Machines can work together. In particular, how do humans understand and debug black-box models?

One key project that, in my opinion, moved the needle there was the whatlies paper from Vincent Warmerdam, 2020. He used UMAP on embeddings to reveal quality issues in LLMs, and built a framework for others to audit their embeddings rather than blindly trust them.

Similarly, Graph Databases come with a lot of visualisation tools out of the box. For example, they would add context with colour, metadata, and different layout algorithms (force-based, Sankey).

Warmerdam, 2020: whatlies in embeddings
Neo4j Bloom
Neo4j Bloom

3.3 Human + Machine: Human in the Loop

Finally, how do we address the lack of control of Deep Learning models, and how do we incorporate declarative knowledge from domain experts?

I like to refer to the phrase "the proof is in the pudding", and by that, I mean that the value of a piece of tech must be judged based on its results in production. And when we look at production systems, we see that LLMs or Deep Learning models are not used in isolation, but rather within Human-in-the-Loop systems.

In a project and paper from 2 weeks ago, Google has started using language models to help it find and spot bugs in its C/C++, Java, and Go code. The results have been encouraging: it has recently started using an LLM based on its Gemini model to “successfully fix 15% of sanitiser bugs discovered during unit tests, resulting in hundreds of bugs patched”. Though the 15% acceptance rate sounds relatively small, it has a big effect at Google-scale. The bug pipeline yields better-than-human fixes - “approximately 95% of the commits sent to code owners were accepted without discussion,” Google writes. “This was a higher acceptance rate than human-generated code changes, which often provoke questions and comments”.

The key takeaway here for me has to do with their architecture:

AI-powered patching
AI-powered patching at Google

They built it with a LLM, but they also combined LLMs with smaller more specific AI models, and more importantly with a double human filter on top, thus working with machines.


I remember those 2019 days vividly, moving from LSTMs to Transformers, and we thought that was Deep Learning. Now, with LLMs, we've reached what I would describe as Abysmal Learning. And I like this image because it can mean both "extremely deep" as well as "profoundly bad".

More than ever, we need more control, more transparency, and ways for humans to work with machines. In this blog post, we've seen here a few ways in which Graphs and LLMs can work together to help with that, and I'm excited to see what the future holds.

Abysmal Learning
Deeper than Deep Learning: Abysmal Learning


  1. Language, Graphs, and AI in industry - Paco Nathan - Jan, 2024
  2. Graph ML meets Language Models - Paco Nathan - Oct 25, 2023
  3. [2306.08302] Unifying Large Language Models and Knowledge Graphs: A Roadmap
  4. GitHub - RManLuo/Awesome-LLM-KG: Awesome papers about unifying LLMs and KGs - Jun 14, 2023
  5. Evaluating LLMs is a minefield
  6. GPT-4 and professional benchmarks: the wrong answer to the wrong question - AI Snake Oil - Oct 4, 2023
  7. AI-powered patching: the future of automated vulnerability fixes - Google Security - Jan 31, 2024
  8. Graph & Geometric ML in 2024: Where We Are and What’s Next (Part II — Applications) | by Michael Galkin - Jan 16, 2024
  9. [2312.02783] Large Language Models on Graphs: A Comprehensive Survey - Dec 5, 2023
  10. ULTRA: Foundation Models for Knowledge Graph Reasoning | by Michael Galkin | Towards Data Science - Nov 3, 2023
  11. Fine Tuning Is For Form, Not Facts | Anyscale - July 5, 2023
  12. GenAI Stack Walkthrough: Behind the Scenes With Neo4j, LangChain, and Ollama in Docker - Oct 05, 2023
  13. NebulaGraph Launches Industry-First Graph RAG: Retrieval-Augmented Generation with LLM Based on Knowledge Graphs - Sep 6, 2023
  14. RAG Using Unstructured Data & Role of Knowledge Graphs | Kùzu - Jan 15, 2024
  15. Constructing knowledge graphs from text using OpenAI functions | by Tomaz Bratanic - Oct 20, 2023
  16. Knowledge graph from unstructured text | by Noah Mayerhofer | Neo4j Developer Blog - Sep 21, 2023