Dec 19, 2024

5min read

Knowledge Graph(s) and LLM-Based Ontologies Have a Very Good Shot at Unlocking GenAI in Production

Authors

Julien Hobeika

Pierrick Cudonnec

Alexander Fred-Ojala

In the wake of hype around ChatGPT, enterprises began dedicating budgets to generative AI (genAI) adoption. As they played around with genAI, and its potential use cases, they quickly realised that to be truly useful and effective, these models need to interact with their proprietary, “private” company data. This realization has sparked a surge of companies focused on integrating AI with enterprise data — some developing solutions to fine-tune models directly in-house, while others look to retrieval-augmented generation (RAG) as a way to incorporate company knowledge into LLM analysis and outputs.

However, the deployment of “AI in production” is yet to happen. One of the main hurdles is that the quality of the data used to feed the models is lacking. Companies like Deasie and Unstructured are tackling this problem by focusing on metadata labelling to crack the “data foundation layer”.

There is, also, another promising approach gaining attention: knowledge graphs (KGs).


Graph databases are the solution to the “data foundation layer” and will unlock KGs value catch-up

Graph databases have failed to capture as much value as their relational counterparts (see chart). At the scale of a single business, no graph database has achieved the same widespread adoption or impact of platforms like Snowflake or Databricks.

One key reason is that relational databases power apps at scale (e.g. CRMs, ERPs etc), while graph databases do not. This is largely because no KG architecture has been able to capture and update enough data to comprehensively and accurately map a business.

We believe that graph databases are the solution to enabling AI apps to move into production in the enterprise market, because they can solve the “data foundation layer” problem. The reason for this is because LLMs are a powerful way to build ontologies and semantic layers. Let’s explore why.

Graph databases overcame some of the relational database limits and powered “early” AI apps, but they remain limited

When Leonhard Euler (1707–1783) came up with the foundation of graph theory — while trying to prove that the Königsberg bridge problem had no solution — he likely had no idea of the immense business potential of graph databases.

In the 1970s the relational database model emerged and became the dominant paradigm for decades. Relational databases excelled at handling structured tabular data — and they still do, as exemplified by Snowflake. One of the limitations of relational databases is that they struggle with complex relationships due to their reliance on joins and normalization.

Enter graph databases, which represent data as nodes (entities) and edges (relationships), mirroring graph theory. Neo4j, which recently surpassed $200m in ARR, is probably the most famous example of a graph database company.

Graph databases are a good solution to problems that involve real-world and contextual/relationship data in domains such as supply chains, recommendation systems, and fraud detection. They often involve complex relationships that are cumbersome to model and query in traditional databases. Graph databases were designed to represent and query these relationships natively — some great startups actually use knowledge graphs as a powerful way to build their products, such as Causaly and Spread.

Problems: Graph databases have failed to scale, and gen-AI use cases are still not in production in Enterprises

Scaling graph databases to handle massive amounts of data is a well-documented challenge. Moreover, they are notoriously hard to productize because of ontologies. In short, ontologies define the key entities (nodes) in a domain, their attributes, and the relationships (edges) connecting them. Until LLMs came about, ontology design was largely a manual process, often handled by consultants. This manual reliance is one of the reasons why Palantir’s Foundry, which uses a graph-based model, became valuable but was very hard to scale as a product.

LLMs and transformer-based models are particularly well-suited to define ontologies. First, they excel at capturing semantic relationships (via attention mechanism). Second, LLMs are also really good at creating vector representations (embeddings) that effectively capture the hierarchical and semantic relationships (logical) between entities. And third, they’re easy to update — a must have feature for governance as well as changing fields. As a result, LLMs are a powerful way to build ontologies, enabling knowledge graphs to be built and updated much faster.

After the ontology has been “cracked”, it becomes possible to infer the graph representation of an entity — be it a business unit, process or company. However, another traditional limitation of knowledge graphs lies in the definition and scaling of symbolic rules, a process that has also been largely manual. Symbolic rules refer to the logic and constraints that dictate how the elements within a graph interact. They enhance the graph’s reasoning capabilities and enable the inference of new knowledge. Just as LLMs are a powerful way to define ontologies, they are also excellent at defining symbolic rules.

To recap, LLMs not only turbocharge the creation of knowledge graphs by streamlining ontology development, but are also really good at inferring the symbolic rules. What are the implications?

The opportunity: Putting AI into production for the enterprise market

Taking a step back, one of the main reasons why AI has struggled to get into production in enterprises is probably the data layer. Simply put, most enterprise data, as it stands, is just not ready for LLM production. This is why companies like Deasie are gaining traction — their focus on solving the “data foundation layer” is key to enabling LLM adoption. In the US, Unstructured is also a strong contender, addressing the challenge that “80% of enterprise data exists in difficult-to-use formats like HTML, PDF, CSV, PNG, PPTX, and more” (as stated on their website) by extracting and transforming complex data for use with every major vector database and LLM framework.

It turns out that the combination of KGs and LLMs is another compelling and potentially more powerful solution to the data foundation problem. This approach allows data to be mapped and embedded at a company or use case level, unlocking a wide range of AI and RAG use cases in the enterprise market.

Why Microsoft created GraphRAG (now open source)

Microsoft’s GraphRAG (now open source) uses community detection algorithms to help structure data into clusters and their relationships.

As a reminder, RAG (retrieval augmented generation) is a way for a model to utilize relevant data for specific queries, without the requirement that the LLM has to have been trained on or has to remember the data. This is a powerful technique in order to ingest e.g., private and proprietary data into the LLM input prompt.

Here is a simplified explanation of how RAG works:

1. The user provides a prompt.

2. The prompt is embedded into vectors using a pre-trained embedding model, like, BERT.

3. The RAG application searches for similar vectors in a pre-embedded index of private data.

4. The LLM integrates the retrieved knowledge in its input prompt and/or answer

So far, the dominant approach for RAG has been BaselineRAG, which has a notable limitation: it struggles to connect the dots between two concepts that are semantically distant but logically close. Capturing the logic that goes beyond the semantic is what KGs excel at. Enter Microsoft GraphRAG.

Instead of searching only for similar text vectors, GraphRAG explores relationships in the graph (using graph embeddings or graph-specific algorithms) to retrieve connected nodes or subgraphs. In short, it captures the context and relationships much better, and is therefore well positioned to provide higher quality and more correct outputs compared to traditional RAG methods.

Microsoft GraphRAG was open sourced in June 2024 (45 sec accouncement: here). However, whilst groundbreaking, it remains more of a promising solution, than a proven one that has delivered enterprise value. It seems that GraphRAG fails to scale the user inputs that enable the proper ontology definition. This limits the connections to co-occurences in documents. Setting up the semantic layer or the ontology is still somewhat cumbersome. In detail, it is resource intensive (experts, compute) -meaning it is hard to scale and maintain. It also faces difficulties to automatically decide which entities and relations are crucial. It is open source and right now doesn’t have the GtM force usually required for adoption by the enterprise market. Nevertheless, this technique has further confirmed the growing market interest for robust RAG-based solutions, validating both the timing and market appetite.

There are three promising ways to solve for AI in production

1. Metadata labeling at scale: This approach is compelling, but metadata labelling does not necessarily capture relationship links (semantic focus that would not capture the logic). Scale AI ($1bn Series F) is probably the one of the largest companies that followed this approach. Yet, its human-based approach labeling limits its scalability (and therefore the margin profile).

2. Building a single knowledge graph on an enterprise data lake: The second approach envisions constructing one unique knowledge graph on top of one enterprise data lake as the solution to win the market. This is the approach of US-based Relational AI ($115m raised). In essence, this approach resolves to establish control over both the graph layer and a single semantic layer for the whole enterprise. It might be relevant in some cases, but it is also hard to scale and potentially not tailored enough to the use cases being built in the enterprise.

3. Customizing multiple graphs to address specific enterprise needs: A more scalable and flexible solution is likely the orchestration of multiple smaller knowledge graphs, tailored to specific use cases. In this model, LLMs are used to infer graph architectures and symbolic layers for each use case, which in turn helps build the genAI applications. This enables enterprises to create a network of smaller, specialized graphs tailored to each use case or cluster of ontologies, making applications more accurate, easier to govern, and interoperable.

This third approach represents a significant opportunity for entrepreneurs. Startups in this space would ideally target the large enterprise market, where it is too challenging and too generic to build a large unique knowledge graph, and where multiple KGs make sense. In our perspective, the core market for product-market-fit and scale would be in regulated environments with heterogeneous systems where terminologies play a significant role. Ideally, this company also has a US focus — as infrastructure startups need market velocity to scale.

Market potential

At scale, this company could capture $bns of ARR, positioning itself as a critical infrastructure layer between enterprise data and AI apps. As the KGs catch-up on the >$60bn relational database market and as AI gets into production, this company could capture a significant share of the overall genAI market.

Open questions on the space: 5 things we don’t know (yet)

1. What are the limitations of abstracting the ontology with a GraphRAG-like approach? If the winning approach (assuming there is one) includes humans in the loop for ontology definition, what is the right way to include domain knowledge?

2. What is the right starting point from a market perspective — horizontal or vertical?

3. How will the market actually split across the use cases that work best with RelationalRAG, GraphRAG (without ontology), and multiple ontology-based knowledge graphs?

4. Assuming building multiple knowledge graphs is the right approach -how do we know and optimise the links and limits between them?

5. It is likely that some cases require mixing public, and private data. Which layer will capture the place where the “magic” happens?

If you are a founder building in this space with this paradigm in mind, feel free to reach out at julien.hobeika@eqtventures.com or pierrick@eqtventures.com. We would love to chat.

More to read

20 articles