Knowledge Graphs and LLMs in Action you own this product

Build AI systems using connected data

Alessandro Negro with Vlastimil Kus, Giuseppe Futia and Fabio Montagna
Forewords by Maxime Labonne, Khalifeh AlJadda

October 2025
ISBN 9781633439894
472 pages

Included with a Manning Online subscription

printed in black & white

available in Russian, Simplified Chinese

catalog / Data Science / Machine Learning / Knowledge Graphs

resources: Source code Supplememental material Book forum Source code on Github Register your pBook for a free eBook

table of content

Part 1 Foundations of hybrid intelligent systems

1 Knowledge graphs and LLMs: A killer combination

1.1 Knowledge graphs

1.2 Large language models

1.3 KGs and LLMs: Stronger together

1.4 The paradigm shift in data-driven applications

1.4.1 The four pillars of knowledge graphs

1.5 Building data-driven applications using KGs and LLMs

1.5.1 Example use case: Drug discovery and development

1.5.2 Example use case: Conversational AI for customer support

1.5.3 Deciding whether to use a KG

1.6 Knowledge graph technologies

1.6.1 Taxonomies and ontologies

1.7 How do we teach KGs and LLMs?

2 Intelligent systems: A hybrid approach

2.1 What is intelligence?

2.2 Designing an intelligent system

2.2.1 What is an intelligent system?

2.2.2 Categories of intelligent systems

2.2.3 Characteristics of an intelligent system

2.3 Knowledge acquisition and representation

2.4 Reasoning

2.5 Reasoning engines

2.5.1 Limitations of a pure deductive reasoning engine

2.5.2 Using inductive reasoning and ML

2.5.3 The role of LLMs in the reasoning engine

2.6 A KG approach to IASs

Part 2 Building knowledge graphs from structured data sources

3 Create your first knowledge graph from ontologies

3.1 Knowledge graph building: Warmup

3.1.1 Business and domain understanding

3.1.2 Data understanding

3.2 Understanding knowledge graph technologies

3.2.1 RDF or LPG? A goal-driven discussion

3.2.2 Representing edge properties with RDF and LPG

3.3 Building a knowledge graph

3.3.1 Ontology ingestion and processing with neosemantics

3.3.2 Annotation ingestion and processing

3.4 Querying the data

3.5 Reasoning over the KG

4 From simple networks to multisource integration

4.1 Biomedical knowledge graphs and applications

4.2 Multi-omic applications of KGs

4.2.1 Creating a KG from the PPI and protein-disease networks

4.2.2 High-level analysis of the resulting KGs

4.2.3 Domain-specific analysis of the PPI and disease KG

4.3 Pharmaceutical applications of KGs

4.3.1 Deep analysis of the Hetionet knowledge graph

4.3.2 LLM-assisted interpretation of pathway analysis results

4.4 Clinical applications of KGs

4.4.1 LLM-guided clinical decision support analysis

Part 3 Building knowledge graphs from text

5 Extracting domain-specific knowledge from unstructured data

5.1 The archives challenge

5.2 Key concepts of knowledge extraction

5.2.1 Recognizing named entities

5.2.2 Extracting relations

5.3 Building KGs with large language models

5.3.1 Using LLMs

5.3.2 Prompt engineering examples

5.3.3 Prompt engineering guidelines

5.3.4 KG building: Traditional NLP or LLMs?

6 Building knowledge graphs with large language models

6.1 Transforming an archive to a KG

6.1.1 Graph modeling

6.1.2 Creating a metagraph

6.1.3 Normalization and cleansing

6.1.4 Graph-based entity resolution

6.2 Intellectual network analysis: The value of graphs

6.3 Next steps in the Rockefeller Archive Center project

6.4 The value of knowledge graphs in the LLM era

7 Named entity disambiguation

7.1 From recognition to disambiguation

7.2 Understanding named entity disambiguation

7.3 Domain-based NED and LLMs

7.4 Business and domain understanding

7.4.1 Context

7.4.2 Use case definition

7.5 Understanding the data

7.5.1 Unstructured data

7.5.2 Domain ontologies

7.6 Building a SoHO knowledge graph

7.6.1 Defining the schema

7.6.2 Processing and ingesting documents

7.6.3 Disambiguating and ingesting medical entities

7.6.4 Processing, loading, and mapping ontologies

7.6.5 Generating entity co-occurrences

7.7 KG-based use cases

7.7.1 Conceptual search

7.7.2 Structured knowledge-based search

7.7.3 KG-based interpretability and discovery

7.7.4 Uncovering new knowledge

8 NED with open LLMs and domain ontologies

8.1 Understanding limitations of traditional NED systems

8.2 Ingesting the domain ontology

8.3 Setting up the model with Ollama and Llama 3.1 8B

8.4 End-to-end NED process

8.4.1 Named entity recognition

8.4.2 Candidate selection

8.4.3 Candidate disambiguation

8.5 Conclusions

Part 4 Machine learning on knowledge graphs

9 Machine learning on knowledge graphs: A primer approach

9.1 Machine learning on graphs: Why?

9.2 Machine learning on graphs: What?

9.2.1 Node classification

9.2.2 Link prediction (a.k.a. relationship prediction)

9.2.3 Clustering and community detection

9.2.4 Graph classification

9.3 Machine learning on graphs: How?

9.3.1 Node classification and link prediction

9.3.2 Graph classification

9.3.3 Graph clustering

10 Graph feature engineering: Manual and semiautomated approaches

10.1 Manual node features

10.1.1 Degree

10.1.2 Triangles

10.1.3 Density

10.1.4 Geodesic (or shortest) path

10.1.5 Closeness

10.1.6 Betweenness

10.1.7 PageRank

10.1.8 Prediction

10.2 Manual relationship features

10.2.1 Node-based representation

10.2.2 Path-based features

10.3 Semiautomated feature extraction

10.3.1 Performing ReFeX manually

10.3.2 Performing ReFeX automatically with code

11 Graph representation learning and graph neural networks

11.1 Embeddings in graph representation learning

11.1.1 Understanding graph embeddings: From discrete to continuous

11.1.2 Real-world applications and examples

11.2 The encoder–decoder model

11.2.1 The encoder: Converting graph structure to vectors

11.2.2 The decoder: Reconstructing graph properties

11.2.3 The power of the framework

11.2.4 Node2Vec: An example of an encoder–decoder framework

11.3 Shallow embeddings: A first approach to graph representation

11.3.1 Understanding shallow embeddings

11.3.2 Limitations of shallow embeddings

11.4 Embeddings in knowledge graphs

11.4.1 Loss function

11.4.2 Multirelationship decoder

11.5 Message passing and graph neural networks

11.5.1 The message-passing framework: A neural conversation

11.5.2 Motivation and intuition: Why message passing works

11.5.3 The basic GNN model

11.5.4 Message passing with self-loops

11.6 Generalized aggregation and update methods

11.6.1 Neighborhood normalization

11.6.2 Neighborhood attention

11.6.3 Multihead attention and transformer connections

11.6.4 Generalized update methods

11.7 The synergy of GNNs and LLMs

12 Node classification and link prediction with GNNs

12.1 Node classification for anti-money laundering applications

12.1.1 Input data

12.1.2 Graph processor: Data preparation

12.1.3 Graph processor: Homogeneous PyG graph

12.1.4 Encoder–decoder architecture

12.1.5 Evaluation and analysis

12.2 Link prediction for movie recommendations

12.2.1 Input data

12.2.2 Graph processor: Data preparation

12.2.3 Graph processor: Heterogeneous PyG graph

12.2.4 Encoder–decoder architecture

12.2.5 Evaluation and analysis

Part 5 Information retrieval with knowledge graphs and LLMs

13 Knowledge graph–powered retrieval-augmented generation

13.1 AI agents

13.2 Chatting with the LLM

13.3 Challenges in the production environment

13.4 Chatting with the AI about private data

13.4.1 Retrieval-augmented generation

13.4.2 Vector-based RAG limitations

13.4.3 Graph RAG

13.4.4 Reasoning agents

13.4.5 Let’s chat with our KG

14 Asking a KG questions with natural language

14.1 Querying a knowledge graph in the policing domain

14.1.1 Enabling domain experts with knowledge graphs

14.2 RAG for KG querying: Capabilities and challenges

14.2.1 RAG effectiveness with complete context

14.2.2 RAG fragility with incomplete retrieval

14.3 Schema-based approach for querying KGs

14.3.1 Understanding and using graph schemas

14.4 Think like an expert: Using metadata for enhanced querying

14.5 Intent detection: Understanding user expectations

14.5.1 Classifying by visualization type

14.5.2 Is it data, documentation, or just complaining?

14.6 From schema to LLM-ready context

14.6.1 Schema extraction and representation

14.6.2 Enriching schemas with descriptive annotations

14.6.3 A practical approach to schema representation

14.7 It’s time to think: Understanding LLM reasoning

14.7.1 The order matters: Answer first vs. reasoning first

14.7.2 Thinking in queries: From text to Cypher

14.7.3 Structuring output for reliable query generation

14.8 Response summarization: From results to insights

15 Building a QA agent with LangGraph

15.1 Building the LangGraph pipeline

15.1.1 System architecture overview

15.1.2 Configuring pipeline components

15.1.3 Schema translation service

15.1.4 State management design

15.1.5 Pipeline agent implementation

15.1.6 Pipeline integration layer

15.2 Streamlit application

15.2.1 Application overview

15.2.2 LangGraph integration

15.3 Expert-emulating investigation

15.3.1 Identifying the initial case

15.3.2 Spatial analysis of surveillance coverage

15.3.3 Vehicle pattern detection

15.3.4 Context-aware request refinement

15.3.5 Historical record analysis

15.4 Future directions and enhancements

15.4.1 Learning from use

15.4.2 Enhancing core capabilities

15.4.3 Advanced evolution paths

Appendixes

Appendix A: Introduction to graphs

A.1 What is a graph?

A.2 Graphs as models of networks

A.3 Representing graphs

Appendix B: Neo4j

B.1 Introduction to Neo4j

B.2 Installing Neo4j

B.2.1 Installing a Neo4j server

B.2.2 Neo4j Desktop installation

B.3 Cypher

B.4 Installing plugins

B.4.1 Installing APOC Core

B.4.2 GDS installation

B.5 Cleaning

Appendix C: Building knowledge graphs from structured sources

C.1 MicroRNA–disease association: Warmup

C.1.1 Key concepts

C.1.2 Business understanding

C.1.3 Data understanding

C.2 Building the miRNA knowledge graph

C.2.1 Importing known miRNA–disease connections

C.2.2 Importing the disease ontology

C.2.3 Importing miRNA information

C.3 Exploring and analyzing the miRNA KG

Appendix D: references

Overview

8 NED with open LLMs and domain ontologies

This chapter introduces a practical, domain-agnostic approach to Named Entity Disambiguation that marries open large language models with rich domain ontologies. It begins by motivating the need to go beyond traditional tools such as ScispaCy, which, while strong in biomedical contexts, can be hard to update, are tied to a single domain, and underuse the relational knowledge embedded in ontologies. The proposed solution runs locally on consumer hardware using an open model (Llama 3.1 8B via Ollama) and leverages SNOMED as a reference ontology, addressing privacy, latency, and extensibility while explicitly exploiting ontology structure during disambiguation.

The implementation loads SNOMED into Neo4j, normalizing many relationship types under a single relationship with a type property and introducing a dedicated hierarchical link to propagate semantic categories from top-level concepts down the taxonomy. These propagated categories feed an ontology-guided NER step in which the LLM is prompted to extract only ontology-derived entity types; character offsets are then computed deterministically in post-processing. Candidate selection does not rely on the LLM: it uses Neo4j full-text search constrained by the NER labels to retrieve plausible SNOMED candidates for each mention (e.g., multiple interpretations for “Zika”), establishing a focused pool for the final decision.

Candidate disambiguation combines graph algorithms and LLM reasoning in three steps: shortest-path detection between candidates within the ontology (with hub filtering and limited hops for relevance), path-to-text translation that turns the discovered graph paths into natural-language statements, and summarization that condenses many sentences into a compact context. The final LLM prompt uses this distilled, ontology-grounded context to pick the best candidate per mention—e.g., preferring “Congenital Zika virus infection” when “microcephaly” co-occurs. The result is an end-to-end NED workflow that is accurate in sparse or ambiguous contexts, portable across domains with suitable ontologies, privacy-friendly when run locally, and extensible with improvements such as vector-based candidate retrieval.

A sample of the hierarchical structure of the SNOMED ontology. Leveraging this hierarchical structure, nodes located on a deeper level, such as “Ecallantide” and “Retinopathy associated with AIDS”, can be categorized using the information from the first-level nodes, such as “Pharmaceutical product” and “Disease”, which represent the archetypal entities of the ontology.

A comprehensive workflow for a Named Entity Disambiguation (NED) system designed to leverage Large Language Models (LLMs) and domain-specific ontologies, such as SNOMED, for biomedical text processing. This workflow is organized into distinct stages, each involving various processes and interactions between the LLM and domain ontology to disambiguate entities within input text accurately.

The first stage of NED is Named Entity Recognition (NER). In our scenario, the collection of named entities is derived directly from the ontology. In SNOMED, the categories are defined by the first-level nodes of the ontology, whose information is propagated to all the other nodes.

The second stage of NED is the Candidate Selection. In this scenario, for each entity mention detected in the previous step, this stage retrieves a collection of potential candidates that can refer to each mention. The current implementation employs a full-text search but can be potentially extended with more advanced techniques.

The third stage of NED is the Candidate Disambiguation. The key goal is to select the best match among all the possible candidates identified in the previous step. To reach this goal, the disambiguation phase exploits an advanced approach combining graph-based algorithms (shortest path detection) and LLMs.

NED Candidate Disambiguation is divided into three main phases: (1) Shortest Path Detection between all the candidates related to the different entity mentions in the sentence. (2) Path-to-text Translation to transform detected paths into natural language sentences. (3) Textual paths summarization to summarize all the natural language sentences into a unique and valuable piece of text useful for the disambiguation

Summary

Named Entity Disambiguation (NED) is essential for accurately identifying and distinguishing entities in complex domains, particularly in biomedical text.
Traditional Natural Language Processing (NLP) tools such as ScispaCy have some limitations:

they can not be used in diverse domains, they can not leverage the relationships between entities, and their reference knowledge can not be extended and updated.

The combination of general-purpose Large Language Models (LLMs) and domain ontologies allows us to address these issues: LLMs can be driven by the continuously updated knowledge incorporated by the ontology and leverage its relational structure.
To reach this goal, we can deploy a flexible end-to-end process for NED, including multiple phases involving LLMs and domain ontologies, such as Named Entity Recognition (NER), NED Candidate Selection, and NED Candidate Disambiguation.
To fully leverage the capabilities of LLMs combined with the graph dimension of domain ontologies, the disambiguation phase is divided into three different stages:

Shortest-path detection.
Path-to-text translation.
Textual path summarization.

Future NED applications can leverage this framework and adapt to different domains, which are characterized by rich ontologies describing the relational nature of their specific entities.

FAQ

What limitations of traditional NED tools like ScispaCy does this chapter address?

ScispaCy is domain-specific (biomedical), hard to expand/update with new entities and aliases, and it underutilizes knowledge base structure by not leveraging relationships and paths. The chapter addresses these gaps by combining open LLMs with domain ontologies and graph algorithms to use hierarchical and relational context during disambiguation.

Why combine open LLMs with domain ontologies for NED?

LLMs provide broad linguistic competence, while ontologies like SNOMED contribute precise, curated structure. Together they: 1) generalize across domains; 2) ground decisions in canonical concepts; 3) exploit hierarchical/relational context; 4) improve accuracy under sparse local context; and 5) can run locally for privacy with tools like Ollama.

What is SNOMED, and which files are ingested?

SNOMED CT is a comprehensive, multilingual clinical terminology with 450k+ concepts and rich relationships. The chapter ingests two TSV files: 1) sct2_Description_Full-en_*.txt for names and aliases; 2) sct2_Relationship_Full_*.txt for relationships (triples and metadata).

How is SNOMED modeled and loaded in Neo4j?

- Nodes: SnomedEntity(id, name, aliases)
- Relationships: a generic SNOMED_RELATION with type and id; plus a dedicated SNOMED_IS_A for hierarchy
- Indexes/constraints: unique id, indexes for name, relation id/type, and a full-text name index
- Names/aliases are added to both nodes and relationships from the description file.

How are NER categories derived from SNOMED?

First-level ontology nodes (e.g., Disease, Organism, Substance, Event) are propagated down the SNOMED_IS_A hierarchy using APOC traversal. The collected types (n.type) become the allowed labels in the NER prompt, ensuring entity extraction aligns with ontology categories.

How does candidate selection work, and why isn’t the LLM used here?

Candidate selection uses Neo4j full-text search on SnomedEntity names/aliases with fuzzy matching, filtered by the NER label to limit scope. It returns (snomed_id, name) candidates per mention. LLMs aren’t used because the ontology is too large for prompts and we want exact retrieval from the ground-truth KB rather than model priors.

What is the shortest-path strategy for disambiguation?

For candidates across co-occurring mentions in a sentence: 1) compute allShortestPaths (1–2 hops) between candidate pairs; 2) exclude high-degree “hub” nodes via GDS degree to avoid generic links; 3) format paths with relationship directions and types; these paths become evidence used downstream in LLM prompts.

How are graph paths turned into text, and why summarize them?

Paths like (A)-[:REL]->(B)<-[:REL]-(C) are translated by an LLM into clear sentences that preserve exact entity names. A second prompt summarizes multiple sentences into a concise “context” to reduce token load and highlight salient relationships that guide the final disambiguation.

How do I run Llama 3.1 8B locally with Ollama and call it from Python?

- Install and run: “ollama serve”, then “ollama pull llama3.1:latest”
- Use OpenAI-compatible Chat Completions via base_url http://localhost:11434/v1 and model "llama3.1:latest"
- Set temperature low (e.g., 0) for deterministic NED outputs.

What enhancements and practical tips does the chapter suggest?

- Add vector search to boost candidate recall alongside full-text
- Tune hub filtering thresholds and path lengths; cache computed paths
- Post-process character offsets deterministically to fix LLM span errors
- Extend to other domains by swapping in rich ontologies
- Evaluate with standard NED metrics; incrementally update aliases and entities; leverage multilingual capabilities of Llama 3.1 where helpful.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

eBook

pdf, ePub, online

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more