Knowledge Graphs and LLMs in Action you own this product

Build AI systems using connected data

Alessandro Negro with Vlastimil Kus, Giuseppe Futia and Fabio Montagna
Forewords by Maxime Labonne, Khalifeh AlJadda

October 2025
ISBN 9781633439894
472 pages

Included with a Manning Online subscription

printed in black & white

available in Russian, Simplified Chinese

catalog / Data Science / Machine Learning / Knowledge Graphs

resources: Source code Supplememental material Book forum Source code on Github Register your pBook for a free eBook

table of content

Part 1 Foundations of hybrid intelligent systems

1 Knowledge graphs and LLMs: A killer combination

1.1 Knowledge graphs

1.2 Large language models

1.3 KGs and LLMs: Stronger together

1.4 The paradigm shift in data-driven applications

1.4.1 The four pillars of knowledge graphs

1.5 Building data-driven applications using KGs and LLMs

1.5.1 Example use case: Drug discovery and development

1.5.2 Example use case: Conversational AI for customer support

1.5.3 Deciding whether to use a KG

1.6 Knowledge graph technologies

1.6.1 Taxonomies and ontologies

1.7 How do we teach KGs and LLMs?

2 Intelligent systems: A hybrid approach

2.1 What is intelligence?

2.2 Designing an intelligent system

2.2.1 What is an intelligent system?

2.2.2 Categories of intelligent systems

2.2.3 Characteristics of an intelligent system

2.3 Knowledge acquisition and representation

2.4 Reasoning

2.5 Reasoning engines

2.5.1 Limitations of a pure deductive reasoning engine

2.5.2 Using inductive reasoning and ML

2.5.3 The role of LLMs in the reasoning engine

2.6 A KG approach to IASs

Part 2 Building knowledge graphs from structured data sources

3 Create your first knowledge graph from ontologies

3.1 Knowledge graph building: Warmup

3.1.1 Business and domain understanding

3.1.2 Data understanding

3.2 Understanding knowledge graph technologies

3.2.1 RDF or LPG? A goal-driven discussion

3.2.2 Representing edge properties with RDF and LPG

3.3 Building a knowledge graph

3.3.1 Ontology ingestion and processing with neosemantics

3.3.2 Annotation ingestion and processing

3.4 Querying the data

3.5 Reasoning over the KG

4 From simple networks to multisource integration

4.1 Biomedical knowledge graphs and applications

4.2 Multi-omic applications of KGs

4.2.1 Creating a KG from the PPI and protein-disease networks

4.2.2 High-level analysis of the resulting KGs

4.2.3 Domain-specific analysis of the PPI and disease KG

4.3 Pharmaceutical applications of KGs

4.3.1 Deep analysis of the Hetionet knowledge graph

4.3.2 LLM-assisted interpretation of pathway analysis results

4.4 Clinical applications of KGs

4.4.1 LLM-guided clinical decision support analysis

Part 3 Building knowledge graphs from text

5 Extracting domain-specific knowledge from unstructured data

5.1 The archives challenge

5.2 Key concepts of knowledge extraction

5.2.1 Recognizing named entities

5.2.2 Extracting relations

5.3 Building KGs with large language models

5.3.1 Using LLMs

5.3.2 Prompt engineering examples

5.3.3 Prompt engineering guidelines

5.3.4 KG building: Traditional NLP or LLMs?

6 Building knowledge graphs with large language models

6.1 Transforming an archive to a KG

6.1.1 Graph modeling

6.1.2 Creating a metagraph

6.1.3 Normalization and cleansing

6.1.4 Graph-based entity resolution

6.2 Intellectual network analysis: The value of graphs

6.3 Next steps in the Rockefeller Archive Center project

6.4 The value of knowledge graphs in the LLM era

7 Named entity disambiguation

7.1 From recognition to disambiguation

7.2 Understanding named entity disambiguation

7.3 Domain-based NED and LLMs

7.4 Business and domain understanding

7.4.1 Context

7.4.2 Use case definition

7.5 Understanding the data

7.5.1 Unstructured data

7.5.2 Domain ontologies

7.6 Building a SoHO knowledge graph

7.6.1 Defining the schema

7.6.2 Processing and ingesting documents

7.6.3 Disambiguating and ingesting medical entities

7.6.4 Processing, loading, and mapping ontologies

7.6.5 Generating entity co-occurrences

7.7 KG-based use cases

7.7.1 Conceptual search

7.7.2 Structured knowledge-based search

7.7.3 KG-based interpretability and discovery

7.7.4 Uncovering new knowledge

8 NED with open LLMs and domain ontologies

8.1 Understanding limitations of traditional NED systems

8.2 Ingesting the domain ontology

8.3 Setting up the model with Ollama and Llama 3.1 8B

8.4 End-to-end NED process

8.4.1 Named entity recognition

8.4.2 Candidate selection

8.4.3 Candidate disambiguation

8.5 Conclusions

Part 4 Machine learning on knowledge graphs

9 Machine learning on knowledge graphs: A primer approach

9.1 Machine learning on graphs: Why?

9.2 Machine learning on graphs: What?

9.2.1 Node classification

9.2.2 Link prediction (a.k.a. relationship prediction)

9.2.3 Clustering and community detection

9.2.4 Graph classification

9.3 Machine learning on graphs: How?

9.3.1 Node classification and link prediction

9.3.2 Graph classification

9.3.3 Graph clustering

10 Graph feature engineering: Manual and semiautomated approaches

10.1 Manual node features

10.1.1 Degree

10.1.2 Triangles

10.1.3 Density

10.1.4 Geodesic (or shortest) path

10.1.5 Closeness

10.1.6 Betweenness

10.1.7 PageRank

10.1.8 Prediction

10.2 Manual relationship features

10.2.1 Node-based representation

10.2.2 Path-based features

10.3 Semiautomated feature extraction

10.3.1 Performing ReFeX manually

10.3.2 Performing ReFeX automatically with code

11 Graph representation learning and graph neural networks

11.1 Embeddings in graph representation learning

11.1.1 Understanding graph embeddings: From discrete to continuous

11.1.2 Real-world applications and examples

11.2 The encoder–decoder model

11.2.1 The encoder: Converting graph structure to vectors

11.2.2 The decoder: Reconstructing graph properties

11.2.3 The power of the framework

11.2.4 Node2Vec: An example of an encoder–decoder framework

11.3 Shallow embeddings: A first approach to graph representation

11.3.1 Understanding shallow embeddings

11.3.2 Limitations of shallow embeddings

11.4 Embeddings in knowledge graphs

11.4.1 Loss function

11.4.2 Multirelationship decoder

11.5 Message passing and graph neural networks

11.5.1 The message-passing framework: A neural conversation

11.5.2 Motivation and intuition: Why message passing works

11.5.3 The basic GNN model

11.5.4 Message passing with self-loops

11.6 Generalized aggregation and update methods

11.6.1 Neighborhood normalization

11.6.2 Neighborhood attention

11.6.3 Multihead attention and transformer connections

11.6.4 Generalized update methods

11.7 The synergy of GNNs and LLMs

12 Node classification and link prediction with GNNs

12.1 Node classification for anti-money laundering applications

12.1.1 Input data

12.1.2 Graph processor: Data preparation

12.1.3 Graph processor: Homogeneous PyG graph

12.1.4 Encoder–decoder architecture

12.1.5 Evaluation and analysis

12.2 Link prediction for movie recommendations

12.2.1 Input data

12.2.2 Graph processor: Data preparation

12.2.3 Graph processor: Heterogeneous PyG graph

12.2.4 Encoder–decoder architecture

12.2.5 Evaluation and analysis

Part 5 Information retrieval with knowledge graphs and LLMs

13 Knowledge graph–powered retrieval-augmented generation

13.1 AI agents

13.2 Chatting with the LLM

13.3 Challenges in the production environment

13.4 Chatting with the AI about private data

13.4.1 Retrieval-augmented generation

13.4.2 Vector-based RAG limitations

13.4.3 Graph RAG

13.4.4 Reasoning agents

13.4.5 Let’s chat with our KG

14 Asking a KG questions with natural language

14.1 Querying a knowledge graph in the policing domain

14.1.1 Enabling domain experts with knowledge graphs

14.2 RAG for KG querying: Capabilities and challenges

14.2.1 RAG effectiveness with complete context

14.2.2 RAG fragility with incomplete retrieval

14.3 Schema-based approach for querying KGs

14.3.1 Understanding and using graph schemas

14.4 Think like an expert: Using metadata for enhanced querying

14.5 Intent detection: Understanding user expectations

14.5.1 Classifying by visualization type

14.5.2 Is it data, documentation, or just complaining?

14.6 From schema to LLM-ready context

14.6.1 Schema extraction and representation

14.6.2 Enriching schemas with descriptive annotations

14.6.3 A practical approach to schema representation

14.7 It’s time to think: Understanding LLM reasoning

14.7.1 The order matters: Answer first vs. reasoning first

14.7.2 Thinking in queries: From text to Cypher

14.7.3 Structuring output for reliable query generation

14.8 Response summarization: From results to insights

15 Building a QA agent with LangGraph

15.1 Building the LangGraph pipeline

15.1.1 System architecture overview

15.1.2 Configuring pipeline components

15.1.3 Schema translation service

15.1.4 State management design

15.1.5 Pipeline agent implementation

15.1.6 Pipeline integration layer

15.2 Streamlit application

15.2.1 Application overview

15.2.2 LangGraph integration

15.3 Expert-emulating investigation

15.3.1 Identifying the initial case

15.3.2 Spatial analysis of surveillance coverage

15.3.3 Vehicle pattern detection

15.3.4 Context-aware request refinement

15.3.5 Historical record analysis

15.4 Future directions and enhancements

15.4.1 Learning from use

15.4.2 Enhancing core capabilities

15.4.3 Advanced evolution paths

Appendixes

Appendix A: Introduction to graphs

A.1 What is a graph?

A.2 Graphs as models of networks

A.3 Representing graphs

Appendix B: Neo4j

B.1 Introduction to Neo4j

B.2 Installing Neo4j

B.2.1 Installing a Neo4j server

B.2.2 Neo4j Desktop installation

B.3 Cypher

B.4 Installing plugins

B.4.1 Installing APOC Core

B.4.2 GDS installation

B.5 Cleaning

Appendix C: Building knowledge graphs from structured sources

C.1 MicroRNA–disease association: Warmup

C.1.1 Key concepts

C.1.2 Business understanding

C.1.3 Data understanding

C.2 Building the miRNA knowledge graph

C.2.1 Importing known miRNA–disease connections

C.2.2 Importing the disease ontology

C.2.3 Importing miRNA information

C.3 Exploring and analyzing the miRNA KG

Appendix D: references

Overview

10 Graph feature engineering: manual and semi-automated approaches

This chapter explains how to turn graph elements into machine‑learnable vectors and frames feature engineering along a spectrum from manual to semi‑automated to fully automated approaches. It emphasizes the core trade‑off between interpretability and efficiency: hand‑crafted features are transparent and auditable, while more automated pipelines scale better but tend to be less explainable. The chapter focuses on techniques that keep humans in the loop—both to inject domain knowledge and to produce features that are understandable by analysts and amenable to reasoning by Large Language Models.

For nodes, the manual toolbox progresses from local to global structure: degrees (including domain‑aware variants such as fraud and legit degree), triangles (fraudulent, semi‑fraudulent, legitimate), egonet density, distances to specific node classes (e.g., shortest paths to fraudsters and path counts within 1–3 hops), and centralities like closeness, betweenness, and PageRank (plus a fraud‑weighted variant). These features are assembled into tabular datasets and used with standard classifiers (for example, logistic regression) to detect suspicious behavior while preserving explainability. For relationships, two strategies are presented: node‑based combinations that merge endpoint vectors (concatenation, average, L1/L2, Hadamard) and path‑based features that capture connection patterns via metapaths. To reduce hub bias in path counts, Degree‑Weighted Path Count (DWPC) down‑weights highly connected intermediates, enabling effective link prediction in domains such as drug repurposing. The chapter also shows how LLMs can accelerate this work by generating Cypher queries and scaffolding feature extraction code from high‑level specifications.

To bridge manual work and full automation, the chapter introduces ReFeX, a semi‑automated method that recursively aggregates local and egonet features (using sum/mean across neighborhoods) and prunes redundancy via correlation checks and binning. ReFeX yields interpretable, consistent, and transferable “regional” features that scale to larger graphs and remain robust across snapshots, while still benefiting from expert oversight. Although it focuses on topology and may omit attribute or edge‑type nuances without customization, ReFeX offers a practical middle ground when domain knowledge is limited or rapid, explainable feature generation is needed. The chapter closes by positioning these techniques as a solid, transparent foundation for downstream tasks in fraud detection and drug discovery, and as a springboard toward the fully automated representation learning methods that follow.

An example of a fraudulent social network. The white nodes represent legit people or people for whom we don’t know if they are fraudsters or not. The black nodes represent fraudsters.

Node feature extraction: using metrics and graph algorithms to transform nodes into numerical feature vectors that capture their key characteristics.

An example of three connected nodes forming a triangle (or not). If each node is connected to the other two, they constitute a triangle.

A typical relationship prediction, compared with the node classification, process as it has been presented in the previous chapter.

Hetionet schema as it was presented in chapter 4. For details on the project and the schema refer to the chapter.

Metagraph and metapaths examples from Hetionet.

Metapath-based features definition for the relationships between compounds and diseases.

ReFeX converts each node in a vector representing the node’s topological feature at different scales [8]

The same simple fraudulent network as in Figure 10.1. In this case, the color of the nodes will be ignored since ReFeX doesn’t consider the type of the nodes.

Visual description of a few iterations of the ReFeX process. At each iteration, each node passes its values (degree, previous sums, etc.) to the neighbors that will aggregate.

Reference

A. Negro, Graph-Powered Machine Learning. Shelter Island, NY, USA: Manning Publications, 2021
F. Fouss, M. Saerens, and M. Shimbo, "Algorithms and Models for Network Data and Link Analysis." Cambridge, UK: Cambridge University Press, 2016. Accessed: May 15, 2024 [Online]. Available: https://doi.org/10.1017/CBO9781316418321
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, "Introduction to Algorithms," 4th ed. Cambridge, MA, USA: MIT Press, 2022
B. Baesens, V. Van Vlasselaer, and W. Verbeke, "Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection." Hoboken, NJ, USA: Wiley, 2015
D. S. Himmelstein, A. Lizee, C. Hessler, et al., "Systematic integration of biomedical knowledge prioritizes drugs for repurposing," eLife, vol. 6, no. e26726, Oct. 2017. Accessed: May 15, 2024 [Online]. Available: https://doi.org/10.7554/eLife.26726
D. S. Himmelstein, "Our hetnet edge prediction methodology: the modeling framework for Project Rephetio," ThinkLab, 2016. Accessed: May 15, 2024 [Online]. Available: https://doi.org/10.15363/thinklab.d210
D. S. Himmelstein, "Dhimmel/Learn V1.0: The Machine Learning Repository For Project Rephetio," Zenodo, 2017. Accessed: May 15, 2024 [Online]. Available: https://doi.org/10.5281/zenodo.268654
K. Henderson, B. Gallagher, L. Li, L. Akoglu, T. Eliassi-Rad, H. Tong, and C. Faloutsos, "It's Who You Know: Graph Mining Using Recursive Structural Features," in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 2011, pp. 663-671. Accessed: May 15, 2024 [Online]. Available: https://doi.org/10.1145/2020408.2020512

FAQ

What is graph feature engineering and why is it necessary for machine learning on graphs?

Most ML algorithms (logistic regression, random forests, deep nets) require numeric vectors, not raw graph structures. Graph feature engineering transforms nodes, relationships, or whole graphs into vectors that capture structural properties (for example, degree, triangles, centrality). The quality and relevance of these vectors strongly influence downstream performance for tasks like node classification, link prediction, and graph-level analysis.

How do manual, semi-automated, and fully automated approaches compare, and what are the trade-offs?

- Manual features: highly interpretable and domain-aligned, but time-consuming and bespoke.
- Semi-automated features (for example, ReFeX): generate broad, interpretable structural features with minimal manual effort; good balance of coverage and transparency.
- Fully automated (representation learning, covered later): highly scalable and task-adaptive, but often less interpretable. In practice, choose based on interpretability needs, available expertise, data scale, and time-to-value.

What’s the difference between local and global node features?

- Local features: derived from a node’s immediate neighborhood (egonet) or small-radius neighborhoods. Examples: total degree, triangles, egonet density, counts of paths within 1–3 hops to specific node types (for example, fraudsters).
- Global features: capture a node’s role in the broader network. Examples: betweenness, closeness, PageRank, eigenvector centrality. These reflect influence, reachability, and connectivity patterns beyond the local vicinity.

How do degree-based features help in fraud detection?

Degree measures are simple yet powerful signals:
- total_degree: number of neighbors
- fraud_degree and legit_degree: neighbors split by known labels (for example, fraudster vs legitimate).
If a node is directly connected to many fraudsters (high fraud_degree), its risk of being fraudulent or influenced by fraud increases. This domain-aware refinement of degree often improves classification accuracy.

What do triangle-based features capture and how are they specialized for fraud use cases?

Triangles indicate tightly knit local structure (friends of friends are connected). Specializations help model contagion or influence:
- total_triangles: all triangles a node participates in
- fraud_triangles: triangles where both other nodes are fraudsters
- legit_triangles: triangles where both other nodes are legitimate
- semi_fraud_triangles: mixed triangles (one fraudster, one legitimate)

How is egonet density computed and when is it useful?

Egonet density measures how interconnected a node’s neighborhood is. For an egonet with N nodes and M observed edges, density = M / [N(N − 1)/2]. High density signals cohesive, potentially high-trust clusters; lower density suggests sparse, broker-like neighborhoods. In fraud analysis, density complements degree and triangles to describe local cohesion.

What are geodesic (shortest-path) features and why do they matter?

Geodesic features capture how close a node is to risky or influential nodes and the breadth of short connections:
- geodesic_path: shortest distance to any fraudster (or other target class)
- #1-hop, #2-hop, #3-hop paths to fraudsters: counts of paths within 1–3 hops

How do closeness, betweenness, and PageRank differ and when should I use each?

- Closeness: inverse of the average shortest path to reachable nodes. High closeness means a node can reach others quickly; useful for identifying fast spreaders/influencers.
- Betweenness: fraction of shortest paths between other node pairs that pass through the node. High betweenness flags bridges or bottlenecks; useful for controlling information flow and spotting chokepoints (potentially key facilitators of fraud).
- PageRank: importance derived from connections to other important nodes. A fraud-weighted variant boosts influence from known fraudsters, surfacing nodes with disproportionate exposure to risky peers.

How can I represent relationships (links) for link prediction: node-based vs path-based?

Two main strategies:
- Node-based combination: build an edge vector by combining the two node vectors with operators like concatenation, average, L1 (|u − v|), L2 ((u − v)^2), or Hadamard (u * v). Simple and fast; performance depends on node feature quality.
- Path-based features: describe structural connectivity between nodes using counts/patterns of paths and metapaths. In heterogeneous graphs (for example, drug–gene–disease), degree-weighted path count (DWPC) discounts paths through hubs and yields more meaningful signals than raw path counts.

What is ReFeX and why is it a strong semi-automated option?

ReFeX (Recursive Feature eXtraction) automatically generates interpretable structural features by:
- Extracting local features (for example, degree),
- Adding egonet features (for example, egonet size, edges, density),
- Recursively aggregating neighbor features using sum/mean across iterations (regional patterns),
- Pruning highly correlated or redundant features.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

eBook

pdf, ePub, online

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more