Graph Data
A guide to graph data structures in analytics, used to model complex relationships and interconnected networks such as social graphs, fraud rings, and supply chains, where the connections are as important as the entities themselves.
When the Relationships Are the Data
Relational databases are excellent at storing entities and linking them via foreign keys. If you want to know all the orders placed by a specific customer, a simple SQL JOIN between the customers table and the orders table answers the question instantly.
However, relational databases struggle when the questions involve deep, recursive relationships, indirect connections, or pathfinding. If you want to ask “Which of Customer A’s friends bought the same product as Customer B’s friends?”, the required SQL query involves multiple recursive self-joins that become computationally impossible to execute as the dataset grows.
Graph data structures are designed specifically to solve these highly connected problems. In a graph model, data is not stored in rigid tables. Instead, it is stored as Nodes (entities like people, products, or accounts) and Edges (the relationships connecting them, such as “KNOWS”, “PURCHASED”, or “TRANSFERRED_FUNDS_TO”).
Common Graph Use Cases
Graph analytics shines when the connections between entities are the primary source of analytical value:
Fraud Detection: Fraud rings often involve complex networks of synthetic identities transferring small amounts of money between seemingly unrelated accounts. A graph query can easily identify circular money flows or “islands” of accounts that only transact with each other, exposing the ring in ways traditional tabular analysis cannot.
Recommendation Engines: “Customers who bought this also bought…” is a classic graph problem. The engine traverses the graph from a Target Product, outward to all Customers who purchased it, and outward again to all other Products those Customers purchased, returning the most heavily connected items.
Supply Chain Traceability: In a complex manufacturing supply chain, tracing a specific contaminated raw material component through multiple sub-assemblies to find all the final products affected requires traversing a deep, multi-level bill-of-materials graph.

Processing Graphs in the Modern Data Stack
Historically, graph analytics required migrating data entirely into specialized, expensive operational graph databases like Neo4j.
While operational graph databases are still necessary for real-time, transactional graph queries, the modern data lakehouse supports massive-scale analytical graph processing without moving the data.
Using frameworks like GraphX (part of Apache Spark), data engineers can take standard tabular data from Iceberg (e.g., an accounts table and a transactions table), logically project them as a massive Graph of nodes and edges in memory, and run distributed graph algorithms across the cluster.
Algorithms like PageRank (to find the most influential nodes in a network), Connected Components (to find isolated clusters), and Shortest Path can be executed over billions of edges stored in the lakehouse, generating analytical scores (e.g., a “Fraud Risk Score” based on network proximity to known bad actors) that are written back to standard Iceberg tables for consumption by BI tools.
Learn More
To dive deeper into these architectures and master the modern data ecosystem, check out the comprehensive books by Alex Merced available in our Books section.