Book name: Graph Databases – NEW
OPPORTUNITIES FOR CONNECTED DATA
Authors – Ian Robinson, Jim Webber
and Emil Eifrem
Publisher – O’REILLY MEDIA
This book consists of seven main
chapters. Also there is good information in Foreword, Preface and
Appendix.
Foreword is written by Emil
Eifrem who is author of this book as well as CEO and Co-founder of
Neo Technology. He mentions presence of different graphs everywhere
and describes birth of graph databases. It is interesting to read the
story of experiments for graph model development and birth of Graph
Databases.
In Preface, importance of Graph
databases in today’s world and business is highlighted. Also it
briefs about Graph theory existence and application and how it
evolved. Also how widely Graph databases getting used/can be used.
Chapter 1 is an Introduction to
Graph, Graph Theory and Graph Databases. It also points out that the
book is about Graph databases and not about Graph Theory. A simple
definition is given as, Graph – a collection of vertices and edges
or – a set of nodes and the relationships that connect them. The
Labeled Property Graph Model is explained with Twitter example.
Graph Space is divided in two parts –
Graph Databases and Graph Compute Engines.
Graph Databases are explained with two
properties as below –
The underlying storage – Some
graph databases use native graph storage whereas some use other type
of storage like relational or object oriented database.
The processing engine – Some
graph databases use native graph processing which leverage
index-free adjacency whereas some use other techniques where nodes
are not pointed to each other directly.
Graph Compute Engine is a technology
that enables global graph computational algorithms to be run against
large datasets. A variety of different types of graph compute engines
exist like in-memory/single machine graph compute engines and
distributed graph compute engines.
The power of graph database lies in its
following characteristics:
Performance
Flexibility
Agility
Chapter 2 discusses different
options for storing data. It addresses limitations of relational and
NoSQL databases. Mainly, highlights lack of ability to model ad-hoc,
exceptional relationships in real world, performance issues with
joins and denormalization. Also it describes overcoming these
limitations with Graph databases by presenting Twitter example.
Following table shows some performance
comparison for search query at different levels in graph –
Depth |
RDBMS execution time(s) |
Neo4j execution time(s) |
Records returned |
2 |
0.016 |
0.01 |
~2500 |
3 |
30.267 |
0.168 |
~110,000 |
4 |
1543.505 |
1.359 |
~600,000 |
5 |
Unfinished |
2.132 |
~800,000 |
Chapter 3 is important chapter
in this book as it discusses Data Modeling with Graphs. It explains
model as – Modeling is an abstracting activity motivated by a
particular need or goal. Graph representations are no different in
this respect. What perhaps differentiates them from many other data
modeling techniques, however, is the close affinity between the
logical and physical models. It lists salient features of labeled
property graph which is made up of nodes, relationships, properties,
and labels.
The rest of the chapter is dedicated to
graph database query language Cypher. It is designed to be easily
read and understood by developers, database professionals, and
business stakeholders. Its ease of use derives from the fact that it
is in accord with the way we intuitively describe graphs using
diagrams. Cypher enables a user (or an application acting on behalf
of a user) to ask the database to find data that matches a specific
pattern. Cypher supports following clauses to query the graph
database:
MATCH
WHERE
RETURN
CREATE
CREATE UNIQUE
MERGE
DELETE
SET
FOREACH
UNION
WITH
START
Use of these clauses is demonstrated by
taking example of smaller graph or sub-graph.
A section is dedicated to compare the
relational and graph modeling. An example of systems Management
Domain is considered and explained by modeling it in a relational
model way and graph model way.
Testing the Graph Model is explained
with two techniques. The first, and simplest, is just to check that
the graph reads well. We pick a start node, and then follow
relationships to other nodes, reading each node’s labels and each
relationship’s name as we go. To further increase our confidence,
we also need to consider the queries we’ll run on the graph. Here
we adopt a design for queryability mindset. These two techniques are
explained with an example.
Next section is about Cross-Domain
Models, which mentions, business insight often depends on us
understanding the hidden network effects at play in a complex value
chain. To generate this understanding, we need to join domains
together without distorting or sacrificing the details particular to
each domain. Property graphs provide a solution here. Using a
property graph, we can model a value chain as a graph of graphs in
which specific relationships connect and distinguish constituent
subdomains.
Common Modeling Pitfalls is explained
in next section. Loss of information due to bad modeling is explained
with Email Provenance Problem. Also Evolving a domain is discussed
considering the change in model to add new facts and compositions as
new nodes and relationships than changing the model.
Identifying Nodes and Relationshipis
explained as – design for queryability :
Describe the client or end-user
goals that motivate our model.
Rewrite these goals as questions
to ask of our domain.
Identify the entities and the
relationship that appear in these questions.
Translate these entities and
relationships into Cypher path expressions.
Express the questions we want to
ask of our domain as graph patterns using path expressions similar
to the ones we used to model the domain.
By examining the language we use to
describe our domain, we can very quickly identify the core elements
in our graph:
Common nouns become labels: “user”
and “email,” for example, become the labels User and Email.
Verbs that take an object become
relationship names: “sent” and “wrote”, for example, become
SENT and WROTE
A proper noun – a person or
company’s name, for example – refers to an instance of a thing,
which we model as a node, using one or more properties to capture
that thing’s attributes.
In the section Avoiding Anti-Patterns,
common mistakes in identifying nodes and relationships are
highlighted. It is explained with the example “google” and
“email” nouns which commonly used as verbs nowadays.