Saturday, January 9, 2016

Book KTA (Key Take Away): Graph Databases – NEW OPPORTUNITIES FOR CONNECTED DATA - Part 1

Book name: Graph Databases – NEW OPPORTUNITIES FOR CONNECTED DATA
Authors – Ian Robinson, Jim Webber and Emil Eifrem
Publisher – O’REILLY MEDIA
Book can be downloaded from here for free - http://neo4j.com/books/

This book consists of seven main chapters. Also there is good information in Foreword, Preface and Appendix.

Foreword is written by Emil Eifrem who is author of this book as well as CEO and Co-founder of Neo Technology. He mentions presence of different graphs everywhere and describes birth of graph databases. It is interesting to read the story of experiments for graph model development and birth of Graph Databases.

In Preface, importance of Graph databases in today’s world and business is highlighted. Also it briefs about Graph theory existence and application and how it evolved. Also how widely Graph databases getting used/can be used.

Chapter 1 is an Introduction to Graph, Graph Theory and Graph Databases. It also points out that the book is about Graph databases and not about Graph Theory. A simple definition is given as, Graph – a collection of vertices and edges or – a set of nodes and the relationships that connect them. The Labeled Property Graph Model is explained with Twitter example.
Graph Space is divided in two parts – Graph Databases and Graph Compute Engines.
Graph Databases are explained with two properties as below –
  1. The underlying storage – Some graph databases use native graph storage whereas some use other type of storage like relational or object oriented database.
  2. The processing engine – Some graph databases use native graph processing which leverage index-free adjacency whereas some use other techniques where nodes are not pointed to each other directly.
Graph Compute Engine is a technology that enables global graph computational algorithms to be run against large datasets. A variety of different types of graph compute engines exist like in-memory/single machine graph compute engines and distributed graph compute engines.
The power of graph database lies in its following characteristics:
  • Performance
  • Flexibility
  • Agility
Chapter 2 discusses different options for storing data. It addresses limitations of relational and NoSQL databases. Mainly, highlights lack of ability to model ad-hoc, exceptional relationships in real world, performance issues with joins and denormalization. Also it describes overcoming these limitations with Graph databases by presenting Twitter example.
Following table shows some performance comparison for search query at different levels in graph –
Depth RDBMS execution time(s) Neo4j execution time(s) Records returned
2 0.016 0.01 ~2500
3 30.267 0.168 ~110,000
4 1543.505 1.359 ~600,000
5 Unfinished 2.132 ~800,000

Chapter 3 is important chapter in this book as it discusses Data Modeling with Graphs. It explains model as – Modeling is an abstracting activity motivated by a particular need or goal. Graph representations are no different in this respect. What perhaps differentiates them from many other data modeling techniques, however, is the close affinity between the logical and physical models. It lists salient features of labeled property graph which is made up of nodes, relationships, properties, and labels.
The rest of the chapter is dedicated to graph database query language Cypher. It is designed to be easily read and understood by developers, database professionals, and business stakeholders. Its ease of use derives from the fact that it is in accord with the way we intuitively describe graphs using diagrams. Cypher enables a user (or an application acting on behalf of a user) to ask the database to find data that matches a specific pattern. Cypher supports following clauses to query the graph database:
  • MATCH
  • WHERE
  • RETURN
  • CREATE
  • CREATE UNIQUE
  • MERGE
  • DELETE
  • SET
  • FOREACH
  • UNION
  • WITH
  • START
Use of these clauses is demonstrated by taking example of smaller graph or sub-graph.
A section is dedicated to compare the relational and graph modeling. An example of systems Management Domain is considered and explained by modeling it in a relational model way and graph model way.
Testing the Graph Model is explained with two techniques. The first, and simplest, is just to check that the graph reads well. We pick a start node, and then follow relationships to other nodes, reading each node’s labels and each relationship’s name as we go. To further increase our confidence, we also need to consider the queries we’ll run on the graph. Here we adopt a design for queryability mindset. These two techniques are explained with an example.
Next section is about Cross-Domain Models, which mentions, business insight often depends on us understanding the hidden network effects at play in a complex value chain. To generate this understanding, we need to join domains together without distorting or sacrificing the details particular to each domain. Property graphs provide a solution here. Using a property graph, we can model a value chain as a graph of graphs in which specific relationships connect and distinguish constituent subdomains.
Common Modeling Pitfalls is explained in next section. Loss of information due to bad modeling is explained with Email Provenance Problem. Also Evolving a domain is discussed considering the change in model to add new facts and compositions as new nodes and relationships than changing the model.
Identifying Nodes and Relationshipis explained as – design for queryability :
  1. Describe the client or end-user goals that motivate our model.
  2. Rewrite these goals as questions to ask of our domain.
  3. Identify the entities and the relationship that appear in these questions.
  4. Translate these entities and relationships into Cypher path expressions.
  5. Express the questions we want to ask of our domain as graph patterns using path expressions similar to the ones we used to model the domain.
By examining the language we use to describe our domain, we can very quickly identify the core elements in our graph:
  • Common nouns become labels: “user” and “email,” for example, become the labels User and Email.
  • Verbs that take an object become relationship names: “sent” and “wrote”, for example, become SENT and WROTE
  • A proper noun – a person or company’s name, for example – refers to an instance of a thing, which we model as a node, using one or more properties to capture that thing’s attributes.
In the section Avoiding Anti-Patterns, common mistakes in identifying nodes and relationships are highlighted. It is explained with the example “google” and “email” nouns which commonly used as verbs nowadays.

No comments:

Post a Comment