Big Data - Turning Point

Saturday, January 9, 2016

Book KTA (Key Take Away): Graph Databases – NEW OPPORTUNITIES FOR CONNECTED DATA - Part 3

Book name: Graph Databases – NEW OPPORTUNITIES FOR CONNECTED DATA

Authors – Ian Robinson, Jim Webber and Emil Eifrem

Publisher – O’REILLY MEDIA

Book can be downloaded for free from here - http://neo4j.com/books/

Chapter 6 is about Graph Database Internals which discuss the implementation of graph databases. It considers most common architectures and Neo4j graph database architecture for discussion.

Native Graph Processing - A database engine that utilizes index-free adjacency is one in which each node maintains direct references to its adjacent nodes. Each node, therefore, acts as a micro-index of other nearby nodes, which is much cheaper than using global indexes. A nonnative graph database engine, in contrast, uses (global) indexes to link nodes together. Also there is good explanation on how Index-Free Adjacency Leads to Low-Cost “Joins”.

Native Graph Storage – Neo4j stores graph data in a number of different store files. Each store file contains the data for a specific part of the graph (e.g. , there are separate stores for nodes, relationships, labels, and properties). Then it explains Neo4j node and relationship store file record structure in detail.

Programmatic APIs – Following the APIs are discussed:

Kernel API: These allow user code to listen to transactions as they flow through the kernel, and thereafter to react (or not) based on the data content and lifecycle stage of the transaction.
Core API: This is an imperative Java API that exposes the graph primitives of nodes, relationships, properties, and labels to the user. When used for reads, the API is lazily evaluated, meaning that relationships are only traversed as and when the calling code demands the next node.
Traversal Framework: A declarative Java API which enables the user to specify a set of constraints that limit the parts of the graph the traversal is allowed to visit.

In next section, following Nonfunctional Characteristics are discussed in detail:

Transactions (How transactions are implemented in Neo4j)
Recoverability
Availability (Replication)
Scale – Capacity (graph size), Latency (response time), Read and Write throughput.

Chapter 7 is Predictive Analysis with Graph Theory which examine some analytical techniques and algorithms for processing graph data.

Following search/path finding algorithms are explained in brief:

Depth- and Breadth- First Search
Path-Finding with Dijkstra’s Algorithm
The A* (A-star) Algorithm

In next section, Graph Theory and Predictive Modeling is explained with following points:

Triadic Closures – A triadic closure is a common property of social graphs, where we observe that if two nodes are connected via a path involving a third node, there is an increased likelihood that the two nodes will become directly connected at some point in the future.
Structural Balance – Relationship balance between nodes of a graph.
Local Bridges – A connection between two sub-graphs.

Book also has Appendix which gives NOSQL Overview. Readers new to NOSQL, should read this overview first for better understanding of the book.

Book KTA (Key Take Away): Graph Databases – NEW OPPORTUNITIES FOR CONNECTED DATA - Part 2

Book name: Graph Databases – NEW OPPORTUNITIES FOR CONNECTED DATA

Authors – Ian Robinson, Jim Webber and Emil Eifrem

Publisher – O’REILLY MEDIA

Book can be downloaded for free from here - http://neo4j.com/books/

Chapter 4 is Building a Graph Database Application – which is core chapter of this book. It discusses some of the practical issues of working with a graph database. Also it mentions from the experience that graph database applications are highly amenable to being developed using the evolutionary, incremental, and iterative software development practices in widespread use today.

It starts with Data Modeling which as discussed in detail in Chapter 3. It is demonstrated with different example here. Interesting concept explained here is Timeline Trees which can be built if we need to find all the events that have occurred over a specific period.

Versioning is explained as – a versioned graph enables to recover the state of the graph at a particular point in time. Versioning scheme is possible in graph models, in which, nodes and relationships are timestamped and archived whenever they are modified. The downside of such versioning schemes is that they leak into any queries written against the graph, adding a layer of complexity to even the simplest query.

Next section is about Application Architecture, which discusses Embedded mode and Server mode architecture.

In Embedded mode architecture, Neo4j runs in the same process as application. Embedded mode is ideal for hardware devices, desktop applications, and for incorporating in application servers. Some of the advantages of embedded mode are –

Low latency
Choice of APIs
Explicit transactions
JVM only
GC behaviors
Database lifecycle

Server mode is the most common means of deploying the database today. At the heart of each server is an embedded instance of Neo4j. Some of the benefits of server mode are –

REST API
Platform independence
Scaling independence
Isolation from application GC behaviors
Network overhead
Transaction state

Following are strategies to consider Clustering in Neo4j:

Replication – Writes are done on master as well as one or more slaves.
Buffer writes using queues – In high write load scenarios, writes to the cluster can be buffered in a queue.
Global clusters – In Neo4j, it is possible to install a multi-region cluster in multiple data centers and on cloud platforms. A multi-region cluster enables us to service reads from the portion of the cluster geographically closest to the client.

Load Balancing needs to be considered to maximize throughput and reduce latency when using clustered graph database. Following options can be considered:

Separate read traffic from write traffic
Cache sharding
Read your own writes

Next section is dedicated to Testing the graph database application. Following techniques are discussed:

Test-Driven Data Model Development
Performance Testing - Query performance tests, Application performance tests.

In Capacity Planning section, planning for production deployment is discussed. It describes that estimating production needs depends on different factors like graph sizes, query performance, number of expected users and their behaviors. Also following criterion for optimization are discussed based on business needs:

Cost
Performance
Redundancy
Load

Performance criteria is discussed in detail with following points –

Calculating the cost of graph database performance – depends on database stack , size of graph whether it fits in memory or not, since this impacts hardware selection
Performance optimization options:

Increase the JVM heap size
Increase the percentage of the store mapped into the page caches
Invest in faster disks – SSDs or enterprise flash hardware.

Redundancy section explains planning for redundancy requires determining how many instances in a cluster we can afford to lose while keeping the application up and running.

Optimizing for Load is mentioned as trickiest part of capacity planning. A rule of thumb is given as –

Number of concurrent requests = (1000 / average request time (in milliseconds)) * number of cores per machine * number of machine

Next section is about the Importing and Bulk Loading Data, which explains different tools and commands to initial import data and batch loading from legacy databases or external systems.

Chapter 5 is about the Graphs in the Real World. It starts with the reasons organizations choose graph databases as follows:

“Minutes to milliseconds” performance
Drastically accelerated development cycles
Extreme business responsiveness
Enterprise ready

Then, some common use cases are discussed in details. It’s been mentioned that these use cases are taken from real-world production systems.

Social – Understanding people behavior based on their connection e.g. Facebook.
Recommendations – Understanding connections between people and things.
Geo – Finding best route.
Master Data Management – Identifying hierarchies, master data metadata and master data models and facilitating modeling, storing and querying.
Network and Data Center Management – A graph representation of network enabling to catalog assets, visualize how they are deployed and identify the dependencies between them.
Authorization and Access Control (Communications) – Identifying relation/connection between parties (users) and resources (files, products, services, network devices, etc).

In next section, following three real-world examples are explained in detail –

Social Recommendations (Professional Social Network) – This example is explained with LinkedIn like site.
Authorization and Access Control – This is explained with international communications service company which sells communication products and services to its customers.
Geospatial and Logistics – This is explained with courier service example.

Book KTA (Key Take Away): Graph Databases – NEW OPPORTUNITIES FOR CONNECTED DATA - Part 1

Book name: Graph Databases – NEW OPPORTUNITIES FOR CONNECTED DATA

Authors – Ian Robinson, Jim Webber and Emil Eifrem

Publisher – O’REILLY MEDIA

Book can be downloaded from here for free - http://neo4j.com/books/

This book consists of seven main chapters. Also there is good information in Foreword, Preface and Appendix.

Foreword is written by Emil Eifrem who is author of this book as well as CEO and Co-founder of Neo Technology. He mentions presence of different graphs everywhere and describes birth of graph databases. It is interesting to read the story of experiments for graph model development and birth of Graph Databases.

In Preface, importance of Graph databases in today’s world and business is highlighted. Also it briefs about Graph theory existence and application and how it evolved. Also how widely Graph databases getting used/can be used.

Chapter 1 is an Introduction to Graph, Graph Theory and Graph Databases. It also points out that the book is about Graph databases and not about Graph Theory. A simple definition is given as, Graph – a collection of vertices and edges or – a set of nodes and the relationships that connect them. The Labeled Property Graph Model is explained with Twitter example.

Graph Space is divided in two parts – Graph Databases and Graph Compute Engines.

Graph Databases are explained with two properties as below –

The underlying storage – Some graph databases use native graph storage whereas some use other type of storage like relational or object oriented database.
The processing engine – Some graph databases use native graph processing which leverage index-free adjacency whereas some use other techniques where nodes are not pointed to each other directly.

Graph Compute Engine is a technology that enables global graph computational algorithms to be run against large datasets. A variety of different types of graph compute engines exist like in-memory/single machine graph compute engines and distributed graph compute engines.

The power of graph database lies in its following characteristics:

Performance
Flexibility
Agility

Chapter 2 discusses different options for storing data. It addresses limitations of relational and NoSQL databases. Mainly, highlights lack of ability to model ad-hoc, exceptional relationships in real world, performance issues with joins and denormalization. Also it describes overcoming these limitations with Graph databases by presenting Twitter example.

Following table shows some performance comparison for search query at different levels in graph –

Depth	RDBMS execution time(s)	Neo4j execution time(s)	Records returned
2	0.016	0.01	~2500
3	30.267	0.168	~110,000
4	1543.505	1.359	~600,000
5	Unfinished	2.132	~800,000

Chapter 3 is important chapter in this book as it discusses Data Modeling with Graphs. It explains model as – Modeling is an abstracting activity motivated by a particular need or goal. Graph representations are no different in this respect. What perhaps differentiates them from many other data modeling techniques, however, is the close affinity between the logical and physical models. It lists salient features of labeled property graph which is made up of nodes, relationships, properties, and labels.

The rest of the chapter is dedicated to graph database query language Cypher. It is designed to be easily read and understood by developers, database professionals, and business stakeholders. Its ease of use derives from the fact that it is in accord with the way we intuitively describe graphs using diagrams. Cypher enables a user (or an application acting on behalf of a user) to ask the database to find data that matches a specific pattern. Cypher supports following clauses to query the graph database:

MATCH
WHERE
RETURN
CREATE
CREATE UNIQUE
MERGE
DELETE
SET
FOREACH
UNION
WITH
START

Use of these clauses is demonstrated by taking example of smaller graph or sub-graph.

A section is dedicated to compare the relational and graph modeling. An example of systems Management Domain is considered and explained by modeling it in a relational model way and graph model way.

Testing the Graph Model is explained with two techniques. The first, and simplest, is just to check that the graph reads well. We pick a start node, and then follow relationships to other nodes, reading each node’s labels and each relationship’s name as we go. To further increase our confidence, we also need to consider the queries we’ll run on the graph. Here we adopt a design for queryability mindset. These two techniques are explained with an example.

Next section is about Cross-Domain Models, which mentions, business insight often depends on us understanding the hidden network effects at play in a complex value chain. To generate this understanding, we need to join domains together without distorting or sacrificing the details particular to each domain. Property graphs provide a solution here. Using a property graph, we can model a value chain as a graph of graphs in which specific relationships connect and distinguish constituent subdomains.

Common Modeling Pitfalls is explained in next section. Loss of information due to bad modeling is explained with Email Provenance Problem. Also Evolving a domain is discussed considering the change in model to add new facts and compositions as new nodes and relationships than changing the model.

Identifying Nodes and Relationshipis explained as – design for queryability :

Describe the client or end-user goals that motivate our model.
Rewrite these goals as questions to ask of our domain.
Identify the entities and the relationship that appear in these questions.
Translate these entities and relationships into Cypher path expressions.
Express the questions we want to ask of our domain as graph patterns using path expressions similar to the ones we used to model the domain.

By examining the language we use to describe our domain, we can very quickly identify the core elements in our graph:

Common nouns become labels: “user” and “email,” for example, become the labels User and Email.
Verbs that take an object become relationship names: “sent” and “wrote”, for example, become SENT and WROTE
A proper noun – a person or company’s name, for example – refers to an instance of a thing, which we model as a node, using one or more properties to capture that thing’s attributes.

In the section Avoiding Anti-Patterns, common mistakes in identifying nodes and relationships are highlighted. It is explained with the example “google” and “email” nouns which commonly used as verbs nowadays.