Big Data - Turning Point: Book KTA (Key Take Away): Graph Databases – NEW OPPORTUNITIES FOR CONNECTED DATA

Book name: Graph Databases – NEW OPPORTUNITIES FOR CONNECTED DATA

Authors – Ian Robinson, Jim Webber and Emil Eifrem

Publisher – O’REILLY MEDIA

Book can be downloaded for free from here - http://neo4j.com/books/

Chapter 4 is Building a Graph Database Application – which is core chapter of this book. It discusses some of the practical issues of working with a graph database. Also it mentions from the experience that graph database applications are highly amenable to being developed using the evolutionary, incremental, and iterative software development practices in widespread use today.

It starts with Data Modeling which as discussed in detail in Chapter 3. It is demonstrated with different example here. Interesting concept explained here is Timeline Trees which can be built if we need to find all the events that have occurred over a specific period.

Versioning is explained as – a versioned graph enables to recover the state of the graph at a particular point in time. Versioning scheme is possible in graph models, in which, nodes and relationships are timestamped and archived whenever they are modified. The downside of such versioning schemes is that they leak into any queries written against the graph, adding a layer of complexity to even the simplest query.

Next section is about Application Architecture, which discusses Embedded mode and Server mode architecture.

In Embedded mode architecture, Neo4j runs in the same process as application. Embedded mode is ideal for hardware devices, desktop applications, and for incorporating in application servers. Some of the advantages of embedded mode are –

Low latency
Choice of APIs
Explicit transactions
JVM only
GC behaviors
Database lifecycle

Server mode is the most common means of deploying the database today. At the heart of each server is an embedded instance of Neo4j. Some of the benefits of server mode are –

REST API
Platform independence
Scaling independence
Isolation from application GC behaviors
Network overhead
Transaction state

Following are strategies to consider Clustering in Neo4j:

Replication – Writes are done on master as well as one or more slaves.
Buffer writes using queues – In high write load scenarios, writes to the cluster can be buffered in a queue.
Global clusters – In Neo4j, it is possible to install a multi-region cluster in multiple data centers and on cloud platforms. A multi-region cluster enables us to service reads from the portion of the cluster geographically closest to the client.

Load Balancing needs to be considered to maximize throughput and reduce latency when using clustered graph database. Following options can be considered:

Separate read traffic from write traffic
Cache sharding
Read your own writes

Next section is dedicated to Testing the graph database application. Following techniques are discussed:

Test-Driven Data Model Development
Performance Testing - Query performance tests, Application performance tests.

In Capacity Planning section, planning for production deployment is discussed. It describes that estimating production needs depends on different factors like graph sizes, query performance, number of expected users and their behaviors. Also following criterion for optimization are discussed based on business needs:

Cost
Performance
Redundancy
Load

Performance criteria is discussed in detail with following points –

Calculating the cost of graph database performance – depends on database stack , size of graph whether it fits in memory or not, since this impacts hardware selection
Performance optimization options:

Increase the JVM heap size
Increase the percentage of the store mapped into the page caches
Invest in faster disks – SSDs or enterprise flash hardware.

Redundancy section explains planning for redundancy requires determining how many instances in a cluster we can afford to lose while keeping the application up and running.

Optimizing for Load is mentioned as trickiest part of capacity planning. A rule of thumb is given as –

Number of concurrent requests = (1000 / average request time (in milliseconds)) * number of cores per machine * number of machine

Next section is about the Importing and Bulk Loading Data, which explains different tools and commands to initial import data and batch loading from legacy databases or external systems.

Chapter 5 is about the Graphs in the Real World. It starts with the reasons organizations choose graph databases as follows:

“Minutes to milliseconds” performance
Drastically accelerated development cycles
Extreme business responsiveness
Enterprise ready

Then, some common use cases are discussed in details. It’s been mentioned that these use cases are taken from real-world production systems.

Social – Understanding people behavior based on their connection e.g. Facebook.
Recommendations – Understanding connections between people and things.
Geo – Finding best route.
Master Data Management – Identifying hierarchies, master data metadata and master data models and facilitating modeling, storing and querying.
Network and Data Center Management – A graph representation of network enabling to catalog assets, visualize how they are deployed and identify the dependencies between them.
Authorization and Access Control (Communications) – Identifying relation/connection between parties (users) and resources (files, products, services, network devices, etc).

In next section, following three real-world examples are explained in detail –

Social Recommendations (Professional Social Network) – This example is explained with LinkedIn like site.
Authorization and Access Control – This is explained with international communications service company which sells communication products and services to its customers.
Geospatial and Logistics – This is explained with courier service example.

Big Data - Turning Point

Saturday, January 9, 2016

Book KTA (Key Take Away): Graph Databases – NEW OPPORTUNITIES FOR CONNECTED DATA - Part 2

No comments:

Post a Comment