Book name: Graph Databases – NEW
OPPORTUNITIES FOR CONNECTED DATA
Authors – Ian Robinson, Jim Webber
and Emil Eifrem
Publisher – O’REILLY MEDIA
Book can be downloaded for free from here - http://neo4j.com/books/
Chapter 4 is Building a Graph
Database Application – which is core chapter of this book. It
discusses some of the practical issues of working with a graph
database. Also it mentions from the experience that graph database
applications are highly amenable to being developed using the
evolutionary, incremental, and iterative software development
practices in widespread use today.
It starts with Data Modeling which as
discussed in detail in Chapter 3. It is demonstrated with different
example here. Interesting concept explained here is Timeline Trees
which can be built if we need to find all the events that have
occurred over a specific period.
Versioning is explained as – a
versioned graph enables to recover the state of the graph at a
particular point in time. Versioning scheme is possible in graph
models, in which, nodes and relationships are timestamped and
archived whenever they are modified. The downside of such versioning
schemes is that they leak into any queries written against the graph,
adding a layer of complexity to even the simplest query.
Next section is about Application
Architecture, which discusses Embedded mode and Server mode
architecture.
In Embedded mode architecture, Neo4j
runs in the same process as application. Embedded mode is ideal for
hardware devices, desktop applications, and for incorporating in
application servers. Some of the advantages of embedded mode are –
- Low latency
- Choice of APIs
- Explicit transactions
- JVM only
- GC behaviors
- Database lifecycle
Server mode is the most common means of
deploying the database today. At the heart of each server is an
embedded instance of Neo4j. Some of the benefits of server mode are –
- REST API
- Platform independence
- Scaling independence
- Isolation from application GC behaviors
- Network overhead
- Transaction state
Following are strategies to consider
Clustering in Neo4j:
- Replication – Writes are done on master as well as one or more slaves.
- Buffer writes using queues – In high write load scenarios, writes to the cluster can be buffered in a queue.
- Global clusters – In Neo4j, it is possible to install a multi-region cluster in multiple data centers and on cloud platforms. A multi-region cluster enables us to service reads from the portion of the cluster geographically closest to the client.
Load Balancing needs to be considered
to maximize throughput and reduce latency when using clustered graph
database. Following options can be considered:
- Separate read traffic from write traffic
- Cache sharding
- Read your own writes
Next section is dedicated to Testing
the graph database application. Following techniques are discussed:
- Test-Driven Data Model Development
- Performance Testing - Query performance tests, Application performance tests.
In Capacity Planning section, planning
for production deployment is discussed. It describes that estimating
production needs depends on different factors like graph sizes, query
performance, number of expected users and their behaviors. Also
following criterion for optimization are discussed based on business
needs:
- Cost
- Performance
- Redundancy
- Load
Performance criteria is discussed in
detail with following points –
- Calculating the cost of graph database performance – depends on database stack , size of graph whether it fits in memory or not, since this impacts hardware selection
- Performance optimization options:
- Increase the JVM heap size
- Increase the percentage of the store mapped into the page caches
- Invest in faster disks – SSDs or enterprise flash hardware.
Redundancy section explains planning
for redundancy requires determining how many instances in a cluster
we can afford to lose while keeping the application up and running.
Optimizing for Load is mentioned as
trickiest part of capacity planning. A rule of thumb is given as –
Number of concurrent requests = (1000 /
average request time (in milliseconds)) * number of cores per machine
* number of machine
Next section is about the Importing and
Bulk Loading Data, which explains different tools and commands to
initial import data and batch loading from legacy databases or
external systems.
Chapter 5 is about the Graphs in
the Real World. It starts with the reasons organizations choose graph
databases as follows:
- “Minutes to milliseconds” performance
- Drastically accelerated development cycles
- Extreme business responsiveness
- Enterprise ready
Then, some common use cases are
discussed in details. It’s been mentioned that these use cases are
taken from real-world production systems.
- Social – Understanding people behavior based on their connection e.g. Facebook.
- Recommendations – Understanding connections between people and things.
- Geo – Finding best route.
- Master Data Management – Identifying hierarchies, master data metadata and master data models and facilitating modeling, storing and querying.
- Network and Data Center Management – A graph representation of network enabling to catalog assets, visualize how they are deployed and identify the dependencies between them.
- Authorization and Access Control (Communications) – Identifying relation/connection between parties (users) and resources (files, products, services, network devices, etc).
In next section, following three
real-world examples are explained in detail –
- Social Recommendations (Professional Social Network) – This example is explained with LinkedIn like site.
- Authorization and Access Control – This is explained with international communications service company which sells communication products and services to its customers.
- Geospatial and Logistics – This is explained with courier service example.
No comments:
Post a Comment