In Memory Data Grid Technologies

After winning a CSC Leading Edge Forum (LEF) research grant, I (Paul Colmer) wanted to publish some of the highlights of my research to share with the wider technology community.

What is an In Memory Data Grid?

It is not an in-memory relational database, a NOSQL database or a relational database.  It is a different breed of software datastore.

In summary an IMDG is an ‘off the shelf’ software product that exhibits the following characteristics:

The data model is distributed across many servers in a single location or across multiple locations.  This distribution is known as a data fabric.  This distributed model is known as a ‘shared nothing’ architecture.

  • All servers can be active in each site.
  • All data is stored in the RAM of the servers.
  • Servers can be added or removed non-disruptively, to increase the amount of RAM available.
  • The data model is non-relational and is object-based. 
  • Distributed applications written on the .NET and Java application platforms are supported.
  • The data fabric is resilient, allowing non-disruptive automated detection and recovery of a single server or multiple servers.

There are also hardware appliances that exhibit all these characteristics.  I use the term in-memory data grid appliance to describe this group of products and these were excluded from my research.

There are six products in the market that I would consider for a proof of concept, or as a starting point for a product selection and evaluation: 

  • VMware Gemfire                                                (Java)
  • Oracle Coherence                                             (Java)
  • Alachisoft NCache                                             (.Net)
  • Gigaspaces XAP Elastic Caching Edition           (Java)
  • Hazelcast                                                          (Java)
  • Scaleout StateServer                                         (.Net)

 And here are the rest of products available in the market now, that I consider IMDGs:

  • IBM eXtreme Scale
  • Terracotta Enterprise Suite
  • Jboss (Redhat) Infinispan

 Relative newcomers to this space, and worthy of watching closely are Microsoft and Tibco.

Why would I want an In Memory Data Grid? 

Let’s compare this with our old friend the traditional relational database:

  • Performance – using RAM is faster than using disk.  No need to try and predict what data will be used next.  It’s already in memory to use.
  • Data Structure – using a key/value store allows greater flexibility for the application developer.  The data model and application code are inextricably linked.  More so than a relational structure.
  • Operations – Scalability and resiliency are easy to provide and maintain.  Software / hardware upgrades can be performed non-disruptively.

How does an In Memory Data Grid map to real business benefits?

  • Competitive Advantage – businesses will make better decisions faster.
  • Safety – businesses can improve the quality of their decision-making.
  • Productivity – improved business process efficiency reduces waster and likely to improve profitability.
  • Improved Customer Experience – provides the basis for a faster, reliable web service which is a strong differentiator in the online business sector.

How do use an In Memory Data Grid?

  1. Simply install your servers in a single site or across multiple sites.  Each group of servers within a site is referred to as a cluster.
  2. Install the IMDG software on all the servers and choose the appropriate topology for the product.  For multi-site operations I always recommend a partitioned and replicated cache.
  3. Setup your APIs, or GUI interfaces to allow replicated between the various servers.
  4. Develop your data model and the business logic around the model.

With a partitioned and replicated cache, you simply partition the cache on the servers that best suits the business needs to trying to fulfil, and the replicated part ensures there are sufficient copies across all the servers.  This means that if a server dies, there is no effect on the business service.  Providing you have provisioned enough capacity of course.

The key here is to design a topology that mitigates all business risk, so that if a server or a site is inoperable, the service keeps running seamlessly in the background. 

There are also some tough decisions you may need to make regarding data consistency vs performance.  You can trade the performance to improve data consistency and vice versa.

Are there any proven use cases for In Memory Data Grid adoption?

Oh yes, and if you’re a competitor in these markets, you may want to rethink your solution.

Financial Services: Improve decision-making, profitability and market competitiveness through increased performance in financial stock-trading markets. Reduction in processing times from 60 minutes to 60 seconds.

Online Retailer: Providing a highly available, easily maintainable and scalable solution for 3+ million visitors per month in the online card retailer market.

Aviation: Three-site active / active / active flight booking system for a major European budget-airline carrier. Three sites are London, Dublin and Frankfurt.

Check out the VMware Gemfire and Alachisoft NCache websites for more details on these proven use cases.

About the Author:

Paul Colmer is a technology consultant working for CSC and director and active professional musician for Music4Film.net.  He specialises in Cloud Computing, Social Business and Solution Architecture. He is based in Brisbane, Australia. http://www.linkedin.com/pub/paul-colmer/6/894/539

High Scalability – 8 Commonly Used Scalable System Design Patterns

Ricky Ho in Scalable System Design Patterns has created a great list of scalability patterns along with very well done explanatory graphics. A summary of the patterns are:

  1. Load Balancer – a dispatcher determines which worker instance will handle a request based on different policies.
  2. Scatter and Gather – a dispatcher multicasts requests to all workers in a pool. Each worker will compute a local result and send it back to the dispatcher, who will consolidate them into a single response and then send back to the client.
  3. Result Cache – a dispatcher will first lookup if the request has been made before and try to find the previous result to return, in order to save the actual execution.
  4. Shared Space – all workers monitors information from the shared space and contributes partial knowledge back to the blackboard. The information is continuously enriched until a solution is reached.
  5. Pipe and Filter – all workers connected by pipes across which data flows.
  6. MapReduce –  targets batch jobs where disk I/O is the major bottleneck. It use a distributed file system so that disk I/O can be done in parallel.
  7. Bulk Synchronous Parallel – a  lock-step execution across all workers, coordinated by a master.
  8. Execution Orchestrator – an intelligent scheduler / orchestrator schedules ready-to-run tasks (based on a dependency graph) across a clusters of dumb workers.

High Scalability – High Scalability – 6 Strategies for Scaling BBC iPlayer

The BBC’s iPlayer site averages 8 million page views a day for 1.3 million users. Technical Architect Simon Frost describes how they scaled their site in Scaling the BBC iPlayer to handle demand:

  1. Use frameworks. Frameworks support component based development which makes it convenient for team development, but can introduce delays that have to be minimized. Zend/PHP is used because it supports components and is easy to recruit for.  MySQL is used for program metadata. CouchDB is used for key-value access for fast read/write of user-focused data.
  2. Prove architecture before building it. Eliminate guesswork by coming up with alternate architectures and create prototypes to determine which option works best. Balance performance with factors like ease of development.
  3. Cache a lot. Data is cached in memcached for a few seconds to minutes. Short cache invalidation periods keep the data up to date for the users, but even these short periods make a huge difference in performance. Caching doesn’t have to be for a long time to see a benefit. Varnish is used to cache HTML pages. Much of the invalidation is time or action-based (e.g. someone adds a new favourite).
  4. Break the page into personalised and standard components. A common main page is created so that it can be cached separately from personalized data. This gives faster smoother viewing experience. Personalized elements are loaded using Ajax. Varnish’s flexible caching policies are used to cache these elements. User favorite lists are cached for as little as a few minutes.
  5. Use lots of servers. Scaling is horizontal using a pool of servers. Web servers are stateless. Pages are served out of two data centers for high availability.
  6. Test the site before we launching. Load test to track down and fix problems before users see them.

    High Scalability – High Scalability – Pomegranate – Storing Billions and Billions of Tiny Little Files

    Pomegranate is a novel distributed file system built over distributed tabular storage that acts an awful lot like a NoSQL system. It’s targeted at increasing the performance of tiny object access in order to support applications like online photo and micro-blog services, which require high concurrency, high throughput, and low latency. Their tests seem to indicate it works:

    We have demonstrate that file system over tabular storage performs well for highly concurrent access. In our test cluster, we observed linearly increased more than 100,000 aggregate read and write requests served per second (RPS). 

    Rather than sitting atop the file system like almost every other K-V store, Pomegranate is baked into file system. The idea is that the file system API is common to every platform so it wouldn’t require a separate API to use. Every application could use it out of the box.

    The features of Pomegranate are:

    • It handles billions of small files efficiently, even in one directory;
    • It provide separate and scalable caching layer, which can be snapshot-able;
    • The storage layer uses log structured store to absorb small file writes to utilize the disk bandwidth;
    • Build a global namespace for both small files and large files;
    • Columnar storage to exploit temporal and spatial locality;
    • Distributed extendible hash to index metadata;
    • Snapshot-able and reconfigurable caching to increase parallelism and tolerant failures;
    • Pomegranate should be the first file system that is built over tabular storage, and the building experience should be worthy for file system community. 

    Can Ma, who leads the research on Pomegranate, was kind enough to agree to a short interview.

    Can you please give an overview of the architecture and what you are doing that’s cool and different?

    Basically, there is no distributed or parallel file system that can handle billions of small files efficiently. However, we can foresee that web applications(such as email, photo, and even video), and bio-computing(gene sequencing) need massive small file accesses. Meanwhile, file system API is general enough and well understood for most programmers.

    Thus, we want to built a file system to manage billions of small files, and provide high throughput of concurrent accesses. Although Pomegranate is designed for accesses to small files, it support large files either. It is built on top of other distributed file systems, such as Lustre, and only manage the namespace and small files. We just want to stand on ”the Shoulders of Giants”. See the figure bellow:

    Pomegranate has many Metadata Servers and Metadata Storage Servers to serve metadata requests and small file read/write requests. The MDSs are just a caching layer, which load metadata from storage and commit memory snapshots to storage. The core of Pomegranate is a distributed tabular storage system called xTable. It supports key indexed multi-column lookups. We use distributed extendible hash to locate server from the key, because extendible hash is more adaptive to scale up and down. 

    In file systems, directory table and inode table are always separated to support two different types of lookup. Lookups by pathname are handled by directory table, while lookups by inode number are handled by inode table. It is nontrivial to consistently update these two indexes, especially in a distributed file system. Meanwhile, using two indexes has increased the lookup latency, which is unacceptable for accessing tiny files. Typically, there are in memory caches for dentry and inode, however, the caches can’t easily extend. Modifying metadata has to update multiple locations. To keep consistency, operation log is introduced. While, operation log is always a serial point for request flows.

    Pomegranate use a table-like directory structure to merge directory table and inode table. Two different types of lookup are unified to lookups by key. For file system, the key is the hash value of dentry name. Hash conflicts are resolved by a global unique id for each file. For each update, we just need to search and update one table. To eliminate the operations log, we design and support memory snapshot to get a consistent image. The dirty regions of each snapshot can be written to storage safely without considering concurrent modifications.(The concurrent updates are COWed.)

    However, there are some complex file system operations such as mkdir, rmdir, hard link, and rename that should be considered. These ops have to update at least two tables. We implement a reliable multisite update service to propagate deltas from one table to another. For example, on mkdir, we propagate the delta(“nlink +1”) to the parent table. 

    Are there any single points of failure? 

    There is no SPOF in design. We use cluster of MDSs to serve metadata request. If one MDS crashed, the requests are redirected to other MDSs(consistent hash and heartbeats are used). Metadata and small files are replicated to multiple nodes either. However, this replication is triggered by external sync tools which is asynchronous to the writes.

    Small files have usually been the death of filesystems because of directory structure maintenance. How do you get around that?

    Yep, it is deadly slow for small file access in traditional file systems. We replace the traditional directory table (B+ tree or hash tree) to distributed extendible hash table. The dentry name and inode metadata are treated as columns of the table. Lookups from clients are sent(or routed if needs) to the correct MDS. Thus, to access a small file, we just need to access one table row to find the file location. We keep each small file stored sequentially in native file system. As a result, one I/O access can serve a small file read.

    What posix apis are supported? Can files be locked, mapped, symlinks, etc?

    At present, the POSIX support is progressing. We do support symlinks, mmap access. While, flock is not supported. 

    Why do a kernel level file system rather than a K-V store on top?

    Our initial objective is to implement a file system to support more existing applications. While, we do support K/V interface on top of xTable now. See the figure of architecture, the AMC client is the key/value client for Pomegranate. We support simple predicates on key or value, for example we support “select * from table where key < 10 and ‘xyz’ in value” to get the k/v pairs that value contains “xyz” and key < 10.

    How does it compare to other distributed filesystems?

    We want to compare the small file performance with other file systems. However, we have not tested it yet. We will do it in the next month. Although, we believe most distributed file systems can not handle massive small file accesses efficiently.

    Are indexes and any sort of queries supported?

    For now, these supports has not be properly considered yet. We plan to consider range query next.

    Does it work across datacenters, that is, how does it deal with latency?

    Pomegranate only works in a datacenter. WAN support has not been considered yet.

    It looks like you use an in-memory architecture for speed. Can you talk about that?

    We use a dedicated memory cache layer for speed. Table rows are grouped as table slices. In memory, the table sl
    ices are hashed in to a local extendible hash table both for performance and space consumption. Shown by the bellow figure,

    Clients issue request by hash the file name and lookup in the bitmap. Then, using a consistent hash ring to locate the cache server(MDS) or storage server(MDSL). Each update firstly gets the *opened* transaction group, and can just apply to the in memory table row. Each transaction group changing is atomic. After all the pending updates are finished, the transaction group can be committed to storage safely. This approach is similar as Sun’s ZFS.

    How is high availability handled?

    Well, the central server for consistent hash ring management and failure coordinator should be replicated by Paxos algorithm. We plan to use ZooKeeper for high available central service.
    Other components are designed to be fault tolerant. Crashes of MDS and MDSL can be configured as recovered immediately by routing requests to new servers (by selecting the next point in consistent hash ring).

    Operationally, how does it work? How are nodes added into the system?

    Adding nodes to the caching layer is simple. The central server (R2) add the new node to the consistent hash ring. All the cache servers should act on this change and just invalidate their cached table slices if they will be managed by the new node. Requests from clients are routed to the new server, and a CH ring change notification will piggyback to client to pull the new ring from the center server.

    How do you handle large files? Is it suitable for streaming video?

    As described earlier, large files are relayed to other distributed file systems. Our caching layer will not be polluted by the streaming video data.

    Anything else you would like to add?

    Another figure for interaction of  Pomegranate components.