Nati Shalom’s Blog: NoCAP

« My JavaOne 2010 Plans |
Main

October 16, 2010

NoCAP

In the past few months i was involved in  many of the NoSQL discussions. I must admit that i really enjoyed those discussions as it felt that we finally started to break away from the “one size fit it all” dogma and look at the data management solutions in a more pragmatic manner. That in itself sparks lots of interesting and innovative ideas that can revolutionize the entire database market such as the introduction of document model, map-reduce and new query semantics that comes with it. As with any new movement we seem to be going through the classic hype cycle. Right now it seems to me that were getting close to the peak of that hype. One of the challenges that i see when a technology reaches its peak of the hype is that people stop questioning the reason for doing things and jump on new technology just because X did that.  NoSQL is no different on that regard.

image

In this post i wanted to spend sometime on the CAP theorem

and clarify some of the confusion that i often see when people associate CAP with scalability without fully understanding the implications that comes with it and the alternative approaches.

I chose to name this post NoCAP specifically to illustrate the idea that you can achieve scalability without compromising on consistency at least not at the degree that many of the disk based NoSQL implementations imposes.

Recap on CAP

Quoting the definition on wikipedia:

The CAP theorem, also known as Brewer’s theorem, states that it is impossible for a distributed computer system

to simultaneously provide all three of the following guarantees:[1]

[2]

  • Consistency

    (all nodes see the same data at the same time)

  • Availability

    (node failures do not prevent survivors from continuing to operate)

  • Partition Tolerance

    (the system continues to operate despite arbitrary message loss)

CAP and NoSQL

Many of the disk based NoSQL implementations was originated from the need to deal with write scalability. This was largely due to the changes in traffic behavior that was mainly a result of the social networking  in which most of the content is generated by the users and not by the site owner.

In a traditional database approach achieving data consistency requires synchronous write to disk and distributed transactions (known as the ACID properties).

It was clear that  the demand for write scalability would conflict with the traditional approaches for achieving consistency (synchronous write to a central disk and distributed transactions).

The solution to that was:  1) Breaking the centralized disk access through partitioning of the data into distributed nodes. 2) Achieve high availability through redundancy (replication of the data into multiple nodes) 3) Use asynchronous replication to reduce the write latency.

The assumptions behind point 3 above is going to be the center in this specific post.

CAP-Theorem.JPG

Graphical representation of the Cap Theorem. (Source

)

 

The Consistency Challenge

One of the common assumptions behind many of the NoSQL implementations is that to achieve write scalability we need to push as many operations on the write-path to a background process in order that we could minimize the time in which a user transaction is blocked on write.

The implication is that with asynchronous write we loose consistency between write and read operations i.e. read operation can return older version then that of write.

There are different algorithms that were developed to address this type of inconsistency challenges, often referred to as Eventual Consistency

.

For those interested in more information on that regard i would recommend looking at Jeremiah Peschka

post Consistency models in nonrelational dbs

. Jeremiah provides a good (and short!) summary of the CAP theorem, Eventual Consistency model and other common principles that comes with it such as (BASE – Basically Available Soft-state Eventually, NRW, Vector clock,..).

Do we really need Eventual Consistency to achieve write scalability?

Before I’ll dive into this topic i wanted to start with quick introduction to the term “Scalability” which is often used interchangeably with throughput. Quoting

Steve Haines

:

The terms “performance” and “scalability” are commonly used interchangeably, but the two are distinct: performance measures the speed with which a single request can be executed, while scalability measures the ability of a request to maintain its performance under increasing load

(See previous post on that regard:  The true meaning of linear scalability)

In our specific case that means that write scalability can be delivered primarily through point 1 and 2 above ( 1-Break the centralized disk access through partitioning of the data into distributed nodes. 2-Achieve high availability through redundancy and replication of the data into multiple nodes) where point 3 ( Use asynchronous replication to those replica’s to avoid the replication overhead on write) is mostly related with write throughput and latency and  not scalability. Which bring me to the point behind this post:

Eventual consistency have little or no direct impact on write scalability .

To be more specific my argument is that it is quite often enough to break our data model into partitions (a.k.a shards) and break out from the centralized disk model to achieve write scalability. In many cases we may find that we can achieve sufficient throughput and latency just by doing that.

We should consider the use of asynchronous write algorithms to optimize the write performance and latency but due t
o the inherited complexity that comes with it we should consider that only after we tried simpler alternative such as using db-shards, FLASH disk or memory based devices.

Achieving write throughput without compromising consistency or scalability

The diagram below illustrates one of the examples by which we could achieve write scalability and throughput without compromising on consistency.

image

As with the previous examples we break our data into partitions to handle our write scaling between nodes. To achieve high throughput we use in-memory storage instead of disk. As in-memory device tend to be significantly faster and concurrent then disk and since network speed is no longer a bottleneck we can achieve high throughput and low latency even when we use synchronous write to the replica.

The only place in which we’ll use asynchronous write is the write to the long-term-storage (disk).  As the user transaction doesn’t access the long-term storage directly through the read or write path, they are not exposed to the potential inconsistency between the memory storage and the long-term storage. The long-term storage can be any of the disk based alternatives starting from a standard SQL databases ending with any of the existing disk based NoSQL engines.

The other benefit behind this approach is that it is significantly simpler. Simpler not just in terms of development but simpler to maintain compared with the Eventual Consistency alternatives. In case of distributed system simplicity often correlate with reliability and deterministic behavior.

 

Final words

It is important to note that in this post i was referring mostly to the C in CAP and not CAP in its broad definition.  My points was not to say don’t use solution that are based on CAP/EventualConsistency  model but rather to say don’t jump on Eventual Consistency based solutions before you considered the implications and alternative approaches. There are potentially simpler approaches to deal with write scalability such as using database shards, or In-memory-data-grids.

As were reaching the age of Terra-Scale devices such as Cisco UCS where we can achieve huge capacity of memory, network and compute power in a single box the area’s in which we can consider to put our entire data in-memory get significantly broader as we can easily store Terra bytes of data in just few boxes. The case of Foursquare’s MongoDB Outage

is interesting on that regard.  10gen’s CEO Dwight Merrimanargued that the entire set of data actually needs to be served completely in-memory:

For various reasons the entire DB is accessed frequently so the working set is basically its entire size Because of this, the memory requirements for this database were the total size of data in the database. If the database size exceeded the RAM on the machine, the machine would thrash, generating more I/O requests than the four disks could service.

It is a common misconception to think that putting part of the data in LRU based cache ontop of a disk based storage could yeild better performance as noted in the sanford research The Case for RAM Cloud

..even a 1% miss ratio for a DRAM cache costs a factor of 10x in performance. A caching approach makes the deceptive suggestion that “a few cache misses are OK” and lures programmers into con-figurations where system performance is poor..

In that case using pure In-Memory-Data-Grid as a front end and disk based storage as long term storage could potentially work better and with significantly lower maintenance overhead and higher determinism. The capacity of data in this specific case ( <100GB)  shouldn’t be hard to fit into single UCS box or few of the EC2 boxes.

 

References

Email this

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d835457b7453ef0134883b0a1b970c

Listed below are links to weblogs that reference NoCAP:

Comments

pron said…

Nice post, Nati, but I believe you are a bit misleading regarding the CAP theorem. CAP applies to all distributed solutions, including the one you’ve described (that does not sacrifice consistency). However, that solution does sacrifice partitioning. Asynchronous writes are not only used for reducing latencies (i.e. performance), but also to provide partition tolerance. It’s simply a matter of whether you prefer CAP’s CA, or AP.

Nati Shalom said in reply to pron…

Paron

You have a point here

My points was more to do with Eventual Consistency part of CAP and not CAP in its broad definition.
I’ll add that clarification to the post.

Asynchronous writes are not only used for reducing latencies (i.e. performance), but also to provide partition tolerance

I see how partition tolerance relates to the the fact that i need to maintain redundant replicas. I fail to see the direct connection on whether or not the replication is synchronous or not.

Quoting the wikipedia definition:

Partition Tolerance (the system continues to operate despite arbitrary message loss)

In the proposed alternative that i was pointing to above you can achieve partition tolerance by the fact that there is more then one replica of the data available somewhere over the network. So when a node fails or become un-available the system continues to serve write/read request through one of the redundant nodes.

My point was that you don’t have to give up consistency to achieve partition tolerance at that level.

pron said in reply to Nati Shalom

“My point was that you don’t have to give up consistency to achieve partition tolerance at that level.”
But you do. By definition, synchronous replication means that the master has to wait for the slave to acknowledge receiving the update. If update messages are lost (a partition is created in the network), the master in effect ceases to function. Both master and slave(s) stop serving queries, because they cannot guarantee consistent responses. With eventual-consistency, nodes can still serve queries even if data replication is temporarily hampered – some nodes will simply return query results that don’t reflect the latest updates. This is precisely the consistency/partition-tolerance trade-off.

Nati Shalom said in reply to pron…

Pron

Let me refine my response.

Synchronous replication behaves differently then what you described: Synchronous replication (thinking of GigaSpaces in mind)works as follow when it comes to network partition or node failure:

1) Network Partition between primary and replica node: if a replica node failed the partition that is still available continues to serve users requests and log all the transaction into a log buffer till the communication is restored. The failed node restore its state upon startup from the current state of the available node and by replaying the the log (to fill in the gap since it started its recovery process).

2) If a primary node fails, users is routed to an alternate node that becomes the new primary. The recover process of the failed primary is exactly the same as i mentioned above as it basically becomes the new backup when it comes back alive.

In this model the asynchronous replication part happens only during recovery process and not in a continues basis and thus you don’t have to compromise between consistency and partition tolerance at the level that you would otherwise.

HTH
Nati S.

pron said in reply to Nati Shalom

Nati, you cannot escape the CAP theorem, and I think you may be adding to the confusion rather than clarifying it. I invite you to read http://www.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/.

What your (perfectly reasonable) solution does is create a quorum between nodes that contain the “shard”. If you have just two of them, a master and a slave, than in the case of a two-node failure, your system is not available (there is no access to the data contained only in those two nodes). In order to survive larger failure you would require a larger quorum, say 3 nodes, but then, because writes are synchronous, write latency MUST suffer, no matter whether the data is stored on disk or in RAM. The write transaction must wait for all shard nodes to acknowledge the operation. This kind of system is called CA (perhaps a bit confusingly), or, one that in an event of partitions becomes unavailable. Again, this may be a perfectly good solution but it is not a “simpler solution”, but rather one to a problem with different requirements. It is absolutely not a viable solution to systems that require continued operation in an event of many-node failure which must make use of data-stores such as Dynamo or Cassandra, just as Dynamo and Cassandra cannot be used by systems that require perfect consistency.
You may, however, argue that node failures are, more often than not, disk failures, so a RAM only data-store is unlikely to incur multiple-node failures.

Nati Shalom said in reply to pron…

Pron

Thanks for the pointer. The definition that i was referring to in this post seems to be more consistent with that of Henry Robinson than that of Dr. Michael Stonebraker. So i hope that were under agreement here.

In order to survive larger failure you would require a larger quorum, say 3 nodes, but then, because writes are synchronous, write latency MUST suffer

That’s not entirely true. Writes to replica happens in parallel to one another i.e. they are not sequential therefore the latency is not proportional to the number of replica but to the response of the slowest node in the replica group.

There are even more efficient ways to deal with large failure then to keep many redundant nodes active all the time and pay the overhead in bandwidth, capacity, complexity (replica of 3 means that we have for every GB of data we have 2GB of redundant information, with large data systems that can turn into huge numbers and cost)

A better way IMO is to add more backup on demand i.e. only when one of the nodes failed. So you are guarantied to have two nodes always available (with the exception of loosing the two nodes at the SAME time. Statistically i would argue that its the same chances of loosing two data centers at the same time as in most cases the two nodes would live into separate data centers)

In GigaSpaces there is also an option to mix between synchronous and asynchronous replica under the same group. This is infact how we store data into a longterm storage so you have more options to control the tradeoff of throughput and redundancy on that regard.

Over the years of dealing with those tradeoffs we found that even in the most mission critical systems that you can think of customers chose the backup-on demand option rather then having more backups even though they can.

pron said in reply to Nati Shalom

Another required read for anyone interested in distributed systems and the CAP theorem is this: http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html

Dr. Abadi from Yale university explains that there are really only two types of operating mode for distributed systems: CA/CP (which are exactly the same thing), and AP.
I, too, believe that AP solutions (that only guarantee eventual consistency) are relevant only for extremely large, “web-scale” applications with thousands of nodes and millions of concurrent users (like Amazon and Facebook), provided they can live without strict consistency, and possibly for smaller applications where consistency is not at all crucial, and where dealing with eventual-consistency bears absolutely no additional development complexity. For most enterprise applications, CA(=CP) solutions should certainly be considered first.
Not surprisingly, the first well-known AP data-stores, Dynamo and Cassandra (which is largely based on Dynamo), were created by Amazon and Facebook. They were the first to face a problem big enough where CA was simply inadequate and AP’s eventual consistency was, luckily, acceptable. Amazon certainly tackled a new problem with an ingenious solution. But, again, if you’re not Amazon, Facebook, Twitter or LinkedIn, CA data-stores are probably a better fit.

Nati Shalom said in reply to pron…

Thanks i’ll add the two references to the reference section of this post.

Nati Shalom said in reply to pron…

And BTW i’m having hard time agreeing with the classification around CA vs CP. As Daniel Abadi noted here

:

CP and CA are essentially identical. So in reality, there are only two types of systems: CP/CA and AP. I.e., if there is a partition, does the system give up availability or consistency? Having three letters in CAP and saying you can pick any two does nothing but confuse this point.

Dehora

said…

Nati, I agree with most of this but are you setting up a strawman here? I don’t see how EC has any bearing. MongoDB is not in the family of databases that use EC as a strategy. It’s sharded, it performs delayed writes, it uses sync memory writes across nodes emulate durability. Each shard’s master is the sole writer master its data. A read replica is elected in the event the master fails. It’s much closer to your diagram of a disk backed memory grid – add a master lookup node, and you’ve drawn MongoDB. It’s not a system where you can just write to any node in the cluster and have the nodes make the data consistent later, which is what an EC strategy provides.

“Eventual consistency have no direct impact on write scalability .”

At worst that’s incorrect, at best it needs to be qualified. If you allow copies of a record to be concurrently updated then you have improved the write scalability of the system as there are no coordination bottlenecks on the write path. At the cost of consistency for readers. If you only allow a single copy of the record to exist and force processes to compete for access to that record, then scalability is reduced. And notably the cost of consistency for readers still has not been removed (reduced but not removed). As soon as you cache, this dilemma exists.

The outage issue seems more to do with internal detail, albeit architecturally significant detail – the way indexes and data are structured, how they get paged, the on-disk architecture. Clearly there’s an issue here, or the answer to the problem would be buy more RAM, not change MongoDB to degrade and rebalance better. But as you’ve realized by mentioning UCS you can’t easily just buy more RAM and put it into a public cloud machine, you are limited to buying more machines. So in that configuration rebalancing is a must have.

Nati Shalom said in reply to Dehora

Hi Dehora
Thanks for the detailed response – i appreciate it!

MongoDB is not in the family of databases that use EC as a strategy. It’s sharded, it performs delayed writes, it uses sync memory writes across nodes emulate durability. Each shard’s master is the sole writer master its data. A read replica is elected in the event the master fails

I was reading a serious of posts on MongoDB consistency model – namely this

reference.
You seem to be right in saying that even though MongoDB comes with those nobs to control consistency it comes with Strong Consistency as the default setup.

I’ll remove that point from my post.
Again thanks for pointing this out.

If you allow copies of a record to be concurrently updated then you have improved the write scalability of the system as there are no coordination bottlenecks on the write path. At the cost of consistency for readers.

For that specific scenario can’t i just use optimistic locking on that single record?

Can you point to a scenario where parallel updates on the same record through multiple nodes couldn’t be addressed through more relaxed locking on the same node?

From what I’ve seen in most cases where Eventual Consistency was brought up it was rarely used to address concurrent update contention on the same record but as a general mean to address write saleability through asynchronous write.

To be more accurate i think that i’ll change my wording to:

“Eventual consistency have little or no direct impact on write scalability”

The outage issue seems more to do with internal detail, albeit architecturally significant detail – the way indexes and data are structured, how they get paged

The point that i was trying to make was the same point that was brought up in the stanford research. Putting even large part of your data in cache is always going to lead to complexity and non deterministic behavior. For example if most of your queries happens to do things like (max), (avr) that will always yield going down to disk even if the 90% of your set is in-memory.
The impact of going to disk just for the 10% is not linear i.e. it wouldn’t yield 10% degradation but 10 times more.

My point was that if your data need to be served in-memory then having a disk based solution with LRU cache as front end is just going to make you application more complex. A better approach in this case is to put the entire set of data in-memory and use the memory storage as the system of record as with In-Memory-Data-Grid.

The size of the data in subject seemed to be within the range that could easily fit with that approach.

“I chose to name this post NoCAP specifically to illustrate the idea that you can achieve scalability without compromising on consistency at least not at the degree that many of the disk based NoSQL implementations imposes.”

“It is important to note that in this post i was referring mostly to the C in CAP and not CAP in its broad definition.”

“To be more accurate i think that i’ll change my wording to: Eventual consistency have little or no direct impact on write scalability.”

With the due respect, the main thing which is eventually consistent here is your post. I suggest using the term IDontUnderstandCAP rather than NoCAP (hint: if you decide to ignore partitioning in a solution, the CAP theorem continues to be valid).

I think the whole point comes from the fact that you’re confusing availability with partition tolerance. The fact that if a server is un-available the request is routed to some other server, is something that makes your distributed system highly available (and adds to the A of CAP). That is not partition tolerance.

Partition tolerance comes into play from the fact that if backbone between Europe and US breaks, Amazon, i.e., has now two systems (the two partitions, the Europe one and the US one) keep on going on, but they are now inconsistent (there’s no consistency on stock availability between the two systems as they cannot communicate their updates due to client actions). But they eventually will.

Dependance on synchronous communication means you’ll loose availability or partition tolerance, depending on the strategy you decide to take in case of breaks of the communication medium over which the synchronous communication works.

Comment below or sign in with TypePad

Facebook

Twitter

and more…

powered by TypePad

Post seguinte
Deixe um comentário

Deixe um comentário

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair /  Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair /  Alterar )

Conectando a %s

%d blogueiros gostam disto: