Wednesday, August 12, 2009

Distributed Data Structures: ConcurrentDistributedMap

Concurrent Distributed Data Structures?

Many challenges exist when developing a high scale multi-node application. Our goal at Terracotta is to take on those challenges in ways that remove them from the plate of those architecting and developing applications and place them squarely on our shoulders.

In order to accomplish such a lofty goal we first had to create some core pieces of infrastructure on which many higher order abstractions could be built. One such "piece" is our ConcurrentDistributedMap. This data structure is a fundemental piece of our Distributed Cache, our Hibernate product and our Web Sessions product and is also available for use in custom solutions for those using Terracotta as a platform.


Challenges and Tradeoffs

Developing a data structure that is Distributed as well as Concurrent and Coherent has very different trade-offs from developing for a single JVM. If one took a standard concurrent data structure like ConcurrentHashMap and just clustered it "as is" one would likely run into performance and memory efficiency issues. Even a really cool concurrent data structure like Cliff Click's Non Blocking Hash Map would not do well if one used the algorithms without thought in a coherent cluster.

The challenge is that the trade-offs change when you add the latency of a network and data locality in the middle of the game. In normal concurrent data structures you care about:

- How long you hold locks
- How much is locked while you hold it.
- CPU usage
- Memory Usage and Object creation

In the clustered case you add the following:

Lock locality - Is the lock you need already held on the local machine or do you need to go get it over the network. If you need to go get it how long does that take. While a little of the question of "How long does it take to get the lock" exists on a multi-cpu single machine it's not nearly to the same degree.

Data locality - Is the data I need already local or do I need to go get it. If I need to get it how long does that take

Data change rate - How much clustered data am I changing and how long does it take to send it around? Also, do I send it around?

Data size - In a clustered world one often uses data structures that don't fit entirely in a single node. One has to take pains to control the size and amount of the data in each JVM for efficiency.

There are other implementation specific/point in time issues like number of locks and their cost but those can mostly be optimized away at the platform level.


Single JVM ConcurrentHashMap

ConcurrentHashMap adds concurrency by collecting groups of entries into segments. Those segments are grouped together both from a lock perspective, they share a lock, and from a physical space perspective, all entries in a segment are generally in one collection. In a single JVM the only risk of sharing a lock between the entries is that one can contend on the in-memory speed look-ups. This is a very effective way to handle large numbers of threads making highly contended gets and puts to the map. If one runs into contention with this kind of data structures one can just up the number of segments in the Map.


Concurrent Map In A Clustered World

In a clustered world problems occur with a data structure like this. First, getting a lock or an object can be either in-memory speed or take many times in-memory speed depending on whether it has recently been accessed locally. In some cases this is no problem and in some cases it's pretty bad. It's also a space issue. If a segment is brought in as a whole and it's entries are in that segment strictly because of it's hashCode then the natural partitioning of the app's usage won't help save space by only loading the entries needed locally. Instead it will load the needed objects and anything else in it's segments. This elimenates the benefits of any natural or forced locality that occurs in a multi-node application.


Use-Case Analysis

In order to highlight some of the pro's and con's of CHM (ConcurrentHashMap) I'm going to vet it against a few use-cases.

Use-case 1 - An 8 node app sharing a clustered ConcurrentHashMap

All the data in the map is read only and it's used in all nodes evenly and the data fits entirely in a single JVM's heap.

GOOD NEWS! you will be fine with a regular clustered ConcurrentHashMap. Lets look at why.

1) All data will be loaded everywhere so unnecessary faulting (the act of pulling a data item into a node) won't be happening
2) All locks will be read locks and will be local everywhere so your latency will be nice and low (Due to greedy locks)
3) Won't have contention on the segments because reads are pretty much concurrent

Use-case 2 - The same as use-case 1 but now the map data is bigger than memory and you have a sticky load balancer.

Some good and some bad:

1) Since data is batched into segments by hash code and your load balancer hashes on something completely different than your map hashes on you will end up loading data into each node that is not needed. This is a result of the ConcurrentHashMap segmenting strategy.

2) Locks will still be fine because it's all read and read locks are very concurrent so segment contention won't be an issue.

So the memory manager may be doing unnecessary work and whether you will be in trouble depends on how big the ConcurrentHashMap is

Use-case 3 - Same as use-case 2 with the exception that now we are doing 50 percent writes. Something similar to caching conversations.

1) Still have the above problem of loading unneeded batches
2) But now, due to the writes, you are also maintaining the state of the objects that have unnecessarily poor locality in all the nodes where they don't belong.
3) Now you have a locking problem. While writing an entry to a segment you are blocking people in other nodes from reading or writing to that segment adding some serious latency. Plus the locks are getting pulled around to different nodes because even though your load balancer provides locality it is on a different dimension that of the internals of the map and is therefore not helpful.

Reviewing the problems highlighted by use case 3:

- Lock hopping leading to slow lock retrieval
- Lock contention due to grouping of multiple unrelated entries with locks.
- Faulting and Memory wasting due to unfortunate segmenting of data
- Broadcasting of changes or invalidations to nodes that shouldn't care


What did we do?

We built a specialty highly concurrent map tuned for distribution and the above challenges call ConcurrentDistributedMap.


Locking:
Instead of breaking things down into segments for locking we lock on the individual keys in the map. This gives the same correctness guarantees while giving the maximum concurrency. This drastically reduces lock hopping and contention and provides in-memory lock speeds most of the time.


Segmenting:
The segments go away completely. Key Value pairs are managed on an individual basis so no unnecessary faulting occurs.


Broadcasting and invalidation:
The above, plus an efficient memory manager means that values are only faulted into nodes where they are used. Since those values aren't in all nodes anymore invalidation and or broadcasting of changes for those entries is no longer needed.

This data structure takes excellent advantage of any natural partitioning that may occur at the application level.


Summary

Building a fast, coherent, concurrent, distributed data-structure requires thinking about an extended set of concerns. However, if one pays attention to the issues it is possible to create a highly useful solution. To learn more check out the ConcurrentDistributedMap described above.


Additional Reading:




For more information on Terracotta's distributed data structures one can always look here:


13 comments:

Ophir Radnitz said...

Great post, thank you.

How would ConcurrentDistributedMap behave in a single node in comparison with ConcurrentHashMap?

Ophir

Steve Harris said...

Thanks! We have some numbers. I'll try to dig them up and post them.

bob pasker said...

Oracle beat the crap out of sybase with it's "row-level locking Performs better in larger deployments" mantra. sybAse took 10 years to move from blocks to rows and by that time it was over.

Dimitris Andreou said...

I'm probably missing some key assumptions and realities of terracotta, but anyway. I don't see how locking individual objects could work, unless you canonicalize/intern them. Is this what's going on?

Steve Harris said...

That is correct. The first version we did was actually a ConcurrentStringKeyedMap and that worked off of essentially the string keys. Once we moved to ConcurrentDistributedMap with any class of key we use some tricks to essentially canonicalize objects with non-predictable hashcodes.

Dimitris Andreou said...

Care to elaborate some of those tricks then? Because it sounds a bit like a chicken-and-egg problem, from a cursory look. A concurrent distributed map could be used to canonicalize the objects so to make the concurrent distributed map work, but that would be problematic :)

kravets said...

Mr. DSO Guy

I am a huge fan of tim-concurrent-collections and about to use it to implement a fast hashtable-like lookup of objects in a massive POJO Graph with medium to long term persistence by reachability from a super static.

One missing feature is the "weak" variant of your excellent ConcurrentStringMap or ability to use WeakReferences in TC 3.0+

For details please see my post to TC Forums here:

http://forums.terracotta.org/forums/posts/list/0/2434.page

Steve Harris said...

Dimitris,

Sorry, been away for a few days. I did some digging because I couldn't remember. Their are 3 usecases that I can think of with keys:

1) Literal - this is the easy one because it's hashCode is defined by the JVM spec and will be stable across JVM. So nothing special to do here at all

2) User defined .equals() method and hashCode() method. This works fine as long as the calculated hashCode is based of of hashCodes of literals. The default behavior is to lock on the hashCode of the object so this will work fine

3) hashCode of key based on system identity hash code of not literal. For this usecase one has to define a custom locking strategy which is part of the API to ConcurrentDistributedMap

Hope that helps

Dimitris Andreou said...

Thanks Steve.

>The default behavior is to lock on the hashCode of the object

I have trouble understanding what you mean by this. Lock the hashCode? Do you mean locking an Integer object representing that hashcode?? (But that would require interning Integer objects of arbitrary size?!)

Cheers

Steve Harris said...

Don't think of it in terms of whether you can synchronize on a primitive. We have our own lock manager and the lock is essentially held for the integer primitive value.

Dimitris Andreou said...

When one locks an integer, using your lock manager, is it guaranteed to have a different lock? (i.e. no contention for locking different integers).

If yes, wow, lots of locks!

If no, this is lock striping, which is what ConcurrentHashMap does in the first place...

I hope my confusion is clear. I can't really interpret the passage:

>Locking:Instead of breaking things down into segments for locking we lock on the individual keys in the map. This gives the same correctness guarantees while giving the maximum concurrency.

Anyway, I understand I shouldn't use that much time of yours, so feel free to ignore this :) thanks anyway.

Steve Harris said...

Yes, it can end up being a LOT of locks but extremely high concurrency.

Dimitris Andreou said...

Ok. I merely express my disbelief, and I'm done: ints alone could need as many as 2^32 locks, if each one of them is mapped to a different lock. And obviously arrays can't even be than long, but only 2^31 - 1, and such an array would be several GBs worth of heap by itself. So even if you're not very direct about it, I regard it a very safe bet that you *are* using lock striping, to reduce the number of locks. Which is basically CHM's strategy as well. I don't see any reason why a plain CHM wouldn't attain as much concurrency as you claim by simply increasing the number of segments.

Thanks,
Dimitris