Tuesday, May 06, 2008

An Optimization For Garbage Collectors...

For the last few days I have been thinking a lot about GC as Terracotta moves towards our first major rewrite of that subsystem. Lots of relatively large changes have been bouncing around in my head as I read papers, blogs and talk to people. Maybe I'll blog about those later but one pretty simple one occurred to me. I have a theory that most Shared Objects are actually only directly referenced by one parent object (I haven't run stats on this yet so I might be full of it). I started from wondering whether we could take advantage of this to improve the efficiency of GC. Here is what I came up with:

  • We can keep a Set of Object ID's for objects that only have one direct reference to it. We have an implementation of a compressed Set of ID's so this can be quite space efficient.
  • If an Object gets a second reference to it then it is removed from that Set
  • If that one reference is removed and in the Terracotta world that object is not reachable from a client or in a non-terracotta world it is not reachable from the stack then the object is garbage.
  • If an Object has no references but is reachable from the stack or is still on a client then add it to the no-refs Set so that when those two things are no longer true they can be marked as garbage or if the object is re-referenced it can be accounted for properly.
  • You can also recurse through the objects that the new garbage object referenced doing the same check.
One might be wondering, "Does steve think he just invented reference counting?" Nope, I don't and I haven't decided if this idea is any better than just having a first phase of garbage collection based solely on reference counting. I'm just theorizing that it might be. I don't even know that I invented this shortcut. Most real world GC's are hybrids of multiple approaches that best fit the set of restrictions and limitations faced in the environment. This one seems like it could improve things significantly in real world apps and potentially drastically reduce the required frequency of full GC's in our world without adding too much overhead.

Anyway, blast away :-)


Daniel Ehrenberg said...

Well, you can't have garbage collection based solely on reference counting, because then you don't collect inter-JVM cycles. If you didn't feel like collecting cycles, you could also just maintain a remembered set on each JVM of inward-pointing references, and use these as roots. (I guess these amount to the same thing, when you come down to it.)

Anyway, the biggest problem I see with this approach is that you have to store this big set somewhere, if I understand what you're saying. Maintaining this set would require a lot of communication. But you're talking about, in a broader sense, a sort of hybrid between reference counting and tracing collection.

http://citeseer.ist.psu.edu/plainfosse95survey.html has lots of good information about the state of the art (as of 13 years ago, but still relevant) in distributed GC algorithms. It's much easier to read than the corresponding chapter in the blue book. One interesting approach, which I don't completely understand yet, is weighted reference counting, which avoids sending increment and decrement messages around the network (as your algorithm would require). This can, then, be combined with certain tracing techniques. The easiest is that a tracing collection is run when the heap fills up, but that paper describes a number of better options that don't pause the whole system.

Daniel Ehrenberg said...

Also: how do objects get removed from the set if a server crashes?

Steve Harris said...


A few points:

This isn't a replacement for the current garbage collector. This is a shortcut for collecting a certain class of garbage. Would allow us to run the full GC less often.

Compressed sets (sets that take advantage of ranges) can be quite small, and we can also persist it in a btree.

This has nothing to do with Distributed garbage collection. This is for improving the single node case. It can be extended to the multi-node case by using the same strategies we'll use for multi-node GC (remember sets etc)

thanks for the feedback

Steve Harris said...

If the set is non-persistent then everything is removed if the server crashes :-).

Though it would actually better if server crashes don't impact the set. We want it to be long lived

Steve Harris said...

Oh, and thanks for the link! Will check it out.