Wednesday, December 30, 2009

From Performance to Scale - A Software Story... Part I

When building most applications whether it's JEE, Spring, Grails or plain old Tomcat their are a series of common tools that are used. HttpsSessions is used for managing the state of the web portion, ORM for talking to a database, a cache to hold interimediate application state and a scheduler to make things happen at a certain time, place or interval. Out of the box it is quite easy to get started with these components. However, as an application makes it's way into production it often needs improved performance, often followed by HA and then scale-out.

This blog is about a journey. It's about taking an application from performance all the way to scale-out without rewrites and using the components that you already have.

In the beginning it's just the Application

The good news is that whether you're a JBoss person, a Spring person, a Grails person or a Tomcat person your really using the same key components. Almost everyone in these environments are using Hibernate as an ORM, the Http Session Spec, Ehcache for caching/performance and Quartz for scheduling. So while their is no official standard, the global consensus is pretty clear. If it's me I would start from Grails (which is built on Spring, Ehcache, Quartz and Hibernate) but doesn't require the 2 weeks of picking, deeply understanding and piecing together components that is required with most other approaches.

Round 1, Performance

So you've written your single node application using the usual components. Turns out, it's slow. Now what? Well the first thing you need to do is understand why it's slow at a high level. I usually check the following obvious things:

1) CPU bound - Check machines CPU stats on all nodes (including DB), now day you also need to figure out if one of your CPU's is maxed out which could mean the app needs more parallelism

2) GC Bound - Use verbose GC or your favorite tool

3) Database bottleneck - A number of tools and tricks exist to see if your database is the bottleneck. My favorite ways are:
- Keep track of and monitor query times and or look at hibernate stats
- Take thread dumps. If all threads are blocked on DB calls something is wrong
- Check resource Utilization on the Database machine

4) I also like to isolate operations and see which ones are slow or create slowness. This can tell you a lot about where to focus.

Ok, so now you know why things are slow and your in the same boat as about 90 percent of the world. The Database is the number one bottleneck.

Tuning the ORM

Many start by generating the ORM schema using tools. This usually ends up being a completely Normalized database that requires expensive queries. At this stage of development it's almost certain that this is where your performance problems will be. Aside from the usual mistakes like missing JDBC connection pool or bad JDBC Connection pool (Use C3PO and make sure your seeing parallelism. Many of the others are either useless or broken). The next thing to do is to try hibernate 2nd level caching. If your application is read heavy this can have an amazing impact on performance. Entity caching is generally the way to start. The most common gotcha is it getting invalidated by custom SQL. It is a little known fact that if you execute custom sql through hibernate it clears the cache making it useless. If you are going to do Query caching you MUST READ THIS BLOG.

Now you've got hibernate 2nd level caching turned on and you've tuned it. The database isn't the bottleneck anymore but your seeing a lot of GC and thread dumps show contention on Hibernate itself. Especially when displaying medium size to large tables of data on the screen.

Plus, it turns out that updating the database on every intermediate operation that occurs is pounding the databases locks, cpu and disk. It's also made my schema really unwieldy What do I do now?

Stepping up to Caching

If after 2nd level caching you still have performance problems it is usually time to look at application level caching. This can be extremely helpful for performance. If your seeing:

a) A lot of garbage creation around calculating results
b) cpu usage or contention getting information
c) Latency from IO in retrieving information from disk, a web service or a DB
d) Pounding your database with fine grained updates/information with information that you only needed for a short time.

You'll likely want to start caching. Some things to cache include:

1) heavily used reference data from the database - States, zipcodes, product ID's etc
2) Intermediate state - If a series of operations occur but all you care about is the end result. Cache until the end result is reached then put in the DB.
3) heavily reused calculated values - "Like totally searching for Britney Spears."
4) Temporary values that are only needed for a set amount of time - user sessions or data that is needed at a certain time of day.

Hopefully between the Hibernate caching and tuning and the Application data caching you've now got much of your performance issues under control.

Your application probably does some scheduling and uses http sessions but these rarely create performance problems in one node.

Part II - Now that it's performaning what about availability?
Part III - One node isn't enough anymore I need scale-out

No comments: