Performance By The Numbers: June 2012

Wednesday, June 20, 2012

Bullet Graphs

My last few posts here have been high-level stuff. It's important to think about the big picture. But, it's also important to think about the little picture. So, this week, I want to talk a little bit about just that – little pictures. Graphs, actually. We're making a few changes in how we report some of our results. You'll be seeing these graphs in upcoming projects. So, I wanted to take a little time and describe how they work and what patterns we can see in them. They have a lot more power than it would seem, for something this simple.

The Definition

The bullet graph was developed by Stephen Few. Few is a consultant in the area of business intelligence and data visualization. What he was trying to build was a way to visualize a data set that would play nice with how we see and perceive information and still pack a lot of information into a small space. (The circular dials that are so common in the dashboard metaphor, pretty much, do neither of these things.) His web site (http://www.perceptualedge.com) is well worth some time to read.

But, I digress. Let's talk some more about bullet charts. I've cooked up a few examples of different things that we see in the charts the way that we present them. The data is properly randomized, so we'll get realistic fluctuations, instead of pristine curves.

There are 4 primary elements to a bullet chart.

1. The scale. For the bullet graph to work, there'll always be a numeric scale attached. Usually it starts at 0. But, for our examples today, it runs from 3-18.

2. The background. The background on a bullet chart is important. That is where we encode the "good/better/best" sort of information that we need to make sense of it. These graphs are tuned to be readable, even when printed. The light grey color is SLA. The dark grey color is the goal for that measurement. On this graph, and all of today's examples, we have an "SLA" of 16 and a goal of 5.

3. The measure we are interested in. That would be the thin, black bar down the middle. Sometimes this bar will be red. We use the black bar for the 95%-ile response time. This is a time when being "outside the box" is not a good thing. In the example above, we would be over SLA.

4. The secondary measure. This is the white diamond – the level bubble – floating in the middle of the black bar. We put the median on the bubble.

That seems simple enough, right? The fact that it is so simple is one of the things that I like about this particular chart. You can process "good", "bad" or "say what?" in a blink, without having to parse monstrous tables of data. One of the great ironies of data visualization is the notion of "pretty". People fret over "why do I need to spend time making a pretty chart?" We don't. Not really. What we need is simple and clear. It just happens that things that our brains scan as "clear", they also scan as "pretty". Go figure.

Patterns

Let's take a look at a couple of examples and see what else we can see in these little charts.

Our first chart, Data Set A, is pretty typical in a number of ways. We have our goal and requirement bands in there. The 95%-ile bar, though, is right at the requirement. That is something that we would want to investigate further. The other thing to notice is that the median bubble is riding right about the middle of the chart. Let's take a look at a histogram of the data along those same buckets and see how it looks.

Here, I've taken the histogram of the data that went into building the bullet chart and aligned it with the bullet chart itself. The green and red lines just help us to see where the two graphs align. We know that half of the data has to be below the bubble (and the green line) – that's the definition of median. I promised random data, with all of its warts and bumps, and here it is. There are a few bumps in the histogram. But, overall, it has the sort of shape that we would expect from a normal distribution – big hump in the middle, median line bisects the hump, tails on both ends. We have a few items about the 95%-ile, just like we would expect. Overall, this is the sort of pattern that you'd expect to see.

So, let's take a look at a couple of not-so-normal patterns.

We've seen this graph before. Right away, we would want to look into a transaction that looked like this because it is over SLA. But, there is something else interesting here as well. Look at the median bubble. See how it is riding way to the right? That's a pattern that warrants some investigation. So, let's put it beside its frequency distribution graph and see what we see.

Now, we can see something more interesting is going on. To begin with, the histogram is skewed to the right. The green line still bisects that graph, it has to. But, our measurements have a definite floor. Nothing measured here was less than 8. Why is that? That sort of pattern is characteristic of some sort of timeout or other failure going on. Find what is causing THAT, and you'll make a big change in this result.

Let's take a look at one more.

Under normal circumstances, a transaction with this profile would not get investigated. It is below SLA, the median looks pretty good. Odds-on chance any reasonable test would turn up much more interesting things to investigate than this. Still, the pattern underlying this graph is something that we see often enough in things that don't meet SLA. So, let's take a look at it and see what we see.

Notice how the curve is skewed to the left. Sure, half to the left of the green line and half to the right. We know that. But, the big grouping is to the left. The ones to the right are "like too little jam spread over too much bread". When we see this sort of pattern, it always means that there is something fundamentally different about the transactions that fell in the big grouping to the left as opposed to the thin group to the right. What's the difference? What is it that is either making the one group clump, or spreading the other group out? If you can find that answer – and it is usually just one variable – then you really have a handle on this transaction and on how to either improve the transaction or improve the test.

Friday, June 15, 2012

Riding The Line Ride

A while back, my family and I spent a couple of weeks in Orlando, hanging out at Disney World. We hadn't been since my two oldest boys were toddlers (they are in college now). It was a blast. Of course, the rides and shows and stuff at any amusement park are just the punctuation in the story. The majority of the story is about riding the Line Ride.

You know the Line Ride, right? It is the one that you get one before you get on anything else. There is one out in front of every coaster, show or water ride. It must be the most popular thing in the park, because no matter how hot the day or crowded the building, we still hop right on the next Line Ride we see.

On our first day in the park, we were standing in line and it set me to wondering if two cars on one coaster track really qualified as a single queue/2 server model, or did the fact that there was only 1 track, and one car travelling at a time, mean that it really should be single queue/single server and the second car was just an optimization for loading and unloading the track. I had just about concluded that it should be single queue/single server when a more important question occurred to me – what sort of sicko stands around at Disney World thinking about queuing models?

The question has been bouncing around in my head ever since I got back. (The first question, that is, not the second one.) So, to finally exorcise this notion once and for all, I thought I'd put together a few quick thoughts on queuing and why we get so worried when we see even a little bit of it.

The Simple View

Without fail, when you talk about queuing theory, what comes out next is a whole rack of fancy equations, and a lot of interesting simulations based on a bunch of assumptions. For our purposes, a much simpler approach will, I think, be clearer and just as effective at what we need to get at.

First, let's think about the thing that we are modeling. I'm going to talk about an over simplified model of a basic web server. The life cycle of a request looks something like this:

1. A request arrives

2. If we have a processor available, it gets processed. If not, it gets queued.

3. Our request gets processed.

4. A reply is sent out.

Each of these steps has a parameter or two, and a couple of assumptions that we'll use to simplify our model. The first of these is the rate at which requests arrive for us to process. We'll call the number of requests that arrive per second our arrival rate. Requests arrive at random. But, the equations let us use the average arrival rate to work with. Here we'll have a variable that we can set for our model – the arrival rate.

The next step involves our queue itself. The simplest form of the queuing equations assumes that there is an infinite queue. But, that is never really the case. So, let's set our maximum queue length to 100 entries. As we'll see in a bit, even a small queue like that is more than enough to see the effect of even a little queuing. Having a finite queue introduces another thing that we want to track, though. That is, if a request arrives and the queue is full, then it gets an error response. In queuing parlance, the critical measure here is the number of requests that "balk". For a web server, this is what it means when we get an HTTP 503 response. So, the two measures that we'll want to look at here: the average queue length for any given arrival rate and the % of requests that get an error because the queue was full.

Our third step is the request getting processed. Here we have two things that we want to think about. The first is how long does it take to process an average request? Just to simplify things, let's say that in this model we can process one request in one second. It's an average – some will go faster and others slower. But, it is a simple assumption that we can make and still keep things fairly general. The second question to ask is the one from the roller coaster line – how many "servers" do we have? By servers, we don't mean physical Windows machines, but things servicing requests. That is, how many processors are we looking at? A typical web services machine is configured with 4 CPUs. So, let's call it 4 servers for our model. That means that while it takes 1 second to process 1 request, we can still process up to 4 requests per second because they are independent and running on separate servers. Our usual measurement here is Utilization. That is, what percent of the time is the system busy working on a request.

Our final step, sending the reply, isn't really a time that we measure. It's just our marker for when we are done with this request. So, there is nothing to do there.

We have one measurement that is left that we want to talk about, and it is the most common, most talked about one of all. That is Response Time. Response Time is the total amount of time that a request spends in the system – including the processing time and the amount of time it spends waiting on a queue somewhere.

Before we look at the results, let's review. Our requests take one second each to process. We have one queue feeding 4 processors. We have a maximum queue length of 100 entries. (For anyone who wants to dig deeper into the equations and models and the like, all of this means that ours is rightly specified as an M/M/4/104 arrangement.) Our key measures are Response Time, Utilization, Queue Length, and the % of requests that end in a queue full error. What we want to do is to vary the average arrival rate of our requests and see what impact this has on our key measures. (By the way, the queuing equations give average results. That is to say, average queue length, utilization and the like. That is why we talk about "sustained queuing" as opposed to a few things showing up on the queue for a moment and then being cleared.)

Results

So, let's start our arrival rate at 1 request/second. We'll increment it from there by 0.5 r/s and see where it takes us. The results of all of that come out like this:

Arrival Rate	Response Time	Utilization	Avg. Queue Length	% Queue Full Error
1	1	25%	0	0
1.5	1	37.50%	0	0
2	1.1	50%	0	0
2.5	1.2	62.50%	0.5	0
3	1.5	75%	1.5	0
3.5	2.5	87.50%	5	0
4	13.4	99%	49	1%
4.5	24	100%	92	11%
5	25	100%	96	20%
5.5	25.3	100%	97	27%
6	25.5	100%	98	33%

With an arrival rate of 1 request/second, we have 1 second response time, no queuing and the system is only 25% utilized. That is to say, that there are 4 processors waiting to take care of only 1 request at a time. 3 of them are sitting idle and there is no reason for any request to wait. The response time is just the amount of time it takes for the request to get processed.

Between 1 and 2.5 requests/second, our utilization and response time are pretty close to linear. There is still no noticeable queuing and the system isn't looking all that busy.

But, watch what happens when we get to 3 requests/second. All of a sudden, we've got sustained queuing. Our response time is up to 150% of where it started. But, the system is only 75% busy. What's going on here? Remember that our requests arrive at random. So, there will be times when a bunch arrive and get queued up. And, there will be lulls in the arrival rate when we have a chance to drain the queue that built up in the busy time. Even when the system is averaging at only 75% of capacity, the quiet times are coming too infrequently for us to keep the queue drained.

Intuitively, you'd think that 4 requests/second is a full load and that at that rate, the machine would stay busy but there wouldn't be a lot of queuing. This is one of those times when the intuitive answer isn't even close. Sure enough, we have almost 100% utilization at that load. But our "1 request takes 1 second" notion is no longer even close to our response time. Requests still average 1 second to process, but the average response time for those requests is 13.4 seconds. All that remaining time is spent sitting on a queue. With an increase of only 0.5 requests/second, we've gone from an average queue length of 5, to 49 – almost an order of magnitude.

You can see from the table how bad the response times get as we increase the arrival rate beyond the rate that we can process them. But, rather than talking through those, let's look at them graphically. Response time first….

Notice how our curve looks nice and flat at the start, then rises rapidly, even with fairly small increments in the arrival rate. With a few more data points, and a much longer maximum queue you'd be able to see that this curve is, in fact, an exponential one. But, also notice how the response time seems to flatten out after a bit. We'll talk about that more in a moment.

Utilization isn't surprising at all. It increases linearly with the incoming load, and then flattens out. After all, you can't use more than 100% of something.

Average queue length follows the same sort of exponential curve that response time does. Nice and flat, riding right at zero until we hit that magic threshold, then it zooms up. Queue length, though, flattens out just the same way that response time does. But, it is flattening out as it approaches the maximum length of the queue. And, right there is where we get our clue to why those two curves flatten out. How many people are in line ahead of you is one of the key factors in how long you'll be waiting in line. So, sure, the response time curve will flatten out when the average queue length starts to hit its maximum. But, a maximum response time is not exactly a cause for celebration. You see, a full queue means that requests are getting turned away.

The thing about all of this that surprised me the first time I saw this was how low the queuing starts, and how quickly it impacts response time. Basically, response time stays pretty flat and resource utilization follows a nice, linear sort of curve right up until the point where we start queuing. After that, it changes fast. If you really want to have good response time (including enough capacity to recover from the occasional burst), the sweet spot is at about 60% utilization. If our transactions take longer to process, it just means that we have to take them at a lower rate. Queuing will still start at around 60% utilization. That'll just happen at a lower arrival rate than it would if the transactions ran quicker.

Friday, June 8, 2012

The Power Of Context

Here's a silly idea: a machine that can calculate mathematical functions. I'm not talking about addition and subtraction here. I'm talking about hard stuff like logarithms and trig functions.

In the 21^st century, the only thing silly about that last paragraph is the suggestion that it is a silly idea. Now, you can walk into any department store and, for about what it would cost you to go out to lunch, buy a thing the size of your hand that will do all of that and more. But, in 1837, at a time when the steam engine was one of the most complex machines around, a "difference engine" was crazy talk. (In fact, that's exactly what they thought of poor Charles Babbage. It wasn't until very recently that someone was able to build a working Babbage Engine.)

The point I'm trying to make here is that the value of an idea (or a process or a test) depends on the context in which it resides. In the context of the early 19^th century, a computing machine was too radical to be believed. In the context of the early 21^st century, it is so common place as to be outright boring. But, right in between, somewhere around the 1940's, that's the context in which it was genius.

The Context-Driven School of Testing is a collection of principles, first collected by Cem Kaner, James Bach and Brian Marick. Those principles are

1. The value of any practice depends on its context.

2. There are good practices in context, but there are no best practices.

3. People, working together, are the most important part of any project's context.

4. Projects unfold over time in ways that are often not predictable.

5. The product is a solution. If the problem isn't solved, the product doesn't work.

6. Good software testing is a challenging intellectual process.

7. Only through judgment and skill, exercised cooperatively throughout the entire project, are we able to do the right things at the right times to effectively test our products.

They go on to write "The essential value of any test lies in its ability to provide information (i.e. to reduce uncertainty)". And that is the notion that I want to focus on in today's post. (Check out the link above. Their examples are well worth a read.)

We Performance folks have all manner of tools and techniques at our disposal. We have load generators and databases and reporting utilities and spreadsheets and homegrown whatnots of every shape and flavor. We can generate more tables of numbers than anyone could ever pound through. But, what was the question? If they aren't answering some essential question, they are just 1's and 0's.

The basic question, and the one that everyone thinks of first, is "Does this application meet SLA?" Sure. That is the big question. But, there is more to it than that. I'll argue that anyone with any experience testing a given application will be able to set up a performance run in which the application will not meet its response time SLA. Hit it too hard. Hit it with too many of one particular transaction. Arrange the data for one screen in just the right way. We've all seen enough to have a few dirty tricks up our sleeves.

But, hold on a minute. What is the difference between a "dirty trick" and a "clean kill?" To get to what we consider a reasonable test, we need to understand and appropriately control for the variables that we know about. We have to hit the application at the right rate. We have to get the mix of transactions right. We have to know that the data being used is about what the application can expect to see in Production. When designing and scripting a performance test, we put a lot of effort into modeling user behavior and into calibrating our scenarios to be sure that we know what the correct settings for these variables are and that we are generating a load that will match that.

Which brings us to another question: how confident are you in that model? If an application is in Production, we can use web logs or application logs or other such measures to create our model for user behavior. In that case, we can be pretty confident that we are modeling how users actually use an application. New applications, or others where we don't have those independent logs, leave us with projections and other guesses. How confident are we in those? They are much better than nothing. Still, one thing that we can count on is that the real world is much stranger than any sane person would think up. No matter how careful our projections are, real users will find a different way to use the application.

I tested an application, once upon a time, where the projections included that X tasks would be created in a day. It went on to say that 40% of those created in a day would then be edited. Marking a task complete counted as editing it. It took a couple of weeks of testing the application daily to create a backlog of tasks that buried the response time of the application. (The requirement was later clarified to say that closing a task was a different activity, and that no backlog should be created.) The context of the application usage impacts the requirements, the design of the application and its tests, and the final results.

But, it took a couple of weeks to build up enough of a backlog for that to matter, which brings us to another question that we need to answer. We say "the test ran within tolerance" when talking about how well we hit the various rate and mix targets. What we're talking about is that if the goal is 1000 transactions in an hour, hitting 998 or 1002 does not invalidate the result. Or does it? Where is the boundary between "clean kill" and "dirty trick"? Typically, we'll treat this question as a matter of how well the scenario did what we intended it to do.

But, when we are uncertain about our usage model (and, let's face it, every model has some uncertainty to it), we have to consider a second meaning to the "What is within tolerance?" question. Using the example above, if 1002 is acceptable, how about 1020? Or 1200? Or, maybe it doesn't break until we push 2000 in an hour. In other words, we have to ask "How sensitive is the application to deviations from the model?" And, "What is the risk of the application seeing the conditions that it is sensitive to in Production?" Those are the questions that lead us into the more interesting scenarios. To answer those questions, we have to take all of those variables we were talking about earlier – like hit rate and mix and data size and user counts and everything else we uncover along the way – and we have to start spinning those dials.

The Question

We start with the simple, obvious question: "Does this application meet SLA?" We know that there are circumstances under which it will not meet its response time requirements. And, we know that there will be circumstances where it will meet those requirements. We have our projections and our models for how the application will be used, and we generally have an idea where those models are soft. In short, we take our original question and adjust it to account for the context in which the application will run. In the end, the question that we are asking is one word (and a world of meaning) different from our original question. In the end, we are asking,

WHEN does the application meet SLA?

Friday, June 1, 2012

Of Slinkys And Bathtubs

Of Slinkys and Bathtubs

You think that because you understand "one" that you must therefore understand "two" because one and one make two. But you forget that you must also understand "and".
-- Sufi teaching story

Have you ever been hit by one of those ideas that just changed the world? I don't mean an "interesting insight". I mean one of those things that so profoundly shifted the way you look at things that you never go back. I want to share an idea with you today that did that for me. It seems a simple thing, at first. The best ones always do. But, follow the thread all the way to the end. Its implications are powerful stuff.

Very early in my career, my first mentor handed me a copy of a book. He smiled his subversive smile and said "Read this." Jack Anderson, my mentor, was a particularly easy going fellow. A recommendation like "read this" was akin to walking down a mountain with a pair of tablets. The book was "An Introduction To General Systems Thinking" by Gerald Weinberg.

The Slinky

We start with a slinky. As a kid, this was just one of my all time favorite toys. I can still idle away hours just sloshing it back and forth from one hand to another. Now, we could break this slinky down and talk all about Hook's Law and spring tensions and such. But, we don't need to do that. It's a just a slinky.

Take your slinky and hold the top half of it still with one hand. Put your other hand, I'll call it the left, under the slinky. Move your left hand. The slinky will bob and bounce up and down for quite a while. Fun game. What made the slinky bounce like that? The temptation is to say it was moving your hand. But, try the same experiment with the box the slinky came in, and you'll see your hand really had nothing to do with it. The bounce came from a property which intrinsic to the slinky. Your hand was an external force which served to either inhibit (when it was there) or allow (when you moved it) the expression of an innate behavior of the slinky.

Slinkys are like Tiggers. They are always bouncy. But, sometimes, they are between bounces.

The Bathtub

Let's take another example: a bathtub.

Imagine a bathtub. Any bathtub will do; I like the big claw foot monsters like my old house had. To complete this experiment properly, we may want to imagine an ample supply of towels as well. Fill the bathtub about half way up. Now, leave the water running and pop the plug out of the bottom of the bathtub.

One of three things will happen next. If the drain is taking water out faster than the spigot is putting it in, the tub will eventually empty. If the water is coming in faster than it is going out, it will eventually overflow. Or, if we fiddle with the knobs very carefully, we can reach a state called dynamic equilibrium where the level of the water stays just the same, even though the water itself is constantly moving through the tub.

The Big Insight

Our bathtub and our slinky are systems. Each has parts – the spigot, the drain, and the basin, for instance. Each of the parts interacts with each others in particular ways (the spigot puts water into the basin, the drain takes it out). And, the net behavior of the system depends on all of the parts, and the individual interactions. Studying just the spigot won't explain the behavior of the systemanymore than studying only the drain will. You have to understand the system, as a system, in order to understand it at all.

But, looking at the big picture, there are some interesting behaviors that we see.

Let's think about when the tub overflows. We can get towels and a mop and even a ShopVac to clean it up. But, the water on the floor isn't the problem. It's just a symptom. (If all we do is clean up the water, we'll always be cleaning up the water because we haven't done anything about where the water is coming from.) The real problem, the thing we need to fix, happened when the dishwasher downstairs shut off. This removed a drain on the house's water pressure, allowing more water to flow to the tub. Now, our perfect equilibrium setting is delivering just a bit too much water. We didn't see the symptom then, because the water level was just half way at that point. One important aspect of systems like this is that the symptoms, the problems, may not show up until long after the occurrence of their root cause. And, that root cause may fall in one of the limiting factors, which is itself a side effect of a completely different system.

To fix this overflow, we can turn the spigot down. But, even if we turn it off entirely, the tub will not immediately be empty. It will take some time to drain. We can't dial the spigot to its perfect equilibrium level at this point – or our equilibrium point will be just at the spilling over stage. We have to dial it down to well below optimum, let it drain, then ease it back up gradually. Another important aspect of systems is that they have inertia. For this reason, you can't just set a value and be done. They require care and tuning and either constant adjustment or internal feedback mechanisms to deal with changing conditions.

Let's review for a moment. Inputs, outputs, queues in between, systems with inertia, constant tuning, problems that show up long after their root cause has come and gone, issues where the root cause of a problem may not lie in the machine or software that exhibits the problem at all…..does any of this sound familiar? From where I'm sitting, systems thinking is the very heart and soul of performance work.

Further Reading

Books, theses, dissertations…whole libraries have been written on the topic of systems thinking and analyzing things from a systems point of view. Many of them are particularly dense works. But, I've found two to be decent and interesting reads. One is the original Weinberg book, above. But, even its silver anniversary reprint edition can be a little hard to find sometimes. The other, the one I'm reading now, is "Thinking In Systems" by Donella H. Meadows. Using simple things like bathtubs and used car lots, she communicates the important points along with some effective tools for dealing with them.

I can, and at some point probably will, carry on for hours about some of the tools and techniques that come with the systems approach. But, for now, I'll just give time for the idea to take root. Let me know what you think.