Thursday, May 31, 2012

Taming A CPU Report


We are overwhelmed by information, not because there is too much, but because we don't know how to tame it. Information lies stagnant in rapidly expanding pools as our ability to collect and warehouse it increases, but our ability to make sense of and communicate it remains inert, largely without notice. (Stephen Few, Perceptual Edge)
That quote just blew me away the first time that I read it. We have loads and loads of data – numbers, measurements, other stuff from all manner of sources. But, how do we make sense of it? How do we distill it, removing the bits that we don't matter, without losing any of the parts that we really need to be worried about?
This series is going to be about how to make sense of a few key measurements. These are all things we collect, or can collect, from a perfmon log. Some of what I want to talk about is patterns that we've seen in Stress or in Production. I straddle an interesting line when it comes to analysis of things like this. On the one hand, the fundamental notion of evidence requires that we have hard numbers to talk about. On the other, I tend to think visually, in terms of patterns. So, some of this will be nice graphs and equations and numbers. Other bits will be visual patterns – the intuitive gut check. A pattern like that isn't enough to say what a problem is, and only seldom can say that we definitely have a problem. But, it is enough to point us in the direction of something worth investigating further.
The things I'll mention here are not exhaustive by any means. They are the most common, or the things that came to mind when I started gathering data and writing this article. If you have another idea, or another pattern, or see where I botched one of these, let me hear from you. The more eyes we have on this stuff, the better.
This first one is pretty straight-forward. We all know about CPU utilization. As the series goes on, though, we'll add more counters and look for interactions between them.
Characterizing CPU Utilization
The first question to answer here is how to we characterize CPU utilization? How do we talk about it and where do those numbers really come from?
Where it comes from is a particular perfmon object: Processor(*)\% Processor Time. There is an instance of this counter for each CPU on the machine, plus an aggregate one labeled "_Total". Except where otherwise specified, we'll be talking about Processor(_Total)\% Processor Time, which is the average busy time across all of the CPU's on this machine.
"Average busy time"….that reminds me of a wee detail that we need to mention before going any further. Different perfmon counters reports their values in different ways. Some are snapshots at the instant that the reading was taken. Some are running counts that increase over time. Altogether, there are something like 29 different classes of perfmon counter, each with its own particular way of reporting whatever it has seen. %CPU actually reports the average busy time between readings. That is, hit it every 15 seconds, and what you are really seeing is a 15 second average.
Let's start by taking a look at a set of readings for this counter over time.

The x-axis is time, or the individual readings that we took. The y-axis is the % busy time for this machine. The first thing we generally do to talk about %CPU is to talk about the average across some period of time. Or, graphically…

That is, CPU averaged 10.2% during this period. But, this graph is pretty tame. Its peaks are only as high as 14%. Still, we don't engineer to the average CPU. We engineer to "sustained peak". That is, not the very worst case, but what is the upper bound most of the time? So, we have to find a way to characterize "upper bound, most of the time". This is where we use a bit of statistics. Let's start by graphing what the quality folks refer to as an upper and lower control limit (UCL and LCL). The upper control limit is just the mean plus 1 standard deviation. (The lower control limit is the mean minus 1 deviation.)

The UCL on this graph (mean + 1 deviation) includes a little over 70% of the measurements that we took. That makes it a nice measurement for "sustained peak" – the place where most of the times we'll be below, and when we do go above, there'll be enough time to recover. We're still looking at the average across all 2, or 4 or 8 or whatever CPUs on the machine. But, it gives us a good picture of how busy things are. That UCL is what we'll engineer machines to.
Applications
Having a nice, mathematical way to talk about sustained CPU utilization is, by itself, a useful thing. But, this article is about patterns and things we can see in these graphs. So, let's take a look at a few pictures and see what we see. We're looking both for patterns in the graph that we can use, and for things that the UCL/LCL lines buy us.
Let's consider a typical story. We start with a particular app and load. It shows us a %CPU of about 25%, with a little variation. We make a change – either a different load pattern, a change to the code, whatever. The average %CPU still looks to be about 25%, but it isn't so flat or so well behaved anymore. Like this:

The blue line, to the left of the big black divider, is %CPU before the change. To the right of the divider (in green) is %CPU after the change. It looks like %CPU went up a bit, but how do we treat the sustained load now?

Average CPU went up just a little bit – from 25% to 27%. But, the sustained CPU – the UCL that we have to work with -- skipped from 27% to 32%.
So, what do we do with the new information? Other than just reporting it, the next question I see is why is the new load so much more variable – why those spikes? Understanding what went on at those times could help us pin down a problem transaction. It could be that this is just the behavior of the changed application and is not something to be worried about. But, it warrants a little investigation.
Case #1
Let's try a few more variations and see what we can see. Sometimes, we'll include the UCL and LCL boundaries. Sometimes, we'll be able to see what we want to see without them. The goal here isn't to pin point a problem. It's just to get to the next question to ask. Here's a pattern that we've see a few times.

We're humming along just fine, all is well. Then, either all at once, or over a fairly short period, %CPU changes dramatically and stays high. What next?
At this point in the story, I'm asking one big question…and a couple of little ones to help me answer the big one. The big question is, "what changed?" What happened in those few minutes that caused the CPU to go up like that?
One thing to look in to is whether all of the CPU's are running hot, or just one of them. If they are all running high like this, then the most likely explanation is that something went wrong with the load scenario. Either we are ramping users in too quickly, or the session database filled up (causing users to fail and lots of hits to happen very quickly). When all of the CPUs change at once, it is almost always something external to the system under test, and we need to investigate things that would increase the load like that.
Sometimes, though, we'll see another case. We'll look at it and see that 1 CPU is running full out. But, the others are still running at a nice, easy 10%. The only way to get into that position is if 1 thread is locked in a loop somewhere. That thread is on just 1 CPU, and is keeping it busy. The others are handling all of the new traffic. This is generally an application issue, and it is time to take a dump of that process and see what it is busy doing.
Case #2
Let's try another one and see if there is anything interesting going on there.

%CPU is mostly down around the 5% or less range. But, there are these occasional spikes to above 20%.
First question: is this something that we need to be concerned about? I will argue that it is. To begin with, without the spikes, the sustained peak on this application would be down around 5%. As is, we would plan more over twice as much CPU usage. If we are down in the 5%-10% range, that probably means that this is on a shared server, which means the difference between 5% and 10% could mean more servers. So, yeah, even though it is small, we want to at least understand why those spikes are there.
The next question we'd ask is "So, what is going on with those spikes?" Is this a low volume app, and that is just when it happened to be processing a transaction? That could be. That's worth looking at.
But, look again at the spikes. Those aren't really all that random. They happen at regular intervals. Plot another graph, on the same time scale, with hits/sec or transactions/sec. This could be a case where the scenario is marching. That's a problem with the test, not the application.
Case #3
Let's look at one more. This time, though, we'll introduce a new counter to help us understand what is really going on. Overall CPU utilization looks like this:

At first glance, whether we would be concerned about an application with this CPU profile depends, mostly, on what sort of application and which server we are looking at. If this is a web services server, that graph is right in expectation. If this is a thin client UI server, running in isolation….one app running 30-40% CPU on one of those machines would be a big deal. That would be something we'd want to look at.
Let's dig a little bit deeper and see what else we can see. Processor(*)\% Processor Time is the aggregate of all of the time being spent by all of the CPUs on this server. That includes both the time spent working on user processes and the time spent in the kernel. The counter for kernel time alone is Processor(*)\% Privileged Time. Let's plot %Privileged Time on the same graph.

This is the profile that we expect to see. The kernel has some work to do, and some things are done at kernel level (like the http service). But, the majority of the time being spent is happening in the application. If the graph looks like this, there is little for is except to go figure out where all that CPU time is going.
How do we do that? Processor(*)\% Processor Time tracks processor utilization on a per-processor basis. There is another counter collection of interest as well. Process(*)\% Processor Time tracks processor utilization on a per-PROCESS basis. That, coupled with process explorer, can usually nail down at least which process is chewing up the CPU. (It is usually a w3wp.exe process, meaning it is the web app itself. But, there have been times when a support process would decide to get busy. So, it is worth checking.)
But, we've seen another profile lately as well. Once in a while, when we plot % Privileged Time, it will look like this:

This graph is a problem. This is telling us that the majority of the time being spent on this server is being spent in the kernel. This almost always means that there is some sort of hardware problem going on. In this particular case, there was an issue with one of the CPUs on the machine that was generating excessive interrupts. This sort of hardware problem is usually also associated with high values for the System\Context Switches/sec counter.
What other patterns do you see? Let me know.

Saturday, May 26, 2012

Telling The Story


A few months ago, while doing the research for the most recent edition of our performance spreadsheets, I noticed an interesting phenomenon. If I searched for "performance testing", or "performance reporting" or anything like that, I came up with a lot of LoadRunner graphs or perfmon graphs, or things that looked a whole lot like them. The problem is that I was looking for a better, clearer way to communicate the information. LR graphs have their place, and can be useful. But, clear, concise communication is not their forte.
On the other hand, if I followed the links for things like "business intelligence", or even better "visual business intelligence", I struck gold. Yeah, I'm a geek. I not only admit it, I'm right proud of it. "Business Intelligence" was not the first place I'd've thought to look for information on how to better present technical data. It just didn't occur to me. But, let me pass on a quote from Perceptual Edge:
We are overwhelmed by information, not because there is too much, but because we don't know how to tame it. Information lies stagnant in rapidly expanding pools as our ability to collect and warehouse it increases, but our ability to make sense of and communicate it remains inert, largely without notice.
Computers speed the process of information handling, but they don't tell us what the information means or how to communicate its meaning to decision makers. These skills are not intuitive; they rely largely on analysis and presentation skills that must be learned.
Sound familiar? It sounded to me a whole lot like reporting on a performance test – great rivers of data, and no paddle in sight. It got me hooked. I've been tracking some of the material in this area for a while now, looking for tidbits to help communicate the things that we see. This week I noticed an article from Stephen Few on what he calls "Statistical Narrative" – that is, communicating the important parts of a whole lot of data by means of a story.
I was going to quote a few bits here, and add a few comments – the usual sort of blog activity. But, the more I started working on it, the more I realized that there really wasn't anything there to pare down. So, instead, I present the link. If you want a few pointers on better presentations and better ways to communicate what we need to communicate, this is the paper to download.
If you check it out, let me know what you think.

Friday, May 18, 2012

On The Granularity Of Observations


I've got too much to think about, too much to figure out.
Stuck between hope and doubt, it's too much to think about.
    -- Todd Snider, "Statistician's Blues"
I want to talk today about the difference between averaged data and actual observations. It's a thing that you've probably heard me go on about before. But, it's important, and if we don't understand what is really going on with some of our reports, can easily lead us to some very wrong conclusions. We're going to delve into some statistics along the way, so I want to start this discussion, right up front, with an example. We could do this with just about anything that can be measured -- response times, daily temperatures, dice, whatever. To keep this article as general as possible, I'm just going to throw dice: two, basic, 6-sided dice.
We'll throw them 100 times, record the results and then start to work. That is, our raw data will be 100 numbers ranging from 2 to 12. Next we are going to start taking groups of these numbers and averaging them. We'll average every 4, then every 5, then every 10, and finally every 20. We want to see what all of this averaging does to our results.
Who cares?
It would be easy, at this point, to say this doesn't have anything to do with the numbers that we get out of a performance test. But, hang on a minute. LoadRunner gives us two sets of things. There is the summary page – the big table of numbers at the start. It is calculated based on the raw data. But, that's not real useful for comparing results or for getting some of the other reports that we like. So, we snag a copy of what it labels the "Average Transaction Response Time" graph and do some statistics from there. But, that is giving us just what it says – averages across some period of time.
How long is that period of time? Well, that depends. LoadRunner can give us 1-second data for any test. But, that is not the default. The default is, I think, based on the length of the test and always works out to a power of 2. 64 second data seems pretty common. I have seen it default to 128 seconds or even as high as 1024 seconds. (15 minutes is only 900 seconds, by the way.)
So, if we are not careful with what LoadRunner is really telling us, we could be making decisions about milliseconds based on 15-minute averages.
Raw Data
So, we've rolled our dice and now have 100 numbers between 2 and 12. What can we say about those numbers? Well, we can calculate a few basic statistics, like we always do. And, we can draw ourselves a little histogram (also like we usually do) to see how they are distributed. Let's see how that looks.
Count
100
Minimum
2
Maximum
12
Average
6.37
StDev
2.25
95%-ile
11.00

Range, deviation, all the usual stuff. What does that tell us about how the numbers are distributed? Not as much as we would like. So, let's take a look at a histogram of that data.

The fact that we threw 2 dice gives us a particular sort of shape to this graph – the famous normal curve. Does that matter to us at all? Not at this point, it doesn't. This graph could have been any shape that we want it to be, and what comes next would be just the same.
A Little Statistics (And A Little History)
What this is all going to come back to is a statistical notion called the Central Limit Theorem. The Central Limit Theorem, in its pure form, is all about probability and the sums of randomly distributed variables. Fortunately, its intuitive form is a lot more straightforward. Suppose I start with a bunch of otherwise random response times. Then, I group every 4 (or 5 or whatever) of them and report the average of each group. The Central Limit Theorem tells us two things about those averages:
1.     The average of each group will tend towards the average of the whole set. Intuitively, the larger I make each group, the closer to the population average each number will be.
2.     The numbers that I report (that is, the group averages) will, as a group themselves, be normally distributed. Even if the raw data was not normally distributed, the group averages will be. And, they will be distributed around the mean of the entire population.
The Central Limit Theorem was worked out over the course of decades by a crowd of mathematicians. The history bit of it comes back to the fact that a number of them – Pascal being one – eventually stopped work on it because of its association with exactly what we are using to demonstrate it: dice.
On With The Show
So, what happens when I start taking groups of these original, raw numbers and averaging them? Let's start with groups of 4. We'll divide it up into 25 groups of 4 and see what we get.

Raw
average every 4
Count
100
25
Minimum
2
4.0
Maximum
12
9.0
Average
6.37
6.37
StDev
2.25
1.28
95%-ile
11.00
8.40

And, our histogram…

4 isn't all that big of a group. It still leaves us with 25 data points to look at, right? Let's look at that a little closer. Along the way, we lost several points (both high and low) from the range of our data. Because of that, our 95%-ile shrank by almost 3 points. That could be a big deal for a lot of our tests.
Let's try a little bit bigger grouping and see what happens. Let's see what happens when we average them in groups of 5.

Raw
average every 4
average every 5
Count
100
25
20
Minimum
2
4.0
4.0
Maximum
12
9.0
8.4
Average
6.37
6.37
6.37
StDev
2.25
1.28
1.05
95%-ile
11.00
8.40
7.64


A little change in the granularity of our numbers gives us a little change in the range of our results. We shave a point or so off of the range and off of the 95%-ile.
Really, ever since the first time we started averaging, we aren't calculating a percentile of the real data any more. We're calculating a percentile of an average. And, on these couple of graphs, we're starting to see why that is such a bad thing to do.
Going To Extremes
I mentioned before that on a large test, LoadRunner will aggregate things, by default, to 1024 seconds. That's 17 minutes and 4 seconds. So, let's see what happens when we push the group size of our averages to higher numbers. How about 10?

Raw
average every 4
average every 5
average every 10
Count
100
25
20
10
Minimum
2
4.0
4.0
5.6
Maximum
12
9.0
8.4
7.2
Average
6.37
6.37
6.37
6.37
StDev
2.25
1.28
1.05
0.53
95%-ile
11.00
8.40
7.64
7.11


Depending on the hit rate of our test, that's still not as many points in a given average as LR will give us for just an hour long test run. Even that is the, relatively conservative, 64 second block. So, let's push it one more time and see what happens at groups of 20.

Raw
average every 4
average every 5
average every 10
average every 20
Count
100
25
20
10
5
Minimum
2
4.0
4.0
5.6
6.2
Maximum
12
9.0
8.4
7.2
6.6
Average
6.37
6.37
6.37
6.37
6.37
StDev
2.25
1.28
1.05
0.53
0.14
95%-ile
11.00
8.40
7.64
7.11
6.52

By this point in the story, our histogram is ridiculous. It looks like…

Notice what happened at 10 and 20, though. Each time, we trimmed a little bit more from the range of our data. In the end, our 95%-ile and the average were too close to call. We'd lost all of the information that could have been used to diagnose a potential issue. It's just gone.
Speaking of the average, though, take a look at what happened to the average as we pushed to averages of larger and larger groups…

Raw
average every 4
average every 5
average every 10
average every 20
Count
100
25
20
10
5
Minimum
2
4.0
4.0
5.6
6.2
Maximum
12
9.0
8.4
7.2
6.6
Average
6.37
6.37
6.37
6.37
6.37
StDev
2.25
1.28
1.05
0.53
0.14
95%-ile
11.00
8.40
7.64
7.11
6.52

See how much it changed? Alright. See how much it didn't change? That's the central limit theorem in action.
Why does all of this matter? Suppose you have an SLA of "95% less than 8 seconds". Did you meet your SLA?