We are overwhelmed by information, not because there is too much, but because we don't know how to tame it. Information lies stagnant in rapidly expanding pools as our ability to collect and warehouse it increases, but our ability to make sense of and communicate it remains inert, largely without notice. (Stephen Few, Perceptual Edge)
That quote just blew me away the first time that I read it. We have loads and loads of data – numbers, measurements, other stuff from all manner of sources. But, how do we make sense of it? How do we distill it, removing the bits that we don't matter, without losing any of the parts that we really need to be worried about?
This series is going to be about how to make sense of a few key measurements. These are all things we collect, or can collect, from a perfmon log. Some of what I want to talk about is patterns that we've seen in Stress or in Production. I straddle an interesting line when it comes to analysis of things like this. On the one hand, the fundamental notion of evidence requires that we have hard numbers to talk about. On the other, I tend to think visually, in terms of patterns. So, some of this will be nice graphs and equations and numbers. Other bits will be visual patterns – the intuitive gut check. A pattern like that isn't enough to say what a problem is, and only seldom can say that we definitely have a problem. But, it is enough to point us in the direction of something worth investigating further.
The things I'll mention here are not exhaustive by any means. They are the most common, or the things that came to mind when I started gathering data and writing this article. If you have another idea, or another pattern, or see where I botched one of these, let me hear from you. The more eyes we have on this stuff, the better.
This first one is pretty straight-forward. We all know about CPU utilization. As the series goes on, though, we'll add more counters and look for interactions between them.
Characterizing CPU Utilization
The first question to answer here is how to we characterize CPU utilization? How do we talk about it and where do those numbers really come from?
Where it comes from is a particular perfmon object: Processor(*)\% Processor Time. There is an instance of this counter for each CPU on the machine, plus an aggregate one labeled "_Total". Except where otherwise specified, we'll be talking about Processor(_Total)\% Processor Time, which is the average busy time across all of the CPU's on this machine.
"Average busy time"….that reminds me of a wee detail that we need to mention before going any further. Different perfmon counters reports their values in different ways. Some are snapshots at the instant that the reading was taken. Some are running counts that increase over time. Altogether, there are something like 29 different classes of perfmon counter, each with its own particular way of reporting whatever it has seen. %CPU actually reports the average busy time between readings. That is, hit it every 15 seconds, and what you are really seeing is a 15 second average.
Let's start by taking a look at a set of readings for this counter over time.
The x-axis is time, or the individual readings that we took. The y-axis is the % busy time for this machine. The first thing we generally do to talk about %CPU is to talk about the average across some period of time. Or, graphically…
That is, CPU averaged 10.2% during this period. But, this graph is pretty tame. Its peaks are only as high as 14%. Still, we don't engineer to the average CPU. We engineer to "sustained peak". That is, not the very worst case, but what is the upper bound most of the time? So, we have to find a way to characterize "upper bound, most of the time". This is where we use a bit of statistics. Let's start by graphing what the quality folks refer to as an upper and lower control limit (UCL and LCL). The upper control limit is just the mean plus 1 standard deviation. (The lower control limit is the mean minus 1 deviation.)
The UCL on this graph (mean + 1 deviation) includes a little over 70% of the measurements that we took. That makes it a nice measurement for "sustained peak" – the place where most of the times we'll be below, and when we do go above, there'll be enough time to recover. We're still looking at the average across all 2, or 4 or 8 or whatever CPUs on the machine. But, it gives us a good picture of how busy things are. That UCL is what we'll engineer machines to.
Applications
Having a nice, mathematical way to talk about sustained CPU utilization is, by itself, a useful thing. But, this article is about patterns and things we can see in these graphs. So, let's take a look at a few pictures and see what we see. We're looking both for patterns in the graph that we can use, and for things that the UCL/LCL lines buy us.
Let's consider a typical story. We start with a particular app and load. It shows us a %CPU of about 25%, with a little variation. We make a change – either a different load pattern, a change to the code, whatever. The average %CPU still looks to be about 25%, but it isn't so flat or so well behaved anymore. Like this:
The blue line, to the left of the big black divider, is %CPU before the change. To the right of the divider (in green) is %CPU after the change. It looks like %CPU went up a bit, but how do we treat the sustained load now?
Average CPU went up just a little bit – from 25% to 27%. But, the sustained CPU – the UCL that we have to work with -- skipped from 27% to 32%.
So, what do we do with the new information? Other than just reporting it, the next question I see is why is the new load so much more variable – why those spikes? Understanding what went on at those times could help us pin down a problem transaction. It could be that this is just the behavior of the changed application and is not something to be worried about. But, it warrants a little investigation.
Case #1
Let's try a few more variations and see what we can see. Sometimes, we'll include the UCL and LCL boundaries. Sometimes, we'll be able to see what we want to see without them. The goal here isn't to pin point a problem. It's just to get to the next question to ask. Here's a pattern that we've see a few times.
We're humming along just fine, all is well. Then, either all at once, or over a fairly short period, %CPU changes dramatically and stays high. What next?
At this point in the story, I'm asking one big question…and a couple of little ones to help me answer the big one. The big question is, "what changed?" What happened in those few minutes that caused the CPU to go up like that?
One thing to look in to is whether all of the CPU's are running hot, or just one of them. If they are all running high like this, then the most likely explanation is that something went wrong with the load scenario. Either we are ramping users in too quickly, or the session database filled up (causing users to fail and lots of hits to happen very quickly). When all of the CPUs change at once, it is almost always something external to the system under test, and we need to investigate things that would increase the load like that.
Sometimes, though, we'll see another case. We'll look at it and see that 1 CPU is running full out. But, the others are still running at a nice, easy 10%. The only way to get into that position is if 1 thread is locked in a loop somewhere. That thread is on just 1 CPU, and is keeping it busy. The others are handling all of the new traffic. This is generally an application issue, and it is time to take a dump of that process and see what it is busy doing.
Case #2
Let's try another one and see if there is anything interesting going on there.
%CPU is mostly down around the 5% or less range. But, there are these occasional spikes to above 20%.
First question: is this something that we need to be concerned about? I will argue that it is. To begin with, without the spikes, the sustained peak on this application would be down around 5%. As is, we would plan more over twice as much CPU usage. If we are down in the 5%-10% range, that probably means that this is on a shared server, which means the difference between 5% and 10% could mean more servers. So, yeah, even though it is small, we want to at least understand why those spikes are there.
The next question we'd ask is "So, what is going on with those spikes?" Is this a low volume app, and that is just when it happened to be processing a transaction? That could be. That's worth looking at.
But, look again at the spikes. Those aren't really all that random. They happen at regular intervals. Plot another graph, on the same time scale, with hits/sec or transactions/sec. This could be a case where the scenario is marching. That's a problem with the test, not the application.
Case #3
Let's look at one more. This time, though, we'll introduce a new counter to help us understand what is really going on. Overall CPU utilization looks like this:
At first glance, whether we would be concerned about an application with this CPU profile depends, mostly, on what sort of application and which server we are looking at. If this is a web services server, that graph is right in expectation. If this is a thin client UI server, running in isolation….one app running 30-40% CPU on one of those machines would be a big deal. That would be something we'd want to look at.
Let's dig a little bit deeper and see what else we can see. Processor(*)\% Processor Time is the aggregate of all of the time being spent by all of the CPUs on this server. That includes both the time spent working on user processes and the time spent in the kernel. The counter for kernel time alone is Processor(*)\% Privileged Time. Let's plot %Privileged Time on the same graph.
This is the profile that we expect to see. The kernel has some work to do, and some things are done at kernel level (like the http service). But, the majority of the time being spent is happening in the application. If the graph looks like this, there is little for is except to go figure out where all that CPU time is going.
How do we do that? Processor(*)\% Processor Time tracks processor utilization on a per-processor basis. There is another counter collection of interest as well. Process(*)\% Processor Time tracks processor utilization on a per-PROCESS basis. That, coupled with process explorer, can usually nail down at least which process is chewing up the CPU. (It is usually a w3wp.exe process, meaning it is the web app itself. But, there have been times when a support process would decide to get busy. So, it is worth checking.)
But, we've seen another profile lately as well. Once in a while, when we plot % Privileged Time, it will look like this:
This graph is a problem. This is telling us that the majority of the time being spent on this server is being spent in the kernel. This almost always means that there is some sort of hardware problem going on. In this particular case, there was an issue with one of the CPUs on the machine that was generating excessive interrupts. This sort of hardware problem is usually also associated with high values for the System\Context Switches/sec counter.
What other patterns do you see? Let me know.














