I've got too much to think about, too much to figure out.
Stuck between hope and doubt, it's too much to think about.
-- Todd Snider, "Statistician's Blues"
Stuck between hope and doubt, it's too much to think about.
-- Todd Snider, "Statistician's Blues"
I want to talk today about the difference between averaged data and actual observations. It's a thing that you've probably heard me go on about before. But, it's important, and if we don't understand what is really going on with some of our reports, can easily lead us to some very wrong conclusions. We're going to delve into some statistics along the way, so I want to start this discussion, right up front, with an example. We could do this with just about anything that can be measured -- response times, daily temperatures, dice, whatever. To keep this article as general as possible, I'm just going to throw dice: two, basic, 6-sided dice.
We'll throw them 100 times, record the results and then start to work. That is, our raw data will be 100 numbers ranging from 2 to 12. Next we are going to start taking groups of these numbers and averaging them. We'll average every 4, then every 5, then every 10, and finally every 20. We want to see what all of this averaging does to our results.
Who cares?
It would be easy, at this point, to say this doesn't have anything to do with the numbers that we get out of a performance test. But, hang on a minute. LoadRunner gives us two sets of things. There is the summary page – the big table of numbers at the start. It is calculated based on the raw data. But, that's not real useful for comparing results or for getting some of the other reports that we like. So, we snag a copy of what it labels the "Average Transaction Response Time" graph and do some statistics from there. But, that is giving us just what it says – averages across some period of time.
How long is that period of time? Well, that depends. LoadRunner can give us 1-second data for any test. But, that is not the default. The default is, I think, based on the length of the test and always works out to a power of 2. 64 second data seems pretty common. I have seen it default to 128 seconds or even as high as 1024 seconds. (15 minutes is only 900 seconds, by the way.)
So, if we are not careful with what LoadRunner is really telling us, we could be making decisions about milliseconds based on 15-minute averages.
Raw Data
So, we've rolled our dice and now have 100 numbers between 2 and 12. What can we say about those numbers? Well, we can calculate a few basic statistics, like we always do. And, we can draw ourselves a little histogram (also like we usually do) to see how they are distributed. Let's see how that looks.
Count
|
100
|
Minimum
|
2
|
Maximum
|
12
|
Average
|
6.37
|
StDev
|
2.25
|
95%-ile
|
11.00
|
Range, deviation, all the usual stuff. What does that tell us about how the numbers are distributed? Not as much as we would like. So, let's take a look at a histogram of that data.
The fact that we threw 2 dice gives us a particular sort of shape to this graph – the famous normal curve. Does that matter to us at all? Not at this point, it doesn't. This graph could have been any shape that we want it to be, and what comes next would be just the same.
A Little Statistics (And A Little History)
What this is all going to come back to is a statistical notion called the Central Limit Theorem. The Central Limit Theorem, in its pure form, is all about probability and the sums of randomly distributed variables. Fortunately, its intuitive form is a lot more straightforward. Suppose I start with a bunch of otherwise random response times. Then, I group every 4 (or 5 or whatever) of them and report the average of each group. The Central Limit Theorem tells us two things about those averages:
1. The average of each group will tend towards the average of the whole set. Intuitively, the larger I make each group, the closer to the population average each number will be.
2. The numbers that I report (that is, the group averages) will, as a group themselves, be normally distributed. Even if the raw data was not normally distributed, the group averages will be. And, they will be distributed around the mean of the entire population.
The Central Limit Theorem was worked out over the course of decades by a crowd of mathematicians. The history bit of it comes back to the fact that a number of them – Pascal being one – eventually stopped work on it because of its association with exactly what we are using to demonstrate it: dice.
On With The Show
So, what happens when I start taking groups of these original, raw numbers and averaging them? Let's start with groups of 4. We'll divide it up into 25 groups of 4 and see what we get.
Raw
|
average every 4
| |
Count
|
100
|
25
|
Minimum
|
2
|
4.0
|
Maximum
|
12
|
9.0
|
Average
|
6.37
|
6.37
|
StDev
|
2.25
|
1.28
|
95%-ile
|
11.00
|
8.40
|
And, our histogram…
4 isn't all that big of a group. It still leaves us with 25 data points to look at, right? Let's look at that a little closer. Along the way, we lost several points (both high and low) from the range of our data. Because of that, our 95%-ile shrank by almost 3 points. That could be a big deal for a lot of our tests.
Let's try a little bit bigger grouping and see what happens. Let's see what happens when we average them in groups of 5.
Raw
|
average every 4
|
average every 5
| |
Count
|
100
|
25
|
20
|
Minimum
|
2
|
4.0
|
4.0
|
Maximum
|
12
|
9.0
|
8.4
|
Average
|
6.37
|
6.37
|
6.37
|
StDev
|
2.25
|
1.28
|
1.05
|
95%-ile
|
11.00
|
8.40
|
7.64
|
A little change in the granularity of our numbers gives us a little change in the range of our results. We shave a point or so off of the range and off of the 95%-ile.
Really, ever since the first time we started averaging, we aren't calculating a percentile of the real data any more. We're calculating a percentile of an average. And, on these couple of graphs, we're starting to see why that is such a bad thing to do.
Going To Extremes
I mentioned before that on a large test, LoadRunner will aggregate things, by default, to 1024 seconds. That's 17 minutes and 4 seconds. So, let's see what happens when we push the group size of our averages to higher numbers. How about 10?
Raw
|
average every 4
|
average every 5
|
average every 10
| |
Count
|
100
|
25
|
20
|
10
|
Minimum
|
2
|
4.0
|
4.0
|
5.6
|
Maximum
|
12
|
9.0
|
8.4
|
7.2
|
Average
|
6.37
|
6.37
|
6.37
|
6.37
|
StDev
|
2.25
|
1.28
|
1.05
|
0.53
|
95%-ile
|
11.00
|
8.40
|
7.64
|
7.11
|
Depending on the hit rate of our test, that's still not as many points in a given average as LR will give us for just an hour long test run. Even that is the, relatively conservative, 64 second block. So, let's push it one more time and see what happens at groups of 20.
Raw
|
average every 4
|
average every 5
|
average every 10
|
average every 20
| |
Count
|
100
|
25
|
20
|
10
|
5
|
Minimum
|
2
|
4.0
|
4.0
|
5.6
|
6.2
|
Maximum
|
12
|
9.0
|
8.4
|
7.2
|
6.6
|
Average
|
6.37
|
6.37
|
6.37
|
6.37
|
6.37
|
StDev
|
2.25
|
1.28
|
1.05
|
0.53
|
0.14
|
95%-ile
|
11.00
|
8.40
|
7.64
|
7.11
|
6.52
|
By this point in the story, our histogram is ridiculous. It looks like…
Notice what happened at 10 and 20, though. Each time, we trimmed a little bit more from the range of our data. In the end, our 95%-ile and the average were too close to call. We'd lost all of the information that could have been used to diagnose a potential issue. It's just gone.
Speaking of the average, though, take a look at what happened to the average as we pushed to averages of larger and larger groups…
Raw
|
average every 4
|
average every 5
|
average every 10
|
average every 20
| |
Count
|
100
|
25
|
20
|
10
|
5
|
Minimum
|
2
|
4.0
|
4.0
|
5.6
|
6.2
|
Maximum
|
12
|
9.0
|
8.4
|
7.2
|
6.6
|
Average
|
6.37
|
6.37
|
6.37
|
6.37
|
6.37
|
StDev
|
2.25
|
1.28
|
1.05
|
0.53
|
0.14
|
95%-ile
|
11.00
|
8.40
|
7.64
|
7.11
|
6.52
|
See how much it changed? Alright. See how much it didn't change? That's the central limit theorem in action.
Why does all of this matter? Suppose you have an SLA of "95% less than 8 seconds". Did you meet your SLA?





No comments:
Post a Comment