Performance By The Numbers: Apples, Oranges and Stout

One of the things that I like to do when I'm not at the office is to make wine. It is the ultimate lazy man's hobby. I mix a bunch of stuff up, and wait a week. Pour it into a big bottle and come back in a month or two. If I miss it by a day – or a quarter – it doesn't matter. Wine is very forgiving that way. I've never made beer, but friends of mine have. From what I hear, it is a terribly fiddly concoction requiring careful temperatures and measurements and timings and the like. That's just not my idea of fun.

But, the two things do have a couple of things in common -- the important bits all rely on things that are alive or have recently been alive. A small change here or there can result in all manner of changes in the end product. So, comparing one batch to the next can really be the proverbial "comparing apples to oranges". This batch is different than that batch which is nothing at all like the batch of the same stuff from last year. For a fellow working in his basement, like me, that's part of the fun. For a commercial brewery, that's a big problem.

Guinness Breweries were faced with exactly this problem all the way back in the early 1900's. So, Claude Guinness (owner and operator at the time) did something radical. He went over to Cambridge and to Oxford and started hiring people to come apply a little biochemistry and statistics to the processes that they used to make beer. One of these bright fellows was a man by the name of William Sealy Gosset, who developed the wee bit of statistics that I want to write about today.

Gosset's problem was how to tell if a particular batch of beer was up to the Guinness standards. You can measure, you can test, you can do all sorts of things. But, in the end, you are left with a need to compare a whole bunch of casks of beer and tell whether they are OK or not. It is the earliest record of statistical process control that I know of. Gosset eventually published his work, but because of intellectual property issues did it under the pen name of "Student." And, Student's T-test was born.

The Problem

Gosset's problem is, in fact, our problem as well. Performance tests, by necessity, involve a fair amount of randomization. Run enough transactions, with random data, and you'll eventually start to form a pattern. Run more of them, and even though the individual runs are distinct, the patterns of response times will start to look more and more alike. We run with one configuration, we make a change, we run the test again. Now, we have two sets of response times, and we want to answer the question "Did the change make a difference?" In other words, we need to compare two sets of performance data. And, when we compare them, we need to be able to make some sort of determination whether they are similar enough to be considered the same (i.e. the change didn't have an impact) or they are different enough that they aren't the same (i.e. that configuration change made a noticeable impact).

In this article, I plan to answer two often asked questions. The first is how we actually compare two runs and what we mean when we say the differences are "significant". The second is "Well, how much would this have to change before you'd consider it significant?" (The first question is much easier to answer than the second, as you'll see.)

Variance or Variability

When we talk about two runs, or a transaction that doesn't show as different, you'll hear us talk about the variability of the transaction. Simply put, given a bunch of hits to the same screen in the same run, how much does the response time vary under different conditions and with different data? Some transactions are very consistent – they respond in about the same amount of time every time we hit them. It might be fast or it might be slow, but it is at least consistent. Others will vary a lot. Does a claim have 1 bill or 1000 bills? That question becomes particularly relevant when looking at payments, for instance.

We talk about variability. Mathematically, this is called variance. Variance is just a measure of how much the individual data points in the set differ from their average.Variance is defined as the mean of the squares minus the square of the mean. For instance, consider the numbers 1, 2, 3, 4. The mean of the squares is (1*1 + 2*2 + 3*3 + 4*4)/4 = 7.5. The mean is (1 + 2 + 3 + 4)/4 = 2.5. The square of the mean is 2.5 * 2.5 = 6.25. So, the variance is 7.5 – 6.25 = 1.25. (You'll also hear us talk about the standard deviation. The standard deviation is a more common term, and is just the square root of the variance.) To see more of what these statistics look like graphically, check out this article.

The important thing here is what it means to the frequency distribution curve when we have a large or small variance.

Here is the frequency distribution of a curve with a small variance. For a response time measure, the X-axis on this graph is the response time, the Y-axis is how frequently (either raw count of % of total) that value occurred. Notice how the tails of the curve are small, and the main body is very narrow. The smaller the variance, the narrower this curve. If I pick any item at random from this set, how confident would I be that it would be between, say, 20 and 30? I'd feel pretty good about that one. That's the highest part of the curve. There just isn't enough outside of the center to worry me.

By contrast, here is a curve drawn with the same range and mean as the first one. The difference is that this data set has a large variance. In this one the tails are thick and the body of the curve is very wide. Same question: if I pick any item at random from this set, how confident would I be that it would be between 20 and 30? Not so much, really. Oh sure, it's the high part of the curve, but there is an awful lot of area under that curve that is outside of the range that I picked.

That's why we say that the smaller the variance (the smaller the standard deviation), the more confident you can be in a particular test.

What Are The Odds?

When it really comes down to it, statistics and probability are intertwined. It's hard to tell where one stops and the other starts, or even if there is such a point. Given enough data, you'll almost never be able to say just about anything with absolute certainty. What we can do is to make a statement with a sufficiently high probability that we are confident in it. (Yes, that really is a "confidence interval".)

Take a look at the frequency distributions of two data sets and this will become a lot clearer, I think.

Let's start with these two runs. Here we have two data sets, each with a nice, small variance. Statistically, what we are really asking when we say "Are these results different?" is "What are the odds that this result happened by chance?" What do you suppose the chances are that these two runs are genuinely different? I'd say it's pretty good. The means (the high spot) differs by a lot, relative to the rest of the curve. In fact, one way to visualize the question is to think of the chances that these two runs are the same as being the intersection of the areas under each curve (that little bubble from about 30 to about 36). In this case, there is a small range with a small height – a small area, meaning the chances that these two things are the same thing is pretty small.

We would say that this difference is statistically significant.

Let's look at another pair of curves.

Here we have the distributions for two runs of a transaction with a large variance. Now, what can we say about whether these two runs are different? Not much, really. Another way to think of distinguishing the two runs is "If I pick an item at random, what are the odds it appears in both sets?" Based on the intersection between those curves, the chances of that happening would be pretty high. So, turn it around the other way. What are the odds that these two data sets are different? Not real good.

We would say that this difference is not statistically significant.

There is an important point here that we really need to talk about for a moment. In each case, we are talking about the probability that the difference in these two data sets happened by chance. In the first graph, with the small variance, the probability that it happened by chance is small enough that we can call them different. In the second graph, the large variance one, the probability that this result happened by chance is pretty high. In this case, it might be that they are different, or it might be that they are the same. There really just isn't enough information to tell. Remember, in the case of a high variance, "difference is not statistically significant" doesn't mean "no difference" it means "can't say for sure."

The T Test

The T-test works by calculating a value based on the means, variances and sample sizes of the two data sets. The curve for this equation actually approximates a normal curve itself. So, to get a probability factor, there is a table look up to get to the final probability that these two samples are different. Fortunately, all of this calculation comes from a function built into Excel. We are testing to 95% confidence. So, if this value pops out as < 5%, we can say that the difference is statistically significant.

One small digression is worth a moment here. Anyone paying close attention to the statistics may have noticed that so far, all of my examples have been normally distributed and have had the same variances. In real test runs, neither of those are necessarily the case. The Central Limit Theorem will let us wiggle around the "normally distributed" part to an extent. The difference in variances, though, is a bit stickier. Fortunately, there is a slight variation on the classic Student's T-test, called Welch's T-test that is designed for tests where the sample sizes and variances are not equal. This test is also known to be particularly tolerant of distributions that are not strictly on the normal curve. What we actually calculate is a Welch's T-test for our checks.

This is where we get into talking about sample sizes. That is, how many hits on a given transaction do we need in order to be able to compare them? The absolute minimum (and the threshold below which the spreadsheet will not give you an answer) is no less than 5 hits in each data set and the two sets together must have at least 40. Below that, the results of the T-test can't be relied upon. That is the bare minimum, though. The more hits you have, the more accurate the comparison will be. The particular statistics that we report on each data set, though, has its own requirements. Individual data sets should have at least 30 hits in each one. For cases where you suspect the distribution of the response times does not approximate normal, you'll want more hits for each data set. In that case, especially, treat the "30 in each" rule as a minimum. Still, the more the merrier.

The Reality Checks

There is one additional filter that we add to our data before we say that two test runs are different. You see, for a nicely consistent transaction, even a small difference may show up in a T-test as being statistically significant. But, we're talking about perceived user response times. A change of 0.01 seconds might be statistically significant, but will anyone care? So, we add what we call a "reality check". Once a transaction passes the T-test as being statistically significant, we look at the difference between the means and the 95%ile's of the two runs. We will only report a change as being significant if it either of these differences is at least 0.25 seconds AND the two data sets pass a T-test.

While I'm here, let me mention the other summaries that we use for distance testing. For that, we report what we call Hot Spots. Hot Spots are things that should get attention first. For transactions that make it through both of the previous filters (that is, have a 0.25 difference and pass a T-test), if the old value was within SLA but the new one was not, we'll report that transaction as a Hot Spot. If it passes the T-test and the reality check, but does not cross SLA (either because it began over SLA or because it ended still below SLA), we'll report it as interesting – meaning that it may be worth looking at, but isn't as high a priority as the Hot Spots.

One other check that we do here is important. It's not a statistical test, but more of a gut feel sort of thing. We try to control every variable that we reasonably can. But, even then, there is just going to be a certain amount of stuff that we can't control. Stuff happens. In the same way that individual transactions vary, the aggregated numbers will tend to vary between runs. Same script, same data, same versions and configs, and we can still expect some variation between runs. That's normal. Often, we'll repeat the same test like that several times – not typically as a monitored test, just as an informal thing on our own. The reason for all of that repetition is to get a feeling for just how much things vary between runs. It helps us to understand just how much weight to put on the comparison of two individual test results.

What Does It All Mean?

So, let's go back to our original questions, and add one new one.

What does it mean when we say a difference is significant? It means that the results were sufficiently different to pass a T-test and that the differences were more than 0.25 seconds.

How big would a difference have to be in order to be considered significant? That's not such an easy question. The general rule of thumb is that the means should differ by more than two standard deviations. But, even that answer depends on the variances and sample sizes and of the two sets of measurements. We start with a distribution curve (which isn't linear), then calculate what amounts to a double integral off of that (area under that curve and area under that curve). That's why we can't give a number, even a number for each transaction that says "anything more than this is significant". That's just not how the math works.

What is all of this not telling me? Imagine we run a test, then make a change in the application, then test again. We reasonably infer that the change caused the difference. That inference is beyond the scope of our statistics. A T-test tells us the probability that two sets of data are different. It does not tell us why they are different

Performance By The Numbers

Tuesday, May 15, 2012

Apples, Oranges and Stout

No comments:

Post a Comment