Performance By The Numbers: The Power Of Context

Here's a silly idea: a machine that can calculate mathematical functions. I'm not talking about addition and subtraction here. I'm talking about hard stuff like logarithms and trig functions.

In the 21^st century, the only thing silly about that last paragraph is the suggestion that it is a silly idea. Now, you can walk into any department store and, for about what it would cost you to go out to lunch, buy a thing the size of your hand that will do all of that and more. But, in 1837, at a time when the steam engine was one of the most complex machines around, a "difference engine" was crazy talk. (In fact, that's exactly what they thought of poor Charles Babbage. It wasn't until very recently that someone was able to build a working Babbage Engine.)

The point I'm trying to make here is that the value of an idea (or a process or a test) depends on the context in which it resides. In the context of the early 19^th century, a computing machine was too radical to be believed. In the context of the early 21^st century, it is so common place as to be outright boring. But, right in between, somewhere around the 1940's, that's the context in which it was genius.

The Context-Driven School of Testing is a collection of principles, first collected by Cem Kaner, James Bach and Brian Marick. Those principles are

1. The value of any practice depends on its context.

2. There are good practices in context, but there are no best practices.

3. People, working together, are the most important part of any project's context.

4. Projects unfold over time in ways that are often not predictable.

5. The product is a solution. If the problem isn't solved, the product doesn't work.

6. Good software testing is a challenging intellectual process.

7. Only through judgment and skill, exercised cooperatively throughout the entire project, are we able to do the right things at the right times to effectively test our products.

They go on to write "The essential value of any test lies in its ability to provide information (i.e. to reduce uncertainty)". And that is the notion that I want to focus on in today's post. (Check out the link above. Their examples are well worth a read.)

We Performance folks have all manner of tools and techniques at our disposal. We have load generators and databases and reporting utilities and spreadsheets and homegrown whatnots of every shape and flavor. We can generate more tables of numbers than anyone could ever pound through. But, what was the question? If they aren't answering some essential question, they are just 1's and 0's.

The basic question, and the one that everyone thinks of first, is "Does this application meet SLA?" Sure. That is the big question. But, there is more to it than that. I'll argue that anyone with any experience testing a given application will be able to set up a performance run in which the application will not meet its response time SLA. Hit it too hard. Hit it with too many of one particular transaction. Arrange the data for one screen in just the right way. We've all seen enough to have a few dirty tricks up our sleeves.

But, hold on a minute. What is the difference between a "dirty trick" and a "clean kill?" To get to what we consider a reasonable test, we need to understand and appropriately control for the variables that we know about. We have to hit the application at the right rate. We have to get the mix of transactions right. We have to know that the data being used is about what the application can expect to see in Production. When designing and scripting a performance test, we put a lot of effort into modeling user behavior and into calibrating our scenarios to be sure that we know what the correct settings for these variables are and that we are generating a load that will match that.

Which brings us to another question: how confident are you in that model? If an application is in Production, we can use web logs or application logs or other such measures to create our model for user behavior. In that case, we can be pretty confident that we are modeling how users actually use an application. New applications, or others where we don't have those independent logs, leave us with projections and other guesses. How confident are we in those? They are much better than nothing. Still, one thing that we can count on is that the real world is much stranger than any sane person would think up. No matter how careful our projections are, real users will find a different way to use the application.

I tested an application, once upon a time, where the projections included that X tasks would be created in a day. It went on to say that 40% of those created in a day would then be edited. Marking a task complete counted as editing it. It took a couple of weeks of testing the application daily to create a backlog of tasks that buried the response time of the application. (The requirement was later clarified to say that closing a task was a different activity, and that no backlog should be created.) The context of the application usage impacts the requirements, the design of the application and its tests, and the final results.

But, it took a couple of weeks to build up enough of a backlog for that to matter, which brings us to another question that we need to answer. We say "the test ran within tolerance" when talking about how well we hit the various rate and mix targets. What we're talking about is that if the goal is 1000 transactions in an hour, hitting 998 or 1002 does not invalidate the result. Or does it? Where is the boundary between "clean kill" and "dirty trick"? Typically, we'll treat this question as a matter of how well the scenario did what we intended it to do.

But, when we are uncertain about our usage model (and, let's face it, every model has some uncertainty to it), we have to consider a second meaning to the "What is within tolerance?" question. Using the example above, if 1002 is acceptable, how about 1020? Or 1200? Or, maybe it doesn't break until we push 2000 in an hour. In other words, we have to ask "How sensitive is the application to deviations from the model?" And, "What is the risk of the application seeing the conditions that it is sensitive to in Production?" Those are the questions that lead us into the more interesting scenarios. To answer those questions, we have to take all of those variables we were talking about earlier – like hit rate and mix and data size and user counts and everything else we uncover along the way – and we have to start spinning those dials.

The Question

We start with the simple, obvious question: "Does this application meet SLA?" We know that there are circumstances under which it will not meet its response time requirements. And, we know that there will be circumstances where it will meet those requirements. We have our projections and our models for how the application will be used, and we generally have an idea where those models are soft. In short, we take our original question and adjust it to account for the context in which the application will run. In the end, the question that we are asking is one word (and a world of meaning) different from our original question. In the end, we are asking,

WHEN does the application meet SLA?

Performance By The Numbers

Friday, June 8, 2012

The Power Of Context

No comments:

Post a Comment