Tuesday, February 26, 2013

Isolating The Wild Queue


Sometimes searching for the bottleneck in an application is a little bit like searching for Bigfoot.  Someone told you there is one out there.  You think you know what the tracks are supposed to look like.  But, just how are you going to go about finding the evidence to narrow down his hiding place?

Quite some time ago, I posted an article about queuing and how it looks from a few graphs and what it means to us in Performance.  We’ve seen a few issues now, on at least 3 applications on 2 platforms that I know of, that all tie back to some flavor of queuing.  So, I wanted to take a few minutes and walk through an example of isolating a problem related to queuing.  There are a couple of characteristic patterns here – symptoms that can show up (and have shown up) on many different applications and platforms.

The Problem

The initial problem statement is the one we always hear – high end-user response times.  We need to find and correct the root cause.  A little bit more digging tells us that the problem is not being seen all the time.  At night, when traffic is low, response times are pretty zippy.   But, the more load there is on the system, the worse they get.

The application in question is a multi-tiered sort of app – the sort we deal with every day.  The end user hits a web server (or some other front-end server).  This application does not call database services directly.  Instead, it calls out to a variety of web services to do its work and gather its data.  The web services call out to a database server, which accesses the data and returns it.


 I’ve simplified this down to just 1 machine (and one sort of machine) at each tier.  One of the interesting lessons in this particular investigation is just how much we can constrain the investigation using what I call “back of the envelope” sort of tools.  That is, just the basic things that we can quickly draw, without getting bogged down in too many details.

Investigation – With Crayons

Scott Barber used to have a tag line that he used in some of his training.  It went, essentially, “Crayons Before Calculus”.  Sure, we could start with all sorts of fancy tools and obscure perfmon counters.  Sometimes, that’s just the only way to go.  But, before we pull out the calculus, start with crayons.  Start with the simple things, the things that are easy to do and easy to interpret.  Use them for as long as you can, right until they stop being useful and you really have to pull out the fancy stuff.  That’s how we’re going to approach this investigation – with crayons.

So, let’s start trying to narrow this thing down.   The first question we asked on this one is “where is the time being spent?”  That is, which tier is the one we want to focus on?

To answer this question, we start with just a few high-level tools.  Our DBA took a short trace at the database server and measured the response times for the various stored procedures in our database.  We also took a look at the IIS logs at the web services server and either the IIS logs or the application logs at the web server.

 From this we learned a few things…

  1. The database is seeing a lot of traffic.  Each web service hit generates multiple calls to the database, so that is to be expected.  All in all, the stored procedures at the database are measuring time in double digit microseconds.  Sometimes a little higher, but those are very much the exceptions.  At that speed, it doesn’t seem like the database would be the culprit.
  2. The web services are seeing a fair amount of traffic as well.  That also is expected, since one hit to the web server generates multiple web service calls.  According to IIS, the response times here are on the order of 50-150 milliseconds.  Given the number of calls it makes to the database on the back end, that’s as fast as you could reasonably expect.  Again, there are exceptions, but far and away, the bulk of them fall into that range.
  3. The web server is seeing a LOT of traffic.  It is seeing more incoming connections, actually, than any of the servers further back in the chain.  Response times here are measured in high order seconds and even minutes.  

#3 is our first big, important clue.  Sure, each web server hit generates multiple web services hits.  But, there aren’t anywhere near enough callouts from the web server to the web services server to justify a jump from “fractions of a second” out to “minutes”.  Given this, we can start to focus on the web server as the place where the problems are happening.


 So, naturally, the next thing to do is to start looking at resource utilization at the web server and see what we are seeing there.  This, however, starts to generate more questions that answers….
  1. First, we look at the web server.  That’s where the problem is, right?  We have high traffic here.  Knowing that response time was slow, we built this box out to be a big, physical machine with a bunch of CPUs and RAM and other goodies.  Problem is, perfmon tells us that most of that extra hardware isn’t being used.  CPU is less than 10% busy.  Disk is idle.  There is not much paging going on.  Even the network counters (output queue and the like) are telling us that it’s not network bound.
  2. Well, if the web server isn’t busy, what is it?  Maybe the response time numbers misled us.  So, we start taking a look at the servers on the back end.  The web services server has a little CPU time going on.  But, it too is more idle than busy.  Memory and our other usual counters all look good.  We notice that, while there is a lot of traffic here, there is less traffic here than at the web server.  That’s curious.  Why would that be?  We’ll come back to that question in a moment.
  3. To complete the cycle, we go look at the database server.  It’s busy, of course.  But, CPU, memory, even disk utilization are all in the range of what you’d expect from a well behaved server of this sort.  Not to mention, it’s response times are so small it’s hard to imagine a way to improve them even if this were the culprit.
  4. Our usual perfmon counters aren’t helping us much at this point.  They are telling us that everything is fine – even when we know it isn’t.  The traffic pattern between the web server and the services server does seem odd, though.  When you can’t find the issue on the servers, look off of the servers.  So, at this point, we involve the network folks.  A bunch of packet traces later, we learn that there are not any significant network errors or issues going on.  Our initial thought that there is less traffic at the web services layer than there should be is proven to be correct, though.  For a while, we investigate possible issues with BigIP, but that also does not prove to be the problem.  (In the real investigation, we found that load wasn’t being distributed evenly between the web services servers.  But, even at that, it couldn’t explain the response time differences we were seeing.)
At this point in the story, it looks like we have a serious mystery on our hands.  Response times and our usual perfmon counters are in violent disagreement.

High response times, and low utilization….that tells us a lot.  In fact, that is one of the characteristic symptoms of a whole class of problems.  That particular pattern is one of the early warning signs that we are blocking on some resource.  Something, somewhere, is trying to use or lock some limited resource and is blocking until it becomes available.   

 The Shape Of The Curve

Our next step is to take this issue into the lab and try to isolate just what it is that we are blocking on.  Once again, we want to narrow it down to where we think the blockage is most likely to be seen.

It is entirely possible that there is a conflict between this application hitting that database and some other application hitting the same database.  If that were the case, if the database were the resource being locked, we would see the big response time jump at the web services server – that is, at the first layer that accesses the locked resource.  It’s possible this is the issue.  But, given that our first big jump in response times is at a server which does not access the database directly, it doesn’t seem likely.

So, we are going to look at the first tier where response times take a mysterious increase – we are going to load the web server.

By now, everyone has heard me talk about “the shape of the curve”.  (Folks are probably tired of hearing me talk about it, in fact.)  But, in a case like this, the shape of the curve is exactly what we want to see.  We know that at low load levels, response times are good.  We know that at very high levels, response times are shockingly bad.  So, we want to create a condition where we start low and increase the load until we start seeing the slower response times…and a little beyond.  This isn’t just about “finding the break point”.  That’s important.  And, when we find that point, we’ll take a ton of measurements and probably some crash dumps and the like.  But, let’s watch what happens along the way.

We start with a very low load – just 1 vuser.  Then, we add users one or two at the time and we watch response times.  There will be a lot of variation in those times, of course.  This is one of those causes where the average response time is sufficient to tell us what we need to know.  We collect that info, and plot response times as a function of the number of vusers we have in the system.
Now, we’re getting some place.  Notice how the response times start off very low and very flat.  On a smaller scale, they would look linear.  (In fact, had we stopped at just 10 or 12 vusers, we’d have concluded that response times were linear.)  But, then, at some point in the test….at some point, as we are adding load….those same response times go non-linear.   Dramatically, non-linear.

While running the tests to collect this data, we also noticed another interesting behavior.  After a point, the hits per second in the test weren’t increasing.  Plotting that, as a function of vusers, we see this graph….

At first, adding vusers increases the number of hits per second that we can get out of the system.  But, after a bit, those increases start to level out.  In this particular case, they hit a hard limit at about 20 h/s.

Seeing this, we naturally set about trying all sorts of things to validate this result – we tried hitting more and different load generators; we tried several tests built around determining if the load generation tool itself was the bottleneck.  In each case, no matter how we ran it or how we twisted the conditions, the result was consistent.  These two graphs did not change shape.

The General Case

What do we have so far?  We have a case, in Production, where our typical perfmon counters seem to disagree with our actual response times.  And, we have corroborating evidence from the lab showing these two curves.

It’s not often that I point at one thing and say “Remember this.”  I’m doing that right now.  Look at the shape of those two curves.  Remember those two curves.  Those two curves are as good as DNA for identifying this class of problem.  There is only one thing that I know of that can create both of these symptoms – queuing.  We are blocking on something, and sitting on a queue until it is available.

The only thing left to do is to figure out what the resource is that we are blocking on.

Nailing Down the Specifics

Now that we know that we can reproduce the issue in the lab, and that we don’t need the extreme load conditions that we see in Production to do it, we need to figure out what it is that we are blocking on.

The big ticket items are all right out – processor queue, network queue, disk queue.  Those things are part of our first level perfmon collection set.  You would think that at this point, we would want to run another test with a bunch of obscure perfmon counters – grab the .NET LocksAndThreads counters, and the SQL Locks counters and anything else that looks like it counts locks.  In my experience, though, those things will tell us that we are blocking on something, but they won’t tell us WHAT we are blocking on.

The fastest way I know to answer that question is to use DebugDiag to pull a core dump.  I won’t go into the details of how to work with DebugDiag here – that would be a whole blog by itself.  But, I will talk about what we found when we did that, because it is also an issue that we’ve seen on several different platforms and applications.

When we pulled the dump, and asked DebugDiag to do some analysis on it, we got back an interesting message.  It said that multiple threads were “waiting on a web service call, but not waiting on the server”.  What?  We need to pull that error message apart.
  •  “Waiting on a web service call…” – OK, the  application tried to make a web service call and is waiting on that. 
  • …but not waiting on the server” – Hmmm….so, we tried to make a call, but haven’t actually sent the request yet…and that’s where it is waiting.
Aha!

Thrice upon a time, I saw a problem that behaved much like this.  It seems that web service calls go through the same library that handles browser requests.  (Makes sense to reuse such a thing, really.  After all, they are all just web requests.)  Browsers have to be good citizens.  There may be zillions of people running browsers, and any one individual isn’t going to have very high throughput requirements.  So, the WinInet library limits the number of connections that one process can have to any particular host or domain.  

But, servers play by a whole different set of rules.  Web servers have high throughput requirements – they need to be able to manage tons of web service calls in order to service those zillions of users with their little browsers.  (At least we hope our web server sees that kind of traffic.)  And in order to do that, it is going to have to increase that limit somehow.  Fortunately, MS provided a way to override this default setting and increase the number of connections that we can have out to other servers.

Wrapping Up

Did you notice what we just did?  

We started with the crayons – the simplest tools at our disposal – and used them to narrow down the search.  We were able to narrow the investigation from “runs slow under load” all the way down to “blocking on an internal resource at THIS server” with nothing fancier than LogParser and Excel.  No obscure counters or fancy tools.  We just needed the high level, easily available stuff and a little understanding of the shape of things.

We let the simple tools guide us to the places where we really needed to pull out the microscopes and other fancy gear.  And, only when we hit a point where we absolutely had no other way to get more data, did we pull out the big hairy stuff.

In each of the cases that we used this technique, and changed this setting, the throughput of that first level application improved dramatically.  

Of course, it also opens up the flood gates on the back-end servers.  But, that’s a another story.

Wednesday, June 20, 2012

Bullet Graphs


My last few posts here have been high-level stuff. It's important to think about the big picture. But, it's also important to think about the little picture. So, this week, I want to talk a little bit about just that – little pictures. Graphs, actually. We're making a few changes in how we report some of our results. You'll be seeing these graphs in upcoming projects. So, I wanted to take a little time and describe how they work and what patterns we can see in them. They have a lot more power than it would seem, for something this simple.
The Definition
The bullet graph was developed by Stephen Few. Few is a consultant in the area of business intelligence and data visualization. What he was trying to build was a way to visualize a data set that would play nice with how we see and perceive information and still pack a lot of information into a small space. (The circular dials that are so common in the dashboard metaphor, pretty much, do neither of these things.) His web site (http://www.perceptualedge.com) is well worth some time to read.
But, I digress. Let's talk some more about bullet charts. I've cooked up a few examples of different things that we see in the charts the way that we present them. The data is properly randomized, so we'll get realistic fluctuations, instead of pristine curves.

There are 4 primary elements to a bullet chart.
1.     The scale. For the bullet graph to work, there'll always be a numeric scale attached. Usually it starts at 0. But, for our examples today, it runs from 3-18.
2.     The background. The background on a bullet chart is important. That is where we encode the "good/better/best" sort of information that we need to make sense of it. These graphs are tuned to be readable, even when printed. The light grey color is SLA. The dark grey color is the goal for that measurement. On this graph, and all of today's examples, we have an "SLA" of 16 and a goal of 5.
3.     The measure we are interested in. That would be the thin, black bar down the middle. Sometimes this bar will be red. We use the black bar for the 95%-ile response time. This is a time when being "outside the box" is not a good thing. In the example above, we would be over SLA.
4.     The secondary measure. This is the white diamond – the level bubble – floating in the middle of the black bar. We put the median on the bubble.
That seems simple enough, right? The fact that it is so simple is one of the things that I like about this particular chart. You can process "good", "bad" or "say what?" in a blink, without having to parse monstrous tables of data. One of the great ironies of data visualization is the notion of "pretty". People fret over "why do I need to spend time making a pretty chart?" We don't. Not really. What we need is simple and clear. It just happens that things that our brains scan as "clear", they also scan as "pretty". Go figure.
Patterns
Let's take a look at a couple of examples and see what else we can see in these little charts.

Our first chart, Data Set A, is pretty typical in a number of ways. We have our goal and requirement bands in there. The 95%-ile bar, though, is right at the requirement. That is something that we would want to investigate further. The other thing to notice is that the median bubble is riding right about the middle of the chart. Let's take a look at a histogram of the data along those same buckets and see how it looks.

Here, I've taken the histogram of the data that went into building the bullet chart and aligned it with the bullet chart itself. The green and red lines just help us to see where the two graphs align. We know that half of the data has to be below the bubble (and the green line) – that's the definition of median. I promised random data, with all of its warts and bumps, and here it is. There are a few bumps in the histogram. But, overall, it has the sort of shape that we would expect from a normal distribution – big hump in the middle, median line bisects the hump, tails on both ends. We have a few items about the 95%-ile, just like we would expect. Overall, this is the sort of pattern that you'd expect to see.
So, let's take a look at a couple of not-so-normal patterns.

We've seen this graph before. Right away, we would want to look into a transaction that looked like this because it is over SLA. But, there is something else interesting here as well. Look at the median bubble. See how it is riding way to the right? That's a pattern that warrants some investigation. So, let's put it beside its frequency distribution graph and see what we see.

Now, we can see something more interesting is going on. To begin with, the histogram is skewed to the right. The green line still bisects that graph, it has to. But, our measurements have a definite floor. Nothing measured here was less than 8. Why is that? That sort of pattern is characteristic of some sort of timeout or other failure going on. Find what is causing THAT, and you'll make a big change in this result.
Let's take a look at one more.

Under normal circumstances, a transaction with this profile would not get investigated. It is below SLA, the median looks pretty good. Odds-on chance any reasonable test would turn up much more interesting things to investigate than this. Still, the pattern underlying this graph is something that we see often enough in things that don't meet SLA. So, let's take a look at it and see what we see.

Notice how the curve is skewed to the left. Sure, half to the left of the green line and half to the right. We know that. But, the big grouping is to the left. The ones to the right are "like too little jam spread over too much bread". When we see this sort of pattern, it always means that there is something fundamentally different about the transactions that fell in the big grouping to the left as opposed to the thin group to the right. What's the difference? What is it that is either making the one group clump, or spreading the other group out? If you can find that answer – and it is usually just one variable – then you really have a handle on this transaction and on how to either improve the transaction or improve the test.

Friday, June 15, 2012

Riding The Line Ride



A while back, my family and I spent a couple of weeks in Orlando, hanging out at Disney World. We hadn't been since my two oldest boys were toddlers (they are in college now). It was a blast. Of course, the rides and shows and stuff at any amusement park are just the punctuation in the story. The majority of the story is about riding the Line Ride.
You know the Line Ride, right? It is the one that you get one before you get on anything else. There is one out in front of every coaster, show or water ride. It must be the most popular thing in the park, because no matter how hot the day or crowded the building, we still hop right on the next Line Ride we see.
On our first day in the park, we were standing in line and it set me to wondering if two cars on one coaster track really qualified as a single queue/2 server model, or did the fact that there was only 1 track, and one car travelling at a time, mean that it really should be single queue/single server and the second car was just an optimization for loading and unloading the track. I had just about concluded that it should be single queue/single server when a more important question occurred to me – what sort of sicko stands around at Disney World thinking about queuing models?
The question has been bouncing around in my head ever since I got back. (The first question, that is, not the second one.) So, to finally exorcise this notion once and for all, I thought I'd put together a few quick thoughts on queuing and why we get so worried when we see even a little bit of it.
The Simple View
Without fail, when you talk about queuing theory, what comes out next is a whole rack of fancy equations, and a lot of interesting simulations based on a bunch of assumptions. For our purposes, a much simpler approach will, I think, be clearer and just as effective at what we need to get at.
First, let's think about the thing that we are modeling. I'm going to talk about an over simplified model of a basic web server. The life cycle of a request looks something like this:
1.     A request arrives
2.     If we have a processor available, it gets processed. If not, it gets queued.
3.     Our request gets processed.
4.     A reply is sent out.
Each of these steps has a parameter or two, and a couple of assumptions that we'll use to simplify our model. The first of these is the rate at which requests arrive for us to process. We'll call the number of requests that arrive per second our arrival rate. Requests arrive at random. But, the equations let us use the average arrival rate to work with. Here we'll have a variable that we can set for our model – the arrival rate.
The next step involves our queue itself. The simplest form of the queuing equations assumes that there is an infinite queue. But, that is never really the case. So, let's set our maximum queue length to 100 entries. As we'll see in a bit, even a small queue like that is more than enough to see the effect of even a little queuing. Having a finite queue introduces another thing that we want to track, though. That is, if a request arrives and the queue is full, then it gets an error response. In queuing parlance, the critical measure here is the number of requests that "balk". For a web server, this is what it means when we get an HTTP 503 response. So, the two measures that we'll want to look at here: the average queue length for any given arrival rate and the % of requests that get an error because the queue was full.
Our third step is the request getting processed. Here we have two things that we want to think about. The first is how long does it take to process an average request? Just to simplify things, let's say that in this model we can process one request in one second. It's an average – some will go faster and others slower. But, it is a simple assumption that we can make and still keep things fairly general. The second question to ask is the one from the roller coaster line – how many "servers" do we have? By servers, we don't mean physical Windows machines, but things servicing requests. That is, how many processors are we looking at? A typical web services machine is configured with 4 CPUs. So, let's call it 4 servers for our model. That means that while it takes 1 second to process 1 request, we can still process up to 4 requests per second because they are independent and running on separate servers. Our usual measurement here is Utilization. That is, what percent of the time is the system busy working on a request.
Our final step, sending the reply, isn't really a time that we measure. It's just our marker for when we are done with this request. So, there is nothing to do there.
We have one measurement that is left that we want to talk about, and it is the most common, most talked about one of all. That is Response Time. Response Time is the total amount of time that a request spends in the system – including the processing time and the amount of time it spends waiting on a queue somewhere.
Before we look at the results, let's review. Our requests take one second each to process. We have one queue feeding 4 processors. We have a maximum queue length of 100 entries. (For anyone who wants to dig deeper into the equations and models and the like, all of this means that ours is rightly specified as an M/M/4/104 arrangement.) Our key measures are Response Time, Utilization, Queue Length, and the % of requests that end in a queue full error. What we want to do is to vary the average arrival rate of our requests and see what impact this has on our key measures. (By the way, the queuing equations give average results. That is to say, average queue length, utilization and the like. That is why we talk about "sustained queuing" as opposed to a few things showing up on the queue for a moment and then being cleared.)
Results
So, let's start our arrival rate at 1 request/second. We'll increment it from there by 0.5 r/s and see where it takes us. The results of all of that come out like this:
Arrival Rate
Response Time
Utilization
Avg. Queue Length
% Queue Full Error
1
1
25%
0
0
1.5
1
37.50%
0
0
2
1.1
50%
0
0
2.5
1.2
62.50%
0.5
0
3
1.5
75%
1.5
0
3.5
2.5
87.50%
5
0
4
13.4
99%
49
1%
4.5
24
100%
92
11%
5
25
100%
96
20%
5.5
25.3
100%
97
27%
6
25.5
100%
98
33%
       
With an arrival rate of 1 request/second, we have 1 second response time, no queuing and the system is only 25% utilized. That is to say, that there are 4 processors waiting to take care of only 1 request at a time. 3 of them are sitting idle and there is no reason for any request to wait. The response time is just the amount of time it takes for the request to get processed.
Between 1 and 2.5 requests/second, our utilization and response time are pretty close to linear. There is still no noticeable queuing and the system isn't looking all that busy.
But, watch what happens when we get to 3 requests/second. All of a sudden, we've got sustained queuing. Our response time is up to 150% of where it started. But, the system is only 75% busy. What's going on here? Remember that our requests arrive at random. So, there will be times when a bunch arrive and get queued up. And, there will be lulls in the arrival rate when we have a chance to drain the queue that built up in the busy time. Even when the system is averaging at only 75% of capacity, the quiet times are coming too infrequently for us to keep the queue drained.
Intuitively, you'd think that 4 requests/second is a full load and that at that rate, the machine would stay busy but there wouldn't be a lot of queuing. This is one of those times when the intuitive answer isn't even close. Sure enough, we have almost 100% utilization at that load. But our "1 request takes 1 second" notion is no longer even close to our response time. Requests still average 1 second to process, but the average response time for those requests is 13.4 seconds. All that remaining time is spent sitting on a queue. With an increase of only 0.5 requests/second, we've gone from an average queue length of 5, to 49 – almost an order of magnitude.
You can see from the table how bad the response times get as we increase the arrival rate beyond the rate that we can process them. But, rather than talking through those, let's look at them graphically. Response time first….

Notice how our curve looks nice and flat at the start, then rises rapidly, even with fairly small increments in the arrival rate. With a few more data points, and a much longer maximum queue you'd be able to see that this curve is, in fact, an exponential one. But, also notice how the response time seems to flatten out after a bit. We'll talk about that more in a moment.

Utilization isn't surprising at all. It increases linearly with the incoming load, and then flattens out. After all, you can't use more than 100% of something.

Average queue length follows the same sort of exponential curve that response time does. Nice and flat, riding right at zero until we hit that magic threshold, then it zooms up. Queue length, though, flattens out just the same way that response time does. But, it is flattening out as it approaches the maximum length of the queue. And, right there is where we get our clue to why those two curves flatten out. How many people are in line ahead of you is one of the key factors in how long you'll be waiting in line. So, sure, the response time curve will flatten out when the average queue length starts to hit its maximum. But, a maximum response time is not exactly a cause for celebration. You see, a full queue means that requests are getting turned away.

The thing about all of this that surprised me the first time I saw this was how low the queuing starts, and how quickly it impacts response time. Basically, response time stays pretty flat and resource utilization follows a nice, linear sort of curve right up until the point where we start queuing. After that, it changes fast. If you really want to have good response time (including enough capacity to recover from the occasional burst), the sweet spot is at about 60% utilization. If our transactions take longer to process, it just means that we have to take them at a lower rate. Queuing will still start at around 60% utilization. That'll just happen at a lower arrival rate than it would if the transactions ran quicker.

Friday, June 8, 2012

The Power Of Context


Here's a silly idea: a machine that can calculate mathematical functions. I'm not talking about addition and subtraction here. I'm talking about hard stuff like logarithms and trig functions.

In the 21st century, the only thing silly about that last paragraph is the suggestion that it is a silly idea. Now, you can walk into any department store and, for about what it would cost you to go out to lunch, buy a thing the size of your hand that will do all of that and more. But, in 1837, at a time when the steam engine was one of the most complex machines around, a "difference engine" was crazy talk. (In fact, that's exactly what they thought of poor Charles Babbage. It wasn't until very recently that someone was able to build a working Babbage Engine.)
The point I'm trying to make here is that the value of an idea (or a process or a test) depends on the context in which it resides. In the context of the early 19th century, a computing machine was too radical to be believed. In the context of the early 21st century, it is so common place as to be outright boring. But, right in between, somewhere around the 1940's, that's the context in which it was genius.
The Context-Driven School of Testing is a collection of principles, first collected by Cem KanerJames Bach and Brian Marick. Those principles are
1.     The value of any practice depends on its context.
2.     There are good practices in context, but there are no best practices.
3.     People, working together, are the most important part of any project's context.
4.     Projects unfold over time in ways that are often not predictable.
5.     The product is a solution. If the problem isn't solved, the product doesn't work.
6.     Good software testing is a challenging intellectual process.
7.     Only through judgment and skill, exercised cooperatively throughout the entire project, are we able to do the right things at the right times to effectively test our products.
They go on to write "The essential value of any test lies in its ability to provide information (i.e. to reduce uncertainty)". And that is the notion that I want to focus on in today's post. (Check out the link above. Their examples are well worth a read.)
42
We Performance folks have all manner of tools and techniques at our disposal. We have load generators and databases and reporting utilities and spreadsheets and homegrown whatnots of every shape and flavor. We can generate more tables of numbers than anyone could ever pound through. But, what was the question? If they aren't answering some essential question, they are just 1's and 0's.
The basic question, and the one that everyone thinks of first, is "Does this application meet SLA?" Sure. That is the big question. But, there is more to it than that. I'll argue that anyone with any experience testing a given application will be able to set up a performance run in which the application will not meet its response time SLA. Hit it too hard. Hit it with too many of one particular transaction. Arrange the data for one screen in just the right way. We've all seen enough to have a few dirty tricks up our sleeves.
But, hold on a minute. What is the difference between a "dirty trick" and a "clean kill?" To get to what we consider a reasonable test, we need to understand and appropriately control for the variables that we know about. We have to hit the application at the right rate. We have to get the mix of transactions right. We have to know that the data being used is about what the application can expect to see in Production. When designing and scripting a performance test, we put a lot of effort into modeling user behavior and into calibrating our scenarios to be sure that we know what the correct settings for these variables are and that we are generating a load that will match that.
Which brings us to another question: how confident are you in that model? If an application is in Production, we can use web logs or application logs or other such measures to create our model for user behavior. In that case, we can be pretty confident that we are modeling how users actually use an application. New applications, or others where we don't have those independent logs, leave us with projections and other guesses. How confident are we in those? They are much better than nothing. Still, one thing that we can count on is that the real world is much stranger than any sane person would think up. No matter how careful our projections are, real users will find a different way to use the application.
I tested an application, once upon a time, where the projections included that X tasks would be created in a day. It went on to say that 40% of those created in a day would then be edited. Marking a task complete counted as editing it. It took a couple of weeks of testing the application daily to create a backlog of tasks that buried the response time of the application. (The requirement was later clarified to say that closing a task was a different activity, and that no backlog should be created.) The context of the application usage impacts the requirements, the design of the application and its tests, and the final results.
But, it took a couple of weeks to build up enough of a backlog for that to matter, which brings us to another question that we need to answer. We say "the test ran within tolerance" when talking about how well we hit the various rate and mix targets. What we're talking about is that if the goal is 1000 transactions in an hour, hitting 998 or 1002 does not invalidate the result. Or does it? Where is the boundary between "clean kill" and "dirty trick"? Typically, we'll treat this question as a matter of how well the scenario did what we intended it to do.
But, when we are uncertain about our usage model (and, let's face it, every model has some uncertainty to it), we have to consider a second meaning to the "What is within tolerance?" question. Using the example above, if 1002 is acceptable, how about 1020? Or 1200? Or, maybe it doesn't break until we push 2000 in an hour. In other words, we have to ask "How sensitive is the application to deviations from the model?" And, "What is the risk of the application seeing the conditions that it is sensitive to in Production?" Those are the questions that lead us into the more interesting scenarios. To answer those questions, we have to take all of those variables we were talking about earlier – like hit rate and mix and data size and user counts and everything else we uncover along the way – and we have to start spinning those dials.
The Question
We start with the simple, obvious question: "Does this application meet SLA?" We know that there are circumstances under which it will not meet its response time requirements. And, we know that there will be circumstances where it will meet those requirements. We have our projections and our models for how the application will be used, and we generally have an idea where those models are soft. In short, we take our original question and adjust it to account for the context in which the application will run. In the end, the question that we are asking is one word (and a world of meaning) different from our original question. In the end, we are asking,
WHEN does the application meet SLA?