Performance By The Numbers: Isolating The Wild Queue

Sometimes searching for the bottleneck in an application is a little bit like searching for Bigfoot. Someone told you there is one out there. You think you know what the tracks are supposed to look like. But, just how are you going to go about finding the evidence to narrow down his hiding place?

Quite some time ago, I posted an article about queuing and how it looks from a few graphs and what it means to us in Performance. We’ve seen a few issues now, on at least 3 applications on 2 platforms that I know of, that all tie back to some flavor of queuing. So, I wanted to take a few minutes and walk through an example of isolating a problem related to queuing. There are a couple of characteristic patterns here – symptoms that can show up (and have shown up) on many different applications and platforms.

The Problem

The initial problem statement is the one we always hear – high end-user response times. We need to find and correct the root cause. A little bit more digging tells us that the problem is not being seen all the time. At night, when traffic is low, response times are pretty zippy. But, the more load there is on the system, the worse they get.

The application in question is a multi-tiered sort of app – the sort we deal with every day. The end user hits a web server (or some other front-end server). This application does not call database services directly. Instead, it calls out to a variety of web services to do its work and gather its data. The web services call out to a database server, which accesses the data and returns it.

I’ve simplified this down to just 1 machine (and one sort of machine) at each tier. One of the interesting lessons in this particular investigation is just how much we can constrain the investigation using what I call “back of the envelope” sort of tools. That is, just the basic things that we can quickly draw, without getting bogged down in too many details.

Investigation – With Crayons

Scott Barber used to have a tag line that he used in some of his training. It went, essentially, “Crayons Before Calculus”. Sure, we could start with all sorts of fancy tools and obscure perfmon counters. Sometimes, that’s just the only way to go. But, before we pull out the calculus, start with crayons. Start with the simple things, the things that are easy to do and easy to interpret. Use them for as long as you can, right until they stop being useful and you really have to pull out the fancy stuff. That’s how we’re going to approach this investigation – with crayons.

So, let’s start trying to narrow this thing down. The first question we asked on this one is “where is the time being spent?” That is, which tier is the one we want to focus on?

To answer this question, we start with just a few high-level tools. Our DBA took a short trace at the database server and measured the response times for the various stored procedures in our database. We also took a look at the IIS logs at the web services server and either the IIS logs or the application logs at the web server.

From this we learned a few things…

The database is seeing a lot of traffic. Each web service hit generates multiple calls to the database, so that is to be expected. All in all, the stored procedures at the database are measuring time in double digit microseconds. Sometimes a little higher, but those are very much the exceptions. At that speed, it doesn’t seem like the database would be the culprit.
The web services are seeing a fair amount of traffic as well. That also is expected, since one hit to the web server generates multiple web service calls. According to IIS, the response times here are on the order of 50-150 milliseconds. Given the number of calls it makes to the database on the back end, that’s as fast as you could reasonably expect. Again, there are exceptions, but far and away, the bulk of them fall into that range.
The web server is seeing a LOT of traffic. It is seeing more incoming connections, actually, than any of the servers further back in the chain. Response times here are measured in high order seconds and even minutes.

#3 is our first big, important clue. Sure, each web server hit generates multiple web services hits. But, there aren’t anywhere near enough callouts from the web server to the web services server to justify a jump from “fractions of a second” out to “minutes”. Given this, we can start to focus on the web server as the place where the problems are happening.

So, naturally, the next thing to do is to start looking at resource utilization at the web server and see what we are seeing there. This, however, starts to generate more questions that answers….

First, we look at the web server. That’s where the problem is, right? We have high traffic here. Knowing that response time was slow, we built this box out to be a big, physical machine with a bunch of CPUs and RAM and other goodies. Problem is, perfmon tells us that most of that extra hardware isn’t being used. CPU is less than 10% busy. Disk is idle. There is not much paging going on. Even the network counters (output queue and the like) are telling us that it’s not network bound.
Well, if the web server isn’t busy, what is it? Maybe the response time numbers misled us. So, we start taking a look at the servers on the back end. The web services server has a little CPU time going on. But, it too is more idle than busy. Memory and our other usual counters all look good. We notice that, while there is a lot of traffic here, there is less traffic here than at the web server. That’s curious. Why would that be? We’ll come back to that question in a moment.
To complete the cycle, we go look at the database server. It’s busy, of course. But, CPU, memory, even disk utilization are all in the range of what you’d expect from a well behaved server of this sort. Not to mention, it’s response times are so small it’s hard to imagine a way to improve them even if this were the culprit.
Our usual perfmon counters aren’t helping us much at this point. They are telling us that everything is fine – even when we know it isn’t. The traffic pattern between the web server and the services server does seem odd, though. When you can’t find the issue on the servers, look off of the servers. So, at this point, we involve the network folks. A bunch of packet traces later, we learn that there are not any significant network errors or issues going on. Our initial thought that there is less traffic at the web services layer than there should be is proven to be correct, though. For a while, we investigate possible issues with BigIP, but that also does not prove to be the problem. (In the real investigation, we found that load wasn’t being distributed evenly between the web services servers. But, even at that, it couldn’t explain the response time differences we were seeing.)

At this point in the story, it looks like we have a serious mystery on our hands. Response times and our usual perfmon counters are in violent disagreement.

High response times, and low utilization….that tells us a lot. In fact, that is one of the characteristic symptoms of a whole class of problems. That particular pattern is one of the early warning signs that we are blocking on some resource. Something, somewhere, is trying to use or lock some limited resource and is blocking until it becomes available.

The Shape Of The Curve

Our next step is to take this issue into the lab and try to isolate just what it is that we are blocking on. Once again, we want to narrow it down to where we think the blockage is most likely to be seen.

It is entirely possible that there is a conflict between this application hitting that database and some other application hitting the same database. If that were the case, if the database were the resource being locked, we would see the big response time jump at the web services server – that is, at the first layer that accesses the locked resource. It’s possible this is the issue. But, given that our first big jump in response times is at a server which does not access the database directly, it doesn’t seem likely.

So, we are going to look at the first tier where response times take a mysterious increase – we are going to load the web server.

By now, everyone has heard me talk about “the shape of the curve”. (Folks are probably tired of hearing me talk about it, in fact.) But, in a case like this, the shape of the curve is exactly what we want to see. We know that at low load levels, response times are good. We know that at very high levels, response times are shockingly bad. So, we want to create a condition where we start low and increase the load until we start seeing the slower response times…and a little beyond. This isn’t just about “finding the break point”. That’s important. And, when we find that point, we’ll take a ton of measurements and probably some crash dumps and the like. But, let’s watch what happens along the way.

We start with a very low load – just 1 vuser. Then, we add users one or two at the time and we watch response times. There will be a lot of variation in those times, of course. This is one of those causes where the average response time is sufficient to tell us what we need to know. We collect that info, and plot response times as a function of the number of vusers we have in the system.

Now, we’re getting some place. Notice how the response times start off very low and very flat. On a smaller scale, they would look linear. (In fact, had we stopped at just 10 or 12 vusers, we’d have concluded that response times were linear.) But, then, at some point in the test….at some point, as we are adding load….those same response times go non-linear. Dramatically, non-linear.

While running the tests to collect this data, we also noticed another interesting behavior. After a point, the hits per second in the test weren’t increasing. Plotting that, as a function of vusers, we see this graph….

At first, adding vusers increases the number of hits per second that we can get out of the system. But, after a bit, those increases start to level out. In this particular case, they hit a hard limit at about 20 h/s.

Seeing this, we naturally set about trying all sorts of things to validate this result – we tried hitting more and different load generators; we tried several tests built around determining if the load generation tool itself was the bottleneck. In each case, no matter how we ran it or how we twisted the conditions, the result was consistent. These two graphs did not change shape.

The General Case

What do we have so far? We have a case, in Production, where our typical perfmon counters seem to disagree with our actual response times. And, we have corroborating evidence from the lab showing these two curves.

It’s not often that I point at one thing and say “Remember this.” I’m doing that right now. Look at the shape of those two curves. Remember those two curves. Those two curves are as good as DNA for identifying this class of problem. There is only one thing that I know of that can create both of these symptoms – queuing. We are blocking on something, and sitting on a queue until it is available.

The only thing left to do is to figure out what the resource is that we are blocking on.

Nailing Down the Specifics

Now that we know that we can reproduce the issue in the lab, and that we don’t need the extreme load conditions that we see in Production to do it, we need to figure out what it is that we are blocking on.

The big ticket items are all right out – processor queue, network queue, disk queue. Those things are part of our first level perfmon collection set. You would think that at this point, we would want to run another test with a bunch of obscure perfmon counters – grab the .NET LocksAndThreads counters, and the SQL Locks counters and anything else that looks like it counts locks. In my experience, though, those things will tell us that we are blocking on something, but they won’t tell us WHAT we are blocking on.

The fastest way I know to answer that question is to use DebugDiag to pull a core dump. I won’t go into the details of how to work with DebugDiag here – that would be a whole blog by itself. But, I will talk about what we found when we did that, because it is also an issue that we’ve seen on several different platforms and applications.

When we pulled the dump, and asked DebugDiag to do some analysis on it, we got back an interesting message. It said that multiple threads were “waiting on a web service call, but not waiting on the server”. What? We need to pull that error message apart.

“Waiting on a web service call…” – OK, the application tried to make a web service call and is waiting on that.
“…but not waiting on the server” – Hmmm….so, we tried to make a call, but haven’t actually sent the request yet…and that’s where it is waiting.

Aha!

Thrice upon a time, I saw a problem that behaved much like this. It seems that web service calls go through the same library that handles browser requests. (Makes sense to reuse such a thing, really. After all, they are all just web requests.) Browsers have to be good citizens. There may be zillions of people running browsers, and any one individual isn’t going to have very high throughput requirements. So, the WinInet library limits the number of connections that one process can have to any particular host or domain.

But, servers play by a whole different set of rules. Web servers have high throughput requirements – they need to be able to manage tons of web service calls in order to service those zillions of users with their little browsers. (At least we hope our web server sees that kind of traffic.) And in order to do that, it is going to have to increase that limit somehow. Fortunately, MS provided a way to override this default setting and increase the number of connections that we can have out to other servers.

Wrapping Up

Did you notice what we just did?

We started with the crayons – the simplest tools at our disposal – and used them to narrow down the search. We were able to narrow the investigation from “runs slow under load” all the way down to “blocking on an internal resource at THIS server” with nothing fancier than LogParser and Excel. No obscure counters or fancy tools. We just needed the high level, easily available stuff and a little understanding of the shape of things.

We let the simple tools guide us to the places where we really needed to pull out the microscopes and other fancy gear. And, only when we hit a point where we absolutely had no other way to get more data, did we pull out the big hairy stuff.

In each of the cases that we used this technique, and changed this setting, the throughput of that first level application improved dramatically.

Of course, it also opens up the flood gates on the back-end servers. But, that’s a another story.

Performance By The Numbers

Tuesday, February 26, 2013

Isolating The Wild Queue