A while back, my family and I spent a couple of weeks in Orlando, hanging out at Disney World. We hadn't been since my two oldest boys were toddlers (they are in college now). It was a blast. Of course, the rides and shows and stuff at any amusement park are just the punctuation in the story. The majority of the story is about riding the Line Ride.
You know the Line Ride, right? It is the one that you get one before you get on anything else. There is one out in front of every coaster, show or water ride. It must be the most popular thing in the park, because no matter how hot the day or crowded the building, we still hop right on the next Line Ride we see.
On our first day in the park, we were standing in line and it set me to wondering if two cars on one coaster track really qualified as a single queue/2 server model, or did the fact that there was only 1 track, and one car travelling at a time, mean that it really should be single queue/single server and the second car was just an optimization for loading and unloading the track. I had just about concluded that it should be single queue/single server when a more important question occurred to me – what sort of sicko stands around at Disney World thinking about queuing models?
The question has been bouncing around in my head ever since I got back. (The first question, that is, not the second one.) So, to finally exorcise this notion once and for all, I thought I'd put together a few quick thoughts on queuing and why we get so worried when we see even a little bit of it.
The Simple View
Without fail, when you talk about queuing theory, what comes out next is a whole rack of fancy equations, and a lot of interesting simulations based on a bunch of assumptions. For our purposes, a much simpler approach will, I think, be clearer and just as effective at what we need to get at.
First, let's think about the thing that we are modeling. I'm going to talk about an over simplified model of a basic web server. The life cycle of a request looks something like this:
1. A request arrives
2. If we have a processor available, it gets processed. If not, it gets queued.
3. Our request gets processed.
4. A reply is sent out.
Each of these steps has a parameter or two, and a couple of assumptions that we'll use to simplify our model. The first of these is the rate at which requests arrive for us to process. We'll call the number of requests that arrive per second our arrival rate. Requests arrive at random. But, the equations let us use the average arrival rate to work with. Here we'll have a variable that we can set for our model – the arrival rate.
The next step involves our queue itself. The simplest form of the queuing equations assumes that there is an infinite queue. But, that is never really the case. So, let's set our maximum queue length to 100 entries. As we'll see in a bit, even a small queue like that is more than enough to see the effect of even a little queuing. Having a finite queue introduces another thing that we want to track, though. That is, if a request arrives and the queue is full, then it gets an error response. In queuing parlance, the critical measure here is the number of requests that "balk". For a web server, this is what it means when we get an HTTP 503 response. So, the two measures that we'll want to look at here: the average queue length for any given arrival rate and the % of requests that get an error because the queue was full.
Our third step is the request getting processed. Here we have two things that we want to think about. The first is how long does it take to process an average request? Just to simplify things, let's say that in this model we can process one request in one second. It's an average – some will go faster and others slower. But, it is a simple assumption that we can make and still keep things fairly general. The second question to ask is the one from the roller coaster line – how many "servers" do we have? By servers, we don't mean physical Windows machines, but things servicing requests. That is, how many processors are we looking at? A typical web services machine is configured with 4 CPUs. So, let's call it 4 servers for our model. That means that while it takes 1 second to process 1 request, we can still process up to 4 requests per second because they are independent and running on separate servers. Our usual measurement here is Utilization. That is, what percent of the time is the system busy working on a request.
Our final step, sending the reply, isn't really a time that we measure. It's just our marker for when we are done with this request. So, there is nothing to do there.
We have one measurement that is left that we want to talk about, and it is the most common, most talked about one of all. That is Response Time. Response Time is the total amount of time that a request spends in the system – including the processing time and the amount of time it spends waiting on a queue somewhere.
Before we look at the results, let's review. Our requests take one second each to process. We have one queue feeding 4 processors. We have a maximum queue length of 100 entries. (For anyone who wants to dig deeper into the equations and models and the like, all of this means that ours is rightly specified as an M/M/4/104 arrangement.) Our key measures are Response Time, Utilization, Queue Length, and the % of requests that end in a queue full error. What we want to do is to vary the average arrival rate of our requests and see what impact this has on our key measures. (By the way, the queuing equations give average results. That is to say, average queue length, utilization and the like. That is why we talk about "sustained queuing" as opposed to a few things showing up on the queue for a moment and then being cleared.)
Results
So, let's start our arrival rate at 1 request/second. We'll increment it from there by 0.5 r/s and see where it takes us. The results of all of that come out like this:
Arrival Rate
|
Response Time
|
Utilization
|
Avg. Queue Length
|
% Queue Full Error
|
1
|
1
|
25%
|
0
|
0
|
1.5
|
1
|
37.50%
|
0
|
0
|
2
|
1.1
|
50%
|
0
|
0
|
2.5
|
1.2
|
62.50%
|
0.5
|
0
|
3
|
1.5
|
75%
|
1.5
|
0
|
3.5
|
2.5
|
87.50%
|
5
|
0
|
4
|
13.4
|
99%
|
49
|
1%
|
4.5
|
24
|
100%
|
92
|
11%
|
5
|
25
|
100%
|
96
|
20%
|
5.5
|
25.3
|
100%
|
97
|
27%
|
6
|
25.5
|
100%
|
98
|
33%
|
With an arrival rate of 1 request/second, we have 1 second response time, no queuing and the system is only 25% utilized. That is to say, that there are 4 processors waiting to take care of only 1 request at a time. 3 of them are sitting idle and there is no reason for any request to wait. The response time is just the amount of time it takes for the request to get processed.
Between 1 and 2.5 requests/second, our utilization and response time are pretty close to linear. There is still no noticeable queuing and the system isn't looking all that busy.
But, watch what happens when we get to 3 requests/second. All of a sudden, we've got sustained queuing. Our response time is up to 150% of where it started. But, the system is only 75% busy. What's going on here? Remember that our requests arrive at random. So, there will be times when a bunch arrive and get queued up. And, there will be lulls in the arrival rate when we have a chance to drain the queue that built up in the busy time. Even when the system is averaging at only 75% of capacity, the quiet times are coming too infrequently for us to keep the queue drained.
Intuitively, you'd think that 4 requests/second is a full load and that at that rate, the machine would stay busy but there wouldn't be a lot of queuing. This is one of those times when the intuitive answer isn't even close. Sure enough, we have almost 100% utilization at that load. But our "1 request takes 1 second" notion is no longer even close to our response time. Requests still average 1 second to process, but the average response time for those requests is 13.4 seconds. All that remaining time is spent sitting on a queue. With an increase of only 0.5 requests/second, we've gone from an average queue length of 5, to 49 – almost an order of magnitude.
You can see from the table how bad the response times get as we increase the arrival rate beyond the rate that we can process them. But, rather than talking through those, let's look at them graphically. Response time first….
Notice how our curve looks nice and flat at the start, then rises rapidly, even with fairly small increments in the arrival rate. With a few more data points, and a much longer maximum queue you'd be able to see that this curve is, in fact, an exponential one. But, also notice how the response time seems to flatten out after a bit. We'll talk about that more in a moment.
Utilization isn't surprising at all. It increases linearly with the incoming load, and then flattens out. After all, you can't use more than 100% of something.
Average queue length follows the same sort of exponential curve that response time does. Nice and flat, riding right at zero until we hit that magic threshold, then it zooms up. Queue length, though, flattens out just the same way that response time does. But, it is flattening out as it approaches the maximum length of the queue. And, right there is where we get our clue to why those two curves flatten out. How many people are in line ahead of you is one of the key factors in how long you'll be waiting in line. So, sure, the response time curve will flatten out when the average queue length starts to hit its maximum. But, a maximum response time is not exactly a cause for celebration. You see, a full queue means that requests are getting turned away.
The thing about all of this that surprised me the first time I saw this was how low the queuing starts, and how quickly it impacts response time. Basically, response time stays pretty flat and resource utilization follows a nice, linear sort of curve right up until the point where we start queuing. After that, it changes fast. If you really want to have good response time (including enough capacity to recover from the occasional burst), the sweet spot is at about 60% utilization. If our transactions take longer to process, it just means that we have to take them at a lower rate. Queuing will still start at around 60% utilization. That'll just happen at a lower arrival rate than it would if the transactions ran quicker.





No comments:
Post a Comment