Application Performance Data Visualization « Capacity Management Tips & Tricks

Performance Data Visualization I

This is the first post of a series on application performance data visualization. I’ll cover a simple but extremely useful technique that allows you to visualize the application behaviour against different workloads.

Before we get our hands dirty I need to explain a concept called service time (S).

Service time is the time that a resource in your system is busy during a given transaction. The resource may be physical, like a disk or a processor, or logical like an application process or a thread.

If your system is, let’s say, a store, you can think of the counter as a resource in your system. The time the counter takes to check out would be its service time.

It’s easy to understand that the counter will take more time to check out a hundred items than it takes to handle just a couple of items, so its service time is directly related to the amount of work (or load) it is subjected to. We’ll not delve into details like resource sharing for now, but keep in mind that in a real system frequently the service time is not related only to the load of an individual resource.

When I first started to figure out the importance of the service time in application performance evaluation, I was amazed to discover that many popular softwares had no way to register the service time for a given transaction while others did it in a poor and almost useless way.

As an example, nginx would log the time to serve a request in seconds before version 0.5.19. Consider that an HTTP request usually takes less than a single second for most websites and you can figure how useless this is.

Currently, nginx has a variable called $request_time that allows you to log the time it takes to serve a request in milliseconds. The Apache Web Server has a directive called %D that logs the time it took to serve a request in microseconds. You can find more information about the Apache log directives in the mod_log_config documentation.

Unfortunately neither of these options is enabled in the default configuration files so you need to manually configure these options. I’ll use Apache 2.2.3 running on a CentOS 5.5 system as a testbed for this article.

While it’s beyond the scope of this article to describe the Apache configuration directives, I’ll try to make it simple for someone not familiar with the software to understand what we’re doing.

I’ll edit the Apache configuration file located at /etc/httpd/conf/httpd.conf (remember to make a backup of the original file before commiting any changes) and look for the existent LogFormat directives.

There are four pre-configured LogFormat directives in the default configuration file:

LogFormat “%h %l %u %t ”%r” %>s %b ”%{Referer}i” ”%{User-Agent}i”” combined

LogFormat “%h %l %u %t ”%r” %>s %b” common

LogFormat “%{Referer}i -> %U” referer

LogFormat “%{User-agent}i” agent

Most sites use the “combined” format or some variation of it, so I’ll create a new log format called “combinedcap” by simply copying the combined format and adding the %D directive. Then I’ll change the CustomLog directive so it will use our new log format:

LogFormat “%h %l %u %t ”%r” %>s %b ”%{Referer}i” ”%{User-Agent}i” %D” combinedcap
CustomLog logs/access_log combinedcap

If you’re not in a production server starting the httpd should be pretty straight forward, we don’t need any further configuration. Just make sure you allow incoming connections to the TCP port 80 if you’re running any kind of firewall (CentOS will enforce security constraints by default).

In a standard setup your log files should be in /etc/httpd/logs/. The access_log file should look like this (I’ll highlight the information we’ll need for our analisys):

192.168.98.1 – – [14/Aug/2010:23:47:36 -0300] “GET / HTTP/1.1″ 403 5043 “-” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8″ 1523

192.168.98.1 – – [14/Aug/2010:23:47:36 -0300] “GET /icons/apache_pb.gif HTTP/1.1″ 200 2326 “http://192.168.98.129/” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8″ 37566

192.168.98.1 – – [14/Aug/2010:23:47:36 -0300] “GET /icons/powered_by_rh.png HTTP/1.1″ 200 1213 “http://192.168.98.129/” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8″ 62461

192.168.98.1 – – [14/Aug/2010:23:47:36 -0300] “GET /favicon.ico HTTP/1.1″ 404 288 “http://192.168.98.129/” “Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8″ 2172

So now we have the time the server receied the request and how long it took to serve it. In this example I’ve used the standard Apache 2 Test Page, a static HTML with some images. The longest request took 0,037 seconds and the fastest took 0,0015 seconds.

Obviously using data from a local server will be not be useful unless for benchmarking purposes. We’ll use production data from a live website for our analisys. This data was collected for one day on a relatively busy website.

In order to harness all this data we’ll use some Perl scripts and R. If you feel like using Python or any other programming language, stay at ease and don’t forget to share it !

You can download the Perl script that I use to summarize the Apache log files here.

Once you have enough data (I’d recommend at least a couple of days on a small site), simply run the script and your output should look something like this (I’ll hide diagnostics/progress indicators for the sake of clarity):

gzip -dc logs/accesslog.20100701* | ./apache_svc.pl 2>/dev/null | tee apache_svc_20100701.out

rate;svctime;freq

32;45316.739953;423

127;209501.459075;562

90;148895.554217;166

71;115292.206897;174

206;355824.196262;214

118;196627.652339;791

102;172595.483986;281

200;345448.035616;365

18;24767.897799;1272

125;207119.412238;621

16;22090.723684;1292

44;68436.582031;256

55;86475.967213;305

84;135554.740506;158

27;36318.533965;633

233;406275.000000;3

161;278944.878049;328

Now we have lots of data do chew on, my resulting output file has a total of 229 data lines plus the header. Let’s take a closer look at a randomly chosen line, like “32;45316.739953;423″:

  • The average service time for an arrival rate of 32 hits per second is roughly 0,045 seconds
  • During the period that was analyzed there was 423 individual seconds where we had 32 hits

I’ll import this data to R to see what kind of information we can extract from the data:

> svc <- read.csv("apache_svc_20100701.out",
sep = ";")> summary(svc)rate          svctime            freqMin.   :  1.0   Min.   :  1570   Min.   :   1.01st Qu.: 64.0   1st Qu.:102105   1st Qu.: 180.0Median :121.0   Median :200154   Median : 279.0Mean   :121.4   Mean   :205012   Mean   : 377.33rd Qu.:178.0   3rd Qu.:310029   3rd Qu.: 559.0Max.   :308.0   Max.   :451029   Max.   :1301.0

I’ll begin by plotting a simple scatter plot that correlates the arrival rate with the service time, this gives us an idea of how the server handles traffic. Note that I’ve converted the Service time to seconds simply by multiplying it by 10E-6.

So here we can see an almost linear relation between the arrival rate and service time, and a slightly irregular behavior when we reach more than 200 hits per second. Bear in mind that this may be caused by the lack of samples, so you may filter the data and remove service time values when you don’t have enough data samples, there’s no rule of the thumb for that.

Now let’s say you want to know how is your system workload. I’ll use the frequency field in order to visualize how many times the server was subjected to a certain arrival rate:

Conclusions

Service time and arrival rate can be very useful in order to determine the behavior of your system in different workloads, as well as help you to discover what is the current load that your system is subjected to. The techniques presented in this article can be used for almost any system as long as it provides a way to register the service time.

It’s important to note that it takes some effort to realize how to extract information from the raw log files, but once you understand the data there’s a lot of useful information that may help you in capacity planning and even in troubleshooting some quirky behaviour in your systems.

Feel free to contact me if you need any further information, I appreciate your feedback.

Deixe um comentário

Deixe um comentário

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair /  Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair /  Alterar )

Conectando a %s

%d blogueiros gostam disto: