Parsing Apache log files with built-in Linux utilities

In Linux, most services log information about their performance in simple text file format. Versatile text processing tools exist in the shell to parse, sort, and search this information.

Let’s look at some of these and their functionality and how they can be connected to create a complete workflow using shell commands. We can use shell redirections (like the pipe or |) to connect these.

We’ll need to use regular expressions to search and match parts of logs, you can read more about them here: The basics of regular expressions.

Apache logs

Webservers log all requests into human-readable text files (one line per request) that by default includes the visitor’s IP address, request URL, referral URL (where the request came from, if available), and other parameters. Fields are separated by a single space.

Logs are rotated daily or weekly depending on system setting so the log folder will possibly contain extra older log files, sometimes zipped (filenames ending in .gz or .bz2)

Visitor IP address is personal information so they should never be publicly disclosed – the example below shows mangled IPs (replaced to XXX) to avoid privacy issues.

The magic of grep

Grep is a line filter tool to process text files and return matching parts only. It’s also capable of counting matches but most often it’s used to filter text files for processing.

For example, to find all the POST requests, we can use the following line:

$ grep ' "POST' access.log
xxxxx.203 - - [18/May/2021:08:40:24 +0200] "POST /wp-admin/admin-ajax.php HTTP/2.0" 200 118 "https://techtipbits.com/wp-admin/" "Mozilla/5.0"
xxxxx.203 - - [18/May/2021:08:40:24 +0200] "POST /wp-admin/admin-ajax.php HTTP/2.0" 200 74 "https://techtipbits.com/wp-admin/" "Mozilla/5.0"

Or to count all the POST requests (grep -c):

$ grep -c ' "POST' access.log 
68

Slicing lines with Cut

The “cut” is useful to extract parts of a log file. It works by counting characters or fields delimited by a special character. There is one drawback, it can’t process consecutive delimiters, but luckily Apache log files are separated by a single space per field.

When working on Apache log files, we’ll need to set the delimiter to space by using the -d” ” parameter (or -d\ – that’s two spaces after the backslash)

For example, to extract all the requests protocols (field 8, like HTTP/1 or HTTP/2), we could use the following line:

$ cut -f8 -d" " techtipbits-access.log
HTTP/1.1"
HTTP/1.1"
HTTP/1.1"
HTTP/1.1"

Or to combine it with the previous one, let’s look at all the result codes (field 9) of POST requests:

$ grep ' "POST' techtipbits-access.log  | cut -f9 -d" "
200
200
200
403
200
403
200

Sorting and counting uniques

To get a breakdown of different variations (for example, the number of different codes from the above sample) we’ll need to sort the results and count different unique values. It’s essential to sort them first because the utility that counts uniques only works on consecutive repeats.

This example demonstrates the issue when we’re using “uniq -c” to count repeats:

$ grep ' "POST' techtipbits-access.log  | cut -f9 -d" " | uniq -c
      3 200
      1 403
      1 200
      1 403
     62 200

It merges lines together (the same 200 responses, then one 403 response, then another 200 response etc..) but we need to calculate each one separately. Sorting results before counting uniques fixes this problem easily:

$ grep ' "POST' techtipbits-access.log  | cut -f9 -d" " | sort |  uniq -c
     66 200
      2 403

Parsing multiple files at once

Until now we’ve worked on one file only. Frequently (especially in the case of Apache logs) log files are split between days or weeks and it is convenient to work on them together. This is done by using the “cat” command that simply prints the contents of all the files given as parameters so we can print lines from many files at once.

In our case, past log files are all compressed (this is frequently the case), the command “zcat” uncompresses all files before printing them. To handle non-compressed files as well (so we can work on both) we can use “zcat -f” that will transparently work for compressed and uncompressed files, too.

Let’s assume that all the log files contain the filename pattern “access.log” somehow. Some of them are named as “techtipbits-access.log”, others are like “techtipbits-access.log.XX.gz”, XX being a number or a date.

To find out all the different URLs that were accessed, we can cat all log files, cut the URL field (number 7), then sort them and remove duplicates and finally count the number of lines. Sort -u sorts lines and removes duplicates (“u” stands for unique), wc -l calculates the number of lines from its input.

$ zcat -f *access.log* | cut -f7 -d" " | sort -u | wc -l
2162

Using Sed to edit lines

Sed is a stream editor, it’s capable of doing string replacements using regular expressions. It has extensive capabilities in addition to simple replacements but for now, we’ll use it to search and replace parts of the log file to simplify processing.

One common task is to find out the day from the date/time that’s saved with each request in logs. To accomplish this, we’ll use sed ‘s/SOURCE/DEST/’ that replaces SOURCE to DEST.

In the example below the string “:<anything>” gets replaced to nothing, so essentially anything after the first colon is removed. The same effect could be achieved by doing “cut -f1 -d:” but that doesn’t work if there is any extra information that should be retained after the date field so we’ll use sed for now.

$ cut -f4 -d" " access.log 
[18/May/2021:00:42:30
[18/May/2021:00:42:32
[18/May/2021:00:48:29
[18/May/2021:01:57:13
...
$ cut -f4 -d" " access.log | sed 's/:.*//'
[18/May/2021
[18/May/2021
[18/May/2021
[18/May/2021

Group lines to calculate breakdowns

To combine all of the examples, let’s calculate group sums to count the number of different URLs downloaded on each day. To achieve this, we’ll need to extract dates + URLs, then remove duplicates. This way we’ll be left with one URL for each date, like DATE1 URL1, DATE1 URL2, DATE2 URL1, etc. Then we need to count the number of lines for each day by removing the URLs and calculate consecutive similar lines now containing dates only (uniq -c)

Cut is capable of returning multiple fields (set by the -f parameter) so this will give us the list of datetimes + urls:

$ cut -f4,7 -d" " access.log
[18/May/2021:00:42:30 /robots.txt
[18/May/2021:00:42:32 /category-sitemap.xml
[18/May/2021:01:57:13 /.env
[18/May/2021:01:57:13 /wp-content/

Now we need to remove the time part of the date to be able to group them. We’re using sed to search for a string in each line that start with a colon then anything but space ([^ ]*) then replace it to an empty string.

$ cut -f4,7 -d" " access.log | sed 's/:[^ ]*//'
[18/May/2021 /robots.txt
[18/May/2021 /category-sitemap.xml
[18/May/2021 /author/techtipbits/
[18/May/2021 /.env
[18/May/2021 /wp-content/

There is square bracket at the beginning of each line, we can remove it with sed by doing “s/^\[//” – ^ means the beginning of the line in regex and the bracket should be escaped because it’s a special character. We can add multiple expressions to the same sed function by using semicolons to separate them.

$ cut -f4,7 -d" " access.log | sed 's/:[^ ]*//;s/^\[//' 
18/May/2021 /robots.txt
18/May/2021 /category-sitemap.xml
18/May/2021 /author/techtipbits/
18/May/2021 /.env
18/May/2021 /wp-content/

Now we can sort and remove duplicates (sort -u), remove the URLs from the end (cut -f1 -d\ ) then count the duplicates (dates) and we’re done:

$ cut -f4,7 -d" " access.log | sed 's/:[^ ]*//;s/^\[//' | sort -u | cut -f1 -d" " | uniq -c
    180 18/May/2021

Obviously, this is much more useful when done across multiple log files – prepend the whole line with a “cat” command to print lines from all the files to achieve this. We’ll use “zcat -f” because older log files (but not all of them) are compressed.

$ zcat -f *access.log* | cut -f4,7 -d" " | sed 's/:[^ ]*//;s/^\[//'  | sort -u | cut -f1 -d" " | uniq -c
    300 17/May/2021
    180 18/May/2021
    396 23/Apr/2021
    490 24/Apr/2021
    336 25/Apr/2021

Sorting lines by numeric values

Each Apache log file line includes the response code from the webserver (field 9), which is normally “200” in case of successfully completed requests, 301 in case of redirects, 404 for not found errors, etc. Let’s find out what the most common result codes are, and sort them by the number of occurrences.

Sorting them then counting unique values gives us the results we need, then sorting it again in reverse by the numeric value (-rn) returns the most common ones on top.

$ cut -f9 -d\  testaccess.log | sort | uniq -c | sort -rn
 122314 200
   4452 302
   3339 404
   1256 403
   1191 301
    275 201
    100 304
     26 400
      9 401
      5 405
      3 409
      1 500