Processing Server Logs

New Site Architecture

Yesterday I started migrating from a dedicated VPS to Amazon’s S3 for hosting.1 This goal of this site since 2011 is to use entirely static files, eliminating lots of overhead and complications when it comes to deployment and maintenance. Apache and nginx make for simple deployments but it’s still overhead to maintain them and the underlying server. As I write this the server has 41 security updates waiting to be installed and Ubuntu 10.04 LTS is running into end of life in April. This necessitated a simpler solution.

Hosting a website on Amazon S3 is very simple. Essentially you setup a bucket called www.mysite.com and create a CNAME record for that subdomain to the S3 bucket. Instant static website served out of an S3 bucket. Check out the static website on S3 documentation if you’re interested in learning more.

Putting aside the actual implementation details, I created an S3 bucket and put Fastly in front of it. After uploading all the static files and running a quick test on fast.joshuakehn.com everything looked good but I wasn’t certain I hadn’t forgotten anything from the old server. The easiest way to check this was to grab all the old access logs, comb through them for status codes and URLs, and hit the new endpoint with them.

I collected all of the access logs from Apache and compiled them into a tarball to easily download. Here’s a sample from those files so the format is better understood:

184.106.100.25 - - [15/Mar/2015:18:29:34 +0000] "GET /atom.xml HTTP/1.1" 200 88954 "-" "Feedbin - 1 subscribers"
184.106.100.25 - - [15/Mar/2015:18:35:53 +0000] "GET /2014/8/19/perfect-rib-eye-steak.html HTTP/1.1" 200 5256 "https://www.google.co.uk/" "Mozilla/5.0 (iPad; CPU OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B440 Safari/600.1.4"
184.106.100.25 - - [15/Mar/2015:18:35:53 +0000] "GET /theme/css/style.css HTTP/1.1" 200 3372 "http://joshuakehn.com/2014/8/19/perfect-rib-eye-steak.html" "Mozilla/5.0 (iPad; CPU OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B440 Safari/600.1.4"
184.106.100.25 - - [15/Mar/2015:18:35:53 +0000] "GET /assets/images/steak/IMG_5117.jpg HTTP/1.1" 200 118043 "http://joshuakehn.com/2014/8/19/perfect-rib-eye-steak.html" "Mozilla/5.0 (iPad; CPU OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B440 Safari/600.1.4"
184.106.100.25 - - [15/Mar/2015:18:35:53 +0000] "GET /theme/js/jquery-2.1.1.min.js HTTP/1.1" 200 29876 "http://joshuakehn.com/2014/8/19/perfect-rib-eye-steak.html" "Mozilla/5.0 (iPad; CPU OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B440 Safari/600.1.4"
184.106.100.25 - - [15/Mar/2015:18:35:53 +0000] "GET /assets/images/steak/DSC03215.jpg HTTP/1.1" 200 127138 "http://joshuakehn.com/2014/8/19/perfect-rib-eye-steak.html" "Mozilla/5.0 (iPad; CPU OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B440 Safari/600.1.4"
184.106.100.25 - - [15/Mar/2015:18:35:53 +0000] "GET /assets/images/steak/DSC03218.jpg HTTP/1.1" 200 127322 "http://joshuakehn.com/2014/8/19/perfect-rib-eye-steak.html" "Mozilla/5.0 (iPad; CPU OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B440 Safari/600.1.4"
184.106.100.25 - - [15/Mar/2015:18:35:53 +0000] "GET /assets/images/steak/DSC03213.jpg HTTP/1.1" 200 127718 "http://joshuakehn.com/2014/8/19/perfect-rib-eye-steak.html" "Mozilla/5.0 (iPad; CPU OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B440 Safari/600.1.4"
184.106.100.25 - - [15/Mar/2015:18:35:53 +0000] "GET /assets/images/steak/DSC03217.jpg HTTP/1.1" 200 176650 "http://joshuakehn.com/2014/8/19/perfect-rib-eye-steak.html" "Mozilla/5.0 (iPad; CPU OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B440 Safari/600.1.4"
184.106.100.25 - - [15/Mar/2015:18:35:54 +0000] "GET /assets/images/steak/DSC03220.jpg HTTP/1.1" 200 144367 "http://joshuakehn.com/2014/8/19/perfect-rib-eye-steak.html" "Mozilla/5.0 (iPad; CPU OS 8_1_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B440 Safari/600.1.4"

After downloading the access logs I compiled them into a single file and started cleaning them of junk. Junk here being stuff I knew didn’t need to be checked or requests that were obviously repeated multiple times.

$ <* > composite_logs
$ < composite_logs | sed '/HTTP-Monitor/d' > filtered_logs
$ < filtered_logs | ag 'HTTP/1\.(0|1)" (200|301|404) ' > clean_status_lines
$ < clean_status_lines | awk '{ print $9, $7 }' | sort | uniq > unique_urls
$ < unique_urls | sed "/(?|\'|\"|#)/d" > unique_urls_2

You can simplify this command down into a single item with something like this:

$ <* | sed '/HTTP-Monitor/d' | sed "/(?|\'|\"|#)/d" | ag 'HTTP/1\.(0|1)" (200|301|404) ' | awk '{ print $9, $7 }' | sort | uniq

The awk configs are specific to my log file format from Apache. Column 9 is the status code and column 7 is the url.

At this point I have a clean list of urls which looks something like the following:

$ head -n 10 unique_urls_2
200 /
200 /2010/
200 /2010/10/
200 /2010/10/1/Sorting-Algorithms.html
200 /2010/10/11/
200 /2010/10/11/HTML-5-Privacy-Concerns.html
200 /2010/10/15/
200 /2010/10/15/Mobile-Stylesheets.html
200 /2010/10/20/

With only 4,354 urls to check I didn’t need to be fancy with request match process. I wrote a quick Python script which would extract status code and URL for each line, make the request, and then print an error if the status code didn’t match the expectation.

#!/usr/bin/python
import requests
import sys

with open("unique_urls_2", "r") as f:
    lines = f.readlines()
    for line in lines:
        expect_status, url = line.split(" ")
        res = requests.get("http://fast.joshuakehn.com{}".format(url.strip()))

        if res.status_code != int(expect_status):
            print "{} -> {} != {}".format(res.status_code, url.strip(), expect_status)

  1. There will be more details about this, especially around Fastly/Varnish configuration. ↩︎

written March 20th, 2015

March 2015

Can’t find what you’re looking for? Try hitting the home page or viewing all archives.