SquidGuard and DansGuardian are the most popular tools for URL blocking. But are they worth the headache? squidGuard requires a full Squid process. This can be a daunting one. Squid is a complicated piece of software which is made more so by tacking on squidGuard. DansGuardian is also complicated, and in my opinion, sketchy. It takes very long to load up if the URL list is sorted. The instructions actually tell you to un-sort the list just so that the sorting algorithm runs faster! This was unbelievable to me.

The Multnomah Education Service District produces a URL blacklist as part of their squidGuard usage and promotion effort. It contains categories for several types of objectionable content. My goal was to take portions of this list and use it to filter web access. My environment consists of Internet access with a NAT engine using PF on OpenBSD 5.3. The system runs on a PC Engines ALIX box, an inexpensive system which contains 256MB of RAM, three 10/100 Ethernet interfaces and a low power 500MHz AMD Geode processor.

Instead of using a larger and more complicated system like squid, or DansGuardian (which also requires an external proxy), I decided to try using relayd. After reviewing the manual page, and looking at some other examples in the wild, I came up with the following simple configuration:

PF:

pass in quick log inet proto tcp to port 80 divert-to 127.0.0.1 port 8080

relayd.conf:

http protocol httpfilter {
        # Return HTML error pages
        return error

        header change "Connection" to "close"

        # Block requests to unwated hosts
        request url filter file "/etc/blacklist.txt"
}

relay httpproxy {
        listen on 127.0.0.1 port 8080
        protocol "httpfilter"
        forward to destination
}

blacklist.txt:

youtube.com

This didn't do anything. It didn't block youtube.com at all. I looked more closely at the example and noticed that there was a trailing slash. So next, I tried this blacklist.txt:

youtube.com/

I reloaded relayd. With this small, example blacklist.txt, I was able to successfully block access to YouTube. This wasn't my goal, just a test. But it worked perfectly. It blocked www.youtube.com, youtube.com, and any longer URL based on either domain name.

Next, I took the MESD blacklist and converted it to a blacklist.txt format, like so:

ftp http://squidguard.mesd.k12.or.us/blacklists.tgz
tar xvf blacklists.tgz
cp blacklists/porn/urls blacklist.txt
cat blacklists/ads/urls >> blacklist.txt
cat blacklists/spyware/urls >> blacklist.txt
while read line; do
 echo $line/ >> blacklist.txt
done < blacklists/porn/domains
while read line; do
 echo $line/ >> blacklist.txt
done < blacklists/ads/domains
while read line; do
 echo $line/ >> blacklist.txt
done < blacklists/spyware/domains

Now I had a perfectly formatted blacklist.txt. It was 20.2MB in size with 1.08 million entries. I tried relayctl reload to load this into relayd's config. My ALIX box ran out of memory. I run a swapless system using flashrd and a 4GB CF card.

I moved the relayd config and blacklist over to my Core 2 Duo laptop. I loaded the blacklist. I found that relayd peaks up to 1500MB of RAM building the red/black tree for this large list. It took around 100 seconds with my Core 2 Duo. Once relayd settled down, relayd was left using 300MB of RAM to hold the RB tree. This proved my ALIX box wasn't going to handle the blacklist, which didn't solve the problem.

I told Stuart Henderson what I was up to. He suggested I use route-to on the router, and peruse /usr/local/share/doc/pkg-readmes/squid-* for clues. His advice was perfect. Having never used PF's route-to keyword before, I didn't realize how powerful it was. I moved the relayd configuration and blacklist over to my Core i3 desktop which sits mostly idle. It loaded up the 1.08 million URL list into an RB tree in about 10 seconds! 

Now I had PF on the Core i3 host at 192.168.1.10 directing all traffic for port 80 to relayd listening on 127.0.0.1, port 8080. I had relayd configured with the 1.08 million URL filter. And I configured the ALIX box to redirect port 80 traffic over to the Core i3:

pass in quick on vr2 inet proto tcp from !192.168.1.10 to port 80 route-to (vr2 192.168.1.10)

That was it. Now the Core i3 host, with ip forwarding still disabled, was able to discern the destination IP and splice itself into web requests, intercepting based on URL alone! I was also impressed with how fast relayd's RB tree populated on the Core i3.

In theory, a tuned relayd system can handle much more traffic than an equivalent running DansGuardian or squidGuard as OpenBSD 5.3 uses TCP socket splicing which provides for zero-copy TCP proxying. This is the same technique Linux and HAproxy use to saturate a single 10Gbps ethernet link with CPU time to spare. Until DansGuardian or squidGuard use socket splicing, they simply can't compete.

Relayd is an extensible framework, content filtering beyond the headers is possible, but as far as I can tell, not yet implemented. It would not work with socket splicing, but it would be a feature on-par with DansGuardian and squidGuard. Another feature I'd like to add would be to redirect the user to a specific URL, rather than a generic error page, should they try to access a filtered site. Relayd might even be able to do this already, but I don't see an obvious way. (Perhaps a custom error page with a META redirect?)

Finally, relayd can also do header and URL inspection for SSL with Reyk's full MITM patch. See: http://www.reykfloeter.com/post/41814177050/relayd-ssl-interception

This is quite slick, it even splices certificates! I didn't try it yet, but it is certainly a useful feature for some environments. It doesn't carry the speed advantage of TCP socket splicing, and adds encryption overhead, but that's the only way to effectively do the full deed.