Chapter 10. Transparent Caching

Table of Contents
The Problem with Transparency
The Transparent Caching Process
Network Layout
Filtering Traffic
Kernel Redirection (not done)
Squid Settings (not done)

When you implement disk caching in an Operating System Kernel, all applications automatically see the benefit: the data caching happens without their knowledge. Since the Operating System ensures that on-disk copies of data are always the same as the cached copies, the data that an application reads is never out of date.

With web caching, however, there is a chance that the original data can change without the cache knowing. Squid uses refresh patterns (described in chapter 11) to decide when cached objects are to be removed. If these rules are too agressive, you could end up serving stale objects to clients. Even if these rules are perfect, an incorrectly configured source-server could get Squid to return old objects. Because users could retrieve an out of date page, you should not implement caching without their knowledge.

Squid can be configured to act transparently. In this mode, clients will not configure their browsers to access the cache, but Squid will transparently pick up the appropriate packets and cache requests. This solves the biggest problem with caching: getting users to use the cache server. Users hardly ever know how to configure their browsers to use a cache, which means that support staff have to spend time with every user getting them to change their settings. Some users are worried about their privacy, or they think (that since it's a host between them and the Internet) that the cache is slower (certainly not the case, as a few tests with the client program will show).

However: transparent caching isn't really transparent. The cache setup is transparent, but using the cache isn't. Users will notice a difference in error messages, and even the progress bars that browsers show can act differently.

The Problem with Transparency

When Squid transparently caches a site, the source IP address of the connection changes: the request comes from the cache server rather than the client machine. This can play havoc with web sites that use IP-address authentication (such sites only allow requests from a small set of IP addresses, rather than authenticating requests with a name and password.)

Since the cache changes the source IP address of the connection, some servers may deny legitimate users access. In many cases, this will cost users money (they may pay for the service, or use the information on that site to make money.)

If you know your network inside out, and know exactly who would be accessing a site like this, there is probably no problem with using transparent caching. If this is the case, though, it might be easier to simply change all of your users' settings.

Dialup ISPs generally have little problem implementing transparent caching, since dialup customers almost always get a different IP address whenever they connect. They cannot thus access sites which require a static IP address, so when requests start coming from the cache server there is no problem.

ISPs which transparently cache leased-line customers are the most likely to have problems with IP-authenticating servers. If you are phasing transparency in for such an ISP, you must make sure that your customers know all the implications. They must know how to refresh pages (and who to tell if they find such out-of-date pages, so that the Squid refresh rules can be changed), and how the source IP address is going to change. You must not simply install the transparent cache and hope for the best!