Chapter 2. Terminology and Technology

Table of Contents
What Squid is
What Squid is not
Supported Protocols
Inter-Cache Communication Protocols
Firewall Terminology

What Squid is

Squid is a free, high-speed, Internet proxy-caching program. So, what is a "proxy cache"?

According to Project Gutenberg's Online version of Webster's Unabridged Dictionary:

Proxy. An agent that has authority to act for another.

Cache. A hiding place for concealing and preserving provisions which it is inconvenient to carry

Squid acts as an agent, accepting requests from clients (such as browsers) and passes them to the appropriate Internet server. It stores a copy of the returned data in an on-disk cache. The real benefit of Squid emerges when the same data is requested multiple times, since a copy of the on-disk data is returned to the client, speeding up Internet access and saving bandwidth. Small amounts of disk space can have a significant impact on bandwidth usage and browsing speed. (?costs?)

Internet Firewalls (which are used to protect company networks) often have a proxy component. What makes the Squid proxy different from a firewall proxy? Most firewall proxies do not store copies of the returned data, instead they re-fetch requested data from the remote Internet server each time.

Squid differs from firewall proxies in other ways too:

When, in this book, we refer to a 'cache', we are referring to a 'caching proxy' - something that keeps copies of returned data. A 'proxy' on the other hand, is a program which do not cache replies.

The web consists of HTML pages, graphics and sound files (to name but a few!). Since only a very small portion of the web is made up of text, referring to all cached data as pages is misleading. To avoid ambiguity, caches store objects, not pages.

(? trash Many Internet servers support more than one protocol. A given server can support more than one type of query protocol. A web server uses the Hyper Text Transfer Protocol (HTTP) to serve data. An older protocol, the File Transfer Protocol (FTP) often runs on web servers too. Muddling them up would be bad. Caching an FTP response and returning the same data to the client on a subsequent HTTP request would be incorrect. Squid uses the complete URL to uniquely identify everything stored in the cache.

So as to avoid returning out of date data to clients, objects must be expired. Squid therefore allows you to set refresh times for objects, ensuring old data is not returned to clients.

Squid is based on software developed for the Harvest project, which developed their 'cached' (pronounced 'Cache-Dee') as a side project. Squid development is funded by the National Laboratory of Network Research (NLANR), who are in turn funded by the National Science Foundation (NSF). Squid is 'open source' software, and although development is done mainly with NSF funding (??), features are added and bugs fixed by a team of online collaborators.

Why Cache?

In the USA

Small Internet Service Providers (ISPs) cache to reduce their line costs, since a large portion of their operating costs are infrastructural, rather than staff related.

Companies and content providers (such as AOL) have recently started caching. These organizations are not short of bandwidth (indeed, they often have as much bandwidth as a small country), but their customers occasionally see slow response. There are numerous reasons for this:

Origin Server Load

Raw bandwidth is increasing faster than overall computer performance. These days many servers act as a back-end for one site, load balancing incoming requests. Where this is not done, the result is slow response. If you have ever received a call complaining about slow response, you will know the benefit of caching - in many cases the user's mind is already made up: it's your fault.

Quick Abort

Squid can be configured to continue fetching objects (within certain size limits) even although somebody who starts a download aborts it. Since there is a chance of more than one person wanting the same file, it is useful to have a copy of the object in your cache, even if the first user aborts. Where you have plenty of bandwidth, this continued-fetching ensures that you will be a local copy of the object available, just in case someone else wants it. This can dramatically reduce latency, at the cost of higher bandwidth usage.

Peer Congestion

As bandwidth increases, router speed needs to increase at the same rate. Many peering points (where huge volumes of Internet traffic are exchanged) often do not have the router horsepower to support their ever-increasing load. You may invest vast sums of money to maintain a network that stays ahead of the growth curve, only to have all your effort wasted the moment packets move off your network onto a large peering point, or onto another service provider's network.

Traffic spikes

Large sporting, television and political events can cause spikes in Internet traffic. Events like The Olympics, the Soccer World Cup, and the Starr report on the Clinton-Lewinsky issue create large traffic spikes.

You can plan ahead for sports events, but it's difficult to estimate the load that they will eventually cause. If you are a local ISP, and a local team reaches the finals, you are likely to get a huge peak in traffic. Companies can also be affected by traffic spikes, with bulk transfer of large databases or presentations flooding lines at random intervals. Though caching cannot completely solve this problem, it can reduce the impact.

Unreachable sites

If Squid attempts to connect to an origin server, only to find that it is down, it will log an error and return the object (even if there is a chance of sending out-of-date data to the client) from disk. This reduces the impact of a large-scale Intenet outage, and can help when a backhoe digs up a major segment of your network backbone.

Outside of the USA

Outside of the USA, bandwidth is expensive and latency due to very long haul links is high.

Costs

Outside of the USA and Canada, bandwidth is expensive. Saving bandwidth reduces Internet infrastructural costs significantly. Since Internet connectivity is so expensive, ISPs and their customers reduce their bandwidth requirements with caches.

Latency

Although reduction of latency is not normally the major reason for introducing caching in these countries, the problems experienced in the USA are exacerbated by the high latency and lower speed of the lines to the USA.