Cache Auto-config

Client browsers can have all options configured manually, or they can be configured to download a autoconfig file (every time the startup), which provides all of the information about your cache setup. Each URL referenced (be it the URL that you typed, or the URL for a graphic on the page yet to be retrieved) is checked against the list of rules. You should keep the list of rules as short as possible, otherwise you could end up slowing down page loads - not at the cache level, but at the browser.

Web server config changes for autoconfig files

The original \**Netscape documentation for the proxy autoconfig file suggested Available at http://home.netscape.com/eng/mozilla/2.0/relnotes/demo/proxy-live.html the filename proxy.pac for Proxy AutoConfig files. Since it's possible to have a file ending in .pac that is not used for autoconfiguration, browsers require a server returning an autoconfig file to indicate so in the mime type. Most web servers do not automatically recognize the .pac extension as a proxy-autoconfig file, and have to be reconfigured to return the correct mime type (application/x-ns-proxy-autoconfig).

Apache

On some systems Apache already defines the autoconfig mime type. The Apache config file mime.types is used to associate filename extensions with mime types. This file is normally stored in the apache conf directory. This directory also contains the access.conf and httpd.conf files, which you may be more familiar with editing. As you can probalby see, the mime.types file consists of two fields: a mime type on the left, the associated filename extension on the right. Since this file is only read at startup or reconfigure, you will need to send a HUP signal to the parent apache process for your changes to take affect. The following line should be added to the file, assuming that it is not already included:

application/x-ns-proxy-autoconfig       pac

Example 6-1. Restarting Apache

cd /usr/local/lib/httpd/logs
kill -HUP `cat httpd.pid`

Internet Information Server

(?nothing here yet?)

Netscape

(? or here ?)

Autoconfig Script Coding

The autoconfig file is actually a Java function, put in a file and served by your standard web server program. Don't panic if you don't know Java, since this section acts as a cookbook. Besides: the basic structure of the Java language is quite easy to get the hang of, especially if you have previous programming experience, whether it be in C, Pascal or Perl.

The Hello World! of auto-configuration scripts

If you have learned a programming language, you probably remember one of the most basic programs simply printing the phrase Hello World!. We don't want to print anything when someone tries to go to a page, but the following example is similar to the original Hello World program in that it's the shortest piece of code that does something useful.

The following simply connects direct to the origin server for every URL, just as it would if you had no proxy-cache configured at all.

Example 6-2. A very basic autoconfig file

function FindProxyForURL(url, host) {
	return DIRECT;
}

The next example gets the browser to connect to the cache server named cache.domain.example on port 3128. If the machine is down for some reason, an error message will be returned to the user.

Example 6-3. Connecting to a cache server

function FindProxyForURL(url, host) {
	return "PROXY cache.domain.example:3128";
}

Example 6-4. Connecting to a cache server, with failover

function FindProxyForURL(url, host) {
	return "PROXY cache.domain.example:3128; DIRECT";
}

As you may be able to guess from the above, returning text with a semicolon (;) splits the answer returned into two sub-strings. If the first cache server is unavailable, the second will be tried. This provides you with a failover mechanism: you can attempt a local proxy server first and, if it is down, try another proxy. If all are down, a direct attempt will be made. After a short period of time, the proxy will be retried.

A third return type is included, for SOCKS proxies, and is in the same format as the HTTP type:

return "SOCKS socks.domain.example:3128";

If you have no intranet, and require no exclusions, you should use the above autoconfig file. Configuring machines with above autoconfig file allows you to add future required exclusions very easily.

Auto-config functions

Web browsers include various built-in functions to make your autoconfig coding as simple as possible. You don't have to write the code that does a string match of the hostname, since you can use a standard function call to do a match. Not all functions are covered here, since some of them are very rarely used. You can find a complete list of autoconfig functions (with examples) at http://home.netscape.com/eng/mozilla/2.0/relnotes/demo/proxy-live.html.

dnsDomainIs

Returns true if the first argument (normally specified as the variable host, which is defined in the autoconfig function by default) is in the domain specified in the second argument. Checks if a host is in a domain.

Example 6-5. dnsDomainIs

if dnsDomainIs(host,".mydomain.example") {
	return "DIRECT";
}

You can check more than one domain by using the || Java operator. Since this is a Java operator you can use the layout described in this example in any combination.

Example 6-6. Using multiple dnsDomainIs calls

if (dnsDomainIs(host,".mydomain.example")||
	dnsDomainIs(host,".anotherdomain.example")) {
		return "DIRECT";
}

isInNet

Sometimes you will wish to check if a host is in your local IP address range. To do this, the browser resolves the name to find the IP address. Do not use more than one isInNet call if you can help it: each call causes the browser to resolve the hostname all over again, which takes time. A string of these calls can reduce browser performance noticeably.

The isInNet function takes three arguments: the hostname, and a subnet/netmask pair.

Example 6-7. using the isInNet call

if isInNet(host, "192.168.0.0", "255.255.0.0") {
	return "DIRECT";
}

isPlainHostname

Simply checks that there is no full-stop in the hostname (the only argument for this call). Many people refer to local machines simply by hostname, since the resolver library will automatically attempt to look up host.domain.example if you simply attempt to connect to host. For example: typing www in your browser should bring up your web site.

Many people connect to internal web servers (such as one sitting on their co-worker's desk) by typing in the hostname of the machine. These connections should not pass through the cache server, so many people use a function like the following:

Example 6-8. using isPlainHostName to decide if the connection should be direct

if isPlainHostName(host) {
	return "DIRECT";
} else {
	return "PROXY cache.mydomain.example:3128";
}

myIpAddress

Returns the IP address of the machine that the browser is running on, requires no arguments.

On a network with more than one cache, your script can use this information to decide which cache to communicate with. In the next subsection we look at different ways of communicating with a local proxy (with minimal manual user intervention), so the example here is comparatively basic. The below example assumes that you have more than two networks: one with a private address range (10.0.0.*), the others with real IP addresses.

If the client machine is in the private address range, it cannot connect directly to the destination server, so if the cache is down for some reason they cannot access the Internet. A machine with a real IP address, on the other hand, should attempt to connect directly to the origin server if the cache is down. (? need to check it will work too! ?).

Since myIpAddress requires no arguments, we can simply place it in where we would have put host in the isInNet function call.

Example 6-9. myIpAddress

if (isInNet(myIpAddress, "10.0.0.0", "255.255.255.0")) {
	return "PROXY cache.mydomain.example:3128";
} else {
	return "DIRECT";
}

shExpMatch

The shExpMatch function accepts two arguments: a string and a shell expression. Shell expressions are similar to regular expressions, though are more limited. This function is often used to check if the url or host variables have a specific word in them.

If you are configuring a ISP-wide script, this function can be quite useful. Since you do not know if a customer will call their machine "intranet" or "intra" or "admin", you can chain many shExpMatch checks together. Note that in the below example uses a single "intra*" shell expression to match both "intranet" and "intra.mydomain.example".

Example 6-10. shExpMatch

if (shExpMatch(host, "intra*")||
	shExpMatch(host, "admin*")) {
		return "DIRECT";
} else {
		return "PROXY cache.mydomain.example:3128";
}

url.substring

This function doesn't take the same form as those described above. Since Squid does not support all possible protocols, you need a way of comparing the first few characters of the destination URL with the list of possible protocols. The function has two arguments. The first is a starting position, the second the number of characters to retrieve. Note that (like C), string start at position 0, rather than at 1.

All of this is best demonstrated with an example. The following attempts to connect to the cache for the most common URL types (http, ftp and gopher), but attempts to go directly for protocols that Squid doesn't recognize.

Example 6-11. url.substring

if (url.substring(0, 5) == "http:" ||
	url.substring(0, 4) == "ftp:"||
	url.substring(0, 7) == "gopher:")
		return "PROXY cache.is.co.za:8080; DIRECT";
else
	return "DIRECT";

Example autoconfig files

The main reason that autoconfig files were invented was the sheer number of possible cache setups. It's difficult (or even impossible) to represent all of the possible combinations that a autoconfig file can provide you with.

There is no config file that will work for everyone, so a couple of config files are included here, one of which should suit your setup.

A Small Organization

A small organization is the easiest to create an autoconfig file for. Since you will have a moderately small number of IP addresses you can use the isInNet function to discover if the destination host is local or not (a large organization, such as an ISP would need a very long autoconfig file simply because they have many IP address ranges).

Example 6-12. A small organization's proxy config file

function FindProxyForURL(url, host) {
	// We only have one network range, and one DNS request doesn't
	// mean a large slowdown
	if (isInNet(host, "196.4.160.0", "255.255.255.0"))
			return DIRECT;
	// If it's not local, use the cache server, with automatic
	// connection to the outside in case of problems
	return "PROXY cache.domain.example:3128; DIRECT"
}

A Dialup ISP

Since dialup customers don't have intranet systems, a dialup ISP would have a very straight forward config file. If you wish your customers to connect directly to your web server (why waste the disk space of a cache when you have the origin server rack-mounted above it), you should use the dnsDomainIs function:

Example 6-13. Dialup ISP autoconfig file

function FindProxyForURL(url, host) {
	// For servers in the local domain, go direct
	if dnsDomainIs(host, "mydomain.example")
		return "DIRECT";
	// Otherwise go through the cache server, with fail-over
	return "PROXY cache.mydomain.example:3128; DIRECT";
}

Leased Line ISP

When you are providing a public service, you have no control over what your customers call their machines. You have to handle the generic names (like intranet) and hope that people name their machines according to the de-facto standards.

Example 6-14.

function FindProxyForURL(url, host) {
	// so that people can type just "intranet" or just "mypc"
	if isPlainHostName(host)
		return "DIRECT";
	
	// For servers in our domain, go direct: for announcements etc
	if dnsDomainIs(host, "mydomain.example")
		return "DIRECT";
	// since there are many domains, we cannot do them all. We assume
	// that people are going to type "intranet" instead of
	// "intranet.customerdomain.example"
	return "PROXY cache.mydomain.example:3128; DIRECT";

(? I need some info on ieak - waiting for people here?)

Cache Array Routing Protocol

Many large ISPs will have more than one cache server. To avoid duplicating objects, these cache servers have to communicate with one another. Consider the following;

cache1 gets a request for an object. It caches the page, and stores it on disk. An hour or so later, cache2 gets a request for the same page. To find a local copy of the object, cache2 has to query the other caches. Add more and more caches, and your number of queries goes up.

If an incoming request for a specific URL only ever went to one cache, your caches would not need to communicate with one another. A client requesting the page http://www.oreilly.com/ would always connect to cache1.

Let's assume that you have 5 caches. Splitting the Internet into five pieces would split the load across the caches almost evenly. How do you split though? By destination IP address? No, since IP's like 19?.*.*.* are much more common than "5.*.*.*". By domain? No again, since one domain like microsoft.com would mean that you were distributing load incorrectly.

Some of you will know what a hash function is. If not, don't panic: you can still use CARP without knowing the theoretical basis of the algorithms involved.

CARP allows you to split up the Internet by URL (the combination of hostname, path and filename). If you have 5 cache servers, you split up the domain of possible answers into 5 parts. (A hash function returns a number, so we are using the appropriate terms - a domain is not an Internet domain in this context). With a good hashing function, the numbers returned are going to be spread across the 5 parts evenly, which spreads your load perfectly.

If you have a cache which is twice as powerful as your others, you can allocate it more of the domain, and put more load on it.

Carp is used by some cache servers (most notably Microsoft Proxy and Squid) to decide which parent cache to send a request too. Browsers can also use CARP to decide which cache to talk to, using a java auto-config script. For more information (and an example Java script), you should look at the web page http://naragw.sharp.co.jp/sps/