Cache Digests

Cache digests are one of the latest peering developments. Currently they are only supported by Squid, and they have to be turned on at compile time.

Squid keeps it's "list" of objects in an in-memory hash. The hash table (which is based on MD5) helps Squid find out if an object is in the cache without using huge amounts of memory or reading files on disk. Periodically Squid takes this table of objects and summarizes it into a small bitmap (suitable for transfer across a modem). If a bit in the map is on, it means that the object is in the store, if it's off, the object is not. This bitmap/summary is available to other caches, which connect on the HTTP port and request a special URL. If the client cache (the one that just collected the bitmap) wants to know if the server has an object, it simply performs the same mathematical function that generated the values in the bitmap. If the server has the object, the appropriate bit in the bitmap will be defined.

There are various advantages to this idea: if you have a set of loaded caches, you will find that inter-cache communication can use significant amounts of bandwidth. Each request to one cache sparks off a series of requests to all neighboring caches. Each of these queries also causes some server load: the networking stack has to deal with these extra packets, for one thing. With cache-digests, however, load is reduced. The cache digest is generated only once every 10 minutes (the exact value is tunable). The transfer of the digest thus happens fairly seldom, even if the bitmap is rather large (a few 100kbytes is common.) If you were to run 10 caches on the same physical network, however, with each ICP request being a few hundred bytes, the numbers add up. This network load reduction can give your cache time to breathe too, since the kernel will not have to deal with as many small packets.

ICP packets are incredibly simple: they essentially contain only the requested URL. Today, however, a lot of data is transferred in the headers of a request. The contents of a static URL may differ depending on the browser that a user uses, cookie values and more. Since the ICP packet only contains the URL, Squid can only check the URL to see if it has the object, not both the headers and the URL. This can (very occasionally) cause strange problems, with the wrong pages being served. With cache digests, however, the bitmap value depends on both the headers AND the url, which stops these strange hits of objects that are actually generated on-the-fly (normally these pages contain cgi-bin in their path, but some don't, and cause problems.)

Cache digests can generate a small percentage of false hits: since the list of objects is updated only every 10 minutes, your cache could expire an object a second after you download the summarized index. For the next ten minutes, the client cache would believe your server has data that it doesn't. Some five percent of hits may be false, but they are simply retrieved directly from the origin server if this turns out to be the case.