Report on the effect of the Independent Council Report on the NLANR Web Caches

Duane Wessels

September 23, 1998

printable version with white background.


Motivation

On Friday, September 11, the U.S. Congress made public the Attorney General's Independent Council Report on President Clinton. Both politicians and the media hyped the fact that the report is to be available on the Internet. Apparently, the URLs for the report were announced at least a day in advance, although we did not witness this personally. The official servers were: house.gov, loc.gov, gpo.gov.

Data Collection

NLANR maintains nine high-level Web caches located throughout the U.S. These caches are directly connected to approximately 400 other caches, and indirectly to 1100 worldwide. Collectively, the NLANR caches receive approximately 7,000,000 requests per day from the others. The following data is generated from the access logs of the NLANR caches, during September 11--19. Keep this in mind as you look at the data. These logs are not a valid representation of ALL Web traffic on the Internet. Furthermore, our logs do not accurately represent the traffic of all caches and clients connecting through us. Hopefully, most of the user requests end up being cache hits. These logs only represent cache misses, or refresh requests from the lower layers.

We are interested in requests for the Attorney General's Independent Council Report on President Clinton. The first step of the data analysis is to isolate those requests from all others in our log files. In this, we are not extremely thorough. We use simple regular expressions to find URLs which we know or learned about. Specifically, these (case insensitive) patterns are:

	/icreport/
	/starr\.report/
	/starr-report\.aol\.com/
	/report\.yahoo\.com/
	/starrreport/
	/abcnews\.com\/
No doubt you can find more, and maybe we missed some important ones. If so, let us know and we can re-run the analysis.

Request Rate - TOTAL

Note: this plot has been smoothed to make it prettier. The data is averaged at 15 minute intervals. A postscript file is available.

This graph simply shows the amount of client requests for IC Report pages. The initial peak is shown clearly. Well before the report is released, people are trying to access the report.

The Friday afternoon peak is so sharp here because initially, the URLs do not exist. Generally, ``404 Not Found'' replies are not cached. Thus, every request for the report is a cache miss and these requests filter all the way up to our caches. As the report is made available on the various origin servers, the load drops off sharply because we start having cache hits in the lower levels.

You can see an increase in requests on Monday, the 14th, when people return to work (etc) after the weekend.

The long-term decreasing trend might be decreased interest by Web users. It might also be due to cache hits at lower levels.


Request Rate - Per Web Server

Note: this plot has not been smoothed. The data is averaged at 5 minute intervals. A postscript file is available.

This plot also shows requests per second, but breaks it down for some of the more popular origin servers. Interestingly, the large peaks are for the two CNN web servers. The official servers, {loc,house,gpo}.gov have peaks, but much smaller ones. This is only one of the indications that the GOV sites were not able to handle the load. As the expected time of release came and passed, most users probably could not get responses from the GOV servers, and thus turned to other sources.

The large spike for yahoo.com on the 19th might be web-robot (or prefetching) activity.


Service Times

Note: this plot has been smoothed to make it prettier. The data is averaged at one hour intervals. A postscript file is available. Also, note the logarithmic scale on the Y-axis.

This graph shows the MEDIAN per-request service time for cache clients (as measured by our caches). Generally, this is the time elapsed between receiving the client request and writing the last byte of the reply. We use medians instead of means because the distribution usually has a long, heavy tail.

This graph shows some dramatic congestion on Friday afternoon. This congestion could be on the origin servers themselves, or on the network links which they connect with.

Again, note the logarithmic Y-axis scale. Normally, our caches measure median service times in the range 0.25-0.5 seconds. On Friday afternoon, some of the servers have very high service times, in the range of 10's or 100's of seconds.

The house.gov and gpo.gov servers seem to have a particularly difficult time handling the load. The loc.gov server struggles at first as well, but recovers after a day or so. Perhaps it was upgraded or additional machines were installed for load-spreading.

The CNNFN server experienced a similar initial peak. Also you can clearly see when the report was added to that site, and when the report was removed.

It would very likely be wrong to attribute the reduction in service time to the effects of caching. Although we have no hard numbers, probably 75% of Web users do not go through caches.


Hit Ratios

Note: this plot has been smoothed to make it prettier. The data is averaged at SIX HOUR intervals. A postscript file is available. Please ignore the "edge effects" from the smoothing algorithm.

This graph shows the hit ratios for each server. After the first day, the hit ratios generally level off. Although not as high as we would like for "hot objects," these hit ratios are pretty good. The three GOV servers achieve 50% or better after a few days.

The most distressing thing, however, is the two CNN servers. The main CNN server gives a very low hit ratio and the CNNFN server gives essentially no hits. What causes these low hit ratios?

The CNNFN server pre-expires ALL of its responses. For example:

	HTTP/1.0 200 OK
	Server: Netscape-Enterprise/2.01
	Date: Wed, 23 Sep 1998 23:48:06 GMT
	Expires: Wed, 23 Sep 1998 23:48:06 GMT
This alone would not normally be so bad. However, the CNNFN server never returns "304 Not Modified" for an If-Modified-Since validation request.

The CNN server does not include an Expires header in its responses. However, for the report files, it does always set the Last-Modified time to the Current time:

	HTTP/1.0 200 OK
	Server: Netscape-Enterprise/2.01
	Date: Wed, 23 Sep 1998 23:50:21 GMT
	Last-Modified: Wed, 23 Sep 1998 23:50:21 GMT
The CNN server does return "better" Last-Modified headers for some other requests, such as images. Additionally, the CNN server does return "304 Not Modified" for validation requests.

Server Domain Names

We found it interesting how quickly some sites assigned special hostnames to the report. Among those we know about:

	DOMAIN				FIRST LOGGED (EDT)
	-----------------------------	-------------------
	icreport.house.gov		Sep 11 13:34 
	icreport.loc.gov		Sep 11 16:35
	starr-report.aol.com		Sep 11 16:53
	report.yahoo.com		Sep 11 17:06
	starrreport.excite.com		Sep 11 20:48
	icreport.access.gpo.gov		Sep 11 21:08
	icreport.lycos.com		Sep 12 14:43
	search.report.yahoo.com		Sep 14 08:39
	starrreport.ap.org		Sep 14 19:53

Client Domains

Where did requests originate? This analysis is highly speculative; all we have is domain names. Many of the more popular top-level domains (com, net) could be located in any country.

The table below shows the breakdown of requests by top-level domain. The first column is the percentage of ALL requests to our caches coming from that domain. The second is the percentage of requests for copies of the Independent Council Report from the websites referenced above. It seems that the top-level domains typically comprised of U.S. organizations (com, net, edu, us) have a higher than normal interest in the report pages, while international sites have a lower than normal interest.

Remember, our log files are not a valid representation of all Web traffic. Connecting to the NLANR caches is entirely voluntary and we can only log accesses from sites which elect to send requests to us. A number of countries, such as the United Kingdom (uk) and Germany (de) have their own national caching systems. They currently do not forward requests to our caches.

TLD All requests IC Report requests
com 22.851658 33.429243
id 22.663313 19.279737
net 7.017079 12.513399
is 6.827721 1.730136
th 6.558700 2.177715
edu 6.138637 8.325341
kr 4.834703 1.893747
za 3.257730 3.652092
pe 1.942889 0.661965
cr 1.722956 0.955336
sg 1.659810 0.823695
ph 1.645584 2.709920
de 1.386551 1.570287
ca 1.327626 0.404325
unknown 1.214713 2.096850
il 1.140298 0.620592
au 1.001077 0.855665
ru 0.791419 0.562294
ar 0.665279 0.417489
mx 0.519621 0.073343
es 0.500825 0.276446
se 0.480439 0.216267
cz 0.326270 0.558533
jp 0.316839 0.067701
nl 0.198351 0.118477
co 0.172055 0.169252
gr 0.150432 0.327221
ch 0.133177 0.105313
ro 0.126287 0.139163
gov 0.123903 0.026328
no 0.109114 0.223789
br 0.098089 0.067701
be 0.090461 0.099671
pl 0.081722 0.007522
cl 0.075965 0.007522
at 0.072529 1.013634
cn 0.069248 0.030089
org 0.059942 0.107193
us 0.049127 0.150447
mz 0.040580 0.000000
lk 0.039025 0.099671
in 0.031349 0.120357
it 0.027349 0.000000
pt 0.023815 0.013164
su 0.014923 0.015045
nz 0.013419 0.000000
by 0.006232 0.011283
mil 0.000986 0.000000
my 0.000492 0.000000
am 0.000381 0.000000
ua 0.000193 0.000000
tr 0.000113 0.000000
ge 0.000063 0.000000
tw 0.000046 0.000000
uk 0.000039 0.000000
fr 0.000019 0.000000
sk 0.000010 0.000000

wessels@ircache.net