printable version with white background.
On Friday, September 11, the U.S. Congress made public the Attorney General's Independent Council Report on President Clinton. Both politicians and the media hyped the fact that the report is to be available on the Internet. Apparently, the URLs for the report were announced at least a day in advance, although we did not witness this personally. The official servers were: house.gov, loc.gov, gpo.gov.
NLANR maintains nine high-level Web caches located throughout the U.S. These caches are directly connected to approximately 400 other caches, and indirectly to 1100 worldwide. Collectively, the NLANR caches receive approximately 7,000,000 requests per day from the others. The following data is generated from the access logs of the NLANR caches, during September 11--19. Keep this in mind as you look at the data. These logs are not a valid representation of ALL Web traffic on the Internet. Furthermore, our logs do not accurately represent the traffic of all caches and clients connecting through us. Hopefully, most of the user requests end up being cache hits. These logs only represent cache misses, or refresh requests from the lower layers.
We are interested in requests for the Attorney General's Independent Council Report on President Clinton. The first step of the data analysis is to isolate those requests from all others in our log files. In this, we are not extremely thorough. We use simple regular expressions to find URLs which we know or learned about. Specifically, these (case insensitive) patterns are:
/icreport/ /starr\.report/ /starr-report\.aol\.com/ /report\.yahoo\.com/ /starrreport/ /abcnews\.com\/No doubt you can find more, and maybe we missed some important ones. If so, let us know and we can re-run the analysis.
Note: this plot has been smoothed to make it prettier. The data is averaged at 15 minute intervals. A postscript file is available.
This graph simply shows the amount of client requests for IC Report pages. The initial peak is shown clearly. Well before the report is released, people are trying to access the report.
The Friday afternoon peak is so sharp here because initially, the URLs do not exist. Generally, ``404 Not Found'' replies are not cached. Thus, every request for the report is a cache miss and these requests filter all the way up to our caches. As the report is made available on the various origin servers, the load drops off sharply because we start having cache hits in the lower levels.
You can see an increase in requests on Monday, the 14th, when people return to work (etc) after the weekend.
The long-term decreasing trend might be decreased interest by Web users. It might also be due to cache hits at lower levels.
Note: this plot has not been smoothed. The data is averaged at 5 minute intervals. A postscript file is available.
This plot also shows requests per second, but breaks it down for some of the more popular origin servers. Interestingly, the large peaks are for the two CNN web servers. The official servers, {loc,house,gpo}.gov have peaks, but much smaller ones. This is only one of the indications that the GOV sites were not able to handle the load. As the expected time of release came and passed, most users probably could not get responses from the GOV servers, and thus turned to other sources.
The large spike for yahoo.com on the 19th might be web-robot (or prefetching) activity.
Note: this plot has been smoothed to make it prettier. The data is averaged at one hour intervals. A postscript file is available. Also, note the logarithmic scale on the Y-axis.
This graph shows the MEDIAN per-request service time for cache clients (as measured by our caches). Generally, this is the time elapsed between receiving the client request and writing the last byte of the reply. We use medians instead of means because the distribution usually has a long, heavy tail.
This graph shows some dramatic congestion on Friday afternoon. This congestion could be on the origin servers themselves, or on the network links which they connect with.
Again, note the logarithmic Y-axis scale. Normally, our caches measure median service times in the range 0.25-0.5 seconds. On Friday afternoon, some of the servers have very high service times, in the range of 10's or 100's of seconds.
The house.gov and gpo.gov servers seem to have a particularly difficult time handling the load. The loc.gov server struggles at first as well, but recovers after a day or so. Perhaps it was upgraded or additional machines were installed for load-spreading.
The CNNFN server experienced a similar initial peak. Also you can clearly see when the report was added to that site, and when the report was removed.
It would very likely be wrong to attribute the reduction in service time to the effects of caching. Although we have no hard numbers, probably 75% of Web users do not go through caches.
Note: this plot has been smoothed to make it prettier. The data is averaged at SIX HOUR intervals. A postscript file is available. Please ignore the "edge effects" from the smoothing algorithm.
This graph shows the hit ratios for each server. After the first day, the hit ratios generally level off. Although not as high as we would like for "hot objects," these hit ratios are pretty good. The three GOV servers achieve 50% or better after a few days.
The most distressing thing, however, is the two CNN servers. The main CNN server gives a very low hit ratio and the CNNFN server gives essentially no hits. What causes these low hit ratios?
The CNNFN server pre-expires ALL of its responses. For example:
HTTP/1.0 200 OK Server: Netscape-Enterprise/2.01 Date: Wed, 23 Sep 1998 23:48:06 GMT Expires: Wed, 23 Sep 1998 23:48:06 GMTThis alone would not normally be so bad. However, the CNNFN server never returns "304 Not Modified" for an If-Modified-Since validation request.
The CNN server does not include an Expires header in its responses. However, for the report files, it does always set the Last-Modified time to the Current time:
HTTP/1.0 200 OK Server: Netscape-Enterprise/2.01 Date: Wed, 23 Sep 1998 23:50:21 GMT Last-Modified: Wed, 23 Sep 1998 23:50:21 GMTThe CNN server does return "better" Last-Modified headers for some other requests, such as images. Additionally, the CNN server does return "304 Not Modified" for validation requests.
We found it interesting how quickly some sites assigned special hostnames to the report. Among those we know about:
DOMAIN FIRST LOGGED (EDT) ----------------------------- ------------------- icreport.house.gov Sep 11 13:34 icreport.loc.gov Sep 11 16:35 starr-report.aol.com Sep 11 16:53 report.yahoo.com Sep 11 17:06 starrreport.excite.com Sep 11 20:48 icreport.access.gpo.gov Sep 11 21:08 icreport.lycos.com Sep 12 14:43 search.report.yahoo.com Sep 14 08:39 starrreport.ap.org Sep 14 19:53
Where did requests originate? This analysis is highly speculative; all we have is domain names. Many of the more popular top-level domains (com, net) could be located in any country.
The table below shows the breakdown of requests by top-level domain. The first column is the percentage of ALL requests to our caches coming from that domain. The second is the percentage of requests for copies of the Independent Council Report from the websites referenced above. It seems that the top-level domains typically comprised of U.S. organizations (com, net, edu, us) have a higher than normal interest in the report pages, while international sites have a lower than normal interest.
Remember, our log files are not a valid representation of all Web traffic. Connecting to the NLANR caches is entirely voluntary and we can only log accesses from sites which elect to send requests to us. A number of countries, such as the United Kingdom (uk) and Germany (de) have their own national caching systems. They currently do not forward requests to our caches.
| TLD | All requests | IC Report requests |
|---|---|---|
| com | 22.851658 | 33.429243 |
| id | 22.663313 | 19.279737 |
| net | 7.017079 | 12.513399 |
| is | 6.827721 | 1.730136 |
| th | 6.558700 | 2.177715 |
| edu | 6.138637 | 8.325341 |
| kr | 4.834703 | 1.893747 |
| za | 3.257730 | 3.652092 |
| pe | 1.942889 | 0.661965 |
| cr | 1.722956 | 0.955336 |
| sg | 1.659810 | 0.823695 |
| ph | 1.645584 | 2.709920 |
| de | 1.386551 | 1.570287 |
| ca | 1.327626 | 0.404325 |
| unknown | 1.214713 | 2.096850 |
| il | 1.140298 | 0.620592 |
| au | 1.001077 | 0.855665 |
| ru | 0.791419 | 0.562294 |
| ar | 0.665279 | 0.417489 |
| mx | 0.519621 | 0.073343 |
| es | 0.500825 | 0.276446 |
| se | 0.480439 | 0.216267 |
| cz | 0.326270 | 0.558533 |
| jp | 0.316839 | 0.067701 |
| nl | 0.198351 | 0.118477 |
| co | 0.172055 | 0.169252 |
| gr | 0.150432 | 0.327221 |
| ch | 0.133177 | 0.105313 |
| ro | 0.126287 | 0.139163 |
| gov | 0.123903 | 0.026328 |
| no | 0.109114 | 0.223789 |
| br | 0.098089 | 0.067701 |
| be | 0.090461 | 0.099671 |
| pl | 0.081722 | 0.007522 |
| cl | 0.075965 | 0.007522 |
| at | 0.072529 | 1.013634 |
| cn | 0.069248 | 0.030089 |
| org | 0.059942 | 0.107193 |
| us | 0.049127 | 0.150447 |
| mz | 0.040580 | 0.000000 |
| lk | 0.039025 | 0.099671 |
| in | 0.031349 | 0.120357 |
| it | 0.027349 | 0.000000 |
| pt | 0.023815 | 0.013164 |
| su | 0.014923 | 0.015045 |
| nz | 0.013419 | 0.000000 |
| by | 0.006232 | 0.011283 |
| mil | 0.000986 | 0.000000 |
| my | 0.000492 | 0.000000 |
| am | 0.000381 | 0.000000 |
| ua | 0.000193 | 0.000000 |
| tr | 0.000113 | 0.000000 |
| ge | 0.000063 | 0.000000 |
| tw | 0.000046 | 0.000000 |
| uk | 0.000039 | 0.000000 |
| fr | 0.000019 | 0.000000 |
| sk | 0.000010 | 0.000000 |