This article takes a more technical viewpoint on the caching issues raised here.
Warning – Not for the casual non-technical reader!
The Problem
RevaHealth.com is made up of 10’s of millions of pages, organised as ‘pretty’ URLs such as
- /dentsist/ireland
- /dentists/ireland/dublin
- /dentists/ ireland/dublin/crowns
- /dentists/ ireland/dublin/crowns/the-big-clinic
We cache each page in an asp.net data cache, and this works for frequently requested pages as they have a high cache hit rate. This works by holding the data you need to construct a page in memory. However, there is a fairly heavy code hit which results in a Time to First Byte of 1.2 to 1.5 seconds.
This wasn’t providing the user experience that we wanted and we were determined to lower it, so we added asp.net output page caching with a time to live of an hour. This holds the fully constructed page in the web server memory so it can be close to instantly returned to the user. This resulted in a Time to First Byte of 0.5 seconds.
This was great.
Or so we thought. Regular testing revealed that even frequently requested pages were rarely in our cache. Why? In fact only about one in ten pages were in the cache. This wasn’t good. Was the output page caching not working?
Why?
The answer wasn’t so simple. Firstly, RevaHealth.com is very broad and flat. Search results are divided by the type of clinic (dental, cosmetic, etc) and then further subdivided by multiple levels of location (country, county, city & neighbourhood). To make matters worse there are options for further treatment and/or specialization sub-subdivisions. A typical landing URL might well look like this:
http://www.revahealth.com/dentists/uk/west-midlands/birmingham/erdington/implants
Landing pages are almost all ‘long tail’, and the tail is very, very long. With over 50,000 locations in over 200 countries, several dozen clinic types and hundreds of procedures, we have millions of search pages and over 100,000 clinics. We knew our landing pages covered a very broad range, but only when we looked into the figures more closely did we realise just how broad and how flat the site was.
In a typical period 152,072 visitors entered the site through 36,357 pages. Only 66 of those pages had more than 100 hits and 30,000 had five or less hits. So in a typical one hour period only a few hundred pages were getting a hit in the output cache. The huge bulk of pages requested were not in the cache when requested.
Looking for Solutions
Clearly a simple remedy would be to extend the cache life beyond an hour. But this has business implications. Firstly, when our customers update their profile, they want to see that change reflected as soon as possible. Asking them to wait more than an hour would not be good.
More importantly, for a site like RevaHealth.com, search ordering and the appearance of results is critical, and search results order updates happen dynamically as patients contact clinics, review clinics and generally interact with the site. So, extending cache time to a level where we get more of the tail into the cache would be very problematic.
We decided to simulate traffic to the site, and to force the most frequently requested pages into the cache on an hourly cycle.
cURL seemed to be the obvious tool to use, as we had some experience with it and it is widely accepted. We generated a list of the top 100,000 most frequently requested URLs and created a cURL script to fetch them all.
Our experiences with cURL
cURL is a feature rich tool, but we wanted to use it in a pretty simple way – fetch all the pages on the list on an hourly cycle.
The first problem we encountered was that from the command line there is no way to limit the rate of page fetches. We knew we wanted to fetch them at a rate just above 12/second to ensure that the script would complete in an hour. But curl will only set a speed limit in kb/sec. Since our page size varies greatly, this made fixing that speed a case of trial and error. Obviously we didn’t want to fetch too fast and strain the server unnecessarily, or fetch too slow and not complete the list in an hour.
We could have used libCurl in our own server code and set a rate per second there, but we were keen not to have to write code for this, and instead use the command line tool to keep it simple.
Some relatively straightforward trail-and-error tests revealed a rate which would enable the script to finish within the cache time available (one hour).
What was frustrating during this process was that there is no way for cURL to send the actual file data fetched to nul and to save the normal stdout output to a file or even send it to the screen. We didn’t want to save the actual output files which could get potentially very large, but sending them to nul meant normal output was sent to nul too. Equally frustrating when testing was that the normal (non verbose) output does not show the URL of the page being fetched.
The progress meter shows bytes downloaded, percentage completed, etc, but rather strangely, not the file being fetched, so there’s no easy way to tell your progress through the list of pages you are fetching.
In the end though we got past all these problems and had a script that worked – or so we thought. In fact, our first run through made no difference to the cache at all. This caused a lot of head scratching until someone looked at the fetched files and we realised that, of course, they were not compressed.
We always return compressed dynamic pages. Since the output file is gzipped, and cached as a compressed file, we were only having non-zipped pages cached.
Helpfully, cURL allows the http request headers to be set on the command line, so simply adding –header “Accept-Encoding: gzip,deflate” fetched our zipped pages into the cache and testing in Firebug showed that they were being requested by our script.
We watched memory usage during the build up of the cache, and made some adjustments to allow larger physical memory to be used. At a certain number of pages requested we began to see large page usage, so we scaled back the number of pages being requested and all returned to normal.
Browsers Browser Browsers
We thought we were done, but one of the oddest things was yet to bite us. Like most developers, we love Firebug and we were checking everything using Firefox, but before we push changes live we do a fairly rigorous check in other browsers. Disaster. Firefox and Chrome were receiving our new cached page but Internet Explorer wasn’t.
Internet explorer was simply bypassing the cached pages and hitting the code. This was exactly what we were trying to avoid.
The problem was that we were also using GZIP to compress the HTML. It turns out that IE passes a different parameter for the ‘accept-encoding: gzip’ than Firefox or Chrome does. Even though they all accepted exactly the same encoding the web server wouldn’t serve it up.
- FF: Accept-Encoding: gzip,deflate
- Chrome: Accept-Encoding: bzip2
- IE: Accept-Encoding: : gzip, deflate (note the space)
Essentially because the browsers were each requesting the same file using very slightly different parameters it resulted in the web server thinking they were different files.
The choice was simple, either:
- Cut the size of the cache to 33% and increase the length of it 3x
- Only support some browsers
Unfortunately the commercial reality of choices like this is – ‘Provide the greatest good to the greatest number of users’. This meant only providing cached pages for IE. As a result Firefox and Chrome users have a slightly degraded experience compared to IE users, however this degradation is largely compensated by faster JavaScript engines.
Note: IIS 7 introduces some control that solves this particular issue.
Your War Stories
We’d love to hear about your trials and tribulations getting time to first byte down. Leave a comment below.
Recent Comments