AWS tip: Use expires metadata in S3 for smooth updating of some CloudFront content
Sometimes I think everyone in the world uses Amazon Web Services (AWS). That’s of course not true, but a lot of people do use AWS for a wide variety of purposes. One of those purposes is to serve static content from S3 buckets.
The AWS use case I’ll be talking about here is using S3 to host a “static” website that is served over HTTPS through CloudFront, and that website has some content that is updated at somewhat predictable intervals.
If you’re having trouble getting your CloudFront content to update as fast as it updated when you were just using S3 to serve by HTTP, then I might have the solution for you, and it might be a simple matter of adjusting your S3 workflow.
The visitor might enter something into a form, but if they refresh the page, the slate is wiped clean, restored to how it was on the previous serve. Unless the S3 bucket is updated by the bucket owner prior to the refresh.
But when the S3 content is routed through CloudFront in order to serve it through HTTPS rather than HTTP, there might be delays in updating what the website’s visitors are served.
Depending on your particular use case, this can be a serious problem. How do you solve it? How do you get CloudFront to update those pages with more or less predictable update schedules to actually update in a timely manner?
Your mileage may vary: the most efficient way might be to attach expires metadata in S3, which I’ll explain later on.
I don’t want to get bogged down in a long-winded description of my particular use case, so I’ll generalize it and fictionalize it a little bit to focus on the pertinent details of S3 and CloudFront.
Let’s say that you run a website that has a page listing upcoming arts and culture events in a city of medium size.
The file with listing the upcoming events is
upcoming.html. It is continually updated on a roughly daily basis. As events happen, they are removed, and copied over to
past.html (in turn those will eventually be archived).
As past events are removed from
upcoming.html, other events move up the page until either they happen or are canceled. If an event is postponed, it gets moved down the page.
There are hardly any events on Mondays, Tuesdays and Wednesdays, so the page sometimes goes from Sunday to Thursday without an update.
I suppose you could just change the displayed date every day even if you change nothing else on the page (e.g., on Monday, you change the date to Monday’s date even though there are no events for Monday). That would make the update schedule much more predictable.
Regardless, this doesn’t present any problem at all for S3 served over HTTP. When you upload an updated page to S3, it’s available to the website’s visitors almost immediately.
It should be served over HTTPS, even though it doesn’t have any valuable secret information on it, like passwords or bank account numbers. There is no database, relational or otherwise. It doesn’t even have forms.
If someone wants to notify you of an upcoming event to list on the website, they send you an e-mail. If someone wants to donate to help you keep the website up and running, they can do so through PayPal, or you might even have a GoFundMe page set up for the purpose.
Even so, a hacker could try to hijack your website in order to trick visitors into downloading malware on their computers or mobile devices. So yeah, every website should be HTTPS.
Trouble is, you can’t serve HTTPS directly from an S3 bucket. It needs to pass through another service, preferably also an AWS service. That’s where CloudFront comes in.
I’m not going to give you a tutorial on CloudFront. Benjamin Rodarte wrote one here on Medium (scroll down to Part III) which is very relevant to this use case.
Before moving on, I’m going to say just one more thing about security. Once you’ve had CloudFront up and running long enough, you can get a listing of the most popular “objects” in a CloudFront distribution.
It sure sent a chill down my spine to see that
wp-login.php is a popular page on a website I manage even though that website is not a WordPress website and doesn’t have any PHP pages on it whatsoever.
That tells me that people want to hack into the website for who knows what nefarious purpose.
Once you’ve got your website set up to deliver S3 content through CloudFront, verify your website does show as secure in browsers like Google Chrome and Mozilla Firefox.
The problem is now that when you upload updates to your S3 bucket, they are not always immediately reflected in your visitors’ Web browsers.
As I understand it, your S3 content is copied to each of the CloudFront “edge” locations. To improve the speed of content delivery, CloudFront does not always check with S3 if a particular file has been updated.
Let’s say CloudFront checked your S3 bucket for updates this morning. For whatever reason, you make changes to
past2017.html shortly before lunchtime, so if someone goes to that page after lunch, they’ll probably get the old version.
And that’s fine, not many people are going to be looking at that page anyway. I hope a lot of people are looking at
upcoming.html, so it’s important that that page is as up-to-date as possible.
I can then understand your frustration at seeing an old version of
upcoming.html appear on your Web browser. If CloudFront can’t predict when you’ll be updating a particular file, it can’t make the best decision as to when to check S3.
So you have to somehow let CloudFront know when a file needs to be updated.
The AWS documentation is thorough, but not always designed to help you find the best solution for your particular use case.
The first solution I found in the AWS documentation was “versioning,” in which you would add version numbers to filenames. I consider this solution to be unsatisfactory for the situation at hand.
The next solution I found was “invalidation,” in which you tell CloudFront that one or more specific files need to be updated.
Essentially the files in the CloudFront edge locations are declared to be invalid, so the edge locations have to go to S3 to get valid versions.
Okay, that’s cumbersome, but preferable to just sitting around complaining. So if you upload a new version of
upcoming.html to your S3 bucket and you notice that it doesn’t change in the browser, you go ahead and invalidate the file in CloudFront.
I read somewhere that AWS gives you ten free invalidations a month. I doubt they would cost more than a few cents each, but this is enough to suggest you should be judicious with your invalidations, just like you should be careful not to use more than one automatic transfer from savings to checking per month (likely that’s a much bigger fee than what AWS charges you).
For about two months I was actually invalidating
upcoming.html every Thursday and Friday night (for the Friday and Saturday updates) and maybe one or two other days in the month if necessary.
I’m sure there’s a way to get Python involved in this. Sure I do want to learn how to use scripting in AWS, but Python might be too much for this particular application.
I don’t know how it was that I finally realized that the expires metadata in S3 is the best solution for this use case. I had seen it before in the AWS documentation, but it just hadn’t clicked in my mind.
Here’s how it works: either as you upload a file to S3 or immediately after, you attach expires metadata in the ISO 8601 date format (YYYY-MM-DD) to the file. Then that metadata travels with the files to the CloudFront edge locations.
The following screenshots illustrate adding the metadata after choosing the file to upload but before having S3 execute the upload. Imagine this is happening at 11:50 p.m. on Sunday, August 12, 2018.
They’ve added another one in between that I won’t show here; it pertains to read/write access for other users and other AWS accounts. If you don’t need to change anything on that one, just click “Next” again.
At some point shortly after midnight (presumably in the Virginia region), the CloudFront edge locations will go ahead and get the updated file.
Depending on your situation, you might need to do one more invalidation to get the ball rolling.
This is not to say you will never again need to do an invalidation. I’ve done one or two invalidations a month since I started using the expires metadata.
In one instance it was to correct a misspelling. Since then I’ve been more careful, so that my next invalidation is due to an urgent update, not a stupid little typo.
I’m not completely clear on the timeline, I assume it depends on the time zone of the region of your S3 bucket.
In my experience in the few months that I’ve been using the expires metadata, CloudFront has been fairly prompt about updating the edge locations, and generally much quicker than an invalidation.
Having to do an invalidation once in a while is preferable to worrying about overrunning some unknown limit and getting an unexpected AWS bill increase. Not that it would be anything exorbitant.
For some of you, the expires metadata in S3 might not be the best solution. If you have multiple files that each require different update schedules, it might be a bit too time-consuming to put in the metadata.
I’ve already kind of ran into that. One time I needed to put two different expiration dates on two separate files. On another occasion I failed to realize two different files needed the same date and could have been uploaded together.
But that’s just a simple matter of adjusting my workflow. For the most part, the expires metadata in S3 has worked pretty well for my use case.
If you know of a more efficient way for this use case, in which content is updated in a fairly predictable way though not at fixed intervals, please leave a comment.
Please also leave a comment if anything above has become terribly outdated rather than just a tiny bit outdated.