On September 10th and 11th of 2012, github suffered what they’re considering a “major” interruption is service. We did not see any direct adverse effects from this until approximately 16:00 EST on September 11th at which time some of our themes, as well as sites published with our themes, started behaving erratically as some files used in our themes became inaccessible for an 18 to 20 hour time span. While I cannot be 100% certain that the github outage was the cause of the issue, we were using github to distribute those files so it would seem the correlation was more than coincidental. The github outage has since been resolved (within the last few hours) and our files are once again accessible. We also spent most of last night updating all themes to point to another CDN (Amazon Cloudfront) as a proactive measure in the event that our files did not return. As of right now, everything is running, so the urgency for updating your themes and sites is less then it was over the last 20 hours. However, I would like to take this time to address some of the valid and very relevant concerns that have poured in over the last day.
What went wrong?
Honestly, I don’t fully know and I doubt I’m going to get any definitive answers. What I do know is that the files we were serving from github became inaccessible for approximately 20 hours. We were quick to move those files to an enterprise level CDN and update our themes within the first 6 hours of the crisis, but that wasn’t immediately helping the many thousands of sites already published that were relying on these lost files.
How did the problem get fixed?
Initially, we moved all of hosted files from github over to Amazon Cloudfront, an enterprise level CDN, and issued critical updates to the 9 themes that relied on these services. Eventually access to the files on github was restored and all existing sites still using those resources were once again behaving normally.
How will this problem be avoided in the future?
By moving to Amazon Cloudfront the risk of this ever happening again is slim to none. Amazon Cloudfront is a service built specifically as a robust CDN and is an enterprise level service that is relied on by so many of today’s web giants, like Netflix, NASDAQ and Pinterest to name a few. But as Steppenwolf once sang, nothing is forever. So with that in mind we are discussing a few ideas to ensure our themes have safeguards built in, possibly with an API for user control. As a failsafe, we’ll have themes look internally when CDN hosted resources become inaccessible. Rest assured we’ll be implementing a plan to prevent such catastrophic failures in the very near future.
What is a CDN?
A CDN is a content delivery network that allows web sites to draw from common sources, saving visitors from having to download the same files over and over again. This saves considerable time and bandwidth for the visitor and is less taxing on the websites resources. Proper CDN’s are also globally distributed so that those files can be obtained by the visitors computer from a node that is geographically much closer, perhaps, then the servers that the website is hosted on. Spreading the weight, so to speak, can help your website react faster.
Why does my theme require a CDN?
RapidWeaver themes are not easily updated by end users since new themes completely replace old themes. If a user has modified that theme for their purposes then those modifications are instantly lost. Likewise, if a user has made copies of a theme for specific RapidWeaver projects, then those copies are not updated without the end user spending considerable time to make a new copy of the updated theme. We used to work around this by supplying “Delta updates”, or update installers that users could download and run to install only those files which had changed from the last version. This was a less intrusive way to manage critical theme updates. However, in the very near future such installers might not be permissible in Mac OS X, given the direction Apple is taking things with Sandbox and Gatekeeper. In preparation for these operating system changes we had to go back to offering full theme updates that overwrite the entire theme. Since this is not an ideal user experience and since there is no way for us to know whether this method of updates will ever change in RapidWeaver, we took a long hard look at what we could do to make the lives of our customers easier. A CDN allows us to do that.
What are the benefits of using a CDN in my theme?
By serving some of a themes most frequently updated files “from the cloud” we can offer continual updates and improvements to themes and sites without the disruption and headache of updating entire themes (and theme copies) and all of the headache that goes with that. So if a line of code needs to be fixed in one of the CDN hosted files, we can change that code here, and every theme and site that uses that code will automatically benefit from it. Often times a bug in our themes can be fixed, globally, in a few minutes with no effort on the end users part. Often times the vast majority of users aren’t even aware that there was ever a bug or that one was fixed. Users don’t need to update a theme or republish a site, which is the sort of experience we think is best for users.
What are the drawbacks of using a CDN in my theme?
The most obvious drawback of using a CDN is being dependent on another host, person or service for your site to work. This kind of lack of control can be unnerving to some. Also, much like when your own site goes down, a CDN could possibly go down as well. While this isn’t likely (eh-hem), catastrophic failures can occur when using CDN services, especially ones that weren’t designed for it.
In conclusion, this was a terrible and likely preventable set of circumstances that we have learned a great deal from. We are taking strides to ensure that nothing of this magnitude ever occurs again. While we did our best to act quickly and minimize the negative impact this had on our customers, there is no denying that this was, at the very least, a troublesome few hours that left many thousands of people confused and, in some cases, panicked. Please know that I am sorry and that I take this very seriously. I will be putting preventative measures into action as soon as humanly possible. Thank you for your continued support and patience.