Last night an outage occurred with Shopify’s asset server that was not detected by our monitoring setup. Our monitoring normally notifies 3 staff members via SMS within minutes of any technical issues. Unfortunately this issue went undetected and therefore none of our admins were notified. This led to the extended outage of Shopify assets like images and stylesheets.
Following is a detailed post-mortem from our system administrator Alex, for the technically inclined:
What happened: yesterday we briefly switched to S3 for asset hosting. At this time, two additional changes were made: The asset proxy’s hostname was changed (from an EC2-provided default) and monitoring was disabled (because S3 returns 403 Access Denied instead of our usual Shopify Asset not found page, which Pingdom interprets as a fail). We ended up rolling the S3 change back. I reverted the asset proxy changes quickly as I was on a bench on the street while walking home, but I did not revert the hostname or monitoring changes. At some point last night the log rotation script refreshed Squid, which freaked out because it could not resolve its own hostname for some reason, which triggered the downtime.
We are very sorry about this and we are in the process of tightening up our monitoring and escalation setup to ensure that a problem like this cannot go undetected again.