Revenge of the Fat Finger
So, yesterday was an interesting day. It started with our monitoring systems lighting up like a Christmas tree when a server that was hosted by Linode mysteriously went down. While we were frantically trying to figure out what was going on, we received this via e-mail:
Support Ticket 7576722 regarding Linode ‘XXX (linodeXXXX)’ has been updated by ‘mXXX’
While performing routine maintenance, our administrators erroneously rebooted the physical host that your Linode resides on. Your Linode will return to its last state shortly (booted or powered off). There is no need to issue a boot job for your Linode at this time.
Please accept our sincere apologies for the interruption in service, and please let us know if you have any questions or concerns.
Ok, at least we had a reason for what happened. And then a bit latter we started getting all kinds support tickets for what was seemingly unrelated items, until we learned of the issue at AWS s3: Composing e-mail was freezing (it was trying to pull signature graphics from S3), a restore of client data was going excruciatingly slow (S3 back-end), some custom scripts integrated into our monitoring system were crashing (they used S3 to pass a json file to third party).
And then this morning I read this in the AWS postmortem (https://aws.amazon.com/message/41926/) :
At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
Twice in one day an incorrectly executed command (essentially a typo) resulted in taking down a few dozen servers at Linode, and half the Internet it seemed in the case of AWS.
It just struck me as a reminder that even with all this technology, which it seems daily to get closer and closer to magic at times, still needs us techs in the background making it all go.
At least when we enter the correct commands.