Today, The Storehouse experienced an outage that lasted approximately 12 hours. This was caused due to updates performed late the night before and services restarting during that process.
When you have an application, there’s inevitably some things that just need to be done periodically. These aren’t tied directly to user actions, so the quick answer is usually cron. It’s easy to setup, but when it breaks it can cause subtle issues that may impact your customers or application.
So I’m a human, and I have outages. My goal is to be more transparent, not only with my customers, but with myself about why the outage occurred and what I can do to keep it from happening again. From February 16 to 18, Storehouse had a few intermittent outages that lasted anywhere from 1 hour to 3 hours.
I rely on Zabbix to keep tabs on all of my machines and to make sure all of The Storehouse is working perfectly. It’s always troubling to wake up to 30+ emails from Zabbix and is pretty good cause for alarm. Turns out, these things were fairly innocuous and the sign of a pretty simple issue and related to the backup of that VM. I’ll try to outline the steps I used to diagnose and lessons I learned along the way.