A subtle issue I ran into was the issue that Proxmox VE would sometimes unmount a GlusterFS volume and would fail to backup. This issue was a bit sneaky though, since the PVE backup program wouldn’t execute it wouldn’t send an email notifying me of the failure. This would make it so the backups would fail silently for some time, until I happened to login and see the errors in the cluster’s log.
Graph of free memory on a node with a leaking piece of software.
So I’m a human, and I have outages. My goal is to be more transparent, not only with my customers, but with myself about why the outage occurred and what I can do to keep it from happening again. From February 16 to 18, Storehouse had a few intermittent outages that lasted anywhere from 1 hour to 3 hours.
This is a scary problem when you’re recovering from an outage of your database machines. If you’re running a Galera cluster and they all go offline, you’ll need to do a bit of work to restart the cluster and make it safe.
This is another one of those things that is pretty straightforward, but requires culminating information from a different sources in order to get things up and running. The goal here is to get Zabbix to monitor our MariaDB (MariaDB is a drop in replacement for MySQL, I’ll refer to either as MariaDB here) server’s status. There’s a built in template, but a few other files and settings need setup before you can get the juicy data flowing.