Outage September 15 2017

Today, The Storehouse experienced an outage that lasted approximately 12 hours. This was caused due to updates performed late the night before and services restarting during that process. Last night I ran upgrades of the servers that run The Storehouse, including our three ProxmoxVE nodes. When the upgrades on these nodes were complete, the nodes had an updated kernel version and needed to restart to use the new kernel Restarting is usually a painless process in our environment....

September 15, 2017

Staggering Chef Client Runs

One of the new tools I’ve discovered is Chef to manage the configuration and software on Storehouse’s fleet of virtual machines. Chef makes it really handy to update and track config changes, since everything can be tracked using Git or similar. One issue we ran into was having chef-client run at the same time for multiple machines. This issue is kinda subtle, but makes a lot of sense when you think about it....

July 20, 2017

Make a Site Private but Allow Lets Encrypt

This is a pretty straightforward thing I’ve wanted to do for some time. Basically, I have a number of sites that I use internally that I wanted to get certificates via Let’s Encrypt, but I also wanted to keep them restricted to only a few IP addresses. The solution is quite simple and works perfectly. We accomplish this with two .htaccess files. One at the site root to restrict IP address that can access the site, the second to disable that restriction on the directory where the Let’s Encrypt challenge is stored....

May 22, 2017

Monitoring a Mount Point With Zabbix

A subtle issue I ran into was the issue that Proxmox VE would sometimes unmount a GlusterFS volume and would fail to backup. This issue was a bit sneaky though, since the PVE backup program wouldn’t execute it wouldn’t send an email notifying me of the failure. This would make it so the backups would fail silently for some time, until I happened to login and see the errors in the cluster’s log....

March 29, 2017

Outages Feb 16-18 2017

So I’m a human, and I have outages. My goal is to be more transparent, not only with my customers, but with myself about why the outage occurred and what I can do to keep it from happening again. From February 16 to 18, Storehouse had a few intermittent outages that lasted anywhere from 1 hour to 3 hours. So this post is long overdue, heck it’s even March now. I don’t have a good excuse for the delay: I know what caused the outages and had taken corrective action but I simply put off writing this....

March 1, 2017

MySQL (MariaDB) Galera Cluster Restart

This is a scary problem when you’re recovering from an outage of your database machines. If you’re running a Galera cluster and they all go offline, you’ll need to do a bit of work to restart the cluster and make it safe. Galera relies on the fact that there’s at least one node running in your cluster at all times. If your entire cluster goes offline, you won’t be able to start it again, even with the –wsrep-new-cluster option....

February 5, 2017

Zabbix MySQL (MariaDB) Monitoring

This is another one of those things that is pretty straightforward, but requires culminating information from a different sources in order to get things up and running. The goal here is to get Zabbix to monitor our MariaDB (MariaDB is a drop in replacement for MySQL, I’ll refer to either as MariaDB here) server’s status. There’s a built in template, but a few other files and settings need setup before you can get the juicy data flowing....

January 31, 2017

Proxmox 3 to 4 Upgrade Network Issue

This is a problem that showed itself when upgrading our Proxmox 3.2 Nodes up to Proxmox 4. About halfway through the upgrade, our network adapters suddenly stopped being able to communicate with any local addresses, but could still ping outside addresses. The cause was a minor config change that gets added in pretty stealthy. When this happens, simply add the following line to the bridge config in /etc/network/interfaces: bridge_vlan_aware yes To make the entire config section resemble:...

January 23, 2017