Following on and off again outages, Microsoft’s open-sourced GitHub division explains the week-long occurrence developers had been experiencing was due to a resource contention in the service’s MySQL cluster.
While highlighting the root source of the outages, senior vp of engineering Keith Ballinger goes on at length about each individual outage including the amount of time the service was down and how the MySQL clusters were connected.
We again saw a recurrence of load characteristics that caused client connections to fail and again performed a primary failover in order to recover. In order to reduce load, we throttled webhook traffic and will continue to use that as a mitigation to prevent future recurrence during peak load times as we continue to investigate further mitigations.
Ballinger’s transparent blog post sums up with an apology and a preliminary plan to address future MySQL-type outages that includes “an audit of load patterns for this particular database during peak hours and a series of performance fixes based on these audits.”
GitHub will be moving traffic to other databases as a preventative measure to reduce load and speed failover times in conjunction with scaling its current infrastructure as well as hardware.
Microsoft and its related businesses are experiencing a less than complimentary week of headlines regarding hacks, bribery accusations and Outlook bugs, but at least they seem to have a handle on the GitHub outages.