The lightning strikes occurred Aug. 13 and the resulting storage system problems weren't fully resolved for five days. Google's post mortem found room for improvement in both hardware upgrades and in the engineering response to the problem.
The outage "is wholly Google's responsibility," the firm said, with no hint that nature, God or the local power grid should share any blame. This clear admission speaks a truth about the data center business: Downtime for any reason, especially at world's highest performing data centers, is unacceptable.
About 19% of the data center sites that "experienced a lightning strike experienced a site outage and critical load loss," said Matt Stansberry, a spokesman for the Uptime Institute. The institute, which advises users on reliability issues, maintains a database of abnormal incidents.
"A lightning storm may knock out utility and paralyze engine generators in a single strike," said Stansberry. Uptime recommends that that data center managers transfer load to engine generators "upon credible notification of lightning in the area."
Moving to generators when lighting is within three to five miles "is a common protocol," he said.
The Belgium lightning strikes caused "a brief loss of power to storage systems" that host disk capacity for Google Compute Engine (GCE) instances. The GCE lets users create and run virtual machines. Customers got errors, and in a "very small fraction" suffered permanent data loss.
Google thought it was prepared. Its automatic auxiliary systems restored power quickly, and its storage systems were designed with battery backup. But some of those systems "were more susceptible to power failure from extended or repeated battery drain," said the firm in its report on the incident.
After this event, Google's engineers conducted a "wide-ranging review" of the company's data center technology, including electrical distribution, and found areas needing improvement. They include upgrading hardware "to improve cache data retention during transient power loss," as well as "improve[d] response procedures" for its system engineers.
Google is hardly alone in facing this problem. Amazon suffered an outage in a Dublin, Ireland data center in 2011.
Google touts its reliability and prepares for the unimaginable, including earthquakes and even public health crises that "assumes people and services may be unavailable for up to 30 days." (This is planning for a pandemic.)
Google didn't quantify the 0.000001%, data loss, but for a company that seeks to make the sum total of world's knowledge searchable, it still might be enough data to fill a local library or two.
Only Google knows for sure.