Despite reports over the weekend claiming that Dropbox was hacked, the company has today released a post-mortem of its downtime, with the blame falling onto its update and database infrastructure.
The company said that the weekend’s outage was due to a bug within an update script reinstalling a number of machines containing production traffic for photo sharing, camera uploads, and some APIs.
“On Friday at 5.30pm PT, we had a planned maintenance scheduled to upgrade the OS on some of our machines. During this process, the upgrade script checks to make sure there is no active data on the machine before installing the new OS,” wrote Dropbox head of infrastructure Akhil Gupta.
“A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-slave pairs were impacted, which resulted in the site going down.
“Your files were never at risk during the outage.”
Gupta said that Dropbox was able to recover from backups, and able to restore “most functionality” within three hours.
However, the company was not able to fully restore complete functionality until Sunday afternoon, Pacific Time, due to the large size of the MySQL databases that the company uses.
Seemingly shocked by the time taken using the standard tooling to restore from MySQL backups, Dropbox said it has developed a tool that will allow for faster restorations by parallelising the replay of binary logs. The company plans to open source this tool in future.
In order to prevent the updating script from reinstalling active machines in Dropbox’s master and dual-slave database infrastructure, Gupta said that active machines would now be able to ignore such commands.
“Over the past few years, our infrastructure has grown rapidly to support hundreds of millions of users. We routinely upgrade and repurpose our machines. When doing so, we run scripts that remotely verify the production state of each machine,” Gupta said.
“We’ve since added an additional layer of checks that require machines to locally verify their state before executing incoming commands. This enables machines that self-identify as running critical processes to refuse potentially destructive operations.”