Ask HN: YouTube down?
And the postmortem action item is:The incident occurred while Google's network operations team was replacing the routers that link us-central1-c to Google's backbone that connects to the public internet. Google engineers paused the router replacement process after determining that additional cabling would be required to complete the process and decided to start a rollback operation. The rollout and rollback operations utilized a version of workflow that was only compatible with the newer routers. Specifically, rollback was not supported on the older routers.
Fix the automated workflows for router replacements to ensure the correct
version of workflows are utilized for both types of routers.
The action items should have been "1) make this work for these two routers, and 2) make sure no platforms ever get left out again".This shouldn't have happened, because it should be standard practice to test both upgrade and rollback on all your gear. Network gear vendors do this as standard practice before they ship new gear with upgrade instructions. Google can throw together end-to-end automated tests of upgrades/rollbacks and refuse to perform maintenance until tests pass.
The bigger postmortem question should be, why was the change allowed at all if the platform didn't support rollback? Additional action item: "3) don't allow changes if the platforms don't support and have successful rollback tests".
Now, did they need to test rollback? Maybe they don't mind portions of CloudSQL, Spanner, Storage, BigTable, and AppEngine being down for 41 minutes in one zone. But if they're not even testing rollback for BGP changes, what else aren't they testing?
...Also, lol, they realized in the middle of an upgrade that they didn't have enough network cable? Maybe add an extra action item: "4) count how much network cable you have before you start replacing core routers"