1. 程式人生 > >Ask HN: YouTube down?

Ask HN: YouTube down?

Unrelated, but on a recent network outage (https://status.cloud.google.com/incident/cloud-networking/18...):
  The incident occurred while Google's network operations team was replacing
  the routers that link us-central1-c to Google's backbone that connects to
  the public internet. Google engineers paused the router replacement process
  after determining that additional cabling would be required to complete the
  process and decided to start a rollback operation. The rollout and rollback
  operations utilized a version of workflow that was only compatible with the
  newer routers. Specifically, rollback was not supported on the older routers.
And the postmortem action item is:
  Fix the automated workflows for router replacements to ensure the correct
  version of workflows are utilized for both types of routers.
The action items should have been "1) make this work for these two routers, and 2) make sure no platforms ever get left out again".

This shouldn't have happened, because it should be standard practice to test both upgrade and rollback on all your gear. Network gear vendors do this as standard practice before they ship new gear with upgrade instructions. Google can throw together end-to-end automated tests of upgrades/rollbacks and refuse to perform maintenance until tests pass.

The bigger postmortem question should be, why was the change allowed at all if the platform didn't support rollback? Additional action item: "3) don't allow changes if the platforms don't support and have successful rollback tests".

Now, did they need to test rollback? Maybe they don't mind portions of CloudSQL, Spanner, Storage, BigTable, and AppEngine being down for 41 minutes in one zone. But if they're not even testing rollback for BGP changes, what else aren't they testing?

...Also, lol, they realized in the middle of an upgrade that they didn't have enough network cable? Maybe add an extra action item: "4) count how much network cable you have before you start replacing core routers"