Ask HN: YouTube down?

阿新 • • 發佈：2018-12-28

Unrelated, but on a recent network outage (https://status.cloud.google.com/incident/cloud-networking/18...):

  The incident occurred while Google's network operations team was replacing
  the routers that link us-central1-c to Google's backbone that connects to
  the public internet. Google engineers paused the router replacement process
  after determining that additional cabling would be required to complete the
  process and decided to start a rollback operation. The rollout and rollback
  operations utilized a version of workflow that was only compatible with the
  newer routers. Specifically, rollback was not supported on the older routers.

And the postmortem action item is:

  Fix the automated workflows for router replacements to ensure the correct
  version of workflows are utilized for both types of routers.

The action items should have been "1) make this work for these two routers, and 2) make sure no platforms ever get left out again".

This shouldn't have happened, because it should be standard practice to test both upgrade and rollback on all your gear. Network gear vendors do this as standard practice before they ship new gear with upgrade instructions. Google can throw together end-to-end automated tests of upgrades/rollbacks and refuse to perform maintenance until tests pass.

The bigger postmortem question should be, why was the change allowed at all if the platform didn't support rollback? Additional action item: "3) don't allow changes if the platforms don't support and have successful rollback tests".

Now, did they need to test rollback? Maybe they don't mind portions of CloudSQL, Spanner, Storage, BigTable, and AppEngine being down for 41 minutes in one zone. But if they're not even testing rollback for BGP changes, what else aren't they testing?

...Also, lol, they realized in the middle of an upgrade that they didn't have enough network cable? Maybe add an extra action item: "4) count how much network cable you have before you start replacing core routers"

Ask HN: YouTube down?

Ask HN: YouTube down?

Ask HN: Why does a video rewind on YouTube require a rebuffer?

Ask HN: Where to go post Google+ shut down?

Ask HN: Need help narrowing down micro

Ask HN: What happens if a Soyuz launch splashes down?

Ask HN: What encoding does YouTube use on the public facing error pages?

Ask HN: Is gitlab.com down for you right now?

Ask HN: Is HN email down?

Ask HN: Why are cellphone numbers increasingly used for 2FA?

Ask HN: Could there be lifeforms made of dark matter living right here with us?

Ask HN: What's the best private budgeting app?

Ask HN: How do developers learn new skills quickly?

Ask HN: Do you think MS will ever open source Windows?

See upvote count for Ask HN submissions?

Ask HN: Anyone choose to get braces / aligners later in life?

Ask HN: Who is hiring? (October 2018)

Ask HN: Are there any languages/compilers that optimize at this level?

Ask HN: How do VMs hot swap native code?

Ask HN: Google Next '18 London Registration Code

Ask HN: Stack/Advice for upgrading personal legacy web app

Ask HN: YouTube down?

相關推薦