Skip to content

Latest commit

 

History

History
105 lines (68 loc) · 3.03 KB

Week11-Mon.md

File metadata and controls

105 lines (68 loc) · 3.03 KB
layout title sched-activation
course
Building reliable systems (Week 11, Monday, March 24)
class="active"

Reliability for large-scale systems

Experiences of James Hamilton, then at Windows Live. ({{site.data.bibliography.hamilton2007.title}})

Three main principles (Bill Hoffman):

  • Expect failures
  • Keep things simple
  • Automate everything

Application design

Keep development, test, and operations staff close to each other

  • DevOps approach: "You built it, you support it"

Failure recovery must be

  • Simple
  • Tested frequently
  • Perhaps hard-fail systems to shut them down

Redundancy

  • Availability of four to five 9s require redundant systems
  • True redundancy means you can hard-fail any system without work loss

Support and deploy only one version

Multi-tenant your services

  • Share hardware cost across all services
  • Give up some system efficiency for operations efficiency

Allow rare human intervention in an emergency

  • Test the scripts with fire drills or Game Days

Admission control---block new work when system is overloaded

  • Do this at interfaces within the system as well

    • For example, allow users to read but not write

Shard the data

  • Arbitrarily adjustable (not just first letter of family name)
    • Use a hash or dictionary to map primary keys to shards
  • Not bound to any real-world object (not "all employees of organization X")

Analyze throughput and latency

  • With background tasks running (garbage collection, etc.)
  • Predict and measure access patterns (especially to databases, but also network and other components)
  • Every call to a subservice should have a timeout
    • Bound the number of restarts
    • Log restarts
  • Dependent (lower-level) servics should commit to the same overall SLAs as their callers

Automatic management

Make your secondary backups synchronous

  • The write is considered complete when both copies are written
    • Cost in longer latency and lower throughput
  • You can instantly route requests to the backup when the primary fails

Build in recovery at higher levels, not lower

  • The service should have redundant messages
    • Don't conceal it lower, at the network protocol
    • TCP doesn't guarantee reliability across sessions
  • Fail fast

Fail services regularly---see next day's class

Operations

Never delete anything

  • Just mark it deleted
  • Make everything configurable ("Feature flags")

The "Big Red Switch"

  • Turns off all non-essential services
  • Use it when the system is melting down (or about to)
  • Resetting the switch should restart stopped services automatically

Control admission into the service

Meter admission into the service when coming up after a failure

Guide to reading for next class

Read all of {{site.data.bibliography.tseitlin2013.title}}.

This article describes Netflix's approach to automatically generating failures of system components. They have created a powerful culture of stress-testing their systems and learning from any failures.