Abstract: Modern software-based services are implemented as distributed systems with complex behavior and failure modes. Many large tech organizations are using experimentation to verify such systems' reliability. Netflix engineers call this approach chaos engineering. They've determined several principles underlying it and have used it to run experiments. This article is part of a theme issue on DevOps.
For me, the most interesting bit of the paper is this: Rather than simply measuring "is it up or down?" Netflix uses continuous-variable, time-dependent metrics to determine whether system availability has been affected by a test manipulation. For instance, they have a curve that predicts SPS (stream starts per second) over the course of any 24-hour day, based on past performance.