Despite significant efforts in the field of Autonomic Computing, system operators will still play a critical role in administering Internet services for many years to come. However, very little is know about how system operators work, what tools they use and how we can make them more efficient. In this paper we study the practices of operators in a large-scale Internet service Amazon.com and propose a new set of tools for operators. The first tool lets the operators explore the health of system components and dependencies between them; the other monitors the actions of operators and automatically suggests solutions to recurring problems.
Large-scale Internet services invest significant amount of time and money to achieve high availability of their service, typically in the form of labor-intensive monitoring and response teams. Despite the high human resource expenditures, software and hardware failures still occur, and the human-intensive approach to monitoring and troubleshooting scales poorly as the service’s complexity and workload grow [1]. We hypothesize that the right kind of visualization and automation can reduce the human effort required for monitoring and repairing failures, allow operators to more quickly recognize a problem as recurrent, and facilitate better knowledge transfer in quickly-growing or high-turnover teams.
One of the authors spent three months working alongside the Amazon.com team responsible for real-time monitoring of hardware and software and for providing monitoring tools for the rest of the company. We collected quantitative data via interviews and surveys and analyzed the data in the trouble ticket database that contains information about each failure in the last few years. Based on our observations, we identify three challenges that make failures difficult to find and fix, and describe prototypes and early evaluation of two tools designed to address these challenges.
Download pdf Advanced Tools for Operators at Amazon.com
Related Searches: resource expenditures, scale internet, intensive approach, hardware failures, human effort
RSS feed for comments on this post · TrackBack URI
Leave a reply