Troubleshooting
Shamelessly stolen from https://karlkfi.medium.com/devops-troubleshooting-df937724c6cf
But it needs more IMHO.
Original author: Karl Isenberg
Title: DevOps Troubleshooting'
Jul 1, 2020·3 min read
My Preamble
Keep a log of you activity, a journal. Even if the org you are with doesn't have a formal process you should have a note pad up where you can record timestamps of activities, ideas, yours and other's. Capture outputs. and keep a record of things you tried. and or changes that were made. and other bits of information you collect. For example : "the log showed no commits in the last 4 hours". "front end traffic patterns were not unusual"
The notes will come in handy 10 minutes later when you've ruled out the first three ideas and you can't remember what Joe said at the beginning of the call, ( nor can Joe :P )
This can also help late comers to the issue get up to speed.
Start
Many technical people “just know” how to troubleshoot a technical issue, from experience, example, or trial and error, but many of those same highly technical people, when put on the spot, can’t necessarily tell you HOW they troubleshoot.
How do YOU troubleshoot?
Basic Troubleshooting Framework
The obvious answer is, “It depends,” but that’s not very satisfying, unless you can give a host of classes of problems and how to deal with them. Instead, lets look at some high level steps that describe how you might approach any technical problem.
Identify the symptoms
- error rates?
- latency?
- increased resource consumption: CPU , RAM IO
- service up / down ?
- wrong request/response data?
- too big/small request/response data?
- Gather and examine detailed information
- Hypothesize potential causes
- Verify hypotheses one by one, in order of most likely and/or simplest to fix
- Devise a plan to solve the problem
- Implement the solution plan
- Verify the issue was resolved
- Repeat as needed
Incident Management Objectives
The above steps apply to almost any issue, but what if you’re on-call for a production system? Generally production readiness includes having an incident management process which can be trained and followed to ensure the following goals are met:
- Resolve an incident as quickly as possible (small MTTR)
This might be
- Ensure client satisfaction with support quality
This boils down to communicating early, clearly, concisely, truthfully, and often.
- Keep clients and stakeholders up to date on what to expect so they can plan accordingly
This boils down to communicating early, clearly, concisely, truthfully, and often.
- Follow up with analysis and action items to avoid this issue in the future
This might be
Incident Management Framework
One nice thing about a DevOps culture is that the people who are on-call for the production systems are also developers. Having a developer on-call can shorten the resolution time of an incident and sometimes provide a better, more knowledgeable and informed solution, because they have more in-depth domain knowledge of the system. Because of this, the process for incident management might look a little different, with less scripted run books to follow, and more in-depth, iterative troubleshooting.
The following steps describe one way you can combine the Basic Troubleshooting Framework with additional process to meet the Incident Management Objectives.
- Identify the symptoms, scope, impact, and urgency
- Raise the alarm and ask for help
- Gather and examine detailed information (logs, metrics, errors, traces, current state)
- Assign a communications lead (to allow the problem solvers to focus) & communicate status regularly
- Establish a war room and/or video conference, if the severity merits it
- Hypothesize potential causes
- Try to reproduce & record the steps
- Verify hypotheses one by one (or in parallel, if in a group) in order of most likely and/or simplest to fix
Stem the bleeding quickly, if possible.
For example you might want to scale up a deployment right away to give yourself some headroom. This might be dangerous if it consumes resources that choke other system, so do this with care.
Brainstorm more hypotheses, if needed
I call the process Scatter/gather. When coming up with ideas, they may scatter all over the place. When you need to act first you must "gather" and decide which idea to execute on first. Sometimes ideas come in faster than you can properly evaluate them. Prioritizing ideas is important. Are some ideas easy to rule out? Are some ideas "intractable"? You might need to hold on to an idea and revisit it 10 minutes later, don't loose those ideas. Take notes.
Identify the chain of causality (root causes)
This is often tightly linked with generating hypotheses.
Devise and implement a plan to solve the problem
Closely monitor the behaviour
... after a fix has been applied to quickly detect if the problem resurfaces or side effects occur.
Document the symptoms, validated hypothesis, and changes made
Postmortem
- Have a postmortem to share the knowledge and identify improvements to product and process
- is there a metric we need to be mointoring that we we're that might have given us some warning or indication , something was up, or about to go down?
Admit mistake
- Admit mistakes and be honest, but avoid blame, shame, or punishment
Reward overtime
- Reward overtime so that being on-call isn’t just a chore
Praise participants
- Praise participants who played their part well, not just the one who discovered or solved the issue
Not Perfect, But Good Enough
No process is perfect, of course. This one was assembled from years of experience both problem solving and being on-call, but you may have more experience yourself. So feel free to adopt the above steps and modify as needed. If I missed something important or you think they’re out of order, feel free to drop me a note in the comments. I’ll be using this as a reference too, so I may update it as needed.
Cloud Guy. Anthos Solutions Architect at Google (opinions my own). X-Cruise, X-Mesosphere, & X-Pivotal.
Troubleshooting DevOps Process Computer Science More from Karl Isenberg Follow Cloud Guy. Anthos Solutions Architect at Google (opinions my own). X-Cruise, X-Mesosphere, & X-Pivotal.
Jun 15, 2020
Kubernetes: Batteries Not Included
If you’ve been around Kubernetes for a while, it’s probably no surprise that Kubernetes is both too much and not enough, depending on who you are and what you need. Kubernetes feels like it should be useful to everyone. Every company needs a website and a mobile app, these days. Every company has internal tools and systems. Software is all migrating to microservices and distributed systems. They all need databases and message brokers and file storage systems. More and more companies are using machine learning and other complex software systems to drive business value. … Read more · 7 min read
61
Jun 10, 2020
A Select List of Kubernetes Tools
Just because you have a hammer, doesn’t mean every problem is a nail. Some of these you’ve probably heard of. Some of them you probably haven’t. What makes this list special is that they’ve all be useful to me, while developing and operating Kubernetes platforms and workloads. Hopefully they’re useful to you too! kubectx + kubens — fast cluster & namespace switching kustomize — resource patching (built into kubectl now) krew — find and install kubectl plugins ketall — kubectl get all, but actually get everything kail — tail logs from multiple pods rbac-lookup — lookup which roles a user has kube-capacity — cluster resource requests, limits, and usage overview kube-score — static… Read more · 2 min read
36
1
Published in Cruise
·Oct 15, 2019
Container Platform Networking at Cruise Using Google Kubernetes Engine with a multi-cloud, private hybrid network. Authors: Karl Isenberg & Buck Wallander
GKE in a private hybrid network This is part three of our ongoing series on the Cruise PaaS: Building a Container Platform Container Platform Security Stay tuned for more on observability and deployment! In our previous posts, we covered how the Cruise PaaS spans multiple Google Kubernetes Engine (GKE) clusters in multiple Google Cloud Provider (GCP) environments and projects, with a bunch of addons to increase the functionality and security of GKE and make it work on our private hybrid-cloud network. In this post, we’ll cover why we need a private hybrid-cloud network and how it works to provide another… Read more in Cruise · 16 min read
261
Published in Cruise
·Jul 16, 2019
Container Platform Security at Cruise Best practices for enterprise-grade Kubernetes security. Authors: Karl Isenberg
&
Mike Ruth Kubernetes Logo in ArmorKubernetes Logo in Armor This is part two of our ongoing series on the Cruise PaaS: Building a Container Platform Container Platform Security Container Platform Networking Stay tuned for more on observability, and deployment! Safety is one of our core values at Cruise. It’s why we challenge our cars to master the complexities of double-parked vehicles in San Francisco. It’s also why security is a top priority in everything we do. However, security isn’t just a checkbox you mark off on project designs — it’s continual improvements made at multiple layers of the stack. Since security improvements often… Read more in Cruise · 14 min read
161
Published in Cruise
·Jun 5, 2019
Building a Container Platform at Cruise
https://medium.com/cruise/building-a-container-platform-at-cruise-part-1-507f3d561e6f
The backend for Cruise self-driving cars runs on Kubernetes. This is part one of our ongoing series on the Cruise PaaS: Building a Container Platform Container Platform Security Container Platform Networking Stay tuned for more on observability, and deployment! Every day, our self-driving cars navigate the streets of San Francisco. Our autonomous vehicles validate our software as they chauffeur Cruise employees around the city, continuously improving their driving ability by tackling the challenges of a complex urban environment. To operate continuously and safely, our fleet is supported by thousands of servers and interconnected cloud services. … Read more in Cruise · 6 min read