Troubleshooting

Shamelessly stolen from https://karlkfi.medium.com/devops-troubleshooting-df937724c6cf

But it needs more IMHO.

Original author: Karl Isenberg

Title: DevOps Troubleshooting'

Jul 1, 2020·3 min read

My Preamble

Keep a log of you activity, a journal. Even if the org you are with doesn't have a formal process you should have a note pad up where you can record timestampes of activities, ideas, yours and other's. Capture outputs. and keep a record of things you tried. and or changes that were made. and other bits of information you collect. For exmaple : "the log showed no commits in the last 4 hours". "front end traffic patterns were not unusual"

The notes will come in handy 10 minutes later when you've ruled out the first three ideas and you can't remember what Joe said at the beginning of the call, ( nor can Joe :P )

This can also help late comers to the issue get up to speed.

Start

Many technical people “just know” how to troubleshoot a technical issue, from experience, example, or trial and error, but many of those same highly technical people, when put on the spot, can’t necessarily tell you HOW they troubleshoot.

How do YOU troubleshoot?

Basic Troubleshooting Framework

The obvious answer is, “It depends,” but that’s not very satisfying, unless you can give a host of classes of problems and how to deal with them. Instead, lets look at some high level steps that describe how you might approach any technical problem.

=== Identify the symptoms

Gather and examine detailed information
Hypothesize potential causes
Verify hypotheses one by one, in order of most likely and/or simplest to fix
Devise a plan to solve the problem
Implement the solution plan
Verify the issue was resolved
Repeat as needed

Incident Management Objectives

The above steps apply to almost any issue, but what if you’re on-call for a production system? Generally production readiness includes having an incident management process which can be trained and followed to ensure the following goals are met:

Resolve an incident as quickly as possible (small MTTR)
Ensure client satisfaction with support quality
Keep clients and stakeholders up to date on what to expect so they can plan accordingly
Follow up with analysis and action items to avoid this issue in the future

Incident Management Framework

One nice thing about a DevOps culture is that the people who are on-call for the production systems are also developers. Having a developer on-call can shorten the resolution time of an incident and sometimes provide a better, more knowledgeable and informed solution, because they have more in-depth domain knowledge of the system. Because of this, the process for incident management might look a little different, with less scripted run books to follow, and more in-depth, iterative troubleshooting.

The following steps describe one way you can combine the Basic Troubleshooting Framework with additional process to meet the Incident Management Objectives.

Identify the symptoms, scope, impact, and urgency

Raise the alarm and ask for help

Gather and examine detailed information (logs, metrics, errors, traces, current state)

Assign a communications lead (to allow the problem solvers to focus) & communicate status regularly

Establish a war room and/or video conference, if the severity merits it

Hypothesize potential causes

Try to reproduce & record the steps

Verify hypotheses one by one (or in parallel, if in a group) in order of most likely and/or simplest to fix

Stem the bleeding quickly, if possible.

For example you might want to scale up a deployment right away to give yourself some headroom. This might be dangerous if it consumes resources that choke other system, so do this with care.

Brainstorm more hypotheses, if needed

I call the process Scatter/gather. When coming up with ideas, they may scatter all over the place. When you need to act first you must "gather" and decide which idea to execute on first. Sometimes ideas come in faster than you can properly evaluate them. Prioritizing ideas is important. Are some ideas easy to rule out? Are some ideas "intractable"? You might need to hold on to an idea and revisit it 10 minutes later, don't loose those ideas. Take notes.

Identify the chain of causality (root causes)

This is often tightly linked with generating hypotheses.

Devise and implement a plan to solve the problem

Closely monitor the behaviour

... after a fix has been applied to quickly detect if the problem resurfaces or side effects occur.

Document the symptoms, validated hypothesis, and changes made

Postmortem

Have a postmortem to share the knowledge and identify improvements to product and process

Admit mistake

Admit mistakes and be honest, but avoid blame, shame, or punishment

Reward overtime

Reward overtime so that being on-call isn’t just a chore

Praise participants

Praise participants who played their part well, not just the one who discovered or solved the issue

Not Perfect, But Good Enough

No process is perfect, of course. This one was assembled from years of experience both problem solving and being on-call, but you may have more experience yourself. So feel free to adopt the above steps and modify as needed. If I missed something important or you think they’re out of order, feel free to drop me a note in the comments. I’ll be using this as a reference too, so I may update it as needed.

Cloud Guy. Anthos Solutions Architect at Google (opinions my own). X-Cruise, X-Mesosphere, & X-Pivotal.

Troubleshooting DevOps Process Computer Science More from Karl Isenberg Follow Cloud Guy. Anthos Solutions Architect at Google (opinions my own). X-Cruise, X-Mesosphere, & X-Pivotal.

Jun 15, 2020

Kubernetes: Batteries Not Included

If you’ve been around Kubernetes for a while, it’s probably no surprise that Kubernetes is both too much and not enough, depending on who you are and what you need. Kubernetes feels like it should be useful to everyone. Every company needs a website and a mobile app, these days. Every company has internal tools and systems. Software is all migrating to microservices and distributed systems. They all need databases and message brokers and file storage systems. More and more companies are using machine learning and other complex software systems to drive business value. … Read more · 7 min read

61

Jun 10, 2020

A Select List of Kubernetes Tools

Just because you have a hammer, doesn’t mean every problem is a nail. Some of these you’ve probably heard of. Some of them you probably haven’t. What makes this list special is that they’ve all be useful to me, while developing and operating Kubernetes platforms and workloads. Hopefully they’re useful to you too! kubectx + kubens — fast cluster & namespace switching kustomize — resource patching (built into kubectl now) krew — find and install kubectl plugins ketall — kubectl get all, but actually get everything kail — tail logs from multiple pods rbac-lookup — lookup which roles a user has kube-capacity — cluster resource requests, limits, and usage overview kube-score — static… Read more · 2 min read

36

1

Published in Cruise

·Oct 15, 2019

Container Platform Networking at Cruise Using Google Kubernetes Engine with a multi-cloud, private hybrid network. Authors: Karl Isenberg & Buck Wallander

GKE in a private hybrid network This is part three of our ongoing series on the Cruise PaaS: Building a Container Platform Container Platform Security Stay tuned for more on observability and deployment! In our previous posts, we covered how the Cruise PaaS spans multiple Google Kubernetes Engine (GKE) clusters in multiple Google Cloud Provider (GCP) environments and projects, with a bunch of addons to increase the functionality and security of GKE and make it work on our private hybrid-cloud network. In this post, we’ll cover why we need a private hybrid-cloud network and how it works to provide another… Read more in Cruise · 16 min read

261

Published in Cruise

·Jul 16, 2019

Container Platform Security at Cruise Best practices for enterprise-grade Kubernetes security. Authors: Karl Isenberg

Mike Ruth Kubernetes Logo in ArmorKubernetes Logo in Armor This is part two of our ongoing series on the Cruise PaaS: Building a Container Platform Container Platform Security Container Platform Networking Stay tuned for more on observability, and deployment! Safety is one of our core values at Cruise. It’s why we challenge our cars to master the complexities of double-parked vehicles in San Francisco. It’s also why security is a top priority in everything we do. However, security isn’t just a checkbox you mark off on project designs — it’s continual improvements made at multiple layers of the stack. Since security improvements often… Read more in Cruise · 14 min read

161

Published in Cruise

·Jun 5, 2019

Building a Container Platform at Cruise The backend for Cruise self-driving cars runs on Kubernetes. This is part one of our ongoing series on the Cruise PaaS: Building a Container Platform Container Platform Security Container Platform Networking Stay tuned for more on observability, and deployment! Every day, our self-driving cars navigate the streets of San Francisco. Our autonomous vehicles validate our software as they chauffeur Cruise employees around the city, continuously improving their driving ability by tackling the challenges of a complex urban environment. To operate continuously and safely, our fleet is supported by thousands of servers and interconnected cloud services. … Read more in Cruise · 6 min read

Troubleshooting

Contents

My Preamble

Start

Basic Troubleshooting Framework

Incident Management Objectives

Incident Management Framework

Stem the bleeding quickly, if possible.

Brainstorm more hypotheses, if needed

Identify the chain of causality (root causes)

Devise and implement a plan to solve the problem

Closely monitor the behaviour

Document the symptoms, validated hypothesis, and changes made

Postmortem

Admit mistake

Reward overtime

Praise participants

Not Perfect, But Good Enough

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools