SRE Incident reaction — Troubleshoot template

Nhan Nguyen Duc
4 min readOct 9, 2022

--

What should I do when an incident happens? If you are a new member, or this is the first time you face2face with an incident, this article is for you.

Let’s start!

How to identify the incident?

When I joined my company, I’m so confused about the alert system. When an alert raising, what does that alert mean? Is it really important and need to contact my boss — or others? What can I do? … I think every newbie will have all those questions, so I will show you how to answer all those questions and identify the incident.

  • What happens?: You must find and know what happen. Maybe the latency of some service increases or your website cannot be accessed, or have many 500 error responses on your system, … You should check your system, at least check if your system still running.
  • Where did this alert come from? — Your login service, databases, proxy, or the whole system down?
  • How important it is? This is the hardest part for a newbie. If it is just a notification about maintenance, but you wake up your whole team at 02:00 AM, it is a disaster. Based on your experience, knowledge, responsible, your judgment will be different, but this is some basic thinking you can lean on: Is your service/website working? If the whole system is down, it is an emergency and surely you should call your boss and colleagues. Make a contact list in case of emergency, or at least know someone who knows it.
  • Do I have been held responsible for that alert? If you are thinking the incident confirm/react is only SRE team responsible, you’re completely wrong. SREs are also human, they can’t be online 24/7, and they can have mistakes. So when you see an alert, don’t pass it. You should try to figure that alert out and know what happens, or at least ping/mention SRE or other members can relate to that alert and make them know “something bad is happening”.

If you are SRE, incident identification skill is a must-know. If your whole system is down and you must mitigate the effect, you must identify the problem as soon as possible. Some hints:

・If your whole system is down, check the Single Point of Failure (SPOF) first. It may be proxy, DB, …

・Check the critical monitoring metrics, such as CPU, Memory, Throughput

When know what happens, the next thing you have to do is mitigate the effect of the incident. At least make your system RUN!

The immediate reaction is very important, it mitigates the effect of the incident, and make you more time to troubleshoot and fix it. For example, if your CPU is running out and making the system freeze, you should add more servers and route new requests to them to make your system run instead of trying to find the root cause. Furthermore, you are working as a team, so let your team members what you are doing. In this case, a good team leader will divide the task into smaller subtasks, and use all team power to bring the system back as soon as possible.

Some hints to recover the system:

・If the incident relates to software, you can revert to the nearest running version of the application

・If the incident relates to infrastructure, try upping your spec, or adding more servers to make your system “alive”

・Set up on-call rotation to monitor the system during the critical time

If you have done the previous step well, you will have more time to troubleshoot the problem. It’s time to deep dive into your system, check logs, find the root causes, and solve them. The debug skill — or troubleshooting is one of the most important skills of software engineer, including SRE. In the first step, you have identified what happens, we call it is “symptom”. From the symptoms, we will look forward to related services to find the root causes. Sometimes the root causes may be complex: The change of 1 app is OK, but the change of many apps can make collide and become the root causes of the incident.

To be honest, troubleshooting skill is very hard, especially with newbie. To troubleshoot effectively, you need to know about your company’s system, sometimes also business, and need knowledge about components and how they communicate, or the spec and limits of infrastructure. But I will give you some hints to try:

・Thinking by exclusion method: Try to brainstorm all the services/components that can be related to the incident. Confirm and exclude

・Share your ideas with others: Sometimes your ideas are true, but you don’t have enough knowledge to confirm them. Share your ideas with your colleagues and someone else maybe can help.

After troubleshooting and knowing the root causes, finally we will find a way to solve them. Based on the situation the solution is very different, therefore I won’t write more about it. But don’t forget the postmortem!

Conclusion

The Incident reaction and response is not only responsible for SRE but also for all software engineers.

When an incident happens and you don’t know what to do, see this article.

Thanks for reading here. If this article is helpful, let me know :’)

--

--

Nhan Nguyen Duc
Nhan Nguyen Duc

No responses yet