On Handling Production Failure
Takes on handling production failures driven from a real life story
Hello friend! 👋
Basma here. Thank you for reading An Engineer's Echo, your weekly publication of stories to equip you with the soft and hard skills to fast-track your growth in software engineering.
Reading Time: ~5 mins.
I keep stressing how important it is to communicate and be open to listening to others at work, as well as helping out whenever possible.
But, I messed up this idea in a situation at work yesterday. :( It was all because of a miscommunication on my part, which caused a two-hour delay in fixing a production issue. This left another team pretty frustrated with me.
I think it's a common mistake that could happen to anyone. And it's a good lesson for anyone to learn from such miscommunications.
So, here are the main things to take away from this:
Why it's crucial to handle production problems with care.
How to manage communication between different teams.
Be mindful of your biases.
What happened?
So, here's the deal: The other team tagged me in an incident chat because of a production bug in a system we don’t deal with in my team.
They were hinting that the problem might be linked to a PR I had recently deployed. As it seemed, the error logs synced up with the time I pushed my PR.
Now, let's rewind to that day. I had one day to wrap up a big project I'd been tackling. This project was a big deal for me—it was the first big-scale one I'd been trusted to scope in the team. I was eager to prove my seniority.
So, back to the production incident. I read their message, noticed which team it was coming from, and which system they were talking about (I had a hunch at that point, but we'll come back to that later).
Normally, I'm helping people whenever someone asks. I try to give as much time and help as I can until we sort out the problem.
But this time, I did something different (without really thinking it through):
I skimmed through my PR again to refresh my memory on what it was all about. Didn't spot any obvious bugs in the logic at first glance.
Then, I shot them a message on Slack saying, “I don't think my PR is the cause of your problem.”
I had decided to step away from Slack for the rest of the day to focus on that big project I mentioned earlier.
Honestly, I genuinely believed it wasn’t my issue, and figured maybe they'd jumped to conclusions about my PR causing the mess.
Two hours later, upon returning to Slack, I saw them mentioning me again, providing evidence that my deployment timing aligned oddly with their issue.
Realizing the situation, a teammate suggested rolling back my PR to troubleshoot. After doing so, the problem was swiftly resolved. 😱
What’s the moral of that story?
In one line: “NEVER say it’s not your problem!” — even if you think so.
In more details…
When it’s about production be extra careful, and approachable:
The production environment is crucial for any company as it directly impacts its reputation. Therefore, leaders must ensure that the production environment is always healthy.
So, when dealing with an incident in production, whether in your team or another, that you've been called to help with:
Own the problem.
Communicate effectively: Instead of saying "I don't think X is my code's fault," say "Let's rollback my code and see if it helps," or "I will test my assumption by writing a quick unit test to confirm if the use case is a false positive," or anything useful that doesn't make you sound like you're avoiding responsibility.
Listen to your gut feeling: When I sent that message saying "It's not my PR," my gut gave me an uncomfortable feeling that I was probably doing something wrong (avoiding responsibility, not giving it enough time, etc.). I should have listened to that feeling before logging off Slack.
Ensure you don’t leave people hanging: If you need to leave, make it clear, and ensure someone else can cover for your absence.
Learn how to collaborate with cross-teams: Cross-teams manage their own tasks that you may not have context on. If you need that context, don't hesitate to ask for it. Don't shy away from what you don't know; make an effort to learn what's necessary.
Beware of your biases:
Remember when I mentioned ‘I had a hunch’ when I read the team's name?
That ‘hunch’ was a negative bias. I don’t particularly favor what that team does because they're responsible for a product in the company that goes against my beliefs and values—something I consider ‘immoral’.
So, there was a psychological barrier I faced (which I wasn't aware of at the time of the incident) that greatly influenced my unhelpful behavior, completely unlike my usual instinct of trying to help people first.
I won't tell you to ‘not have biases.’ I believe that's unrealistic and inhumane. What I would advise (and tell myself) is to ensure that when you detect these biases, you don’t let them affect others.
In other words, what I realize now I should have done in that situation is to inform the team that I couldn't assist at the moment because I had other pressing tasks or obligations, and suggest tagging someone else from my team (the person who reviewed my PR in question) who might be able to help.
This story wasn’t easy for me to share. But, reflection on our mistakes is key to learning and understanding the impact of our actions.
Next week, I'll discuss the best ways to handle negative or constructive feedback.
Additionally, I'm considering adding a new track to 'An Engineer’s Echo' focusing on mental health, psychology, and self-improvement. Some people have expressed interest in this topic, so feel free to join the conversation and let me know what topics you'd like to explore.
Thank you for reading! I hope you enjoyed it, let me know by hitting the like button ❤️ to help others find it on Substack. Share it to spread the love!
Great articles you don’t want to miss:
Design a Live Video Streaming Platform — by
The 5-5-5 of Active Listening — by
How to Make Your Code Changes Easier to Review? — by
That’s it, folks!
Speak next week, or if you’re eager, chat with me at any time (I always reply)!
— Basma
Rollbacks are a great way to quickly test if something is caused by a PR or not.
In a similar situation, I'd have created a test to reproduce the alleged issue with my PR. A test is a win-win because if it fails, I know it was me. If it won't, I have evidence that it wasn't me.
Thanks for the mention, Basma!
I like this article, throughout my IT journey owning the problem is the best starting point. Thanks for this Basma.