All Eyes on You: Leading a Software Team Through Production Incidents
All eyes are on you. Production is down. What do you do?
Last Thursday, it happened again. Teams lights up. Alerts are firing. Users can't place policies, and the business team is already pinging you. Within minutes, everyone turns to you — the delivery lead — expecting answers. You don't have any. Not yet.
I've been leading development teams in a microservices environment for a while now, and if there's one thing I know for certain, it's that Production will break. Team members rotate. Tech debt piles up quietly. Some dependency you forgot existed decides to die at the worst possible time. But nothing — and I mean nothing — matches the pressure of a live incident affecting real users while the company watches.
Here's what I've learned from those moments.
Assess before you jump in
Your gut tells you to start debugging immediately. Fight it. First, understand what you're actually dealing with. Full outage or partial degradation? Which users are affected? Since when? Is there a workaround that can buy you an hour?
I've watched investigations go sideways for hours because nobody stopped to ask these basic questions. Everyone just dove into logs. Meanwhile, the actual root cause was sitting somewhere completely different.
Get the right people in the room
In microservices, especially when you are sharing some of them with other teams, the root cause almost never lives in your team's domain. You own the API that's throwing errors, sure — but the problem might be an upstream feed, a shared database, or an infrastructure change nobody told you about.
You need to know your dependency map. You need to know who owns what. And you need to know how to get their attention on a Thursday afternoon — because figuring out the escalation path during a live outage is a terrible time to learn it.
One channel, no exceptions
This one cost me real time before I learned it. The moment an investigation starts, set up a single dedicated channel. Everyone posts updates there. No side threads. No DMs. No one is debugging quietly in a corner.
Without this, you spend half the incident repeating context to every new person who joins and trying to piece together what's already been tried from three different conversations. I've seen 30-minute issues stretch to two hours purely because of this.
Update stakeholders even when there's nothing to update
Your business team doesn't need a solution every 30 minutes. They need to know someone competent is handling it. So even when you have nothing new, send a status update: "Still investigating. Three engineers engaged. No ETA yet, next update in 30 minutes."
It feels pointless, but silence is what kills you. Silence makes people anxious, and anxious people start escalating to the wrong person at the wrong time. Regular updates — even empty ones — buy you the breathing room to actually work the problem.
Escalate before you're comfortable with it
If there's no clear path to resolution after 30–60 minutes, escalate. I got this wrong early on. You want to be the person who solved it. You don't want to "bother" leadership. But sitting on a business-impacting issue for too long because you're hoping for a breakthrough is a much worse look than escalating early.
My rule: if you can't clearly articulate what you're going to try next, it's time to escalate. There's no room for ego when production is down.
Do the postmortem, even when you don't want to
Once things stabilize, the temptation is to close the laptop and pretend the day didn't happen. Don't. Run a proper postmortem. Not to blame anyone — but to understand what happened, why it wasn't caught earlier, and what you're going to do so it doesn't happen again.
Last Thursday's postmortem was a humbling one. After all the debugging and the escalations, the root cause was traced back to infrastructure settings that should have been updated back in December. I knew about them. I just never followed up to make sure the change was made. Two months later, that oversight took down Production. That's a tough thing to admit in front of your team and stakeholders, but that's exactly why postmortems matter. If I'd swept it under the rug or pointed fingers elsewhere, the same type of thing would happen again — and next time the trust wouldn't be there.
This is also where you earn back trust with stakeholders. Showing up with a clear root cause analysis and a concrete prevention plan turns what was a bad afternoon into proof that you take reliability seriously.
None of this will prevent things from breaking. But it gives you a framework so that when they do, you're not improvising under pressure. And if your team sees you handle it this way enough times, they'll start doing it too — which means the next time alerts fire on a Thursday, you won't be the only one who knows what to do.