When You Break Production: A Survival Guide

It’s 2:47 PM on a Tuesday. You just ran that script. The one you’ve run a hundred times. Except this time, something’s different.

The monitoring dashboard turns red. Slack notifications start cascading. Your phone buzzes. Then buzzes again. Your manager’s face appears at your desk, or worse, their name pops up in a video call request.

You broke production.

Your stomach drops. Your hands are shaking. You’re mentally calculating how quickly you can update your resume and whether LinkedIn still shows you as “Open to Work” from that time you forgot to turn it off.

Here’s the thing nobody tells you: this moment will happen. If you work in IT long enough—whether you’re in help desk moving up to sysadmin, a DevOps engineer, or a senior architect—you will eventually break something important. The only people who never break production are people who never touch production.

This guide is for that moment. The one happening right now, or the one that’s coming.

Why This Happens to Everyone

Let’s get something straight: breaking production doesn’t mean you’re bad at your job. It usually means the opposite.

The people who cause outages are typically the ones doing real work. They’re deploying code, updating configurations, maintaining infrastructure. The person who never breaks anything is usually the person who never ships anything.

Every major tech company has war stories. Amazon’s S3 outage in 2017 took down half the internet because of a typo in a command. GitLab accidentally deleted a production database and spent hours recovering. Cloudflare’s outage in 2019 was caused by a single regular expression. These weren’t junior engineers making rookie mistakes—these were experienced teams at companies built on reliability.

The difference between these incidents and career-ending disasters isn’t about preventing all mistakes. It’s about how you respond when they happen.

The First 15 Minutes: Stop, Breathe, Act

When you realize you’ve broken something, your brain floods with cortisol and adrenaline. Fight-or-flight kicks in. This is exactly when you need to override your instincts.

Don’t Make It Worse

Your first impulse will be to fix it. To run another command, push another change, do something to undo what just happened. Resist this urge.

The second-worst thing you can do is break production. The worst thing is breaking it twice while trying to fix the first break. Every “quick fix” you attempt without fully understanding what went wrong is another roll of the dice.

Take 30 seconds. Literally count to thirty. Look at what you just did. Look at what’s happening now. Make sure you understand the connection before you touch anything else.

Communicate Immediately

This is where most people fail. The instinct is to hide, to fix it quietly before anyone notices, to pretend it didn’t happen. This never works, and it transforms a technical incident into a trust incident.

The message is simple: “I think I just caused [specific thing]. I’m investigating. More details in [X minutes].”

Send it to your team channel. Send it to your manager if they’re not already in the channel. If there’s an incident channel, use it. Don’t explain yourself. Don’t apologize profusely. Just state what’s happening.

You can send this message even if you’re not 100% sure you caused it. “I ran [command] and [system] went down immediately after. Investigating connection.” That’s honest, helpful, and doesn’t waste time.

Get Help

If you’re the most senior person available, skip this step and start diagnosing. But if there’s someone who knows the system better than you, pull them in immediately.

This isn’t weakness. This is the professional communication that separates people who have long careers from people who don’t. The ego cost of asking for help is nothing compared to the cost of extended downtime while you fumble alone.

Incident Response: The Mechanics

Once the initial shock passes and you’ve communicated, it’s time to actually fix things. Your organization might have formal incident response procedures. If so, follow them. If not, here’s a framework.

Identify the Blast Radius

What’s actually broken? Not what you’re afraid might be broken—what’s demonstrably not working right now?

Check monitoring. Check logs. Check user reports. Make a list of actual symptoms, not guesses about root causes.

Sometimes you think you broke everything, but you actually broke one thing that’s very visible. Sometimes you think you broke one thing, but the cascading failures are still propagating. You need to know which situation you’re in.

Rollback vs. Roll Forward

You have two options: undo what you did (rollback) or fix the problem caused by what you did (roll forward).

Rollback when:

You have a clear, tested rollback procedure
The change was discrete and reversible
You’re confident rolling back won’t cause additional problems
Time pressure is high and rollback is faster

Roll forward when:

Rollback would be as risky as fixing
The change included data migrations that can’t be easily undone
You’ve already identified the fix and it’s quick
Rolling back would lose legitimate changes

If you’re not sure, bias toward rollback. It’s usually safer to get back to a known-good state and then make changes carefully than to push forward under pressure.

Document As You Go

Your future self—and everyone doing the postmortem—will thank you for this. Keep a running log in the incident channel:

2:47 PM - Ran database migration script
2:48 PM - API errors spike in monitoring
2:52 PM - Confirmed migration caused table lock
2:55 PM - Attempting to kill blocking query
3:01 PM - Query killed, errors dropping

This timeline is gold for understanding what happened. It also protects you because it shows you acted professionally and communicated throughout.

Good IT documentation habits pay off most during incidents.

Talking to Your Manager

At some point during or after the incident, you’ll have a conversation with your manager. How this goes depends partly on your manager and partly on you.

What to Say

Lead with facts, not emotions. Not “I’m so sorry, I feel terrible, this is all my fault.” Instead: “I ran X command, which caused Y to happen. We’re currently doing Z to resolve it.”

Your manager probably doesn’t need to hear how bad you feel. They need to know what happened, what’s being done, and whether they need to escalate or communicate to other stakeholders.

After the facts, take ownership without being melodramatic. “This was my change, I should have [specific thing I’ll do differently next time].” That’s different from “I’m the worst engineer ever and I completely failed the team.”

What They’re Actually Thinking

Unless you have a terrible manager—and yes, some managers don’t understand tech—they’re probably not thinking about firing you. They’re thinking about:

How do we fix this right now?
Who else do I need to loop in?
How do I communicate this to my boss?
What do we need to change so this doesn’t happen again?

Notice that “punish the person who made the mistake” isn’t on this list. Good managers know that incidents are system failures, not personal failures. If one person’s mistake can bring down production, that’s a process problem.

This doesn’t mean no consequences ever. If you’re careless repeatedly, ignore procedures consistently, or refuse to learn from mistakes, that’s a different conversation. But a single production incident, handled professionally, rarely damages careers at healthy companies.

Red Flags in Manager Responses

Pay attention to how your manager handles this. Their response tells you a lot about whether this is a place you want to stay.

Green flags:

Focus on fixing the problem first, discussing later
Asking what the team can do to prevent this
Treating it as a learning opportunity
No yelling, no blame in public channels

Red flags:

Public humiliation in Slack or meetings
“How could you let this happen?”
Immediate threats about your job
Blaming you while ignoring systemic issues

If your manager’s response is full of red flags, start thinking about your exit strategy. A blame culture makes everything worse, not just incidents.

The Postmortem: Where Learning Happens

Within a few days of the incident, there should be a postmortem. If your organization doesn’t do these, suggest starting. If they refuse, that tells you something about the culture.

Blameless, Not Shameless

A good postmortem is blameless—it focuses on systems and processes, not individual fault. But blameless doesn’t mean shameless. You should still own your part in what happened.

“I ran the migration script without checking the table lock behavior” is appropriate. “Someone ran a script” is not—it hides information. “John completely screwed up the database” is also not appropriate—it assigns blame rather than analyzing systems.

The goal is to identify the chain of events factually, then ask: what could we change so this outcome is harder to achieve?

Action Items That Actually Prevent Things

Bad postmortems produce action items like “be more careful” or “don’t make mistakes.” These are useless.

Good action items are specific and systemic:

Add a dry-run flag to the migration script
Implement a canary deployment process
Create runbook for [rolling back database changes]
Add monitoring alert when table locks exceed 30 seconds
Require two-person review for production database changes

Notice these aren’t about individuals trying harder. They’re about making the dangerous thing harder to do accidentally.

If your postmortem only produces “be more careful” action items, push back. The same conditions that caused this incident will cause the next one.

The Emotional Aftermath

Okay, the technical stuff is handled. The postmortem is done. But you’re still feeling terrible.

This is normal. Even when the incident is resolved, even when nobody’s angry at you, even when you intellectually know that mistakes happen—you might still feel like garbage. For days. Maybe weeks. Work-life balance in IT is already hard without adding “I broke production” to your mental load.

Impostor Syndrome Spike

Breaking production often triggers massive impostor syndrome. “Real engineers don’t break things. I don’t belong here. They’re going to figure out I’m a fraud.”

This is your brain being unhelpful. The reality is that every engineer, including the ones you admire most, has stories of things they broke. The difference is that experienced people have made peace with this reality and learned to manage the anxiety.

If impostor syndrome hits hard after an incident, talk to someone who’s been in the industry a while. Ask them about their worst outage. You’ll find that everyone has one, and most people are willing to share.

Returning to Normal Work

You have to keep working. You have to run commands, push code, make changes. And for a while, every action will feel terrifying.

This passes. The anxiety diminishes with time. But you can help it along by:

Starting with low-risk tasks to rebuild confidence
Using extra verification steps until trust returns (even if not strictly required)
Pairing with someone on the first few production changes
Remembering that the precautions implemented after the postmortem make the system safer than before

If the anxiety doesn’t pass—if months later you’re still paralyzed by fear of making changes—that’s worth talking to someone about. Not a career problem; a mental health concern that happens to be affecting work.

Building a Career That Survives Failure

The engineers who build long, successful careers aren’t the ones who never fail. They’re the ones who fail well—who respond professionally, learn systematically, and help improve the systems around them.

The Incident Response Resume Booster

Here’s something counterintuitive: handling a production incident well can actually help your career.

When you’re interviewing for senior roles, a question like “Tell me about a time you broke production” is a softball. You get to demonstrate:

Technical troubleshooting under pressure
Professional communication during crisis
Systematic approach to postmortems
Ability to implement lasting improvements
Emotional maturity in difficult situations

An engineer who’s handled real incidents is often a better hire than one who’s never faced a crisis. The former has been tested. The latter is an unknown.

Skills That Reduce Impact

While you can’t prevent all incidents, you can build skills that reduce their frequency and severity.

Understanding your systems deeply means fewer surprises. You know how things connect, what depends on what, where the risky operations live. This comes from deliberately investing in learning, not just doing your assigned work.

Scripting and automation skills let you create tools with safety rails. A well-written bash script can have confirmation prompts, dry-run modes, and rollback procedures built in. Manual commands offer no such protection.

Practicing these skills outside of crisis moments—in a home lab, or using platforms like Shell Samurai for hands-on command-line practice—builds the muscle memory and confidence that lets you act effectively under pressure.

Git and version control fluency means you can always roll back. If you’re a sysadmin who’s been avoiding Git, incidents like this are why you shouldn’t.

Monitoring and observability knowledge means you catch problems faster and understand blast radius immediately. The time between “something’s wrong” and “I know exactly what’s wrong” is where damage happens.

Creating a Culture That Handles Failure Well

If you’re in a position to influence culture—as a team lead, manager, or even just a respected voice on your team—you can help create an environment where incidents are handled well.

Share your own failure stories. Normalize talking about things that went wrong. When someone else breaks something, model the response you’d want: focus on fixing, not blaming. Push for systemic fixes in postmortems rather than “try harder” action items.

The goal is a team where people report problems immediately because they know they’ll get help, not punishment. Where postmortems are learning opportunities rather than tribunals. Where the systems improve after each incident instead of just adding warning labels.

This kind of culture makes incidents less frequent (because people raise concerns early), less severe (because issues get reported immediately), and less damaging to careers (because people aren’t destroyed by normal human error).

When It’s Actually Your Fault

Let’s be honest: sometimes incidents aren’t just bad luck. Sometimes you really did screw up in a preventable way.

Maybe you ran a production command without testing it first. Maybe you ignored the change management process because it was slow. Maybe you were distracted, or tired, or overconfident.

This doesn’t change most of the advice above. You still communicate immediately, fix the problem, participate constructively in the postmortem. But your internal processing needs to include genuine accountability.

“I should have tested that first” is different from “testing is too slow.” One leads to actually testing next time. The other leads to repeating the mistake.

The postmortem action items should still be systemic—making it harder for anyone to make the same mistake—but you should also make personal commitments. Not as self-flagellation, but as actual plans you’ll follow.

If you find yourself repeatedly in this category—incidents that happen because you cut corners or ignore processes—that’s worth examining. Are you burnt out? Are the processes genuinely broken? Are you in the wrong role? Sometimes a pattern of “preventable” incidents is a sign of something deeper that needs addressing.

The on-call stress guide and burnout recovery resources might be relevant here. Tired people make more mistakes. Stressed people cut more corners. Fixing the root cause often fixes the incidents.

The Incident That Changed How I Think About Incidents

You’ve probably experienced that moment where a production issue hits and everyone starts pointing fingers. That energy gets spent on blame instead of recovery.

Compare that to teams where the first response is “how do we fix this?” followed by “how do we prevent this?” No drama. No raised voices. Just engineering.

The difference isn’t that the second team is better at avoiding mistakes. It’s that they’ve decided mistakes are a normal part of complex systems, and their job is to build systems that are resilient to human error rather than demanding superhuman perfection.

If you’re lucky enough to be on a team like the second one, appreciate it. If you’re on a team like the first one, you have a choice: try to change the culture, or find somewhere healthier.

FAQ

How long should I wait before running another production command?

As long as you need to feel genuinely confident, not as punishment. For some people, that’s the next day after the postmortem. For others, it’s a few weeks of working on low-risk tasks first. There’s no universal timeline, but rushing back to prove something often backfires.

Should I mention production incidents in job interviews?

Often, yes. “Tell me about a time something went wrong” is a common interview question, and a well-handled production incident is a great story. Focus on what you learned and what changed afterward, not on defensiveness or blame. The STAR method works well for structuring these answers.

What if my manager wants to fire me after an incident?

First, understand whether this is standard at your company or an unusual response. If single incidents routinely end careers there, that’s a cultural problem—start job searching regardless of immediate outcome. If this seems unusual, ask to understand their specific concerns and what, if anything, would address them. Sometimes managers react in the heat of the moment and calm down. Sometimes the relationship is genuinely damaged. Either way, knowing where you stand helps you make decisions.

How do I rebuild trust with my team after a major incident?

Time and consistency. Show up, do good work, handle subsequent challenges well. Don’t constantly apologize or bring it up—that makes it awkward for everyone. Do demonstrate that you’ve incorporated the lessons. If the postmortem action item was to test things first, visibly test things first. Trust rebuilds through behavior, not words.

What’s the difference between a healthy postmortem and a blame session?

Healthy postmortems focus on the future: “What do we change so this is less likely or less impactful?” Blame sessions focus on the past: “Whose fault was this and how do we make sure they feel bad about it?” If your postmortems feel like interrogations, something is wrong with the culture, not the postmortem format.

Final Thoughts

You broke production. It’s done. The monitoring is back to green, the postmortem is filed, and everyone has moved on except the voice in your head.

Here’s what you need to hear: this doesn’t define you. Not your skills, not your career, not your worth as an engineer.

What defines you is what happens next. Do you learn from this? Do you help fix the systems that made it possible? Do you handle the next incident—there will be a next incident—with more confidence and competence?

The best engineers I know have stories about the time they brought down production. They tell these stories with a mix of “ugh, that sucked” and “here’s what I learned.” That’s where you’re heading.

The goal isn’t to never fail. It’s to fail well, recover quickly, and build systems (technical and personal) that turn failures into improvements.

You’ve got this.