Itâs 2:47 PM on a Tuesday. You just ran that script. The one youâve run a hundred times. Except this time, somethingâs different.
The monitoring dashboard turns red. Slack notifications start cascading. Your phone buzzes. Then buzzes again. Your managerâs face appears at your desk, or worse, their name pops up in a video call request.
You broke production.
Your stomach drops. Your hands are shaking. Youâre mentally calculating how quickly you can update your resume and whether LinkedIn still shows you as âOpen to Workâ from that time you forgot to turn it off.
Hereâs the thing nobody tells you: this moment will happen. If you work in IT long enoughâwhether youâre in help desk moving up to sysadmin, a DevOps engineer, or a senior architectâyou will eventually break something important. The only people who never break production are people who never touch production.
This guide is for that moment. The one happening right now, or the one thatâs coming.
Why This Happens to Everyone
Letâs get something straight: breaking production doesnât mean youâre bad at your job. It usually means the opposite.
The people who cause outages are typically the ones doing real work. Theyâre deploying code, updating configurations, maintaining infrastructure. The person who never breaks anything is usually the person who never ships anything.
Every major tech company has war stories. Amazonâs S3 outage in 2017 took down half the internet because of a typo in a command. GitLab accidentally deleted a production database and spent hours recovering. Cloudflareâs outage in 2019 was caused by a single regular expression. These werenât junior engineers making rookie mistakesâthese were experienced teams at companies built on reliability.
The difference between these incidents and career-ending disasters isnât about preventing all mistakes. Itâs about how you respond when they happen.
The First 15 Minutes: Stop, Breathe, Act
When you realize youâve broken something, your brain floods with cortisol and adrenaline. Fight-or-flight kicks in. This is exactly when you need to override your instincts.
Donât Make It Worse
Your first impulse will be to fix it. To run another command, push another change, do something to undo what just happened. Resist this urge.
The second-worst thing you can do is break production. The worst thing is breaking it twice while trying to fix the first break. Every âquick fixâ you attempt without fully understanding what went wrong is another roll of the dice.
Take 30 seconds. Literally count to thirty. Look at what you just did. Look at whatâs happening now. Make sure you understand the connection before you touch anything else.
Communicate Immediately
This is where most people fail. The instinct is to hide, to fix it quietly before anyone notices, to pretend it didnât happen. This never works, and it transforms a technical incident into a trust incident.
The message is simple: âI think I just caused [specific thing]. Iâm investigating. More details in [X minutes].â
Send it to your team channel. Send it to your manager if theyâre not already in the channel. If thereâs an incident channel, use it. Donât explain yourself. Donât apologize profusely. Just state whatâs happening.
You can send this message even if youâre not 100% sure you caused it. âI ran [command] and [system] went down immediately after. Investigating connection.â Thatâs honest, helpful, and doesnât waste time.
Get Help
If youâre the most senior person available, skip this step and start diagnosing. But if thereâs someone who knows the system better than you, pull them in immediately.
This isnât weakness. This is the professional communication that separates people who have long careers from people who donât. The ego cost of asking for help is nothing compared to the cost of extended downtime while you fumble alone.
Incident Response: The Mechanics
Once the initial shock passes and youâve communicated, itâs time to actually fix things. Your organization might have formal incident response procedures. If so, follow them. If not, hereâs a framework.
Identify the Blast Radius
Whatâs actually broken? Not what youâre afraid might be brokenâwhatâs demonstrably not working right now?
Check monitoring. Check logs. Check user reports. Make a list of actual symptoms, not guesses about root causes.
Sometimes you think you broke everything, but you actually broke one thing thatâs very visible. Sometimes you think you broke one thing, but the cascading failures are still propagating. You need to know which situation youâre in.
Rollback vs. Roll Forward
You have two options: undo what you did (rollback) or fix the problem caused by what you did (roll forward).
Rollback when:
- You have a clear, tested rollback procedure
- The change was discrete and reversible
- Youâre confident rolling back wonât cause additional problems
- Time pressure is high and rollback is faster
Roll forward when:
- Rollback would be as risky as fixing
- The change included data migrations that canât be easily undone
- Youâve already identified the fix and itâs quick
- Rolling back would lose legitimate changes
If youâre not sure, bias toward rollback. Itâs usually safer to get back to a known-good state and then make changes carefully than to push forward under pressure.
Document As You Go
Your future selfâand everyone doing the postmortemâwill thank you for this. Keep a running log in the incident channel:
- 2:47 PM - Ran database migration script
- 2:48 PM - API errors spike in monitoring
- 2:52 PM - Confirmed migration caused table lock
- 2:55 PM - Attempting to kill blocking query
- 3:01 PM - Query killed, errors dropping
This timeline is gold for understanding what happened. It also protects you because it shows you acted professionally and communicated throughout.
Good IT documentation habits pay off most during incidents.
Talking to Your Manager
At some point during or after the incident, youâll have a conversation with your manager. How this goes depends partly on your manager and partly on you.
What to Say
Lead with facts, not emotions. Not âIâm so sorry, I feel terrible, this is all my fault.â Instead: âI ran X command, which caused Y to happen. Weâre currently doing Z to resolve it.â
Your manager probably doesnât need to hear how bad you feel. They need to know what happened, whatâs being done, and whether they need to escalate or communicate to other stakeholders.
After the facts, take ownership without being melodramatic. âThis was my change, I should have [specific thing Iâll do differently next time].â Thatâs different from âIâm the worst engineer ever and I completely failed the team.â
What Theyâre Actually Thinking
Unless you have a terrible managerâand yes, some managers donât understand techâtheyâre probably not thinking about firing you. Theyâre thinking about:
- How do we fix this right now?
- Who else do I need to loop in?
- How do I communicate this to my boss?
- What do we need to change so this doesnât happen again?
Notice that âpunish the person who made the mistakeâ isnât on this list. Good managers know that incidents are system failures, not personal failures. If one personâs mistake can bring down production, thatâs a process problem.
This doesnât mean no consequences ever. If youâre careless repeatedly, ignore procedures consistently, or refuse to learn from mistakes, thatâs a different conversation. But a single production incident, handled professionally, rarely damages careers at healthy companies.
Red Flags in Manager Responses
Pay attention to how your manager handles this. Their response tells you a lot about whether this is a place you want to stay.
Green flags:
- Focus on fixing the problem first, discussing later
- Asking what the team can do to prevent this
- Treating it as a learning opportunity
- No yelling, no blame in public channels
Red flags:
- Public humiliation in Slack or meetings
- âHow could you let this happen?â
- Immediate threats about your job
- Blaming you while ignoring systemic issues
If your managerâs response is full of red flags, start thinking about your exit strategy. A blame culture makes everything worse, not just incidents.
The Postmortem: Where Learning Happens
Within a few days of the incident, there should be a postmortem. If your organization doesnât do these, suggest starting. If they refuse, that tells you something about the culture.
Blameless, Not Shameless
A good postmortem is blamelessâit focuses on systems and processes, not individual fault. But blameless doesnât mean shameless. You should still own your part in what happened.
âI ran the migration script without checking the table lock behaviorâ is appropriate. âSomeone ran a scriptâ is notâit hides information. âJohn completely screwed up the databaseâ is also not appropriateâit assigns blame rather than analyzing systems.
The goal is to identify the chain of events factually, then ask: what could we change so this outcome is harder to achieve?
Action Items That Actually Prevent Things
Bad postmortems produce action items like âbe more carefulâ or âdonât make mistakes.â These are useless.
Good action items are specific and systemic:
- Add a dry-run flag to the migration script
- Implement a canary deployment process
- Create runbook for [rolling back database changes]
- Add monitoring alert when table locks exceed 30 seconds
- Require two-person review for production database changes
Notice these arenât about individuals trying harder. Theyâre about making the dangerous thing harder to do accidentally.
If your postmortem only produces âbe more carefulâ action items, push back. The same conditions that caused this incident will cause the next one.
The Emotional Aftermath
Okay, the technical stuff is handled. The postmortem is done. But youâre still feeling terrible.
This is normal. Even when the incident is resolved, even when nobodyâs angry at you, even when you intellectually know that mistakes happenâyou might still feel like garbage. For days. Maybe weeks. Work-life balance in IT is already hard without adding âI broke productionâ to your mental load.
Impostor Syndrome Spike
Breaking production often triggers massive impostor syndrome. âReal engineers donât break things. I donât belong here. Theyâre going to figure out Iâm a fraud.â
This is your brain being unhelpful. The reality is that every engineer, including the ones you admire most, has stories of things they broke. The difference is that experienced people have made peace with this reality and learned to manage the anxiety.
If impostor syndrome hits hard after an incident, talk to someone whoâs been in the industry a while. Ask them about their worst outage. Youâll find that everyone has one, and most people are willing to share.
Returning to Normal Work
You have to keep working. You have to run commands, push code, make changes. And for a while, every action will feel terrifying.
This passes. The anxiety diminishes with time. But you can help it along by:
- Starting with low-risk tasks to rebuild confidence
- Using extra verification steps until trust returns (even if not strictly required)
- Pairing with someone on the first few production changes
- Remembering that the precautions implemented after the postmortem make the system safer than before
If the anxiety doesnât passâif months later youâre still paralyzed by fear of making changesâthatâs worth talking to someone about. Not a career problem; a mental health concern that happens to be affecting work.
Building a Career That Survives Failure
The engineers who build long, successful careers arenât the ones who never fail. Theyâre the ones who fail wellâwho respond professionally, learn systematically, and help improve the systems around them.
The Incident Response Resume Booster
Hereâs something counterintuitive: handling a production incident well can actually help your career.
When youâre interviewing for senior roles, a question like âTell me about a time you broke productionâ is a softball. You get to demonstrate:
- Technical troubleshooting under pressure
- Professional communication during crisis
- Systematic approach to postmortems
- Ability to implement lasting improvements
- Emotional maturity in difficult situations
An engineer whoâs handled real incidents is often a better hire than one whoâs never faced a crisis. The former has been tested. The latter is an unknown.
Skills That Reduce Impact
While you canât prevent all incidents, you can build skills that reduce their frequency and severity.
Understanding your systems deeply means fewer surprises. You know how things connect, what depends on what, where the risky operations live. This comes from deliberately investing in learning, not just doing your assigned work.
Scripting and automation skills let you create tools with safety rails. A well-written bash script can have confirmation prompts, dry-run modes, and rollback procedures built in. Manual commands offer no such protection.
Practicing these skills outside of crisis momentsâin a home lab, or using platforms like Shell Samurai for hands-on command-line practiceâbuilds the muscle memory and confidence that lets you act effectively under pressure.
Git and version control fluency means you can always roll back. If youâre a sysadmin whoâs been avoiding Git, incidents like this are why you shouldnât.
Monitoring and observability knowledge means you catch problems faster and understand blast radius immediately. The time between âsomethingâs wrongâ and âI know exactly whatâs wrongâ is where damage happens.
Creating a Culture That Handles Failure Well
If youâre in a position to influence cultureâas a team lead, manager, or even just a respected voice on your teamâyou can help create an environment where incidents are handled well.
Share your own failure stories. Normalize talking about things that went wrong. When someone else breaks something, model the response youâd want: focus on fixing, not blaming. Push for systemic fixes in postmortems rather than âtry harderâ action items.
The goal is a team where people report problems immediately because they know theyâll get help, not punishment. Where postmortems are learning opportunities rather than tribunals. Where the systems improve after each incident instead of just adding warning labels.
This kind of culture makes incidents less frequent (because people raise concerns early), less severe (because issues get reported immediately), and less damaging to careers (because people arenât destroyed by normal human error).
When Itâs Actually Your Fault
Letâs be honest: sometimes incidents arenât just bad luck. Sometimes you really did screw up in a preventable way.
Maybe you ran a production command without testing it first. Maybe you ignored the change management process because it was slow. Maybe you were distracted, or tired, or overconfident.
This doesnât change most of the advice above. You still communicate immediately, fix the problem, participate constructively in the postmortem. But your internal processing needs to include genuine accountability.
âI should have tested that firstâ is different from âtesting is too slow.â One leads to actually testing next time. The other leads to repeating the mistake.
The postmortem action items should still be systemicâmaking it harder for anyone to make the same mistakeâbut you should also make personal commitments. Not as self-flagellation, but as actual plans youâll follow.
If you find yourself repeatedly in this categoryâincidents that happen because you cut corners or ignore processesâthatâs worth examining. Are you burnt out? Are the processes genuinely broken? Are you in the wrong role? Sometimes a pattern of âpreventableâ incidents is a sign of something deeper that needs addressing.
The on-call stress guide and burnout recovery resources might be relevant here. Tired people make more mistakes. Stressed people cut more corners. Fixing the root cause often fixes the incidents.
The Incident That Changed How I Think About Incidents
Youâve probably experienced that moment where a production issue hits and everyone starts pointing fingers. That energy gets spent on blame instead of recovery.
Compare that to teams where the first response is âhow do we fix this?â followed by âhow do we prevent this?â No drama. No raised voices. Just engineering.
The difference isnât that the second team is better at avoiding mistakes. Itâs that theyâve decided mistakes are a normal part of complex systems, and their job is to build systems that are resilient to human error rather than demanding superhuman perfection.
If youâre lucky enough to be on a team like the second one, appreciate it. If youâre on a team like the first one, you have a choice: try to change the culture, or find somewhere healthier.
FAQ
How long should I wait before running another production command?
As long as you need to feel genuinely confident, not as punishment. For some people, thatâs the next day after the postmortem. For others, itâs a few weeks of working on low-risk tasks first. Thereâs no universal timeline, but rushing back to prove something often backfires.
Should I mention production incidents in job interviews?
Often, yes. âTell me about a time something went wrongâ is a common interview question, and a well-handled production incident is a great story. Focus on what you learned and what changed afterward, not on defensiveness or blame. The STAR method works well for structuring these answers.
What if my manager wants to fire me after an incident?
First, understand whether this is standard at your company or an unusual response. If single incidents routinely end careers there, thatâs a cultural problemâstart job searching regardless of immediate outcome. If this seems unusual, ask to understand their specific concerns and what, if anything, would address them. Sometimes managers react in the heat of the moment and calm down. Sometimes the relationship is genuinely damaged. Either way, knowing where you stand helps you make decisions.
How do I rebuild trust with my team after a major incident?
Time and consistency. Show up, do good work, handle subsequent challenges well. Donât constantly apologize or bring it upâthat makes it awkward for everyone. Do demonstrate that youâve incorporated the lessons. If the postmortem action item was to test things first, visibly test things first. Trust rebuilds through behavior, not words.
Whatâs the difference between a healthy postmortem and a blame session?
Healthy postmortems focus on the future: âWhat do we change so this is less likely or less impactful?â Blame sessions focus on the past: âWhose fault was this and how do we make sure they feel bad about it?â If your postmortems feel like interrogations, something is wrong with the culture, not the postmortem format.
Final Thoughts
You broke production. Itâs done. The monitoring is back to green, the postmortem is filed, and everyone has moved on except the voice in your head.
Hereâs what you need to hear: this doesnât define you. Not your skills, not your career, not your worth as an engineer.
What defines you is what happens next. Do you learn from this? Do you help fix the systems that made it possible? Do you handle the next incidentâthere will be a next incidentâwith more confidence and competence?
The best engineers I know have stories about the time they brought down production. They tell these stories with a mix of âugh, that suckedâ and âhereâs what I learned.â Thatâs where youâre heading.
The goal isnât to never fail. Itâs to fail well, recover quickly, and build systems (technical and personal) that turn failures into improvements.
Youâve got this.