Check out the on-demand sessions from the Low-Code/No-Code Summit to learn how to successfully innovate and gain efficiencies by improving and scaling citizen developers. look now.
As we settle into the time of year when we reflect on what we’re grateful for, we tend to focus on important basics like health, family, and friends.
But on a professional level, IT operations practitioners (ITOps) are grateful to avoid disastrous outages that can cause confusion, frustration, lost revenue and damaged reputations. The very The last thing ITOps, Network Operation Center (NOC), or Site Reliability Engineering (SRE) teams want while eating their turkey and spending time with their families is to be notified of a failure. These can be extremely expensive – $12,913 per minute, in fact, and up to $1.5 million per hour for large organizations.
However, to understand the peace of mind that comes with avoiding downtime, you have to have endured the pain and anxiety that comes with breakdowns firsthand. Here are some of the horror stories ITOps pros are grateful to avoid this season.
A case of janky command structure
A longtime IT professional was on duty with three others at 7 p.m. The crew received an alert about a problem affecting the front-end user interface of their global traffic management device. Luckily there was a runbook for this hosted in a database, so it looked like the problem would be fixed quickly. One of the team members saw two things to enter: an order and a secondary input. He typed the commands and, based on the look of the runbook, waited for the command line to ask for input, such as “what do you want to restart?”
Smart Security Summit
Learn about the critical role of AI and ML in cybersecurity and industry-specific case studies on December 8. Sign up for your free pass today.
The way the command structure was set up, if you didn’t provide input, the device itself would reboot. He typed what he thought was the correct command – “bigstart, restart” – and the entire front-end global traffic manager was removed.
As a reminder, this happened in the early evening. The client was a finance company, and the system crashed just as the companies were closing and trying to do their books and other finance-related tasks. Horrible timing, to say the least.
Five minutes into the outage, the ITOps team realized what had happened: the tool they were using for their runbook was using default text wrapping, so what looked like two separate commands n was actually just one. Although the outage was relatively short, it came at a critical time and created a chain reaction of headaches. The lesson learned? Make sure your command structure is optimized.
When Google is your best friend in the middle of the night
For a 15-plus year IT veteran, what seemed like a quiet night shift quickly turned into an anxious nightmare. “I have never found myself panicking so quickly as when the remote terminal I was in suddenly shut down,” he said.
What he was trying to do was restart a service while working on a remote machine, but he inadvertently disabled the network connector in the process. Calling someone and waking them up in the middle of the night to tell them they’d “atomized” a network card wasn’t ideal, so he and his teammates started digging.
After what he calls “a fair amount of googling” he was able to find his way to a Dell server and reboot the network adapter from there. It took longer than expected to be fixed, but the problem was finally solved.
His pro tip: “Don’t disable the network adapter on a machine you’re logging into remotely in the middle of the night.” It may seem obvious, but the underlying lesson is to have a contingency plan in place if something goes terribly wrong.
ITOps: Relying on email used to be great – until it wasn’t
Back when email was the primary means for NOC teams to receive alerts, one longtime IT pro recalls having a teammate whose only job was essentially dispatch: monitoring emails and create tickets for incidents that needed immediate attention, and others for those they could access later. The system worked well, but it was actually a ticking time bomb ready to explode given that it was a large multinational.
This fear materialized when the company’s entire data center went down.
It was its own set of issues in its own right, but the incident generated so many email alerts that it also crashed the company’s Outlook server. “At that point, you are really blind,” recalls this computer hero.
The event happened in the middle of the night, so the guard team had to reluctantly start waking up their teammates. After the issue was finally resolved, the team developed a sense of humor about it. As they recalled, “We used to joke that we DDoS ourselves with our own alert noise. Good time!”
Ultimately, the overall moral of the story is this: every time a hand touches a keyboard, there’s a chance something will go wrong. It’s sometimes unavoidable, of course, but teams that are able to automate and simplify their IT operations processes as much as possible give themselves the best chance of avoiding costly outages – so they can enjoy their celebrations of Uninterrupted Thanksgiving.
Mohan Kompella is Vice President of Product Marketing at BigPanda.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including data technicians, can share data insights and innovations.
If you want to learn more about cutting-edge insights and up-to-date information, best practices, and the future of data and data technology, join us at DataDecisionMakers.
You might even consider writing your own article!
Learn more about DataDecisionMakers