It all started in January of 2018. My team and I had just completed the build-out of our third datacenter in 12 months. We had just begun moving production workloads when a VMware NSX bug brought everything crashing down. For the next 22 hours we worked to restore operations. That was only the beginning of a very painful year. Over the next several months, we suffered additional datacenter outages due to the likes of a NetApp QoS bug, a Cisco ASA bug, and others.
I spoke with industry peers and heard similar horror stories. It was comforting to hear that I was not alone, but... was this just how it had to be? Are all IT Operations professionals doomed to suffer like this during their careers?
"Are all IT Operations professionals doomed to suffer like this during their careers?"
By the spring of 2018, I was spending a considerable amount of time searching for solutions. I asked several industry colleagues what they were doing to mitigate this risk in the hope of finding something to help us. By August, I realized there was no product available to buy. It needed to be built. My team and I cobbled together a semi-automated process that was better than the current industry standard manual processes.
After the initial outages, Cisco offered to set us up with a Technical Account Manager who would pick up the phone and call someone when they released a new bug. How was this any better than getting more emails in our inboxes? Something about that arrangement didn’t work for me. There had to be a better way.
My team were some of the best and brightest, but like many IT teams, we were understaffed relative to the expectations of the business. They needed something more than spreadsheets and an email inbox to manage the thousands of different risks in our environment. I also needed a consolidated view of these risks to manage accountability and remediation.
"There had to be a better way."
Finally, in late 2018 at the Gartner Infrastructure and Operations conference, I had the light-bulb moment. I had spoken to everyone who would hear me out. They all had similar stories to mine. I would then describe what we had recently built, and nearly everyone remarked how much better our solution was compared to what they were currently doing. I came to the realization that I could help others avoid the pain I had experienced.
"I came to the realization that I could help others avoid the pain I had experienced"
For decades we’ve had monitoring tools tell us when a hard drive fails or a service in an OS stops working. Why don’t we have a monitoring tool that tells us when the software running the storage or database is broken? The forums have postings about such bugs, but then again, how do you even manage something using forums?
During my research and conversations with industry veterans, some folks would give me a puzzled look and say “Do you mean a CVE data feed? Because we use Qualys, Rapid7, Tenable, etc.”. So many of my peers were confused when I told them “No, the bugs were not security vulnerabilities, and they weren’t in any data feed that uses the National Vulnerability Database (NVD) or CVE framework”.
To clarify using the Information Security Triad (Confidentiality, Integrity, Availability) for reference, the bugs that hit us were mostly related to availability and integrity while CVEs are mostly related to confidentiality. But at the end of the day, they were still risks to the organization.
Since 2018, I’ve been very fortunate to speak with many IT colleagues. In each conversation, there is another nugget of valuable information that we use to strengthen our solution. I am proud that BugZero is a solution built by IT, for IT. We know what helps and what does not. We know security concerns are vitally important to businesses and have built security and compliance at the core of our platform.
Fast forward to today and our list of mission-critical IT vendor integrations is at double-digits and counting with the likes of Cisco, NetApp, VMware, AWS, Microsoft, Red Hat, HPE, MongoDB, Veeam, and Fortinet. These integrations report only the risks that are currently present in your environment. We call them operational bugs, but others call them stability defects, functional bugs, or some similar term.
These risks are filtered, prioritized, and normalized before being inserted into your ITSM solution (we currently support ServiceNow, but are adding other ITSM integrations). This builds a process around something that has historically been handled with email, spreadsheets, vendor portals that are rarely checked, and phone calls from Technical Account Managers. None of those ‘solutions’ are a process that an entire team can rally behind.
"Our vision is combining all of these risks so IT teams can be more proactive, increase uptime, and ultimately have a better work/life balance than is possible today."
Why doesn’t IT Ops have the same level of rigor that SecOps has for CVEs? SecOps has predictive analytics, risk scores, and ML & AI are beginning to be the norm. Our vision is combining all of these risks so IT teams can be more proactive than reactive, have better uptime, and ultimately have a better work/life balance than is the norm today. Year over year, companies are becoming more reliant on IT and software is becoming more complex. It’s time for BugZero.
One of the times our IT systems are most at risk is during change. So, how does BugZero help you mitigate risks during change? Every day businesses must address new risks based on newly available data and continuously drive the requisite changes, since "Change Management is Management".
Read Change Management
All businesses depend on software. All software has bugs. That software might be storage controller firmware, it may be running your network infrastructure, or it could be your critical enterprise applications. Those bugs might be security vulnerabilities, or the operational defects more commonly known as bugs.
Read Software Bugs
ITIC’s latest research shows the Hourly Cost of Downtime now exceeds $300k for 91% of mid-market, and large enterprises. Overall, 44% of mid-sized and large enterprise survey respondents reported that a single hour of downtime can cost their businesses over $1M.
Read Downtime Costs
It all started in January of 2018. My team and I had just completed the build-out of our third datacenter in 12 months. We had just begun moving production workloads when a VMware NSX bug brought everything crashing down. For the next 22 hours we worked to restore operations.
Read Founder's Letter