Founder's Letter

Founder's Letter

It all started on January 21st 2018.  My team and I had just completed the build-out of our third datacenter in 12 months.  We had just begun moving production workloads when a VMware NSX bug brought everything crashing down.  For the next 22 hours we worked to restore operations.  That was only the beginning of a very painful year that ultimately resulted in significant growth for me personally and in my career.  Over the next 6 months, we suffered additional datacenter outages due to the likes of a NetApp QoS bug, a Cisco ASA bug, and others.

Kubler

The Kubler-Ross Grief Cycle is an apt description of what followed.  At first, I went through the denial stage.  Then quickly progressed to anger where I, unfortunately, took this frustration out on those around me.  I spoke with industry peers and heard similar horror stories during the bargaining phase.  It was comforting to hear that I was not alone.  While I don’t believe I was depressed at the time, I was experiencing symptoms of overwhelming helplessness. Was this just how it had to be?  Are all IT Operations professionals doomed to suffer this same pain and anxiety during their careers?

"Are all IT Operations professionals doomed to suffer this same pain and anxiety during their careers?"

By the spring of 2018, I was spending a considerable amount of time searching for solutions.  I asked several industry colleagues what they were doing to mitigate this risk in the hope of finding something, anything, to help us.  By August, I realized there was no product available to buy.  It needed to be built.  My team and I cobbled together something to put a semi-automated process around what has traditionally been a very labor-intensive endeavor.   

After the initial outages, Cisco offered to set us up with a Technical Account Manager who would pick up the phone and call someone when they released a new bug.  How was this any better than getting more emails in our inboxes?  The stubborn but ethical part of my brain said, “why do we have to pay vendors more money on top of the good money we’re already paying, just so they will tell us when they’ve made a mistake?”.  Something about that arrangement didn’t work for me.  There had to be a better way. 

My team were some of the best and brightest, as well as extremely hard working.  But, like many IT teams, we were understaffed relative to the expectations of the business.  I know they weren’t malicious when they missed these catastrophic bug notices.  They frankly had too much on their plates and were likely burned out from the 3 datacenter builds in the previous 12 months.  We needed help, and there’s nothing wrong with that realization.  They needed something more than spreadsheets and an email inbox to keep track and manage the thousands of different risks in our environment.  I also needed a consolidated view of these risks since I clearly wasn’t going to invade their voicemails and inboxes.


"There had to be a better way."

Finally, in late 2018 at the Gartner Infrastructure and Operations conference, I had the light-bulb moment.  I had spoken to everyone who would hear me out.  They all had similar (and sometimes worse) stories than mine.  I would then describe what we had recently built, and nearly everyone remarked how much better our solution was compared to what they were currently doing. I came to the realization that I could help others avoid the pain I had experienced.and was excited by the fulfillment that would bring.

How It Works

Then came acceptance.  I called my mentor and said, “I think I want to help others, so they don’t have to struggle like I have this past year”.  He had been a successful entrepreneur several times throughout his career and was extremely encouraging.  I had already spoken to him several times throughout the year detailing the pain we were experiencing.  He understood and knew others in the industry who were struggling with these same challenges.

"I came to the realization that I could help others avoid the pain I had experienced"

For decades we’ve had monitoring tools tell us when a hard drive fails or a service in an OS stops working.  Why don’t we have a monitoring tool that tells us when the software running the storage or database is broken?  The forums have postings about such bugs. But they were often pertaining to different versions than what we were running, so it was wasted effort.  And again, how do you even manage something using forums? 

 

During my research and conversations with industry veterans, some folks would give me a puzzled look and say “Do you mean a CVE data feed?  Because we use Qualys, Rapid7, Tenable, etc.”.  So many of my peers were confused when I told them “No, the bugs were not security vulnerabilities, and they weren’t in any data feed that uses the National Vulnerability Database (NVD) or CVE framework”.  Using the CIA triad (confidentiality, availability, and integrity) for reference, the bugs that hit us were mostly related to availability and integrity while CVEs are mostly related to confidentiality.  But at the end of the day, they were still risk to the organization. 

 

Since 2018, I’ve been very fortunate to speak with many IT colleagues.  In each conversation, there is another nugget of valuable information that we use to strengthen our solution.  I am proud that BugZero has become a solution built by IT, for IT.  We know what helps and what just causes more pain.  We know security concerns are vitally important to businesses and have built security and compliance at the core of our platform. 

 

On occasion, we speak with folks who haven’t suffered an outage from this problem and I’m genuinely jealous.  I typically dig deeper to figure out what they’re doing differently.  What are their staffing levels?  How much time are they spending on this?  Have they cobbled together something?  Is there some solution or service that I couldn’t find back in 2018?  How can they afford to spend millions of dollars on vendor Technical Account Managers? For decades we’ve had monitoring tools tell us when a hard drive fails or a service in an OS stops working.  Why don’t we have a monitoring tool that tells us when the software running the storage or database is broken?  The forums have postings about such bugs. But they were often pertaining to different versions than what we were running, so it was wasted effort.  And again, how do you even manage something using forums? 

 

During my research and conversations with industry veterans, some folks would give me a puzzled look and say “Do you mean a CVE data feed?  Because we use Qualys, Rapid7, Tenable, etc.”.  So many of my peers were confused when I told them “No, the bugs were not security vulnerabilities, and they weren’t in any data feed that uses the National Vulnerability Database (NVD) or CVE framework”.  Using the CIA triad (confidentiality, availability, and integrity) for reference, the bugs that hit us were mostly related to availability and integrity while CVEs are mostly related to confidentiality.  But at the end of the day, they were still risk to the organization. 

 

Since 2018, I’ve been very fortunate to speak with many IT colleagues.  In each conversation, there is another nugget of valuable information that we use to strengthen our solution.  I am proud that BugZero has become a solution built by IT, for IT.  We know what helps and what just causes more pain.  We know security concerns are vitally important to businesses and have built security and compliance at the core of our platform. 

 

On occasion, we speak with folks who haven’t suffered an outage from this problem and I’m genuinely jealous.  I typically dig deeper to figure out what they’re doing differently.  What are their staffing levels?  How much time are they spending on this?  Have they cobbled together something?  Is there some solution or service that I couldn’t find back in 2018?  How can they afford to spend millions of dollars on vendor Technical Account Managers? 


Fast forward to today and our list of mission-critical IT vendor integrations is at double-digits and counting with the likes of Cisco, NetApp, VMware, AWS, Microsoft, Red Hat, HPE, MongoDB, Veeam, and Fortinet.  These integrations report only the risks that are currently present in your environment.  We call them operational bugs, but others call them stability defects, functional bugs, or some similar term. These risks are then filtered, prioritized, and normalized before being inserted into your ITSM solution (we currently support ServiceNow, but are adding other ITSM integrations).  This builds a process around something that has historically been handled with email, spreadsheets, vendor portals that are rarely checked, and phone calls from Technical Account Managers.  None of those ‘solutions’ are a process that an entire team can rally behind.

"Our vision is combining all of these risks so IT teams can be more proactive, increase uptime, and ultimately have a better work/life balance than is possible today."

Why doesn’t IT Ops have the same level of rigor that SecOps has for CVEs?  SecOps has predictive analytics, risk scores, and ML & AI are beginning to be the norm.  Our vision is combining all of these risks so IT teams can be more proactive than reactive, have better uptime, and ultimately have a better work/life balance than is the norm today.  Year over year, companies are becoming more reliant on IT and software is becoming more complex.  It’s time for BugZero.

Latest news

Keep reading