Key Takeaways from SREcon19
We had a blast at this year’s SREcon in Brooklyn, NY. The three-day event included meeting with integration partners and deepening our understanding of website reliability from a DevOps perspective.
What is SREcon?
SREcon is a conference of Site Reliability Engineers (SREs) that gathers annually to discuss and learn about DevOps-related issues. This year’s focus was “comprehension, understandability, and predictability.”
Site Reliability Engineers are tasked with managing alerts and restoring system functionality during outages, as well as working on projects that keep systems from breaking in the first place.
SREcon holds several conferences worldwide every year. If you missed last week’s event in Brooklyn, there are two more this year. Check out their website to learn more.
Here are some of the key takeaways from this year’s SREcon as it relates to web infrastructure monitoring.
Monitoring is the Foundation of Site Reliability Engineering
Without proactive monitoring, SREs can’t do their job. Monitoring not only allows you to fix problems as they happen, but helps IT teams know when it’s time to replace aging systems or make major development changes.
While SRE teams at large enterprises like Google and Airbnb build their own monitoring systems, third-party monitoring services are faster to deploy and scale. They are the perfect answer for companies that want to get up and running quickly, but don’t have the money or time to build monitoring systems from scratch.
SREs Can Help Prevent Expensive Outages
Every business knows outages cost money. But how much money depends on a variety of factors, including the length of the outage and the size of the company. While SREs are a long way from preventing every possible outage from happening, a good worker balances the management of black swans with project work.
According to Aaron Wieczorek of the United States Digital Service, a 9-day outage at the US Patent and Trademark Office (USPTO) in 2018 cost the government agency $864M. The office lost an estimated $4M per hour.
Not only did the outage cost the agency a substantial amount of money, but customers had to shell out additional funds to patent attorneys who were required to use paper filing methods until systems were restored.
While the USPTO had the resources to recover from the outage, other companies don’t have that advantage. By proactively monitoring your business, you can avoid having to shut your doors when a big outage hits.
Get started with a free 21-day trial from Uptime.com, no credit card required.
Managing Alerts is a Big Deal
The more an SRE has to respond to alerts and put out fires, the less time they have to devote to projects that keep systems from breaking in the first place.
The more checks you create with your monitoring software, the more alerts you’re going to get. While this may be useful when you’re new to monitoring, it creates a culture of information overload where IT teams are constantly looking for problems to solve instead of coming up with innovative ways to prevent problems from happening in the first place.
If alerts are irrelevant and don’t require action, IT personnel are more likely to ignore all alerts, and may miss major outages or performance problems.
On the flipside, not having enough checks in place means you may miss out on valuable insights that can help you create a better user experience.
Alert on Uptime, Not Latency
When and how to alert is one of the core issues of site reliability engineering. In order to reduce the number of alerts SREs have to respond to, Wieczorek recommends organizations set up alerts on downtime instead of latency.
It’s an Uphill Battle to Prove the Value of SRE Teams
Though site reliability engineering is a newcomer to DevOps teams, early adopters like Netflix and Airbnb are publishing research and sharing data that show the value of adding SREs to existing IT teams.
According to conference speaker and Circonus CEO Theo Schlossnagle, “The [SRE] community is strong, which is good because we have a lot of maturing to do as we painfully overcome the significant cognitive dissonance in the discipline across organizations.”
USENIX, the nonprofit that sponsors SREcon, banished paywalls in 2008 and provides all their conference recordings and research free of charge via their website.
Minute-by-minute Uptime checks.
Start your 14-day free trial with no credit card required at Uptime.com.