10 Web Monitoring Tips for Redundant Systems
As your team grows, so do the rules and regulations you use to keep things organized. The same is true for systems, which grow in complexity as they grow in size.
That complexity is difficult to manage on its own without the natural turnover that occurs in tech. Those who built and managed legacy systems, eventually go on to bigger and brighter things, either within the company or toward other opportunities.
You can help protect yourself from this kind of frustration with runbooks and good documentation, but it’s going to happen to even the most prepared teams.
Those gaps in system knowledge, coupled with systems layers deep present several challenges for uptime monitoring. Today we’ll dive into those challenges; to meet them head on and look at how Uptime.com can help solve for these issues.
1. Use Tags to Group Checks Together
We’ll begin with some basic tips on organization and management. Working with checks in bulk is much easier when you utilize tags.
Tagging by system, by owner, by geography and more are all effective means for differentiating systems logically. But tags themselves are powerful tools for Uptime.com users, allowing you to quickly assemble reporting, status pages, dashboards, and organize checks.
2. Utilize the Uptime.com REST API
The REST API makes it easy to manage check creation, make edits in bulk, report on metrics with precision, and more. Any user can be authorized to access the API, all you need is a token and you’re ready to send requests and retrieve or create data.
What can you do with the API?
Almost everything you can do within our user interface. You can’t pay for your subscription, but you can do just about everything else including issuing new user seats, creating contacts, editing status pages, adding incidents, controlling maintenance and more.
3. Utilize Bulk Actions for Check Management
Sizable systems necessitate a volume of checks, and bulk actions have the tools you need to manage an account at scale.
With bulk actions, you highlight the items you want to change, select what to change, and then you make the change. Simple, fluid management for hundreds of checks.
For convenience, we also include bulk import and export options (the latter is reserved for Premium subscribers).
4. Create Multiple Contacts for Different Team Members
Any team that manages hundreds of checks likely has dozens of people handling systems. Those people need to be informed when something goes down, but you need to make sure the right person gets the right alert.
Don’t stop at “Default” or a catch-all contact for your team. It’s helpful to build out email, third-party integrations, and even SMS or phone contacts for critical systems owners as you start to consider your escalation structure.
Whatever your needs, Uptime.com can accommodate. Any subscriber can create a near infinite amount of contacts and integrations to ensure the alerts and downtime data gets to where they need to go.
5. Build Checks into Deployment
One of the best long term steps you can take is to build the check creation process into your deployments. Here are some ideas:
When creating more complex systems, ensure you have specific data selectors you have customized for transaction testing. CSS and Xpath selectors offer extreme precision, and function best when elements are unique. As you build your systems, consider how you will test those systems.
Terraform can be very useful in terms of managing your pipeline from production to long term maintenance. Our Terraform provider allows you to build check creation into your workflow, but includes tools for management as well.
6. Create Checks as Unknown Parts Fail
Whether you have inherited a system or are a seasoned veteran, you will encounter unknown points of failure. Systems of a certain scale simply have too many moving parts for nothing to slip through the cracks.
As these systems fail, part of learning from those mistakes is to automate them. An extensive runbook is actually a massive red flag that your team may be failing to automate basic maintenance tasks.
7. Create Checks for Every Level of Failure
Good monitoring layers upon itself for a more comprehensive view into downtime. Not just what is down, but where and when it is failing and any signs leading up to that failure. You get these valuable data points with multiple check types. Here is one common configuration:
- An HTTP(S) check for a high value URL
- A Transaction check built around that high value URL and associated goals
- RUM deployed to that high value URL and its associated steps in the funnel
The result gives you performance data from RUM, transaction data that tells you when and where the site is failing, and a rapid-fire HTTP(S) check for basic up/downtime checks.
8. Monitor Internal Resources
Most SREs for enterprise-level DevOps likely run multiple web monitoring applications, but are you monitoring internal resources? Not only your disk space, bandwidth, and performance metrics, but your internal systems down to their bits and bytes?
Private Location monitoring may be an effective solution for that glimpse behind the curtain. Security measures may prevent you from allowing an external monitor into your systems. The solution must come from within.
Private locations are like your own Uptime.com servers, and they can run every check type you will find in the Uptime.com suite. How can you use them? Webhook checks are perfect for jobs you need to monitor, while HTTP(S) and API checks can help determine status from critical system endpoints.
9. Escalate for a More Effective Response
Escalations are critical to alert delivery, and to ensure action is taken when it is needed and not a moment before or after. Without escalations, you and your team can find yourselves fatigued by late-night alerting for issues that have already resolved themselves.
If you have been using tags to organize by system, you likely have team members associated with each system. Field alerts to a central contact (Slack and Microsoft Teams channels work well for this), and then escalate to the system owner upon extended downtime.
Weekly SLA reports should help keep you up to date on minor downtime, and escalations will give your system owner the necessary data to resolve extended downtime.
10. Package Multiple Checks into a Status Page Component Group
Tags make it easy to create status page component groups, which are stylistically intuitive for conveying groups of checks that are interrelated. Components can nest multiple checks within a single grouping, so users can check each part of your infrastructure.
This type of grouping can be especially useful for internal status pages, where your team may need insights on which systems are up and running at a glance. Bonus points if you share this data with support and customer service teams, as they are on the frontlines of communicating uptime status to your end users.
Redundant Systems Require Redundancy
When you’re monitoring an onion’s worth of infrastructure, you need an onion’s worth of insights into those moving parts. The deeper into the bits and bytes you can get, the better equipped you’ll be to leverage that data with precision when something does go wrong.
You also won’t know what you don’t already know about your system until it fails. The better prepared you are to monitor for those unforeseen points of collapse, the better protected you are from the next major incident.
Mitigation is key when you’re monitoring systems at scale. Uptime.com makes it easy to group these checks together and manage those complexities.
Minute-by-minute Uptime checks.
Start your 14-day free trial with no credit card required at Uptime.com.