What is a Network Audit and How can Uptime.com Help?
Scaling sort of sneaks up on you, doesn’t it? One day, you’re carefree, the next you start to notice something is off… Maybe it’s the crashing, or the frequent dips in performance. Could it be the new hire? It’s not DNS. Is it DNS?
Scaling is a natural part of the business process, and your infrastructure will start to change completely as your userbase doubles and triples. In office, you’re celebrating major milestones but outwardly your user experience is maybe a little… awkward?
Like adolescence, server infrastructure is about knowing yourself. There can be quite a bit of waste built into poorly audited systems, no matter how useful the toolsets that may have initially justified the costs. Remember that you can spend your last dollars failing to guarantee 100% survival.
What is a Network Audit?
Put in simple terms, a network audit helps you better understand the evolving beast that is your infrastructure. Smaller companies will benefit greatly from baking these audits into their scaling process, while major enterprises will find them essential for detecting problems and waste.
CI/CD is the purest expression of Conway’s Law: your CI/CD pipeline will be broken and messy in exactly the same ways your engineering org is broken and messy. https://t.co/KHB7su003S pic.twitter.com/j0SScRTS4I
— John Arundel (@bitfield) May 16, 2020
Thinking about the specifics, a network audit involves carefully mapping your infrastructure in terms of the software, hardware, and resources that ensure the lights stay on.
Basic Needs in a Network Audit
Typically, you begin with a checklist outlining the most important elements. For example, you might start with bandwidth demands or just create a map of the systems up and running at a given time. With this understanding, you can consider needs more specific to your business.
Are you seeing dips in performance that are geographical in nature? Did you find systems you had not previously detected, and are these legacy systems still in use in any major capacity? What about frequency of outages, or total downtime? How are you stacking up against the SLA your company has set forth?
Addressing these needs is a full-time task, so today we want to explain some methods that can help make your life a little easier.
Assisting with your Network Audit
We are going to look at a few important items that cover the fundamentals. Once you have gone through this list, you should have a more developed overview of your network, applications, and associated infrastructure.
Monitor Your Entire Site and Discover Systems
Let’s begin with Monitor Entire Site, a tool that can dig deep to find associated infrastructure you may not be monitoring. Everything from a basic HTTP(S) check to malware/blacklist testing and DNS error detection is possible.
It’s a good idea to periodically check in with this tool, running URLs critical to your application. Doing so will help you zero in on checks you might have missed for a more comprehensive monitoring system.
Audit Ex-Employees and (Re)Configure Escalations
One of the most common questions we receive at Uptime.com support is about removing or otherwise adjusting users who no longer work for a company. Routine auditing will catch these users before they pose a significant security risk. Users may also shuffle around to various groups. They may manage this themselves with tags, but an audit will help uncover elements that otherwise slip our collective radar.
Additionally, we recommend auditing your escalation system. Are alerts arriving timely, and can you act on them? It is true that small outages may still affect your SLA, but when you cannot act on these outages you are wasting effort. Audits will help you consider these types of challenges carefully.
There are multiple reasons an integration may stop working. In these situations, Uptime.com will pause the integration and send an email informing you of the error and requesting you re-configure your integration.
But these errors are most common when the integration is consistently generating data (such as performance metrics). If an alert integration suddenly stopped working, it could slip unnoticed and lead to undetected alerts.
You can test your existing integrations quickly. Click Notifications>Contacts, then click Actions>Test next to each contact. You don’t need to limit this to integrations either. If you are concerned any contact is not receiving specific alerts, use testing to check.
Audit Your Alerts
Sometimes we get so caught up in “making it work” we can forget to examine the root cause. A network audit presents us with the opportunity to right this wrong
Want to know the fastest way to see a system is having difficulty? Click Reports>Alerts, and search a check’s name or service number. At a glance, you will see how often Uptime.com issued an alert and the duration of each outage. Only you know your threshold for downtime, but if you see too many alerts in a short period of time you have a lead you can investigate for optimization.
Advanced Network Audit Tips
Well, hello there “new you”. Allow us to give you some ideas on some advanced ways to explore this new body of infrastructure.
Auditing for Violations and Mistakes
Everyone makes mistakes, sometimes in violation of established protocol. Whether for corrective or security reasons, a thorough audit includes review of the Uptime.com Audit Log.
Available to both Business and Enterprise users, the Audit Log allows administrators to search by user ID, service name or ID, and more. You can track changes made to a specific check, or review user actions for a period. If you need to know when a change was made, the audit log is your best source.
Use Multiple Checks
Monitoring is more comprehensive when you include additional angles on potential trouble spots. Let’s say we have a specific server that governs an important API endpoint, such as one for backups. If we only monitor this endpoint, we only know whether it is responding. We don’t know why it’s down. Sure, we might get an error code or miss an expected response. If we used an API check, we might even be able to see which step failed.
That’s great for visibility, we do need to know if it’s up or down. But we improve observability when we understand more about the problem from outage data. For example, can we use a Ping ICMP check to monitor the server that governs this process? Is that box alive and well?
Can we use an HTTP(S) check configured to a faster interval that will catch these problems before our API check will? These are the kinds of questions that keep SREs up far too late.
Depending on which of these various checks goes down, we understand more about the problem. If we lose the endpoint and our ping check, we can start to rule out some bogus leads faster.
Response Time and Outages
A check’s response time can sometimes indicate a problem, but is always good to keep in mind. A status page, either public or private, is one of the best sources you have as an SRE for seeing these vital statistics in a centralized space. Second to status pages are custom dashboards, which provide an at-a-glance overview.
You can review average response time, or dive into checks for a glimpse into how each performs.
We also recommend the Uptime.com REST API, which you can use to pull metric data into whatever report format you choose.
An audit is a good time to think about internal policies that can help keep the organization, and your applications, stay more secure. Of course everyone wants to keep customer data safe, but what about the systems and IP that govern your business? Wouldn’t it be just as damaging to have a copycat of your application somewhere out there capitalizing on your work?
Everything from BYOD to passwords are fair game. This may be a good time to look into an SSO provider, such as One Login, Okta, or AWS’s own system to name just a few. SSO solves some of the BYOD headaches, as administrators can easily track access and provision privileges according to an employee’s group or status in the organization.
Get Serious About Network Audits!
There is no shortage of trouble situations a thorough network audit can uncover, and the sooner you adopt these procedures the more efficient your organization will run. The great news about a network audit is that it is free form and you can shape and mold them as you please.
What does your checklist look like?
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.