How Uptime.com can Help Improve Internal Documentation
An acquaintance of mine works for a company that still uses Windows XP to manage some internal applications. The higher ups of the company refuse to adopt the new versions, given costs and technical gaps, and it’s created something of a Pandora’s box for employee turnover. With no strong internal reference documentation, each new departure leaves IT wondering two things:
- Did we lose the last person knowledgeable in the system’s workings?
- Can we still fix it if it breaks?
This rather amusing conundrum is apparently not an isolated incident. There are varying estimates, but the number of companies and users still running XP is likely in the hundreds of millions. Some of you reading this may have been in diapers when these companies were in their prime of development.
escape room concept:
– you are a software engineeer
– there is a production issue related to a legacy codebase
– no one knows how it works
– various credentials are scattered around the office on post-it notes
– there’s some printouts of git diffs
– you have an hour to fix this
— Walf (@walfieee) January 18, 2018
So what do you do with these legacy systems? How do you:
- Create effective monitoring and oversight
- Ensure future generations can build on and scale these systems
This is a very real problem not unique to XP users. The infrastructure you are building now may not change for years, maybe decades.
Why Internal Documentation Saves Lives
Just like my poor acquaintance, losing someone isn’t the end of the world when you have strong internal documentation. It’s a fact of life, devs change jobs. Having an easy-to-reference document that evolves over time saves time at every step.
Fortunately, we have some tips and advice you can apply to your monitoring and oversight that will help you prepare for the inevitabilities of life.
Maybe this is bringing it back to fundamentals, but support tends to run smoother when it has lots of documentation to work from. Speaking from experience, when support knows what to look for in the context of an error, your lower tiers can almost always solve problems with minimal escalation.
With Uptime.com, every check has a “Notes” field for dev (or whoever) to document fixes to apply when things go wrong. Click the Advanced tab, and then fill in the Notes field.
Building Documentation into Your Testing
It’s easy to get lost in the act of making something, forgetting to document how you actually made it. The obvious method is to comment your code, which should also be as humanly readable as possible. Impenetrable code makes it difficult to scale and hurts the next team member that has to fix a problem when you’re no longer there.
However, testing something is a great opportunity to document how it works for your support team. If your support is a part of the QA process, even better. There really are bugs and features, in the sense that some things work the way they do because they are built a certain way. It’s best to minimize the number of instances where a user can willingly cause a problem, but it’s equally important to document… shall we call them “surprise features”?
“If you are ever lucky enough to work with one, you should have a very, very healthy fear of professional testers.”
Support Note Best Practices
Like in code, focus on readability. Your notes should explain the problem (perhaps by defining the status code or the general symptoms), along with direct instruction on what to do.
When this happens, do that.
Pretty simple stuff.
For your notes to be effective, you need to build on our next tip.
Document Repetitive Issues
The killer had a blue jacket on at the time of the murder, the shot had a blue filter, the detectives stumble across a blue car registered to someone who lives in a blue house.
Ok so correlation doesn’t equal causation, but my money’s on Mr. Blue having a bloodied candlestick or a wrench somewhere in that house. And that’s because repeated details tend to speak to a larger picture.
Like performance issues and small outages preceding a DDoS attack; when you notice one offs stop becoming one offs, you have a problem and the opportunity to squash it before it hurts your usability.
Identifying Real Issues
One of the challenges we face is determining when an issue actually occurred. We can say something failed, but what is something? Was it the connection along the way (the last mile)? Was it in some part of the infrastructure? Some third party script causing issues?
Identifying a real issue is very difficult and time consuming. Wouldn’t it be great to have a living document that identifies the issue and focuses on fixes for it?
Uptime.com has two methods that help here.
The first is alert history, where you can search by objectID or name to locate all alerts for a single check. Review the list and document repeated error codes. Were these issues addressed and are they documented somewhere?
A simple audit like this is a great way to expand your publicly available documentation, and to help your team learn the internal workings of the product.
The second method is to review real-time analysis for a check, which offers a similar history of alerts with some other tools. You can review alert details, and view a chronological breakdown of all server outages.
A Quick Use Case
I have a server that my users report has a spotty connection, but I can’t seem to replicate the problem locally. I have to trust my users because they are paying to use my service, but pinpointing the problem is making me lose sleep. My monitoring has a sensitivity of two and I’m not seeing alerts. Before I pull my hair out, I check out real-time analysis.
There, I see that my server actually is failing but only from one or two locations consistently. I can further investigate outages in those regions, maybe my load balancer or my host is having issues.
Most importantly, I do actually see the issue for the first time. So a day or so later when my servers really will crash, I can have a jumpstart on my research of this issue.
Tips on Creating Internal Documentation
Here are some tips to help get this right without consuming too much time.
Tag your Interrelated Components
Take the time to tag components by team or by interrelated systems. When one of them breaks, you will have a better idea of how that outage could affect other systems.
Choose Your Documents with Purpose
Always ask yourself: is a document here going to save us time? It’s the same as automation: if I automate this process is it really saving me time? One consideration with documentation is that it evolves. If you choose to take on documentation, you do so with the understanding that it will grow alongside the features and applications you are building.
Allow for Collaboration
Project managers are likely the gatekeepers for this stuff, but they don’t know it all (despite what they tell themselves). Good documentation leaves the door open for team input.
Run the Gauntlet
Documentation means nothing if you can’t test the results under fire. In live incidents, or gameday exercises, part of considering observability should include how your team is using its existing documentation as well as how you can meaningfully expand on it.
When Should I Start Documenting?
Yesterday was a pretty good time. I guess the day before works if you wanted to get a head start. If you’re asking this question, you’re really wanting to know “where” do I start.
The answer is wherever makes the most sense. Just like companies devise FAQs to stop clogging the support lines, what repeated issues do you see that documentation could fix? Start with the lowest of hanging fruits and work your way up, time depending. Before you know it, documentation will be baked into everything you do.
Minute-by-minute Uptime checks.
Start your 21-day free trial with no credit card required at Uptime.com.