Do you rely on your users to tell you about a service outage?
By Head of Engineering, David Sugden
Have you ever been in the situation where your users call in to report an outage with your services? Did you wonder how long your service was down before you got called?
If you answered ‘yes’ then you may want to start tracking TTD [Time to Detect / Discover], which reflects how long the problem existed before you became aware of it. Trending the MTTD (the mean/average) over time is a common KPI for Incident management, andprovides an objective insight into the effectiveness of your production operations.
A couple of techniques for driving down MTTD.
Embedding continual improvement can begin by holding blameless post-mortems after every user-impacting Incident. Here you will capture the timeline of events, including when the Incident actually started, not only when you became aware of it - you should even wind the timeline back to events that led to the Incident, such as a change for example - and for every post-mortem you collect some key metrics, such as TTD (Time to Detect), TTR (Time to Recover/Resolve), how long it took to get the Incident to the correct engineer (i.e. to someone that actually had the knowledge and access to fix the issue), how many hand-offs along the way, and so on.
As part of the review, and during your post-Incident review meeting, it’s crucial to identify root causes and add actions to the Problem backlog that will minimise the likelihood of a recurrence. The team should also use this as an opportunity to review the method of detection. I would always recommend talking about this topic, even if the outage was detected through internal monitoring, because every Incident provides an opportunity to learn, improve, and tune the dials; but as a minimum whenever you relied on your users to call in to advise you of an outage this should always be a trigger for the team to follow up and review the effectiveness of the observability, monitoring, and alerting.
Talking of which. Let’s briefly consider the service monitoring - how have you decided what to monitor and when to alert of a possible outage - do you understand the possible risks of service failure and have them covered. Take a look at your critical user journeys and map them across business transactions - to what extent do you have coverage for these through synthetic monitoring? Synthetics can give you an early insight into issues before users encounter them. Hook these checks into an alerting mechanism - taking into consideration frequency and duration to ensure all alerts are actionable - and you can have your engineers investigating before any users call it in. Double points if you can update your Service Status Page automatically when you declare your internal Incident.
These are just a couple of recommendations for improving your Mean Time to Detect over time, and is not intended to be an exhaustive set.
Want to learn more about Site Reliability Engineering?
On Wednesday, 27th September we are hosting a technical round table devoted to architecting for resiliency and responding to service issues.
What will be the key takeaways?
- The latest techniques in building reliable services in today’s complex world
- Insights into the benefits of SRE for future projects
- How SRE practices drive effective responses to service-impacting events
- Exchange of ideas with like-minded others
Whether you're passionate about SRE or a Founder or Senior Leader, this round table is the perfect opportunity to share your knowledge and experience while picking up some top tips.
Link to sign up: https://www.eventbrite.co.uk/e/help-my-services-are-on-fire-tickets-685530790047