“An insecure Service is neither reliable nor safe”
Let's explore this statement in a little more detail...
We can summarise the role of the SRE as 'ensuring the smooth operation of a Service for its users.' Of course, this ten-word simplification is not intended to detract from the fact that there are tools, processes and working practices, statistical analysis, and cultural aspects to running an effective SRE function within an organisation.
Site Reliability Engineers help to establish observability platforms, ensuring that there is monitoring of the critical user journeys, resources, and related infrastructure; support the implementation of alerting and a practical & sustainable on-call process for incidents; and contribute to blameless post-mortems and help foster a culture of continual feedback, learning, and improvement.
In addition, they lead best-practice release strategies that allow for the safe rollout of features, such as canarying or blue-green; create and promote objective, data-driven guardrails that enable the fast flow of business value until committed parameters of user expectations are breached (namely via SLOs and error budgets); ensure that the architecture has taken into account demand forecasts, scalability & elasticity, resiliency, durability; run Service Risk assessments and identify engineering tasks to mitigate the risks that are likely to impact on achieving the committed SLOs; and so on – this is a summary, and it's a wide, heavily-engineering focussed domain with a lot of reference content.
At this point, keen eyes will have noticed that we've not mentioned "security."
Many Services have likely been entrusted to handle sensitive user data, such as personally identifiable information, demographic, financial, etc. Users have trusted your organisation with their data and expect Services to be "available," they also expect their data to be safe, with data privacy concerns paramount. Granted, I am drawing a parallel between Security and Data Privacy and Trust to show the intersection with "reliability." While simplified to be helpful in this context, it is not to diminish the far wider breadth of these domains.
A Reliable Service is not only one that is available and responsive – the common SLIs being based around error rate % and latency - but responses also need to be accurate, correctly formatted, and fresh. To elaborate, would you say that your Service is reliable if it returns yesterday's bank balance – for example, this may be due to an as-yet undetected, stale memory cache caused by a TTL configuration error in production.
While a European user may quickly recognise a USA date-formatting issue and determine that 8/14 means 14th August, they are less likely to detect an issue if they are presented with 8/6 - we mean 8th June, correct? – does your organisation consider its Service to be operable if the dates are incorrectly formatted and cannot be interpreted by your users – what if this date was when you expected payment of an invoice and you didn't get paid until 6th August (as a tangent: let's all agree that months should be displayed in full text from here on).
A more trivial example, perhaps, how important it is if the Family Name is being displayed as the Given Name – what if the Previous Address is being used as the Current Address – what if this means your organisation ends up sending customer mail to their old residence.
Finally, consider how you would feel about a Service that is presenting the details of another user – now think about if this wasactually due to a SQL injection security flaw in the code.
By narrower definitions, systems exhibiting the above traits may still be considered reliable – do you agree or disagree? (as a tangent: think about how you could monitor for such issues – often caused by misconfigurations, and thus may pass through traditional functional testing cycles).
Albeit primarily drawing on Data Privacy, the fact there is an intersection between Reliability and Security should now be coming into focus. If a Service is breached, a likely response will be to take the Service offline. Likewise, suppose a Service has a critical vulnerability. In that case, a decision may also be made to take down the Service until the risk has been mitigated, servers patched, or a new release rolled out.
Another example of that intersection is during Threat Modelling, during which we may explore the risks of a Distributed Denial of Service attack. Typically, we discuss how to architect a service to mitigate for such an event from a malicious external actor; however, the Threat Model should also explore the risks of an accidental self-inflicted Denial of Service – inherent in our own service design, perhaps we have engineered our solution to include a retry mechanism when it encounters a failure (such as reading a database or calling a downstream service) – but without the implementation of exponential backoff we may generate a cascading failure as well as excessively overloading the internal services.
In summary, is a Service that is offline reliable? No.
Does that matter if it was due to an infrastructure outage, lack of resiliency, or an unpatched security vulnerability? Not really.
Should your SREs be interested in Security threats as a risk to Service? Yes.
To conclude, here's a teaser for how designing for Security and Reliability intersects and builds on the quote: everything should be made as simple as possible, but no simpler.
"Keeping system design as simple as possible is one of the best ways to improve your ability to assess both the reliability and the security of a system. A simpler design reduces the attack surface, decreases the potential for unanticipated system interactions, and makes it easier for humans to comprehend and reason about the system. Understandability is especially valuable during emergencies when it can help responders mitigate symptoms quickly and reduce mean time to repair (MTTR). [Ref 1] "
Want to learn more about Site Reliability Engineering?
On Wednesday, 27th September, Axiologik is hosting a technical round table devoted to architecting for resiliency and responding to service issues.
What will be the key takeaways?
- The latest techniques in building reliable services in today's complex world
- Insights into the benefits of SRE for future projects
- How SRE practices drive effective responses to service-impacting events
- Exchange of ideas with like-minded others
Whether you're passionate about SRE or a Founder or Senior Leader, this round table is the perfect opportunity to share your knowledge and experience while picking up some top tips.
Link to sign up: https://www.eventbrite.co.uk/e/help-my-services-are-on-fire-tickets-685530790047 (https://www.eventbrite.co.uk/e/help-my-services-are-on-fire-tickets-685530790047)
References
[1] Adkins, Heather; et al. Building Secure and Reliable Systems. O'Reilly Media. 2020.