What is Resilient Architecture
Resilience is the quality of being able to take a hit without damage or, in terms of software, how well the system recovers from failure. Most of my study and efforts until now have been directed to preventing failure in the first place, more so than focusing on recovering from failure. So to say that I’m an ‘architect’ of resilient software would be an overstatement, rather I am an enthusiast for resilient software. I will share what I know about resilience, and how I have contributed to resilience in applications, but in terms of my own position it must suffice to say that I recognize the necessity of resilience due to the inevitability, despite all efforts, of failure in some form, and that I am eager to learn more about resilient architecture.
It seems to me that the most essential aspects of resilience are anticipating failure, detecting failure, accommodating failure, and correcting failure. While the aim of architecting for reliability is to prevent failure in the first place, anticipating failure has the aim of preventing failure by applying solutions that will prevent the kinds of failure we anticipate, as well as to advance resilience by allowing us to measure and plan for the different varieties of failure. Failures to anticipate include: network failure, out-of-memory failure, failure of a specific service due to pushing a new, broken version, overburdened service, security incident, and accidentally giving a junior developer excessive authorization credentials such that they accidentally drop the production database on their first day. The aims of failure detection, reaction, and correction are all to minimize the cost of failure. In the ideal case the process can be fully automated, with failure detection triggering the correct response that corrects the failure without any cost. An example of this would be load balancing combined with auto-scaling of services; an algorithm can be implemented that detects when the service load is too high, or is suddenly peaking, and will trigger spinning up new instances to which traffic will be directed. Lambda functions work this way by default, and work so flawlessly that the cost of excessive cost is not one of unreliability, but rather of excessive cloud fees. A less extreme example would be setting up alerts from our frontend logging system (eg. Sentry) that is connected to an algorithm which, if the alerts peak shortly after a release, automatically trigger a rollback; the cost in this case is a very limited period of increased failures by end users, and the delay of the released features, but the panic of fixing a production issue is avoided. By accommodating failure I mean building reaction to failure states as part of the functioning of the application, separate from notifying and correcting the issue. For example, determining what set of features can continue to function when certain other components are non-functional, such is with a local-first application which is meant to retain utility without networking. This would also include informing the user of the failure state so they are not surprised and frustrated excessively. Finally correction refers to the efforts and time necessary to restore functionality. In the case of the dropped production database, ideally such a situation should have been anticipated and the correction process outlined and rehearsed. The actual failure event should not be cause for a developer to google “how to restore PostgreSQL from backups”. Each type of failure has its own methods for detection, accommodation (if possible) and correction, so I won’t go into more detail now.