In Part 1 of the series we looked in detail at what High Availability can actually mean. In this part we look specially at some metrics and tools that are used in practice to determine a systems availability.
Let's take a look at some examples of metrics both usable and of limited usefulness in availability determination.
Service Response Time
The actual time taken for a given service to complete a request and that response to be returned to the requester, is a fundamental service availability metric. This class of metrics form a very widespread, diverse but also essential group.
This metric could cover something well understood and often measured, for example the amount of time a DNS service takes to respond to a DNS lookup. A more complex example could be the amount of time a sales order takes to be acknowledged or some other business specific process. These different types of service response metrics will generally require different approaches and tools to monitor. A metric like DNS response time is now pretty much 'industry standard' and you'll find it built in to many tools. Another widely deployed 'standard' service metrics is HTTP request response time.
The activity levels of services can also be instrumental in determining the overall health and thus availability of your system. Number of HTTP requests, 404 errors, number of concurrent users and number of database transactions are some examples of service volume metrics.
How do service volume metrics help? Once a baseline, or what the expected volume of a given metric at any given time ought to be is established, these metrics can be used to spot service degradation or other anomalies.
If at 6:30PM on a Friday your Take Out Food Delivery e-commerce platform usually has 65,000 concurrent users but for is currently only hitting 4,500 users, this may be a clear indicator of a problem. If a backend database usually has around 3,500 transactions (at known time) per second but is currently recording 15,000 per second for no increase in front end traffic, then something is clearly amiss.
Service volume metrics can be used as part of availability metrics themselves or can be used for informational and alerting purposes as signs of trouble. or any combination of the above.
With all the focus on compute and service metrics, network metrics still serve vital functions. In order to understand how the end user experience is being affected network metrics are often needed.
Once upon a time it seemed as if the only metric anybody used to determine anything from network performance to if a server was 'up' was ping. Today ping itself is of limited usefulness as the ping response time figure is heavily influenced by network infrastructure and OS handling. In summary the ping round trip number actually cannot tell you a lot on its own. In 2017 ping absolutely shouldn't be used to determine if a server or other resource is up.
Jitter is the measure of variabillity in network packet delivery time. Many realtime applications such as VOIP are somewhat sensitive to jitter so awareness of this metric can be useful in some cases.
Latency is the time it takes for a response to be received following a request being made. Commonly netywork latency (the time it takes for a packet to be send and a response to be received) is discussed. However many different types of latency can be measured to provide better insight into system availability. Examples of this include service response time latency, such as the time it takes for a new account to be opened or an API request to be returned.
There are numerous tools for service metric monitoring. They fall broadly into two categories although many products incorporate both of these.
Infrastructure Performance Monitoring
Traditionally these types of tools collected metrics on everything from the humble ping to disk space and disk queue lengths. Infrastructure performance tools encompass the types of 'monitoring systems' that are considered 'traditional'. Examples include Nagios, PRTG etc.
In the AWS space the CloudWatch offers significant monitoring capabilities 'out of the box' and with the abillity to add custom metrics its also possible to take it on a basic level into Application Performance Monitoring territory.
Application Performance Monitoring
Application Performance Monitoring as the name suggests focuses on the workings of applications themselves. At a high level this can be application response times and the like. However the heart of APM is monitoring is gaining insight into the inner workings of the application in real (or near) time. APM grants access to metrics from within applications themselves sometimes even down to how long a given piece of code takes to execute.
NewRelic is probably the most well known in this area but many other products such as DataDog and ManageEngine include advanced APM capabilities.
As already mentioned it's important to keep in mind that many traditional infrastructure performance monitoring tools now incorporate significant APM capabilities in their own right.
Outside The Cloud
Monitoring of elements outside of the cloud including on-premise resources and connectivity is essential to achieving the desired level of service availability. Many of the monitoring tools mentioned are capable of monitoring on-premise resources including network connectivity.
In order to ensure that your solution is able to meet its defined availability targets a comprehensive range of metrics should be monitored. These cannot be limited to 'infrastructure' metrics, traditional or otherwise. Metrics monitored must provide coverage of the entire stack as well as extending out to user endpoints and (where in scope) transit links.