No matter what type of application you have, how effectively you can measure its success is dependent on a monitoring environment that takes all aspects of your business into consideration.
Businesses are usually at the mercy of technology departments when it comes to defining how to monitor the health of an application. This creates a siloed view of performance and missed insights into how the workload is actually impacting their overall business. Similarly, you can’t have only Sales, Finance, or Operations determining what defines success. Instead, a holistic approach that integrates the viewpoints of different departments yields an optimal perspective of how your application is really doing.
DEFINING YOUR KPIs
Defining which KPIs to monitor can be challenging at first. Sometimes, it’s easier to ask a consultant like Five Talent to help you define your metrics. In fact, 90% of our clients needs some help choosing the right KPIs. We use the AWS Well-Architected Framework to identify monitoring and logging KPIs based on the Five Pillars.
When you’re finished with this process, you should have an average of 5-10 KPIs per department, each with separate benchmarks and monitoring alarms. When serious alerts do occur, you can then socialize them to all departments and coordinate a company wide response. The value of your monitoring depends on it.
First, let’s take a look at examples of KPIs from different departments.
Without question, the security of your application should be first priority. For Chief Security Officers, KPIs are generally focused on monitoring two types of access – people and programmatic.
Production data access
For compliance requirements (i.e. GDPR, PCI, HIPAA), you’ll need to monitor and report who accesses your production level and personal information data. Set alarms to alert you when someone outside your approved checklist is accessing secured data.
It’s more secure to store personally identifiable information (PII) in one location that is only accessed by your application. But if you discover an unknown IP address accessing your location, you’ve got a true security issue. You can use services like Amazon GuardDuty, AWS Shield, and AWS CloudTrail to track these patterns, and automatically alert you when there is a problem.
Denial of Service (DOS) attacks
DOS attacks can take down an application instantly, affecting its reliability. If you receive millions of requests at once on your application, either your application is scaling (and you have a cost issue) or it didn’t scale and is crashing. Your Security team needs to understand how to set limits so this doesn’t occur.
Finance should have pricing models in place that reflect workload revenue and costs. With scalable architecture, costs can become harder to control, forcing decisions about whether to limit architecture. AWS tools like CostExplorer can help you determine your costs and variances.
A significant increase in traffic and users drives up costs. After monitoring and analyzing these metrics, Finance may recommend migrating from amonolithic application to a serverless environment using AWS Lambda instead.
The cost to store big files like photos or years of back up data can escalate quickly. Recording and tracking them gives you visibility into trends that may need closer management.
Understanding what it costs to onboard a user informs bigger decisions about how to invest in architecture that supports your application.
Typically, the IT Department is 100% responsible for defining KPIs for an application. As stated above, we strongly recommend that ALL business departments impacted by an application are involved in defining KPIs. For IT, metrics are easy to identify, including tracking reliability and performance. Tools like AWS CloudWatch are excellent for monitoring system side performance changes and overall operational health for your application.
Uptime vs Downtime
Metrics such as CPU usage, memory usage, disk IO, and diskspace provide a clear picture of your workload performance.
Analyze your bandwidth usage, the latency of requests, and traffic origins to determine what improvements need to be made to your network environment.
Database query times
In order to identify problems with data retrieval, establish a baseline of query execution. For example, as your data set gets larger or your traffic pattern changes, a report that typically takes 15 seconds to run could slow to 30-45 seconds to execute. This signals an upcoming issue you can proactively address.
If your team isn’t following best practices for getting updates out to production, it can affect reliability and security. Monitoring your operational process gives you key insights into areas for improvement.
How long does it take to publish a feature from the time it’s requested and approved to when it’s deployed and tested? If you have time to market pressures, your Ops team needs to know how fast you can respond to feature requests.
Performance & Usability (internal and external)
How are users experiencing your application? With AWS CloudWatch and third party tools from companies like New Relic, you can monitor performance metrics in each of the layers of an application and determine what testing is needed to improve.
Review your backlog frequently with your team to identify slower development or tasks needing clearer definition.
ESTABLISHING BASELINES & ALERTS
After you’ve determined what KPIs you’ll be tracking across departments for your workload, you can start monitoring and analyzing logs to decide on benchmarks. Establishing baselines allows you to evaluate performance thresholds and automate alerts for easier tracking and response.
We recommend companies set up one main dashboard such as AWS CloudWatch that represents their most critical KPIs with a simple green – yellow – red monitoring approach for alerts. This provides an overall picture of application health. Individual departments can use the same process to delve more deeply into their own KPIs. In addition, we use a third party tool called BlazeMeter for load testing to see how applications scale.
Next, establish your service level agreements (SLAs) to outline response times to issues. What may be acceptable to the IT Department may not be acceptable to Operations or Sales. Again, this underscores the importance of building a monitoring environment based on feedback from all business departments.
Set Permissions, Establish Ownership
Now that you have your KPIs and your baselines, clearly outline who has permission and responsibility for each monitoring task. We use the RACI Matrix (Responsible – Accountable – Consulted – Informed) as a framework for determining roles and responsibilities, making sure duties are segregated with checks and balances in place. This becomes even more essential when you have multiple outside vendors to manage.
Document Your Playbook
Last, having documentation in place is a discipline that has to be taught and followed. If a problem occurs and know one documents how it was fixed, everyone has to learn it for the first time all over again. At Five Talent, we use a wiki that’s searchable and easy to update.
Monitoring is an ongoing process. Business changes all the time, and it’s necessary to take the time for an internal review of your KPIs at least once a quarter to make sure they’re still relevant to your application. Most of all, your KPIs should be prioritized with input from all departments as to what matters most to your business. Carefully watching over your application post-deployment ensures the investment into building it was worth it.