Skip to content

Key Performance Indicators

1.0 Understanding KPIs ("Golden Signals")

Organizations utilize key performance indicators (KPIs) a.k.a 'Golden Signals' that provide insight into the health or risk of the business and operations. Different parts of an organization would have unique KPIs that cater to measurement of their respective outcomes. For example, the product team of an eCommerce application would track the ability to process cart orders successfully as its KPI. An on-call operations team would measure their KPI as mean-time to detect (MTTD) an incident. For the financial team a KPI for cost of resources under budget is important.

Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) are essential components of service reliability management. This guide outlines best practices for using Amazon CloudWatch and its features to calculate and monitor SLIs, SLOs, and SLAs, with clear and concise examples.

  • SLI (Service Level Indicator): A quantitative measure of a service's performance.
  • SLO (Service Level Objective): The target value for an SLI, representing the desired performance level.
  • SLA (Service Level Agreement): A contract between a service provider and its users specifying the expected level of service.

Examples of common SLIs:

  • Availability: Percentage of time a service is operational
  • Latency: Time taken to fulfill a request
  • Error rate: Percentage of failed requests

2.0 Discover customer and stakeholder requirements (using template as suggested below)

  1. Start with the top question: “What is the business value or business problem in scope for the given workload (ex. Payment portal, eCommerce order placement, User registration, Data reports, Support portal etc)
  2. Break down the business value into categories such as User-Experience (UX); Business-Experience (BX); Operational-Experience (OpsX); Security-Experience(SecX); Developer-Experience (DevX)
  3. Derive core signals aka “Golden Signals” for each category; the top signals around UX & BX will typically construe the business metrics
ID Initials Customer Business Needs Measurements Information Sources What does good look like? Alerts Dashboards Reports
M1 Example External End User User Experience Response time (Page latency) Logs / Traces < 5s for 99.9% No Yes No
M2 Example Business Availability Successful RPS (Requests per second) Health Check >85% in 5 min window Yes Yes Yes
M3 Example Security Compliance Critical non-compliant resources Config data <10 under 15 days No Yes Yes
M4 Example Developers Agility Deployment time Deployment logs Always < 10 min Yes No Yes
M5 Example Operators Capacity Queue Depth App logs/metrics Always < 10 Yes Yes Yes

2.1 Golden Signals

Category Signal Notes References
UX Performance (Latency) See M1 in template Whitepaper: Availability and Beyond (Measuring latency)
BX Availability See M2 in template Whitepaper: Avaiability and Beyond (Measuring availability)
BX Business Continuity Plan (BCP) Amazon Resilience Hub (ARH) resilience score against defined RTO/RPO Docs: ARH user guide (Understanding resilience scores)
SecX (Non)-Compliance See M3 in template Docs: AWS Control Tower user guide (Compliance status in the console)
DevX Agility See M4 in template Docs: DevOps Monitoring Dashboard on AWS (DevOps metrics list)
OpsX Capacity (Quotas) See M5 in template Docs: Amazon CloudWatch user guide (Visualizing your service quotas and setting alarms)
OpsX Budget Anomalies Docs:
1. AWS Billing and Cost Management (AWS Cost Anomaly Detection)
2. AWS Budgets

3.0 Top Level Guidance ‘TLG’

3.1 TLG General

  1. Work with business, architecture and security teams to help refine the business, compliance and governance requirements and ensure they accurately reflect the business needs. This includes establishing recovery-time and recovery-point targets (RTOs, RPOs). Formulate methods to measure requirements such as measuring availability and latency (ex. Uptime could allow a small percentage of faults over a 5 min window).

  2. Build an effective tagging strategy with purpose built schema that aligns to various business functional outcomes. This should especially cover operational observability and incident management.

  3. Where possible leverage dynamic thresholds for alarms (esp. for metrics that do not have baseline KPIs) using CloudWatch anomaly detection which provides machine learning algorithms to establish the baselines. When utilizing AWS available services that publish CW metrics (or other sources like prometheus metrics) to configure alarms consider creating composite alarms to reduce alarm noise. Example: a composite alarm that comprises of a business metric indicative of availability (tracked by successful requests) and latency when configured to alarm when both drop below a critical threshold during deployments could be deterministic indicator of deployment bug.

  4. (NOTE: Requires AWS Business support or higher) AWS publishes events of interest using AWS Health service related to your resources in Personal Health Dashboard. Leverage AWS Health Aware (AHA) framework (that uses AWS Health) to ingest proactive and real-time alerts aggregated across your AWS Organization from a central account (such as a management account). These alerts can be sent to preferred communication platforms such as Slack and integrates with ITSM tools like ServiceNow and Jira. Image: AWS Health Aware 'AHA'

  5. Leverage Amazon CloudWatch Application Insights to setup best monitors for resources and continuously analyze data for signs of problems with your applications. It also provides automated dashboards that show potential problems with monitored applications to quickly isolate/troubleshoot application/infrastructure issues. Leverage Container Insights to aggregate metrics and logs from containers and can be integrated seamlessly with CloudWatch Application Insights. Image: CW Application Insights

  6. Leverage AWS Resilience Hub to analyze applications against defined RTOs and RPOs. Validate if the availability, latency and business continuity requirements are met by using controlled experiments using tools like AWS Fault Injection Simulator. Conduct additional Well-Architected reviews and service specific deep-dives to ensure workloads are designed to meet business requirements following AWS best practices.

  7. For further details refer to other sections of AWS Observability Best Practices guidance, AWS Cloud Adoption Framework: Operations Perspective whitepaper and AWS Well-Architected Framework Operational Excellence Pillar whitepaper content on 'Understading workload health'.

3.2 TLG by Domain (emphasis on business metrics i.e. UX, BX)

Suitable examples are provided below using services such as CloudWatch (CW) (Ref: AWS Services that publish CloudWatch metrics documentation)

3.2.1 Canaries (aka Synthetic transactions) and Real-User Monitoring (RUM)

  • TLG: One of the easiest and most effective ways to understand client/customer experience is to simulate customer traffic with Canaries (Synthetic transactions) which regularly probes your services and records metrics.
AWS Service Feature Measurement Metric Example Notes
CW Synthetics Availability SuccessPercent (Ex. SuccessPercent > 90 or CW Anomaly Detection for 1min Period)
**[Metric Math where m1 is SuccessPercent if Canaries run each weekday 7a-8a (CloudWatchSynthetics): **
IF(((DAY(m1)<6) AND (HOUR(m1)>7 AND HOUR(m1)<8)),m1)]
CW Synthetics Availability VisualMonitoringSuccessPercent (Ex. VisualMonitoringSuccessPercent > 90 for 5 min Period for UI screenshot comparisons)
**[Metric Math where m1 is SuccessPercent if Canaries run each weekday 7a-8a (CloudWatchSynthetics): **
IF(((DAY(m1)<6) AND (HOUR(m1)>7 AND HOUR(m1)<8)),m1)
If customer expects canary to match predetermined UI screenshot
CW RUM Response Time Apdex Score (Ex. Apdex score:
NavigationFrustratedCount < ‘N’ expected value)

3.2.2 API Frontend

AWS Service Feature Measurement Metric Example Notes
CloudFront Availability Total error rate (Ex. [Total error rate] < 10 or CW Anomaly Detection for 1min Period) Availability as a measure of error rate
CloudFront (Requires turning on additional metrics) Peformance Cache hit rate (Ex.Cache hit rate < 10 CW Anomaly Detection for 1min Period)
Route53 Health checks (Cross region) Availability HealthCheckPercentageHealthy (Ex. [Minimum of HealthCheckPercentageHealthy] > 90 or CW Anomaly Detection for 1min Period)
Route53 Health checks Latency TimeToFirstByte (Ex. [p99 TimeToFirstByte] < 100 ms or CW Anomaly Detection for 1min Period)
API Gateway Availability Count (Ex. [(4XXError + 5XXError) / Count) * 100] < 10 or CW Anomaly Detection for 1min Period) Availability as a measure of "abandoned" requests
API Gateway Latency Latency (or IntegrationLatency i.e. backend latency) (Ex. p99 Latency < 1 sec or CW Anomaly Detection for 1min Period) p99 will have greater tolerance than lower percentile like p90. (p50 is same as average)
API Gateway Performance CacheHitCount (and Misses) (Ex. [CacheMissCount / (CacheHitCount + CacheMissCount) * 100] < 10 or CW Anomaly Detection for 1min Period) Performance as a measure of Cache (Misses)
Application Load Balancer (ALB) Availability RejectedConnectionCount (Ex.[RejectedConnectionCount/(RejectedConnectionCount + RequestCount) * 100] < 10 CW Anomaly Detection for 1min Period) Availability as a measure of rejected requests due to max connections breached
Application Load Balancer (ALB) Latency TargetResponseTime (Ex. p99 TargetResponseTime < 1 sec or CW Anomaly Detection for 1min Period) p99 will have greater tolerance than lower percentile like p90. (p50 is same as average)

3.2.3 Serverless

AWS Service Feature Measurement Metric Example Notes
S3 Request metrics Availability AllRequests (Ex. [(4XXErrors + 5XXErrors) / AllRequests) * 100] < 10 or CW Anomaly Detection for 1min Period) Availability as a measure of "abandoned" requests
S3 Request metrics (Overall) Latency TotalRequestLatency (Ex. [p99 TotalRequestLatency] < 100 ms or CW Anomaly Detection for 1min Period)
DynamoDB (DDB) Availability ThrottledRequests (Ex. [ThrottledRequests] < 100 or CW Anomaly Detection for 1min Period) Availability as a measure of "throttled" requests
DynamoDB (DDB) Latency SuccessfulRequestLatency (Ex. [p99 SuccessfulRequestLatency] < 100 ms or CW Anomaly Detection for 1min Period)
Step Functions Availability ExecutionsFailed (Ex. ExecutionsFailed = 0)
**[ex. Metric Math where m1 is ExecutionsFailed (Step function Execution) UTC time: IF(((DAY(m1)<6 OR ** ** DAY(m1)==7) AND (HOUR(m1)>21 AND HOUR(m1)<7)),m1)]
Assuming business flow that requests completion of step functions as a daily operation 9p-7a during weekdays (start of day business operations)

3.2.4 Compute and Containers

AWS Service Feature Measurement Metric Example Notes
EKS Prometheus metrics Availability APIServer Request Success Ratio (ex. Prometheus metric like APIServer Request Success Ratio) See best practices for monitoring EKS control plane metrics and EKS observability for details.
EKS Prometheus metrics Performance apiserver_request_duration_seconds, etcd_request_duration_seconds apiserver_request_duration_seconds, etcd_request_duration_seconds
ECS Availability Service RUNNING task count Service RUNNING task count See ECS CW metrics documentation
ECS Performance TargetResponseTime (ex. [p99 TargetResponseTime] < 100 ms or CW Anomaly Detection for 1min Period) See ECS CW metrics documentation
EC2 (.NET Core) CW Agent Performance Counters Availability (ex. ASP.NET Application Errors Total/Sec < 'N') (ex. ASP.NET Application Errors Total/Sec < 'N') See EC2 CW Application Insights documentation

3.2.5 Databases (RDS)

AWS Service Feature Measurement Metric Example Notes
RDS Aurora Performance Insights (PI) Availability Average active sessions (Ex. Average active serssions with CW Anomaly Detection for 1min Period) See RDS Aurora CW PI documentation
RDS Aurora Disaster Recovery (DR) AuroraGlobalDBRPOLag (Ex. AuroraGlobalDBRPOLag < 30000 ms for 1min Period) See RDS Aurora CW documentation
RDS Aurora Performance Commit Latency, Buffer Cache Hit Ratio, DDL Latency, DML Latency (Ex. Commit Latency with CW Anomaly Detection for 1min Period) See RDS Aurora CW PI documentation
RDS (MSSQL) PI Performance SQL Compilations (Ex.
SQL Compliations > 'M' for 5 min Period)
See RDS CW PI documentation

4.0 Using Amazon CloudWatch and Metric Math for Calculating SLIs, SLOs, and SLAs

4.1 Amazon CloudWatch and Metric Math

Amazon CloudWatch provides monitoring and observability services for AWS resources. Metric Math allows you to perform calculations using CloudWatch metric data, making it an ideal tool for calculating SLIs, SLOs, and SLAs.

4.1.1 Enabling Detailed Monitoring

Enable Detailed Monitoring for your AWS resources to get 1-minute data granularity, allowing for more accurate SLI calculations.

4.1.2 Organizing Metrics with Namespaces and Dimensions

Use Namespaces and Dimensions to categorize and filter metrics for easier analysis. For example, use Namespaces to group metrics related to a specific service, and Dimensions to differentiate between various instances of that service.

4.2 Calculating SLIs with Metric Math

4.2.1 Availability

To calculate availability, divide the number of successful requests by the total number of requests:

availability = 100 * (successful_requests / total_requests)

Example:

Suppose you have an API Gateway with the following metrics: - 4XXError: Number of 4xx client errors - 5XXError: Number of 5xx server errors - Count: Total number of requests

Use Metric Math to calculate the availability:

availability = 100 * ((Count - 4XXError - 5XXError) / Count)

4.2.2 Latency

To calculate average latency, use the SampleCount and Sum statistics provided by CloudWatch:

average_latency = Sum / SampleCount

Example:

Suppose you have a Lambda function with the following metric: - Duration: Time taken to execute the function

Use Metric Math to calculate the average latency:

average_latency = Duration.Sum / Duration.SampleCount

4.2.3 Error Rate

To calculate the error rate, divide the number of failed requests by the total number of requests:

error_rate = 100 * (failed_requests / total_requests)

Example:

Using the API Gateway example from before:

error_rate = 100 * ((4XXError + 5XXError) / Count)

4.4 Defining and Monitoring SLOs

4.4.1 Setting Realistic Targets

Define SLO targets based on user expectations and historical performance data. Set achievable targets to ensure a balance between service reliability and resource utilization.

4.4.2 Monitoring SLOs with CloudWatch

Create CloudWatch Alarms to monitor your SLIs and notify you when they approach or breach SLO targets. This enables you to proactively address issues and maintain service reliability.

4.4.3 Reviewing and Adjusting SLOs

Periodically review your SLOs to ensure they remain relevant as your service evolves. Adjust targets if necessary and communicate any changes to stakeholders.

4.5 Defining and Measuring SLAs

4.5.1 Setting Realistic Targets

Define SLA targets based on historical performance data and user expectations. Set achievable targets to ensure a balance between service reliability and resource utilization.

4.5.2 Monitoring and Alerting

Set up CloudWatch Alarms to monitor SLIs and notify you when they approach or breach SLA targets. This enables you to proactively address issues and maintain service reliability.

4.5.3 Regularly Reviewing SLAs

Periodically review SLAs to ensure they remain relevant as your service evolves. Adjust targets if necessary and communicate any changes to stakeholders.

4.6 Measuring SLA or SLO Performance Over a Set Period

To measure SLA or SLO performance over a set period, such as a calendar month, use CloudWatch metric data with custom time ranges.

Example:

Suppose you have an API Gateway with an SLO target of 99.9% availability. To measure the availability for the month of April, use the following Metric Math expression:

availability = 100 * ((Count - 4XXError - 5XXError) / Count)

Then, configure the CloudWatch metric data query with a custom time range:

  • Start Time: 2023-04-01T00:00:00Z
  • End Time: 2023-04-30T23:59:59Z
  • Period: 2592000 (30 days in seconds)

Finally, use the AVG statistic to calculate the average availability over the month. If the average availability is equal to or greater than the SLO target, you have met your objective.

5.0 Summary

Key Performance Indicators (KPIs) a.k.a 'Golden Signals' must align to business and stake-holder requirements. Calculating SLIs, SLOs, and SLAs using Amazon CloudWatch and Metric Math is crucial for managing service reliability. Follow the best practices outlined in this guide to effectively monitor and maintain the performance of your AWS resources. Remember to enable Detailed Monitoring, organize metrics with Namespaces and Dimensions, use Metric Math for SLI calculations, set realistic SLO and SLA targets, and establish monitoring and alerting systems with CloudWatch Alarms. By applying these best practices, you can ensure optimal service reliability, better resource utilization, and improved customer satisfaction.