Skip to main content

key-performance-indicators

1.0 Understanding KPIs ("Golden Signals")

Organizations utilize key performance indicators (KPIs) a.k.a 'Golden Signals' that provide insight into the health or risk of the business and operations. Different parts of an organization would have unique KPIs that cater to measurement of their respective outcomes. For example, the product team of an eCommerce application would track the ability to process cart orders successfully as its KPI. An on-call operations team would measure their KPI as mean-time to detect (MTTD) an incident. For the financial team a KPI for cost of resources under budget is important.

Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) are essential components of service reliability management. This guide outlines best practices for using Amazon CloudWatch and its features to calculate and monitor SLIs, SLOs, and SLAs, with clear and concise examples.

  • SLI (Service Level Indicator): A quantitative measure of a service's performance.
  • SLO (Service Level Objective): The target value for an SLI, representing the desired performance level.
  • SLA (Service Level Agreement): A contract between a service provider and its users specifying the expected level of service.

Examples of common SLIs:

  • Availability: Percentage of time a service is operational
  • Latency: Time taken to fulfill a request
  • Error rate: Percentage of failed requests

2.0 Discover customer and stakeholder requirements (using template as suggested below)

  1. Start with the top question: “What is the business value or business problem in scope for the given workload (ex. Payment portal, eCommerce order placement, User registration, Data reports, Support portal etc)
  2. Break down the business value into categories such as User-Experience (UX); Business-Experience (BX); Operational-Experience (OpsX); Security-Experience(SecX); Developer-Experience (DevX)
  3. Derive core signals aka “Golden Signals” for each category; the top signals around UX & BX will typically construe the business metrics
IDInitialsCustomerBusiness NeedsMeasurementsInformation SourcesWhat does good look like?AlertsDashboardsReports
M1ExampleExternal End UserUser ExperienceResponse time (Page latency)Logs / Traces< 5s for 99.9%NoYesNo
M2ExampleBusinessAvailabilitySuccessful RPS (Requests per second)Health Check>85% in 5 min windowYesYesYes
M3ExampleSecurityComplianceCritical non-compliant resourcesConfig data<10 under 15 daysNoYesYes
M4ExampleDevelopersAgilityDeployment timeDeployment logsAlways < 10 minYesNoYes
M5ExampleOperatorsCapacityQueue DepthApp logs/metricsAlways < 10YesYesYes

2.1 Golden Signals

CategorySignalNotesReferences
UXPerformance (Latency)See M1 in templateWhitepaper: Availability and Beyond (Measuring latency)
BXAvailabilitySee M2 in templateWhitepaper: Avaiability and Beyond (Measuring availability)
BXBusiness Continuity Plan (BCP)Amazon Resilience Hub (ARH) resilience score against defined RTO/RPODocs: ARH user guide (Understanding resilience scores)
SecX(Non)-ComplianceSee M3 in templateDocs: AWS Control Tower user guide (Compliance status in the console)
DevXAgilitySee M4 in templateDocs: DevOps Monitoring Dashboard on AWS (DevOps metrics list)
OpsXCapacity (Quotas)See M5 in templateDocs: Amazon CloudWatch user guide (Visualizing your service quotas and setting alarms)
OpsXBudget AnomaliesDocs:
1. AWS Billing and Cost Management (AWS Cost Anomaly Detection)
2. AWS Budgets

3.0 Top Level Guidance ‘TLG’

3.1 TLG General

  1. Work with business, architecture and security teams to help refine the business, compliance and governance requirements and ensure they accurately reflect the business needs. This includes establishing recovery-time and recovery-point targets (RTOs, RPOs). Formulate methods to measure requirements such as measuring availability and latency (ex. Uptime could allow a small percentage of faults over a 5 min window).

  2. Build an effective tagging strategy with purpose built schema that aligns to various business functional outcomes. This should especially cover operational observability and incident management.

  3. Where possible leverage dynamic thresholds for alarms (esp. for metrics that do not have baseline KPIs) using CloudWatch anomaly detection which provides machine learning algorithms to establish the baselines. When utilizing AWS available services that publish CW metrics (or other sources like prometheus metrics) to configure alarms consider creating composite alarms to reduce alarm noise. Example: a composite alarm that comprises of a business metric indicative of availability (tracked by successful requests) and latency when configured to alarm when both drop below a critical threshold during deployments could be deterministic indicator of deployment bug.

  4. (NOTE: Requires AWS Business support or higher) AWS publishes events of interest using AWS Health service related to your resources in Personal Health Dashboard. Leverage AWS Health Aware (AHA) framework (that uses AWS Health) to ingest proactive and real-time alerts aggregated across your AWS Organization from a central account (such as a management account). These alerts can be sent to preferred communication platforms such as Slack and integrates with ITSM tools like ServiceNow and Jira. Image: AWS Health Aware &#39;AHA&#39;

  5. Leverage Amazon CloudWatch Application Insights to setup best monitors for resources and continuously analyze data for signs of problems with your applications. It also provides automated dashboards that show potential problems with monitored applications to quickly isolate/troubleshoot application/infrastructure issues. Leverage Container Insights to aggregate metrics and logs from containers and can be integrated seamlessly with CloudWatch Application Insights. Image: CW Application Insights

  6. Leverage AWS Resilience Hub to analyze applications against defined RTOs and RPOs. Validate if the availability, latency and business continuity requirements are met by using controlled experiments using tools like AWS Fault Injection Simulator. Conduct additional Well-Architected reviews and service specific deep-dives to ensure workloads are designed to meet business requirements following AWS best practices.

  7. For further details refer to other sections of AWS Observability Best Practices guidance, AWS Cloud Adoption Framework: Operations Perspective whitepaper and AWS Well-Architected Framework Operational Excellence Pillar whitepaper content on 'Understading workload health'.

3.2 TLG by Domain (emphasis on business metrics i.e. UX, BX)

Suitable examples are provided below using services such as CloudWatch (CW) (Ref: AWS Services that publish CloudWatch metrics documentation)

3.2.1 Canaries (aka Synthetic transactions) and Real-User Monitoring (RUM)

  • TLG: One of the easiest and most effective ways to understand client/customer experience is to simulate customer traffic with Canaries (Synthetic transactions) which regularly probes your services and records metrics.
AWS ServiceFeatureMeasurementMetricExampleNotes
CWSyntheticsAvailabilitySuccessPercent(Ex. SuccessPercent > 90 or CW Anomaly Detection for 1min Period)
**[Metric Math where m1 is SuccessPercent if Canaries run each weekday 7a-8a (CloudWatchSynthetics): **
IF(((DAY(m1)<6) AND (HOUR(m1)>7 AND HOUR(m1)<8)),m1)]
CWSyntheticsAvailabilityVisualMonitoringSuccessPercent(Ex. VisualMonitoringSuccessPercent > 90 for 5 min Period for UI screenshot comparisons)
**[Metric Math where m1 is SuccessPercent if Canaries run each weekday 7a-8a (CloudWatchSynthetics): **
IF(((DAY(m1)<6) AND (HOUR(m1)>7 AND HOUR(m1)<8)),m1)
If customer expects canary to match predetermined UI screenshot
CWRUMResponse TimeApdex Score(Ex. Apdex score:
NavigationFrustratedCount < ‘N’ expected value)

3.2.2 API Frontend

AWS ServiceFeatureMeasurementMetricExampleNotes
CloudFrontAvailabilityTotal error rate(Ex. [Total error rate] < 10 or CW Anomaly Detection for 1min Period)Availability as a measure of error rate
CloudFront(Requires turning on additional metrics)PeformanceCache hit rate(Ex.Cache hit rate < 10 CW Anomaly Detection for 1min Period)
Route53Health checks(Cross region) AvailabilityHealthCheckPercentageHealthy(Ex. [Minimum of HealthCheckPercentageHealthy] > 90 or CW Anomaly Detection for 1min Period)
Route53Health checksLatencyTimeToFirstByte(Ex. [p99 TimeToFirstByte] < 100 ms or CW Anomaly Detection for 1min Period)
API GatewayAvailabilityCount(Ex. [(4XXError + 5XXError) / Count) * 100] < 10 or CW Anomaly Detection for 1min Period)Availability as a measure of "abandoned" requests
API GatewayLatencyLatency (or IntegrationLatency i.e. backend latency)(Ex. p99 Latency < 1 sec or CW Anomaly Detection for 1min Period)p99 will have greater tolerance than lower percentile like p90. (p50 is same as average)
API GatewayPerformanceCacheHitCount (and Misses)(Ex. [CacheMissCount / (CacheHitCount + CacheMissCount) * 100] < 10 or CW Anomaly Detection for 1min Period)Performance as a measure of Cache (Misses)
Application Load Balancer (ALB)AvailabilityRejectedConnectionCount(Ex.[RejectedConnectionCount/(RejectedConnectionCount + RequestCount) * 100] < 10 CW Anomaly Detection for 1min Period)Availability as a measure of rejected requests due to max connections breached
Application Load Balancer (ALB)LatencyTargetResponseTime(Ex. p99 TargetResponseTime < 1 sec or CW Anomaly Detection for 1min Period)p99 will have greater tolerance than lower percentile like p90. (p50 is same as average)

3.2.3 Serverless

AWS ServiceFeatureMeasurementMetricExampleNotes
S3Request metricsAvailabilityAllRequests(Ex. [(4XXErrors + 5XXErrors) / AllRequests) * 100] < 10 or CW Anomaly Detection for 1min Period)Availability as a measure of "abandoned" requests
S3Request metrics(Overall) LatencyTotalRequestLatency(Ex. [p99 TotalRequestLatency] < 100 ms or CW Anomaly Detection for 1min Period)
DynamoDB (DDB)AvailabilityThrottledRequests(Ex. [ThrottledRequests] < 100 or CW Anomaly Detection for 1min Period)Availability as a measure of "throttled" requests
DynamoDB (DDB)LatencySuccessfulRequestLatency(Ex. [p99 SuccessfulRequestLatency] < 100 ms or CW Anomaly Detection for 1min Period)
Step FunctionsAvailabilityExecutionsFailed(Ex. ExecutionsFailed = 0)
**[ex. Metric Math where m1 is ExecutionsFailed (Step function Execution) UTC time: IF(((DAY(m1)<6 OR ** ** DAY(m1)==7) AND (HOUR(m1)>21 AND HOUR(m1)<7)),m1)]
Assuming business flow that requests completion of step functions as a daily operation 9p-7a during weekdays (start of day business operations)

3.2.4 Compute and Containers

AWS ServiceFeatureMeasurementMetricExampleNotes
EKSPrometheus metricsAvailabilityAPIServer Request Success Ratio(ex. Prometheus metric like APIServer Request Success Ratio)See best practices for monitoring EKS control plane metrics and EKS observability for details.
EKSPrometheus metricsPerformanceapiserver_request_duration_seconds, etcd_request_duration_secondsapiserver_request_duration_seconds, etcd_request_duration_seconds
ECSAvailabilityService RUNNING task countService RUNNING task countSee ECS CW metrics documentation
ECSPerformanceTargetResponseTime(ex. [p99 TargetResponseTime] < 100 ms or CW Anomaly Detection for 1min Period)See ECS CW metrics documentation
EC2 (.NET Core)CW Agent Performance CountersAvailability(ex. ASP.NET Application Errors Total/Sec < 'N')(ex. ASP.NET Application Errors Total/Sec < 'N')See EC2 CW Application Insights documentation

3.2.5 Databases (RDS)

AWS ServiceFeatureMeasurementMetricExampleNotes
RDS AuroraPerformance Insights (PI)AvailabilityAverage active sessions(Ex. Average active serssions with CW Anomaly Detection for 1min Period)See RDS Aurora CW PI documentation
RDS AuroraDisaster Recovery (DR)AuroraGlobalDBRPOLag(Ex. AuroraGlobalDBRPOLag < 30000 ms for 1min Period)See RDS Aurora CW documentation
RDS AuroraPerformanceCommit Latency, Buffer Cache Hit Ratio, DDL Latency, DML Latency(Ex. Commit Latency with CW Anomaly Detection for 1min Period)See RDS Aurora CW PI documentation
RDS (MSSQL)PIPerformanceSQL Compilations(Ex.
SQL Compliations > 'M' for 5 min Period)
See RDS CW PI documentation

4.0 Using Amazon CloudWatch and Metric Math for Calculating SLIs, SLOs, and SLAs

4.1 Amazon CloudWatch and Metric Math

Amazon CloudWatch provides monitoring and observability services for AWS resources. Metric Math allows you to perform calculations using CloudWatch metric data, making it an ideal tool for calculating SLIs, SLOs, and SLAs.

4.1.1 Enabling Detailed Monitoring

Enable Detailed Monitoring for your AWS resources to get 1-minute data granularity, allowing for more accurate SLI calculations.

4.1.2 Organizing Metrics with Namespaces and Dimensions

Use Namespaces and Dimensions to categorize and filter metrics for easier analysis. For example, use Namespaces to group metrics related to a specific service, and Dimensions to differentiate between various instances of that service.

4.2 Calculating SLIs with Metric Math

4.2.1 Availability

To calculate availability, divide the number of successful requests by the total number of requests:

availability = 100 * (successful_requests / total_requests)

Example:

Suppose you have an API Gateway with the following metrics:

  • 4XXError: Number of 4xx client errors
  • 5XXError: Number of 5xx server errors
  • Count: Total number of requests

Use Metric Math to calculate the availability:

availability = 100 * ((Count - 4XXError - 5XXError) / Count)

4.2.2 Latency

To calculate average latency, use the SampleCount and Sum statistics provided by CloudWatch:

average_latency = Sum / SampleCount

Example:

Suppose you have a Lambda function with the following metric:

  • Duration: Time taken to execute the function

Use Metric Math to calculate the average latency:

average_latency = Duration.Sum / Duration.SampleCount

4.2.3 Error Rate

To calculate the error rate, divide the number of failed requests by the total number of requests:

error_rate = 100 * (failed_requests / total_requests)

Example:

Using the API Gateway example from before:

error_rate = 100 * ((4XXError + 5XXError) / Count)

4.4 Defining and Monitoring SLOs

4.4.1 Setting Realistic Targets

Define SLO targets based on user expectations and historical performance data. Set achievable targets to ensure a balance between service reliability and resource utilization.

4.4.2 Monitoring SLOs with CloudWatch

Create CloudWatch Alarms to monitor your SLIs and notify you when they approach or breach SLO targets. This enables you to proactively address issues and maintain service reliability.

4.4.3 Reviewing and Adjusting SLOs

Periodically review your SLOs to ensure they remain relevant as your service evolves. Adjust targets if necessary and communicate any changes to stakeholders.

4.5 Defining and Measuring SLAs

4.5.1 Setting Realistic Targets

Define SLA targets based on historical performance data and user expectations. Set achievable targets to ensure a balance between service reliability and resource utilization.

4.5.2 Monitoring and Alerting

Set up CloudWatch Alarms to monitor SLIs and notify you when they approach or breach SLA targets. This enables you to proactively address issues and maintain service reliability.

4.5.3 Regularly Reviewing SLAs

Periodically review SLAs to ensure they remain relevant as your service evolves. Adjust targets if necessary and communicate any changes to stakeholders.

4.6 Measuring SLA or SLO Performance Over a Set Period

To measure SLA or SLO performance over a set period, such as a calendar month, use CloudWatch metric data with custom time ranges.

Example:

Suppose you have an API Gateway with an SLO target of 99.9% availability. To measure the availability for the month of April, use the following Metric Math expression:

availability = 100 * ((Count - 4XXError - 5XXError) / Count)

Then, configure the CloudWatch metric data query with a custom time range:

  • Start Time: 2023-04-01T00:00:00Z
  • End Time: 2023-04-30T23:59:59Z
  • Period: 2592000 (30 days in seconds)

Finally, use the AVG statistic to calculate the average availability over the month. If the average availability is equal to or greater than the SLO target, you have met your objective.

5.0 Summary

Key Performance Indicators (KPIs) a.k.a 'Golden Signals' must align to business and stake-holder requirements. Calculating SLIs, SLOs, and SLAs using Amazon CloudWatch and Metric Math is crucial for managing service reliability. Follow the best practices outlined in this guide to effectively monitor and maintain the performance of your AWS resources. Remember to enable Detailed Monitoring, organize metrics with Namespaces and Dimensions, use Metric Math for SLI calculations, set realistic SLO and SLA targets, and establish monitoring and alerting systems with CloudWatch Alarms. By applying these best practices, you can ensure optimal service reliability, better resource utilization, and improved customer satisfaction.