Skip to main content

Cloud Engineer

Welcome to the Observability Best pratices guide for Cloud Engineer! This document provides you with best practices, tips, and examples for effectively managing your Observability AWS resources across different expertise levels. Whether you're just starting or are an experienced Cloud Engineer.


AWS Cost Management πŸ’Έβ€‹

Goal: Optimize your AWS costs by monitoring and controlling your spending.

LevelCategoryDescriptionTips & ExamplesAdditional Notes
BasicTrack Your SpendingSet up dashboards to monitor how your business activities impact costsExample: Monitor marketing campaigns' effect on server costsPro Tip: Start with basic daily cost tracking
Common Pitfall: Failing to set up alerts
BasicBudget ManagementEstablish spenditure limits to measure project costsTip: Focus on setting budgets for each department or serviceRecommendation: Establish clear budget placements
IntermediateResource TaggingImplement resource tagging to track resource usage by teams and projectsQuick Win: Start with these 3 tags:
  1. Project
  2. Environment
  3. Owner | Did You Know? You could save 20-30% after implementing tagging | | Intermediate | Cost & Usage Visibility | Ensure that you are only incurring the costs you need and that you are not overspending on resources you don't need | Example: Set up granular cost dashboards for better tracking | Pro Tip: Take into consideration the different cost optimization tools AWS provides | | Advanced | Smart Cost Management | Automate tasks that will limit unnecesary spenditure | Example: Power off non-production servers during off hours | Pro Tip: Begin with non-production environments | | Advanced | Strategic Implementation | Establish KPIs and implement FinOps Foundation principles | Create cost optimization KPIs and track them over time | Pro Tip: Start with the "Unit Economics" KPI - measure your cost per business output (e.g., cost per transaction, cost per customer, or cost per service).

Did you know? Remember: The best KPIs are those that directly tie cloud spending to business outcomes, making it easier to demonstrate ROI and gain buy-in for FinOps initiatives. |

Recommendations:​

  • Start simple: Begin with basic monitoring and expand to more advanced techniques as you become more comfortable with AWS tools.
  • Use tags effectively: Tagging is one of the most powerful ways to track and allocate costs. Implementing it early can save significant time in the future.

AWS Performance & Availability πŸš€β€‹

Goal: Ensure optimal performance and availability of your AWS-hosted applications.

LevelComponentDescriptionTips & ExamplesAdditional Notes
BasicWatch Your AppsAggregate curated historical data and see it alongside other related dataExample: Check if users in different regions experience delaysCommon Pitfall: Lack of centralization for your monitoring tools
IntermediateTrack Connection PointsMonitor how different parts of your application communicate with each otherQuick Win: Start by tracking the performance of your most critical serviceDid You Know? Most outages happen due to service-to-service communication failures
AdvancedTest your performanceTest & Simulate applications from the perspective of your customer to understand their experienceExample: Execute synthetic tests towards your application endpointsPro Tip: Collect client side data from user session to granular performance insights
AdvancedEstablish Agreed & Enforce upon target for your availabilityAssess your applications SLO that establishes the acceptable health & availabilityUse for real-time monitoring and quick troubleshootingPro Tip: Regularly evaluate your organization's observability maturity

Recommendations:​

  • Understand user experience: Monitoring only server-side metrics isn't enough. Be sure to track actual user experience globally.
  • Prioritize key services: Begin monitoring your most critical application components and scale monitoring from there.

AWS Security Monitoring πŸ”’β€‹

Goal: Secure your AWS infrastructure by monitoring for security vulnerabilities and incidents.

LevelComponentDescriptionTips & ExamplesAdditional Notes
BasicCentral Security MonitoringConsolidate all security logs in one central place for easy access and analysisExample: Track all access to sensitive data and resourcesPro Tip: Start by focusing on login attempts and access patterns
IntermediateExpand telemetry data collectionInclude additional attributes that contributes troubleshooting and auditing sessionsImplementation: Implement telemetry data from your applications backend codeExample: Send Browser name from which user has logged in from
AdvancedChange MonitoringTrack abrupt changes in your workloads both from internal and external sourcesQuick Win: Set up alerts for unexpected login patterns or user activityCommon Pitfall: Solely depending on static alarm threshold

Recommendations:​

  • Prioritize security: Security should never be an afterthought. Start with basic monitoring and progress to more sophisticated configurations.
  • Automate alerts: Setting up automatic alerts for unusual activities helps detect potential threats before they escalate.

User Experience Monitoring πŸ“ˆβ€‹

Goal: Optimize user experience by monitoring application usage, speed, and behavior.

LevelComponentDescriptionTips & ExamplesAdditional Notes
BasicTrack Page SpeedMonitor how fast your pages load for real usersExample: Identify if your checkout page slows down during peak traffic hoursPro Tip: Focus on the most important user journeys first
IntermediateWatch User Patterns affected by external factorsTrack additional elements that can affect how users interaction with your serviceExample Internet Provider and Location
Quick Win: Start by monitoring basic page load timesDid You Know? Small delays in page load times can significantly impact user retention
AdvancedDeep Networking Usage AnalysisEvaluate and Analyze deep into your network flow activity and statusmExample Network Synthetics and Network Flow MonitorTrack deeper network interactions and user behavior

Recommendations:​

  • Focus on key actions: Prioritize monitoring for actions that impact revenue or user satisfaction.
  • Monitor real user interactions: Don't rely only on synthetic testsβ€”real user data provides more actionable insights.

Serverless Workload Monitoring βš‘β€‹

Goal: Effectively monitor and optimize serverless applications to ensure reliability and cost efficiency.

LevelComponentDescriptionTips & ExamplesAdditional Notes
BasicLambda Function Best practicesMonitor core Lambda metrics and execution statsExample: Track invocations, duration, and error rates
Quick Win: Set up CloudWatch dashboards for Lambda insightsPro Tip: Monitor cold starts and memory utilization to optimize costs
IntermediateEvent Source MonitoringTrack performance of event sources and integrationsExample: Monitor SQS queue depth, API Gateway latency
Quick Win: Set up dead-letter queues for failed eventsDid You Know? Proper event source monitoring can prevent cascade failures
AdvancedProvided Summarized InsightsLeverage CloudWatch's specialized insight tools to gain automated, detailed analytics about your workload performance, resource utilization, and operational patterns across your serverless and containerized applications.Example: Lambda Insights
Container InsightsEnable Lambda Insights at the account level using AWS CloudFormation to automatically collect detailed metrics for all new Lambda functions, while using Contributor Insights to identify top-consuming resources and potential bottlenecks.

Recommendations:​

  • Implement structured logging: Use consistent JSON logging format for better searchability
  • Monitor concurrency limits: Track function concurrency to prevent throttling
  • Cost optimization: Set up cost allocation tags and monitor per-function costs