Monitoring and Observability for QA

Yuri Kan

Monitoring and Observability for QA

Master monitoring and observability from a QA perspective. Learn the three pillars (logs, metrics, traces) and how to use them for quality assurance.

Quick Answer

Monitoring and Observability for QA covers essential QA skills — after this lesson you can explain the three pillars of observability: logs, metrics, and traces.

— Yuri Kan, Senior QA Lead

What You Will Learn

Explain the three pillars of observability: logs, metrics, and traces
Set up alerts and dashboards for quality-related metrics
Use monitoring data to identify and debug production issues

Table of Contents

Why Monitoring Matters for QA

Testing does not end when code reaches production. Monitoring is the continuation of quality assurance in the live environment. No test suite catches every bug, and some issues only appear under real-world traffic patterns, data volumes, and user behaviors.

For QA engineers, monitoring provides:

Post-deployment validation: Confirm new releases work correctly in production
Bug detection: Catch issues testing missed — memory leaks, race conditions, edge cases
Root cause analysis: Correlate test failures with system behavior
Non-functional validation: Verify performance, availability, and reliability meet requirements

The Three Pillars of Observability

Logs

Discrete events that describe what happened at a specific point in time.

{
  "timestamp": "2024-01-15T10:23:45Z",
  "level": "ERROR",
  "service": "payment-service",
  "message": "Payment processing failed",
  "error": "timeout connecting to payment gateway",
  "userId": "usr_12345",
  "orderId": "ord_67890",
  "duration_ms": 30000
}

QA uses for logs:

Investigating failed test cases by reading application logs
Finding error patterns after deployment
Debugging intermittent issues by searching for specific error messages

Metrics

Aggregated measurements over time. Metrics tell you how the system is performing at a high level.

Types of metrics:

Counter: Total number of events (requests, errors, sales)
Gauge: Current value (active users, queue depth, CPU usage)
Histogram: Distribution of values (response time percentiles)

Key metrics for QA:

Metric	What It Tells You
Error rate	Percentage of failed requests
Response time (P50/P95/P99)	How fast the application responds
Throughput (RPS)	Requests per second being handled
Saturation	Resource utilization (CPU, memory, disk)
Availability	Uptime percentage

Traces

Follow a single request as it travels through multiple services. Essential for microservices architectures.

User Request → API Gateway (5ms) → Auth Service (15ms) → Product Service (25ms) → Database (10ms)
Total: 55ms

QA uses for traces:

Finding which service causes slow responses
Understanding request flow for test design
Identifying bottlenecks in the system

SLIs, SLOs, and SLAs

Term	Definition	Example
SLI (Service Level Indicator)	A measurement of service quality	99.95% of requests succeed
SLO (Service Level Objective)	A target for an SLI	“We target 99.9% success rate”
SLA (Service Level Agreement)	A contractual obligation	“We guarantee 99.5% uptime”

QA should help define SLOs and monitor SLIs to ensure the application meets its quality targets.

Setting Up Alerts

Alert Design Principles

Alert on symptoms, not causes. Alert on “error rate > 1%” rather than “CPU > 80%.” Users feel symptoms, not causes.
Set meaningful thresholds. Too sensitive = alert fatigue. Too lenient = missed issues.
Include actionable context. Every alert should tell you what is happening and what to do.

Example Alert Configuration

# Prometheus alerting rules
groups:
  - name: qa-quality-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% for 5 minutes"
          runbook: "https://wiki.example.com/runbooks/high-error-rate"

      - alert: SlowResponses
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P95 response time above 500ms"

Exercise: Design a Monitoring Dashboard for QA

Create a dashboard specification for monitoring a web application post-deployment. Include the metrics, visualization types, and alert thresholds.

Solution

QA Monitoring Dashboard

Row 1: Health Overview

Availability (gauge): Current uptime %. Alert < 99.9%
Error rate (time series): 5xx errors over time. Alert > 1%
Active users (gauge): Current connected users

Row 2: Performance

Response time P50/P95/P99 (time series): Response time percentiles. Alert P95 > 500ms
Throughput (time series): Requests per second
Slow endpoints (table): Top 10 slowest endpoints

Row 3: Business Metrics

Conversion rate (time series): Orders / visits. Alert > 10% drop
Payment success rate (gauge): Successful payments %. Alert < 99%
Cart abandonment rate (time series): Alert > 20% increase

Row 4: Infrastructure

CPU/Memory usage (time series): Per service. Alert > 80%
Pod restarts (counter): Alert > 0 in 15 minutes
Database connections (gauge): Active vs. pool limit. Alert > 80%

Monitoring Tools

Tool	Type	Best For
Prometheus	Metrics	Time-series data collection and alerting
Grafana	Visualization	Dashboards for metrics from any data source
ELK Stack	Logs	Log aggregation, search, and analysis
Jaeger / Zipkin	Traces	Distributed tracing
Datadog	All-in-one	Metrics, logs, traces, APM (SaaS)
New Relic	All-in-one	APM, infrastructure, logs (SaaS)
PagerDuty	Alerting	Incident management and on-call routing

Key Takeaways

Monitoring extends QA into production — testing does not stop at deployment
Use all three pillars — logs for detail, metrics for trends, traces for request flow
Define SLOs before monitoring — know what “good” looks like before measuring
Alert on symptoms, not causes — users experience symptoms
Build QA-specific dashboards — focus on quality metrics, not just infrastructure

Knowledge Check

1. What are the three pillars of observability?

2. Why should QA engineers understand monitoring?

3. What is an SLO (Service Level Objective)?

Frequently Asked Questions

What is monitoring and observability for qa?

Monitoring and Observability for QA is a key concept in CI/CD and DevOps for QA. This lesson teaches you to explain the three pillars of observability: logs, metrics, and traces, providing practical skills you can apply immediately in your testing work.

How do I apply monitoring and observability for qa in real projects?

Start by practicing the core techniques covered in this lesson. Specifically, you should set up alerts and dashboards for quality-related metrics. Apply these skills in your current project to see immediate results.

Why is monitoring and observability for qa important for QA engineers?

Monitoring and Observability for QA is a core skill that employers look for in QA professionals. It directly impacts test coverage, defect detection, and team efficiency. Mastering it strengthens your CI/CD and DevOps for QA capabilities and makes you more effective at delivering quality software.

What should I know before learning monitoring and observability for qa?

You should have a basic understanding of software testing fundamentals. Familiarity with monitoring qa will help, but the lesson includes review sections for key prerequisites.

How does monitoring and observability for qa help my QA career?

Knowledge of monitoring and observability for qa is frequently listed in QA job descriptions and interview questions. It demonstrates expertise in monitoring qa, observability testing and shows you can contribute to quality assurance at a professional level. Senior roles especially value this competency.

Further Reading Monitoring and Observability for QA: Complete Guide →

Monitoring and Observability for QA

What You Will Learn

Why Monitoring Matters for QA #

The Three Pillars of Observability #

Logs #

Metrics #

Traces #

SLIs, SLOs, and SLAs #

Setting Up Alerts #

Alert Design Principles #

Example Alert Configuration #

Exercise: Design a Monitoring Dashboard for QA #

QA Monitoring Dashboard #

Monitoring Tools #

Key Takeaways #

Knowledge Check

Frequently Asked Questions

Why Monitoring Matters for QA

The Three Pillars of Observability

Logs

Metrics

Traces

SLIs, SLOs, and SLAs

Setting Up Alerts

Alert Design Principles

Example Alert Configuration

Exercise: Design a Monitoring Dashboard for QA

QA Monitoring Dashboard

Monitoring Tools

Key Takeaways