Why Monitoring Matters for QA

Testing does not end when code reaches production. Monitoring is the continuation of quality assurance in the live environment. No test suite catches every bug, and some issues only appear under real-world traffic patterns, data volumes, and user behaviors.

For QA engineers, monitoring provides:

  • Post-deployment validation: Confirm new releases work correctly in production
  • Bug detection: Catch issues testing missed — memory leaks, race conditions, edge cases
  • Root cause analysis: Correlate test failures with system behavior
  • Non-functional validation: Verify performance, availability, and reliability meet requirements

The Three Pillars of Observability

Logs

Discrete events that describe what happened at a specific point in time.

{
  "timestamp": "2024-01-15T10:23:45Z",
  "level": "ERROR",
  "service": "payment-service",
  "message": "Payment processing failed",
  "error": "timeout connecting to payment gateway",
  "userId": "usr_12345",
  "orderId": "ord_67890",
  "duration_ms": 30000
}

QA uses for logs:

  • Investigating failed test cases by reading application logs
  • Finding error patterns after deployment
  • Debugging intermittent issues by searching for specific error messages

Metrics

Aggregated measurements over time. Metrics tell you how the system is performing at a high level.

Types of metrics:

  • Counter: Total number of events (requests, errors, sales)
  • Gauge: Current value (active users, queue depth, CPU usage)
  • Histogram: Distribution of values (response time percentiles)

Key metrics for QA:

MetricWhat It Tells You
Error ratePercentage of failed requests
Response time (P50/P95/P99)How fast the application responds
Throughput (RPS)Requests per second being handled
SaturationResource utilization (CPU, memory, disk)
AvailabilityUptime percentage

Traces

Follow a single request as it travels through multiple services. Essential for microservices architectures.

User Request → API Gateway (5ms) → Auth Service (15ms) → Product Service (25ms) → Database (10ms)
Total: 55ms

QA uses for traces:

  • Finding which service causes slow responses
  • Understanding request flow for test design
  • Identifying bottlenecks in the system

SLIs, SLOs, and SLAs

TermDefinitionExample
SLI (Service Level Indicator)A measurement of service quality99.95% of requests succeed
SLO (Service Level Objective)A target for an SLI“We target 99.9% success rate”
SLA (Service Level Agreement)A contractual obligation“We guarantee 99.5% uptime”

QA should help define SLOs and monitor SLIs to ensure the application meets its quality targets.

Setting Up Alerts

Alert Design Principles

  1. Alert on symptoms, not causes. Alert on “error rate > 1%” rather than “CPU > 80%.” Users feel symptoms, not causes.
  2. Set meaningful thresholds. Too sensitive = alert fatigue. Too lenient = missed issues.
  3. Include actionable context. Every alert should tell you what is happening and what to do.

Example Alert Configuration

# Prometheus alerting rules
groups:
  - name: qa-quality-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% for 5 minutes"
          runbook: "https://wiki.example.com/runbooks/high-error-rate"

      - alert: SlowResponses
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P95 response time above 500ms"

Exercise: Design a Monitoring Dashboard for QA

Create a dashboard specification for monitoring a web application post-deployment. Include the metrics, visualization types, and alert thresholds.

Solution

QA Monitoring Dashboard

Row 1: Health Overview

  • Availability (gauge): Current uptime %. Alert < 99.9%
  • Error rate (time series): 5xx errors over time. Alert > 1%
  • Active users (gauge): Current connected users

Row 2: Performance

  • Response time P50/P95/P99 (time series): Response time percentiles. Alert P95 > 500ms
  • Throughput (time series): Requests per second
  • Slow endpoints (table): Top 10 slowest endpoints

Row 3: Business Metrics

  • Conversion rate (time series): Orders / visits. Alert > 10% drop
  • Payment success rate (gauge): Successful payments %. Alert < 99%
  • Cart abandonment rate (time series): Alert > 20% increase

Row 4: Infrastructure

  • CPU/Memory usage (time series): Per service. Alert > 80%
  • Pod restarts (counter): Alert > 0 in 15 minutes
  • Database connections (gauge): Active vs. pool limit. Alert > 80%

Monitoring Tools

ToolTypeBest For
PrometheusMetricsTime-series data collection and alerting
GrafanaVisualizationDashboards for metrics from any data source
ELK StackLogsLog aggregation, search, and analysis
Jaeger / ZipkinTracesDistributed tracing
DatadogAll-in-oneMetrics, logs, traces, APM (SaaS)
New RelicAll-in-oneAPM, infrastructure, logs (SaaS)
PagerDutyAlertingIncident management and on-call routing

Key Takeaways

  1. Monitoring extends QA into production — testing does not stop at deployment
  2. Use all three pillars — logs for detail, metrics for trends, traces for request flow
  3. Define SLOs before monitoring — know what “good” looks like before measuring
  4. Alert on symptoms, not causes — users experience symptoms
  5. Build QA-specific dashboards — focus on quality metrics, not just infrastructure