Why Monitoring Matters for QA
Testing does not end when code reaches production. Monitoring is the continuation of quality assurance in the live environment. No test suite catches every bug, and some issues only appear under real-world traffic patterns, data volumes, and user behaviors.
For QA engineers, monitoring provides:
- Post-deployment validation: Confirm new releases work correctly in production
- Bug detection: Catch issues testing missed — memory leaks, race conditions, edge cases
- Root cause analysis: Correlate test failures with system behavior
- Non-functional validation: Verify performance, availability, and reliability meet requirements
The Three Pillars of Observability
Logs
Discrete events that describe what happened at a specific point in time.
{
"timestamp": "2024-01-15T10:23:45Z",
"level": "ERROR",
"service": "payment-service",
"message": "Payment processing failed",
"error": "timeout connecting to payment gateway",
"userId": "usr_12345",
"orderId": "ord_67890",
"duration_ms": 30000
}
QA uses for logs:
- Investigating failed test cases by reading application logs
- Finding error patterns after deployment
- Debugging intermittent issues by searching for specific error messages
Metrics
Aggregated measurements over time. Metrics tell you how the system is performing at a high level.
Types of metrics:
- Counter: Total number of events (requests, errors, sales)
- Gauge: Current value (active users, queue depth, CPU usage)
- Histogram: Distribution of values (response time percentiles)
Key metrics for QA:
| Metric | What It Tells You |
|---|---|
| Error rate | Percentage of failed requests |
| Response time (P50/P95/P99) | How fast the application responds |
| Throughput (RPS) | Requests per second being handled |
| Saturation | Resource utilization (CPU, memory, disk) |
| Availability | Uptime percentage |
Traces
Follow a single request as it travels through multiple services. Essential for microservices architectures.
User Request → API Gateway (5ms) → Auth Service (15ms) → Product Service (25ms) → Database (10ms)
Total: 55ms
QA uses for traces:
- Finding which service causes slow responses
- Understanding request flow for test design
- Identifying bottlenecks in the system
SLIs, SLOs, and SLAs
| Term | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | A measurement of service quality | 99.95% of requests succeed |
| SLO (Service Level Objective) | A target for an SLI | “We target 99.9% success rate” |
| SLA (Service Level Agreement) | A contractual obligation | “We guarantee 99.5% uptime” |
QA should help define SLOs and monitor SLIs to ensure the application meets its quality targets.
Setting Up Alerts
Alert Design Principles
- Alert on symptoms, not causes. Alert on “error rate > 1%” rather than “CPU > 80%.” Users feel symptoms, not causes.
- Set meaningful thresholds. Too sensitive = alert fatigue. Too lenient = missed issues.
- Include actionable context. Every alert should tell you what is happening and what to do.
Example Alert Configuration
# Prometheus alerting rules
groups:
- name: qa-quality-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 1% for 5 minutes"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: SlowResponses
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "P95 response time above 500ms"
Exercise: Design a Monitoring Dashboard for QA
Create a dashboard specification for monitoring a web application post-deployment. Include the metrics, visualization types, and alert thresholds.
Solution
QA Monitoring Dashboard
Row 1: Health Overview
- Availability (gauge): Current uptime %. Alert < 99.9%
- Error rate (time series): 5xx errors over time. Alert > 1%
- Active users (gauge): Current connected users
Row 2: Performance
- Response time P50/P95/P99 (time series): Response time percentiles. Alert P95 > 500ms
- Throughput (time series): Requests per second
- Slow endpoints (table): Top 10 slowest endpoints
Row 3: Business Metrics
- Conversion rate (time series): Orders / visits. Alert > 10% drop
- Payment success rate (gauge): Successful payments %. Alert < 99%
- Cart abandonment rate (time series): Alert > 20% increase
Row 4: Infrastructure
- CPU/Memory usage (time series): Per service. Alert > 80%
- Pod restarts (counter): Alert > 0 in 15 minutes
- Database connections (gauge): Active vs. pool limit. Alert > 80%
Monitoring Tools
| Tool | Type | Best For |
|---|---|---|
| Prometheus | Metrics | Time-series data collection and alerting |
| Grafana | Visualization | Dashboards for metrics from any data source |
| ELK Stack | Logs | Log aggregation, search, and analysis |
| Jaeger / Zipkin | Traces | Distributed tracing |
| Datadog | All-in-one | Metrics, logs, traces, APM (SaaS) |
| New Relic | All-in-one | APM, infrastructure, logs (SaaS) |
| PagerDuty | Alerting | Incident management and on-call routing |
Key Takeaways
- Monitoring extends QA into production — testing does not stop at deployment
- Use all three pillars — logs for detail, metrics for trends, traces for request flow
- Define SLOs before monitoring — know what “good” looks like before measuring
- Alert on symptoms, not causes — users experience symptoms
- Build QA-specific dashboards — focus on quality metrics, not just infrastructure