The ELK Stack for QA

ELK stands for Elasticsearch, Logstash, and Kibana — three open-source tools that together form a powerful log management platform. For QA engineers, ELK provides the ability to search, analyze, and visualize application logs at scale.

Components

Elasticsearch: The search engine that stores and indexes log data. It allows lightning-fast full-text searches across billions of log entries.

Logstash: The data processing pipeline that ingests logs from various sources, transforms them, and sends them to Elasticsearch. It can parse different log formats, enrich data with metadata, and filter out noise.

Kibana: The visualization layer. It provides a web interface for searching logs, building dashboards, and creating alerts. This is where QA engineers spend most of their time.

Filebeat (often added as the “B” in BELK): A lightweight log shipper installed on application servers that sends logs to Logstash or directly to Elasticsearch.

Kibana for Log Investigation

Searching Logs

The Kibana Discover view is your primary tool for log investigation:

# Find all errors in the payment service in the last hour
service: "payment-service" AND level: "ERROR"

# Find timeout errors
message: "timeout" OR message: "timed out"

# Find errors for a specific user
userId: "usr_12345" AND level: "ERROR"

# Find errors during a specific test run
timestamp: [2024-01-15T10:00:00 TO 2024-01-15T10:30:00] AND level: "ERROR"

Correlation Workflow

When a test fails, follow this workflow in Kibana:

  1. Note the exact time of the test failure
  2. Search for errors in that time window (+-5 minutes)
  3. Filter by service to narrow down the source
  4. Expand the log entry to see full details (stack trace, request ID)
  5. Search by request ID to trace the request across services
  6. Check related services for cascading failures

Building Visualizations

Kibana lets you create visualizations from log data:

  • Line chart: Error count over time (spot trends after deployments)
  • Pie chart: Error distribution by service (which service has the most issues)
  • Data table: Top error messages (most common failures)
  • Metric: Total error count in the last hour

Grafana for Metrics Dashboards

Grafana excels at visualizing time-series metrics from Prometheus, InfluxDB, Elasticsearch, and many other data sources.

Building a QA Dashboard

Panel 1: Test Execution Trend
  Query: count of test runs by status (pass/fail) over time
  Type: Time series (stacked)

Panel 2: Flaky Test Rate
  Query: (tests that changed result in consecutive runs) / total tests
  Type: Gauge with threshold colors

Panel 3: Pipeline Duration
  Query: average pipeline execution time by stage
  Type: Bar chart

Panel 4: Deployment Success Rate
  Query: successful deployments / total deployments
  Type: Stat with sparkline

Panel 5: Application Error Rate (post-deployment)
  Query: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
  Type: Time series with alert threshold line

Annotations

Grafana annotations mark events on time-series graphs. Mark deployments on your metrics graphs to correlate metrics changes with releases:

{
  "time": 1705315200000,
  "text": "Deployed v2.3.1 to production",
  "tags": ["deployment", "production"]
}

This lets you visually see “error rate increased 5 minutes after deployment v2.3.1.”

Practical Log Analysis Patterns

Pattern: Post-Deployment Validation

After each deployment, automatically query logs for:

  1. New error types not seen before the deployment
  2. Error rate changes compared to pre-deployment baseline
  3. Slow query warnings or timeout errors
  4. Configuration-related errors

Pattern: Test Failure Root Cause

When an E2E test fails with a generic error (e.g., “element not found”), the real cause is often in the backend:

  1. Get the timestamp of the test failure
  2. Search backend logs for errors at that time
  3. Common findings: database timeout, null pointer exception, failed third-party API call

Pattern: Performance Regression Detection

Compare log-based latency metrics before and after deployment:

  • Average response time by endpoint
  • P99 response times
  • Database query durations
  • External API call durations

Exercise: Investigate a Production Incident Using Logs

Scenario: After a deployment at 14:00, users report that the checkout page is slow. Error rate has increased from 0.5% to 3%. Use log analysis to find the root cause.

Solution

Step 1: Kibana search for errors after deployment

timestamp: [2024-01-15T14:00:00 TO 2024-01-15T14:30:00] AND level: "ERROR"

Result: 247 errors found, mostly from “order-service”

Step 2: Filter to order-service errors

service: "order-service" AND level: "ERROR" AND timestamp: [14:00 TO 14:30]

Result: “Connection timeout to inventory-service” (180 occurrences)

Step 3: Check inventory-service logs

service: "inventory-service" AND level: ("ERROR" OR "WARN") AND timestamp: [14:00 TO 14:30]

Result: “Database connection pool exhausted” (repeated warnings)

Step 4: Check inventory-service database metrics in Grafana

  • Connection pool: 50/50 (maxed out)
  • Active queries: 48 (vs. normal 10-15)
  • Slow queries: 35 queries taking >5s (vs. normal 0)

Root cause: The new deployment introduced a database query without an index. Under production traffic, this query takes 5+ seconds instead of 50ms, exhausting the connection pool. The inventory-service stops responding, causing checkout timeouts.

Fix: Add the missing database index. Rollback the deployment until the fix is ready.

Key Takeaways

  1. ELK is essential for QA log investigation — search and filter logs to find root causes
  2. Grafana visualizes the big picture — dashboards show trends that individual logs cannot
  3. Correlate timestamps — match test failure times with log entries for fast debugging
  4. Mark deployments on dashboards — annotations connect metrics changes to releases
  5. Build reusable queries — save common investigation queries in Kibana for quick access