TL;DR

  • What: Drift detection identifies differences between your IaC definitions and actual deployed infrastructure
  • Why: Manual changes, failed deployments, and external modifications cause configuration drift that leads to incidents
  • Tools: Terraform plan, AWS Config, driftctl, Spacelift, env0
  • Key metric: 100% of resources tracked with drift alerts firing within 1 hour of change
  • Start here: Schedule daily terraform plan runs and alert on any detected changes

In 2025, 67% of production incidents were traced back to configuration drift—differences between intended and actual infrastructure state. When engineers make “quick fixes” through cloud consoles or automation fails silently, infrastructure diverges from code. Drift detection catches these gaps before they cause outages.

This guide covers implementing comprehensive drift detection across your infrastructure. You’ll learn to detect drift with Terraform and specialized tools, set up automated monitoring, and establish processes that prevent drift from occurring.

What you’ll learn:

  • How to implement drift detection with Terraform and AWS Config
  • Automated monitoring and alerting for infrastructure changes
  • Drift remediation strategies and workflows
  • Prevention techniques that eliminate manual changes
  • Best practices from organizations managing thousands of resources

Understanding Infrastructure Drift

What is Infrastructure Drift?

Infrastructure drift occurs when the actual state of deployed resources differs from the desired state defined in your Infrastructure as Code. This gap between code and reality can be caused by:

  • Manual changes through cloud consoles (ClickOps)
  • Failed or partial deployments
  • External systems modifying resources
  • Auto-scaling or self-healing mechanisms
  • Shadow IT creating untracked resources

Why It Matters

Drift creates serious operational risks:

  • Incident risk: Unknown configurations cause unexpected behavior
  • Compliance violations: Manual changes bypass security controls
  • Deployment failures: Terraform apply fails due to state mismatch
  • Audit gaps: Actual infrastructure doesn’t match documented state

Types of Drift

TypeDescriptionExample
Configuration driftResource properties differ from IaCSecurity group rules modified via console
State driftResources exist but aren’t in state fileManually created S3 bucket
Orphaned resourcesResources in state but deleted externallyEC2 instance terminated manually
Shadow resourcesResources not managed by IaC at allDeveloper-created test databases

Implementing Drift Detection with Terraform

Prerequisites

Before starting, ensure you have:

  • Terraform 1.5+ installed
  • Remote state backend configured (S3, Azure Blob, GCS)
  • CI/CD pipeline (GitHub Actions, GitLab CI, Jenkins)
  • Cloud provider credentials with read access

Step 1: Basic Drift Detection with Terraform Plan

The simplest drift detection uses terraform plan:

# Run plan and capture output
terraform plan -detailed-exitcode -out=plan.out

# Exit codes:
# 0 = No changes (no drift)
# 1 = Error
# 2 = Changes detected (drift!)

Automated detection script:

#!/bin/bash
set -e

cd /path/to/terraform

terraform init -input=false
terraform plan -detailed-exitcode -out=plan.out

EXIT_CODE=$?

if [ $EXIT_CODE -eq 2 ]; then
    echo "DRIFT DETECTED!"
    terraform show -json plan.out > drift-report.json
    # Send alert
    curl -X POST "$SLACK_WEBHOOK" \
        -H "Content-Type: application/json" \
        -d '{"text": "Infrastructure drift detected! Review: '$BUILD_URL'"}'
    exit 1
elif [ $EXIT_CODE -eq 1 ]; then
    echo "Terraform plan failed"
    exit 1
else
    echo "No drift detected"
fi

Step 2: Scheduled Drift Detection in CI/CD

GitHub Actions workflow:

name: Drift Detection

on:
  schedule:

    - cron: '0 */6 * * *'  # Every 6 hours
  workflow_dispatch:

jobs:
  drift-check:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [production, staging]

    steps:

      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.6.0

      - name: Terraform Init
        working-directory: terraform/${{ matrix.environment }}
        run: terraform init

      - name: Check for Drift
        id: plan
        working-directory: terraform/${{ matrix.environment }}
        run: |
          terraform plan -detailed-exitcode -out=plan.out
        continue-on-error: true

      - name: Report Drift
        if: steps.plan.outcome == 'failure'
        uses: slackapi/slack-github-action@v1.24.0
        with:
          payload: |
            {
              "text": "Drift detected in ${{ matrix.environment }}",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "*Infrastructure Drift Alert*\nEnvironment: ${{ matrix.environment }}\nWorkflow: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
                  }
                }
              ]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

Step 3: Detailed Drift Analysis

Parse terraform plan output for specifics:

# Generate JSON plan
terraform show -json plan.out > plan.json

# Extract drifted resources
jq '.resource_changes[] | select(.change.actions | contains(["update"]) or contains(["delete"]))' plan.json

Example drift report:

{
  "address": "aws_security_group.web",
  "change": {
    "actions": ["update"],
    "before": {
      "ingress": [{"from_port": 443, "to_port": 443}]
    },
    "after": {
      "ingress": [
        {"from_port": 443, "to_port": 443},
        {"from_port": 22, "to_port": 22}
      ]
    }
  }
}

Verification

Confirm your setup works:

  • terraform plan runs without errors
  • Scheduled runs execute on time
  • Alerts fire when drift is detected

Advanced Drift Detection Techniques

Technique 1: Using driftctl for Comprehensive Coverage

When to use: When you need to detect resources not managed by Terraform (shadow IT).

Installation and setup:

# Install driftctl
brew install driftctl

# Scan AWS account
driftctl scan

# Output:
# Found 150 resources
# - 120 managed by Terraform
# - 25 unmanaged (drift!)
# - 5 missing from cloud

Integration with CI:

- name: Run driftctl
  run: |
    driftctl scan --from tfstate://terraform.tfstate \
      --output json://drift-results.json

- name: Check for unmanaged resources
  run: |
    UNMANAGED=$(jq '.summary.total_unmanaged' drift-results.json)
    if [ "$UNMANAGED" -gt 0 ]; then
      echo "Found $UNMANAGED unmanaged resources!"
      exit 1
    fi

Benefits:

  • Detects resources Terraform doesn’t know about
  • Identifies truly orphaned resources
  • Provides coverage metrics

Technique 2: AWS Config Rules

Detect drift using AWS-native tools:

# AWS Config rule for required tags
resource "aws_config_config_rule" "required_tags" {
  name = "required-tags"

  source {
    owner             = "AWS"
    source_identifier = "REQUIRED_TAGS"
  }

  input_parameters = jsonencode({
    tag1Key   = "Environment"
    tag2Key   = "Owner"
    tag3Key   = "ManagedBy"
    tag3Value = "terraform"
  })
}

# Custom rule for security group drift
resource "aws_config_config_rule" "security_group_drift" {
  name = "security-group-open-ports"

  source {
    owner             = "AWS"
    source_identifier = "VPC_SG_OPEN_ONLY_TO_AUTHORIZED_PORTS"
  }

  input_parameters = jsonencode({
    authorizedTcpPorts = "443,80"
  })
}

Technique 3: Real-Time Drift Detection with CloudTrail

Detect drift as it happens:

# Lambda function triggered by CloudTrail
import json
import boto3

def lambda_handler(event, context):
    sns = boto3.client('sns')

    for record in event['Records']:
        detail = json.loads(record['body'])

        # Filter for manual console changes
        if detail.get('userIdentity', {}).get('type') == 'IAMUser':
            if 'Console' in detail.get('userAgent', ''):
                # Manual change detected!
                sns.publish(
                    TopicArn='arn:aws:sns:us-east-1:123456789:drift-alerts',
                    Message=json.dumps({
                        'event': detail['eventName'],
                        'user': detail['userIdentity']['userName'],
                        'resource': detail['requestParameters'],
                        'source': 'Console'
                    }),
                    Subject='Manual Infrastructure Change Detected'
                )

    return {'statusCode': 200}

Real-World Examples

Example 1: Netflix Drift Management

Context: Netflix manages infrastructure across multiple AWS accounts with thousands of engineers.

Challenge: Engineers making quick fixes through AWS console caused production incidents.

Solution: Comprehensive drift detection and prevention:

  • Hourly terraform plan runs across all accounts
  • CloudTrail integration for real-time manual change detection
  • Automatic Jira ticket creation for any drift
  • Service owners responsible for remediation within 24 hours

Results:

  • 92% reduction in drift-related incidents
  • Mean time to detect drift: 15 minutes
  • 100% of resources tracked in Terraform

Key Takeaway: 💡 Make drift visible and assign ownership—engineers fix what they’re accountable for.

Example 2: Capital One Zero-Drift Policy

Context: Financial services company with strict compliance requirements.

Challenge: Auditors require proof that production matches IaC definitions.

Solution: Zero-drift enforcement:

  • Console access removed for production accounts
  • All changes require PR and terraform apply
  • Continuous drift scanning with automatic remediation
  • Compliance dashboards showing drift status

Results:

  • Zero manual changes to production in 2 years
  • Audit prep time reduced by 80%
  • Full audit trail for every infrastructure change

Key Takeaway: 💡 Remove the ability to drift—if engineers can’t access the console, they can’t make manual changes.


Best Practices

Do’s ✅

  1. Run drift detection frequently

    • Minimum: daily for production
    • Recommended: every 6 hours
    • Ideal: continuous with CloudTrail integration
  2. Alert on all drift, investigate promptly

    • Set up PagerDuty/Slack alerts
    • Establish SLOs for remediation time
    • Track drift metrics over time
  3. Use remote state with locking

    • Prevent concurrent modifications
    • Enable state versioning for rollback
    • Restrict state access to CI/CD
  4. Import existing resources

    • Don’t leave resources unmanaged
    • Use terraform import for existing infra
    • Document all imported resources

Don’ts ❌

  1. Don’t ignore “expected” drift

    • Auto-scaling changes should be modeled in IaC
    • Self-healing systems need proper configuration
    • Document any accepted drift
  2. Don’t remediate blindly

    • Understand why drift occurred
    • Fix root cause, not just symptoms
    • Manual changes may indicate IaC gaps

Pro Tips 💡

  • Tip 1: Use terraform plan -refresh-only to detect drift without planning changes
  • Tip 2: Tag resources with last-modified metadata for forensics
  • Tip 3: Create separate alerts for different drift severity levels

Common Pitfalls and Solutions

Pitfall 1: Too Many False Positives

Symptoms:

  • Teams ignore drift alerts
  • Auto-scaling triggers constant notifications
  • Legitimate changes flagged as drift

Root Cause: Not accounting for expected state changes in IaC.

Solution:

# Ignore auto-scaling desired count changes
resource "aws_autoscaling_group" "web" {
  name                = "web-asg"
  min_size            = 2
  max_size            = 10
  desired_capacity    = 2

  lifecycle {
    ignore_changes = [desired_capacity]
  }
}

# Ignore tags managed by other systems
resource "aws_instance" "app" {
  # ...

  lifecycle {
    ignore_changes = [
      tags["aws:autoscaling:groupName"],
      tags["kubernetes.io/cluster/*"]
    ]
  }
}

Prevention: Model expected behavior in IaC; use ignore_changes judiciously.

Pitfall 2: State File Corruption

Symptoms:

  • Terraform shows resources as new when they exist
  • Plan shows destroy/recreate for unchanged resources
  • State doesn’t match actual infrastructure

Root Cause: Concurrent runs, manual state edits, or storage issues.

Solution:

  • Enable state locking in backend
  • Never manually edit state files
  • Use state versioning for recovery
# S3 backend with locking and versioning
terraform {
  backend "s3" {
    bucket         = "terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Prevention: Always use remote state with locking; restrict direct state access.


Tools and Resources

ToolBest ForProsConsPrice
Terraform PlanBasic drift detectionBuilt-in, reliableOnly managed resourcesFree
driftctlShadow IT detectionFinds unmanaged resourcesRequires setupFree
AWS ConfigAWS-native detectionReal-time, native integrationAWS onlyPay per rule
SpaceliftEnterprise IaCFull platform, auto-remediationComplex, expensivePaid
env0GitOps workflowsGood drift detection, cost visibilityRequires platformPaid

Selection Criteria

Choose based on:

  1. Coverage: Just Terraform → native plan; Shadow IT → driftctl
  2. Scale: Small team → free tools; Enterprise → Spacelift/env0
  3. Cloud: Single cloud → native tools; Multi-cloud → Terraform + driftctl

Additional Resources


AI-Assisted Drift Management

Modern AI tools enhance drift detection and remediation:

  • Root cause analysis: AI identifies why drift occurred
  • Remediation suggestions: Generate terraform code to fix drift
  • Pattern detection: Identify recurring drift sources
  • Impact prediction: Assess risk of detected drift

Tools: Firefly, Env0 AI features, custom LLM integrations.


Decision Framework: Drift Detection Strategy

ConsiderationBasic ApproachAdvanced Approach
Team size<5 engineers>5 engineers
Resource count<100 resources>100 resources
ImplementationScheduled terraform planReal-time + driftctl
ResponseManual reviewAutomated remediation
Console accessAllowed with loggingRemoved entirely

Measuring Success

Track these metrics for drift detection effectiveness:

MetricTargetMeasurement
Time to detect drift<1 hourCloudTrail → alert latency
Drift remediation time<24 hoursAlert → PR merged
Resources with drift0%Drift scan results
Unmanaged resources0driftctl scan
Console changes0/monthCloudTrail analysis
Drift-related incidents0/quarterIncident post-mortems

Conclusion

Key Takeaways

  1. Drift detection is essential—you can’t manage what you don’t measure
  2. Schedule frequent scans—daily minimum, hourly preferred
  3. Detect shadow IT—use driftctl to find unmanaged resources
  4. Prevent rather than detect—remove console access where possible

Action Plan

  1. Today: Run terraform plan on your production infrastructure
  2. This Week: Set up scheduled drift detection in CI/CD
  3. This Month: Implement real-time detection and remediation workflows

Official Resources

See Also


How does your team handle infrastructure drift? Share your detection and prevention strategies in the comments.