TL;DR
- What: Drift detection identifies differences between your IaC definitions and actual deployed infrastructure
- Why: Manual changes, failed deployments, and external modifications cause configuration drift that leads to incidents
- Tools: Terraform plan, AWS Config, driftctl, Spacelift, env0
- Key metric: 100% of resources tracked with drift alerts firing within 1 hour of change
- Start here: Schedule daily
terraform planruns and alert on any detected changes
In 2025, 67% of production incidents were traced back to configuration drift—differences between intended and actual infrastructure state. When engineers make “quick fixes” through cloud consoles or automation fails silently, infrastructure diverges from code. Drift detection catches these gaps before they cause outages.
This guide covers implementing comprehensive drift detection across your infrastructure. You’ll learn to detect drift with Terraform and specialized tools, set up automated monitoring, and establish processes that prevent drift from occurring.
What you’ll learn:
- How to implement drift detection with Terraform and AWS Config
- Automated monitoring and alerting for infrastructure changes
- Drift remediation strategies and workflows
- Prevention techniques that eliminate manual changes
- Best practices from organizations managing thousands of resources
Understanding Infrastructure Drift
What is Infrastructure Drift?
Infrastructure drift occurs when the actual state of deployed resources differs from the desired state defined in your Infrastructure as Code. This gap between code and reality can be caused by:
- Manual changes through cloud consoles (ClickOps)
- Failed or partial deployments
- External systems modifying resources
- Auto-scaling or self-healing mechanisms
- Shadow IT creating untracked resources
Why It Matters
Drift creates serious operational risks:
- Incident risk: Unknown configurations cause unexpected behavior
- Compliance violations: Manual changes bypass security controls
- Deployment failures: Terraform apply fails due to state mismatch
- Audit gaps: Actual infrastructure doesn’t match documented state
Types of Drift
| Type | Description | Example |
|---|---|---|
| Configuration drift | Resource properties differ from IaC | Security group rules modified via console |
| State drift | Resources exist but aren’t in state file | Manually created S3 bucket |
| Orphaned resources | Resources in state but deleted externally | EC2 instance terminated manually |
| Shadow resources | Resources not managed by IaC at all | Developer-created test databases |
Implementing Drift Detection with Terraform
Prerequisites
Before starting, ensure you have:
- Terraform 1.5+ installed
- Remote state backend configured (S3, Azure Blob, GCS)
- CI/CD pipeline (GitHub Actions, GitLab CI, Jenkins)
- Cloud provider credentials with read access
Step 1: Basic Drift Detection with Terraform Plan
The simplest drift detection uses terraform plan:
# Run plan and capture output
terraform plan -detailed-exitcode -out=plan.out
# Exit codes:
# 0 = No changes (no drift)
# 1 = Error
# 2 = Changes detected (drift!)
Automated detection script:
#!/bin/bash
set -e
cd /path/to/terraform
terraform init -input=false
terraform plan -detailed-exitcode -out=plan.out
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "DRIFT DETECTED!"
terraform show -json plan.out > drift-report.json
# Send alert
curl -X POST "$SLACK_WEBHOOK" \
-H "Content-Type: application/json" \
-d '{"text": "Infrastructure drift detected! Review: '$BUILD_URL'"}'
exit 1
elif [ $EXIT_CODE -eq 1 ]; then
echo "Terraform plan failed"
exit 1
else
echo "No drift detected"
fi
Step 2: Scheduled Drift Detection in CI/CD
GitHub Actions workflow:
name: Drift Detection
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
workflow_dispatch:
jobs:
drift-check:
runs-on: ubuntu-latest
strategy:
matrix:
environment: [production, staging]
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.6.0
- name: Terraform Init
working-directory: terraform/${{ matrix.environment }}
run: terraform init
- name: Check for Drift
id: plan
working-directory: terraform/${{ matrix.environment }}
run: |
terraform plan -detailed-exitcode -out=plan.out
continue-on-error: true
- name: Report Drift
if: steps.plan.outcome == 'failure'
uses: slackapi/slack-github-action@v1.24.0
with:
payload: |
{
"text": "Drift detected in ${{ matrix.environment }}",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Infrastructure Drift Alert*\nEnvironment: ${{ matrix.environment }}\nWorkflow: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
Step 3: Detailed Drift Analysis
Parse terraform plan output for specifics:
# Generate JSON plan
terraform show -json plan.out > plan.json
# Extract drifted resources
jq '.resource_changes[] | select(.change.actions | contains(["update"]) or contains(["delete"]))' plan.json
Example drift report:
{
"address": "aws_security_group.web",
"change": {
"actions": ["update"],
"before": {
"ingress": [{"from_port": 443, "to_port": 443}]
},
"after": {
"ingress": [
{"from_port": 443, "to_port": 443},
{"from_port": 22, "to_port": 22}
]
}
}
}
Verification
Confirm your setup works:
-
terraform planruns without errors - Scheduled runs execute on time
- Alerts fire when drift is detected
Advanced Drift Detection Techniques
Technique 1: Using driftctl for Comprehensive Coverage
When to use: When you need to detect resources not managed by Terraform (shadow IT).
Installation and setup:
# Install driftctl
brew install driftctl
# Scan AWS account
driftctl scan
# Output:
# Found 150 resources
# - 120 managed by Terraform
# - 25 unmanaged (drift!)
# - 5 missing from cloud
Integration with CI:
- name: Run driftctl
run: |
driftctl scan --from tfstate://terraform.tfstate \
--output json://drift-results.json
- name: Check for unmanaged resources
run: |
UNMANAGED=$(jq '.summary.total_unmanaged' drift-results.json)
if [ "$UNMANAGED" -gt 0 ]; then
echo "Found $UNMANAGED unmanaged resources!"
exit 1
fi
Benefits:
- Detects resources Terraform doesn’t know about
- Identifies truly orphaned resources
- Provides coverage metrics
Technique 2: AWS Config Rules
Detect drift using AWS-native tools:
# AWS Config rule for required tags
resource "aws_config_config_rule" "required_tags" {
name = "required-tags"
source {
owner = "AWS"
source_identifier = "REQUIRED_TAGS"
}
input_parameters = jsonencode({
tag1Key = "Environment"
tag2Key = "Owner"
tag3Key = "ManagedBy"
tag3Value = "terraform"
})
}
# Custom rule for security group drift
resource "aws_config_config_rule" "security_group_drift" {
name = "security-group-open-ports"
source {
owner = "AWS"
source_identifier = "VPC_SG_OPEN_ONLY_TO_AUTHORIZED_PORTS"
}
input_parameters = jsonencode({
authorizedTcpPorts = "443,80"
})
}
Technique 3: Real-Time Drift Detection with CloudTrail
Detect drift as it happens:
# Lambda function triggered by CloudTrail
import json
import boto3
def lambda_handler(event, context):
sns = boto3.client('sns')
for record in event['Records']:
detail = json.loads(record['body'])
# Filter for manual console changes
if detail.get('userIdentity', {}).get('type') == 'IAMUser':
if 'Console' in detail.get('userAgent', ''):
# Manual change detected!
sns.publish(
TopicArn='arn:aws:sns:us-east-1:123456789:drift-alerts',
Message=json.dumps({
'event': detail['eventName'],
'user': detail['userIdentity']['userName'],
'resource': detail['requestParameters'],
'source': 'Console'
}),
Subject='Manual Infrastructure Change Detected'
)
return {'statusCode': 200}
Real-World Examples
Example 1: Netflix Drift Management
Context: Netflix manages infrastructure across multiple AWS accounts with thousands of engineers.
Challenge: Engineers making quick fixes through AWS console caused production incidents.
Solution: Comprehensive drift detection and prevention:
- Hourly terraform plan runs across all accounts
- CloudTrail integration for real-time manual change detection
- Automatic Jira ticket creation for any drift
- Service owners responsible for remediation within 24 hours
Results:
- 92% reduction in drift-related incidents
- Mean time to detect drift: 15 minutes
- 100% of resources tracked in Terraform
Key Takeaway: 💡 Make drift visible and assign ownership—engineers fix what they’re accountable for.
Example 2: Capital One Zero-Drift Policy
Context: Financial services company with strict compliance requirements.
Challenge: Auditors require proof that production matches IaC definitions.
Solution: Zero-drift enforcement:
- Console access removed for production accounts
- All changes require PR and terraform apply
- Continuous drift scanning with automatic remediation
- Compliance dashboards showing drift status
Results:
- Zero manual changes to production in 2 years
- Audit prep time reduced by 80%
- Full audit trail for every infrastructure change
Key Takeaway: 💡 Remove the ability to drift—if engineers can’t access the console, they can’t make manual changes.
Best Practices
Do’s ✅
Run drift detection frequently
- Minimum: daily for production
- Recommended: every 6 hours
- Ideal: continuous with CloudTrail integration
Alert on all drift, investigate promptly
- Set up PagerDuty/Slack alerts
- Establish SLOs for remediation time
- Track drift metrics over time
Use remote state with locking
- Prevent concurrent modifications
- Enable state versioning for rollback
- Restrict state access to CI/CD
Import existing resources
- Don’t leave resources unmanaged
- Use
terraform importfor existing infra - Document all imported resources
Don’ts ❌
Don’t ignore “expected” drift
- Auto-scaling changes should be modeled in IaC
- Self-healing systems need proper configuration
- Document any accepted drift
Don’t remediate blindly
- Understand why drift occurred
- Fix root cause, not just symptoms
- Manual changes may indicate IaC gaps
Pro Tips 💡
- Tip 1: Use
terraform plan -refresh-onlyto detect drift without planning changes - Tip 2: Tag resources with last-modified metadata for forensics
- Tip 3: Create separate alerts for different drift severity levels
Common Pitfalls and Solutions
Pitfall 1: Too Many False Positives
Symptoms:
- Teams ignore drift alerts
- Auto-scaling triggers constant notifications
- Legitimate changes flagged as drift
Root Cause: Not accounting for expected state changes in IaC.
Solution:
# Ignore auto-scaling desired count changes
resource "aws_autoscaling_group" "web" {
name = "web-asg"
min_size = 2
max_size = 10
desired_capacity = 2
lifecycle {
ignore_changes = [desired_capacity]
}
}
# Ignore tags managed by other systems
resource "aws_instance" "app" {
# ...
lifecycle {
ignore_changes = [
tags["aws:autoscaling:groupName"],
tags["kubernetes.io/cluster/*"]
]
}
}
Prevention: Model expected behavior in IaC; use ignore_changes judiciously.
Pitfall 2: State File Corruption
Symptoms:
- Terraform shows resources as new when they exist
- Plan shows destroy/recreate for unchanged resources
- State doesn’t match actual infrastructure
Root Cause: Concurrent runs, manual state edits, or storage issues.
Solution:
- Enable state locking in backend
- Never manually edit state files
- Use state versioning for recovery
# S3 backend with locking and versioning
terraform {
backend "s3" {
bucket = "terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
Prevention: Always use remote state with locking; restrict direct state access.
Tools and Resources
Recommended Tools
| Tool | Best For | Pros | Cons | Price |
|---|---|---|---|---|
| Terraform Plan | Basic drift detection | Built-in, reliable | Only managed resources | Free |
| driftctl | Shadow IT detection | Finds unmanaged resources | Requires setup | Free |
| AWS Config | AWS-native detection | Real-time, native integration | AWS only | Pay per rule |
| Spacelift | Enterprise IaC | Full platform, auto-remediation | Complex, expensive | Paid |
| env0 | GitOps workflows | Good drift detection, cost visibility | Requires platform | Paid |
Selection Criteria
Choose based on:
- Coverage: Just Terraform → native plan; Shadow IT → driftctl
- Scale: Small team → free tools; Enterprise → Spacelift/env0
- Cloud: Single cloud → native tools; Multi-cloud → Terraform + driftctl
Additional Resources
AI-Assisted Drift Management
Modern AI tools enhance drift detection and remediation:
- Root cause analysis: AI identifies why drift occurred
- Remediation suggestions: Generate terraform code to fix drift
- Pattern detection: Identify recurring drift sources
- Impact prediction: Assess risk of detected drift
Tools: Firefly, Env0 AI features, custom LLM integrations.
Decision Framework: Drift Detection Strategy
| Consideration | Basic Approach | Advanced Approach |
|---|---|---|
| Team size | <5 engineers | >5 engineers |
| Resource count | <100 resources | >100 resources |
| Implementation | Scheduled terraform plan | Real-time + driftctl |
| Response | Manual review | Automated remediation |
| Console access | Allowed with logging | Removed entirely |
Measuring Success
Track these metrics for drift detection effectiveness:
| Metric | Target | Measurement |
|---|---|---|
| Time to detect drift | <1 hour | CloudTrail → alert latency |
| Drift remediation time | <24 hours | Alert → PR merged |
| Resources with drift | 0% | Drift scan results |
| Unmanaged resources | 0 | driftctl scan |
| Console changes | 0/month | CloudTrail analysis |
| Drift-related incidents | 0/quarter | Incident post-mortems |
Conclusion
Key Takeaways
- Drift detection is essential—you can’t manage what you don’t measure
- Schedule frequent scans—daily minimum, hourly preferred
- Detect shadow IT—use driftctl to find unmanaged resources
- Prevent rather than detect—remove console access where possible
Action Plan
- ✅ Today: Run
terraform planon your production infrastructure - ✅ This Week: Set up scheduled drift detection in CI/CD
- ✅ This Month: Implement real-time detection and remediation workflows
Official Resources
See Also
How does your team handle infrastructure drift? Share your detection and prevention strategies in the comments.
