Infrastructure as Code (IaC) has revolutionized how we manage cloud resources, and Terraform has emerged as the de facto standard for multi-cloud infrastructure provisioning. But with great power comes great responsibility—untested Terraform code can lead to catastrophic production failures, security vulnerabilities, and compliance violations.
Companies like HashiCorp, Spotify, and Uber have developed sophisticated testing strategies that catch issues before they reach production. In this comprehensive guide, you’ll learn how to implement robust validation strategies that ensure your Terraform code is reliable, secure, and maintainable.
Why Terraform Testing Matters
The cost of untested infrastructure code:
# This seemingly innocent change destroyed production
resource "aws_s3_bucket" "data" {
bucket = "company-production-data"
# Developer thought this would just add versioning...
force_destroy = true # ⚠️ DANGER: Deletes all objects on destroy!
}
A single untested terraform apply with the above code could delete years of customer data. Real incidents include:
- GitLab (2017): Database deletion incident affecting 5,000+ projects
- AWS S3 Outage (2017): Typo in decommissioning script took down major services
- Microsoft Azure (2018): Configuration error caused global authentication failures
Key benefits of Terraform testing:
- Catch syntax errors and misconfigurations before deployment
- Validate security compliance automatically
- Ensure infrastructure changes don’t break existing resources
- Enable confident refactoring and upgrades
- Provide documentation through test cases
Terraform Testing Fundamentals
The Testing Pyramid for Infrastructure
/\
/ \ Unit Tests (70%)
/ \ - terraform validate
/------\ - tflint, checkov
/ \
/ \ Integration Tests (20%)
/------------\ - terraform plan testing
/ \ - terratest
/ \
/------------------\ E2E Tests (10%)
- Full deployment + application tests
1. Static Analysis and Linting
The first line of defense—catch issues without creating any resources:
Terraform Validate
# Basic syntax and internal consistency check
terraform validate
# Example output for errors:
# Error: Unsupported argument
# on main.tf line 12:
# 12: instance_types = "t2.micro"
# An argument named "instance_types" is not expected here.
TFLint - Advanced Linting
# Install tflint
curl -s https://raw.githubusercontent.com/terraform-linters/tflint/master/install_linux.sh | bash
# Create .tflint.hcl configuration
cat > .tflint.hcl <<EOF
plugin "aws" {
enabled = true
version = "0.27.0"
source = "github.com/terraform-linters/tflint-ruleset-aws"
}
rule "terraform_deprecated_interpolation" {
enabled = true
}
rule "terraform_unused_declarations" {
enabled = true
}
rule "terraform_naming_convention" {
enabled = true
format = "snake_case"
}
rule "aws_instance_invalid_type" {
enabled = true
}
EOF
# Run tflint
tflint --init
tflint
Example TFLint Output:
3 issue(s) found:
Warning: `ami` is missing (aws_instance_invalid_ami)
on main.tf line 15:
15: resource "aws_instance" "web" {
Warning: variable "region" is declared but not used (terraform_unused_declarations)
on variables.tf line 5:
5: variable "region" {
Error: "t2.micro" is an invalid instance type (aws_instance_invalid_type)
on main.tf line 17:
17: instance_type = "t2.micro"
2. Security Scanning with Checkov
Checkov scans for security and compliance violations:
# Install checkov
pip3 install checkov
# Scan Terraform files
checkov -d . --framework terraform
# Run specific checks
checkov -d . --check CKV_AWS_8 # Ensure EBS is encrypted
# Output to JSON for CI/CD integration
checkov -d . -o json > security-report.json
Example Security Issues Detected:
Check: CKV_AWS_8: "Ensure EBS volume is encrypted"
FAILED for resource: aws_ebs_volume.data
File: /main.tf:45-52
Guide: https://docs.bridgecrew.io/docs/bc_aws_general_3
Check: CKV_AWS_20: "Ensure S3 bucket has versioning enabled"
FAILED for resource: aws_s3_bucket.logs
File: /main.tf:60-65
Check: CKV_AWS_23: "Ensure Security Group has description"
FAILED for resource: aws_security_group.web
File: /main.tf:70-80
Automated Fix for Common Issues:
# Before - Security violations
resource "aws_s3_bucket" "logs" {
bucket = "company-logs"
# Missing: versioning, encryption, public access block
}
# After - Security compliant
resource "aws_s3_bucket" "logs" {
bucket = "company-logs"
}
resource "aws_s3_bucket_versioning" "logs" {
bucket = aws_s3_bucket.logs.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "logs" {
bucket = aws_s3_bucket.logs.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
resource "aws_s3_bucket_public_access_block" "logs" {
bucket = aws_s3_bucket.logs.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
3. Plan Testing and Validation
Test what Terraform will do before doing it:
Terraform Plan Analysis
# Generate plan and save to file
terraform plan -out=tfplan
# Convert binary plan to JSON for analysis
terraform show -json tfplan > tfplan.json
# Analyze plan with jq
cat tfplan.json | jq -r '
.resource_changes[] |
select(.change.actions[] | contains("delete")) |
"⚠️ DELETE: \(.address)"
'
# Output example:
# ⚠️ DELETE: aws_instance.old_server
# ⚠️ DELETE: aws_security_group.deprecated
Automated Plan Validation Script:
# validate_plan.py - Prevent dangerous changes
import json
import sys
def validate_terraform_plan(plan_file):
"""Validate Terraform plan for dangerous operations"""
with open(plan_file) as f:
plan = json.load(f)
errors = []
warnings = []
for change in plan.get('resource_changes', []):
address = change['address']
actions = change['change']['actions']
# Check for deletions of critical resources
if 'delete' in actions:
if 'database' in address or 'rds' in address:
errors.append(f"🚨 BLOCKED: Attempting to delete database: {address}")
elif 's3_bucket' in address and 'backup' in address:
errors.append(f"🚨 BLOCKED: Attempting to delete backup bucket: {address}")
else:
warnings.append(f"⚠️ Warning: Deleting resource: {address}")
# Check for recreation (replace)
if 'delete' in actions and 'create' in actions:
if 'aws_instance' in address:
warnings.append(f"⚠️ Instance will be recreated: {address}")
# Check for security group rule changes
if 'aws_security_group' in address or 'aws_security_group_rule' in address:
if change['change'].get('after', {}).get('ingress'):
for rule in change['change']['after']['ingress']:
if rule.get('cidr_blocks') == ['0.0.0.0/0']:
errors.append(f"🚨 BLOCKED: Security group allows public access: {address}")
# Print results
if errors:
print("\n❌ VALIDATION FAILED - Critical Issues Found:\n")
for error in errors:
print(error)
return False
if warnings:
print("\n⚠️ Warnings (review before applying):\n")
for warning in warnings:
print(warning)
print("\n✅ Plan validation passed")
return True
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: python validate_plan.py tfplan.json")
sys.exit(1)
success = validate_terraform_plan(sys.argv[1])
sys.exit(0 if success else 1)
Usage in CI/CD:
# .github/workflows/terraform.yml
name: Terraform Validation
on: [pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
- name: Terraform Init
run: terraform init
- name: Terraform Validate
run: terraform validate
- name: TFLint
uses: terraform-linters/setup-tflint@v3
run: |
tflint --init
tflint
- name: Checkov Security Scan
uses: bridgecrewio/checkov-action@master
with:
directory: .
framework: terraform
- name: Terraform Plan
run: |
terraform plan -out=tfplan
terraform show -json tfplan > tfplan.json
- name: Validate Plan
run: python3 validate_plan.py tfplan.json
Advanced Testing with Terratest
Terratest enables real infrastructure testing using Go:
Setting Up Terratest
// test/terraform_aws_example_test.go
package test
import (
"testing"
"time"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/gruntwork-io/terratest/modules/http-helper"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestTerraformWebServer(t *testing.T) {
t.Parallel()
// Construct terraform options
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
TerraformDir: "../examples/web-server",
Vars: map[string]interface{}{
"instance_type": "t2.micro",
"environment": "test",
},
EnvVars: map[string]string{
"AWS_DEFAULT_REGION": "us-east-1",
},
})
// Clean up resources at the end
defer terraform.Destroy(t, terraformOptions)
// Deploy infrastructure
terraform.InitAndApply(t, terraformOptions)
// Validate outputs
instanceID := terraform.Output(t, terraformOptions, "instance_id")
publicIP := terraform.Output(t, terraformOptions, "public_ip")
// Verify instance exists and is running
instance := aws.GetEc2Instance(t, instanceID, "us-east-1")
assert.Equal(t, "running", instance.State.Name)
assert.Equal(t, "t2.micro", instance.InstanceType)
// Verify web server responds
url := "http://" + publicIP + ":8080"
http_helper.HttpGetWithRetry(
t,
url,
nil,
200,
"Hello, World",
30,
3*time.Second,
)
}
Testing Module Reusability
// test/terraform_module_test.go
func TestVPCModule(t *testing.T) {
t.Parallel()
terraformOptions := &terraform.Options{
TerraformDir: "../modules/vpc",
Vars: map[string]interface{}{
"vpc_cidr": "10.0.0.0/16",
"azs": []string{"us-east-1a", "us-east-1b"},
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Validate VPC was created
vpcID := terraform.Output(t, terraformOptions, "vpc_id")
vpc := aws.GetVpcById(t, vpcID, "us-east-1")
assert.Equal(t, "10.0.0.0/16", vpc.CidrBlock)
assert.True(t, vpc.EnableDnsHostnames)
assert.True(t, vpc.EnableDnsSupport)
// Validate subnets
publicSubnetIDs := terraform.OutputList(t, terraformOptions, "public_subnet_ids")
assert.Equal(t, 2, len(publicSubnetIDs))
for _, subnetID := range publicSubnetIDs {
subnet := aws.GetSubnetById(t, subnetID, "us-east-1")
assert.True(t, subnet.MapPublicIpOnLaunch)
}
}
Testing Disaster Recovery
// test/terraform_disaster_recovery_test.go
func TestDatabaseFailover(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../examples/rds-multi-az",
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
dbEndpoint := terraform.Output(t, terraformOptions, "db_endpoint")
dbInstanceID := terraform.Output(t, terraformOptions, "db_instance_id")
// Verify database is accessible
err := testDatabaseConnection(dbEndpoint, "admin", "password123")
assert.NoError(t, err)
// Simulate failover
aws.RebootRdsInstance(t, dbInstanceID, "us-east-1")
// Wait for failover to complete
maxRetries := 10
timeBetweenRetries := 30 * time.Second
for i := 0; i < maxRetries; i++ {
err = testDatabaseConnection(dbEndpoint, "admin", "password123")
if err == nil {
t.Logf("Database recovered after %d attempts", i+1)
return
}
time.Sleep(timeBetweenRetries)
}
t.Fatal("Database did not recover after failover")
}
Real-World Implementation Examples
HashiCorp’s Terraform Module Testing
HashiCorp maintains rigorous testing for their official modules:
Their testing strategy:
- Kitchen-Terraform - Integration testing with multiple providers
- Automated example validation - Every example in docs is tested
- Backward compatibility tests - Ensure upgrades don’t break existing code
- Performance benchmarks - Track plan/apply times
Example from their AWS VPC module:
// Test multiple scenarios
func TestAWSVPCModule(t *testing.T) {
testCases := []struct {
name string
vars map[string]interface{}
validate func(*testing.T, *terraform.Options)
}{
{
name: "SingleNAT",
vars: map[string]interface{}{
"enable_nat_gateway": true,
"single_nat_gateway": true,
},
validate: validateSingleNAT,
},
{
name: "MultiNATHighAvailability",
vars: map[string]interface{}{
"enable_nat_gateway": true,
"single_nat_gateway": false,
"one_nat_gateway_per_az": true,
},
validate: validateMultiNAT,
},
}
for _, tc := range testCases {
tc := tc // Capture range variable
t.Run(tc.name, func(t *testing.T) {
t.Parallel()
// Test logic here
})
}
}
Spotify’s State Management Testing
Spotify tests Terraform state operations to prevent corruption:
// test/state_management_test.go
func TestStateConsistency(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../infrastructure",
BackendConfig: map[string]interface{}{
"bucket": "spotify-terraform-state-test",
"key": fmt.Sprintf("test-%d/terraform.tfstate", time.Now().Unix()),
"region": "us-east-1",
},
}
// Apply infrastructure
terraform.InitAndApply(t, terraformOptions)
// Get current state
state1 := terraform.Show(t, terraformOptions)
// Apply again (should be no changes)
terraform.Apply(t, terraformOptions)
state2 := terraform.Show(t, terraformOptions)
// States should be identical
assert.Equal(t, state1, state2, "State changed on re-apply (drift detected)")
// Cleanup
terraform.Destroy(t, terraformOptions)
}
Uber’s Cost Validation Testing
Uber validates estimated costs before applying changes:
# test_cost_estimate.py
import json
import subprocess
def estimate_terraform_cost(plan_file):
"""Estimate costs using Infracost"""
result = subprocess.run(
['infracost', 'breakdown', '--path', plan_file, '--format', 'json'],
capture_output=True,
text=True
)
return json.loads(result.stdout)
def test_monthly_cost_under_budget():
"""Ensure infrastructure changes don't exceed budget"""
# Generate plan
subprocess.run(['terraform', 'plan', '-out=tfplan'], check=True)
# Estimate cost
cost_data = estimate_terraform_cost('tfplan')
monthly_cost = cost_data['projects'][0]['breakdown']['totalMonthlyCost']
# Budget limit
MAX_MONTHLY_COST = 10000.00
assert float(monthly_cost) <= MAX_MONTHLY_COST, \
f"Monthly cost ${monthly_cost} exceeds budget ${MAX_MONTHLY_COST}"
def test_cost_increase_reasonable():
"""Ensure changes don't cause unexpected cost spikes"""
# Get current infrastructure cost
current_cost = get_current_monthly_cost()
# Get new infrastructure cost
subprocess.run(['terraform', 'plan', '-out=tfplan'], check=True)
cost_data = estimate_terraform_cost('tfplan')
new_cost = float(cost_data['projects'][0]['breakdown']['totalMonthlyCost'])
# Cost increase should be < 20%
max_increase = current_cost * 1.20
assert new_cost <= max_increase, \
f"Cost increase too large: ${current_cost} -> ${new_cost}"
Best Practices
✅ Pre-Commit Hooks
Catch issues before they reach version control:
# .pre-commit-config.yaml
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.81.0
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_docs
- id: terraform_tflint
args:
- --args=--config=__GIT_WORKING_DIR__/.tflint.hcl
- id: terraform_checkov
args:
- --args=--quiet
- --args=--framework terraform
- id: terraform_tfsec
Install and use:
# Install pre-commit
pip3 install pre-commit
# Install hooks
pre-commit install
# Run manually
pre-commit run --all-files
✅ Staging Environment Testing
Always test in staging before production:
# environments/staging/main.tf
module "infrastructure" {
source = "../../modules/infrastructure"
environment = "staging"
# Use smaller instances for cost savings
instance_type = "t3.small"
# Enable all logging for debugging
enable_detailed_monitoring = true
log_retention_days = 7
# Use same configuration structure as production
# but with reduced resources
}
Validation workflow:
#!/bin/bash
# validate-staging.sh
set -e
echo "🧪 Testing in Staging Environment"
cd environments/staging
# 1. Validate configuration
terraform validate
# 2. Security scan
checkov -d . --quiet
# 3. Plan and save
terraform plan -out=staging.tfplan
# 4. Apply to staging
terraform apply staging.tfplan
# 5. Run smoke tests
./smoke-tests.sh
# 6. Run integration tests
go test -v ../test/integration_test.go
# 7. Monitor for 10 minutes
echo "⏰ Monitoring for 10 minutes..."
./monitor-health.sh 600
echo "✅ Staging validation complete"
✅ Drift Detection
Detect when infrastructure diverges from code:
// test/drift_detection_test.go
func TestNoDrift(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../production",
}
// Don't apply, just check for drift
planOutput := terraform.InitAndPlan(t, terraformOptions)
// Parse plan to detect changes
planStruct := terraform.ParsePlanOutput(planOutput)
resourcesChanged := planStruct.Add + planStruct.Change + planStruct.Destroy
if resourcesChanged > 0 {
t.Errorf("Drift detected: %d resources would change", resourcesChanged)
t.Logf("Plan output:\n%s", planOutput)
}
}
Automated drift detection:
# .github/workflows/drift-detection.yml
name: Drift Detection
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
jobs:
detect-drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Check for Drift
run: |
cd environments/production
terraform init
terraform plan -detailed-exitcode || {
echo "⚠️ DRIFT DETECTED IN PRODUCTION"
# Send alert
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-H 'Content-Type: application/json' \
-d '{"text":"🚨 Terraform drift detected in production!"}'
exit 1
}
✅ Module Versioning and Testing
Test module upgrades before rolling out:
# Test with new module version
module "vpc_test" {
source = "terraform-aws-modules/vpc/aws"
version = "5.0.0" # Testing upgrade from 4.x
# ... configuration
}
Upgrade testing script:
#!/bin/bash
# test-module-upgrade.sh
OLD_VERSION="4.0.0"
NEW_VERSION="5.0.0"
echo "Testing upgrade: $OLD_VERSION -> $NEW_VERSION"
# Create test environment with old version
cat > test_old.tf <<EOF
module "test" {
source = "terraform-aws-modules/vpc/aws"
version = "$OLD_VERSION"
name = "upgrade-test"
cidr = "10.0.0.0/16"
}
EOF
terraform init
terraform apply -auto-approve
# Capture state
OLD_STATE=$(terraform show -json)
# Upgrade to new version
cat > test_new.tf <<EOF
module "test" {
source = "terraform-aws-modules/vpc/aws"
version = "$NEW_VERSION"
name = "upgrade-test"
cidr = "10.0.0.0/16"
}
EOF
terraform init -upgrade
terraform plan -out=upgrade.tfplan
# Check for unexpected changes
python3 validate_plan.py upgrade.tfplan
terraform apply upgrade.tfplan
echo "✅ Upgrade successful"
Common Pitfalls and Solutions
⚠️ Testing with Hardcoded Values
Problem: Tests use hardcoded values that don’t reflect real usage.
Solution: Use variables and realistic data:
// BAD - Hardcoded test values
func TestInstance(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../",
Vars: map[string]interface{}{
"instance_type": "t2.micro",
"ami": "ami-12345678",
},
}
// ...
}
// GOOD - Realistic, region-aware values
func TestInstance(t *testing.T) {
region := aws.GetRandomStableRegion(t, nil, nil)
ami := aws.GetAmazonLinuxAmi(t, region)
terraformOptions := &terraform.Options{
TerraformDir: "../",
Vars: map[string]interface{}{
"instance_type": "t3.small", // Current generation
"ami": ami,
"region": region,
},
}
// ...
}
⚠️ Not Testing Destroy Operations
Problem: Resources aren’t properly cleaned up.
Solution: Always test destroy:
func TestCompleteLifecycle(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../",
}
// Test create
terraform.InitAndApply(t, terraformOptions)
// Verify resources exist
instanceID := terraform.Output(t, terraformOptions, "instance_id")
instance := aws.GetEc2Instance(t, instanceID, "us-east-1")
assert.NotNil(t, instance)
// Test destroy
terraform.Destroy(t, terraformOptions)
// Verify resources are gone
_, err := aws.GetEc2InstanceE(t, instanceID, "us-east-1")
assert.Error(t, err, "Instance should not exist after destroy")
}
⚠️ Ignoring State File Testing
Problem: State file corruption or inconsistencies go undetected.
Solution: Validate state file integrity:
# test_state_file.py
import json
import boto3
def test_state_file_integrity():
"""Verify Terraform state file is valid and consistent"""
s3 = boto3.client('s3')
# Download state file
response = s3.get_object(
Bucket='terraform-state-bucket',
Key='production/terraform.tfstate'
)
state = json.loads(response['Body'].read())
# Validate structure
assert 'version' in state
assert 'terraform_version' in state
assert 'resources' in state
# Check for empty resources (usually a problem)
assert len(state['resources']) > 0, "State file has no resources"
# Validate resource integrity
for resource in state['resources']:
assert 'type' in resource
assert 'name' in resource
assert 'instances' in resource
for instance in resource['instances']:
assert 'attributes' in instance
# Check critical attributes exist
if resource['type'] == 'aws_instance':
assert 'id' in instance['attributes']
assert 'ami' in instance['attributes']
Tools and Frameworks Comparison
Testing Tools Matrix
| Tool | Type | Best For | Learning Curve | Cost |
|---|---|---|---|---|
| terraform validate | Syntax | Basic validation | Very Easy | Free |
| TFLint | Linting | Best practices, cloud-specific rules | Easy | Free |
| Checkov | Security | Security & compliance scanning | Easy | Free |
| Terratest | Integration | Real infrastructure testing | Medium | Free |
| Kitchen-Terraform | Integration | Multi-provider testing | Medium | Free |
| Sentinel | Policy | Enterprise policy as code | Hard | Paid (Terraform Cloud) |
| Infracost | Cost | Cost estimation and optimization | Easy | Free/Paid |
| Terrascan | Security | Multi-cloud security scanning | Easy | Free |
Tool Selection Guide
For Small Teams:
# Minimal but effective setup
terraform validate
tflint
checkov -d .
For Medium Teams:
# Add integration testing
terraform validate
tflint
checkov -d .
go test -v ./test/... # Terratest
For Enterprise:
# Complete validation pipeline
- terraform validate
- terraform fmt -check
- tflint
- checkov -d .
- terrascan scan
- infracost breakdown --path .
- sentinel apply policy/ # If using Terraform Cloud
- go test -v -timeout 30m ./test/...
- drift detection (scheduled)
Conclusion
Effective Terraform testing isn’t optional—it’s a critical component of reliable infrastructure automation. By implementing the strategies covered in this guide, you can catch issues early, maintain security compliance, and deploy infrastructure changes with confidence.
Key takeaways:
- Layer your testing - Use static analysis, security scanning, plan validation, and integration tests
- Automate everything - Use CI/CD pipelines and pre-commit hooks to enforce standards
- Test in staging first - Always validate changes in a non-production environment
- Monitor for drift - Regularly check that infrastructure matches code
- Version your modules - Test upgrades before rolling them out to production
Next steps:
- Start with basic validation:
terraform validate,tflint, andcheckov - Implement pre-commit hooks to catch issues early
- Add Terratest for critical infrastructure components
- Set up automated drift detection
- Build a comprehensive CI/CD pipeline
For more infrastructure testing strategies, explore our guides on Ansible testing, Kubernetes testing, and CI/CD pipeline security.
Additional resources: