What Is Data Flow Testing?

Data flow testing focuses on the lifecycle of variables: where they are defined (assigned a value), where they are used (read), and where they are killed (go out of scope or are re-assigned). By tracking these events along execution paths, data flow testing reveals defects that other techniques miss.

While control flow testing asks “which paths does the code take?”, data flow testing asks “what happens to the data along those paths?”

Variable States: Define, Use, Kill

Every variable goes through three states:

Define (d): The variable receives a value.

total = 0           # definition of total
user = get_user()   # definition of user

Use (u): The variable’s value is read. Two types:

  • c-use (computation use): Value used in a calculation: result = total * tax_rate
  • p-use (predicate use): Value used in a condition: if total > 100:

Kill (k): The variable ceases to exist (goes out of scope) or is re-defined.

total = 0        # define total
total = total + 5  # use total (c-use), then kill + redefine total

Data Flow Anomalies

Data flow anomalies are suspicious patterns that often indicate bugs:

dd anomaly (define-define)

A variable is defined twice without being used between definitions.

price = get_base_price()     # define
price = get_sale_price()     # define again — first definition is wasted
discount = price * 0.1       # use

The first price assignment is dead code. Either it is an error or unnecessary.

ur anomaly (use-reference without definition)

A variable is used before being defined.

def calculate_total():
    total = total + tax    # BUG: total used before definition
    return total

du anomaly (define with no use)

A variable is defined but never used.

def process():
    result = expensive_computation()  # define
    return "done"                     # result never used

Define-Use Pairs (DU Pairs)

A DU pair is a pair (d, u) where:

  • d is a statement where variable v is defined
  • u is a statement where variable v is used
  • There exists at least one path from d to u that does not re-define v (a definition-clear path)

Example

def process_payment(amount, discount_code):
    price = amount                    # Line 1: define price

    if discount_code == "SAVE10":     # Line 2: use discount_code (p-use)
        discount = 0.10               # Line 3: define discount
    elif discount_code == "SAVE20":   # Line 4: use discount_code (p-use)
        discount = 0.20               # Line 5: define discount
    else:
        discount = 0                  # Line 6: define discount

    final = price * (1 - discount)    # Line 7: use price (c-use), use discount (c-use)
    return final                      # Line 8: use final (c-use)

DU pairs for price: (1, 7) DU pairs for discount_code: (param, 2), (param, 4) DU pairs for discount: (3, 7), (5, 7), (6, 7) DU pairs for final: (7, 8)

Data Flow Coverage Criteria

From weakest to strongest:

All-Defs Coverage

For every variable definition, at least one DU pair from that definition is covered.

For discount: test at least one of (3,7), (5,7), or (6,7). One test case suffices.

All-Uses Coverage (All-C-Uses/All-P-Uses)

For every variable definition, every reachable use is covered.

For discount: test ALL of (3,7), (5,7), and (6,7). Three test cases needed.

All-DU-Paths Coverage

For every DU pair, every definition-clear path between the definition and use is covered.

This is the strongest criterion but may require many tests if there are multiple paths between a definition and its use.

Practical Data Flow Analysis

In practice, you rarely draw formal data flow graphs. Instead, apply data flow thinking:

  1. Trace each variable from creation to last use
  2. Check for anomalies — is anything defined but not used? Used before defined? Defined twice without use?
  3. Ensure all definitions reach uses — does every path from definition to use behave correctly?

Common Data Flow Bugs

Null pointer from conditional definition:

if condition:
    connection = create_connection()
# BUG: connection undefined if condition is False
connection.execute(query)

Stale value after re-assignment:

config = load_config("production")
if is_testing:
    config = load_config("test")
# config has correct value here

setup_database(config)
config = load_config("production")  # re-define — why?
setup_cache(config)                 # always uses production config — bug if testing?

Exercise: Data Flow Analysis

Problem 1

Identify all DU pairs and data flow anomalies in this function:

def calculate_grade(scores, curve):
    total = 0                        # Line 1
    count = 0                        # Line 2
    average = 0                      # Line 3

    for score in scores:             # Line 4
        total = total + score        # Line 5
        count = count + 1            # Line 6

    if count > 0:                    # Line 7
        average = total / count      # Line 8

    average = average + curve        # Line 9

    if average >= 90:                # Line 10
        grade = "A"                  # Line 11
    elif average >= 80:              # Line 12
        grade = "B"                  # Line 13
    else:
        grade = "C"                  # Line 14

    return grade                     # Line 15
Solution

DU pairs:

  • total: (1, 5-use), (5, 5-use), (1, 8) if empty, (5, 8) after loop
  • count: (2, 6-use), (6, 6-use), (2, 7), (6, 7), (2, 8) if empty, (6, 8)
  • average: (3, 9) if count==0, (8, 9) if count>0
  • grade: (11, 15), (13, 15), (14, 15)
  • scores: (param, 4)
  • curve: (param, 9)

Anomaly: Line 3 defines average = 0. If the scores list is empty, count=0 at line 7, so line 8 is skipped. Line 9 uses the initial average = 0, resulting in grade = curve. This may not be the intended behavior — a dd anomaly (line 3 then line 8 both define average) and a potential logic bug when scores is empty.

Test cases for all-uses:

#scorescurveCovers
1[85, 95]5Loop executes, count>0, average computed, >=90
2[70, 80]0Loop executes, 80<=avg<90
3[50, 60]0Loop executes, avg<80
4[]10Empty scores, count=0 path

Problem 2

Find and fix data flow bugs in this code:

def process_order(items, coupon):
    subtotal = 0
    shipping = 0

    for item in items:
        subtotal += item.price * item.quantity

    if subtotal > 50:
        shipping = 0

    if coupon:
        discount = subtotal * coupon.percent / 100

    total = subtotal - discount + shipping
    return total
Solution

Bug 1: ur anomaly — discount used before definition. If coupon is falsy, discount is never defined, but line total = subtotal - discount + shipping uses it. Fix: initialize discount = 0 before the if.

Bug 2: dd anomaly — shipping always 0. shipping is defined as 0, and then conditionally set to 0 again. The else case is missing — presumably shipping should have a non-zero value when subtotal <= 50.

Fixed code:

def process_order(items, coupon):
    subtotal = 0
    discount = 0          # Fix: initialize discount

    for item in items:
        subtotal += item.price * item.quantity

    if subtotal > 50:
        shipping = 0
    else:
        shipping = 9.99   # Fix: non-zero shipping for small orders

    if coupon:
        discount = subtotal * coupon.percent / 100

    total = subtotal - discount + shipping
    return total

Tools for Data Flow Analysis

Most data flow analysis happens during static analysis:

  • SonarQube — Detects dead code, unused variables, null pointer risks
  • SpotBugs (Java) — Finds uninitialized reads, dead stores
  • Pylint/Pyflakes (Python) — Reports unused variables, undefined names
  • ESLint (JavaScript) — no-unused-vars, no-undef rules
  • Coverity — Commercial tool with advanced data flow analysis

Key Takeaways

  • Data flow testing tracks variables through define → use → kill lifecycle
  • DU pairs connect variable definitions to their uses along definition-clear paths
  • Three coverage levels: all-defs (weakest), all-uses, all-du-paths (strongest)
  • Data flow anomalies (dd, ur, du) often indicate real bugs
  • Most common real-world bug: variable used on only one branch of a conditional
  • Static analysis tools automate much of data flow anomaly detection
  • Apply data flow thinking during code review even without formal tools