2025-05-05·14 min read

Differential testing for smart contracts: comparing implementations to find bugs

Differential Testing for Smart Contracts: Comparing Implementations to Find Bugs

By antonio — April 2026

Here's a simple idea that catches surprisingly nasty bugs: take two implementations of the same thing and compare their outputs. If they disagree, at least one of them is wrong.

That's differential testing. It's been a workhorse technique in compiler testing and browser security for decades. It works just as well for smart contracts — maybe even better, because DeFi is full of multiple implementations of the same specs.

Let me show you how to set it up, where it shines, and the real bugs it catches.

What is differential testing?

The concept is straightforward:

  1. You have two (or more) implementations of the same specification
  2. You feed them the same inputs
  3. You compare their outputs
  4. Any difference is a bug in at least one implementation

The power is that you don't need to know what the correct output should be. You just need to know that both implementations should agree. This lets you generate millions of random inputs without writing specific expected outputs for each one.

Input → Implementation A → Output A  ─┐
                                       ├─→ Compare → Mismatch = Bug
Input → Implementation B → Output B  ─┘

For smart contracts, the "implementations" can be:

  • Two different contracts implementing the same ERC standard
  • A reference implementation vs a gas-tuned version
  • The same contract compiled with different Solidity versions
  • A Solidity implementation vs a Vyper implementation
  • An on-chain contract vs an off-chain simulator

When differential testing makes sense

Not every project needs differential testing. Here's when it's worth the setup cost:

You're building a gas-efficient version of something standard. If you're writing a gas-tuned ERC-20 or a custom AMM based on a known formula, differential testing against the reference implementation catches bugs introduced by the rewrite.

You're migrating between versions. Upgrading from Solidity 0.7 to 0.8? Migrating a Vyper contract to Solidity? Differential testing verifies behavioral equivalence.

You have a spec with multiple implementations. ERC-4626 vaults, ERC-2612 permits, or any standard where multiple teams have written compliant implementations.

You have an off-chain model. Many DeFi protocols have Python or TypeScript models for their math. Differential testing against the on-chain implementation catches precision and rounding bugs.

Setting up differential tests in Foundry

Foundry makes differential testing relatively straightforward. Here's a complete example comparing two AMM implementations.

The setup: two AMM implementations

Say we have a reference AMM and a gas-efficient version:

// ReferenceAMM.sol -- clear, correct, not gas-efficient
contract ReferenceAMM {
    uint256 public reserveA;
    uint256 public reserveB;

    constructor(uint256 _reserveA, uint256 _reserveB) {
        reserveA = _reserveA;
        reserveB = _reserveB;
    }

    function getAmountOut(
        uint256 amountIn,
        bool isTokenA
    ) external view returns (uint256 amountOut) {
        uint256 reserveIn = isTokenA ? reserveA : reserveB;
        uint256 reserveOut = isTokenA ? reserveB : reserveA;

        // Standard constant product formula: x * y = k
        // amountOut = reserveOut - (reserveIn * reserveOut) /
        //             (reserveIn + amountIn)
        // With 0.3% fee
        uint256 amountInWithFee = amountIn * 997;
        uint256 numerator = amountInWithFee * reserveOut;
        uint256 denominator = (reserveIn * 1000) + amountInWithFee;

        amountOut = numerator / denominator;
    }

    function swap(uint256 amountIn, bool isTokenA) external
        returns (uint256 amountOut)
    {
        amountOut = this.getAmountOut(amountIn, isTokenA);

        if (isTokenA) {
            reserveA += amountIn;
            reserveB -= amountOut;
        } else {
            reserveB += amountIn;
            reserveA -= amountOut;
        }
    }
}

// OptimizedAMM.sol -- gas-efficient, uses assembly
contract OptimizedAMM {
    uint256 public reserveA;
    uint256 public reserveB;

    constructor(uint256 _reserveA, uint256 _reserveB) {
        reserveA = _reserveA;
        reserveB = _reserveB;
    }

    function getAmountOut(
        uint256 amountIn,
        bool isTokenA
    ) external view returns (uint256 amountOut) {
        assembly {
            let reserveIn := sload(
                add(reserveA.slot, iszero(isTokenA))
            )
            let reserveOut := sload(
                add(reserveA.slot, iszero(iszero(isTokenA)))
            )

            let amountInWithFee := mul(amountIn, 997)
            let numerator := mul(amountInWithFee, reserveOut)
            let denominator := add(
                mul(reserveIn, 1000), amountInWithFee
            )

            amountOut := div(numerator, denominator)
        }
    }

    function swap(uint256 amountIn, bool isTokenA) external
        returns (uint256 amountOut)
    {
        amountOut = this.getAmountOut(amountIn, isTokenA);

        if (isTokenA) {
            reserveA += amountIn;
            reserveB -= amountOut;
        } else {
            reserveB += amountIn;
            reserveA -= amountOut;
        }
    }
}

The differential fuzz test

// test/DifferentialAMM.t.sol
pragma solidity ^0.8.19;

import "forge-std/Test.sol";
import "../src/ReferenceAMM.sol";
import "../src/OptimizedAMM.sol";

contract DifferentialAMMTest is Test {
    ReferenceAMM ref;
    OptimizedAMM opt;

    function setUp() public {
        // Same initial state
        ref = new ReferenceAMM(1_000_000e18, 1_000_000e18);
        opt = new OptimizedAMM(1_000_000e18, 1_000_000e18);
    }

    /// @dev Fuzz test: getAmountOut should match for any input
    function testFuzz_getAmountOut_matches(
        uint256 amountIn,
        bool isTokenA
    ) public view {
        // Bound to reasonable range
        amountIn = bound(amountIn, 1, 1_000_000e18);

        uint256 refOut = ref.getAmountOut(amountIn, isTokenA);
        uint256 optOut = opt.getAmountOut(amountIn, isTokenA);

        assertEq(
            refOut,
            optOut,
            "getAmountOut mismatch between reference and gas-efficient"
        );
    }

    /// @dev Fuzz test: swap sequences should produce same state
    function testFuzz_swapSequence_matches(
        uint256[5] calldata amounts,
        bool[5] calldata directions
    ) public {
        for (uint256 i = 0; i < 5; i++) {
            uint256 amount = bound(amounts[i], 1, 100_000e18);

            uint256 refOut = ref.swap(amount, directions[i]);
            uint256 optOut = opt.swap(amount, directions[i]);

            assertEq(
                refOut,
                optOut,
                string.concat(
                    "Swap output mismatch at step ",
                    vm.toString(i)
                )
            );
        }

        // Final reserves should match exactly
        assertEq(ref.reserveA(), opt.reserveA(), "reserveA mismatch");
        assertEq(ref.reserveB(), opt.reserveB(), "reserveB mismatch");
    }
}

Run it:

forge test --match-contract DifferentialAMMTest -vvv --fuzz-runs 10000

If the assembly optimization has a bug, say, the storage slot calculation for reserveB is off by one, the fuzzer will find inputs where the outputs diverge.

Cross-Language differential testing with FFI

One of the most powerful applications: comparing your Solidity implementation against a Python or Rust reference using Foundry's FFI.

Solidity vs Python math

// test/DifferentialMath.t.sol
pragma solidity ^0.8.19;

import "forge-std/Test.sol";
import "../src/MathLib.sol";

contract DifferentialMathTest is Test {
    MathLib lib;

    function setUp() public {
        lib = new MathLib();
    }

    function testFuzz_sqrt_matchesPython(uint256 x) public {
        x = bound(x, 0, type(uint128).max);

        // Get Solidity result
        uint256 solidityResult = lib.sqrt(x);

        // Get Python result via FFI
        string[] memory cmd = new string[](3);
        cmd[0] = "python3";
        cmd[1] = "-c";
        cmd[2] = string.concat(
            "import math; print(math.isqrt(",
            vm.toString(x),
            "))"
        );

        bytes memory result = vm.ffi(cmd);
        uint256 pythonResult = vm.parseUint(string(result));

        assertEq(
            solidityResult,
            pythonResult,
            string.concat(
                "sqrt mismatch for input ",
                vm.toString(x)
            )
        );
    }

    function testFuzz_expWad_matchesPython(int256 x) public {
        // Bound to range where exp doesn't overflow
        x = bound(x, -42139678854452767551, 135305999368893231589);

        int256 solidityResult = lib.expWad(x);

        string[] memory cmd = new string[](3);
        cmd[0] = "python3";
        cmd[1] = "-c";
        cmd[2] = string.concat(
            "from decimal import Decimal, getcontext; ",
            "getcontext().prec = 50; ",
            "x = Decimal('",
            vm.toString(x),
            "') / Decimal(10**18); ",
            "import math; ",
            "result = int(Decimal(str(math.exp(float(x)))) ",
            "* Decimal(10**18)); ",
            "print(result)"
        );

        bytes memory result = vm.ffi(cmd);
        int256 pythonResult = vm.parseInt(string(result));

        // Allow 1 wei tolerance for rounding differences
        assertApproxEqAbs(
            solidityResult,
            pythonResult,
            1,
            "expWad mismatch"
        );
    }
}

This technique catches subtle fixed-point arithmetic bugs that are really hard to spot in manual review. The Python Decimal library gives you arbitrary precision to compare against.

Cross-Version differential testing

Solidity version changes introduce behavioral differences. Some are documented, some aren't.

Solidity 0.7 vs 0.8 behavior

The biggest change was checked arithmetic. But there are subtler differences:

// test/CrossVersion.t.sol
// This test compares behavior between a 0.7-style implementation
// (using unchecked) and a 0.8 implementation

contract CrossVersionTest is Test {
    LegacyMath legacy;   // Uses unchecked blocks to mimic 0.7
    ModernMath modern;    // Standard 0.8 checked arithmetic

    function setUp() public {
        legacy = new LegacyMath();
        modern = new ModernMath();
    }

    function testFuzz_division_behavior(
        uint256 a,
        uint256 b
    ) public {
        // In 0.7: division by zero returned 0
        // In 0.8: division by zero reverts
        if (b == 0) {
            // Expect modern to revert
            vm.expectRevert();
            modern.divide(a, b);

            // Legacy should return 0 (if it mimics 0.7 behavior)
            // If your migration kept this behavior, test it
            // If not, this differential test catches the discrepancy
            return;
        }

        assertEq(
            legacy.divide(a, b),
            modern.divide(a, b),
            "Division result mismatch"
        );
    }

    function testFuzz_shift_behavior(
        uint256 value,
        uint256 shift
    ) public {
        // In 0.7: shifting by >= 256 was undefined behavior
        // In 0.8: shifting by >= 256 returns 0
        shift = bound(shift, 0, 512);

        if (shift >= 256) {
            assertEq(
                modern.shiftRight(value, shift),
                0,
                "Shift >= 256 should return 0 in 0.8"
            );
            return;
        }

        assertEq(
            legacy.shiftRight(value, shift),
            modern.shiftRight(value, shift),
            "Shift result mismatch"
        );
    }
}

This is especially useful during protocol migrations. We've seen bugs introduced during 0.7→0.8 migrations where developers added unchecked blocks in the wrong places, accidentally preserving overflow behavior in functions that should've been checked.

ABI encoding differential tests

ABI encoding bugs are subtle and dangerous. Compare your manual encoding against Solidity's built-in encoder:

function testFuzz_customEncoding_matchesABI(
    address addr,
    uint256 amount,
    bytes32 id
) public pure {
    // Your custom encoding (maybe for gas optimization)
    bytes memory custom = abi.encodePacked(
        bytes20(addr),
        bytes32(amount),
        id
    );

    // Standard encoding
    bytes memory standard = abi.encode(addr, amount, id);

    // These SHOULD differ (packed vs padded) --
    // but your decoder must handle the format it actually uses
    // The real test: encode then decode and compare values

    (address decodedAddr, uint256 decodedAmount, bytes32 decodedId) =
        abi.decode(standard, (address, uint256, bytes32));

    assertEq(decodedAddr, addr);
    assertEq(decodedAmount, amount);
    assertEq(decodedId, id);
}

Real bugs found by differential testing

Let me share some real patterns where differential testing caught issues:

1. rounding direction discrepancy

A vault's deposit() function rounded shares down (correct, favors the vault), but previewDeposit() rounded up (incorrect, overpromised shares):

function testFuzz_depositPreview_matches(uint256 assets) public {
    assets = bound(assets, 1, 1_000_000e18);

    uint256 previewedShares = vault.previewDeposit(assets);
    uint256 actualShares = vault.deposit(assets, address(this));

    // ERC-4626 spec: previewDeposit MUST return <= actual shares
    assertLe(
        previewedShares,
        actualShares,
        "Preview overpromised shares"
    );
}

The fuzzer found inputs where previewDeposit returned more shares than deposit actually minted. This is a spec violation that can cause accounting bugs in integrating contracts.

2. assembly optimization gone wrong

A hand-rolled mulDiv function in assembly produced incorrect results for specific input ranges near type(uint256).max:

function testFuzz_mulDiv_reference(
    uint256 a,
    uint256 b,
    uint256 denominator
) public pure {
    denominator = bound(denominator, 1, type(uint256).max);

    // Skip overflow cases
    if (b != 0 && a > type(uint256).max / b) return;

    uint256 fast = OptimizedMath.mulDiv(a, b, denominator);
    uint256 reference = (a * b) / denominator;

    assertEq(fast, reference, "mulDiv mismatch");
}

The assembly version had an off-by-one in its high-word multiplication logic. Only triggered when both a and b had specific bit patterns in their upper 128 bits.

3. cross-Chain behavior difference

A contract deployed on both Ethereum and Arbitrum produced different results for the same inputs because of PUSH0 opcode availability and different gas costs affecting an internal gas-bounded loop:

function testFuzz_crossChain_equivalence(
    uint256 input
) public {
    // Fork Ethereum mainnet
    vm.createSelectFork("mainnet");
    uint256 mainnetResult = target.compute(input);

    // Fork Arbitrum
    vm.createSelectFork("arbitrum");
    uint256 arbResult = target.compute(input);

    assertEq(
        mainnetResult,
        arbResult,
        "Cross-chain result mismatch"
    );
}

Advanced: differential invariant testing

Combine differential testing with invariant testing for maximum coverage. Instead of comparing single function calls, compare entire operation sequences:

contract DifferentialInvariantTest is Test {
    ReferenceVault refVault;
    OptimizedVault optVault;
    DiffHandler handler;

    function setUp() public {
        refVault = new ReferenceVault(address(asset));
        optVault = new OptimizedVault(address(asset));
        handler = new DiffHandler(refVault, optVault, asset);

        targetContract(address(handler));
    }

    function invariant_stateAlwaysMatches() public view {
        assertEq(
            refVault.totalAssets(),
            optVault.totalAssets(),
            "totalAssets diverged"
        );
        assertEq(
            refVault.totalSupply(),
            optVault.totalSupply(),
            "totalSupply diverged"
        );
    }
}

contract DiffHandler {
    ReferenceVault ref;
    OptimizedVault opt;

    // Every handler function performs the same action on both
    function deposit(uint256 amount, uint256 actorSeed) external {
        address actor = actors[actorSeed % actors.length];
        amount = bound(amount, 1, asset.balanceOf(actor) / 2);

        // Deposit into both with same params
        vm.startPrank(actor);
        asset.approve(address(ref), amount);
        uint256 refShares = ref.deposit(amount, actor);

        asset.approve(address(opt), amount);
        uint256 optShares = opt.deposit(amount, actor);
        vm.stopPrank();

        require(
            refShares == optShares,
            "Share mismatch on deposit"
        );
    }

    function withdraw(uint256 amount, uint256 actorSeed) external {
        address actor = actors[actorSeed % actors.length];
        uint256 maxRef = ref.maxWithdraw(actor);
        uint256 maxOpt = opt.maxWithdraw(actor);
        require(maxRef == maxOpt, "maxWithdraw mismatch");

        if (maxRef == 0) return;
        amount = bound(amount, 1, maxRef);

        vm.startPrank(actor);
        uint256 refAssets = ref.withdraw(amount, actor, actor);
        uint256 optAssets = opt.withdraw(amount, actor, actor);
        vm.stopPrank();

        require(
            refAssets == optAssets,
            "Asset mismatch on withdraw"
        );
    }
}

This catches state divergence that only shows up after specific sequences of operations. The fuzzer generates random sequences of deposits and withdrawals, and the invariant checks that both implementations stay in sync at every step.

Practical tips

Start with the pure math. The highest-value differential tests compare mathematical functions, swap calculations, interest accrual, pricing formulas. These are deterministic, easy to test, and where precision bugs hide.

Use Python/Rust for reference. Don't build your reference in Solidity if you can avoid it. Use a language with arbitrary-precision arithmetic. This eliminates the risk of both implementations sharing the same bug.

Bound your inputs carefully. Differential testing generates a lot of inputs. If most of them hit trivial code paths (zero amounts, empty arrays), you're wasting cycles. Use Foundry's bound() to focus on interesting ranges.

Log the failing input. When a differential test fails, the specific input that caused divergence is gold. Log it, reproduce it, and understand why the implementations disagree.

Combine with fuzzing. Differential testing tells you what disagrees. Invariant testing and property-based testing tell you what properties should hold. Use both.

For a broader comparison of smart contract fuzzing tools, check our dedicated post.

When not to use differential testing

It's not always the right tool:

  • No reference implementation exists. You're building something novel and there's nothing to compare against.
  • Implementations are intentionally different. If one version adds a fee and the other doesn't, they're supposed to disagree.
  • Performance isn't worth it. FFI-based cross-language testing is slow. For simple contracts, direct property testing is faster and just as effective.

In those cases, stick with standard invariant testing and direct property assertions.

Wrapping up

Differential testing is one of those techniques that's simple in concept but catches bugs that other approaches miss. The insight is that you don't need to know the right answer, you just need two sources that should agree.

In DeFi, where specs get implemented multiple times, where gas-tuned rewrites replace reference code, and where cross-chain deployments must behave identically, differential testing fits naturally.

Set up the comparison. Let the fuzzer generate inputs. Wait for the disagreement. Fix the bug.

Try Recon Pro

Related Posts

Related Glossary Terms

Get differential testing for your protocol