Mutation testing for smart contracts: measure your test suite quality
Mutation testing for smart contracts: measure your test suite quality
You've got tests. Maybe even lots of tests. But here's the uncomfortable question: are your tests actually catching bugs, or are they just passing?
Mutation testing answers that question. It's the closest thing we have to a ground truth metric for test suite quality, and it's criminally underused in smart contract development.
What is mutation testing?
The idea is simple. Take your code, inject small deliberate faults (mutations), and check whether your tests catch them. Each mutation creates a "mutant," a slightly broken version of your contract. If your tests fail when running against the mutant, the mutant is "killed." If your tests still pass, the mutant "survived," and that's a problem.
A surviving mutant means there's a specific type of bug your tests wouldn't catch.
Here's a concrete example. Suppose your contract has:
require(amount <= balance, "Insufficient funds");
A mutation operator might change <= to <:
require(amount < balance, "Insufficient funds");
If your tests still pass with this change, it means you don't have a test for the exact case where amount == balance. That's a gap.
Why it matters for smart contracts
Smart contract security has a particular problem: bugs are expensive. A single uncaught edge case can drain millions. Traditional code coverage tells you which lines execute, but it doesn't tell you whether your assertions actually verify the right behavior.
You can have 100% line coverage and still miss critical bugs. Mutation testing exposes the difference between "my tests touch every line" and "my tests would catch a bug on every line."
For DeFi protocols, this is especially relevant:
- Off-by-one errors in boundary conditions
- Wrong comparison operators (
<vs<=,>=vs>) - Missing edge cases in math operations
- State transitions that skip validation
These are exactly the kinds of faults that mutation operators inject.
How mutation testing works
The process has four steps:
- Generate mutants. A tool analyzes your source code and creates modified versions, each with one small change.
- Run tests against each mutant. For every mutant, run your full test suite.
- Classify results. If tests fail, the mutant is killed (good). If tests pass, the mutant survived (bad). If the mutant causes a compilation error, it's equivalent or stillborn (ignored).
- Calculate mutation score. Killed mutants / Total non-equivalent mutants = your score.
A mutation score of 80% means your tests catch 80% of injected faults. For security-critical code, you want 90%+. Below 70% and your test suite has serious blind spots.
Mutation operators
Common operators for Solidity:
| Operator | What it does | Example |
|---|---|---|
| Relational | Swap comparison operators | < → <=, > → >= |
| Arithmetic | Change math operators | + → -, * → / |
| Logical | Flip boolean logic | && → ||, ! removed |
| Literal | Change constant values | 0 → 1, 1 → 0 |
| Statement deletion | Remove a statement | Delete require(), delete assignment |
| Return value | Change return values | return x → return 0 |
| Condition negation | Negate if conditions | if(x) → if(!x) |
The statement deletion operator is particularly brutal. If deleting a require() doesn't break any test, your tests aren't checking that invariant at all.
Tools for Solidity mutation testing
Gambit (by Certora)
Gambit is the most mature mutation testing tool for Solidity. It's built by the Certora team and works well with Foundry projects.
Install it:
# Download from GitHub releases
# https://github.com/Certora/gambit/releases
Generate mutants:
gambit mutate --solc-remappings "@openzeppelin=node_modules/@openzeppelin" src/Vault.sol
This creates a gambit_out/ directory with all the mutant files and a summary JSON.
vertigo-rs
vertigo-rs is a Rust-based mutation testing framework for Solidity. It's lighter weight and integrates with both Foundry and Hardhat.
cargo install vertigo-rs
vertigo run --foundry
vertigo-rs runs the full loop: generate mutants, run tests, report results. Less manual work than Gambit if you just want a score.
Custom approach
For full control, you can build your own mutation pipeline. It's not as hard as it sounds:
# 1. Generate mutants (use Gambit or write a simple sed script)
# 2. For each mutant:
# a. Replace original file with mutant
# b. Run forge test
# c. Record pass/fail
# d. Restore original file
# 3. Calculate score
Practical walkthrough: Foundry + Gambit
Let's do a real mutation testing run. Here's our target contract:
// src/StakingPool.sol
// SPDX-License-Identifier: MIT
pragma solidity ^0.8.20;
import "@openzeppelin/contracts/token/ERC20/IERC20.sol";
contract StakingPool {
IERC20 public stakingToken;
mapping(address => uint256) public stakedBalance;
mapping(address => uint256) public rewardDebt;
uint256 public totalStaked;
uint256 public rewardPerToken;
uint256 public lastUpdateTime;
uint256 public rewardRate;
constructor(address _token, uint256 _rate) {
stakingToken = IERC20(_token);
rewardRate = _rate;
lastUpdateTime = block.timestamp;
}
function updateRewards() public {
if (totalStaked > 0) {
uint256 elapsed = block.timestamp - lastUpdateTime;
rewardPerToken += (elapsed * rewardRate * 1e18) / totalStaked;
}
lastUpdateTime = block.timestamp;
}
function stake(uint256 amount) external {
require(amount > 0, "Cannot stake zero");
updateRewards();
stakedBalance[msg.sender] += amount;
totalStaked += amount;
rewardDebt[msg.sender] = rewardPerToken;
stakingToken.transferFrom(msg.sender, address(this), amount);
}
function unstake(uint256 amount) external {
require(amount > 0, "Cannot unstake zero");
require(stakedBalance[msg.sender] >= amount, "Insufficient stake");
updateRewards();
stakedBalance[msg.sender] -= amount;
totalStaked -= amount;
stakingToken.transfer(msg.sender, amount);
}
function pendingReward(address user) external view returns (uint256) {
uint256 currentRewardPerToken = rewardPerToken;
if (totalStaked > 0) {
uint256 elapsed = block.timestamp - lastUpdateTime;
currentRewardPerToken += (elapsed * rewardRate * 1e18) / totalStaked;
}
return (stakedBalance[user] * (currentRewardPerToken - rewardDebt[user])) / 1e18;
}
}
And here's a test file:
// test/StakingPool.t.sol
// SPDX-License-Identifier: MIT
pragma solidity ^0.8.20;
import "forge-std/Test.sol";
import "../src/StakingPool.sol";
contract MockToken is IERC20 {
mapping(address => uint256) public override balanceOf;
mapping(address => mapping(address => uint256)) public override allowance;
uint256 public override totalSupply;
function mint(address to, uint256 amount) external {
balanceOf[to] += amount;
totalSupply += amount;
}
function transfer(address to, uint256 amount) external override returns (bool) {
balanceOf[msg.sender] -= amount;
balanceOf[to] += amount;
return true;
}
function transferFrom(address from, address to, uint256 amount) external override returns (bool) {
allowance[from][msg.sender] -= amount;
balanceOf[from] -= amount;
balanceOf[to] += amount;
return true;
}
function approve(address spender, uint256 amount) external override returns (bool) {
allowance[msg.sender][spender] = amount;
return true;
}
}
contract StakingPoolTest is Test {
StakingPool pool;
MockToken token;
function setUp() public {
token = new MockToken();
pool = new StakingPool(address(token), 1e18);
token.mint(address(this), 1_000_000e18);
token.approve(address(pool), type(uint256).max);
}
function test_stake() public {
pool.stake(100e18);
assertEq(pool.stakedBalance(address(this)), 100e18);
assertEq(pool.totalStaked(), 100e18);
}
function test_unstake() public {
pool.stake(100e18);
pool.unstake(50e18);
assertEq(pool.stakedBalance(address(this)), 50e18);
}
function test_cannotStakeZero() public {
vm.expectRevert("Cannot stake zero");
pool.stake(0);
}
function test_cannotUnstakeMoreThanBalance() public {
pool.stake(100e18);
vm.expectRevert("Insufficient stake");
pool.unstake(200e18);
}
}
Now run Gambit:
gambit mutate src/StakingPool.sol
Gambit generates mutants. Let's look at what it creates:
gambit_out/
mutants/
1/ # require(amount > 0) → require(amount >= 0)
2/ # require(amount > 0) → require(amount < 0)
3/ # stakedBalance[msg.sender] += amount → stakedBalance[msg.sender] -= amount
4/ # totalStaked += amount → totalStaked -= amount
5/ # require(stakedBalance[msg.sender] >= amount) → require(stakedBalance[msg.sender] > amount)
...
gambit_results.json
Now test each mutant:
#!/bin/bash
KILLED=0
SURVIVED=0
TOTAL=0
for mutant_dir in gambit_out/mutants/*/; do
TOTAL=$((TOTAL + 1))
mutant_id=$(basename "$mutant_dir")
# Get the mutant file path from gambit results
mutant_file=$(cat gambit_out/gambit_results.json | jq -r ".[$((mutant_id - 1))].filename")
original_file=$(cat gambit_out/gambit_results.json | jq -r ".[$((mutant_id - 1))].original")
# Swap in mutant
cp "$original_file" "$original_file.bak"
cp "$mutant_dir/$mutant_file" "$original_file"
# Run tests
if forge test --no-match-test "testFuzz" 2>/dev/null; then
echo "SURVIVED: Mutant $mutant_id"
SURVIVED=$((SURVIVED + 1))
else
echo "KILLED: Mutant $mutant_id"
KILLED=$((KILLED + 1))
fi
# Restore original
mv "$original_file.bak" "$original_file"
done
echo ""
echo "=== Mutation Testing Results ==="
echo "Total mutants: $TOTAL"
echo "Killed: $KILLED"
echo "Survived: $SURVIVED"
echo "Mutation score: $(( KILLED * 100 / TOTAL ))%"
Interpreting results
After running, you might see something like:
KILLED: Mutant 1 (require amount > 0 → amount >= 0)
SURVIVED: Mutant 5 (require >= → >)
KILLED: Mutant 3 (+= → -=)
KILLED: Mutant 4 (+= → -=)
SURVIVED: Mutant 5 (unstake require >= amount → > amount)
Mutation score: 71%
Two mutants survived. Let's look at what they tell us.
Mutant 5 survived: Changing stakedBalance[msg.sender] >= amount to stakedBalance[msg.sender] > amount didn't break any test. That means we don't test the exact boundary, unstaking exactly the staked amount. Fix:
function test_unstakeExactBalance() public {
pool.stake(100e18);
pool.unstake(100e18);
assertEq(pool.stakedBalance(address(this)), 0);
assertEq(pool.totalStaked(), 0);
}
Every surviving mutant points at a specific gap. Fix the gap, re-run mutation testing, and watch your score climb.
Improving test suites based on surviving mutants
Here's a systematic approach:
-
Sort surviving mutants by location. Cluster them by function. If multiple mutants survive in the same function, your tests for that function are weak across the board.
-
Prioritize security-relevant mutations. A surviving mutant in a
require()statement or a balance update is much scarier than one in an event emission. Focus on the ones that could lead to fund loss. -
Write targeted tests. Each surviving mutant needs at least one test that specifically exercises the boundary or condition that was mutated. Don't just add random tests. Be surgical.
-
Use fuzzing for the hard ones. Some mutations create subtle edge cases that are hard to hit with unit tests. Write a property test with invariant testing instead.
-
Re-run after each fix. Mutation testing is iterative. Kill a mutant, check if the new test also kills other mutants (it often does), and repeat until you're above your target score.
Mutation testing + fuzzing: better together
Mutation testing tells you where your tests are weak. Fuzzing is great at covering those gaps because it automatically explores edge cases.
Here's the workflow:
- Run mutation testing → find surviving mutants
- Write invariant tests that target the weak areas
- Run fuzzing campaigns to exercise those properties
- Re-run mutation testing to verify the gaps are closed
If a mutant survives your fuzzer, you've found a property that's genuinely hard to test. That's useful information. It might point to code that's unnecessarily complex or conditions that are practically unreachable.
Common pitfalls
Equivalent mutants. Some mutations don't actually change behavior. For example, changing x * 1 to x * 0 is obviously a real mutant, but changing if (x != 0) to if (x > 0) for a uint256 is equivalent. Both behave identically since uints can't be negative. These inflate your surviving mutant count. Toss them when calculating your score.
Test speed. Running the full test suite per mutant is slow. If you've got 200 mutants and tests take 30 seconds, that's nearly two hours. Speed things up:
- Use
--no-match-test "testFuzz"to skip fuzz tests during mutation runs - Run mutants in parallel
- Focus on critical contracts first
Compilation failures. Some mutants don't compile (e.g., type errors from operator swaps). These are "stillborn" mutants. Exclude them from your score.
Over-testing events. Don't write tests just to kill mutants in event emissions. Events are important for off-chain indexing but rarely affect contract security. Prioritize logic, math, and access control.
Integrating into your workflow
I recommend running mutation testing at two points:
-
Before a security review. Get your mutation score up before auditors look at the code. It's embarrassing (and expensive) when an auditor finds a bug that a simple boundary test would've caught.
-
After adding new features. Every new function needs tests, and mutation testing tells you if those tests are actually good.
Don't run it on every commit. It's too slow for that. CI is fine for unit tests and short fuzz campaigns. Save mutation testing for dedicated quality checkpoints.
The bottom line
Code coverage lies to you. It tells you which lines ran, not which lines were tested. Mutation testing tells you the truth.
If your mutation score is below 70%, your tests have serious blind spots. Between 70-85%, you're doing okay but there's room to improve. Above 85%, you've got a solid test suite. Above 95% and you're in excellent shape — though getting there takes real effort.
Start small. Pick your most critical contract, run Gambit, kill the surviving mutants, and see how it changes your confidence in your test suite.
Get a professional security audit
Try Recon Pro
Related Posts
5 Properties Every Smart Contract Auditor Forgets to Test
After 40+ DeFi audits, the same five invariant gaps come up every time. Not the obvious ones — accou...
How to fuzz ERC-1155 multi-token contracts
ERC-1155 combines batch operations with mandatory receiver callbacks, creating a reentrancy surface ...