LLM-generated invariant properties: what works, what hallucinates, how we use them
LLM-Generated Invariant Properties: What Works, What Hallucinates, How We Use Them
By deivitto — April 2026
Everyone wants to know: can you just throw your smart contract at GPT-4 or Claude and get a list of invariant properties to test?
Short answer: kind of. Long answer: it depends enormously on the type of property, the complexity of the protocol, and how much you trust the output without verification.
We've been experimenting with LLM-generated properties for over a year now. We've run thousands of generated candidates through actual fuzzing campaigns. Here's what we've learned — the good, the bad, and the stuff that looks right but is completely wrong.
The workflow: not "Generate and ship"
Let me be clear about how we actually use LLMs for property generation. It's not "paste code, get properties, run fuzzer." It's a four-stage pipeline:
Stage 1: LLM generates candidate properties. Feed the contract code, NatSpec, and protocol documentation. Ask for invariant properties in a specific format.
Stage 2: Human reviews candidates. An auditor reads each property and checks: does this make sense? Is it testing what it claims? Does it match the protocol's actual semantics?
Stage 3: Fuzzer validates. Surviving properties get implemented and run against the codebase. The fuzzer tells us if they hold, break immediately (bad property), or break under specific conditions (real finding or edge case).
Stage 4: Human refines. Based on fuzzer results, the auditor adjusts, combines, or replaces properties. This often leads to new properties the LLM never suggested.
The LLM handles maybe 30% of the total work. But it's the right 30%, the tedious boilerplate that would otherwise eat into an auditor's time for thinking deeply.
What works well
Standard token properties
LLMs are great at generating ERC-20/721/1155 invariant properties. These patterns are heavily represented in training data, and the properties are well-known:
// LLM-generated -- accurate and useful as-is
function invariant_totalSupplyEqualsSumOfBalances() public {
uint256 sum = 0;
for (uint256 i = 0; i < actors.length; i++) {
sum += token.balanceOf(actors[i]);
}
assertEq(token.totalSupply(), sum);
}
function invariant_transferPreservesTotalSupply() public {
// totalSupply should never change due to transfers
assertEq(token.totalSupply(), ghost_initialTotalSupply);
}
function invariant_balanceNeverExceedsTotalSupply() public {
for (uint256 i = 0; i < actors.length; i++) {
assertLe(token.balanceOf(actors[i]), token.totalSupply());
}
}
These are correct, well-formed, and directly usable. The LLM can generate 10-15 ERC-20 properties in seconds, and 80%+ will be valid. They won't find novel bugs alone, but they form a solid baseline.
Access control properties
LLMs handle role-based access control well:
// LLM-generated -- good starting point
function invariant_onlyOwnerCanPause() public {
// If the contract is paused, the owner must have called pause()
if (vault.paused()) {
assertTrue(
ghost_pauseCaller == vault.owner(),
"Non-owner paused the contract"
);
}
}
function invariant_ownerIsNeverZeroAddress() public {
assertTrue(vault.owner() != address(0));
}
Monotonicity properties
Properties about values that should only increase or decrease:
// LLM-generated -- usually correct for vaults
function invariant_totalAssetsNeverDecreaseWithoutWithdrawal() public {
if (!ghost_withdrawalOccurred) {
assertGe(
vault.totalAssets(),
ghost_previousTotalAssets,
"Total assets decreased without withdrawal"
);
}
ghost_previousTotalAssets = vault.totalAssets();
ghost_withdrawalOccurred = false;
}
Boilerplate setup code
LLMs are also surprisingly good at generating the handler functions and setup code for invariant testing frameworks. Given a contract's interface, they'll produce reasonable setUp() functions and handler stubs:
// LLM-generated handler setup -- saves 20-30 minutes
contract VaultHandler is Test {
Vault vault;
ERC20 asset;
address[] actors;
constructor(Vault _vault, ERC20 _asset) {
vault = _vault;
asset = _asset;
actors = new address[](3);
actors[0] = makeAddr("alice");
actors[1] = makeAddr("bob");
actors[2] = makeAddr("carol");
}
function deposit(uint256 actorSeed, uint256 amount) external {
address actor = actors[actorSeed % actors.length];
amount = bound(amount, 1, asset.balanceOf(actor));
vm.startPrank(actor);
asset.approve(address(vault), amount);
vault.deposit(amount, actor);
vm.stopPrank();
}
function withdraw(uint256 actorSeed, uint256 amount) external {
address actor = actors[actorSeed % actors.length];
uint256 maxWithdraw = vault.maxWithdraw(actor);
if (maxWithdraw == 0) return;
amount = bound(amount, 1, maxWithdraw);
vm.prank(actor);
vault.withdraw(amount, actor, actor);
}
}
This kind of scaffolding is where LLMs save the most time. Writing handler functions is repetitive and mechanical, exactly the kind of work LLMs handle well.
What hallucinates
Now the part that matters more. Here's where LLMs confidently generate properties that are wrong.
Complex DeFi logic
Ask an LLM to generate properties for a lending protocol's liquidation mechanism and you'll get something like:
// LLM-generated -- WRONG
function invariant_healthFactorAboveOneAfterLiquidation() public {
for (uint256 i = 0; i < borrowers.length; i++) {
if (ghost_wasLiquidated[borrowers[i]]) {
assertGt(
pool.healthFactor(borrowers[i]),
1e18,
"Health factor still below 1 after liquidation"
);
}
}
}
This looks right. It isn't. Many lending protocols allow partial liquidation, where the health factor improves but might not reach 1.0. Some protocols have close factors that limit how much can be liquidated per transaction. The LLM doesn't understand these protocol-specific mechanics, it's pattern-matching from examples it's seen, and the examples don't cover every design choice.
Cross-Contract interactions
LLMs struggle with properties that span multiple contracts:
// LLM-generated -- WRONG
function invariant_oracleAlwaysReturnsPositivePrice() public {
(,int256 price,,,) = oracle.latestRoundData();
assertGt(price, 0, "Oracle returned non-positive price");
}
This property should hold in production. But it doesn't belong in an invariant test, the oracle is an external dependency, and in a fuzzing context, you're testing your protocol's behavior, not Chainlink's. The real property is: "when the oracle returns zero or negative, does our protocol handle it safely?"
// What it should have been:
function invariant_protocolHandlesOracleFailureGracefully() public {
// After any sequence of operations (including oracle manipulation),
// no user should lose funds due to stale/bad oracle data
for (uint256 i = 0; i < users.length; i++) {
// User's redeemable value should never be zero
// when they have a positive deposit
if (ghost_deposits[users[i]] > 0) {
assertGt(
vault.maxWithdraw(users[i]),
0,
"User with deposits can't withdraw"
);
}
}
}
The LLM tested the oracle. The auditor tests the protocol's reaction to oracle behavior.
Timing and ordering dependencies
Properties involving time-dependent logic almost always need human correction:
// LLM-generated -- SUBTLY WRONG
function invariant_vestingReleasesCorrectly() public {
uint256 elapsed = block.timestamp - vestingStart;
uint256 expectedRelease = totalVested * elapsed / vestingDuration;
assertEq(vesting.released(), expectedRelease);
}
This ignores: cliff periods, rounding behavior, how released() tracks what was claimed (not what's claimable), and discrete vs continuous vesting. The LLM generates a linear model because that's the simplest one. Real vesting contracts are messier.
Economic attack properties
This is the biggest gap. LLMs can't reason well about economic attacks:
// LLM-generated -- MEANINGLESS
function invariant_noFlashLoanAttackPossible() public {
// ... this isn't even a thing you can express as a simple property
}
Economic attacks require understanding incentive structures, market conditions, cross-protocol interactions, and adversarial behavior models. LLMs don't have a model of DeFi economics, they have patterns of code. There's a huge difference.
Success rates: the numbers
Over the past year, here's roughly how LLM-generated properties perform across categories:
| Category | Accuracy (Usable As-Is) | Needs Minor Fix | Wrong/Misleading |
|---|---|---|---|
| ERC-20/721 standard | ~85% | ~10% | ~5% |
| Access control | ~75% | ~15% | ~10% |
| Simple arithmetic | ~70% | ~20% | ~10% |
| Vault/ERC-4626 | ~50% | ~30% | ~20% |
| Lending/borrowing | ~25% | ~35% | ~40% |
| AMM/DEX logic | ~20% | ~30% | ~50% |
| Cross-protocol | ~10% | ~20% | ~70% |
| Economic/incentive | ~5% | ~15% | ~80% |
The pattern is clear: the more protocol-specific the logic, the worse LLMs perform. Standard patterns work. Custom business logic doesn't.
Practical tips for using lLMs
Here's what we've found makes the biggest difference:
1. give context, not just code
Don't just paste the contract. Include:
- What the protocol does in plain English
- The key invariants you know should hold
- The architecture (which contracts interact with which)
- Known edge cases
The LLM generates better properties when it understands intent, not just syntax.
2. ask for categories separately
Instead of "generate all invariant properties for this contract," ask:
- "Generate properties about balance conservation"
- "Generate properties about access control"
- "Generate properties about state transitions"
Focused prompts produce focused (and more accurate) results.
3. ask for counterexamples
After generating a property, ask the LLM: "In what scenario could this property be false even in correct code?" This forces it to think about edge cases and often reveals issues with its own generated properties.
4. use the LLM to review its own output
Generate properties with one prompt. Then in a separate conversation, present the properties alongside the contract and ask: "Are any of these properties incorrect or misleading?" LLMs are often better at critiquing than generating.
5. generate in test framework format
Ask for properties in your actual testing framework's syntax (Foundry, Hardhat, etc.) rather than pseudocode. This removes a manual translation step and forces the LLM to think about implementation details like bound() calls and vm.prank().
The dangerous middle ground
Here's what keeps me up at night about LLM-generated properties: the ones that are almost right.
A clearly wrong property gets caught in review. A clearly right property adds value immediately. But a property that's subtly wrong, it passes the fuzzer, it looks correct at a glance, it gives you confidence, that's dangerous.
Example from an actual engagement:
// LLM-generated, passed fuzzer, looked correct
function invariant_poolSolvency() public {
assertGe(
underlying.balanceOf(address(pool)),
pool.totalDebt(),
"Pool insolvent"
);
}
Looks right? The pool should have enough tokens to cover all debt.
Except this protocol had a yield strategy that deployed assets to external protocols. The balanceOf check only captures tokens in the pool contract, not tokens deployed to strategies. The correct property needed to account for totalDeployed() as well.
The fuzzer didn't catch it because the test setup didn't include a yield strategy actor. The property passed for thousands of runs. Everyone felt safe. The property was wrong.
This is why Stage 2 (human review) isn't optional. An experienced auditor would've asked: "wait, where's the yield accounting?"
Our actual workflow, step by step
For transparency, here's exactly how we integrate LLMs into our invariant testing workflow:
-
Manual architecture review, We read the contracts and build a mental model first. Never skip this.
-
Write critical properties by hand, The 3-5 most important invariants for the protocol are always human-written. These are the properties that, if violated, mean the protocol is broken.
-
LLM generates candidate expansion, We feed the contract + our hand-written properties to the LLM and ask it to generate additional properties. This typically produces 20-40 candidates.
-
Human triage, An auditor reviews each candidate: keep, modify, or discard. Usually keeps about 40%, modifies 25%, discards 35%.
-
Implement and fuzz, Surviving properties get implemented and run. Broken properties get investigated, is it a bad property or a real bug?
-
Iterate, Based on fuzzer results, we refine properties and generate new ones. This often involves going back to the LLM with specific questions: "This property broke because of X, generate alternative properties that account for that behavior."
Total time saved vs. pure manual: roughly 25-35% on property generation. The bigger win is in the handler/setup code, probably 40-50% time saved there.
What the future looks like
LLMs are getting better at code reasoning. The gap between "standard patterns" and "protocol-specific logic" is narrowing with each model generation. But I don't see LLMs replacing human auditors for property generation anytime soon.
The reason is simple: the hardest properties require understanding intent, not just code. What's the protocol supposed to do? What economic assumptions does it make? What happens when those assumptions break? These are questions that require domain knowledge, adversarial thinking, and context that goes beyond the codebase.
Where I do see LLMs becoming more useful:
- Better boilerplate. As models improve, the accuracy on standard patterns will approach 95%+.
- Property templates. "Here's a lending protocol with these parameters, generate the standard lending properties adapted to this implementation."
- Mutation suggestion. "Here's a property that passed, suggest mutations that might break it." This feeds into mutation testing workflows.
- Coverage gap detection. "Here are my existing properties, what aspects of the protocol aren't covered?"
For more on AI-assisted auditing, check our learning resources. And if you want to see how these techniques fit into a broader audit process, read AI and smart contract audits.
The bottom line
LLMs are a tool. A useful one. They don't replace the auditor's brain, they give the auditor a faster way to generate the starting material.
Use them for standard properties, boilerplate, and brainstorming. Don't trust them for protocol-specific business logic, economic properties, or cross-contract interactions. And always, always run the output through a human and a fuzzer before believing it.
The best invariant properties still come from someone who deeply understands the protocol. LLMs can help that person work faster. That's genuinely useful. It's just not magic.
Try Recon Pro
Related Posts
Why we built Chimera: write once, fuzz everywhere
Every fuzzer needs different test code. Chimera lets you write properties once and run them with Fou...
AI-guided fuzzing: from LLM property generation to automated campaigns
LLMs can suggest properties. Some are great, some hallucinate. Here's our practical workflow for usi...