AI Smart Contract Audits: Beyond the Hype
By Alex · Security researcherAI Smart Contract Audits: Beyond the Hype
Everyone is talking about AI audits. Most of what passes for "AI auditing" today is superficial — run an LLM over the source code, collect its observations, package them as findings. This is not auditing. It is autocomplete with a security vocabulary.
I want to describe what AI-powered auditing actually looks like when you build it on top of a rigorous foundation: invariant testing, coverage-guided fuzzing, and formal methods. This is what we have built at Recon, and it is fundamentally different from what the industry is selling.
The Problem with "AI Auditing"
The current wave of AI audit tools shares a common flaw: they treat security analysis as a text comprehension problem. Feed the code to a language model, ask it to find vulnerabilities, hope it catches something. This approach has three fatal limitations:
No execution context. A language model reading Solidity does not know what the code does — it knows what the code says. It cannot reason about state transitions across a sequence of transactions. It cannot determine whether a rounding error accumulates over 10,000 operations. It cannot simulate a flash loan attack path through three contracts.
No completeness guarantee. There is no way to measure what an LLM missed. It might flag a reentrancy pattern on one function and overlook an identical pattern two functions down. The output is probabilistic, not systematic.
No reproducibility. Run the same LLM on the same code twice and you may get different results. This makes it impossible to verify coverage, compare audit quality, or build confidence that all paths have been reviewed.
These are not minor limitations. They are the same problems that plague manual auditing — and the entire point of automation should be to solve them, not replicate them faster.
What AI-Powered Auditing Should Look Like
At Recon, we start from a different premise. AI is not the auditor. AI is the auditor's infrastructure. It generates the artifacts that make systematic review possible.
Here is the pipeline:
Step 1: Automated Property Generation
Recon Magic analyzes the target codebase and generates invariant properties — executable Solidity assertions that express what the protocol must guarantee. Solvency conditions. Accounting consistency. Access control boundaries. Monotonicity constraints on rates and prices.
This is not pattern matching. The system analyzes the contract's storage layout, its function signatures, its external call graph, and its mathematical operations to produce properties that are specific to the protocol. A lending protocol gets properties about collateralization ratios and liquidation thresholds. A vault gets properties about share-to-asset conversion accuracy. An AMM gets properties about constant product invariance.
The result is a Chimera-based test suite — compatible with Echidna, Medusa, and Foundry — ready to run.
Step 2: Coverage Gap Analysis
Properties alone are not enough. A property suite might cover the protocol's happy path thoroughly while missing entire branches of execution. This is where the AI layer adds its most critical value.
Our system identifies coverage classes — every meaningfully distinct execution path through the code — and cross-references them against the generated properties. Paths that are not exercised by any property are flagged as gaps. The AI then generates additional properties, handler functions, and test configurations specifically designed to reach those uncovered paths.
This is informed by research on coverage-guided fuzzing. The foundational insight from Grieco et al.'s work on Echidna and Wüstholz and Christakis's Harvey is that coverage feedback transforms random testing into directed exploration. We apply the same principle to property generation itself: the AI does not generate properties blindly, it generates them to maximize the coverage of the resulting fuzzing campaign.
Step 3: Semantic Enrichment
Beyond path coverage, the AI enriches the test suite with semantic classes — the bug-relevant dimensions that pure coverage misses:
- Truncation boundaries at every division and downcast
- Overflow boundaries at every multiplication and shift
- Reentrancy surfaces at every external call
- Oracle manipulation vectors at every price-dependent computation
For each semantic boundary, the system generates targeted properties that distinguish the safe case from the dangerous case. A division that truncates safely at normal values might produce a loss-of-precision exploit at extreme values. The AI ensures both sides of every boundary are tested.
Step 4: Campaign Execution and Analysis
The generated test suite runs as a full stateful fuzzing campaign — millions of randomized transaction sequences, each checked against every property. Violations are captured with full call sequences for reproduction.
But the AI's job does not end at campaign execution. It analyzes the corpus — the set of inputs the fuzzer explored — to identify patterns. Which functions are under-explored? Which state transitions are rare? Which property violations cluster around specific parameter ranges? This analysis feeds back into property refinement, creating a loop that converges toward complete coverage.
What This Achieves
The combined system — automated property generation, coverage gap filling, semantic enrichment, and iterative campaign analysis — produces results that neither pure manual review nor naive AI can match:
Measurable coverage. We can report exactly how many coverage classes and semantic classes were tested, and which ones remain open. This is not a percentage bar — it is an enumerated list of every execution path through the protocol.
Reproducible results. The test suite is deterministic. Any auditor can run it, verify the results, and extend it. Two runs on the same code produce the same coverage.
Continuous protection. The generated test suite is a deliverable. It runs in CI. When the protocol changes, the properties catch regressions immediately. This is the opposite of a one-time PDF report that becomes stale the moment code changes.
Human-AI collaboration. The AI generates the coverage map. Human auditors review the flagged edge cases, the semantic boundary conditions, and the cross-function interaction paths. The human focuses on judgment — "is this behavior intended?" — while the AI handles enumeration — "here are all the behaviors."
Why Most "AI Audits" Fall Short
The fundamental error in the current AI audit landscape is treating the LLM as the end product. An LLM's opinion about code is, at best, a hint. It might be correct. It might be hallucinated. You cannot tell the difference without execution.
Our approach uses AI to generate executable artifacts — properties, test suites, coverage reports — that can be verified independently of the AI that produced them. If the AI generates a wrong property, the fuzzer will either never trigger it (dead code, filtered out) or trigger it on a false positive (caught during human review). The system is self-correcting in a way that pure LLM analysis is not.
This is the difference between "AI told us this is safe" and "AI generated a test suite that exercises 1,847 coverage classes across 23 contracts, and no property violations were found in 50 million test iterations." The first is an opinion. The second is evidence.
The Road Ahead
We are pushing this further. The integration of symbolic execution with AI-guided property generation is opening paths to proving certain properties mathematically rather than just testing them probabilistically. The combination of formal verification, fuzzing, and AI-guided test generation may eventually provide something close to a completeness guarantee for known bug classes.
For now, the practical result is already clear: AI-powered auditing, done correctly, finds more bugs, covers more code, and delivers more lasting value than either manual review or naive automation alone.
Want to see what AI-powered invariant testing finds on your protocol? Request an audit and we will show you the coverage map.