Towards the scientific audit

Two auditors review the same codebase. They follow different paths, focus on different functions, apply different heuristics, and produce different reports. One catches a critical rounding bug in the liquidation logic. The other catches a reentrancy vector in the withdrawal flow. Neither catches both.

This is the state of smart contract auditing today. It's artisanal. The quality of an audit depends on who happens to review your code, what they happen to look at, and how much coffee they had that morning. The results aren't reproducible, the coverage isn't measurable, and the gaps are invisible — until an attacker finds them.

I've been thinking about this problem for a long time. What would it take to make an audit scientific? Not in the vague sense of "more rigorous," but in the specific sense: reproducible, exhaustive, and falsifiable. I want to lay out the framework we've been building at Recon to get there.

This isn't a purely theoretical exercise. We've applied this framework across dozens of engagements, from Liquity's stability pool to Corn's vault system, and the results speak for themselves. When we enumerated coverage classes on Corn, we identified a critical insolvency bug within hours that a previous manual audit had missed entirely.

Standardizing scope

The first step toward a scientific audit is a standardized scope. When we set up a fuzzing harness, we deploy the entire system under test. This isn't optional. The fuzzer needs a running system to interact with. But this requirement turns out to be a gift: it forces us to enumerate every contract, map every relation between them, and assign concrete values to every meaningful storage variable.

This is the deployment problem, and solving it produces something invaluable: a complete, executable specification of the system's initial state. Every contract is listed. Every constructor argument is defined. Every proxy points to its implementation. Every role is assigned. Every oracle returns a price.

Where external integrations exist (price feeds, lending pools, bridges), we deploy mock contracts. These mocks don't simulate the real integration faithfully. Instead, they return symbolic values: any price, any balance, any return code. This includes reentrant callbacks. By making the mock's behavior unconstrained, we test the protocol's resilience to the full range of possible external behaviors, not just the happy path.

This standardized deployment is the foundation. Without it, nothing else works. With it, we've got a reproducible starting point that any auditor can reconstruct.

This is more than a convenience. In a 2023 analysis by Trail of Bits, they noted that many audit failures stem from incomplete understanding of the system's deployment configuration: constructors called with wrong parameters, proxies pointing to stale implementations, roles assigned to the wrong addresses. A standardized deployment eliminates this entire class of errors before the audit even begins.

The integration problem

External integrations are the biggest vulnerability surface in DeFi, and they're also the hardest to standardize. When a protocol integrates with Aave, Compound, Uniswap, or any other external system, the behavior of that external system becomes part of the protocol's security assumptions. And the space of possible integrations is effectively infinite. Call it the * problem.

Our approach has two parts. First, mocks with symbolic return values cover the general case. If the protocol calls an external contract and uses the return value, our mock can return anything, including values that trigger edge cases the developer never considered. Second, for known dangerous integrations (fee-on-transfer tokens, rebasing tokens, tokens with callbacks), we've built simplified implementations that capture the specific gotcha without the full complexity.

I'll be honest: this remains the biggest open challenge. We can't enumerate all possible smart contract integrations any more than we can enumerate all possible inputs. But in the absence of *, a standardized deployment lets us test the code as intended, and that already puts us far ahead of ad-hoc review.

In our Centrifuge engagement, the mocks we deployed for external pool interactions returned symbolic share prices. This immediately surfaced a rounding-based cap bypass that exploited the gap between the protocol's assumptions about external return values and what those integrations could actually return. The bug was invisible to manual review because reviewers implicitly assumed "normal" return values.

Coverage classes

Here's the core thesis. Once you have a valid, standardized scope, you can enumerate every possible path through the code. Not every possible input (that's infinite). Every possible path, meaning every unique combination of branches taken and not taken.

We define three types of coverage classes:

Non-reverting coverage classes: combinations of conditions that lead to a successful execution. These represent the states the protocol can actually reach. Each unique path through the code, each unique set of branches taken, is one coverage class.

Assertion-breaking coverage classes: combinations of conditions that cause an assertion (an invariant property) to fail. These are bugs.

Revert-at-line-X coverage classes: combinations of conditions that cause execution to revert at a specific line. These represent the protocol's defensive checks. Each revert point with each unique reason for reaching it is a separate class.

Formally, each branching decision in the code is a binary variable. The cartesian product of all decision variables produces a large number of theoretical combinations, but most are infeasible (contradictory conditions), and many lead to the same code being executed. We deduplicate by focusing on unique code covered rather than the state of each individual condition. What matters isn't whether variable X was true or false, but which lines of code were actually reached and in what combination.

Formal methods (symbolic execution, abstract interpretation) can extract these coverage classes from source code. The result is a finite, enumerable list of every meaningfully distinct execution path through the system.

To make this concrete: consider a simple function with two if statements. The first checks whether the caller has sufficient balance; the second checks whether the transfer amount exceeds a cap. This function has four theoretical coverage classes (both true, first true + second false, first false + second true, both false), but only three are meaningful. If the balance check fails, the cap check is never reached, so "both false" collapses into the "first false" class. In a real protocol with hundreds of branching conditions, this deduplication is what makes enumeration tractable.

Semantic classes

Coverage classes tell you which code paths exist. They don't tell you which code paths are dangerous. A line of code that performs integer division is on many paths, but the truncation risk is only relevant when the numerator isn't evenly divisible by the denominator. Coverage alone is insensitive to this.

This is where semantic classes come in. We extend coverage classes with additional dimensions that capture bug-relevant behavior:

Truncation classes: every division and every downcast creates a truncation boundary. The code behaves differently when a / b truncates versus when it divides evenly. Each truncation point splits its coverage class into two semantic classes.

Overflow classes: every multiplication, addition, and shift operation has an overflow boundary. Pre-Solidity-0.8 this was silent; post-0.8 it reverts. Either way, the behavior near the boundary is distinct from behavior far from it.

Reentrancy classes: every external call is a potential reentrancy point. The state before the call and the state after the call may be inconsistent if a callback modifies storage. Each external call splits its coverage class into "reentered" and "not reentered" semantic classes.

These semantic dimensions multiply the number of classes, but they capture exactly the categories of bugs that auditors are trained to look for. They make the implicit explicit.

In practice, semantic classes have been the source of some of our most impactful findings. The Badger DAO remBADGER desynchronization bug was a truncation class issue. Repeated deposit/withdraw cycles accumulated rounding errors that eventually made the vault insolvent. A pure coverage analysis would have marked the deposit and withdraw paths as "covered." The semantic dimension of truncation is what distinguished the safe path from the dangerous one.

What is an audit, then?

With coverage classes and semantic classes enumerated, we can define an audit precisely:

Unit level review: for each coverage class and each semantic class, a reviewer examines the code path and annotates whether the behavior is correct, potentially dangerous, or definitively buggy. This is exhaustive at the function level. Every path through every function is reviewed.

Path coverage review: salient semantic classes from different functions are combined to construct multi-step edge cases. If function A has a truncation class and function B uses A's output in a multiplication, the combination may produce an overflow that neither class reveals in isolation. This is what stateful fuzzing approximates: exploring combinations of function calls to find multi-step vulnerabilities.

The thesis is this: a reviewer who completes both levels, examining every coverage class, every semantic class, and every salient combination across functions, would find every bug that falls into a known bug class. The audit becomes exhaustive not by relying on human intuition, but by systematically covering a finite, enumerated space.

How Recon Magic achieves this

This isn't just theory. Recon Magic automates the mechanical parts of this framework. It standardizes scope by generating deployment harnesses. It enumerates coverage classes through a combination of symbolic execution and static analysis. It extracts semantic classes by identifying truncation points, overflow boundaries, and reentrancy surfaces in the source code.

The human auditor's job shifts from "read the code and find bugs" to "review each enumerated class and confirm correctness." This is a fundamentally different task, one that's bounded, measurable, and reproducible. Two auditors working from the same enumeration will cover the same ground.

The numbers illustrate the scale. On a typical 3,000 SLOC DeFi protocol, we enumerate between 500 and 2,000 coverage classes and 100 to 400 semantic classes. A manual reviewer working without this enumeration might cover 60-70% of these classes through intuition and experience. The remaining 30-40%, the classes that are reached only through unusual combinations of conditions, are exactly where the critical bugs hide.

Recon Magic doesn't replace the human reviewer. It gives the reviewer a map of the territory instead of asking them to explore it blind. Every class is tagged, every path is documented, and the reviewer's job becomes confirmation rather than discovery. This is what makes the audit reproducible: a second reviewer, given the same enumeration, reviews the same classes and reaches the same conclusions.

For more on the underlying techniques, see our guides on smart contract auditing, smart contract security, invariant testing, and coverage-guided fuzzing.

The road ahead

We aren't done. The integration problem remains open. The enumeration of coverage classes can be computationally expensive for large codebases. And the combination space across functions grows combinatorially, so heuristics are needed to prioritize which cross-function combinations to examine.

We're also exploring how this framework connects to formal verification. If coverage classes can be extracted symbolically, they can also be verified symbolically, proving that certain classes are unreachable or that certain semantic conditions never produce dangerous values. The combination of enumeration, fuzzing, and formal verification may eventually close the gap entirely. For now, we're focused on making the enumeration fast, accurate, and actionable. Every engagement teaches us something new about where the framework works well and where it needs refinement.

But the direction is clear. Audits shouldn't depend on who reviews your code. They should depend on what your code does. Every path, every semantic boundary, every combination — enumerated, reviewed, and documented.

This is the methodology behind every Recon engagement. For a look at how Recon Magic automates parts of this framework in practice, read Introducing Recon Magic on our Substack.

If you want to see how coverage class enumeration and semantic analysis work on your codebase, request an audit. We'll show you exactly what systematic security looks like.

Towards the Scientific Audit

Towards the scientific audit

Standardizing scope

The integration problem

Coverage classes

Semantic classes

What is an audit, then?

How Recon Magic achieves this

The road ahead

AI Smart Contract Audits: Beyond the Hype

Why your audit found nothing: the false confidence problem

Towards the scientific audit

Standardizing scope

The integration problem

Coverage classes

Semantic classes

What is an audit, then?

How Recon Magic achieves this

The road ahead

Related Posts

AI Smart Contract Audits: Beyond the Hype

Why your audit found nothing: the false confidence problem

Related Glossary Terms

Want to see this methodology applied to your protocol?