Auto-Generating Fuzzers
Writing fuzzers by hand is cumbersome. Each crate has a public API. What if we could put both of these facts together, and automatically generate a fuzzer that explores the whole crate’s API?
Bucket list of objects
In order to make statefuz fuzzers, it is usual to keep each stateful object in a bucket and run functions on these objects.
With this in mind, it would be possible to use a type map to have one bucket per type of object our application is dealing with.
The fuzzer generator could then list the crate’s API, and generate one “operation type” per function it can call (or object it can structurally build). When a function takes in an object, it would take in an index, and use it in the bucket associated to this type in the type map.
Enter references
The problem with this scheme is, type maps need the type to implement Any
for storage, for safety reasons. And a reference cannot be Any
.
So we need to work around that somehow. First thing is, in order to track whether an object is already borrowed, we can just store it inside a RefCell
. Then, RefCell
will be doing all the work of checking whether an object is ready to be borrowed or in a borrow is already happening.
But then we still need a way to store the Ref(Mut)
somehow, and it takes an '_
lifetime that is its shared borrow on the RefCell
. So we’re back to step one.
Casual leaking
But good news! We can solve that problem pretty simply. RefCell
is only ever mutably borrowed. So we can just leak it, and get an 'static
out of it, and then it’s actually possible to store the Ref
s.
Well, except for the fact that leaking memory in a fuzzer will definitely go wrong. To avoid that issue, we can leak not into the global scope, but into an arena, that will get freed at the end of our current fuzzing run.
The issue is, we’re again with non-'static
variables, so they don’t implement Any
. No big deal, we can probably just transmute-extend the lifetime here, as we know that literally no reference to anything at all will ever outlive our arena.
… or do we? If we start passing 'static
elements to the library under fuzzing, it could perfectly well store it itself in thread-locals, and then poof there’s UB. So it’s important, if doing this, to reduce the lifetime of any reference just after taking it out of the type map, eg. by using a closure, so that the crate could not rely on it being any longer than the function call itself. Sure for now, it cannot possibly be a problem, because the function needs to be generic on the lifetime it takes (or we’d know by the signature that it expects an &'static
and not even try), but with specialization it’s very possible that this will become a problem.
(Oh, and another problem is I couldn’t easily find a crate that would actually allow me to leak an untyped non-static-lifetime T
into it and recover an &'arena T
. But well, that’s a matter for another day I guess.)
But I wanna own it
Now the real issue is, that it becomes impossible to pass the value of one of these RefCell
elements by value to anything else again. Sure, there is the RefCell::take(&self) -> T
method, but it requires T: Default
, which is definitely not an option in the general case. So we do need to use RefCell::into_inner(self) -> T
.
But remember, we leaked the RefCell
, and so now only ever will have an &'arena
to it, and then it’ll be dropped by the arena going out of scope! So we can’t just use into_inner
. Instead, we need a way to recover the ownership of the RefCell
.
The good news is, RefCell
tells us whether it is currently being borrowed. We can just use RefCell::try_borrow_mut
, and we can thus know whether anything is currently messing with it. If it’s not, then we can do whatever dark magic we want to get it back safely out of the leak. Like using the imaginary arena’s unsafe method to retrieve the leaked value in exchange for the assertedly-last reference to it. And given the fuzzer code is generated, and these are tests so we can be a bit reckless in our unsafe handling, we just happen to know that the reference in the type map is the last to it, because any other reference would be in Ref
or RefMut
and we have just checked that.
And closures?
Closures are hard. In particular, because they introduce a new lifetime that definitely cannot be extended to be an arena lifetime. So… I’m not sure. Maybe we should introduce another arena per closure run, and make sure that any elements allocated in there die with it? But then we also need a way to get elements out of the closure for the return value, or to be able to proxy references returned by the closure.
What about a GC?
… well, maybe something based on a GC would actually make more sense than all that arena mess, to let the GC handle everything about the end of lifetimes, so we wouldn’t ever need to expand lifetimes? But then we still have the problem of having to store references returned by function calls like Foo::get_bar(&self) -> &Bar
.
A dynamic borrow checker
Fundamentally, what this would want, is a dynamic borrow checker. Right now, RefMut implements such a dynamic borrow checker, but only outside the static borrow checker: as soon as the RefMut
is turned into an &mut
, there’s no going back.
What we would need, would be a dynamic borrow checker that can take in constraints provided by the static borrow checker. In other words, we would need a way to have a function like that:
unsafe fn BorrowCheckedArena::insert_reference<T, U>(
&mut self,
borrowed: Vec<Ptr<T>>,
borrower: &U,
) -> ConstPtr<U>;
This API would assume that borrower
’s lifetime only ever depends on the pointers in borrowed
, and would store it, allowing the reference to be recovered later from the ConstPtr<U>
.
Or even better, something like this; as that would allow for an actually safe API: (modulo error handling on “this was already mutably borrowed”, the choice between &
and &mut
and other details)
fn BorrowCheckedArena::run_with_deps<T, U>(
&mut self,
deps: Vec<Ptr<T>>,
run: impl 'static + FnOnce(&[&T]) -> U,
) -> Ptr<U>;
Maybe it would be possible to do it using gc::GcCell
alongside ouroboros
to make the returned GcCellRef
be owned by taking a Gc
alongside it? But that still would not support use cases that need multiple dependencies, or use cases that return an object that contains a borrow (as opposed to directly a reference).
Limitations
Global state
Intrinsic limitations of this scheme lies in “global state.” Arguably, a library that makes use of global state is bad in the first place. But some libraries do, and there needs to be a way for the generated fuzzer to be aware of it and reset the global state between two runs.
This should probably be configured manually on the fuzzer generator. And maybe the fuzzer generator should refuse building if the user does not at least specify that there is no global state, or give the code needed to reset it to a known-good initial state that’d be called before any fuzzer run?
Generics
Another obvious limitation is generics. Any generic function needs to be instantiated. And there are literally an infinite number of types they can be instantiated with (think eg. Box<Box<Box<Box<()>>>>
).
So we need some way of knowing what type to run the generics with. Right now, I think the best would be to have the user configure the fuzzer generator so it knows, for each kind of generic (eg. T: mycrate::Trait
is one kind of generic, T: mycrate::Trait + Debug
is another), what types it should generate with. At least for the types defined in the crate it’s easy for the fuzzer generator to list them all, but some of them could be generic themselves, so there’s probably no going around the fact that the user will need to say for which types the generic should be tested.
Oh and closures are one particularly cursed kind of generics in this regard, because we won’t want to actually store them in the type map, so we’d need special handling here.
Or sure the thing could happen by having the fuzzer actually choose the types and generate code to be compiled and then fuzzed, but… that would certainly be too much craziness, even for me.