A Nice Technique for Systems Programming
Can Rust replace C? This is a question that has been on my mind for many years,
as I created and now am tech lead for
upb, a C library
for Protocol Buffers. There is an understandable push to bring memory safety
to all
parts of the software stack, and this would suggest a port of upb to Rust.
While I love the premise of Rust, I have long been skeptical that a port of upb
to Rust could preserve the performance and code size characteristics that I and
others have fought so hard to optimize. In fact, this blog entry was
originally going to be an argument for why Rust cannot match C for upb’s use
case.
But I recently discovered a technique that shifted my thinking a lot. I call
it “No-Panic Rust”, and while the technique is clearly not new, I was not
able to find any in-depth discussion of how it works or what problems it
solves. This article is my attempt to fill that gap.
I believe that No-Panic Rust is the key to making Rust a compelling option for
low-level systems programming. I now am optimistic about the possibility of
porting upb to Rust.
Panics are Rust’s mechanism for unrecoverable errors. Anytime our program
encounters an error, we have three basic options for how to handle it:
- Handle the error immediately (eg. retry the operation or fall back to plan B).
- Propagate the error to the caller, who can decide how to handle it.
- Immediately abort execution.
In Rust, we use Result
for (2) and panic!()
for (3). When we use Result
,
it is considered a “recoverable error”, because the caller can test for the
error and decide how to respond.
With recoverable errors, the potential for error is reflected in the function
signature; a function that returns Result
is fallible from the perspective of
the caller. Panics on the other hand present the illusion of infallibility
from an API perspective, but then proceed to handle errors by simply aborting.
There is a lot of standard guidance for when to use panic!()
vs Result
(for
example,
here
and
here),
which largely boils down to the idea that panics should only be used for bugs
in the code. I especially like the framing given in this Reddit post:
[If] your library is the source of a panic, then one of the following
should be true:
Your library has a bug.
Your library documents a precondition of a public API item that, when not
met, causes a panic. Therefore, the user of your library has misused your
library, and their code has a bug.If your Rust application panics in response to any user input, then the
following should be true: your application has a bug, whether it be in a
library or in the primary application code.
In this article we are focused on the library case.
If we are trying to port a C library to Rust, we really do not want to
introduce panics in the code, even for unusual error conditions. They cause
many practical problems:
- Code Size: The runtime to handle a panic pulls in about 300Kb of code.
We pay this cost if even a singlepanic!()
is reachable in the code.
From a code size perspective, this is a severe overhead, given that the upb
core is only 30Kb. - Unrecoverable exit: If a panic is triggered, it takes down the
entire process. In many applications, this is a severe failure mode
that libraries should never invoke. Instead, we should return all errors
to the caller using status codes. - Runtime overhead: A potential panic implies some kind of runtime check.
In many cases, the cost of this check will be minimal, but for very small
and frequently invoked operations, the cost of this check could be
significant.
In the case of upb, I was concerned about all three of these factors. Ideally
we could port upb to Rust without users even noticing. To do that, we want to
maintain the same performance, code size footprint, and error reporting
behavior that the C code has now. Panics get in the way of this ideal.
At some point I realized that it might be possible to ban panics from the
library entirely, which would solve all of these problems at once. That is
when I started getting much more optimistic about porting upb to Rust.
No-Panic Rust is a subset of Rust for which panic!()
is unreachable.
Programs written in no-panic Rust are guaranteed never to panic under any
circumstances.
For a library, this means we should be able to build a cdylib
that does
not have a panic handler linked into it at all.
We can experiment on godbolt.org
to see if we have succeeded or not. Using my tool Bloaty,
we can see if the cdylib
binary is >300Kb (suggesting that the panic
handler has been linked in) or <10Ki (suggesting it has not).
Let’s explore this subset a bit. Is “Hello, World” no-panic?
#[no_mangle]
pub extern "C" fn hello_world()
println!("Hello, World!") // Can panic
No, per the documentation for println!()
:
Panics if writing to
io::stdout
fails.
And indeed, if we try this on Godbolt, we see a big binary:
So println!()
is out. If we want to print to stdout, we’ll need to use an API
that does not advertise that panic is possible.
The stdout
API looks
promising, because it has a write_all()
API that returns a Result
, which
should allow us to handle errors explicitly:
use std::io::self, Write;
#[no_mangle]
pub extern "C" fn hello_world() -> bool
let result = io::stdout().write_all(b"Hello, World!\n");
match result
Ok(_) => true,
Err(_) => false,
This seems like it should be no-panic. We are only calling two APIs,
stdout()
and
write_all()
,
neither of which documents a potential panic.
But if we try it, we’ll see that panic is indeed reachable in this program
somehow.
From this we have learned that we unfortunately cannot rely on panic
annotations in API documentation to determine a priori whether some Rust code
is no-panic or not. We have to actually try it and observe the results.
How can we diagnose what went wrong? On macOS, the linker has a very handy
option called -why_live
, which will print the chain of symbol references
that prevented a symbol from being dead-stripped. We can’t access it on Godbolt
unfortunately, but on macOS we can run this command:
$ RUSTC_LOG=rustc_codegen_ssa::back::link=info \
RUSTFLAGS="-C link-arg=-Wl,-why_live,_rust_panic" \
cargo build --release 2>&1 | rustfilt
This results in the following output, with extraneous details removed:
_core::panicking::panic from [...]
_core::ops::function::FnOnce::call_once from [...]
l_anon.56b0c16dbe4596c74313e318a3dfaa78.520 from [...]
_std::sync::once_lock::OnceLock<T>::initialize from [...]
_std::io::stdio::stdout from [...]
_hello_world from [...]
The panic reference apparently comes from
_core::ops::function::FnOnce::call_once
, which is called
from _std::io::stdio::stdout
.
This seems to suggest that Rust’s standard library does not meet the criteria
given above, because it is capable of panicing even in APIs
like std::io::stdout()
that do not document a panic-worthy precondition.
This also implies that we need tests that check for the no-panic property.
It’s not enough to check once that the code is no-panic, we need to make sure
it stays no-panic over time, even as our project and our dependendencies
evolve.
To get a fully no-panic version of “Hello, World”, we have to reach for the C
library libc
. This makes sense, since the C library is generally written to
return all errors as status codes or errno
. Unfortunately this means turning
to unsafe
:
extern crate libc;
#[no_mangle]
pub extern "C" fn hello_world() -> bool
const MSG: &'static str = "Hello, World!\n\0";
let result = unsafe
libc::printf(MSG.as_ptr() as *const _)
;
result >= 0
And checking on Godbolt, we see the small binary that confirms that this
library is indeed no-panic:
What about adding two numbers? Is this no-panic?
#[no_mangle]
pub extern "C" fn hello_world(a: i32, b: i32) -> i32
a + b
This is a trick question: this is no-panic in opt mode only. For numeric
operations like addition, Rust introduces overflow checks (which panic on
failure) in debug mode, but leaves them out of opt builds.
We can observe this on Godbolt if we add separate panes for opt and non-opt
builds:
This essentially creates a new class of code, which is “no-panic in opt, but
can panic in dbg”.
For the case of upb, this seems like a great option, because it gives us extra
consistency checks in debug mode without suffering the problems of panic in
release builds. It is essentially the Rust equivalent of assert()
in C.
Overflow by itself does not represent a safety issue, so we are not giving up
safety by leaving the panics out of opt builds.
What about using standard containers like Vec
?
use std::hint::black_box;
#[no_mangle]
pub extern "C" fn hello_world()
let vec: Vec<u32> = Vec::new();
black_box(vec);
It turns out this is also “opt no-panic” code (perhaps Vec
is internally
performing some arithmetic which can overflow):
But once we try to actually push elements into the Vec
, we’re squarely out
of no-panic Rust:
Vec
does have a few APIs that will surface allocation errors instead of panicking.
Theoretically, this code should be no-panic:
#![feature(vec_push_within_capacity)]
use std::hint::black_box;
#[no_mangle]
pub extern "C" fn hello_world() -> bool
let mut vec: Vec<u32> = Vec::new();
if !vec.try_reserve(1).is_ok()
return false;
if !vec.push_within_capacity(1).is_ok()
return false;
black_box(vec);
true
This requires the nightly compiler, but I was able to make this work as no-panic
on macOS. For some reason, it did not work with the nightly compiler on Godbolt,
which appears to always include the panic runtime no matter what I do, even for a
trivial library. I was not able to figure out why.
The Rust standard library was not really designed to be no-panic. For example,
memory allocation failure will panic in most cases. If we want to be no-panic,
we will probably have to avoid most of the standard library. Realisticaly we
will probably want to go fully #![no_std]
.
Here is another trick question: is this no-panic Rust?
#[no_mangle]
pub extern "C" fn hello_world(data: &[u8]) -> u8
if data.len() < 1
return 0;
data[0]
On one hand, the slice index operation clearly
documents
that it may panic. On the other hand, the docs say that this panic will only
be triggered if the index is out of bounds, and we have inserted a guard to
ensure that it never is. So is the panic reachable?
If we use our minds to reason about the code, we would conclude that panic is
unreachable. The compiler is capable of reaching the same conclusion, but only
if we run the optimizer, which can prove through a series of optimizations that
the bounds check will never fail.
So this example ends up being “opt no-panic”, just like our arithmetic
operation, but for an entirely different reason!
This is quite an interesting result that totally changed my thinking about
Rust’s bounds checks.
My previous perspective was that Rust will insert all of these unnecessary
bounds checks, bloating the code and slowing it down for no reason. But our
pre-existing C code is not throwing caution to the wind and hoping for the
best. Every place that we perform an index operation in C, it’s because we
believe we have a proof that the index is in bounds. To avoid the bounds
checks in Rust, we just need to express this proof in a way that the Rust
optimizer can understand. This is what I call the “dance with the optimizer.”
In the example above, the bounds check is eliminated using only safe code,
but there are other cases where we might need to use unsafe code to help
the optimizer know about program invariants that cannot be easily derived
from the program flow.
For example, consider this (admittedly contrived) program:
pub struct S<'a>
data: &'a[u8],
ofs: usize, // Invariant: ofs < data.len()
impl<'a> S<'a>
pub fn new(data: &[u8]) -> Option<S>
match data.len()
0 => None,
n => Some(Sdata: data, ofs: n - 1),
pub fn get(&self) -> u8
self.data[self.ofs]
#[no_mangle]
pub extern "C" fn hello_world(s: &S) -> u8
s.get()
In this program, our struct S
has an invariant that the offset S::ofs
will
always be in bounds. This invariant effectively guarantees that the bounds
check in S::get()
will never fail. And we can strongly guarantee that the
invariant holds, because it is enforced by our new()
function which is the
only code that sets these struct members.
But the optimizer isn’t capable of reasoning at this level, so it thinks that
the panic is reachable, and keeps the bounds check in the program, even in opt
mode:
To make this no-panic, we need to help the compiler out by reminding it that
this struct invariant holds in the critical path:
use std::hint::assert_unchecked;
pub struct S<'a>
data: &'a[u8],
ofs: usize, // Invariant: ofs < data.len()
impl<'a> S<'a> {
fn check_invariant(&self)
unsafe assert_unchecked(self.ofs < self.data.len())
pub fn new(data: &[u8]) -> Option<S>
match data.len()
0 => None,
n =>
let s = Sdata: data, ofs: n - 1;
s.check_invariant();
Some(s)
pub fn get(&self) -> u8
self.check_invariant();
self.data[self.ofs]
}
#[no_mangle]
pub extern "C" fn hello_world(s: &S) -> u8
s.get()
This makes use of
std::hint::assert_unchecked
,
a very sharp tool for making soundness promises to the compiler. Here we use
it to inform the compiler of our struct invariant. This has the desired effect
of making this “opt no-panic”:
This definitely requires care; we have to be very sure that the predicate we
pass to assert_unchecked
is true. Luckily we can fuzz against this assertion
to increase our confidence (in debug mode, assert_unchecked
will panic if the
condition is not true). Used judiciously, it can be a powerful tool for
explicitly expressing to Rust the invariants we were relying on to make index
operations safe in C.
No-Panic Rust is not for the faint of heart. It requires a lot of careful,
detailed work, and forces you to give up some niceties of Rust, like the
standard library. But if we are diligent, it can give us the performance, code
size, and error reporting behavior of a C library with the extra safety that
comes from Rust.
This extra safety comes from the fact that Rust will automatically insert
bounds checks anywhere it cannot prove that an access is safe. This puts
the burden on us to justify to the compiler in every case why the bounds
check is safe to elide. In some cases this will mean detecting a bounds
violation explictly and reporting the error to the caller (especially in
parsers, where we do not know whether the input is valid or not). In other
cases, we may know through a program invariant that the index will always
be in-bounds, and we will need to communicate this invariant to Rust.
I should be clear that I have not yet attempted this technique at scale, so
I cannot report on how well it works in practice. For now it is an exciting
future direction for upb, and one that I hope will pay off.
To make this technique practical, we need a tool that can diagnose where a
panic handler was reachable from. The main technique we used in this article
(looking at binary size) does not give us any information about where a panic
came from. On macOS, the -why_live
linker option is perfect for this. I
hope other linkers like LLD will add support for this option also. If not, a
standalone tool could be written that analyzes a binary after it’s linked to
find the chain of references that lead to a panic handler.
It would be nice if Rust made it easier to stay within the no-panic subset.
It’s clear that writing no-panic code is not a core use case that the language
focuses on, but there are many situations (embedded, Linux Kernel, etc) where
we want to avoid panics. It would be nice if functions or even crates could
advertise themselves as no-panic and have the compiler enforce this
transitively. Changing a function from no-panic to panicking would then be an
API-breaking change.