Tracking Down FFI Segfaults in Rust
I was developing safe wrappers for Ceph recently and I wanted to put them into use. I wrote some code that talked directly to Rados ( Ceph’s object store ) to backup data. On the first run of the program after I got it to compile it segfaulted. The problem looked like it was occurring in SQLite. The backtrace was pointing to SQLite failing to destroy a where statement. I thought surely this can’t be the real problem. This crate has thousands of downloads and it would’ve come up by now. I filed a bug anyways on their github and added as much information as I had.
So I decided to comment out the sqlite code to see if that would help things. I then ran into a segfault where Rust was failing to extend a slice. This really tipped me off that something was majorly broken. I started asking questions on #rust. If you haven’t signed onto #rust I highly recommend it. The people are very friendly and knowledgeable. It’s probably the best IRC channel that I frequently hang out in. After adding a ton of print statements I wasn’t really any closer to figuring out what was happening. I suspected it was my code but I had no way to prove it. A thought then occurred to me. Rust is compatible with C so I wondered if Valgrind would be able to hunt this bug down. I asked on #rust and someone clued me in. They said that valgrind does work with Rust but you need to change the memory allocator to the system allocator. They were absolutely correct. When using jemalloc I was getting inconsistent results. However once I changed over to the system allocator I got some very interesting results.
Being the cautious developer that I am I wrote a function in my FFI code that checked all return codes from Ceph. This function would call strerror and then turn that back into a Rust string. What I didn’t realize at the time was that I didn’t own that memory that strerror was giving me a pointer to. Valgrind showed that this function was the root of all the weird behavior I was seeing. I did two things to fix the problem. I changed over to strerror_r because Ceph uses several threads internally in its library. The second thing I did which was allocate some memory on the heap in Rust and copy the string that strerror_r was giving me. I ran valgrind again and just like magic everything worked! Valgrind really is an amazing tool at finding memory problems.
The other interesting thing I got out of this experience is that Rust greatly minimizes the surface area for memory problems. For awhile I was thinking that building these FFI bindings was a bad idea. But really the only areas of your code you have to worry about are the unsafe
blocks. Once I fixed this one problem valgrind returned a perfect report. Writing FFI can definitely be tricky but you have a few tools at your disposal. strace, valgrind, and rust-gdb were extremely helpful in tracking this bug down.