๐งช The 12-Category Probe: Finding Out What Actually Breaks in a Portable Rust Binary
Second post in the one-bin-to-rule-them-all series. Previously: meet the Actually Portable Executable.
In the previous post I had a Rust "hello world" running as a single APE binary, and a vague claim that sync Rust works on six operating systems. That's a claim, not evidence ๐คจ.. This post is the evidence: a probe crate that exercises twelve areas of the Rust standard library, run against a matrix of OSes, and the eleven distinct places where Rust's view of the world quietly drifts from reality when you take the same binary off Linux x86_64 and drop it onto anything else.
Zero of the findings are "compilation error" or "doesn't link". Every single one, the binary builds, the binary runs, and the output is wrong or the program panics mid-run ๐ฅ. That's the whole point: cosmocc's job is making the binary boot on every OS, and it does that job. What breaks is above the loader, inside Rust's assumptions about the libc it was built against.
What the probe actually tests
The probe is a single Cargo crate (edition 2024) that prints structured output for twelve categories:
| CAT | Category | What it exercises |
|---|---|---|
| 01 | errno | File::create of missing / permission-denied paths, observe raw_os_error and ErrorKind |
| 02 | O_* flags | O_NONBLOCK, O_CLOEXEC as seen from libc::* (compile-time constants) |
| 03 | signals | SIGHUP, SIGINT, SIGTERM, SIGKILL, SIGUSR1, SIGPIPE, SIGCHLD, SIGSEGV numeric values |
| 04 | stat | fs::metadata(temp_dir).is_dir(), len(), lstat(symlink).is_symlink() |
| 05 | networking | TcpListener::bind, TCP accept loop, UdpSocket::bind |
| 06 | threads | 8-thread Mutex<u64> counter |
| 07 | Command | Command::new("echo"), Command::new("/nonexistent") error kinds |
| 08 | env vars | set_var / var / remove_var with non-ASCII keys and values |
| 09 | fs_times | created, modified, accessed on a freshly-written temp file |
| 10 | CStr / OsStr | Roundtrip of "hello", multi-byte UTF-8, empty string |
| 11 | panic + catch_unwind | Force a panic in a closure, assert catch_unwind returns Err, recover cleanly |
| 12 | time | SystemTime::now() delta across a thread::sleep(10ms), same for Instant::elapsed |
Twelve categories is not "everything std does," but is "everything which, if it drifts, you will notice within two lines of real Rust code". If these all work the same on six OSes, the odds of any given Rust CLI working are high. If they don't, you learn where it's going to drift before writing a single line of application code.
The probe is built as a fat APE the same way hello.com was in post 1: cargo +nightly, cosmo target spec, apelink to fuse the x86_64 and aarch64 objects into one file. probe/probe.com is 4.6 MB. One copy of that binary gets run on every target.
Three things I had to fix just to get the probe to build
Getting the probe to link burned about a day before it would even run. Worth noting because these aren't Rust bugs, they're places where cosmopolitan and Rust's expectations don't line up at the linker:
The libc crate gets compiled twice under -Z build-std. rust-ape-example's .cargo/config.toml includes build-std = ["libc", "panic_abort", "std"]. For hello-world that's fine, but the probe directly depends on libc (for O_NONBLOCK, SIGHUP, syscall numbers) and cargo built two copies, one with the private rustc-dep-of-std feature flag on, one without, and refused to link them together:
1error[E0464]: multiple candidates for rlib dependency libc found
Fix: drop libc from the build-std list. cargo then builds a single libc for both std's internal needs and the probe's direct dep.
waitid is not in cosmocc 4.0.2's libc. Current nightly std unconditionally calls libc::waitid(P_PIDFD, ...) from std::sys::pal::unix::linux::pidfd, so anything that touches std::process::Command fails to link:
1undefined reference to `waitid'
Fix: a six-line shim in probe/src/main.rs that defines waitid as a call to libc::syscall(SYS_waitid, ...). cosmo's syscall() does per-OS mapping internally, so the shim works across targets without caring which kernel it ends up on.
__xpg_strerror_r returns negative values on some inputs. Rust std on target_os = "linux" redirects its strerror_r FFI call to __xpg_strerror_r (the XSI convention). The XSI contract says: return 0 on success, positive errno on failure, never negative. cosmocc 4.0.2's __xpg_strerror_r returns negative values in some cases, which trips std's sanity check:
1// in std/src/sys/pal/unix/os.rs (simplified)
2let ret = __xpg_strerror_r;
3if ret < 0
The first io::Error::to_string() call in CAT01 crashes the whole probe. Fix: another small shim that delegates to cosmo's plain strerror() and always returns 0. Not a real fix, just enough to keep the probe running so we can see the other findings.
These three shims got the probe to build and run, but only against vanilla upstream libc. When I later wanted the real fix: a forked libc with cosmo-aware constants. I hit a second layer of build-system friction: under -Z build-std, [patch.crates-io] in a crate's Cargo.toml does not propagate to std's own libc dependency. std rebuilds its libc from its own sysroot-resident Cargo.toml, which references crates.io directly. To make std pick up the forked libc, the libc dependency inside .../rustup/toolchains/nightly-.../lib/rustlib/src/rust/library/std/Cargo.toml has to be edited to point at a local path. Patching libc alone isn't enough; you also have to patch stdlib's Cargo manifest to use the patched libc. That's a post 3 detail, not a post 2 one, but worth calling out now because it's the kind of thing that silently wastes a day.
The matrix
The same probe.com binary, copied onto six targets, and run once on each. Each run starts with a header the probe prints on startup, reporting what rustc thinks the build target is. Here's the first few lines of the run on Windows Server 2022:
1===
2
3
4
5)
6
os=linux on Windows. family=unix. The current directory is /C/Users/Administrator, which is cosmopolitan's Unix-path view of C:\Users\Administrator. The binary is insisting, with env!("CARGO_CFG_TARGET_OS") as its witness, that it's running on Linux, while it is demonstrably running on the Windows Server kernel. That's the investigation's premise confirmed: rustc's target pins os=linux at compile time, and nothing inside the binary can find out what it's really running on. Every divergence below is a consequence of code that trusted that os=linux label too much.
Condensed, where โ means "matches Linux baseline" and โ means "diverges":
| CAT | Linux x64 | Linux arm64 | FreeBSD 14 | OpenBSD 7.4 | Windows 2022 | macOS arm64 |
|---|---|---|---|---|---|---|
| 01 errno ENOENT/EACCES | โ | โ | โ | โ | โ | โ |
| 01 errno EINVAL raw | 22 | 22 | 41 F-002 | 43 F-002 | 87 F-002 | 22 (โ ) |
| 02 O_* flags | โ | โ | โ | โ | โ | โ |
| 03 signals | โ | โ | โ | โ | โ | โ |
| 04 stat struct | โ | โ F-001 | โ | โ | โ | โ F-001 |
05 TcpListener::bind | โ | โ | โ F-003 | โ F-003 | โ F-003/F-009 | โ F-003 |
| 06 threads | โ | โ | โ | โ | โ | โ |
07 Command::new | โ | โ | โ F-004 | โ F-004 | โ F-010 | โ F-004 |
| 08 env vars non-ASCII | โ | โ | โ | โ | โ | โ |
| 09 fs_times | โ | โ | โ | โ | โ | โ |
| 10 CStr / OsStr | โ | โ | โ | โ | โ | โ |
| 11 panic + catch_unwind | โ | โ | โ | โ | โ | โ |
12 SystemTime delta | โ | โ | โ | โ | โ | โ |
12 Instant::elapsed | โ | โ | โ F-005 | โ F-011 | โ | โ F-011 |
(โ ) macOS's native EINVAL happens to equal Linux's 22, so the raw number "looks right" by coincidence, but the finding still applies. What's returned is the native OS's errno, not a Linux-normalised one.
Working columns make up most of the grid. Signals, threads, env vars, CStr/OsStr, panic + unwind, basic errno classification, filesystem times, all work. The divergent ones cluster into a handful of categories, and those are the findings.
F-001: the aarch64 stat layout mystery
The first finding came before the VM matrix. I ran the same probe.com on a Raspberry Pi 4 running Debian 13 (native aarch64, not an emulator) and got this on CAT04:
1Linux x86_64: stat("/tmp").len=780 is_dir=true lstat(symlink).is_symlink=true
2Linux aarch64: stat("/tmp").len=220 is_dir=false lstat(symlink).is_symlink=false
is_dir=false on a directory. is_symlink=false on a symlink that the same probe created two lines earlier. The len=220 isn't zero, it's a plausible number, just the wrong number, almost certainly the value of some other field in struct stat being read from the wrong offset.
S_ISDIR(st_mode) evaluating false means either st_mode is wrong, or st_mode is being read from an offset that points at a different field. The latter fits: rustc compiles against aarch64-unknown-linux-musl's struct stat layout. cosmocc emits something different on aarch64. On x86_64 the two layouts agree by coincidence, because x86_64's stat ABI has less historical churn. On aarch64 they don't.
Running the same probe natively on Apple Silicon turned up the same CAT04 failure (len=6880, is_dir=false, is_symlink=false). That's the clincher: F-001 is arch-level, not OS-level. aarch64 Linux and aarch64 macOS both see it. x86_64 FreeBSD, x86_64 Windows, x86_64 OpenBSD don't. The divergence lives in the struct stat layout that rust's libc crate defines for aarch64-*-musl, and cosmocc's aarch64 target doesn't match it.
The fix, which I'll walk through in the next post, is to redefine struct stat on cosmo-aarch64 inside a fork of the libc crate. So F-001 is a single root cause surfacing on every aarch64 cosmo target, rather than several independent problems that happen to look alike.
F-002: errno numbers pass through unnormalised
CAT01 asks std to create a file that doesn't exist, create one at a permission-denied path, and try chmod with an intentionally-invalid mode. The intentionally-invalid case is where it gets interesting. Every cosmo target reports a different raw errno:
| Target | raw_os_error() | What that value means natively |
|---|---|---|
| Linux | 22 | EINVAL |
| FreeBSD | 41 | EPROTOTYPE |
| OpenBSD 7.4 | 43 | EPROTONOSUPPORT |
| Windows | 87 | ERROR_INVALID_PARAMETER |
| macOS | 22 | EINVAL (coincides with Linux) |
Five different numbers for the same failure, because cosmo doesn't normalise errno values across OSes. It normalises signal numbers (CAT03 is all green), and it translates paths, and it dispatches syscalls, but when the BSD kernel returns EPROTOTYPE=41, Rust's code sees raw=41, and Rust's ErrorKind match (which only knows Linux errno values) classifies it as Uncategorized.
Rust std and most crates do match e.raw_os_error() { Some(22) => ..., Some(11) => ... } or the equivalent match kind() to make portable decisions. Neither pattern holds on cosmo-non-Linux. The numeric-match pattern is the more surprising one, because it silently takes a different branch than the one the author had in mind instead of returning Other. Anything which retries on EAGAIN, special-cases EPIPE, or distinguishes EADDRINUSE from EADDRNOTAVAIL, behaves differently from what the Linux-flavoured code assumes.
F-003: sockets
CAT05's first line is TcpListener::bind("127.0.0.1:0"). On Linux that returns an Ok(listener) with a random high port, and the rest of CAT05 runs an accept loop. On every non-Linux cosmo target, bind errors immediately:
- FreeBSD:
raw=41(EPROTOTYPE) - OpenBSD 7.4:
raw=43(EPROTONOSUPPORT) - Windows:
raw=87(ERROR_INVALID_PARAMETER) - macOS:
raw=22(EINVAL)
Same pattern as F-002 in the numeric values, but the failure itself is more fundamental: not "wrong errno number," but "socket creation refused by the kernel." The Rust-side cause is that std and libc are passing Linux-numbered constants (AF_INET, SOCK_STREAM, IPPROTO_TCP) baked in at compile time. On a BSD/Windows kernel, those numbers mean something else, so the kernel rejects the call.
Networking is the highest-impact finding in the matrix. Any Rust APE that opens a socket, inbound or outbound, won't get far on non-Linux cosmo until F-003 is worked through. Ports of ripgrep (post 4) and dog (post 4) depend on F-003 being sorted; xh (post 5) depends on it doubly.
F-004: Command::new
CAT07 runs Command::new("echo").arg("hi").status() and Command::new("/does/not/exist").status(). Linux handles both correctly (echo runs, nonexistent returns NotFound). On FreeBSD, OpenBSD, and macOS both calls fail with raw=22 (which is EINVAL on those OSes as well as Linux). Somewhere inside std's posix_spawn / fork+exec path, a Linux-numbered flag or struct layout is getting passed to the BSD/macOS kernel, which rejects it.
Windows has a related but distinct behaviour (F-010): Command::new("echo") returns NotFound raw=2 because echo is a cmd builtin, not a standalone PE binary. That one's not really a finding at all, is a legitimate "cosmo's spawn path can't emulate cmd's builtins and probably shouldn't try." But it combines with F-004 to say: anything which shells out on cosmo doesn't work as-is on everything but Linux, for at least two distinct reasons.
F-005 and F-011: clocks
CAT12 takes an Instant::now(), does a thread::sleep(Duration::from_millis(10)), then calls Instant::elapsed(). On Linux that returns ~10ms and exits cleanly. On FreeBSD the probe panics inside thread::sleep before elapsed is even reached:
1thread 'main' panicked at library/std/src/sys/thread/unix.rs:582:
2assertion `left == right` failed
3 left: 45
4 right: 4
F-005. Std calls something like clock_nanosleep(CLOCK_MONOTONIC, ...) and asserts on the return convention. Linux's CLOCK_MONOTONIC is 1. FreeBSD's is 4. Std passes 1, FreeBSD's kernel doesn't recognise it, the call returns an error code that std didn't anticipate, the assertion trips.
On macOS and OpenBSD 7.4 the failure is subtly different. thread::sleep succeeds, but Instant::elapsed() then panics at std/src/sys/pal/unix/time.rs:107:68 with Err(Os { code: 22, kind: InvalidInput }). Different call site, same family: Linux CLOCK_MONOTONIC=1, macOS CLOCK_MONOTONIC=6, OpenBSD CLOCK_MONOTONIC=3. That one's F-011.
F-005 + F-011 are the same category of divergence (clock IDs baked in at compile time) surfacing in two different places in std depending on which kernel you hit. Any long-running Rust program, anything which measures latency or sleeps between iterations, has a rough time on cosmo-BSD/macOS today.
F-008, F-009, F-010: the Windows trio
F-008 is F-002 with a Windows flavour: errno 87 (ERROR_INVALID_PARAMETER) for CAT01's invalid chmod. Same pattern, different number. Rust's ErrorKind match has no idea 87 means anything.
F-009 is F-003 with a Windows flavour: TcpListener::bind returns raw=87 instead of a listener. Same root cause, different error value. (Rust std on cosmo-Windows is going through the BSD-socket personality, not Winsock directly, because that's how cosmo's libc layer presents the network stack. The Linux-numbered constants still don't match what the underlying Winsock call expects.)
F-010 is Command::new("echo") as described above. The least interesting of the three because it's arguably working as designed. Noted for completeness and then left alone for the rest of the series.
Everything that works
What doesn't show up in the table is as important as what does.
Signals (CAT03) are normalised by cosmo across every OS in the matrix, to their Linux numeric values. SIGHUP=1, SIGINT=2, SIGTERM=15, SIGKILL=9. On native Windows those aren't even real signals in most senses, but cosmo's abstraction layer hands Rust the Linux numbers anyway. Signal-handler code written for Linux behaves predictably on cosmo-Windows.
Threads (CAT06) work everywhere. Eight pthreads racing on a Mutex<u64> final-count-matches-expected across every target. That's a lot of libc to get right (pthread_create, pthread_mutex_lock, and the rest of the pthreads surface), and cosmo does.
Environment variables (CAT08) roundtrip, including non-ASCII. CStr and OsStr conversions (CAT10) roundtrip. File times (CAT09) work. fs::metadata on x86_64 works across all the x86_64 targets (it's only aarch64 that breaks, via F-001).
And the most surprising green of all: panic::catch_unwind works on every target, including cosmo-Windows. In the usual world, unwinding is one of the most platform-specific mechanisms in any language runtime (Linux uses DWARF-based .eh_frame tables, Windows uses SEH, macOS has compact unwind). You might assume cosmopolitan ships a clever bridge across all three. It does not. libunwind.a in the cosmocc 4.0.2 distribution is a literal 8-byte empty ar archive:
1
2
3
_Unwind_* symbols in libcosmo.a are also all undefined references, so the cosmo runtime doesn't provide them either. What does provide them is Rust's own unwind crate, rebuilt under -Z build-std (we listed panic_unwind in the build-std list exactly for this). That crate pulls in llvm-libunwind C sources and statically links the lot into the final APE. nm probe.com.dbg | grep " T _Unwind_" shows the full set of _Unwind_* entries defined inside the binary, alongside llvm-libunwind internals like __libunwind_Registers_x86_64_jumpto.
Exactly how that bundled llvm-libunwind finds its unwind tables at runtime on each host, and in particular on a Windows-loaded APE where the native loader is the PE loader rather than an ELF loader, is something I haven't traced to source yet. It's probably some mix of "the cosmopolitan runtime registers frame info at startup" and "the binary's layout keeps the right tables at the right addresses regardless of which loader took the file," but I don't want to fake confidence I don't have. What I can say is: the probe's catch_unwind empirically recovers a panic payload on Linux x64/aarch64, FreeBSD, OpenBSD, Windows, and macOS alike, from a single binary that ships its own unwinder statically. Tracing the last mile of how the unwind tables get found on each host is a rabbit hole for another day.
The pattern
If you stare at the finding list long enough, something becomes obvious: every single finding is the same kind of problem. Not "missing syscall," not "wrong algorithm," not "race condition." It is always:
The Rust code has a constant in it. That constant was set at compile time from Linux's header values. On a cosmo-linked binary running on another kernel, the actual value that constant should have is different.
The pattern is so clean that the fix suggests itself: don't make those values compile-time constants. Let the cosmo runtime populate them at load time based on which OS actually booted the binary. The Rust ecosystem's deep assumption that libc::* is a bag of pub const i32 is a perfectly sensible one for any normal rustc target, but it's the single biggest thing that needs rethinking when the same binary dispatches to multiple kernels. Turn those into extern "C" { static EINVAL: c_int; } and the whole category of findings disappears.
Which is what the next post is about, and the thesis of the whole series ๐ฆ.
Should you bother with a probe like this?
Yes, without reservation, if you're porting anything non-trivial to a toolchain that compiles against a single libc but ships to multiple kernels. Each category is a single function. Writing it took a few hours; the matrix data came back in an hour. Eleven findings for maybe six hours of work, before any real application code was written. Any other debug methodology that delivers that ratio, I'd like to hear about it.
If you're not porting anything and you're just reading along, the useful mental model is: the Rust standard library's portability model assumes one libc per target triple. Cosmopolitan breaks that assumption by linking against one libc that dispatches to multiple kernels at runtime. Every place std encoded "the libc this binary was built against" as a compile-time constant is a potential portability finding. There are more places than you'd guess, and they all have the same shape.
Next post: the fix ๐ง.. Why const doesn't quite fit, what extern static looks like in Rust, and how a stack of std patches plus one forked libc crate close most of these findings with one idea applied consistently.