How a Bad Random Number Generator Froze Sway

May 21, 2020 Wayland 6 min

Several months ago, I made the decision to switch from the i3 window manager which uses the X display protocol to Sway which uses the new Wayland protocol. This decision was based off the fact that I had display-specific workspaces were quite buggy in i3 and because I wanted to try something new. At first it went well, really well in fact — workspaces worked perfectly and any tearing I used to have was nonexistent.

All was going well until the last week of April after I updated Sway and rebooted. Sway instantly froze on the startup and kept keyboard input. After some investigation, I found that Waybar seemed to be the problem. Removing it from the config let Sway startup and work normally... or so I thought. Randomly, Sway would completely freeze, the same way that it did on startup, so it was time to debug.

Investigation

While getting a Sway debug log was quite trivial by running Sway like so: sway -d 2> ~/sway.log it didn't yield much1. The ideal solution would be to find out where Sway is hanging with a core dump, but this posed difficult to do because sway was now frozen and with it, keyboard input. Thus, resorting to the SysRq shortcuts was necessary. These shortcuts are implemented in the kernel to perform basic, yet important, actions in cases like freezes. Below are some of the most common ones2:

Shortcut Name Description
Alt+SysRq+r Unraw Take control of keyboard back from the display server.
Alt+SysRq+e Terminate Send SIGTERM to all processes, allowing them to terminate gracefully.
Alt+SysRq+i Kill Send SIGKILL to all processes, forcing them to terminate immediately.
Alt+SysRq+s Sync Flush data to disk.
Alt+SysRq+u Unmount Unmount and remount all filesystems read-only.
Alt+SysRq+b Reboot Reboot

 
Everything beyond here was aided by the generous help of Xyene. Thank you, if you ever read this.

So, to get a core dump of the sway process, I first used the unraw shortcut, followed by switching to a tty and running the following commands:

gcore <sway pid>
gdb /usr/bin/sway <core file>
bt full

Which gave the following core dump3, with the important piece being the first line:

0  0x00007fcbb1012c6f in json_c_get_random_seed () at /usr/lib/libjson-c.so.5
1  0x00007fcbb1011fd6 in  () at /usr/lib/libjson-c.so.5
2  0x00007fcbb100c713 in json_object_object_add_ex () at /usr/lib/libjson-c.so.5
3  0x0000561dc53e42ff in ipc_json_describe_bar_config (bar=bar@entry=0x561dc6f0cbb0) at ../sway/sway/ipc-json.c:1013
        __PRETTY_FUNCTION__ = "ipc_json_describe_bar_config"
        json = 0x561dc74da8b0
        gaps = <optimized out>
        colors = <optimized out>
        tray_bindings = <optimized out>
        tray_bind = <optimized out>
# Truncated...

 
What this revealed was that Sway was actually freezing because of json-c, a JSON parsing library that Sway uses. Looking at the source code of json-c, it can be seen calling the json_c_get_random_seed function in an infinite loop while checking if the result is -1. And so, we found out where json-c is freezing, but the question of why remains.

Delving into json_c_get_random_seed, another function called get_rdrand_seed is ran to try to get a random number using the RDRAND cpu instruction. This seems fine, except when you take into account the fact my CPU is an AMD Ryzen 5 3600X... which sometimes has a horribly malfunctioning RDRAND instruction that always returns 0xFFFFFFFFFFFFFFFF (which is -1). This isn't normally an issue because very few processes attempt to use RDRAND without checking if it fails, often relying on /dev/urandom instead. Both the linux kernel and systemd check to make sure RDRAND returns a sane random number.

Conclusion

So, to finally fix this glorious bug, Xyene introduced a check into the has_rdrand function (which checks whether to use the RDRAND instruction later on) that disables RDRAND if it returns the same value 10 times in a row. The important section can be seen below:

// Some CPUs advertise RDRAND in CPUID, but return 0xFFFFFFFF
// unconditionally. To avoid locking up later, test RDRAND here. If over
// 10 trials RDRAND has returned the same value, declare it broken.
_has_rdrand = 0;
int prev = get_rdrand_seed();
for (int i = 0; i < 10; i++) {
    int temp = get_rdrand_seed();
    if (temp != prev) {
    	_has_rdrand = 1;
    	break;
    }

    prev = temp;
}

This leaves the chances of disabling a correctly functioning RDRAND instruction a whopping 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000004% (4.68e-95%).

A follow up patch was also required because of some inline assembly that tried to get the cpuid bit which can be seen here.


The associated Sway issue for this blog is swaywm/sway#5290 along with json-c/json-c#489 and json-c/json-c#590 for the json-c issues.


  1. Debug log available here.

  2. These are the shortcuts needed for a safe reboot, taken from the Arch wiki.

  3. This also required a build of Sway which didn't strip symbols. This was done by using the sway-git package off the AUR. Core dump available here.