Please Restore Our Registers When You’re Done With Them
2022-11-22 14:22:56 Author: randomascii.wordpress.com(查看原文) 阅读量:4 收藏

“Hey, you. Yes you, that function over there. When you’re cleaning up please remember to restore all of my registers. Yes, that one too – what do you think this is, Linux?”

That’s the problem I was dealing with in a nutshell. Functions are required by a platform’s ABI (Application Binary Interface) to preserve certain registers – restoring them if they were used – but the set of registers that must be restored varies between platforms, and the rules on Linux are different from those on Windows. That may be why I encountered register corruption in Chrome on Windows. But let’s take a step back.

I wasA crashed phone showing a Windows BSOD asked to look at a crash bug in Chrome. The crash was strongly correlated with the injection of third-party DLLs into Chrome’s processes (something that we cannot and do not support) so it was highly likely that the crash was caused by those third-party DLLs, but I still wanted to understand what was going on.

Some of my coworkers had investigated the bug previously and added some extra tests, so the crash had been isolated down to this pseudo-code (actual code is here):

while (StillRunning()) {
  DoLotsOfStuff();
  ImportantFunction(std::move(m_ptr));
  CHECK(!m_ptr);
}

The code above represents a long running loop that does a bunch of stuff. At the end of each iteration it calls a function and moves a smart pointer to the function parameters, which should zero out the source pointer. The crash happened when the CHECK statement noticed that the source pointer was, in fact, not zeroed – we intentionally crash then to avoid memory corruption.

The thing that made me want to investigate was the peculiarity of the behavior of the last two lines. How could we zero a pointer on one line of code, and then find it non-zero on the next line? Even if third-party DLLs are the culprit, how could they do that? I’d checked for any modification of the code bytes near the crash and found nothing, so, how? I wanted to understand.

The truth is in the crash dump

As usual I downloaded one of the crash dumps and looked at the assembly-language instructions which implemented the C++ source code. Here, in a mixture of pseudo-code and assembly language, is what I found the code had been translated to (see comment#61 in the bug for details):

xorps xmm7, xmm7   ; Zero register xmm7
while (StillRunning()) {
  DoLotsOfStuff();
  mov rax, QWORD PTR[rsp + 50h]
  movaps QWORD PTR[rsp + 50h], xmm7   ; zero m_ptr
  call ImportantFunction
  CrashIfNonZero(rsp+50h);
}

Apologies for the mixing of metaphors, but the key point is that the compiler decided to zero the XMM7 register (one of the SSE registers) before the loop started. Then, at the end of each loop iteration it would use XMM7 to zero out m_ptr (stored at the address rsp+50h). The compiler expected XMM7 to remain zero. It did not.

I looked at a large set of crash dumps to see if there was any pattern to the values in XMM7. Here are four of the values I found:

  • 96 12 54 91 ca c8 18 ef 98 e8 77 c9 6e 5d ce ee
  • c5 1e 15 13 00 a0 94 5b 37 a5 f3 55 a8 7e 8d 7d
  • 54 39 1f 15 3e bf 13 3e 58 98 fd 6d 64 a3 5a 27
  • 04 df 90 27 02 94 4c ed 73 65 1d 61 af da 33 36

If there is a pattern in these numbers then I sure can’t see it. This randomness is another clue that constrains what the problem could be caused by.

ABIs matter

Mount St. Helens crater - snow near fireThe functions DoLotsOfStuff and ImportantFunction and all of the functions that they call are required by the Windows ABI to preserve XMM7 (this is not the case on Linux). If they use it then they have to restore it. But one of them was not (or the stack location where they preserved the register got trashed, but that seems less likely). In most of the crashes third-party DLLs were present in the Chrome process. These DLLs are presumed to be hooking Chrome or OS functions, and their injected code was presumed to be corrupting XMM7.

I tweeted about this, trying to get theories for how this could happen. Amidst the various replies talking about ISRs, DPCs, and drivers I saw a reply from someone I’d never talked to before saying, basically, “What about this Chromium code?”

I saw this tweet on my home laptop and when I went to look on my work machine the author had already deleted it. My curiosity piqued I reached out to them in a DM. They said that they had thought the code looked suspicious but then realized that the problem was understood by the developers and that the code wasn’t actually compiled in to Chrome on Windows. Such is the challenge of looking for misuse of XMM7 in Chromium’s source code – there are too many references (over 17,000), and most of them are irrelevant.

They then mentioned that they were switching to binary analysis using IDA Pro and had found a couple of functions that were shipping in chrome.dll but weren’t restoring XMM7 – they sent source-code links to what looked like real bugs. This is a time when binary analysis is actually easier than “use the source,” because the machine code has had all macros and #ifdefs processed and represents what actually ships.

I decided to replicate their work using “dumpbin /disasm” and some simple Python code to scan the output. For every function in Chrome (found by looking for global symbols in the disassembly output) my script checked to see whether XMM7 was used without being saved. Initially I checked to see if it was written relative to rsp prior to its first use, but I found it being written relative to rax and rbp as well, so I eventually loosened my heuristics. My script still gives false positives, and may also give false negatives, but it works well enough to be useful.

Despite my initial assumption that this was a third-party bug, my simple script found several suspicious functions. There were roughly three categories of functions where the first use of XMM7 was not restoring it:

  1. Functions such as dav1d_iflipadst_16x8_internal_16bpc_sse4 (from here?) that are internal-use functions by the dav1d library. These functions are all called by wrappers that preserve and restore XMM7, therefore they are fine.
  2. Functions such as __longjmp_internal which by design restores all of the non-volatile registers so that it can return to a previous execution state.
  3. Buggy code that is built in to Chromium

Mauna Loa crater, snow in Hawaii. Okay, I really like volcanoes!Using this crude binary-analysis technique I ultimately found the same two buggy functions in chrome.dll that my twitter accomplice had found.

The function ScaleRowUp2_Bilinear_12_SSSE3 in WebRTC was putting the constant 0x0008000800080008 into XMM7 without first preserving. This is a bug, and could cause crashes, but I knew that this wasn’t the cause of this crash because the XMM7 values I was seeing were highly random. I reported the issue to the author and they filed a bug and fixed it within 24 hours.

DyadicBilinearQuarterDownsampler_sse in openh264 was also using XMM7 without preserving it. Video codecs often deal with values with high entropy so maybe this could have been producing the random values I was seeing (spoiler alert: it wasn’t) and it was certainly wrong. I filed a bug for this and then decided to fix it. Landing this fix came with a couple of challenges:

  1. The bug was in an assembly language file that used multiple macros to ensure cross-platform correctness. Thus I had to figure out (by examining nearby functions) the correct incantations to preserve the registers when needed. It wasn’t too bad, but it’s always weird when I’m writing code in a language that I effectively don’t know at all. Pattern recognition FTW. Anyway, the two line fix worked.
  2. Fixing the bug in openh264 does not immediately help Chrome because Chromium uses a pinned copy of third-party libraries, so I needed to “roll” in the latest version of openh264. Sometimes there is an auto-roller that does this regularly, but openh264 lacked this. It had been six months since the last roll of openh264, and in the interim somebody had moved the public header files to a new directory. Since Chromium and another third-party project (WebRTC) were both including headers from this renamed directory there was an eight step process (one, two, three, four, five, six, seven, eight) to bring in the new version without breaking anything in WebRTC or Chromium. The basic technique was to do conditional includes in WebRTC, plus waiting for auto-rolls of WebRTC into Chromium and vice-versa.

The WebRTC and openh264 issues were genuine bugs, and fixing them will probably prevent future crashes in Chromium, but it made no difference to the issue that I was investigating. The crashes continued. Third-party software continued to be the most likely explanation.

Roadside chapel in FranceThere had been multiple hints as to what type of third-party software could be the problem. It has to be something that creates highly randomized data. There was an apparent correlation with third-party disk encryption software. One customer I was investigating the crashes with was using a particular third-party disk encryption product, and Microsoft had noticed a correlation with tasks that do filesystem work. Attempts were made to contact the vendor.

Update: the vendor (McAfee/Trellix) was contacted and they have released a fix to their Drive Encryption product.

I am glad that the root cause has been fixed, but also, I would like it if developers working on product that use assembly language could audit their code to make sure they are honoring the Windows ABI. This isn’t the first instance of this class of bug, and it won’t be the last.

My motivation

I decided to write this up because I thought that the journey was interesting, but also because the journey is not over. There could be other registers that are not being properly saved and restored in Chromium. There could be other projects that are making this mistake, in some cases perhaps not aware of the differences between the Linux and Windows ABI. Any software rules that are not tested and enforced are inevitably broken and I am not aware of any consistent testing to detect ABI violations. More bugs of this type seem inevitable.

Summary

These crashes started happening around the M91 versions of Chrome. This at first looked like a Chrome bug, but it now seems more likely that it was because the compiler or the Chromium code changed to make it vulnerable to XMM7 register corruption that was already happening in the ecosystem. That is, prior to M91 Chrome wasn’t using XMM7 at all in the RunWorker function (I checked), and starting with M91 the code-gen changed (compiler change?) and the function started relying on XMM7 staying zeroed for hours at a time. So please restore our registers when you’re done with them.

And thanks again to Dougall on twitter for poking at this problem and inspiring me to dig deeper.

Note that there is now a follow-up blog post, explaining how we made the crashes stop even while the register corruption continues.

Twitter announcement

Hacker news discussion


文章来源: https://randomascii.wordpress.com/2022/11/21/please-restore-our-registers-when-youre-done-with-them/
如有侵权请联系:admin#unsafe.sh