Determinism Bugs, Part Two, Kernel32.dll
2022-1-7 02:56:4 Author: randomascii.wordpress.com(查看原文) 阅读量:5 收藏

It was literally the day after I cracked the __FILE__ determinism bug that I hit a completely different build determinism issue. I was asked to investigate why the Chrome build number reported for Chrome crashes on Windows 11 was lagging behind what was reported by winver. For example, Chrome crashes on 10.0.22000.376 were being reported as happening on 10.0.22000.318. After some code spelunking I found that crashpad retrieves the Windows version number from kernel32.dll, so I focused on that.

Aside: crashpad grabs the Windows version number from kernel32.dll instead of using GetVersionExW (which is deprecated, BTW) because the GetVersion* functions will frequently lie about the Windows version for compatibility reasons. For crash reporting we really want the actual-no-lies-we-can-handle-the-truth version number, and kernel32.dll used to be the best way to get this.

That’s when things got weird.

I used chrome://crash/ to trigger a Chrome crash and then loaded the crash into windbg, and looked at the version information for kernel32.dll with the command “lm v m kernel32”:

Windbg version mismatch annotated_thumb[2]

Can you see the problem? kernel32.dll appears to be reporting that it is version .318 and .376. No wonder our crash reporting system is confused!

Then things got weirder. For some reason I looked at the crash dump on a different machine and the results were different. Now kernel32.dll was being reported as version .318 and .347. How can the same crash dump be reporting different version imageinformation? I was starting to feel a bit unhinged, and was starting to think I should resurrect my original plan of going to circus school.

But before pulling out my tight wire, juggling pins, and unicycles I decided to investigate a bit more closely. I attached windbg to a Chrome process on my Windows 11 machine and ran “lm v m kernel32” again. Now it said that it’s version number was consistently .318. Somehow it felt better to know that the first version number was always .318, but the second one depended on the phase of the moon.

At this point it’s important to understand how minidumps and symbol servers work.

A minidump records the minimum information needed in order to diagnose a crash. This includes the contents of the stacks from all threads, a few hundred bytes of memory from wherever registers are pointing, information about all loaded and unloaded modules, and a few other snippets. In all cases the idea is to record as little information as possible while still being able to accurately reproduce as much process state as possible. Some memory (most heap memory and global variables) are not recorded, but it is okay to have some information omitted. It is not, however, okay to have some information which is incorrect.

The minidump only records minimal information about the loaded modules, but a crash-dump analyst wants to be able to load symbols, disassemble all functions, etc., and that is where symbol servers come in. The minidump records enough information about loaded modules (a few hundred bytes) to contain the crucial identifiers which allow the debugger to download the full DLL or EXE and the PDB from the symbol servers where Microsoft and Chrome publish their DLLs, EXEs, and PDBs.

So, windbg loads a minidump, looks at the timestamp, image size, and image name information, and uses that to download the full DLL or EXE files.

So…

Apparently the memory saved in the minidump contains the first version number displayed by windbg for kernel32.dll, so it is consistent. But the second version number comes from the copy of kernel32.dll downloaded from the symbol server, and that was inconsistent.

I then used sysinternal’s sigcheck to look at the kernel32.dll DLLs in the local symbol imageserver cache on my two development machines. It confirmed that they had versions .347 and .376. It was weird that the symbol server was returning a mismatched copy of kernel32.dll, but even weirder that it had returned two mismatched copies. A quick check of the file dates explained that. The .347 version had been retrieved on December 7th, and the .376 version had been retrieved on December 22nd. And suddenly it all made sense.

Microsoft built Windows 11 version .318. It shipped a new kernel32.dll and pushed it to its symbol servers. Then Microsoft built Windows 11 version . 347. It didn’t ship the new kernel32.dll but it pushed it to its symbol servers, overwriting the previous version. Then Microsoft built Windows 11 version .376. Once again it didn’t ship the new kernel32.dll, but it pushed it to its symbol servers.

All three versions of kernel32.dll had the same timestamp, image size, and image name, so they all occupied the same slot in the symbol server, and overwrote each other. The version was in the local symbol server cache depended on when you first retrieved that “version” (timestamp, image size, image name triplet) from the symbol server.

At this point it all made sense except for – why? Why is Microsoft building different versions of kernel32.dll that have the same symbol server identifier?

From a technical point of view it is fairly obvious what is happening. Microsoft has deterministic builds. In order for builds to be deterministic the timestamp can no longer be the actual build time. Instead the timestamp is based on a hash of the code segment and probably some other data, but crucially the timestamp hash does not include the version number. So, if only the version number changes the timestamp stays the same and the version number in the file you retrieve from the symbol server may not match what was on the user’s machine.

That’s pretty annoying, actually. It’s especially annoying if you’re investigating a version-number bug like I was, but even when you’re not doing that it is confusing, and seems to violate the fundamental guarantees of symbol servers.

So, the technical explanation is simple enough, but the design question is perplexing. It seems to me that a fundamental tenet of symbol servers is that the symbol server identifier uniquely identifies a particular file. Once you break that assumption all bets are off. I am saddened that Microsoft has decided to do this, and I hope that they fix this bug. The whole concepts of minidumps and symbol servers is on shaky ground until this is addressed.

These symbol-server overwrites have been going on since at least July 2020 when Michael Maltsev first blogged about them. Surprisingly enough, at least some of the experts in this area at Microsoft appear to not have been aware, and apparently I wasn’t paying close enough attention to notice until now.

You can find the three overlapping versions of kernel32.dll that I have found in this Google Drive folder. The folder also contains the minidump that started this investigation. I filed an issue for this problem on github.

What about Chromium?

I mentioned in the previous blog post that Chrome also has deterministic builds, so we have also had to deal with this problem. The solution that we settled on is to use the timestamp from the last commit as the build timestamp. This isn’t perfect – it means that some binaries that would otherwise be identical are instead slightly different – but I strongly believe that it’s better than the alternative of claiming that files are the same when they aren’t. The commit timestamp solution also means that the timestamp still has meaning as a date, and it means that hash collisions are a complete non-issue.

The initial crashpad bug, wherein the wrong OS version was being reported in crashes, was fixed after some discussion with Microsoft (probably on twitter) to read the current OS version from the registry. At some point the same issue should be fixed in Chromium itself.

About brucedawson

I'm a programmer, working for Google, focusing on optimization and reliability. Nothing's more fun than making code run 10x as fast. Unless it's eliminating large numbers of bugs. I also unicycle. And play (ice) hockey. And sled hockey. And juggle. And worry about whether this blog should have been called randomutf-8. 2010s in review tells more: https://twitter.com/BruceDawson0xB/status/1212101533015298048


文章来源: https://randomascii.wordpress.com/2022/01/06/determinism-bugs-part-two/
如有侵权请联系:admin#unsafe.sh