I tend to launch most programs on my Windows 10 laptop by typing the <Win> key, then a few letters of the program name, and then hitting enter. On my powerful laptop (SSD and 32 GB of RAM) this process usually takes as long as it takes me to type these characters, just a fraction of a second.
Usually.
Sometimes, however, it takes longer. A lot longer. As in, tens of seconds. The slowdowns are unpredictable but recently I was able to record an Event Tracing for Windows (ETW) trace of one of these delays. With a bit of help from people on twitter I was able to analyze the trace and understand why it took about a minute to launch notepad.
Before I get in to the analysis I have two warnings/disclaimers: 1) I have a good understanding of the problem, but I do not have a solution and 2) if you are seeing identical symptoms that doesn’t mean that your root cause is the same as mine, but I will give some hints on how to see if it is.
My analysis of the trace (trace is here, installer for the analysis tools is here, analysis tutorials are here, feel free to follow along) started with my looking at the input events and Window in Focus graph in Windows Performance Analyzer (WPA), both shown below (zoomed in a bit for maximum detail):
The first diamond in the Multi-Input row shows when I pressed the Windows key, with subsequent key presses (including pressing enter) clumped together shortly afterwards. The units on the x-axis are seconds so we can see that all of the typing took about 3/4 of a second.
The input events are injected into the trace by my UIforETW trace-recording tool. These input events are one of the reasons I prefer UIforETW over Microsoft’s trace-recording tools. I’ve hidden the table-view but it lists which keys were pressed, while anonymizing letters and numbers to prevent UIforETW from being a key logger. Input events can be a critical tool in helping know where/when to look in traces to understand what is happening.
The input events help establish context, but the Window in Focus events really tell the tale. We can see that the SearchApp (start menu?) gains focus as soon as I press the <Win> key but then nothing else happens for more than thirteen seconds. That’s the problem, visualized.
The next step is to see what is causing the delay. A quick glance at the CPU Usage (Precise) and Disk Usage graphs showed that the CPU and disk were almost 100% idle, so the start menu must be waiting on something else:
When a process is idle for a while when you wish that it was doing work then the challenge is to figure out what it was waiting on. I looked at the context-switch events in CPU Usage (Precise). Some of the SearchApp threads were named (yay!) but not all of them were and I couldn’t find the main thread to see what it was waiting on, so I had to poke around and hope something became obvious. I zoomed in on the burst of CPU activity just before notepad launched and I noticed that WerFault.exe and wermgr.exe both started getting busy. Correlation is not causation, but it sure is suspicious.
Note that WER stands for Windows Error Reporting – the system that sends crash dumps back to Microsoft for analysis so that software reliability can be improved
Looking at the Processes table showed me that the command line for WerFault.exe was “C:\WINDOWS\system32\WerFault.exe -u -p 17804 -s 2124”. That suggests that Windows Error Reporting was being asked to record information for crashed process 17804, and when I looked in the Processes table for that Process ID (PID) I found “RuntimeBroker.exe <Microsoft.Windows.Search> (17804)”. Well now. Doesn’t that name look relevant?
A look at all of the “Transient” processes (those that started or ended during the trace timeline) was quite revealing:
WerFault.exe and RuntimeBroker.exe (17804) (the top of the two RuntimeBroker.exe processes) were both running when I started recording the trace and both ended at about the same time, and WerFault.exe was handling a crash in RuntimeBroker.exe. Notice also that a new copy of RuntimeBroker.exe starts running when the old copy goes away. Now we’re starting to have an explanation:
Now we have a new question: why is WerFault.exe sitting idle for so long?
I looked at the CPU Usage (Precise) data and saw that WerFault.exe has at least thirteen threads, none of them named (come on Microsoft – thread names are really helpful!) but the main thread was easily identifiable as the one using the most CPU time. I then sorted by Time Since Last and noticed that at one point the main thread had been waiting to run for 15.572 s. In fact it was probably waiting even longer, but the start of its wait was before the start of the trace and therefore unknowable. You can find more details on how to do idle-analysis here.
The stack where the main WerFault.exe thread was waiting for 15.572 s is shown below:
The summary would be that it was waiting in UploadReport.
So now we understand the problem. RuntimeBroker.exe crashed (due to heap corruption, according to the call stack in the RuntimeBroker.exe crash dump, shown to the right) and it took more than 15 seconds to upload the crash dump, presumably due to my flaky hotel WiFi. During this time my start menu was inoperable.
This deserves reiterating. My start menu was hung due to the combination of heap corruption and WerFault.exe deciding that it needed to upload the crash dump before releasing the old process so that a new one could be started.
It took two bugs (the heap corruption and the upload-before-restarting) to make this hang happen, but happen it did.
We can even go deeper. The UploadReport function was blocked for 15.567 s and the Readying Process/Readying Thread Id shows us who ultimately unblocked the function. That turned out to be another WerFault.exe thread which was blocked in some CHttpRequest functions, as show above. That doesn’t add significantly to the understanding of the problem, but does demonstrate nicely how you can trace a hang backwards through multiple processes and threads.
In general if you want to understand why your computer is performing badly you need to record and analyze a trace. However if you want to see if you are hitting this particular problem then there are easier steps that you can follow.
The first step is to configure the local recording of crash dumps. This is a good idea in general because it lets you monitor the stability of your computer over your time.
Then, with crash dumps being recorded if you see a Start Menu hang you can just look in %localappdata%\crashdumps and see if there is a recent RuntimeBroker.exe crash. If so then you are presumably seeing this bug.
Raymond Chen gives several reasons for why Windows Error Reporting doesn’t restart crashed processes before uploading the report (circa 2012) but I don’t find those reasons entirely compelling, especially in the start-menu case. As long as you kill the old process before starting the new one – and answer the DLL-version questions from a crash dump – most of the problems he points out are avoidable. Exponential backoff on process restarts can address the rest. And, the consequences of waiting can, as we have seen, be arbitrarily long start-menu hangs, with no indication that a crash is the problem. There is also some confusion about the behavior – maybe the design has changed in the last ten years.
Heap corruption bugs can be extremely difficult to find and fix, but this one seems like it might be easy. I turned on pageheap on RuntimeBroker.exe, killed the relevant version to get it to restart and apply the pageheap settings and it started crashing every time I opened the start menu. I configured WER to save full crash dumps and soon had a half-dozen crash dumps with full details of what was happening.
The crashes normally happen on this call stack:
With pageheap enabled the crash happens on a very similar call stack, but slightly earlier. The crash happens earlier (and more reliably) because with page heap when you free memory it is unmapped, so dereferencing it reliably causes a crash, instead of reading from the freed memory:
The crash happens when dereferencing [rcx] so I ran its value through the the !heap command (see my pageheap blogpost for details) and got this call stack:
The only complication is that this doesn’t happen on all Windows 10 machines. There seems to be some required state that makes it happen or not and I don’t know what is. I will say that I’m happy to share the crash dumps with anyone at Microsoft who wants to investigate.
I don’t know the code and don’t understand what is happening but I’ve dealt with enough use-after-free bugs to say that this is probably straightforward to investigate and fix using the crash dumps. Although, I got a couple of different crash call stacks so there might be multiple bugs.
I ended my initial twitter thread by saying that these hangs were making me cranky and had me wondering if Windows 10 was abandonware. Since then I’ve been told that people seem to be seeing start-menu hangs on Windows 11, but that is not necessarily the same problem.
To be clear, Microsoft has the technology to record traces on start menu hangs on customer machines. These traces would show roughly the same thing as my trace. They also receive crash dumps from customer machines. They might even have a way of correlating them (if not then they should hook that up). And they created pageheap which makes use-after-free crashes easy to investigate.
So, why hasn’t this been addressed? On my laptop I see that RuntimeBroker.exe has crashed, on average, every second day this year. That is too many start-menu hangs for my tastes. I don’t know how long it has been happening so maybe a fix is on the way – if so that would be great to hear. If not then I will continue to be cranky and I hope that this serves as a good reminder of the importance of using all that fancy telemetry to address issues like this.
Or, maybe I’m just unlucky and I’m one of the few people (not the only person) who is hitting this crash.
In short, I am really pleased with the tools that Microsoft has created and released to let me analyze performance issues such as these. However I wish that I didn’t have to use them so often on Windows itself.