Обсуждение: Windows crash / abort handling

Поиск
Список
Период
Сортировка

Windows crash / abort handling

От
Andres Freund
Дата:
Hi,

As threatened in [1]... For CI, originally in the AIO project but now more
generally, I wanted to get windows backtraces as part of CI. I also was
confused why visual studio's "just in time debugging" (i.e. a window popping
up offering to debug a process when it crashes) didn't work with postgres.

My first attempt was to try to use the existing crashdump stuff in
pgwin32_install_crashdump_handler(). That's not really quite what I want,
because it only handles postmaster rather than any binary, but I thought it'd
be a good start. But outside of toy situations it didn't work for me.

A bunch of debugging later I figured out that the reason neither the
SetUnhandledExceptionFilter() nor JIT debugging works is that the
SEM_NOGPFAULTERRORBOX in the
  SetErrorMode(SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX);
we do in startup_hacks() prevents the paths dealing with crashes from being
reached.

The SEM_NOGPFAULTERRORBOX hails from:

commit 27bff7502f04ee01237ed3f5a997748ae43d3a81
Author: Bruce Momjian <bruce@momjian.us>
Date:   2006-06-12 16:17:20 +0000

    Prevent Win32 from displaying a popup box on backend crash.  Instead let
    the postmaster deal with it.

    Magnus Hagander


I actually see error popups despite SEM_NOGPFAULTERRORBOX, at least for paths
reaching abort() (and thus our assertions).

The reason for abort() error boxes not being suppressed appears to be that in
debug mode a separate facility is reponsible for that: [2], [3]

"The default behavior is to print the message. _CALL_REPORTFAULT, if set,
specifies that a Watson crash dump is generated and reported when abort is
called. By default, crash dump reporting is enabled in non-DEBUG builds."

We apparently need _set_abort_behavior(_CALL_REPORTFAULT) to have abort()
behave the same between debug and release builds. [4]


To prevent the error popups we appear to at least need to call
_CrtSetReportMode(). The docs say:

  If you do not call _CrtSetReportMode to define the output destination of
  messages, then the following defaults are in effect:

      Assertion failures and errors are directed to a debug message window.

We can configure it so that that stuff goes to stderr, by calling
    _CrtSetReportMode(_CRT_ASSERT, _CRTDBG_MODE_FILE | _CRTDBG_MODE_DEBUG);
    _CrtSetReportFile(_CRT_ASSERT, _CRTDBG_FILE_STDERR);
(and the same for _CRT_ERROR and perhaps _CRT_WARNING)
which removes the default _CRTDBG_MODE_WNDW.

It's possible that we'd need to do more than this, but this was sufficient to
get crash reports for segfaults and abort() in both assert and release builds,
without seeing an error popup.


To actually get the crash reports I ended up doing the following on the OS
level [5]:

    Set-ItemProperty -Path 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AeDebug' -Name 'Debugger' -Value
'\"C:\WindowsKits\10\Debuggers\x64\cdb.exe\" -p %ld -e %ld -g -kqm -c \".lines -e; .symfix+ ;.logappend
c:\cirrus\crashlog.txt; !peb; ~*kP ; .logclose ; q \"' ; `
 
    New-ItemProperty -Path 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AeDebug' -Name 'Auto' -Value 1
-PropertyTypeDWord ; `
 
    Get-ItemProperty -Path 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AeDebug' -Name Debugger; `

This requires 'cdb' to be present, which is included in the Windows 10 SDK (or
other OS versions, it doesn't appear to have changed much). Whenever there's
an unhandled crash, cdb.exe is invoked with the parameters above, which
appends the crash report to crashlog.txt.

Alternatively we can generate "minidumps" [6], but that doesn't appear to be more
helpful for CI purposes at least - all we'd do is to create a backtrace using
the same tool. But it might be helpful for local development, to e.g. analyze
crashes in more detail.

The above ends up dumping all crashes into a single file, but that can
probably be improved. But cdb is so gnarly that I wanted to stop looking once
I got this far...


Andrew, I wonder if something like this could make sense for windows BF animals?


Greetings,

Andres Freund

[1] https://postgr.es/m/20211001222752.wrz7erzh4cajvgp6%40alap3.anarazel.de
[2] https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/crtsetreportmode?view=msvc-160
[3] https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/set-abort-behavior?view=msvc-160
[4] If anybody can explain to me what the two different parameters to
    _set_abort_behavior() do, I'd be all ears
[5] https://docs.microsoft.com/en-us/windows/win32/debug/configuring-automatic-debugging
[6] https://docs.microsoft.com/en-us/windows/win32/wer/wer-settings



Re: Windows crash / abort handling

От
Craig Ringer
Дата:


On Wed, 6 Oct 2021, 03:30 Andres Freund, <andres@anarazel.de> wrote:
Hi,


My first attempt was to try to use the existing crashdump stuff in
pgwin32_install_crashdump_handler(). That's not really quite what I want,
because it only handles postmaster rather than any binary, but I thought it'd
be a good start. But outside of toy situations it didn't work for me.

Odd. It usually has for me, and definitely not limited to the postmaster. But it will fall down for OOM, smashed stack, and other states where in-process self-debugging is likely to fail.

A bunch of debugging later I figured out that the reason neither the
SetUnhandledExceptionFilter() nor JIT debugging works is that the
SEM_NOGPFAULTERRORBOX in the
  SetErrorMode(SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX);
we do in startup_hacks() prevents the paths dealing with crashes from being
reached.

Right.

I patch this out when working on windows because it's a real pain.

I keep meaning to propose that we remove this functionality entirely. It's obsolete. It was introduced back in the days where DrWatson.exe "windows error reporting") used to launch an interactive prompt asking the user what to do when a process crashed. This would block the crashed process from exiting, making everything grind to a halt until the user interacted with the 
UI. Even for a service process.

Not fun on a headless or remote server.

These days Windows handles all this a lot more sensibly, and blocking crash reporting is quite obsolete and unhelpful.

I'd like to just remove it.

If we can't do that I'd like to at least make it optional.

Alternatively we can generate "minidumps" [6], but that doesn't appear to be more
helpful for CI purposes at least - all we'd do is to create a backtrace using
the same tool. But it might be helpful for local development, to e.g. analyze
crashes in more detail.

They're immensely helpful when a bt isn't enough, but BTs are certainly the first step for CI use. 

Re: Windows crash / abort handling

От
Andres Freund
Дата:
Hi,

On 2021-10-06 14:11:51 +0800, Craig Ringer wrote:
> On Wed, 6 Oct 2021, 03:30 Andres Freund, <andres@anarazel.de> wrote:
>
> > Hi,
> >
> >
> > My first attempt was to try to use the existing crashdump stuff in
> > pgwin32_install_crashdump_handler(). That's not really quite what I want,
> > because it only handles postmaster rather than any binary, but I thought
> > it'd
> > be a good start. But outside of toy situations it didn't work for me.
> >
>
> Odd. It usually has for me, and definitely not limited to the postmaster.
> But it will fall down for OOM, smashed stack, and other states where
> in-process self-debugging is likely to fail.

I think it's a question of running debug vs optimized builds. At least in
debug builds it doesn't appear to work because the debug c runtime abort
preempts it.


> I patch this out when working on windows because it's a real pain.
>
> I keep meaning to propose that we remove this functionality entirely. It's
> obsolete. It was introduced back in the days where DrWatson.exe "windows
> error reporting") used to launch an interactive prompt asking the user what
> to do when a process crashed. This would block the crashed process from
> exiting, making everything grind to a halt until the user interacted with
> the
> UI. Even for a service process.

> Not fun on a headless or remote server.

Yea, the way we do it right now is definitely not helpful. Especially because
it doesn't actually prevent the "hang" issue - the CRT boxes at least cause
precisely such stalls.

We've had a few CI hangs due to such errors.


> These days Windows handles all this a lot more sensibly, and blocking crash
> reporting is quite obsolete and unhelpful.

From what I've seen it didn't actually get particularly sensible, just
different and more complicated.

From what I've seen one needs at least:
- _set_abort_behavior(_CALL_REPORTFAULT | _WRITE_ABORT_MSG)
- _set_error_mode(_OUT_TO_STDERR)
- _CrtSetReportMode(_CRT_ASSERT/ERROR, _CRTDBG_MODE_FILE | _CRTDBG_MODE_DEBUG)
- SetErrorMode(SEM_FAILCRITICALERRORS)

There's many things this hocuspocus can be called, but sensible isn't among my
word choices for it.


> I'd like to just remove it.

I think we need to remove the SEM_NOGPFAULTERRORBOX, but I don't think we can
remove the SEM_FAILCRITICALERRORS, and I think we need the rest of the above
to prevent error boxes from happening.


I think we ought to actually apply these to all binaries, not just
postgres. One CI hung was due to psql asserting. But there's currently no easy
hook point for all binaries afaict. If we were to introduce something it
should probably be in pgport? But I'd defer to that to a later step.


I've attached a patch implementing these changes.

I also made one run with that intentionally fail (with an Assert(false)), and
with the changes the debugger is invoked and creates a backtrace etc:
https://cirrus-ci.com/task/5447300246929408
(click on crashlog-> at the top)

Greetings,

Andres Freund

Вложения

Re: Windows crash / abort handling

От
Andrew Dunstan
Дата:
On 10/5/21 15:30, Andres Freund wrote
>
> To actually get the crash reports I ended up doing the following on the OS
> level [5]:
>
>     Set-ItemProperty -Path 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AeDebug' -Name 'Debugger' -Value
'\"C:\WindowsKits\10\Debuggers\x64\cdb.exe\" -p %ld -e %ld -g -kqm -c \".lines -e; .symfix+ ;.logappend
c:\cirrus\crashlog.txt; !peb; ~*kP ; .logclose ; q \"' ; `
 
>     New-ItemProperty -Path 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AeDebug' -Name 'Auto' -Value 1
-PropertyTypeDWord ; `
 
>     Get-ItemProperty -Path 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AeDebug' -Name Debugger; `
>
> This requires 'cdb' to be present, which is included in the Windows 10 SDK (or
> other OS versions, it doesn't appear to have changed much). Whenever there's
> an unhandled crash, cdb.exe is invoked with the parameters above, which
> appends the crash report to crashlog.txt.
>
> Alternatively we can generate "minidumps" [6], but that doesn't appear to be more
> helpful for CI purposes at least - all we'd do is to create a backtrace using
> the same tool. But it might be helpful for local development, to e.g. analyze
> crashes in more detail.
>
> The above ends up dumping all crashes into a single file, but that can
> probably be improved. But cdb is so gnarly that I wanted to stop looking once
> I got this far...
>
>
> Andrew, I wonder if something like this could make sense for windows BF animals?
>


Very possibly. I wonder how well it will work on machines where I have
more than one animal .e.g. lorikeet (cygwin) jacana (msys) and bowerbird
(MSVC) are all on the same machine. Likewise drongo (MSVC) and fairywren
(msys2).


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Windows crash / abort handling

От
Andres Freund
Дата:
Hi,

On 2022-01-10 10:57:00 -0500, Andrew Dunstan wrote:
> On 10/5/21 15:30, Andres Freund wrote
> > The above ends up dumping all crashes into a single file, but that can
> > probably be improved. But cdb is so gnarly that I wanted to stop looking once
> > I got this far...

FWIW, I figured out how to put the dumps into separate files by now...


> > Andrew, I wonder if something like this could make sense for windows BF animals?

> Very possibly. I wonder how well it will work on machines where I have
> more than one animal .e.g. lorikeet (cygwin) jacana (msys) and bowerbird
> (MSVC) are all on the same machine. Likewise drongo (MSVC) and fairywren
> (msys2).

Hm. I can see a few ways to deal with it. Are they running concurrently?
If not then it's easy enough to deal with.

It'd be a bit of a fight with cdb's awfully documented and quirky
scripting [1], but the best solution would probably be to just use an
environment variable from the target process to determine the dump
location. Then each buildfarm config could set a BF_BACKTRACE_LOCATION
variable or such...

[1] So there's !envvar. But that yields a string like
BF_BACKTRACE_LOCATION = value of environment variable when set to an
alias.  And I haven't found an easy way to get rid of the "variablename
= ". There is .foreach /pS [2] which could be used to skip over the
varname =, but that then splits on all whitespaces. Gah.

[2] https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/-foreach

Greetings,

Andres Freund



Re: Windows crash / abort handling

От
Andrew Dunstan
Дата:
On 1/11/22 02:51, Andres Freund wrote:
> Hi,
>
> On 2022-01-10 10:57:00 -0500, Andrew Dunstan wrote:
>> On 10/5/21 15:30, Andres Freund wrote
>>> The above ends up dumping all crashes into a single file, but that can
>>> probably be improved. But cdb is so gnarly that I wanted to stop looking once
>>> I got this far...
> FWIW, I figured out how to put the dumps into separate files by now...
>
>
>>> Andrew, I wonder if something like this could make sense for windows BF animals?
>> Very possibly. I wonder how well it will work on machines where I have
>> more than one animal .e.g. lorikeet (cygwin) jacana (msys) and bowerbird
>> (MSVC) are all on the same machine. Likewise drongo (MSVC) and fairywren
>> (msys2).
> Hm. I can see a few ways to deal with it. Are they running concurrently?
> If not then it's easy enough to deal with.
>
> It'd be a bit of a fight with cdb's awfully documented and quirky
> scripting [1], but the best solution would probably be to just use an
> environment variable from the target process to determine the dump
> location. Then each buildfarm config could set a BF_BACKTRACE_LOCATION
> variable or such...
>
> [1] So there's !envvar. But that yields a string like
> BF_BACKTRACE_LOCATION = value of environment variable when set to an
> alias.  And I haven't found an easy way to get rid of the "variablename
> = ". There is .foreach /pS [2] which could be used to skip over the
> varname =, but that then splits on all whitespaces. Gah.
>
> [2] https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/-foreach
>

Ugly as heck. I normally don't use locations with spaces in them. Let's
assume we don't have to deal with that issue at least.

But in theory these animals could be running in parallel, and in theory
each animal could have more than one branch being run concurrently. In
fact locking them against each other can be difficult/impossible. From
experience, three different perls might not agree on how file locking
works ... In the case of fairywren/drongo I have had to set things up so
that there is a single batch file that runs one job after the other.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Windows crash / abort handling

От
Andres Freund
Дата:
Hi,

On 2022-01-11 12:01:42 -0500, Andrew Dunstan wrote:
> On 1/11/22 02:51, Andres Freund wrote:
> > It'd be a bit of a fight with cdb's awfully documented and quirky
> > scripting [1], but the best solution would probably be to just use an
> > environment variable from the target process to determine the dump
> > location. Then each buildfarm config could set a BF_BACKTRACE_LOCATION
> > variable or such...
> >
> > [1] So there's !envvar. But that yields a string like
> > BF_BACKTRACE_LOCATION = value of environment variable when set to an
> > alias.  And I haven't found an easy way to get rid of the "variablename
> > = ". There is .foreach /pS [2] which could be used to skip over the
> > varname =, but that then splits on all whitespaces. Gah.
> >
> > [2] https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/-foreach
> >
> 
> Ugly as heck.

Indeed. I think I figured it out:

0:000> !envvar frak
        frak = C:\Program Files\Application Verifier\
0:000> ad /q path; .foreach /pS 2 (component {!envvar frak}){ .if (${/d:path}) {aS ${/v:path} ${/f:path} ${component}}
.else{aS ${/v:path} ${component}}}; .block {.echo ${path}}
 
C:\Program Files\Application Verifier\

I mean, no explanation needed, right?


> But in theory these animals could be running in parallel, and in theory
> each animal could have more than one branch being run concurrently. In
> fact locking them against each other can be difficult/impossible.

The environment variable solution directing dumps for each animal / branch to
different directories should take care of that, right?


Do you have a preference where a script file implementing the necessary
cdb.exe commands would reside? It's getting too long to comfortably implement
it inside the registry setting itself... That script would be used by all
branches, rather than be branch specific.

Greetings,

Andres Freund



Re: Windows crash / abort handling

От
Andrew Dunstan
Дата:
On 1/11/22 16:13, Andres Freund wrote:
> Hi,
>
> On 2022-01-11 12:01:42 -0500, Andrew Dunstan wrote:
>> On 1/11/22 02:51, Andres Freund wrote:
>>> It'd be a bit of a fight with cdb's awfully documented and quirky
>>> scripting [1], but the best solution would probably be to just use an
>>> environment variable from the target process to determine the dump
>>> location. Then each buildfarm config could set a BF_BACKTRACE_LOCATION
>>> variable or such...
>>>
>>> [1] So there's !envvar. But that yields a string like
>>> BF_BACKTRACE_LOCATION = value of environment variable when set to an
>>> alias.  And I haven't found an easy way to get rid of the "variablename
>>> = ". There is .foreach /pS [2] which could be used to skip over the
>>> varname =, but that then splits on all whitespaces. Gah.
>>>
>>> [2] https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/-foreach
>>>
>> Ugly as heck.
> Indeed. I think I figured it out:
>
> 0:000> !envvar frak
>         frak = C:\Program Files\Application Verifier\
> 0:000> ad /q path; .foreach /pS 2 (component {!envvar frak}){ .if (${/d:path}) {aS ${/v:path} ${/f:path}
${component}}.else {aS ${/v:path} ${component}}}; .block {.echo ${path}}
 
> C:\Program Files\Application Verifier\
>
> I mean, no explanation needed, right?
>
>
>> But in theory these animals could be running in parallel, and in theory
>> each animal could have more than one branch being run concurrently. In
>> fact locking them against each other can be difficult/impossible.
> The environment variable solution directing dumps for each animal / branch to
> different directories should take care of that, right?
>
>
> Do you have a preference where a script file implementing the necessary
> cdb.exe commands would reside? It's getting too long to comfortably implement
> it inside the registry setting itself... That script would be used by all
> branches, rather than be branch specific.



Well, the buildfarm code sets up a file for gdb in the branch's pgsql.build:


    my $cmdfile = "./gdbcmd";
    my $handle;
    open($handle, '>', $cmdfile) || die "opening $cmdfile: $!";
    print $handle "bt\n";
    print $handle 'p $_siginfo', "\n";
    close($handle);

    my @trace = ("\n\n");

    foreach my $core (@cores)
    {
        my @onetrace =
          run_log("gdb -x $cmdfile --batch $bindir/postgres $core");
        push(@trace,
            "$log_file_marker stack trace: $core $log_file_marker\n",
            @onetrace);
    }


So by analogy we could do something similar for the Windows debugger. Or
we could pick up a file from the repo if we wanted to commit it in
src/tools somewhere.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Windows crash / abort handling

От
Andres Freund
Дата:
Hi,

On 2022-01-09 16:57:04 -0800, Andres Freund wrote:
> I've attached a patch implementing these changes.

Unless somebody is planning to look at this soon, I'm planning to push it to
master. It's too annoying to have these hangs and not see backtraces.


We're going to have to do this in all binaries at some point, but that's
a considerably larger patch...

Greetings,

Andres Freund



Re: Windows crash / abort handling

От
Thomas Munro
Дата:
On Sun, Jan 30, 2022 at 10:02 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-01-09 16:57:04 -0800, Andres Freund wrote:
> > I've attached a patch implementing these changes.
>
> Unless somebody is planning to look at this soon, I'm planning to push it to
> master. It's too annoying to have these hangs and not see backtraces.

+1, I don't know enough about Windows development to have an opinion
on the approach but we've got to try *something*, these hangs are
terrible.



Re: Windows crash / abort handling

От
Andres Freund
Дата:
Hi,

On 2022-02-02 11:24:19 +1300, Thomas Munro wrote:
> On Sun, Jan 30, 2022 at 10:02 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2022-01-09 16:57:04 -0800, Andres Freund wrote:
> > > I've attached a patch implementing these changes.
> >
> > Unless somebody is planning to look at this soon, I'm planning to push it to
> > master. It's too annoying to have these hangs and not see backtraces.
> 
> +1, I don't know enough about Windows development to have an opinion
> on the approach but we've got to try *something*, these hangs are
> terrible.

I've pushed the patch this thread is about now. Lets see what the buildfarm
says. I only could one windows version.  Separately I've also pushed a patch
to run the windows tests under a timeout. I hope in combination these patches
address the hangs.

Greetings,

Andres Freund



Re: Windows crash / abort handling

От
David Rowley
Дата:
On Thu, 3 Feb 2022 at 15:38, Andres Freund <andres@anarazel.de> wrote:
> I've pushed the patch this thread is about now. Lets see what the buildfarm
> says. I only could one windows version.  Separately I've also pushed a patch
> to run the windows tests under a timeout. I hope in combination these patches
> address the hangs.

I tried this out today on a windows machine I have here. I added some
code to trigger an Assert failure and the options of attaching a
debugger work well. Tested both from running the regression tests and
running a query manually with psql.

Tested on Windows 8.1 with VS2017.

Thanks for working on this.

David