Обсуждение: "could not reattach to shared memory" on buildfarm member dory
So far, dory has failed three times with essentially identical symptoms: 2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared memory (key=0000000000000190, addr=00000000018E0000):error code 487 2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel worker 2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from tenk1 group by twenty; 2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel worker" (PID 2240) exited with exit code 1 Now how can this be? We've successfully reserved and released the address range we want to use, so it *should* be free at the instant we try to map. Another thing that seems curious, though it may just be an artifact of not having many data points yet, is that these failures all occurred during pg_upgradecheck. You'd think the "check" and "install-check" steps would be equally vulnerable to the failure. I guess the good news is that we're seeing this in a reasonably reproducible fashion, so there's some hope of digging down to find out the actual cause. regards, tom lane
Greetings, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > So far, dory has failed three times with essentially identical symptoms: > > 2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared memory (key=0000000000000190, addr=00000000018E0000):error code 487 > 2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel worker > 2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from tenk1 group by twenty; > 2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel worker" (PID 2240) exited with exit code 1 > > Now how can this be? We've successfully reserved and released the address > range we want to use, so it *should* be free at the instant we try to map. Yeah, that's definitely interesting. > I guess the good news is that we're seeing this in a reasonably > reproducible fashion, so there's some hope of digging down to find > out the actual cause. I've asked Heath to take a look at the system again and see if there's any Windows logs or such that might help us understand what's happening. AV was disabled on the box, so don't think it's that, at least. Thanks! Stephen
Вложения
On Tue, Apr 24, 2018 at 11:18 AM, Stephen Frost <sfrost@snowman.net> wrote: > Greetings, > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> So far, dory has failed three times with essentially identical symptoms: >> >> 2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared memory (key=0000000000000190, addr=00000000018E0000):error code 487 >> 2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel worker >> 2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from tenk1 group by twenty; >> 2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel worker" (PID 2240) exited with exit code 1 >> >> Now how can this be? We've successfully reserved and released the address >> range we want to use, so it *should* be free at the instant we try to map. > > Yeah, that's definitely interesting. I wondered if another thread with the right timing could map something between the VirtualFree() and MapViewOfFileEx() calls, but we don't create the Windows signal handling thread until a bit later. Could there be any any other threads active in the process? Maybe try asking what's mapped there with VirtualQueryEx() on failure? -- Thomas Munro http://www.enterprisedb.com
On Tue, Apr 24, 2018 at 11:37:33AM +1200, Thomas Munro wrote: > On Tue, Apr 24, 2018 at 11:18 AM, Stephen Frost <sfrost@snowman.net> wrote: > > Greetings, > > > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > >> So far, dory has failed three times with essentially identical symptoms: > >> > >> 2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared memory (key=0000000000000190, addr=00000000018E0000):error code 487 > >> 2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel worker > >> 2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from tenk1 group by twenty; > >> 2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel worker" (PID 2240) exited with exit code 1 > >> > >> Now how can this be? We've successfully reserved and released the address > >> range we want to use, so it *should* be free at the instant we try to map. > > > > Yeah, that's definitely interesting. > > I wondered if another thread with the right timing could map something > between the VirtualFree() and MapViewOfFileEx() calls, but we don't > create the Windows signal handling thread until a bit later. Could > there be any any other threads active in the process? > > Maybe try asking what's mapped there with VirtualQueryEx() on failure? +1. An implementation of that: https://www.postgresql.org/message-id/20170403065106.GA2624300%40tornado.leadboat.com
On Tue, Apr 24, 2018 at 1:18 AM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> So far, dory has failed three times with essentially identical symptoms:
>
> 2018-04-23 19:57:10.624 GMT [2240] FATAL: could not reattach to shared memory (key=0000000000000190, addr=00000000018E0000): error code 487
> 2018-04-23 15:57:10.657 EDT [8836] ERROR: lost connection to parallel worker
> 2018-04-23 15:57:10.657 EDT [8836] STATEMENT: select count(*) from tenk1 group by twenty;
> 2018-04-23 15:57:10.660 EDT [3820] LOG: background worker "parallel worker" (PID 2240) exited with exit code 1
>
> Now how can this be? We've successfully reserved and released the address
> range we want to use, so it *should* be free at the instant we try to map.
Yeah, that's definitely interesting.
> I guess the good news is that we're seeing this in a reasonably
> reproducible fashion, so there's some hope of digging down to find
> out the actual cause.
I've asked Heath to take a look at the system again and see if there's
any Windows logs or such that might help us understand what's happening.
AV was disabled on the box, so don't think it's that, at least.
Disabled or uninstalled?
Back when I was combating windows AV on a daily basis, this normally did not have the same effect. Just disabling the AV didn't actually remove the parts that caused issues, it just hid them. Actual uninstall is what was required.
On 24 April 2018 at 15:18, Magnus Hagander <magnus@hagander.net> wrote: > Back when I was combating windows AV on a daily basis, this normally did not > have the same effect. Just disabling the AV didn't actually remove the parts > that caused issues, it just hid them. Actual uninstall is what was required. Yep. Specifically, they tended to inject kernel hooks and/or load hook DLLs that did funky and often flakey things. Often with poor awareness of things like multiple processes opening one file for write at the same time. I think I heard that MS has cleaned up the situation with AV considerably by offering more kernel infrastructure for it, and restricting what you can do in terms of kernel extensions etc. But I don't know how much. In any case, do you think dropping a minidump at the point of failure might be informative? It should contain the full memory mapping information. For this purpose we could just create a crashdumps/ directory then abort() when we detect the error, and have the buildfarm stop processing until someone can grab the tempdir with the dumps, binaries, .pdb files, etc. src/backend/port/win32/crashdump.c doesn't expose a helper function to do all the dbghelp.dll messing around and create a crashdump, it only allows that to be done via a crash handler. But it might make sense to break out the actual "write a crash dump" part to a separately callable function. I've looked at doing this before, but always got stuck with the apparent lack of support in gdb or lldb to be used as a library for self-dumping. You can always shell out to gcore I guess... but ew. Or we can fork() and abort() the forked child like https://github.com/RuntimeTools/gencore does, but again, ew. I was thinking that maybe the buildfarm could just create crashdumps/ automatically, but then we'd need to have support infrastructure for recording the Pg binaries and .pdb files along with the dumps, rotating them so we don't run out of space, etc etc. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Greetings, * Magnus Hagander (magnus@hagander.net) wrote: > On Tue, Apr 24, 2018 at 1:18 AM, Stephen Frost <sfrost@snowman.net> wrote: > > I've asked Heath to take a look at the system again and see if there's > > any Windows logs or such that might help us understand what's happening. > > AV was disabled on the box, so don't think it's that, at least. > > Disabled or uninstalled? The only AV installed on the system is the "Windows Defender" thing, so it's not some additional AV system. Thanks! Stephen
Вложения
Noah Misch <noah@leadboat.com> writes: > On Tue, Apr 24, 2018 at 11:37:33AM +1200, Thomas Munro wrote: >> Maybe try asking what's mapped there with VirtualQueryEx() on failure? > +1. An implementation of that: > https://www.postgresql.org/message-id/20170403065106.GA2624300%40tornado.leadboat.com Not seeing any other work happening here, I pushed a little bit of quick-hack investigation code. This is based on noting that VirtualAllocEx is documented as rounding the allocation up to a page boundary (4K), but there's nothing specific about whether or how much CreateFileMapping or MapViewOfFileEx might round up. The observed failures could be explained if those guys might eat more virtual address space for the same request size as VirtualAllocEx does. This is a stretch, for sure, but given the lack of any other theories we might as well check it. regards, tom lane
Greetings, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Noah Misch <noah@leadboat.com> writes: > > On Tue, Apr 24, 2018 at 11:37:33AM +1200, Thomas Munro wrote: > >> Maybe try asking what's mapped there with VirtualQueryEx() on failure? > > > +1. An implementation of that: > > https://www.postgresql.org/message-id/20170403065106.GA2624300%40tornado.leadboat.com > > Not seeing any other work happening here, I pushed a little bit of > quick-hack investigation code. This is based on noting that > VirtualAllocEx is documented as rounding the allocation up to a page > boundary (4K), but there's nothing specific about whether or how much > CreateFileMapping or MapViewOfFileEx might round up. The observed > failures could be explained if those guys might eat more virtual > address space for the same request size as VirtualAllocEx does. > This is a stretch, for sure, but given the lack of any other theories > we might as well check it. Sounds good to me. Just as an FYI, there are a couple folks taking a look at the system and trying to figure out what's going on. We've seen an Event ID 1530 error in the Windows Event log associated with vctip.exe which Visual Studio was running with the build, but only sometimes. When vctip.exe is being run and then finishes, it goes and cleans things up which seems to be what's triggering the 1530 and that appears to correllate with the failures, but hard to say if that's really a smoking gun or is just coincidence. Thanks! Stephen
Вложения
[ Thanks to Stephen for cranking up a continuous build loop on dory ] Noah Misch <noah@leadboat.com> writes: > On Tue, Apr 24, 2018 at 11:37:33AM +1200, Thomas Munro wrote: >> Maybe try asking what's mapped there with VirtualQueryEx() on failure? > +1. An implementation of that: > https://www.postgresql.org/message-id/20170403065106.GA2624300%40tornado.leadboat.com So I tried putting in that code, and it turns the problem from something that maybe happens in every third buildfarm run or so, to something that happens at least a dozen times in a single "make check" step. This seems to mean that either EnumProcessModules or GetModuleFileNameEx is itself allocating memory, and sometimes that allocation comes out of the space VirtualFree just freed :-(. So we can't use those functions. We have however proven that no new module gets loaded during VirtualFree or MapViewOfFileEx, so there doesn't seem to be anything more to be learned from them anyway. What it looks like to me is that MapViewOfFileEx allocates some memory and sometimes that comes out of the wrong place. This is, um, unfortunate. It also appears that VirtualFree might sometimes allocate some memory, and that'd be even more unfortunate, but it's hard to be certain; the blame might well fail on VirtualQuery instead. (Ain't Heisenbugs fun?) The solution I was thinking about last night was to have PGSharedMemoryReAttach call MapViewOfFileEx to map the shared memory segment at an unspecified address, then unmap it, then call VirtualFree, and finally call MapViewOfFileEx with the real target address. The idea here is to get these various DLLs to set up any memory allocation pools they're going to set up before we risk doing VirtualFree. I am not, at this point, convinced this will fix it :-( ... but I'm not sure what else to try. In any case, it's still pretty unclear why dory is showing this problem and other buildfarm members are not. whelk for instance seems to be loading all the same DLLs and more besides. regards, tom lane
Greetings, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > [ Thanks to Stephen for cranking up a continuous build loop on dory ] That was actually Heath, who is also trying to work this issue and has an idea about something else which might help (and some more information about what's happening in the event log). Adding him to the thread so he can (easily) reply with what he's found. Heath..? Thanks! Stephen
Вложения
On Mon, Apr 30, 2018 at 2:34 PM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> [ Thanks to Stephen for cranking up a continuous build loop on dory ]
That was actually Heath, who is also trying to work this issue and has
an idea about something else which might help (and some more information
about what's happening in the event log). Adding him to the thread so
he can (easily) reply with what he's found.
Heath..?
Thanks!
Stephen
So what I noticed after adding the '--force' flag was that in the Event Viewer logs there was an Error in the System log stating that "The application-specific permission settings do not grant Local Activation permission for the COM Server application" for the Runtime Broker. So around 2:00 pm today I went and changed the ownership of the registry values to Administrators so I could add the user we are running the builds under to the list of members that have access to it. However right after I made that change the build was actually broken for me so I am just now turning the builds back on to run constantly to verify if this has any effect on the issue of not being able to reattach to the shared memory. I am hoping that this makes things more stable, however I am not confident that these are related.
The change that I attempted prior to this, which was done last Wednesday, was to remove the vctip.exe program from being used as this was causing issues for us as well. This was causing an Event ID 1530 error that stated that "The application this is listed in the event details is leaving the registry handle open" and Windows was helpfully closing any registry values for that user profile, and I thought that possibly when doing so it was cleaning up the memory that was allocated prior. However this did not change anything for us after making that change.
Any ideas or changes that we could do to help debug or verify would be helpful. We have considered changing it to run everything as SYSTEM but if possible we would like to avoid this for security reasons. Thank you in advance and I appreciate all the help.
-Heath
Greetings, * Heath Lord (heath.lord@crunchydata.com) wrote: > Any ideas or changes that we could do to help debug or verify would be > helpful. We have considered changing it to run everything as SYSTEM but if > possible we would like to avoid this for security reasons. Thank you in > advance and I appreciate all the help. Just to be clear- there is no longer anything showing up in the event viewer associated with running the builds. There may still be an issue with the system setup or such, but it at least seems less likely for that to be the issue, so I'm thinking that Tom is more likely correct that PG is doing something not quite right here. Thanks! Stephen
Вложения
Heath Lord <heath.lord@crunchydata.com> writes: > So what I noticed after adding the '--force' flag was that in the Event > Viewer logs there was an Error in the System log stating that "The > application-specific permission settings do not grant Local Activation > permission for the COM Server application" for the Runtime Broker. So > around 2:00 pm today I went and changed the ownership of the registry > values to Administrators so I could add the user we are running the builds > under to the list of members that have access to it. However right after I > made that change the build was actually broken for me so I am just now > turning the builds back on to run constantly to verify if this has > any effect on the issue of not being able to reattach to the shared > memory. I am hoping that this makes things more stable, however I am not > confident that these are related. The build was broken on Windows between about 12:30 and 14:30 EDT, thanks to an unrelated change. Now that that's sorted, dory is still failing :-(. Moreover, even though I took out the module dump logic, the failure rate is still a good bit higher than it was before, which seems to confirm my fear that this is a Heisenbug: either VirtualQuery itself, or the act of elog'ing a bunch of output, is causing memory allocations to take place that otherwise would not have. I'm hoping that the elog output is to blame for that, and am going to go try to rejigger the code so that we capture the memory maps into space that was allocated before VirtualFree. > Any ideas or changes that we could do to help debug or verify would be > helpful. We have considered changing it to run everything as SYSTEM but if > possible we would like to avoid this for security reasons. Yeah, that's no solution. Even if it were OK for dory, end users wouldn't necessarily want to run PG that way. regards, tom lane
I wrote: > The solution I was thinking about last night was to have > PGSharedMemoryReAttach call MapViewOfFileEx to map the shared memory > segment at an unspecified address, then unmap it, then call VirtualFree, > and finally call MapViewOfFileEx with the real target address. The idea > here is to get these various DLLs to set up any memory allocation pools > they're going to set up before we risk doing VirtualFree. I am not, > at this point, convinced this will fix it :-( ... but I'm not sure what > else to try. So the answer is that that doesn't help at all. It's clear from dory's results that something is causing a 4MB chunk of memory to get reserved in the process's address space, sometimes. It might happen during the main MapViewOfFileEx call, or during the preceding VirtualFree, or with my map/unmap dance in place, it might happen during that. Frequently it doesn't happen at all, at least not before the point where we've successfully done MapViewOfFileEx. But if it does happen, and the chunk happens to get put in a spot that overlaps where we want to put the shmem block, kaboom. What seems like a plausible theory at this point is that the apparent asynchronicity is due to the allocation being triggered by a different thread, and the fact that our added monitoring code seems to make the failure more likely can be explained by that code changing the timing. But what thread could it be? It doesn't really look to me like either the signal thread or the timer thread could eat 4MB. syslogger.c also spawns a thread, on Windows, but AFAICS that's not being used in this test configuration. Maybe the reason dory is showing the problem is something or other is spawning a thread we don't even know about? I'm going to go put a 1-sec sleep into the beginning of PGSharedMemoryReAttach and see if that changes anything. If I'm right that this is being triggered by another thread, that should allow the other thread to do its thing (at least most of the time) so that the failure rate ought to go way down. Even if that does happen, I'm at a loss for a reasonable way to fix it for real. Is there a way to seize control of a Windows process so that there are no other running threads? Any other ideas? regards, tom lane
Him On 2018-04-30 20:01:40 -0400, Tom Lane wrote: > What seems like a plausible theory at this point is that the apparent > asynchronicity is due to the allocation being triggered by a different > thread, and the fact that our added monitoring code seems to make the > failure more likely can be explained by that code changing the timing. > But what thread could it be? It doesn't really look to me like either > the signal thread or the timer thread could eat 4MB. It seems plausible that the underlying allocator allocates larger chunks to serve small allocations. But we don't seem to have started any threads at PGSharedMemoryReAttach() time? So it'd have to be something else that starts threads. Heath, could you use process explorer or such to check which processes are running inside a working backend process? Greetings, Andres Freund
On Mon, Apr 30, 2018 at 08:01:40PM -0400, Tom Lane wrote: > It's clear from dory's results that something is causing a 4MB chunk > of memory to get reserved in the process's address space, sometimes. > It might happen during the main MapViewOfFileEx call, or during the > preceding VirtualFree, or with my map/unmap dance in place, it might > happen during that. Frequently it doesn't happen at all, at least not > before the point where we've successfully done MapViewOfFileEx. But > if it does happen, and the chunk happens to get put in a spot that > overlaps where we want to put the shmem block, kaboom. > > What seems like a plausible theory at this point is that the apparent > asynchronicity is due to the allocation being triggered by a different > thread, and the fact that our added monitoring code seems to make the > failure more likely can be explained by that code changing the timing. > But what thread could it be? It doesn't really look to me like either > the signal thread or the timer thread could eat 4MB. syslogger.c > also spawns a thread, on Windows, but AFAICS that's not being used in > this test configuration. Maybe the reason dory is showing the problem > is something or other is spawning a thread we don't even know about? Likely some privileged daemon is creating a thread in every new process. (On Windows, it's not unusual for one process to create a thread in another process.) We don't have good control over that. > I'm at a loss for a reasonable way to fix it > for real. Is there a way to seize control of a Windows process so that > there are no other running threads? I think not. > Any other ideas? PostgreSQL could retry the whole process creation, analogous to internal_forkexec() retries. Have the failed process exit after recording the fact that it couldn't attach. Make the postmaster notice and spawn a replacement. Give up after 100 failed attempts.
Noah Misch <noah@leadboat.com> writes: > On Mon, Apr 30, 2018 at 08:01:40PM -0400, Tom Lane wrote: >> What seems like a plausible theory at this point is that the apparent >> asynchronicity is due to the allocation being triggered by a different >> thread, and the fact that our added monitoring code seems to make the >> failure more likely can be explained by that code changing the timing. >> But what thread could it be? It doesn't really look to me like either >> the signal thread or the timer thread could eat 4MB. syslogger.c >> also spawns a thread, on Windows, but AFAICS that's not being used in >> this test configuration. The 1-second wait doesn't seem to have changed things much, which puts a hole in the idea that this is triggered by a thread spawned earlier in backend process startup. >> Maybe the reason dory is showing the problem >> is something or other is spawning a thread we don't even know about? > Likely some privileged daemon is creating a thread in every new process. Yeah, I'm afraid that's the most likely theory at this point; it offers an explanation why we're seeing this on dory and not other machines. Although if the daemon were responding to process startup, wouldn't the extra wait have given it time to do so? There's still something that doesn't add up here. >> Any other ideas? > PostgreSQL could retry the whole process creation, analogous to > internal_forkexec() retries. In the absence of any clearer theory about what's causing this, that may be our only recourse. Sure is ugly though. regards, tom lane
On Tue, May 1, 2018 at 2:59 PM, Noah Misch <noah@leadboat.com> wrote: > Likely some privileged daemon is creating a thread in every new process. (On > Windows, it's not unusual for one process to create a thread in another > process.) We don't have good control over that. Huh. I was already amazed (as a non-Windows user) by the DSM code that duplicates file handles into the postmaster process without its cooperation, but starting threads is even more amazing. Apparently debuggers do that. Could this be running in some kind of debugger-managed environment or build, perhaps as a result of some core dump capturing mode or something? https://msdn.microsoft.com/en-us/library/windows/desktop/dd405484(v=vs.85).aspx Apparently another way to mess with another process's memory map is via "Asynchronous Procedure Calls": http://blogs.microsoft.co.il/pavely/2017/03/14/injecting-a-dll-without-a-remote-thread/ It looks like that mechanism could allow something either in our own process (perhaps some timer-related thing that we might have set up ourselves or might be set up by the system?) or another process to queue actions for our own thread to run at certain points. https://msdn.microsoft.com/en-us/library/windows/desktop/ms681951(v=vs.85).aspx -- Thomas Munro http://www.enterprisedb.com
Andres Freund <andres@anarazel.de> writes: > Heath, could you use process explorer or such to check which processes > are running inside a working backend process? It seems to be possible to enumerate the threads that are present inside a Windows process, although it's not clear to me how much identifying info is available. Perhaps it'd be worth putting in some "dump threads" debugging code like the "dump modules" code we had in there for a bit? regards, tom lane
Well, at this point the only thing that's entirely clear is that none of the ideas I had work. I think we are going to be forced to pursue Noah's idea of doing an end-to-end retry. Somebody else will need to take point on that; I lack a Windows environment and have already done a lot more blind patch-pushing than I like in this effort. I'll revert the debugging code I added to win32_shmem.c, unless someone sees a reason to leave it there awhile longer. regards, tom lane
On Tue, May 01, 2018 at 11:31:50AM -0400, Tom Lane wrote: > Well, at this point the only thing that's entirely clear is that none > of the ideas I had work. I think we are going to be forced to pursue > Noah's idea of doing an end-to-end retry. Somebody else will need to > take point on that; I lack a Windows environment and have already done > a lot more blind patch-pushing than I like in this effort. Having tried this, I find a choice between performance and complexity. Both of my designs use proc_exit(4) to indicate failure to reattach. The simpler, slower design has WIN32 internal_forkexec() block until the child reports (via SetEvent()) that it reattached to shared memory. This caused a fivefold reduction in process creation performance[1]. The less-simple, faster design stashes the Port structure and retry count in the BackendList entry, which reaper() uses to retry the fork upon seeing status 4. Notably, this requires new code for regular backends, for bgworkers, and for others. It's currently showing a 30% performance _increase_ on the same benchmark; I can't explain that increase and doubt it will last, but I think it's plausible for the less-simple design to be performance-neutral. I see these options: 1. Use the simpler design with a GUC, disabled by default, to control whether the new code is active. Mention the GUC in a new errhint() for the "could not reattach to shared memory" error. 2. Like (1), but enable the GUC by default. 3. Like (1), but follow up with a patch to enable the GUC by default in v12 only. 4. In addition to (1), enable retries if the GUC is set _or_ this postmaster has seen at least one child fail to reattach. 5. Use the less-simple design, with retries enabled unconditionally. I think I prefer (3), with (1) being a close second. My hesitation on (3) is that parallel query has made startup time count even if you use a connection pool, and all the Windows users not needing these retries will see parallel query become that much slower. I dislike (5) for its impact on platform-independent postmaster code. Other opinions? I'm attaching a mostly-finished patch for the slower design. I tested correctness with -DREATTACH_FORCE_FAIL_PERCENT=99. I'm also attaching a proof-of-concept patch for the faster design. In this proof of concept, the postmaster does not close its copy of a backend socket until the backend exits. Also, bgworkers can change from BGWH_STARTED back to BGWH_NOT_YET_STARTED; core code tolerates this, but external code may not. Those would justify paying some performance to fix. The proof of concept handles bgworkers and regular backends, but it does not handle the startup process, checkpointer, etc. That doesn't affect benchmarking, of course. nm [1] This (2 forks per transaction) dropped from 139tps to 27tps: echo 'select 1' >script env PGOPTIONS="--default_transaction_isolation=repeatable\\ read --force_parallel_mode=on" pgbench -T15 -j30 -c30 --connect-n -fscript
Вложения
Noah Misch <noah@leadboat.com> writes: > On Tue, May 01, 2018 at 11:31:50AM -0400, Tom Lane wrote: >> Well, at this point the only thing that's entirely clear is that none >> of the ideas I had work. I think we are going to be forced to pursue >> Noah's idea of doing an end-to-end retry. Somebody else will need to >> take point on that; I lack a Windows environment and have already done >> a lot more blind patch-pushing than I like in this effort. > Having tried this, I find a choice between performance and complexity. Both > of my designs use proc_exit(4) to indicate failure to reattach. The simpler, > slower design has WIN32 internal_forkexec() block until the child reports (via > SetEvent()) that it reattached to shared memory. This caused a fivefold > reduction in process creation performance[1]. Ouch. > The less-simple, faster design > stashes the Port structure and retry count in the BackendList entry, which > reaper() uses to retry the fork upon seeing status 4. Notably, this requires > new code for regular backends, for bgworkers, and for others. Messy as that is, I think actually the worse problem with it is: > In this proof of concept, the > postmaster does not close its copy of a backend socket until the backend > exits. That seems unworkable because it would interfere with detection of client connection drops. But since you say this is just a POC, maybe you intended to fix that? It'd probably be all right for the postmaster to hold onto the socket until the new backend reports successful attach, using the same signaling mechanism you had in mind for the other way. Overall, I agree that neither of these approaches are exactly attractive. We're paying a heck of a lot of performance or complexity to solve a problem that shouldn't even be there, and that we don't understand well. In particular, the theory that some privileged code is injecting a thread into every new process doesn't square with my results at https://www.postgresql.org/message-id/15345.1525145612%40sss.pgh.pa.us I think our best course of action at this point is to do nothing until we have a clearer understanding of what's actually happening on dory. Perhaps such understanding will yield an idea for a less painful fix. regards, tom lane
On Mon, Sep 24, 2018 at 01:53:05PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > In this proof of concept, the > > postmaster does not close its copy of a backend socket until the backend > > exits. > > That seems unworkable because it would interfere with detection of client > connection drops. But since you say this is just a POC, maybe you > intended to fix that? It'd probably be all right for the postmaster to > hold onto the socket until the new backend reports successful attach, > using the same signaling mechanism you had in mind for the other way. It wasn't relevant to the concept being proven, so I suspended decisions in that area. Arranging for socket closure is a simple matter of programming. > Overall, I agree that neither of these approaches are exactly attractive. > We're paying a heck of a lot of performance or complexity to solve a > problem that shouldn't even be there, and that we don't understand well. > In particular, the theory that some privileged code is injecting a thread > into every new process doesn't square with my results at > https://www.postgresql.org/message-id/15345.1525145612%40sss.pgh.pa.us > > I think our best course of action at this point is to do nothing until > we have a clearer understanding of what's actually happening on dory. > Perhaps such understanding will yield an idea for a less painful fix. I see.
On Tue, Sep 25, 2018 at 08:05:12AM -0700, Noah Misch wrote: > On Mon, Sep 24, 2018 at 01:53:05PM -0400, Tom Lane wrote: > > Overall, I agree that neither of these approaches are exactly attractive. > > We're paying a heck of a lot of performance or complexity to solve a > > problem that shouldn't even be there, and that we don't understand well. > > In particular, the theory that some privileged code is injecting a thread > > into every new process doesn't square with my results at > > https://www.postgresql.org/message-id/15345.1525145612%40sss.pgh.pa.us > > > > I think our best course of action at this point is to do nothing until > > we have a clearer understanding of what's actually happening on dory. > > Perhaps such understanding will yield an idea for a less painful fix. > > I see. Could one of you having a dory login use https://live.sysinternals.com/Procmon.exe to capture process events during backend startup? The ideal would be one capture where startup failed reattach and another where it succeeded, but having the successful run alone would be a good start. The procedure is roughly this: - Install PostgreSQL w/ debug symbols. - Start a postmaster. - procmon /nomonitor - procmon "Filter" menu -> Enable Advanced Output - Ctrl-l, add filter for "Process Name" is "postgres.exe" - Ctrl-e (starts collecting data) - psql (leave it running) - After ~60s, Ctrl-e again in procmon (stops collecting data) - File -> Save -> PML - File -> Save -> XML, include stack traces, resolve stack symbols - Compress the PML and XML files, and mail them here I'm attaching the data from a system not having the problem. On this system, backend startup sees six thread creations: 1. main thread 2. thread created before postgres.exe has control 3. thread created before postgres.exe has control 4. thread created before postgres.exe has control 5. in pgwin32_signal_initialize() 6. in src\backend\port\win32\timer.c:setitimer() Threads 2-4 exit exactly 30s after creation. If we fail to reattach to shared memory, we'll exit before reaching code to start 5 or 6. It would be quite interesting if dory makes a different number of threads or if threads 2-4 live some duration other than 30s. It would also be interesting if dory has "Load Image" events after postgres.exe code has started running. This unaffected system loads mswsock.dll during read_inheritable_socket(). Thanks, nm
Вложения
On Sun, Dec 02, 2018 at 09:35:06PM -0800, Noah Misch wrote: > On Tue, Sep 25, 2018 at 08:05:12AM -0700, Noah Misch wrote: > > On Mon, Sep 24, 2018 at 01:53:05PM -0400, Tom Lane wrote: > > > Overall, I agree that neither of these approaches are exactly attractive. > > > We're paying a heck of a lot of performance or complexity to solve a > > > problem that shouldn't even be there, and that we don't understand well. > > > In particular, the theory that some privileged code is injecting a thread > > > into every new process doesn't square with my results at > > > https://www.postgresql.org/message-id/15345.1525145612%40sss.pgh.pa.us > > > > > > I think our best course of action at this point is to do nothing until > > > we have a clearer understanding of what's actually happening on dory. > > > Perhaps such understanding will yield an idea for a less painful fix. > > > > I see. > > Could one of you having a dory login use > https://live.sysinternals.com/Procmon.exe to capture process events during > backend startup? Ping. > The ideal would be one capture where startup failed reattach > and another where it succeeded, but having the successful run alone would be a > good start. The procedure is roughly this: > > - Install PostgreSQL w/ debug symbols. > - Start a postmaster. > - procmon /nomonitor > - procmon "Filter" menu -> Enable Advanced Output > - Ctrl-l, add filter for "Process Name" is "postgres.exe" > - Ctrl-e (starts collecting data) > - psql (leave it running) > - After ~60s, Ctrl-e again in procmon (stops collecting data) > - File -> Save -> PML > - File -> Save -> XML, include stack traces, resolve stack symbols > - Compress the PML and XML files, and mail them here
On Thu, Jan 17, 2019 at 3:27 AM Noah Misch <noah@leadboat.com> wrote:
On Sun, Dec 02, 2018 at 09:35:06PM -0800, Noah Misch wrote:
> On Tue, Sep 25, 2018 at 08:05:12AM -0700, Noah Misch wrote:
> > On Mon, Sep 24, 2018 at 01:53:05PM -0400, Tom Lane wrote:
> > > Overall, I agree that neither of these approaches are exactly attractive.
> > > We're paying a heck of a lot of performance or complexity to solve a
> > > problem that shouldn't even be there, and that we don't understand well.
> > > In particular, the theory that some privileged code is injecting a thread
> > > into every new process doesn't square with my results at
> > > https://www.postgresql.org/message-id/15345.1525145612%40sss.pgh.pa.us
> > >
> > > I think our best course of action at this point is to do nothing until
> > > we have a clearer understanding of what's actually happening on dory.
> > > Perhaps such understanding will yield an idea for a less painful fix.
> >
> > I see.
>
> Could one of you having a dory login use
> https://live.sysinternals.com/Procmon.exe to capture process events during
> backend startup?
Ping.
> The ideal would be one capture where startup failed reattach
> and another where it succeeded, but having the successful run alone would be a
> good start. The procedure is roughly this:
>
> - Install PostgreSQL w/ debug symbols.
> - Start a postmaster.
> - procmon /nomonitor
> - procmon "Filter" menu -> Enable Advanced Output
> - Ctrl-l, add filter for "Process Name" is "postgres.exe"
> - Ctrl-e (starts collecting data)
> - psql (leave it running)
> - After ~60s, Ctrl-e again in procmon (stops collecting data)
> - File -> Save -> PML
> - File -> Save -> XML, include stack traces, resolve stack symbols
> - Compress the PML and XML files, and mail them here
Noah,
I apologize for not responding earlier, but we attempting to capture the information you requested. However, where would you like us to pull the install for PostgreSQL with the debug symbols in it? We are aware of the BigSQL and EDB installers, but are unsure if these contain the debug symbols. Currently this machine has nothing on it other than the necessary dependencies to build postgres for the community. If you could please give us some more information on what you would like us to install to gather this information to help debug the issue we are seeing we would really appreciate it. Also, if there is any additional information on what we have installed on the server or how we are configured please just let us know and we will get you that as soon as possible. Thank you again for your interest in this issue.
-Heath
On Tue, Jan 29, 2019 at 11:28:56AM -0500, Heath Lord wrote: > On Thu, Jan 17, 2019 at 3:27 AM Noah Misch <noah@leadboat.com> wrote: > > On Sun, Dec 02, 2018 at 09:35:06PM -0800, Noah Misch wrote: > > > Could one of you having a dory login use > > > https://live.sysinternals.com/Procmon.exe to capture process events > > during > > > backend startup? > > > The ideal would be one capture where startup failed reattach > > > and another where it succeeded, but having the successful run alone > > would be a > > > good start. The procedure is roughly this: > > > > > > - Install PostgreSQL w/ debug symbols. > > > - Start a postmaster. > > > - procmon /nomonitor > > > - procmon "Filter" menu -> Enable Advanced Output > > > - Ctrl-l, add filter for "Process Name" is "postgres.exe" > > > - Ctrl-e (starts collecting data) > > > - psql (leave it running) > > > - After ~60s, Ctrl-e again in procmon (stops collecting data) > > > - File -> Save -> PML > > > - File -> Save -> XML, include stack traces, resolve stack symbols > > > - Compress the PML and XML files, and mail them here > > I apologize for not responding earlier, but we attempting to capture the > information you requested. However, where would you like us to pull the > install for PostgreSQL with the debug symbols in it? We are aware of the > BigSQL and EDB installers, but are unsure if these contain the debug > symbols. Please use a locally-built PostgreSQL as similar as possible to what dory's buildfarm runs produce, except building with debug symbols. A potentially-convenient way to achieve this would be to add CONFIG=>'Debug' to build_env of your buildfarm configuration, then run the buildfarm client with --keepall. By the end of the run, $buildroot/$branch/inst will contain an installation. (I haven't actually tested that.) You can copy that installation to any convenient place (e.g. your home directory) and use that copy for the remaining steps. > Currently this machine has nothing on it other than the necessary > dependencies to build postgres for the community. If you could please give > us some more information on what you would like us to install to gather > this information to help debug the issue we are seeing we would really > appreciate it. At the moment, I think the only other thing you'll need is to download https://live.sysinternals.com/Procmon.exe. > Also, if there is any additional information on what we > have installed on the server or how we are configured please just let us > know and we will get you that as soon as possible. Thanks. I don't have specific questions about that at this time, but feel free to share any lists or config dumps that seem relevant.
On Sun, Dec 02, 2018 at 09:35:06PM -0800, Noah Misch wrote: > On Tue, Sep 25, 2018 at 08:05:12AM -0700, Noah Misch wrote: > > On Mon, Sep 24, 2018 at 01:53:05PM -0400, Tom Lane wrote: > > > Overall, I agree that neither of these approaches are exactly attractive. > > > We're paying a heck of a lot of performance or complexity to solve a > > > problem that shouldn't even be there, and that we don't understand well. > > > In particular, the theory that some privileged code is injecting a thread > > > into every new process doesn't square with my results at > > > https://www.postgresql.org/message-id/15345.1525145612%40sss.pgh.pa.us > > > > > > I think our best course of action at this point is to do nothing until > > > we have a clearer understanding of what's actually happening on dory. > > > Perhaps such understanding will yield an idea for a less painful fix. > > Could one of you having a dory login use > https://live.sysinternals.com/Procmon.exe to capture process events during > backend startup? The ideal would be one capture where startup failed reattach > and another where it succeeded, but having the successful run alone would be a > good start. Joseph Ayers provided, off-list, the capture from a successful startup. It wasn't materially different from the one my system generates, so I abandoned that line of inquiry. Having explored other aspects of the problem, I expect the attached fix will work. I can reproduce the 4 MiB allocations described in https://postgr.es/m/29823.1525132900@sss.pgh.pa.us; a few times per "vcregress check", they emerge in the middle of PGSharedMemoryReAttach(). On my system, there's 5.7 MiB of free address space just before UsedShmemSegAddr, so the 4 MiB allocation fits in there, and PGSharedMemoryReAttach() does not fail. Still, it's easy to imagine that boring variations between environments could surface dory's problem by reducing that free 5.7 MiB to, say, 3.9 MiB. The 4 MiB allocations are stacks for new threads of the default thread pool[1]. (I confirmed that by observing their size change when I changed StackReserveSize in MSBuildProject.pm and by checking all stack pointers with "thread apply all info frame" in gdb.) The API calls in PGSharedMemoryReAttach() don't cause the thread creation; it's a timing coincidence. Commit 2307868 would have worked around the problem, but pg_usleep() is essentially a no-op on Windows before pgwin32_signal_initialize() runs. (I'll push Assert(pgwin32_signal_event) to some functions.) While one fix is to block until all expected threads have started, that could be notably slow, and I don't know how to implement it cleanly. I think a better fix is to arrange for the system to prefer a different address space region for these thread stacks; for details, see the first comment the patch adds to win32_shmem.c. This works here. > backend startup sees six thread creations: > > 1. main thread > 2. thread created before postgres.exe has control > 3. thread created before postgres.exe has control > 4. thread created before postgres.exe has control > 5. in pgwin32_signal_initialize() > 6. in src\backend\port\win32\timer.c:setitimer() > > Threads 2-4 exit exactly 30s after creation. If we fail to reattach to shared > memory, we'll exit before reaching code to start 5 or 6. Threads 2-4 proved to be worker threads of the default thread pool. [1] https://docs.microsoft.com/en-us/windows/desktop/ProcThread/thread-pools
Вложения
Noah Misch <noah@leadboat.com> writes: > I can reproduce the 4 MiB allocations described > in https://postgr.es/m/29823.1525132900@sss.pgh.pa.us; a few times per > "vcregress check", they emerge in the middle of PGSharedMemoryReAttach(). > The 4 MiB allocations are stacks for new threads of the default thread > pool[1]. Hah! It is great to finally have an understanding of what is happening here. I worry that your proposed fix is unstable, in particular this assumption seems shaky: > + * ... The idea is that, if the allocator handed out > + * REGION1 pages before REGION2 pages at one occasion, it will do so whenever > + * both regions are free. I wonder whether it's possible to address this by configuring the "default thread pool" to have only one thread? It seems like the extra threads are just adding backend startup overhead to no benefit, since we won't use 'em. regards, tom lane
On Tue, Apr 02, 2019 at 10:09:00AM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > I can reproduce the 4 MiB allocations described > > in https://postgr.es/m/29823.1525132900@sss.pgh.pa.us; a few times per > > "vcregress check", they emerge in the middle of PGSharedMemoryReAttach(). > > The 4 MiB allocations are stacks for new threads of the default thread > > pool[1]. > > Hah! It is great to finally have an understanding of what is happening > here. > > I worry that your proposed fix is unstable, in particular this assumption > seems shaky: > > > + * ... The idea is that, if the allocator handed out > > + * REGION1 pages before REGION2 pages at one occasion, it will do so whenever > > + * both regions are free. True. If Windows changes to prefer allocating from the most recently freed region, shmem-protective-region-v1.patch would cease to help. It's not impossible. > I wonder whether it's possible to address this by configuring the "default > thread pool" to have only one thread? It seems like the extra threads are > just adding backend startup overhead to no benefit, since we won't use 'em. I didn't find a way to configure the pool's size. Another option is to reattach shared memory earlier, before the default thread pool starts. A Windows application using only the unavoidable DLLs (kernel32.dll, ntdll.dll, kernelbase.dll) doesn't get a default thread pool; the pool starts when one loads ws2_32.dll, ucrtbased.dll, etc. Hence, the DllMain() function of a DLL that loads early could avoid the problem. (Cygwin fork() uses that route to remap shared memory, though it also retries failed forks.) If we proceed that way, we'd add a tiny pg_shmem.dll that appears early in the load order, just after the unavoidable DLLs. It would extract applicable parameters from the command line and reattach shared memory. When it fails, it would set a flag so the usual code can raise an error. Does that sound more or less attractive than shmem-protective-region-v1.patch?
Noah Misch <noah@leadboat.com> writes: > On Tue, Apr 02, 2019 at 10:09:00AM -0400, Tom Lane wrote: >> I worry that your proposed fix is unstable, in particular this assumption >> seems shaky: >>> + * ... The idea is that, if the allocator handed out >>> + * REGION1 pages before REGION2 pages at one occasion, it will do so whenever >>> + * both regions are free. > True. If Windows changes to prefer allocating from the most recently freed > region, shmem-protective-region-v1.patch would cease to help. It's not > impossible. >> I wonder whether it's possible to address this by configuring the "default >> thread pool" to have only one thread? It seems like the extra threads are >> just adding backend startup overhead to no benefit, since we won't use 'em. > I didn't find a way to configure the pool's size. Seems odd ... > Another option is to reattach shared memory earlier, before the default thread > pool starts. A Windows application using only the unavoidable DLLs > (kernel32.dll, ntdll.dll, kernelbase.dll) doesn't get a default thread pool; > the pool starts when one loads ws2_32.dll, ucrtbased.dll, etc. Hence, the > DllMain() function of a DLL that loads early could avoid the problem. (Cygwin > fork() uses that route to remap shared memory, though it also retries failed > forks.) If we proceed that way, we'd add a tiny pg_shmem.dll that appears > early in the load order, just after the unavoidable DLLs. It would extract > applicable parameters from the command line and reattach shared memory. When > it fails, it would set a flag so the usual code can raise an error. Does that > sound more or less attractive than shmem-protective-region-v1.patch? Well, it definitely sounds like more work. Let's not do such work until we have to. Your patch has the advantages of being (a) small and (b) done, and there's much to be said for that. regards, tom lane
On Sun, Apr 07, 2019 at 12:43:23AM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Tue, Apr 02, 2019 at 10:09:00AM -0400, Tom Lane wrote: > >> I worry that your proposed fix is unstable, in particular this assumption > >> seems shaky: > >>> + * ... The idea is that, if the allocator handed out > >>> + * REGION1 pages before REGION2 pages at one occasion, it will do so whenever > >>> + * both regions are free. > > > True. If Windows changes to prefer allocating from the most recently freed > > region, shmem-protective-region-v1.patch would cease to help. It's not > > impossible. > > >> I wonder whether it's possible to address this by configuring the "default > >> thread pool" to have only one thread? It seems like the extra threads are > >> just adding backend startup overhead to no benefit, since we won't use 'em. > > > I didn't find a way to configure the pool's size. > > Seems odd ... I think you're expected to create your own thread pool if you're picky about the size. The default thread pool is for non-picky callers and for Windows libraries to use internally. The fact that the pool starts before main() also narrows the use case for configuring it. If there were an API for "block until all threads of the pool are alive", that would meet our needs nicely. Needless to say, I didn't find that, either. > > Another option is to reattach shared memory earlier, before the default thread > > pool starts. A Windows application using only the unavoidable DLLs > > (kernel32.dll, ntdll.dll, kernelbase.dll) doesn't get a default thread pool; > > the pool starts when one loads ws2_32.dll, ucrtbased.dll, etc. Hence, the > > DllMain() function of a DLL that loads early could avoid the problem. (Cygwin > > fork() uses that route to remap shared memory, though it also retries failed > > forks.) If we proceed that way, we'd add a tiny pg_shmem.dll that appears > > early in the load order, just after the unavoidable DLLs. It would extract > > applicable parameters from the command line and reattach shared memory. When > > it fails, it would set a flag so the usual code can raise an error. Does that > > sound more or less attractive than shmem-protective-region-v1.patch? > > Well, it definitely sounds like more work. Let's not do such work > until we have to. Your patch has the advantages of being (a) small > and (b) done, and there's much to be said for that. Works for me. Pushed.