Обсуждение: "could not reattach to shared memory" captured in buildfarm
vaquita has an interesting report today: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=vaquita&dt=2009-05-01%2020:00:06 Partway through the contrib tests, for absolutely no visible reason whatsoever, connections start to fail with FATAL: could not reattach to shared memory (key=364, addr=02920000): 487 We've certainly heard more than a couple of field reports of this from Windows users, but I don't think we've ever seen it in the buildfarm before. (I don't see any similar instances in vaquita's history, anyway.) I assume vaquita's configuration hasn't changed recently (Dave?) so this seems to put the lie to the theory we've taken refuge in that it's caused by bad antivirus software. I don't see that it gets us any closer to a solution though. regards, tom lane
On Sat, May 2, 2009 at 4:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I assume vaquita's configuration hasn't changed recently (Dave?) > so this seems to put the lie to the theory we've taken refuge in > that it's caused by bad antivirus software. I don't see that it > gets us any closer to a solution though. Well, theres a bit of a story there. Vaquita and Baiji are both the same Vista machine running on VMware Server. About a month back, for what seemed like no reason, the guest VM started running at much higher speed than it should - animated cursors started running at double speed, double-clicking become impossible and the clock started gaining significant amounts of time - to the expent that buildfarm runs were rejected by the server because the finish time was in the future. I believe I finally fixed this on Friday - from what I can tell, it looks like the Java self-update applet was causing the clock rate on the host to be raised to 1000/1024Hz (this can be done using the multimedia API). This in turn was apparently upsetting VMware. Anyway, long story short, removed the JVM from the host and everything appears to have returned to normal. Nothing has changed in the config of the VM itself, though a couple of minor tweaks were made to the VMware configuration - but they were clock-related. -- Dave Page EnterpriseDB UK: http://www.enterprisedb.com
Tom Lane wrote: > vaquita has an interesting report today: > http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=vaquita&dt=2009-05-01%2020:00:06 > > Partway through the contrib tests, for absolutely no visible reason > whatsoever, connections start to fail with > FATAL: could not reattach to shared memory (key=364, addr=02920000): 487 Note that 487 is "invalid address", and should not have anything to do with the issues Andrew mentioned (which were about the already-exists error). Somebody else mentioned, and IIRC I talked to Dave about this before, that this could be because the address is no longer available. The reason for this could be some kind of race condition in the backends starting - the address is available when the postmaster starts and thus it's used, but when a regular backend starts, the memory is used for something else. One proposed fix is to allocate a fairly large block of memory in the postmaster just before we get the shared memory, and then free it right away. The effect should be to push down the shared memory segment further in the address space. Comments? //Magnus
Magnus Hagander <magnus@hagander.net> writes: > Somebody else mentioned, and IIRC I talked to Dave about this before, > that this could be because the address is no longer available. The > reason for this could be some kind of race condition in the backends > starting - the address is available when the postmaster starts and thus > it's used, but when a regular backend starts, the memory is used for > something else. How is it no longer available, when the new backend is a brand new process? The "race condition" bit seems even sillier --- if there are multiple backends starting, they're each an independent process. regards, tom lane
Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> Somebody else mentioned, and IIRC I talked to Dave about this before, >> that this could be because the address is no longer available. The >> reason for this could be some kind of race condition in the backends >> starting - the address is available when the postmaster starts and thus >> it's used, but when a regular backend starts, the memory is used for >> something else. > > How is it no longer available, when the new backend is a brand new > process? The "race condition" bit seems even sillier --- if there > are multiple backends starting, they're each an independent process. Because some other DLL that was loaded on process startup allocated memory differently - in a different order, different size because or something, or something like that. I didn't mean race condition between backends. I meant against a potential other thread started by a loaded DLL for initialization. (Again, things like antivirus are known to do this, and we do see these issues more often if AV is present for example) //Magnus
Magnus Hagander wrote: > I didn't mean race condition between backends. I meant against a > potential other thread started by a loaded DLL for initialization. > (Again, things like antivirus are known to do this, and we do see these > issues more often if AV is present for example) I don't understand this. How can memory allocated by a completely separate process affect what happens to a backend? I mean, if an antivirus is running, surely it does not run on the backend's process? Or does it? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Magnus Hagander wrote: > >> I didn't mean race condition between backends. I meant against a >> potential other thread started by a loaded DLL for initialization. >> (Again, things like antivirus are known to do this, and we do see these >> issues more often if AV is present for example) > > I don't understand this. How can memory allocated by a completely separate > process affect what happens to a backend? I mean, if an antivirus is running, > surely it does not run on the backend's process? Or does it? Anti[something] software regularly injects code into other processes, yes. Either by creating a thread in the process using CreateRemoteThread() or by using techniques similar to LD_PRELOAD. //Magnus
Magnus Hagander <magnus@hagander.net> writes: > One proposed fix is to allocate a fairly large block of memory in the > postmaster just before we get the shared memory, and then free it right > away. The effect should be to push down the shared memory segment > further in the address space. I have no enthusiasm for doing something like this when we have so little knowledge of what's actually happening. We have *no* idea whether the above could help, or what size of allocation to request. It's not very hard to imagine that the wrong size choice could make things worse rather than better. It seems to me that what we ought to do now is make a serious effort to gather more data. I came across a suggestion that one could use VirtualQuery() to generate a map of the process address space under Windows. I suggest that we add some code that is executed if the reattach attempt fails and dumps the process address space details to the postmaster log. Dumping the postmaster's address space at the time it successfully creates the shmem segment might be useful for comparison, too. (A quick look at the VirtualQuery spec indicates that you can't tell very much beyond free/allocated status, though. Maybe there's some other call that would tell more? It'd be really good if we could get the names of DLLs occupying memory ranges, for example.) regards, tom lane