Обсуждение: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

Поиск
Список
Период
Сортировка
The following bug has been logged online:

Bug reference:      4958
Logged by:          Luke Koops
Email address:      luke.koops@entrust.com
PostgreSQL version: 8.3.7
Operating system:   Windows 2003 EE SP2
Description:        Stats collector hung on WaitForMultipleObjectsEx while
attempting to recv a datagram
Details:

While running a load test with 80 processes connected to postgres, of which
62 were generating load, the stats collector got blocked on
WaitForMultipleObjectsEx which was called with INFINITE timeout.  The stack
trace for the hung thread follows:

-- postgres.exe!mainCRTStartup --
ntoskrnl.exe!KiSwapContext+0x26
ntoskrnl.exe!KiSwapThread+0x2e5
ntoskrnl.exe!KeWaitForSingleObject+0x346
ntoskrnl.exe!KiSuspendThread+0x18
ntoskrnl.exe!KiDeliverApc+0x117
ntoskrnl.exe!KiSwapThread+0x300
ntoskrnl.exe!KeWaitForMultipleObjects+0x3d7
ntoskrnl.exe!ObpWaitForMultipleObjects+0x202
ntoskrnl.exe!NtWaitForMultipleObjects+0xe9
ntoskrnl.exe!KiFastCallEntry+0xfc
ntdll.dll!KiFastSystemCallRet
ntdll.dll!NtWaitForMultipleObjects+0xc
kernel32.dll!WaitForMultipleObjectsEx+0x11a
postgres.exe!pgwin32_waitforsinglesocket+0x1ed
postgres.exe!pgwin32_recv+0x90
postgres.exe!PgstatCollectorMain+0x17f
postgres.exe!SubPostmasterMain+0x33a
postgres.exe!main+0x168
postgres.exe!__tmainCRTStartup+0x10f
kernel32.dll!BaseProcessStart+0x23

Leading up to the hang, the DB was not under high load.  The CPU was running
at 80%, but less than a fifth of that was due to PostgreSQL.  The underlying
disk subsystems were routinely 90% idle.  The txlogs are on a 4 disk RAID-10
array and the data is on a 5 disk RAID-5 array (15K RPM).  This is a
performance test environment for a product that will include PostgreSQL as
an embedded database.

I have the debug symbols for this build, and a process dump of the stats
collector.

I have have not been able to reproduce this problem.  It seems to be
relatively rare, so I'm lucky to have seen it before any of our customers.

Even if the problem cannot be reproduced reliability, I think there is
enough information that would help the community to make corrections in the
code.

After filtering out everything that is expected output, there was nothing
informative in the postgres logs (from 24 hours prior to the hang until 8
hours after).

We are accessing the DB over ODBC.  All of our processes are running on the
same server as PostgreSQL.

The DB is 52GB in size, so I cannot targzupload it.  I estimate that 4-8 GB
of the size is due to the lack of autovacuum after stats collection stopped.
 I doubt that this issue is due to the schema or contents of the DB.

I have verified, using TDIMon, that the back ends are sending the
statistics.  That was a bit redundant since the collector is clearly hung.

Here are the settings in my system that differ from the defaults:
    logging_collector = on
    log_line_prefix = '%m,'
    autovacuum_naptime = 10s
    autovacuum_analyze_scale_factor = 0.05
    autovacuum_vacuum_scale_factor = 0.1
    autovacuum_vacuum_cost_limit = 1000
    log_autovacuum_min_duration = 0

I have customized the vacuum rules for a few "hot" tables that are limited
to 1 or a few rows, but are frequently updated (whenever the system does
anything).  Here is a dump of pg_autovacuum:

 vacrelid | enabled | vac_base_thresh | vac_scale_factor | anl_base_thresh |
anl_scale_factor | vac_
cost_delay | vac_cost_limit | freeze_min_age | freeze_max_age
----------+---------+-----------------+------------------+-----------------+
------------------+-----
-----------+----------------+----------------+----------------
    16507 | t       |               1 |               -1 |               1 |
              -1 |
         0 |         100000 |             -1 |             -1
    16511 | t       |               1 |               -1 |               1 |
              -1 |
         0 |         100000 |             -1 |             -1
    16541 | t       |               1 |               -1 |               1 |
              -1 |
         0 |         100000 |             -1 |             -1
(3 rows)

I created a new cluster on the same system and the stats collector worked
just fine.
Luke Koops wrote:

> Description:        Stats collector hung on WaitForMultipleObjectsEx while
> attempting to recv a datagram
> Details:
>
> While running a load test with 80 processes connected to postgres, of which
> 62 were generating load, the stats collector got blocked on
> WaitForMultipleObjectsEx which was called with INFINITE timeout.  The stack
> trace for the hung thread follows:


> ntdll.dll!NtWaitForMultipleObjects+0xc
> kernel32.dll!WaitForMultipleObjectsEx+0x11a
> postgres.exe!pgwin32_waitforsinglesocket+0x1ed
> postgres.exe!pgwin32_recv+0x90
> postgres.exe!PgstatCollectorMain+0x17f
> postgres.exe!SubPostmasterMain+0x33a
> postgres.exe!main+0x168
> postgres.exe!__tmainCRTStartup+0x10f
> kernel32.dll!BaseProcessStart+0x23

I have seen this problem too.  The process seems stuck for no good
reason.  I wondered at the time if it could be a kernel issue.  I
remember trying to send some data to the collector to verify whether
it'd wake up, but no luck.  (I mean I couldn't find a way to do it on
Windows).

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Magnus Hagander
Дата:
On Sat, Aug 1, 2009 at 01:49, Alvaro Herrera<alvherre@commandprompt.com> wr=
ote:
> Luke Koops wrote:
>
>> Description: =A0 =A0 =A0 =A0Stats collector hung on WaitForMultipleObjec=
tsEx while
>> attempting to recv a datagram
>> Details:
>>
>> While running a load test with 80 processes connected to postgres, of wh=
ich
>> 62 were generating load, the stats collector got blocked on
>> WaitForMultipleObjectsEx which was called with INFINITE timeout. =A0The =
stack
>> trace for the hung thread follows:
>
>
>> ntdll.dll!NtWaitForMultipleObjects+0xc
>> kernel32.dll!WaitForMultipleObjectsEx+0x11a
>> postgres.exe!pgwin32_waitforsinglesocket+0x1ed
>> postgres.exe!pgwin32_recv+0x90
>> postgres.exe!PgstatCollectorMain+0x17f
>> postgres.exe!SubPostmasterMain+0x33a
>> postgres.exe!main+0x168
>> postgres.exe!__tmainCRTStartup+0x10f
>> kernel32.dll!BaseProcessStart+0x23
>
> I have seen this problem too. =A0The process seems stuck for no good
> reason. =A0I wondered at the time if it could be a kernel issue. =A0I
> remember trying to send some data to the collector to verify whether
> it'd wake up, but no luck. =A0(I mean I couldn't find a way to do it on
> Windows).

I have seen this as well, but only in cases where there has been
broken firewall software or such things involved. I have seen a couple
of reports from the field though.

Anyway, this really is a should-never-happen thing. As soon as a new
packet is sent in, WaitForMultipleObjectsEx() should return right
away. And given that backends regularly send packets over, it
shouldn't be an issue even if we miss one...

To generate packets, you should be able to use for example "nc" that
is available as a win32 binary as well.


--=20
 Magnus Hagander
 Self: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Nikhil Sontakke
Дата:
Hi,

>>
>>
>>> ntdll.dll!NtWaitForMultipleObjects+0xc
>>> kernel32.dll!WaitForMultipleObjectsEx+0x11a
>>> postgres.exe!pgwin32_waitforsinglesocket+0x1ed
>>> postgres.exe!pgwin32_recv+0x90
>>> postgres.exe!PgstatCollectorMain+0x17f
>>> postgres.exe!SubPostmasterMain+0x33a
>>> postgres.exe!main+0x168
>>> postgres.exe!__tmainCRTStartup+0x10f
>>> kernel32.dll!BaseProcessStart+0x23
>>
>> I have seen this problem too. =A0The process seems stuck for no good
>> reason. =A0I wondered at the time if it could be a kernel issue. =A0I
>> remember trying to send some data to the collector to verify whether
>> it'd wake up, but no luck. =A0(I mean I couldn't find a way to do it on
>> Windows).
>
> I have seen this as well, but only in cases where there has been
> broken firewall software or such things involved. I have seen a couple
> of reports from the field though.
>
> Anyway, this really is a should-never-happen thing. As soon as a new
> packet is sent in, WaitForMultipleObjectsEx() should return right
> away. And given that backends regularly send packets over, it
> shouldn't be an issue even if we miss one...
>

And this fact should lend credence to Alvaro's (as well as mine)
suspicions that it seems to be a Windows kernel issue.

As a consequence, Magnus I was wondering if having a loop similar to
the WRITE handling of waiting for a fixed timeout in a loop (rather
than an INFINITE call to WaitForMultipleObjectsEx) inside the
pgwin32_waitforsinglesocket() function will help for the READ case
too? I believe Teogor Sigaev had raised a similar concern a while back
about it:

http://www.nabble.com/-GENERAL--Stats-collector-frozen--td8569977i20.html

Regards,
Nikhils
--=20
http://www.enterprisedb.com

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Magnus Hagander
Дата:
On Mon, Aug 3, 2009 at 15:47, Nikhil
Sontakke<nikhil.sontakke@enterprisedb.com> wrote:
> Hi,
>
>>>
>>>
>>>> ntdll.dll!NtWaitForMultipleObjects+0xc
>>>> kernel32.dll!WaitForMultipleObjectsEx+0x11a
>>>> postgres.exe!pgwin32_waitforsinglesocket+0x1ed
>>>> postgres.exe!pgwin32_recv+0x90
>>>> postgres.exe!PgstatCollectorMain+0x17f
>>>> postgres.exe!SubPostmasterMain+0x33a
>>>> postgres.exe!main+0x168
>>>> postgres.exe!__tmainCRTStartup+0x10f
>>>> kernel32.dll!BaseProcessStart+0x23
>>>
>>> I have seen this problem too. =A0The process seems stuck for no good
>>> reason. =A0I wondered at the time if it could be a kernel issue. =A0I
>>> remember trying to send some data to the collector to verify whether
>>> it'd wake up, but no luck. =A0(I mean I couldn't find a way to do it on
>>> Windows).
>>
>> I have seen this as well, but only in cases where there has been
>> broken firewall software or such things involved. I have seen a couple
>> of reports from the field though.
>>
>> Anyway, this really is a should-never-happen thing. As soon as a new
>> packet is sent in, WaitForMultipleObjectsEx() should return right
>> away. And given that backends regularly send packets over, it
>> shouldn't be an issue even if we miss one...
>>
>
> And this fact should lend credence to Alvaro's (as well as mine)
> suspicions that it seems to be a Windows kernel issue.
>
> As a consequence, Magnus I was wondering if having a loop similar to
> the WRITE handling of waiting for a fixed timeout in a loop (rather
> than an INFINITE call to WaitForMultipleObjectsEx) inside the
> pgwin32_waitforsinglesocket() function will help for the READ case
> too? I believe Teogor Sigaev had raised a similar concern a while back
> about it:
>
> http://www.nabble.com/-GENERAL--Stats-collector-frozen--td8569977i20.html

Maybe. I'm unsure if it's enough to just try another
WaitForSingleObjectEx() on it, or if we need to actually issue a
WSARecv() on it as well. Maybe it would be enough to just change the
INIFINTE on line 318 of socket.c to a fixed value. That will loop down
to WSARecv() which should exit with WSAEWOULDBLOCK which will cause us
to do a short sleep and come back. But we'd have to change the limit
of 5 somehow then, since in theory we should wait forever. Maybe that
outer loop should just be a for(;;), what do you think?

=46rom what I understand, none of you have an environment where you can
reliably reproduce this? That means it's going to be a PITA to try to
figure out if we're actually fixing anything :S


--=20
 Magnus Hagander
 Self: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Nikhil Sontakke
Дата:
Hi,

>>>>
>>>>> ntdll.dll!NtWaitForMultipleObjects+0xc
>>>>> kernel32.dll!WaitForMultipleObjectsEx+0x11a
>>>>> postgres.exe!pgwin32_waitforsinglesocket+0x1ed
>>>>> postgres.exe!pgwin32_recv+0x90
>>>>> postgres.exe!PgstatCollectorMain+0x17f
>>>>> postgres.exe!SubPostmasterMain+0x33a
>>>>> postgres.exe!main+0x168
>>>>> postgres.exe!__tmainCRTStartup+0x10f
>>>>> kernel32.dll!BaseProcessStart+0x23
>>>>
>>>> I have seen this problem too. =A0The process seems stuck for no good
>>>> reason. =A0I wondered at the time if it could be a kernel issue. =A0I
>>>> remember trying to send some data to the collector to verify whether
>>>> it'd wake up, but no luck. =A0(I mean I couldn't find a way to do it on
>>>> Windows).
>>>
>>> I have seen this as well, but only in cases where there has been
>>> broken firewall software or such things involved. I have seen a couple
>>> of reports from the field though.
>>>
>>> Anyway, this really is a should-never-happen thing. As soon as a new
>>> packet is sent in, WaitForMultipleObjectsEx() should return right
>>> away. And given that backends regularly send packets over, it
>>> shouldn't be an issue even if we miss one...
>>>
>>
>> And this fact should lend credence to Alvaro's (as well as mine)
>> suspicions that it seems to be a Windows kernel issue.
>>
>> As a consequence, Magnus I was wondering if having a loop similar to
>> the WRITE handling of waiting for a fixed timeout in a loop (rather
>> than an INFINITE call to WaitForMultipleObjectsEx) inside the
>> pgwin32_waitforsinglesocket() function will help for the READ case
>> too? I believe Teogor Sigaev had raised a similar concern a while back
>> about it:
>>
>> http://www.nabble.com/-GENERAL--Stats-collector-frozen--td8569977i20.html
>
> Maybe. I'm unsure if it's enough to just try another
> WaitForSingleObjectEx() on it, or if we need to actually issue a
> WSARecv() on it as well. Maybe it would be enough to just change the
> INIFINTE on line 318 of socket.c to a fixed value. That will loop down
> to WSARecv() which should exit with WSAEWOULDBLOCK which will cause us
> to do a short sleep and come back. But we'd have to change the limit
> of 5 somehow then, since in theory we should wait forever. Maybe that
> outer loop should just be a for(;;), what do you think?
>

Yes, line 318 seems to be a much better location to me. If Windows and
it's socket logic behaves properly most of the times :), most of the
calls should come out within the first few tries, so changing 5 to an
infinite loop shouldn't hurt those normal use cases in theory.

OTOH, I was wondering what if we kill the stats collector and on a
restart the socket communication resumes properly. Would that
conclusively mean that it is a flaw in our code?

Regards,
Nikhils

> From what I understand, none of you have an environment where you can
> reliably reproduce this? That means it's going to be a PITA to try to
> figure out if we're actually fixing anything :S
>
>
> --
> =A0Magnus Hagander
> =A0Self: http://www.hagander.net/
> =A0Work: http://www.redpill-linpro.com/
>



--=20
http://www.enterprisedb.com

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Magnus Hagander
Дата:
On Mon, Aug 3, 2009 at 16:12, Nikhil
Sontakke<nikhil.sontakke@enterprisedb.com> wrote:
> Hi,
>
>>>>>
>>>>>> ntdll.dll!NtWaitForMultipleObjects+0xc
>>>>>> kernel32.dll!WaitForMultipleObjectsEx+0x11a
>>>>>> postgres.exe!pgwin32_waitforsinglesocket+0x1ed
>>>>>> postgres.exe!pgwin32_recv+0x90
>>>>>> postgres.exe!PgstatCollectorMain+0x17f
>>>>>> postgres.exe!SubPostmasterMain+0x33a
>>>>>> postgres.exe!main+0x168
>>>>>> postgres.exe!__tmainCRTStartup+0x10f
>>>>>> kernel32.dll!BaseProcessStart+0x23
>>>>>
>>>>> I have seen this problem too. =A0The process seems stuck for no good
>>>>> reason. =A0I wondered at the time if it could be a kernel issue. =A0I
>>>>> remember trying to send some data to the collector to verify whether
>>>>> it'd wake up, but no luck. =A0(I mean I couldn't find a way to do it =
on
>>>>> Windows).
>>>>
>>>> I have seen this as well, but only in cases where there has been
>>>> broken firewall software or such things involved. I have seen a couple
>>>> of reports from the field though.
>>>>
>>>> Anyway, this really is a should-never-happen thing. As soon as a new
>>>> packet is sent in, WaitForMultipleObjectsEx() should return right
>>>> away. And given that backends regularly send packets over, it
>>>> shouldn't be an issue even if we miss one...
>>>>
>>>
>>> And this fact should lend credence to Alvaro's (as well as mine)
>>> suspicions that it seems to be a Windows kernel issue.
>>>
>>> As a consequence, Magnus I was wondering if having a loop similar to
>>> the WRITE handling of waiting for a fixed timeout in a loop (rather
>>> than an INFINITE call to WaitForMultipleObjectsEx) inside the
>>> pgwin32_waitforsinglesocket() function will help for the READ case
>>> too? I believe Teogor Sigaev had raised a similar concern a while back
>>> about it:
>>>
>>> http://www.nabble.com/-GENERAL--Stats-collector-frozen--td8569977i20.ht=
ml
>>
>> Maybe. I'm unsure if it's enough to just try another
>> WaitForSingleObjectEx() on it, or if we need to actually issue a
>> WSARecv() on it as well. Maybe it would be enough to just change the
>> INIFINTE on line 318 of socket.c to a fixed value. That will loop down
>> to WSARecv() which should exit with WSAEWOULDBLOCK which will cause us
>> to do a short sleep and come back. But we'd have to change the limit
>> of 5 somehow then, since in theory we should wait forever. Maybe that
>> outer loop should just be a for(;;), what do you think?
>>
>
> Yes, line 318 seems to be a much better location to me. If Windows and
> it's socket logic behaves properly most of the times :), most of the
> calls should come out within the first few tries, so changing 5 to an
> infinite loop shouldn't hurt those normal use cases in theory.
>
> OTOH, I was wondering what if we kill the stats collector and on a
> restart the socket communication resumes properly. Would that
> conclusively mean that it is a flaw in our code?

No, if we kill the stats collector that will destroy all sockets, and
when the new one starts all the sockets it operates on are fresh and
new. So it doesn't show that the flaw is in our code - but it also
doesn't show that it's in the kernel or runtime libraries.


--=20
 Magnus Hagander
 Self: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Nikhil Sontakke
Дата:
Hi,

>>>>
>>>> And this fact should lend credence to Alvaro's (as well as mine)
>>>> suspicions that it seems to be a Windows kernel issue.
>>>>
>>>> As a consequence, Magnus I was wondering if having a loop similar to
>>>> the WRITE handling of waiting for a fixed timeout in a loop (rather
>>>> than an INFINITE call to WaitForMultipleObjectsEx) inside the
>>>> pgwin32_waitforsinglesocket() function will help for the READ case
>>>> too? I believe Teogor Sigaev had raised a similar concern a while back
>>>> about it:
>>>>
>>>> http://www.nabble.com/-GENERAL--Stats-collector-frozen--td8569977i20.html
>>>
>>> Maybe. I'm unsure if it's enough to just try another
>>> WaitForSingleObjectEx() on it, or if we need to actually issue a
>>> WSARecv() on it as well. Maybe it would be enough to just change the
>>> INIFINTE on line 318 of socket.c to a fixed value. That will loop down
>>> to WSARecv() which should exit with WSAEWOULDBLOCK which will cause us
>>> to do a short sleep and come back. But we'd have to change the limit
>>> of 5 somehow then, since in theory we should wait forever. Maybe that
>>> outer loop should just be a for(;;), what do you think?
>>>
>>
>> Yes, line 318 seems to be a much better location to me. If Windows and
>> it's socket logic behaves properly most of the times :), most of the
>> calls should come out within the first few tries, so changing 5 to an
>> infinite loop shouldn't hurt those normal use cases in theory.
>>
>> OTOH, I was wondering what if we kill the stats collector and on a
>> restart the socket communication resumes properly. Would that
>> conclusively mean that it is a flaw in our code?
>
> No, if we kill the stats collector that will destroy all sockets, and
> when the new one starts all the sockets it operates on are fresh and
> new. So it doesn't show that the flaw is in our code - but it also
> doesn't show that it's in the kernel or runtime libraries.
>

AFAICS in the code, the inherited pgStatSock socket FD remains the
same across the restart of the stats collector process...

Regards,
Nikhils
--
http://www.enterprisedb.com

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Magnus Hagander
Дата:
On Mon, Aug 3, 2009 at 16:20, Nikhil
Sontakke<nikhil.sontakke@enterprisedb.com> wrote:
> Hi,
>
>>>>>
>>>>> And this fact should lend credence to Alvaro's (as well as mine)
>>>>> suspicions that it seems to be a Windows kernel issue.
>>>>>
>>>>> As a consequence, Magnus I was wondering if having a loop similar to
>>>>> the WRITE handling of waiting for a fixed timeout in a loop (rather
>>>>> than an INFINITE call to WaitForMultipleObjectsEx) inside the
>>>>> pgwin32_waitforsinglesocket() function will help for the READ case
>>>>> too? I believe Teogor Sigaev had raised a similar concern a while back
>>>>> about it:
>>>>>
>>>>> http://www.nabble.com/-GENERAL--Stats-collector-frozen--td8569977i20.html
>>>>
>>>> Maybe. I'm unsure if it's enough to just try another
>>>> WaitForSingleObjectEx() on it, or if we need to actually issue a
>>>> WSARecv() on it as well. Maybe it would be enough to just change the
>>>> INIFINTE on line 318 of socket.c to a fixed value. That will loop down
>>>> to WSARecv() which should exit with WSAEWOULDBLOCK which will cause us
>>>> to do a short sleep and come back. But we'd have to change the limit
>>>> of 5 somehow then, since in theory we should wait forever. Maybe that
>>>> outer loop should just be a for(;;), what do you think?
>>>>
>>>
>>> Yes, line 318 seems to be a much better location to me. If Windows and
>>> it's socket logic behaves properly most of the times :), most of the
>>> calls should come out within the first few tries, so changing 5 to an
>>> infinite loop shouldn't hurt those normal use cases in theory.
>>>
>>> OTOH, I was wondering what if we kill the stats collector and on a
>>> restart the socket communication resumes properly. Would that
>>> conclusively mean that it is a flaw in our code?
>>
>> No, if we kill the stats collector that will destroy all sockets, and
>> when the new one starts all the sockets it operates on are fresh and
>> new. So it doesn't show that the flaw is in our code - but it also
>> doesn't show that it's in the kernel or runtime libraries.
>>
>
> AFAICS in the code, the inherited pgStatSock socket FD remains the
> same across the restart of the stats collector process...

Partially correct, I think.

Each backend has it's own handle on win32, since we use EXEC_BACKEND
(this includes the "utility processes" like the stats collector). When
we start the new one, we are going to use DuplicateHandle() in
save_backend_variables(). This will therefor get it a new handle, but
they are both pointing to the same kernel object. I don't know if
WaitForMultipleObjectsEx() is going to see these as two different
objects or not, but I think it does.


--
 Magnus Hagander
 Self: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Nikhil Sontakke
Дата:
Hi,

>>>>>
>>>>> Maybe. I'm unsure if it's enough to just try another
>>>>> WaitForSingleObjectEx() on it, or if we need to actually issue a
>>>>> WSARecv() on it as well. Maybe it would be enough to just change the
>>>>> INIFINTE on line 318 of socket.c to a fixed value. That will loop down
>>>>> to WSARecv() which should exit with WSAEWOULDBLOCK which will cause us
>>>>> to do a short sleep and come back. But we'd have to change the limit
>>>>> of 5 somehow then, since in theory we should wait forever. Maybe that
>>>>> outer loop should just be a for(;;), what do you think?
>>>>>
>>>>
>>>> Yes, line 318 seems to be a much better location to me. If Windows and
>>>> it's socket logic behaves properly most of the times :), most of the
>>>> calls should come out within the first few tries, so changing 5 to an
>>>> infinite loop shouldn't hurt those normal use cases in theory.
>>>>
>>>> OTOH, I was wondering what if we kill the stats collector and on a
>>>> restart the socket communication resumes properly. Would that
>>>> conclusively mean that it is a flaw in our code?
>>>
>>> No, if we kill the stats collector that will destroy all sockets, and
>>> when the new one starts all the sockets it operates on are fresh and
>>> new. So it doesn't show that the flaw is in our code - but it also
>>> doesn't show that it's in the kernel or runtime libraries.
>>>
>>
>> AFAICS in the code, the inherited pgStatSock socket FD remains the
>> same across the restart of the stats collector process...
>
> Partially correct, I think.
>
> Each backend has it's own handle on win32, since we use EXEC_BACKEND
> (this includes the "utility processes" like the stats collector). When
> we start the new one, we are going to use DuplicateHandle() in
> save_backend_variables(). This will therefor get it a new handle, but
> they are both pointing to the same kernel object. I don't know if
> WaitForMultipleObjectsEx() is going to see these as two different
> objects or not, but I think it does.
>

Hmm, got it. Nothing like adding more confusion into the mix :)

Regards,
Nikhils
--
http://www.enterprisedb.com

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Luke Koops
Дата:
There was no firewall in place, or more correctly the Windows Firewall is c=
onfigured to be off.  There is no other firewall installed on the system.

To get to this point in the code, the return value from WSARecv() was WSAEW=
OULDBLOCK.  The socket is set for overlapped IO and is a datagram socket.  =
MSDN documentation says that means there are too many outstanding overlappe=
d IO requests.  I don't know if "too many" applies to just this socket or t=
o the system as a whole.  The documentation isn't clear about how to handle=
 the return code in this situation.

We don't need to know if this is a Kernel issue, a bug in winsock, or an un=
documented behaviour.  Regardless, it can be treated as a fault.

Knowing that it is possible for WaitForMultipleObjectsEx to lock up means t=
hat it is not safe to call with an INFINITE timeout.  The workaround that's=
 being discussed is beginning to look like the one at line 172 of socket.c.=
  It's bad enough that there is a WSASend in pgwin32_waitforsinglesocket().=
  I doubt you also want to add a WSARecv.  There should be a cleaner way to=
 handle both of these situations.

I am planning to eventually kill the stats collector and see if that clears=
 up the hanging issue, but I want to keep the system state in place for a b=
it longer in case there is some other diagnostic steps I should try.  I've =
exhausted everything I could think of.

-Luke


-----Original Message-----
From: Nikhil Sontakke [mailto:nikhil.sontakke@enterprisedb.com]
Sent: Monday, August 03, 2009 10:38 AM
To: Magnus Hagander
Cc: Alvaro Herrera; Luke Koops; pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #4958: Stats collector hung on WaitForMultipleObjec=
tsEx while attempting to recv a datagram

Hi,

>>>>>
>>>>> Maybe. I'm unsure if it's enough to just try another
>>>>> WaitForSingleObjectEx() on it, or if we need to actually issue a
>>>>> WSARecv() on it as well. Maybe it would be enough to just change
>>>>> the INIFINTE on line 318 of socket.c to a fixed value. That will
>>>>> loop down to WSARecv() which should exit with WSAEWOULDBLOCK which
>>>>> will cause us to do a short sleep and come back. But we'd have to
>>>>> change the limit of 5 somehow then, since in theory we should wait
>>>>> forever. Maybe that outer loop should just be a for(;;), what do you =
think?
>>>>>
>>>>
>>>> Yes, line 318 seems to be a much better location to me. If Windows
>>>> and it's socket logic behaves properly most of the times :), most
>>>> of the calls should come out within the first few tries, so
>>>> changing 5 to an infinite loop shouldn't hurt those normal use cases i=
n theory.
>>>>
>>>> OTOH, I was wondering what if we kill the stats collector and on a
>>>> restart the socket communication resumes properly. Would that
>>>> conclusively mean that it is a flaw in our code?
>>>
>>> No, if we kill the stats collector that will destroy all sockets,
>>> and when the new one starts all the sockets it operates on are fresh
>>> and new. So it doesn't show that the flaw is in our code - but it
>>> also doesn't show that it's in the kernel or runtime libraries.
>>>
>>
>> AFAICS in the code, the inherited pgStatSock socket FD remains the
>> same across the restart of the stats collector process...
>
> Partially correct, I think.
>
> Each backend has it's own handle on win32, since we use EXEC_BACKEND
> (this includes the "utility processes" like the stats collector). When
> we start the new one, we are going to use DuplicateHandle() in
> save_backend_variables(). This will therefor get it a new handle, but
> they are both pointing to the same kernel object. I don't know if
> WaitForMultipleObjectsEx() is going to see these as two different
> objects or not, but I think it does.
>

Hmm, got it. Nothing like adding more confusion into the mix :)

Regards,
Nikhils
--
http://www.enterprisedb.com

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Nikhil Sontakke
Дата:
Hi,

> Knowing that it is possible for WaitForMultipleObjectsEx to lock up means=
 that it is not safe to call with an INFINITE timeout. =A0The workaround th=
at's being discussed is beginning to look like the one at line 172 of socke=
t.c. =A0It's bad enough that there is a WSASend in pgwin32_waitforsinglesoc=
ket(). =A0I doubt you also want to add a WSARecv. =A0There should be a clea=
ner way to handle both of these situations.
>

The change at line 318 in socket.c and using an infinite loop there as
proposed by Magnus, makes much more sense IMO.

> I am planning to eventually kill the stats collector and see if that clea=
rs up the hanging issue, but I want to keep the system state in place for a=
 bit longer in case there is some other diagnostic steps I should try. =A0I=
've exhausted everything I could think of.
>

Yeah it will be interesting to see if the collector starts functioning
fine after the restart. That might hint that the kernel object
representing the socket is maybe fine but would not prove conclusively
that the issue is with PG code because the layer used by
WaitForMultipleObjectsEx might have issues too.

Regards,
Nikhils
--=20
http://www.enterprisedb.com

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Luke Koops
Дата:
> Yeah it will be interesting to see if the collector starts
> functioning fine after the restart. That might hint that the
> kernel object representing the socket is maybe fine but would
> not prove conclusively that the issue is with PG code because
> the layer used by WaitForMultipleObjectsEx might have issues too.
This morning I planned to verify that stats collection was still not workin=
g before killing the stats collector and allowing it to restart.  I had a p=
sql session open from the previous day, but I closed it and tried to start =
a new session.  I log each session and wanted to start a new log.

Now, I am unable to start a new psql session.  I get this error on the clie=
nt side:
| psql: could not send startup packet: Connection reset by peer (0x00002746=
/10054)
|
and this error on the server side:
| 2009-08-06 10:48:59.542 EDT,LOG:  could not receive data from client: No =
connection could be made because the target machine actively refused it.
| 2009-08-06 10:48:59.542 EDT,LOG:  incomplete startup packet


I didn't find too much in the archives about this.  It's what happens if yo=
u just connect to 5432 (with telnet for example) and then drop the connecti=
on.

Occasionally (3-6% of the time), I get this on the client side:
| C:\postgres\bin>psql
| psql: server closed the connection unexpectedly
|         This probably means the server terminated abnormally
|         before or while processing the request.
and this on the server side:
| 2009-08-06 11:16:27.933 EDT,LOG:  could not receive data from client: No =
connection could be made because the target machine actively refused it.

When this sequence happens, I can see a backend postgres.exe process start =
up and then exit very quickly.  Note the absence of the "incomplete startup=
 package" message.

Could it be related to the stats collector problem?  The stats collector on=
 this system has been hung for over 6 weeks, so the timing of this problem =
is quite delayed.

I have windbg on this system along with the source and the symbols, so I co=
uld look for anything in the debugger.

-Luke

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Alvaro Herrera
Дата:
Luke Koops wrote:

> When this sequence happens, I can see a backend postgres.exe process
> start up and then exit very quickly.  Note the absence of the
> "incomplete startup package" message.
>
> Could it be related to the stats collector problem?  The stats
> collector on this system has been hung for over 6 weeks, so the timing
> of this problem is quite delayed.

Maybe you should try a network sniffer and see if you can spot anything
weird.  I think Wireshark can decode the Postgres protocol.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Robert Haas
Дата:
On Fri, Jul 31, 2009 at 10:59 AM, Luke Koops <luke.koops@entrust.com> wrote:
> -- postgres.exe!mainCRTStartup --
> ntoskrnl.exe!KiSwapContext+0x26
> ntoskrnl.exe!KiSwapThread+0x2e5
> ntoskrnl.exe!KeWaitForSingleObject+0x346
> ntoskrnl.exe!KiSuspendThread+0x18
> ntoskrnl.exe!KiDeliverApc+0x117
> ntoskrnl.exe!KiSwapThread+0x300
> ntoskrnl.exe!KeWaitForMultipleObjects+0x3d7
> ntoskrnl.exe!ObpWaitForMultipleObjects+0x202
> ntoskrnl.exe!NtWaitForMultipleObjects+0xe9
> ntoskrnl.exe!KiFastCallEntry+0xfc
> ntdll.dll!KiFastSystemCallRet
> ntdll.dll!NtWaitForMultipleObjects+0xc
> kernel32.dll!WaitForMultipleObjectsEx+0x11a
> postgres.exe!pgwin32_waitforsinglesocket+0x1ed
> postgres.exe!pgwin32_recv+0x90
> postgres.exe!PgstatCollectorMain+0x17f
> postgres.exe!SubPostmasterMain+0x33a
> postgres.exe!main+0x168
> postgres.exe!__tmainCRTStartup+0x10f
> kernel32.dll!BaseProcessStart+0x23

We just had a customer hit a very similar problem on 9.1.3, running on
Windows Server 2008 SP2.  They were able to extract the following
stack trace:

ntoskrnl.exe!KiSwapContext+0x7a
ntoskrnl.exe!KiCommitThreadWait+0x1d2
ntoskrnl.exe!KeWaitForMultipleObjects+0x271
ntoskrnl.exe!ObpWaitForMultipleObjects+0x294
ntoskrnl.exe!NtWaitForMultipleObjects+0xe5
ntoskrnl.exe!KiSystemServiceCopyEnd+0x13
ntdll.dll!ZwWaitForMultipleObjects+0xa
KERNELBASE.dll!WaitForMultipleObjectsEx+0xe8
kernel32.dll!WaitForMultipleObjectsExImplementation+0xb3
postgres.exe!pgwin32_waitforsinglesocket+0x26d
postgres.exe!pgwin32_recv+0xf0
postgres.exe!PgstatCollectorMain+0x1cc
postgres.exe!SubPostmasterMain+0x4c2
postgres.exe!main+0x1d0
postgres.exe!__tmainCRTStartup+0x11a
kernel32.dll!BaseThreadInitThunk+0xd
ntdll.dll!RtlUserThreadStart+0x1d

The customer finds that they can reproduce this on a variety of
systems under heavy load.  However, removing the load doesn't fix the
problem; the system continues to spew pgstat wait timeout messages
into the logs.  Autovacuum fails to DTRT due to lack of current stats
and things go downhill rapidly from there.  Terminating the stats
collector process resolves the issue; the postmaster starts a new one
within 60 seconds and after that the pgstat wait timeout messages
cease and vacuuming consequently resumes.

Now, it looks to me like for this stack trace to happen,
PgstatCollectorMain() has got to call pgwin32_waitforsinglesocket (at
line 3002), and that function has to return true, so that got_data
gets set to true.  Then PgstatCollectorMain() will call recv(), which
on Windows will really be pgwin32_recv, which will call
pgwin32_waitforsinglesocket, which must now hang.  The fact that the
first pgwin32_waitforsinglesocket call returned true should mean that
the stats collector socket is ready for read, while the fact that the
second one did not return seems to imply that it's not ready for read,
close, or accept.  So it almost looks like Windows can change its mind
about whether the socket is readable.

Or maybe we're telling it to change its mind.  This sounds an awful
lot like something that could have been caused by the oversights fixed
in commit b85427f2276d02756b558c0024949305ea65aca5.  Was there a
reason we didn't back-patch that?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
> We just had a customer hit a very similar problem on 9.1.3, running on
> Windows Server 2008 SP2. ...
> The customer finds that they can reproduce this on a variety of
> systems under heavy load.

> Now, it looks to me like for this stack trace to happen,
> PgstatCollectorMain() has got to call pgwin32_waitforsinglesocket (at
> line 3002), and that function has to return true, so that got_data
> gets set to true.  Then PgstatCollectorMain() will call recv(), which
> on Windows will really be pgwin32_recv, which will call
> pgwin32_waitforsinglesocket, which must now hang.  The fact that the
> first pgwin32_waitforsinglesocket call returned true should mean that
> the stats collector socket is ready for read, while the fact that the
> second one did not return seems to imply that it's not ready for read,
> close, or accept.  So it almost looks like Windows can change its mind
> about whether the socket is readable.

> Or maybe we're telling it to change its mind.  This sounds an awful
> lot like something that could have been caused by the oversights fixed
> in commit b85427f2276d02756b558c0024949305ea65aca5.  Was there a
> reason we didn't back-patch that?

Sure: it was unproven that that fixed anything at all, much less that it
was bug-free enough to be safe to backpatch.  Neither of those things
has changed since May.  If you want you can try making up a 9.1 with
those changes and giving it to this customer to see if it fixes their
problems --- but without some field testing of the sort, I'm pretty
hesitant to put it into back branches.

            regards, tom lane

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Robert Haas
Дата:
On Tue, Aug 7, 2012 at 2:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> We just had a customer hit a very similar problem on 9.1.3, running on
>> Windows Server 2008 SP2. ...
>> The customer finds that they can reproduce this on a variety of
>> systems under heavy load.
>
>> Now, it looks to me like for this stack trace to happen,
>> PgstatCollectorMain() has got to call pgwin32_waitforsinglesocket (at
>> line 3002), and that function has to return true, so that got_data
>> gets set to true.  Then PgstatCollectorMain() will call recv(), which
>> on Windows will really be pgwin32_recv, which will call
>> pgwin32_waitforsinglesocket, which must now hang.  The fact that the
>> first pgwin32_waitforsinglesocket call returned true should mean that
>> the stats collector socket is ready for read, while the fact that the
>> second one did not return seems to imply that it's not ready for read,
>> close, or accept.  So it almost looks like Windows can change its mind
>> about whether the socket is readable.
>
>> Or maybe we're telling it to change its mind.  This sounds an awful
>> lot like something that could have been caused by the oversights fixed
>> in commit b85427f2276d02756b558c0024949305ea65aca5.  Was there a
>> reason we didn't back-patch that?
>
> Sure: it was unproven that that fixed anything at all, much less that it
> was bug-free enough to be safe to backpatch.  Neither of those things
> has changed since May.  If you want you can try making up a 9.1 with
> those changes and giving it to this customer to see if it fixes their
> problems --- but without some field testing of the sort, I'm pretty
> hesitant to put it into back branches.

Well, we had the customer try out 9.2beta, and they were unable to
reproduce the issue there.  Woo-hoo.  Does that qualify as sufficient
evidence for back-patching this?

(BTW, I think commit 9b63e9869ffaa4d6d3e8bf45086a765d8f310f1c contains
a thinko in one of the comments: shouldn't "a crock of the first
water" be "a crock of the first order"?)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Alvaro Herrera
Дата:
Excerpts from Robert Haas's message of mar ago 21 14:54:28 -0400 2012:

> (BTW, I think commit 9b63e9869ffaa4d6d3e8bf45086a765d8f310f1c contains
> a thinko in one of the comments: shouldn't "a crock of the first
> water" be "a crock of the first order"?)

I was interested by that phrase when I saw the commit, and as usual when
I see a phrase from Tom that looks like it might mean something
interesting, I looked it up.  Apparently "of the first water" means "of
the highest quality" or "of the most extreme kind":
http://idioms.thefreedictionary.com/of+the+first+water
(Curiously enough, the example usage for the first sense in that site
mentions some gentleman by name of Tom.)

I have learned a few neat English language tricks by paying careful
attention to Tom's choice of words.  There doesn't seem to be many
people in these lists that use English in just the same way.

--=20
=C3=81lvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Excerpts from Robert Haas's message of mar ago 21 14:54:28 -0400 2012:
>> (BTW, I think commit 9b63e9869ffaa4d6d3e8bf45086a765d8f310f1c contains
>> a thinko in one of the comments: shouldn't "a crock of the first
>> water" be "a crock of the first order"?)

> I was interested by that phrase when I saw the commit, and as usual when
> I see a phrase from Tom that looks like it might mean something
> interesting, I looked it up.  Apparently "of the first water" means "of
> the highest quality" or "of the most extreme kind":
> http://idioms.thefreedictionary.com/of+the+first+water

Yeah, that was intentional not a typo.  I think modern usage of this
phrase is largely confined to gemstones though, as in "diamond of the
first water".
http://www.phrases.org.uk/meanings/137550.html

Meanwhile, back at the ranch: I'm fine with applying that patch now that
it's had some field testing.

            regards, tom lane

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Robert Haas
Дата:
On Tue, Aug 21, 2012 at 4:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Meanwhile, back at the ranch: I'm fine with applying that patch now that
> it's had some field testing.

Attached is a version that applies OK to 9.1, after resolving some
conflicts.  Review appreciated.  It wasn't exactly clear to me how to
handle the ereport/elog calls.

What to do pre-9.1 is a bit less clear to me, as the latch machinery
doesn't exist at all in those versions.  Some of the fixes can
probably still be pulled out and applied, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Aug 21, 2012 at 4:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Meanwhile, back at the ranch: I'm fine with applying that patch now that
>> it's had some field testing.

> Attached is a version that applies OK to 9.1, after resolving some
> conflicts.  Review appreciated.  It wasn't exactly clear to me how to
> handle the ereport/elog calls.

It looks sane to me in a quick once-over (bearing in mind I can't test
it).  What's bothering you about the ereport calls?

> What to do pre-9.1 is a bit less clear to me, as the latch machinery
> doesn't exist at all in those versions.  Some of the fixes can
> probably still be pulled out and applied, though.

Presumably the changes in win32/socket.c could still be applied; but
on the other hand it's not clear there's so much need for them pre-9.1.
IIRC most of the point of this patch was to eliminate the tug-of-war
over the socket's event attachment between socket.c and win32_latch.c,
so maybe there's not a bug pre-9.1.

            regards, tom lane

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Robert Haas
Дата:
On Wed, Aug 22, 2012 at 10:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Aug 21, 2012 at 4:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Meanwhile, back at the ranch: I'm fine with applying that patch now that
>>> it's had some field testing.
>
>> Attached is a version that applies OK to 9.1, after resolving some
>> conflicts.  Review appreciated.  It wasn't exactly clear to me how to
>> handle the ereport/elog calls.
>
> It looks sane to me in a quick once-over (bearing in mind I can't test
> it).  What's bothering you about the ereport calls?

Well, 9.1 currently has:

errmsg_internal("failed to enumerate network events: %i", (int) GetLastError()))

Prior to your patch, master had:

errmsg_internal("failed to enumerate network events: error code %lu",
GetLastError())

And you changed it to:

elog(ERROR, "failed to enumerate network events: error code %u",
WSAGetLastError())

So obviously there's a conflict there.  In the previously-posted
version, I chose to just take your version, but I'm not sure whether
it would be better -- in the back branches -- to simply replace
GetLastError() with WSAGetLastError() and leave it alone otherwise.

>> What to do pre-9.1 is a bit less clear to me, as the latch machinery
>> doesn't exist at all in those versions.  Some of the fixes can
>> probably still be pulled out and applied, though.
>
> Presumably the changes in win32/socket.c could still be applied; but
> on the other hand it's not clear there's so much need for them pre-9.1.
> IIRC most of the point of this patch was to eliminate the tug-of-war
> over the socket's event attachment between socket.c and win32_latch.c,
> so maybe there's not a bug pre-9.1.

Well, this thread started as a bug report against 8.3.7...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Alvaro Herrera
Дата:
Excerpts from Robert Haas's message of s=C3=A1b ago 25 00:10:57 -0400 201=
2:

> > It looks sane to me in a quick once-over (bearing in mind I can't tes=
t
> > it).  What's bothering you about the ereport calls?
>=20
> Well, 9.1 currently has:
>=20
> errmsg_internal("failed to enumerate network events: %i", (int) GetLast=
Error()))
>=20
> Prior to your patch, master had:
>=20
> errmsg_internal("failed to enumerate network events: error code %lu",
> GetLastError())
>=20
> And you changed it to:
>=20
> elog(ERROR, "failed to enumerate network events: error code %u",
> WSAGetLastError())
>=20
> So obviously there's a conflict there.  In the previously-posted
> version, I chose to just take your version, but I'm not sure whether
> it would be better -- in the back branches -- to simply replace
> GetLastError() with WSAGetLastError() and leave it alone otherwise.

Note that for translations none of this matters, because
errmsg_internal() does not mark its argument for translation, and
neither does elog().

The change of %i to "error code %lu" was done recently (during 9.2 devel
cycle) to make all error messages involving GetLastError() consistent;
see commit 1af55e275.

--=20
=C3=81lvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Robert Haas
Дата:
On Wed, Aug 22, 2012 at 10:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It looks sane to me in a quick once-over (bearing in mind I can't test
> it).

Well, bad news.  Dave built an installer off the 9.1 tip, we provided
it to the customer, and they can still reproduce the problem.  It's
not clear whether we've narrowed the scope of the problem without
eliminating it, or whether that fix just didn't help at all.  But they
still can't reproduce it on 9.2.  This suggests that there's some
other difference between 9.1 and 9.2 that is relevant here, but I'm
not sure what.  Any ideas?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> Well, bad news.  Dave built an installer off the 9.1 tip, we provided
> it to the customer, and they can still reproduce the problem.  It's
> not clear whether we've narrowed the scope of the problem without
> eliminating it, or whether that fix just didn't help at all.  But they
> still can't reproduce it on 9.2.  This suggests that there's some
> other difference between 9.1 and 9.2 that is relevant here, but I'm
> not sure what.  Any ideas?

I guess that's reassuring, in a way: it means that all the thrashing
about we did to get the latch-ified stats collector logic to work on
Windows actually brought us to a better place.

However, for the purposes of fixing the older branches, it's not so
nice.  I think it means that we'd have to back-port some or all of the
latch-related changes in pgstat.c:

commit 966970ed636586b80739c0d21aec7561f0fafedd
commit f1ca51549e9e7045a2db95a61744334f1dbb4d64
commit d461d0502b6e20c6d4ec326ba2faec8212fd5dee
commit 9b63e9869ffaa4d6d3e8bf45086a765d8f310f1c

and maybe even some of these fixes in win32_latch.c:

commit e42a21b9e6c9b9e6346a34b62628d48ff2fc6ddf
commit f40022f1adaeff85b01d98fea38cf5aa16814aa7
commit 8ebc908c575b3378c9d4096412d16d91d107311c

At least, that's where I'd go looking if I were responsible for fixing
this, which is a responsibility I disclaim.  Looking through this
history reminds me how much time I've already wasted on that miserable
excuse for an operating system.

            regards, tom lane