Обсуждение: Why is citext/regress failing on hamerkop?

Поиск
Список
Период
Сортировка

Why is citext/regress failing on hamerkop?

От
Thomas Munro
Дата:
For example, 'i'::citext = 'İ'::citext fails to be true.

It must now be using UTF-8 (unlike, say, Drongo) and non-C ctype,
given that the test isn't skipped.  This isn't the first time that
we've noticed that Windows doesn't seem to know about İ→i (see [1]),
but I don't think anyone has explained exactly why, yet.  It could be
that it just doesn't know about that in any locale, or that it is
locale-dependent and would only do that for Turkish, the same reason
we skip the test for ICU, or ...

Either way, it seems like we'll need to skip that test on Windows if
we want hamerkop to be green.  That can probably be cribbed from
collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's
prelude... I just don't know how to explain it in the comment 'cause I
don't know why.

One new development in Windows-land is that the system now does
actually support UTF-8 in the runtime libraries[2].  You can set it at
a system level, or for an application at build time, or by adding
".UTF-8" to a locale name when opening the locale (apparently much
more like Unix systems, but I don't know what exactly it does).  I
wonder why we see this change now... did hamerkop have that ACP=UTF-8
setting applied on purpose, or if computers in Japan started doing
that by default instead of using Shift-JIS, or if it only started
picking UTF-8 around the time of the Meson change somehow, or the
initdb-template change.  It's a little hard to tell from the logs.

[1]
https://www.postgresql.org/message-id/CAC%2BAXB10p%2BmnJ6wrAEm6jb51%2B8%3DBfYzD%3Dw6ftHRbMjMuSFN3kQ%40mail.gmail.com
[2] https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page



Re: Why is citext/regress failing on hamerkop?

От
Thomas Munro
Дата:
On Sat, May 11, 2024 at 1:14 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Either way, it seems like we'll need to skip that test on Windows if
> we want hamerkop to be green.  That can probably be cribbed from
> collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's
> prelude... I just don't know how to explain it in the comment 'cause I
> don't know why.

Here's a minimal patch like that.

I don't think it's worth back-patching.  Only 15 and 16 could possibly
be affected, I think, because the test wasn't enabled before that.  I
think this is all just a late-appearing blow-up predicted by the
commit that enabled it:

commit c2e8bd27519f47ff56987b30eb34a01969b9a9e8
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Wed Jan 5 13:30:07 2022 -0500

    Enable routine running of citext's UTF8-specific test cases.

    These test cases have been commented out since citext was invented,
    because at the time we had no nice way to deal with tests that
    have restrictions such as requiring UTF8 encoding.  But now we do
    have a convention for that, ie put them into a separate test file
    with an early-exit path.  So let's enable these tests to run when
    their prerequisites are satisfied.

    (We may have to tighten the prerequisites beyond the "encoding = UTF8
    and locale != C" checks made here.  But let's put it on the buildfarm
    and see what blows up.)

Hamerkop is already green on the 15 and 16 branches, apparently
because it's using the pre-meson test stuff that I guess just didn't
run the relevant test.  In other words, nobody would notice the
difference anyway, and a master-only fix would be enough to end this
44-day red streak.

Вложения

Re: Why is citext/regress failing on hamerkop?

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Sat, May 11, 2024 at 1:14 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Either way, it seems like we'll need to skip that test on Windows if
>> we want hamerkop to be green.  That can probably be cribbed from
>> collate.windows.win1252.sql into contrib/citext/sql/citext_utf8.sql's
>> prelude... I just don't know how to explain it in the comment 'cause I
>> don't know why.

> Here's a minimal patch like that.

WFM until some Windows person cares to probe more deeply.

BTW, I've also been wondering why hamerkop has been failing
isolation-check in the 12 and 13 branches for the last six months
or so.  It is surely unrelated to this issue, and it looks like
it must be due to some platform change rather than anything we
committed at the time.

I'm not planning on looking into that question myself, but really
somebody ought to.  Or is Windows just as dead as AIX, in terms of
anybody being willing to put effort into supporting it?

            regards, tom lane



Re: Why is citext/regress failing on hamerkop?

От
Alexander Lakhin
Дата:
Hello Tom,

12.05.2024 08:34, Tom Lane wrote:
> BTW, I've also been wondering why hamerkop has been failing
> isolation-check in the 12 and 13 branches for the last six months
> or so.  It is surely unrelated to this issue, and it looks like
> it must be due to some platform change rather than anything we
> committed at the time.
>
> I'm not planning on looking into that question myself, but really
> somebody ought to.  Or is Windows just as dead as AIX, in terms of
> anybody being willing to put effort into supporting it?

I've reproduced the failure locally with GSS enabled, so I'll try to
figure out what's going on here in the next few days.

Best regards,
Alexander



Re: Why is citext/regress failing on hamerkop?

От
Andrew Dunstan
Дата:


On 2024-05-12 Su 01:34, Tom Lane wrote:
BTW, I've also been wondering why hamerkop has been failing
isolation-check in the 12 and 13 branches for the last six months
or so.  It is surely unrelated to this issue, and it looks like
it must be due to some platform change rather than anything we
committed at the time.


Possibly. It looks like this might be the issue:

+Connection 2 failed: could not initiate GSSAPI security context: Unspecified GSS failure.  Minor code may provide more information: Credential cache is empty
+FATAL:  sorry, too many clients already


There are several questions here, including: 

1. why isn't it failing on later branches? 
2. why isn't it failing on drongo (which has more modern compiler and OS)?

I think we'll need the help of the animal owner to dig into the issue.

I'm not planning on looking into that question myself, but really
somebody ought to.  Or is Windows just as dead as AIX, in terms of
anybody being willing to put effort into supporting it?			


Well, this is more or less where I came in back in about 2002 :-) I've been trying to help support it ever since, mainly motivated by stubborn persistence than anything else. Still, I agree that the lack of support for the Windows port from Microsoft over the years has been more than disappointing.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Why is citext/regress failing on hamerkop?

От
Andrew Dunstan
Дата:


On 2024-05-12 Su 08:26, Andrew Dunstan wrote:


On 2024-05-12 Su 01:34, Tom Lane wrote:
BTW, I've also been wondering why hamerkop has been failing
isolation-check in the 12 and 13 branches for the last six months
or so.  It is surely unrelated to this issue, and it looks like
it must be due to some platform change rather than anything we
committed at the time.


Possibly. It looks like this might be the issue:

+Connection 2 failed: could not initiate GSSAPI security context: Unspecified GSS failure.  Minor code may provide more information: Credential cache is empty
+FATAL:  sorry, too many clients already


There are several questions here, including: 

1. why isn't it failing on later branches? 
2. why isn't it failing on drongo (which has more modern compiler and OS)?

I think we'll need the help of the animal owner to dig into the issue.


Aha! drongo doesn't have GSSAPI enabled. Will work on that.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Why is citext/regress failing on hamerkop?

От
Thomas Munro
Дата:
On Mon, May 13, 2024 at 12:26 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Well, this is more or less where I came in back in about 2002 :-) I've been trying to help support it ever since,
mainlymotivated by stubborn persistence than anything else. Still, I agree that the lack of support for the Windows
portfrom Microsoft over the years has been more than disappointing. 

I think "state of the Windows port" would make a good discussion topic
at pgconf.dev (with write-up for those who can't be there).  If there
is interest, I could organise that with a short presentation of the
issues I am aware of so far and possible improvements and
smaller-things-we-could-drop-instead-of-dropping-the-whole-port.  I
would focus on technical stuff, not who-should-be-doing-what, 'cause I
can't make anyone do anything.

For citext_utf8, I pushed cff4e5a3.  Hamerkop runs infrequently, so
here's hoping for 100% green on master by Tuesday or so.



Re: Why is citext/regress failing on hamerkop?

От
Andrew Dunstan
Дата:
On 2024-05-12 Su 18:05, Thomas Munro wrote:
> On Mon, May 13, 2024 at 12:26 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Well, this is more or less where I came in back in about 2002 :-) I've been trying to help support it ever since,
mainlymotivated by stubborn persistence than anything else. Still, I agree that the lack of support for the Windows
portfrom Microsoft over the years has been more than disappointing.
 
> I think "state of the Windows port" would make a good discussion topic
> at pgconf.dev (with write-up for those who can't be there).  If there
> is interest, I could organise that with a short presentation of the
> issues I am aware of so far and possible improvements and
> smaller-things-we-could-drop-instead-of-dropping-the-whole-port.  I
> would focus on technical stuff, not who-should-be-doing-what, 'cause I
> can't make anyone do anything.
>

+1


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Why is citext/regress failing on hamerkop?

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> For citext_utf8, I pushed cff4e5a3.  Hamerkop runs infrequently, so
> here's hoping for 100% green on master by Tuesday or so.

In the meantime, some off-list investigation by Alexander Lakhin
has turned up a good deal of information about why we're seeing
failures on hamerkop in the back branches.  Summarizing, it
appears that

1. In a GSS-enabled Windows build without any active Kerberos server,
libpq's pg_GSS_have_cred_cache() succeeds, allowing libpq to try to
open a GSSAPI connection, but then gss_init_sec_context() fails,
leading to client-side reports like this:

+Connection 2 failed: could not initiate GSSAPI security context: Unspecified GSS failure.  Minor code may provide more
information:Credential cache is empty 
+FATAL:  sorry, too many clients already

(The first of these lines comes out during the attempted GSS
connection, the second during the only-slightly-more-successful
non-GSS connection.)  So that's problem number 1: how is it that
gss_acquire_cred() succeeds but then gss_init_sec_context() disclaims
knowledge of any credentials?  Can we find a way to get this failure
to be detected during pg_GSS_have_cred_cache()?  It is mighty
expensive to launch a backend connection that is doomed to fail,
especially when this happens during *every single libpq connection
attempt*.

2. Once gss_init_sec_context() fails, libpq abandons the connection
and starts over; since it has already initiated a GSS handshake
on the connection, there's not much choice.  Although libpq faithfully
issues closesocket() on the abandoned connection, Alexander found
that the connected backend doesn't reliably see that: it may just
sit there until the AuthenticationTimeout elapses (1 minute by
default).  That backend is still consuming a postmaster child slot,
so if this happens on any sizable fraction of test connection
attempts, it's little surprise that we soon get "sorry, too many
clients already" failures.

3. We don't know exactly why hamerkop suddenly started seeing these
failures, but a plausible theory emerges after noting that its
reported time for the successful "make check" step dropped pretty
substantially right when this started.  In the v13 branch, "make
check" was taking 2:18 or more in the several runs right before the
first isolationcheck failure, but 1:40 or less just after.  So it
looks like the animal was moved onto faster hardware.  That feeds
into this problem because, with a successful isolationcheck run
taking more than a minute, there was enough time for some of the
earlier stuck sessions to time out and exit before their
postmaster-child slots were needed.

Alexander confirmed this theory by demonstrating that the main
regression tests in v15 would pass if he limited their speed enough
(by reducing the CPU allowed to a VM) but not at full speed.  So the
buildfarm results suggesting this is only an issue in <= v13 must
be just a timing artifact; the problem is still there.

Of course, backends waiting till timeout is not a good behavior
independently of what is triggering that, so we have two problems to
solve here, not one.  I have no ideas about the gss_init_sec_context()
failure, but I see a plausible theory about the failure to detect
socket closure in Microsoft's documentation about closesocket() [1]:

    If the l_onoff member of the LINGER structure is zero on a stream
    socket, the closesocket call will return immediately and does not
    receive WSAEWOULDBLOCK whether the socket is blocking or
    nonblocking. However, any data queued for transmission will be
    sent, if possible, before the underlying socket is closed. This is
    also called a graceful disconnect or close. In this case, the
    Windows Sockets provider cannot release the socket and other
    resources for an arbitrary period, thus affecting applications
    that expect to use all available sockets. This is the default
    behavior for a socket.

I'm not sure whether we've got unsent data pending in the problematic
condition, nor why it'd remain unsent if we do (shouldn't the backend
consume it anyway?).  But this has the right odor for an explanation.

I'm pretty hesitant to touch this area myself, because it looks an
awful lot like commits 6051857fc and ed52c3707, which eventually
had to be reverted.  I think we need a deeper understanding of
exactly what Winsock is doing or not doing before we try to fix it.
I wonder if there are any Microsoft employees around here with
access to the relevant source code.

In the short run, it might be a good idea to deprecate using
--with-gssapi on Windows builds.  A different stopgap idea
could be to drastically reduce the default AuthenticationTimeout,
perhaps only on Windows.

            regards, tom lane

[1] https://learn.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-closesocket



Re: Why is citext/regress failing on hamerkop?

От
Thomas Munro
Дата:
On Tue, May 14, 2024 at 8:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm not sure whether we've got unsent data pending in the problematic
> condition, nor why it'd remain unsent if we do (shouldn't the backend
> consume it anyway?).  But this has the right odor for an explanation.
>
> I'm pretty hesitant to touch this area myself, because it looks an
> awful lot like commits 6051857fc and ed52c3707, which eventually
> had to be reverted.  I think we need a deeper understanding of
> exactly what Winsock is doing or not doing before we try to fix it.

I was beginning to suspect that lingering odour myself.  I haven't
look at the GSS code but I was imagining that what we have here is
perhaps not unsent data dropped on the floor due to linger policy
(unclean socket close on process exist), but rather that the server
didn't see the socket as ready to read because it lost track of the
FD_CLOSE somewhere because the client closed it gracefully, and our
server-side FD_CLOSE handling has always been a bit suspect.  I wonder
if the GSS code is somehow more prone to brokenness.  One thing we
learned in earlier problems was that abortive/error disconnections
generate FD_CLOSE repeatedly, while graceful ones give you only one.
In other words, if the other end politely calls closesocket(), the
server had better not miss the FD_CLOSE event, because it won't come
again.   That's what

https://commitfest.postgresql.org/46/3523/

is intended to fix.  Does it help here?  Unfortunately that's
unpleasantly complicated and unbackpatchable (keeping a side-table of
socket FDs and event handles, so we don't lose events between the
cracks).



Re: Why is citext/regress failing on hamerkop?

От
Alexander Lakhin
Дата:
13.05.2024 23:17, Tom Lane wrote:
> 3. We don't know exactly why hamerkop suddenly started seeing these
> failures, but a plausible theory emerges after noting that its
> reported time for the successful "make check" step dropped pretty
> substantially right when this started.  In the v13 branch, "make
> check" was taking 2:18 or more in the several runs right before the
> first isolationcheck failure, but 1:40 or less just after.  So it
> looks like the animal was moved onto faster hardware.  That feeds
> into this problem because, with a successful isolationcheck run
> taking more than a minute, there was enough time for some of the
> earlier stuck sessions to time out and exit before their
> postmaster-child slots were needed.

Yes, and one thing I can't explain yet, is why REL_14_STABLE+ timings
substantially differ from REL_13_STABLE-, say, for the check stage:
REL_14_STABLE: the oldest available test log (from 2021-10-30) shows
check (00:03:47) and the newest one (from 2024-05-12): check (00:03:18).
Whilst on REL_13_STABLE the oldest available log (from 2021-08-06) shows
check duration 00:03:00, then it decreased to 00:02:24 (2021-09-22)/
00:02:14 (2021-11-07), and now it's 1:40, as you said.

Locally I see more or less the same timings on REL_13_STABLE (34, 28, 27
secs) and on REL_14_STABLE (33, 29, 29 secs).

14.05.2024 03:38, Thomas Munro wrote:
> I was beginning to suspect that lingering odour myself.  I haven't
> look at the GSS code but I was imagining that what we have here is
> perhaps not unsent data dropped on the floor due to linger policy
> (unclean socket close on process exist), but rather that the server
> didn't see the socket as ready to read because it lost track of the
> FD_CLOSE somewhere because the client closed it gracefully, and our
> server-side FD_CLOSE handling has always been a bit suspect.  I wonder
> if the GSS code is somehow more prone to brokenness.  One thing we
> learned in earlier problems was that abortive/error disconnections
> generate FD_CLOSE repeatedly, while graceful ones give you only one.
> In other words, if the other end politely calls closesocket(), the
> server had better not miss the FD_CLOSE event, because it won't come
> again.   That's what
>
> https://commitfest.postgresql.org/46/3523/
>
> is intended to fix.  Does it help here?  Unfortunately that's
> unpleasantly complicated and unbackpatchable (keeping a side-table of
> socket FDs and event handles, so we don't lose events between the
> cracks).

Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last
state when that patch set can be applied).

An excerpt (all lines related to process 12500) from a failed run log
without the patch set:
2024-05-14 05:57:29.526 UTC [8228:128] DEBUG:  forked new backend, pid=12500 socket=5524
2024-05-14 05:57:29.534 UTC [12500:1] [unknown] LOG:  connection received: host=::1 port=51394
2024-05-14 05:57:29.534 UTC [12500:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket
2024-05-14 05:57:29.534 UTC [12500:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G
2024-05-14 05:57:29.534 UTC [12500:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait
2024-05-14 05:57:29.534 UTC [12500:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0,
len:4
 
2024-05-14 05:57:29.534 UTC [12500:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035
2024-05-14 05:57:29.534 UTC [12500:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 05:57:29.549 UTC [12500:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 05:57:29.549 UTC [12500:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0,
len:4
 
2024-05-14 05:57:29.549 UTC [12500:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035
2024-05-14 05:57:29.549 UTC [12500:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
...
2024-05-14 05:57:52.024 UTC [8228:3678] DEBUG:  server process (PID 12500) exited with exit code 1
# at the end of the test run

And an excerpt (all lines related to process 11736) from a successful run
log with the patch set applied:
2024-05-14 06:03:57.216 UTC [4524:130] DEBUG:  forked new backend, pid=11736 socket=5540
2024-05-14 06:03:57.226 UTC [11736:1] [unknown] LOG:  connection received: host=::1 port=51914
2024-05-14 06:03:57.226 UTC [11736:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket
2024-05-14 06:03:57.226 UTC [11736:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G
2024-05-14 06:03:57.226 UTC [11736:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait
2024-05-14 06:03:57.226 UTC [11736:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0,
len:4
 
2024-05-14 06:03:57.226 UTC [11736:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035
2024-05-14 06:03:57.226 UTC [11736:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 06:03:57.240 UTC [11736:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 06:03:57.240 UTC [11736:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0,
len:4
 
2024-05-14 06:03:57.240 UTC [11736:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035
2024-05-14 06:03:57.240 UTC [11736:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 06:03:57.240 UTC [11736:12] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 06:03:57.240 UTC [11736:13] [unknown] LOG: !!!secure_open_gssapi| read_or_wait returned -1
2024-05-14 06:03:57.240 UTC [11736:14] [unknown] LOG: !!!ProcessStartupPacket| secure_open_gssapi() returned error
2024-05-14 06:03:57.240 UTC [11736:15] [unknown] LOG: !!!BackendInitialize| after ProcessStartupPacket
2024-05-14 06:03:57.240 UTC [11736:16] [unknown] LOG: !!!BackendInitialize| status: -1
2024-05-14 06:03:57.240 UTC [11736:17] [unknown] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:18] [unknown] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:19] [unknown] DEBUG: proc_exit(0): 1 callbacks to make
2024-05-14 06:03:57.240 UTC [11736:20] [unknown] DEBUG:  exit(0)
2024-05-14 06:03:57.240 UTC [11736:21] [unknown] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:22] [unknown] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:23] [unknown] DEBUG: proc_exit(-1): 0 callbacks to make
2024-05-14 06:03:57.243 UTC [4524:132] DEBUG:  forked new backend, pid=10536 socket=5540
2024-05-14 06:03:57.243 UTC [4524:133] DEBUG:  server process (PID 11736) exited with exit code 0

Best regards,
Alexander



Re: Why is citext/regress failing on hamerkop?

От
Tom Lane
Дата:
Alexander Lakhin <exclusion@gmail.com> writes:
> 13.05.2024 23:17, Tom Lane wrote:
>> 3. We don't know exactly why hamerkop suddenly started seeing these
>> failures, but a plausible theory emerges after noting that its
>> reported time for the successful "make check" step dropped pretty
>> substantially right when this started.  In the v13 branch, "make
>> check" was taking 2:18 or more in the several runs right before the
>> first isolationcheck failure, but 1:40 or less just after.  So it
>> looks like the animal was moved onto faster hardware.

> Yes, and one thing I can't explain yet, is why REL_14_STABLE+ timings
> substantially differ from REL_13_STABLE-, say, for the check stage:

As I mentioned in our off-list discussion, I have a lingering feeling
that this v14 commit could be affecting the results somehow:

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL_14_BR [d5a9a661f] 2020-10-18 12:56:43 -0400

    Update the Winsock API version requested by libpq.
    
    According to Microsoft's documentation, 2.2 has been the current
    version since Windows 98 or so.  Moreover, that's what the Postgres
    backend has been requesting since 2004 (cf commit 4cdf51e64).
    So there seems no reason for libpq to keep asking for 1.1.

I didn't believe at the time that that'd have any noticeable effect,
but maybe it somehow made Winsock play a bit nicer with the GSS
support?

            regards, tom lane



Re: Why is citext/regress failing on hamerkop?

От
Alexander Lakhin
Дата:
14.05.2024 17:38, Tom Lane wrote:
> As I mentioned in our off-list discussion, I have a lingering feeling
> that this v14 commit could be affecting the results somehow:
>
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Branch: master Release: REL_14_BR [d5a9a661f] 2020-10-18 12:56:43 -0400
>
>      Update the Winsock API version requested by libpq.
>      
>      According to Microsoft's documentation, 2.2 has been the current
>      version since Windows 98 or so.  Moreover, that's what the Postgres
>      backend has been requesting since 2004 (cf commit 4cdf51e64).
>      So there seems no reason for libpq to keep asking for 1.1.
>
> I didn't believe at the time that that'd have any noticeable effect,
> but maybe it somehow made Winsock play a bit nicer with the GSS
> support?

Yes, probably, but may be not nicer, as the test duration increased?
Still I can't see the difference locally to check that commit.
Will try other VMs/configurations, maybe I could find a missing factor...

Best regards,
Alexander



Re: Why is citext/regress failing on hamerkop?

От
Thomas Munro
Дата:
On Tue, May 14, 2024 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> 14.05.2024 03:38, Thomas Munro wrote:
> > I was beginning to suspect that lingering odour myself.  I haven't
> > look at the GSS code but I was imagining that what we have here is
> > perhaps not unsent data dropped on the floor due to linger policy
> > (unclean socket close on process exist), but rather that the server
> > didn't see the socket as ready to read because it lost track of the
> > FD_CLOSE somewhere because the client closed it gracefully, and our
> > server-side FD_CLOSE handling has always been a bit suspect.  I wonder
> > if the GSS code is somehow more prone to brokenness.  One thing we
> > learned in earlier problems was that abortive/error disconnections
> > generate FD_CLOSE repeatedly, while graceful ones give you only one.
> > In other words, if the other end politely calls closesocket(), the
> > server had better not miss the FD_CLOSE event, because it won't come
> > again.   That's what
> >
> > https://commitfest.postgresql.org/46/3523/
> >
> > is intended to fix.  Does it help here?  Unfortunately that's
> > unpleasantly complicated and unbackpatchable (keeping a side-table of
> > socket FDs and event handles, so we don't lose events between the
> > cracks).
>
> Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last
> state when that patch set can be applied).

Thanks for checking, and generally for your infinite patience with all
these horrible Windows problems.

OK, so we know what the problem is here.  Here is the simplest
solution I know of for that problem.  I have proposed this in the past
and received negative feedback because it's a really gross hack.  But
I don't personally know what else to do about the back-branches (or
even if that complex solution is the right way forward for master).
The attached kludge at least has the [de]merit of being a mirror image
of the kludge that follows it for the "opposite" event.  Does this fix
it?

Вложения

Re: Why is citext/regress failing on hamerkop?

От
Alexander Lakhin
Дата:
15.05.2024 01:26, Thomas Munro wrote:
> OK, so we know what the problem is here.  Here is the simplest
> solution I know of for that problem.  I have proposed this in the past
> and received negative feedback because it's a really gross hack.  But
> I don't personally know what else to do about the back-branches (or
> even if that complex solution is the right way forward for master).
> The attached kludge at least has the [de]merit of being a mirror image
> of the kludge that follows it for the "opposite" event.  Does this fix
> it?

Yes, I see that abandoned GSS connections are closed immediately, as
expected. I have also confirmed that `meson test` with the basic
configuration passes on REL_16_STABLE. So from the outside, the fix
looks good to me.

Thank you for working on this!

Best regards,
Alexander



Re: Why is citext/regress failing on hamerkop?

От
Thomas Munro
Дата:
On Wed, May 15, 2024 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> 15.05.2024 01:26, Thomas Munro wrote:
> > OK, so we know what the problem is here.  Here is the simplest
> > solution I know of for that problem.  I have proposed this in the past
> > and received negative feedback because it's a really gross hack.  But
> > I don't personally know what else to do about the back-branches (or
> > even if that complex solution is the right way forward for master).
> > The attached kludge at least has the [de]merit of being a mirror image
> > of the kludge that follows it for the "opposite" event.  Does this fix
> > it?
>
> Yes, I see that abandoned GSS connections are closed immediately, as
> expected. I have also confirmed that `meson test` with the basic
> configuration passes on REL_16_STABLE. So from the outside, the fix
> looks good to me.

Alright, unless anyone has an objection or ideas for improvements, I'm
going to go ahead and back-patch this slightly tidied up version
everywhere.

Вложения

Re: Why is citext/regress failing on hamerkop?

От
Thomas Munro
Дата:
On Thu, May 16, 2024 at 9:46 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Alright, unless anyone has an objection or ideas for improvements, I'm
> going to go ahead and back-patch this slightly tidied up version
> everywhere.

Of course as soon as I wrote that I thought of a useful improvement
myself: as far as I can tell, you only need to do the extra poll on
the first wait for WL_SOCKET_READABLE for any given WaitEventSet.  I
don't think it's needed for later waits done by long-lived
WaitEventSet objects, so we can track that with a flag.  That avoids
adding new overhead for regular backend socket waits after
authentication, it's just in these code paths that do a bunch of
WaitLatchOrSocket() calls that we need to consider FD_CLOSE events
lost between the cracks.

I also don't know if the condition should include "&& received == 0".
It probably doesn't make much difference, but by leaving that off we
don't have to wonder how peeking interacts with events, ie if it's a
problem that we didn't do the "reset" step.  Thinking about that, I
realised that I should probably set reset = true in this new return
path, just like the "normal" WL_SOCKET_READABLE notification path,
just to be paranoid.  (Programming computers you don't have requires
extra paranoia.)

Any chance you could test this version please Alexander?

Вложения

Re: Why is citext/regress failing on hamerkop?

От
Thomas Munro
Дата:
On Thu, May 16, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Any chance you could test this version please Alexander?

Sorry, cancel that.  v3 is not good.  I assume it fixes the GSSAPI
thing and is superficially better, but it doesn't handle code that
calls twice in a row and ignores the first result (I know that
PostgreSQL does that occasionally in a few places), and it's also
broken if someone gets recv() = 0 (EOF), and then decides to wait
anyway.  The only ways I can think of to get full reliable poll()-like
semantics is to do that peek every time, OR the complicated patch
(per-socket-workspace + intercepting recv etc).  So I'm back to v2.



Re: Why is citext/regress failing on hamerkop?

От
Alexander Lakhin
Дата:
Hello Thomas,

16.05.2024 04:32, Thomas Munro wrote:
> On Thu, May 16, 2024 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Any chance you could test this version please Alexander?
> Sorry, cancel that.  v3 is not good.  I assume it fixes the GSSAPI
> thing and is superficially better, but it doesn't handle code that
> calls twice in a row and ignores the first result (I know that
> PostgreSQL does that occasionally in a few places), and it's also
> broken if someone gets recv() = 0 (EOF), and then decides to wait
> anyway.  The only ways I can think of to get full reliable poll()-like
> semantics is to do that peek every time, OR the complicated patch
> (per-socket-workspace + intercepting recv etc).  So I'm back to v2.

I've tested v2 and can confirm that it works as v1, `vcregress check`
passes with no failures on REL_16_STABLE, `meson test` with the basic
configuration too.

By the way, hamerkop is not configured to enable gssapi for HEAD [1] and
I could not enable gss locally yet (just passing extra_lib_dirs,
extra_include_dirs doesn't work for me).

It looks like we need to find a way to enable it for meson to continue
testing v17+ with GSS on Windows.

[1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hamerkop&dt=2024-05-12%2011%3A00%3A28&stg=configure

Best regards,
Alexander



Re: Why is citext/regress failing on hamerkop?

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> For citext_utf8, I pushed cff4e5a3.  Hamerkop runs infrequently, so
> here's hoping for 100% green on master by Tuesday or so.

Meanwhile, back at the ranch, it doesn't seem that changed anything:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2024-05-16%2011%3A00%3A32

... and now that I look more closely, the reason why it didn't
change anything is that hamerkop is still building 0294df2
on HEAD.  All its other branches are equally stuck at the
end of March.  So this is a flat-out-broken animal, and I
plan to just ignore it until its owner un-sticks it.
(In particular, I think we shouldn't be in a hurry to push
the patch discussed downthread.)

Andrew: maybe the buildfarm server could be made to flag
animals building exceedingly old commits?  This is the second
problem of this sort that I've noticed this month, and you
really have to look closely to realize it's happening.

            regards, tom lane



Re: Why is citext/regress failing on hamerkop?

От
Andrew Dunstan
Дата:
On 2024-05-16 Th 16:18, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> For citext_utf8, I pushed cff4e5a3.  Hamerkop runs infrequently, so
>> here's hoping for 100% green on master by Tuesday or so.
> Meanwhile, back at the ranch, it doesn't seem that changed anything:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2024-05-16%2011%3A00%3A32
>
> ... and now that I look more closely, the reason why it didn't
> change anything is that hamerkop is still building 0294df2
> on HEAD.  All its other branches are equally stuck at the
> end of March.  So this is a flat-out-broken animal, and I
> plan to just ignore it until its owner un-sticks it.
> (In particular, I think we shouldn't be in a hurry to push
> the patch discussed downthread.)
>
> Andrew: maybe the buildfarm server could be made to flag
> animals building exceedingly old commits?  This is the second
> problem of this sort that I've noticed this month, and you
> really have to look closely to realize it's happening.
>
>             


Yeah, that should be doable. Since we have the git ref these days we 
should be able to mark it as old, or maybe just reject builds for very 
old commits (the latter would be easier).


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Why is citext/regress failing on hamerkop?

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 2024-05-16 Th 16:18, Tom Lane wrote:
>> Andrew: maybe the buildfarm server could be made to flag
>> animals building exceedingly old commits?  This is the second
>> problem of this sort that I've noticed this month, and you
>> really have to look closely to realize it's happening.

> Yeah, that should be doable. Since we have the git ref these days we 
> should be able to mark it as old, or maybe just reject builds for very 
> old commits (the latter would be easier).

I'd rather have some visible status on the BF dashboard.  Invariably,
with a problem like this, the animal's owner is unaware there's a
problem.  If it's just silently not reporting, then no one else will
notice either, and we effectively lose an animal (despite it still
burning electricity to perform those rejected runs).

            regards, tom lane



Re: Why is citext/regress failing on hamerkop?

От
Andrew Dunstan
Дата:
On 2024-05-16 Th 17:15, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 2024-05-16 Th 16:18, Tom Lane wrote:
>>> Andrew: maybe the buildfarm server could be made to flag
>>> animals building exceedingly old commits?  This is the second
>>> problem of this sort that I've noticed this month, and you
>>> really have to look closely to realize it's happening.
>> Yeah, that should be doable. Since we have the git ref these days we
>> should be able to mark it as old, or maybe just reject builds for very
>> old commits (the latter would be easier).
> I'd rather have some visible status on the BF dashboard.  Invariably,
> with a problem like this, the animal's owner is unaware there's a
> problem.  If it's just silently not reporting, then no one else will
> notice either, and we effectively lose an animal (despite it still
> burning electricity to perform those rejected runs).
>
>             


Fair enough. That will mean some database changes and other stuff, so it 
will take a bit longer.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Why is citext/regress failing on hamerkop?

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 2024-05-16 Th 17:15, Tom Lane wrote:
>> I'd rather have some visible status on the BF dashboard.  Invariably,
>> with a problem like this, the animal's owner is unaware there's a
>> problem.  If it's just silently not reporting, then no one else will
>> notice either, and we effectively lose an animal (despite it still
>> burning electricity to perform those rejected runs).

> Fair enough. That will mean some database changes and other stuff, so it 
> will take a bit longer.

Sure, I don't think it's urgent.

            regards, tom lane



Re: Why is citext/regress failing on hamerkop?

От
Andrew Dunstan
Дата:
On 2024-05-16 Th 17:34, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 2024-05-16 Th 17:15, Tom Lane wrote:
>>> I'd rather have some visible status on the BF dashboard.  Invariably,
>>> with a problem like this, the animal's owner is unaware there's a
>>> problem.  If it's just silently not reporting, then no one else will
>>> notice either, and we effectively lose an animal (despite it still
>>> burning electricity to perform those rejected runs).
>> Fair enough. That will mean some database changes and other stuff, so it
>> will take a bit longer.
> Sure, I don't think it's urgent.


I've pushed a small change, that should just mark with an asterisk any 
gitref that is more than 2 days older than the tip of the branch at the 
time of reporting.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Why is citext/regress failing on hamerkop?

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> I've pushed a small change, that should just mark with an asterisk any 
> gitref that is more than 2 days older than the tip of the branch at the 
> time of reporting.

Thanks!

            regards, tom lane