Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only
Дата
Msg-id 20231009040846.5e@rfd.leadboat.com
обсуждение исходный текст
Ответ на Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only  ("David G. Johnston" <david.g.johnston@gmail.com>)
Список pgsql-hackers
On Mon, Oct 09, 2023 at 04:22:52PM +1300, Thomas Munro wrote:
> On Mon, Oct 9, 2023 at 3:25 PM Noah Misch <noah@leadboat.com> wrote:
> > The "fd >= FD_SETSIZE" check is irrelevant on Windows.  See comments in the
> > attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently.
> > The first associated failure was commit dea12a1 (2023-08-03); as a doc commit,
> > it's an innocent victim.  Bisect blamed 8488bab "ci: Use windows VMs instead
> > of windows containers" (2023-02), long before the failures began.  I'll guess
> > some 2023-08 Windows update or reconfiguration altered file descriptor
> > assignment, hence the onset of failures.  In my tests of v16, the highest file
> > descriptor was 948.  I could make v16 fail by changing --client=5 to
> > --client=90 in the test.  With the attached patch and --client=90, v16 peaked
> > at file descriptor 2040.
> 
> Ohhh.  Thanks for figuring that out.  I'd never noticed that quirk.  I
> didn't/can't test it but the patch looks reasonable after reading the
> referenced docs.

For what it's worth, I did all that testing through CI, using hacks like
https://cirrus-ci.com/task/5352974499708928 to get the relevant information.
I didn't bother trying non-CI, since buildfarm animals aren't failing.

> Maybe instead of just "out of range" I'd say "out of
> range for select()" since otherwise it might seem a little baffling --
> what range?

I was trying to follow this from the style guide:

  Avoid mentioning called function names, either; instead say what the code was trying to do:
  BAD:    open() failed: %m
  BETTER: could not open file %s: %m

I didn't think of any phrasing that clearly explained things without the
reader consulting the code.  I considered these:

  "socket file descriptor out of range: %d" [what range?]
  "socket file descriptor out of range for select(): %d" [style guide violation]
  "socket file descriptor out of range for checking whether ready for reading: %d" [?]
  "socket file descriptor out of range for synchronous I/O multiplexing: %d" [term from POSIX]

> Random water cooler speculation about future ideas:  I wonder
> whether/when we can eventually ditch select() and use WSAPoll() for
> this on Windows, which is supposed to be like poll().  There's a
> comment explaining that we prefer select() because it has a higher
> resolution sleep than poll() (us vs ms), so we wouldn't want to use
> poll() on Unixen, but we know that Windows doesn't even use high
> resolution timers for any user space APIs anyway so that's just not a
> concern on that platform.  The usual reason nobody ever uses WSAPoll()
> is because the Curl guys told[1] everyone that it's terrible in a way
> that would quite specifically break our usage.  But I wonder, because
> the documentation now says "As of Windows 10 version 2004, when a TCP
> socket fails to connect, (POLLHUP | POLLERR | POLLWRNORM) is
> indicated", it *sounds* like it might have been fixed ~3 years ago?
> One idea would be to hide it inside WaitEventSet, and let WaitEventSet
> be used in front end code (we couldn't use the
> WaitForMultipleObjects() version because it's hard-limited to 64
> events, but in front end code we don't need latches and other stuff,
> so we could have a sockets-only version with WSAPoll()).  The idea of
> using WaitEventSet for pgbench on Unix was already mentioned a couple
> of times by others for general scalability reasons -- that way we
> could finish up using a better kernel interface on all supported
> platforms.  We'd probably want to add high resolution time-out support
> (I already posted a patch for that somewhere), or we'd be back to 1ms
> timeouts.
> 
> [1] https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/

Interesting.  I have no current knowledge to add there.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only
Следующее
От: David Rowley
Дата:
Сообщение: Re: pg16: XX000: could not find pathkey item to sort