Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only
Дата
Msg-id 20231009022529.f3@rfd.leadboat.com
обсуждение исходный текст
Ответ на REL_15_STABLE: pgbench tests randomly failing on CI, Windows only  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
On Mon, Sep 04, 2023 at 03:18:40PM +1200, Thomas Munro wrote:
> Somehow these tests have recently become unstable and have failed a few times:
> 
> https://github.com/postgres/postgres/commits/REL_15_STABLE
> 
> The failures are like:
> 
> [22:32:26.722] # Failed test 'pgbench simple update stdout
> /(?^:builtin: simple update)/'
> [22:32:26.722] # at t/001_pgbench_with_server.pl line 119.
> [22:32:26.722] # 'pgbench (15.4)
> [22:32:26.722] # '
> [22:32:26.722] # doesn't match '(?^:builtin: simple update)'

Fun.  That's a test of "pgbench -C".  The test harness isn't reporting
pgbench's stderr, so I hacked things to get that and the actual file
descriptor values being assigned.  The failure case gets "pgbench: error: too
many client connections for select()" in stderr, from this pgbench.c function:

static void
add_socket_to_set(socket_set *sa, int fd, int idx)
{
    if (fd < 0 || fd >= FD_SETSIZE)
    {
        /*
         * Doing a hard exit here is a bit grotty, but it doesn't seem worth
         * complicating the API to make it less grotty.
         */
        pg_fatal("too many client connections for select()");
    }
    FD_SET(fd, &sa->fds);
    if (fd > sa->maxfd)
        sa->maxfd = fd;
}

The "fd >= FD_SETSIZE" check is irrelevant on Windows.  See comments in the
attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently.
The first associated failure was commit dea12a1 (2023-08-03); as a doc commit,
it's an innocent victim.  Bisect blamed 8488bab "ci: Use windows VMs instead
of windows containers" (2023-02), long before the failures began.  I'll guess
some 2023-08 Windows update or reconfiguration altered file descriptor
assignment, hence the onset of failures.  In my tests of v16, the highest file
descriptor was 948.  I could make v16 fail by changing --client=5 to
--client=90 in the test.  With the attached patch and --client=90, v16 peaked
at file descriptor 2040.

Thanks,
nm

P.S. Later, we should change test code so the pgbench stderr can't grow an
extra line without that line appearing in test logs.

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Richard Guo
Дата:
Сообщение: Re: pg16: XX000: could not find pathkey item to sort
Следующее
От: Suraj Kharage
Дата:
Сообщение: Re: Server crash on RHEL 9/s390x platform against PG16