Re: conchuela timeouts since 2021-10-09 system upgrade

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: conchuela timeouts since 2021-10-09 system upgrade
Дата
Msg-id 20211026134500.GA128912@rfd.leadboat.com
обсуждение исходный текст
Ответ на Re: conchuela timeouts since 2021-10-09 system upgrade  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: conchuela timeouts since 2021-10-09 system upgrade  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: conchuela timeouts since 2021-10-09 system upgrade  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
On Tue, Oct 26, 2021 at 02:03:54AM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Mon, Oct 25, 2021 at 04:59:42PM -0400, Tom Lane wrote:
> >> What I think we should do in these two tests is nuke the use of
> >> background_pgbench entirely; that looks like a solution in search
> >> of a problem, and it seems unnecessary here.  Why not run
> >> the DROP/CREATE/bt_index_check transaction as one of three script
> >> options in the main pgbench run?
> 
> > The author tried that and got deadlocks:
> > https://postgr.es/m/5E041A70-4946-489C-9B6D-764DF627A92D@yandex-team.ru
> 
> Hmm, I guess that's because two concurrent CICs can deadlock against each
> other.  I wonder if we could fix that ... or maybe we could teach pgbench
> that it mustn't launch more than one instance of that script?

Both sound doable, but I don't expect either to fix prairiedog's trouble.

> Or more
> practically, use advisory locks in that script to enforce that only one
> runs at once.

The author did try that.

> So what we have is that libpq thinks it's sent the next DROP INDEX,
> but the backend hasn't seen it.

Thanks for isolating that.

> It's fairly hard to blame that state of affairs on the IPC::Run harness.
> I'm wondering if we might be looking at some timing-dependent corner-case
> bug in the new libpq pipelining code.  Pipelining isn't enabled:
> 
>   pipelineStatus = PQ_PIPELINE_OFF, 
> 
> but that doesn't mean that the pipelining code hasn't been anywhere
> near this command.  I can see
> 
>   cmd_queue_head = 0x300d40, 
>   cmd_queue_tail = 0x300d40, 
>   cmd_queue_recycle = 0x0, 
> 
> (gdb) p *state->con->cmd_queue_head
> $4 = {
>   queryclass = PGQUERY_SIMPLE, 
>   query = 0x3004e0 "DROP INDEX CONCURRENTLY idx;", 
>   next = 0x0
> }
> 
> The trouble with this theory, of course, is "if libpq is busted, why is
> only this test case showing it?".

Agreed, it's not clear how the new tests would reveal a libpq bug that
src/bin/pgbench/t/001_pgbench_with_server.pl has been unable to reveal.  Does
the problem reproduce on v13?

Grasping at straws, background_pgbench does differ by specifying stdin as a
ref to an empty scalar.  I think that makes IPC::Run open a pipe and never
write to it.  The older pgbench tests don't override stdin, so I think that
makes pgbench inherit the original stdin.  Given your pgbench stack trace,
this seems awfully unlikely to be the relevant difference.  If we run out of
ideas, you could try some runs with that difference removed:

--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2110,7 +2110,7 @@ sub background_pgbench
     # IPC::Run would otherwise append to existing contents:
     $$stdout = "" if ref($stdout);
 
-    my $harness = IPC::Run::start \@cmd, '<', \$stdin, '>', $stdout, '2>&1',
+    my $harness = IPC::Run::start \@cmd, '>', $stdout, '2>&1',
       $timer;
 
     return $harness;

> But AFAICS it would take some pretty
> spooky action-at-a-distance for the Perl harness to have caused this.

Agreed.  We'll have to consider the harness innocent for the moment.



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Alexander Korotkov
Дата:
Сообщение: Re: BUG #17229: Segmentation Fault after upgrading to version 13
Следующее
От: Tom Lane
Дата:
Сообщение: Re: conchuela timeouts since 2021-10-09 system upgrade