Re: conchuela timeouts since 2021-10-09 system upgrade

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: conchuela timeouts since 2021-10-09 system upgrade
Дата
Msg-id 52098.1635228234@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: conchuela timeouts since 2021-10-09 system upgrade  (Noah Misch <noah@leadboat.com>)
Ответы Re: conchuela timeouts since 2021-10-09 system upgrade  (Noah Misch <noah@leadboat.com>)
Список pgsql-bugs
Noah Misch <noah@leadboat.com> writes:
> On Mon, Oct 25, 2021 at 04:59:42PM -0400, Tom Lane wrote:
>> What I think we should do in these two tests is nuke the use of
>> background_pgbench entirely; that looks like a solution in search
>> of a problem, and it seems unnecessary here.  Why not run
>> the DROP/CREATE/bt_index_check transaction as one of three script
>> options in the main pgbench run?

> The author tried that and got deadlocks:
> https://postgr.es/m/5E041A70-4946-489C-9B6D-764DF627A92D@yandex-team.ru

Hmm, I guess that's because two concurrent CICs can deadlock against each
other.  I wonder if we could fix that ... or maybe we could teach pgbench
that it mustn't launch more than one instance of that script?  Or more
practically, use advisory locks in that script to enforce that only one
runs at once.

> On prairiedog, the proximate trouble is pgbench getting stuck.  IPC::Run is
> behaving normally given a stuck pgbench.  When pgbench stops sending queries,
> does pg_stat_activity show anything at all running?  If so, are those backends
> waiting on locks?  If not, what's the pgbench stack trace at that time?

The connected backend is idle, waiting for client input:

#0  0x9002ec88 in kevent ()
#1  0x003e3fa8 in WaitEventSetWait (set=0x3802258, timeout=-1, occurred_events=0xbfffe098, nevents=1,
wait_event_info=0)at latch.c:1601 
#2  0x0028d6a0 in secure_read (port=0x2501f40, ptr=0x7ad1d0, len=8192) at be-secure.c:186
#3  0x002958a4 in pq_recvbuf () at pqcomm.c:957
#4  0x00295994 in pq_getbyte () at pqcomm.c:1000
#5  0x004196a0 in PostgresMain

The last few entries in the postmaster log are:

2021-10-26 01:19:31.122 EDT [1013] 002_cic.pl LOG:  statement: DROP INDEX CONCURRENTLY idx;
2021-10-26 01:19:31.136 EDT [1013] 002_cic.pl LOG:  statement: CREATE INDEX CONCURRENTLY idx ON tbl(i);
2021-10-26 01:19:31.167 EDT [1013] 002_cic.pl LOG:  statement: SELECT bt_index_check('idx',true);

which is consistent with pg_stat_activity:

 datid | datname  | pid  | leader_pid | usesysid | usename | application_name | client_addr | client_hostname |
client_port|         backend_start         |          xact_start           |          query_start          |
state_change          
| wait_event_type |     wait_event      | state  | backend_xid | backend_xmin | query_id |               query
     |         backend_type          

-------+----------+------+------------+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------

+-----------------+---------------------+--------+-------------+--------------+----------+------------------------------------+------------------------------
 13019 | postgres | 1013 |            |       10 | tgl     | 002_cic.pl       |             |                 |
-1 | 2021-10-26 01:19:18.49282-04  |                               | 2021-10-26 01:19:31.167508-04 | 2021-10-26
01:19:31.190256-04 
| Client          | ClientRead          | idle   |             |              |          | SELECT
bt_index_check('idx',true);| client backend 

The pgbench process is sitting here:

#0  0x9001f488 in select ()
#1  0x0000758c in wait_on_socket_set (sa=0x300bb0, usecs=0) at pgbench.c:7098
#2  0x0000d514 in threadRun (arg=0x300420) at pgbench.c:6776
#3  0x00010c44 in main (argc=-1635084506, argv=0x0) at pgbench.c:6540

which means it thinks it's waiting for input.  The PGconn object is in
this state:

  status = CONNECTION_OK,
  asyncStatus = PGASYNC_BUSY,
  xactStatus = PQTRANS_IDLE,
  ...
  inBuffer = 0x6a000 "T",
  inBufSize = 16384,
  inStart = 0,
  inCursor = 0,
  inEnd = 0,
  outBuffer = 0x6e000 "Q",
  outBufSize = 16384,
  outCount = 0,
  outMsgStart = 1,
  outMsgEnd = 34,

The inBuffer is logically empty, but its contents look like the last input
was the result of the "SELECT bt_index_check" command, consistent with
the postmaster log.  The outBuffer is likewise logically empty, but what
it contains appears to be the next DROP INDEX command:

(gdb) x/64c state->con->outBuffer
0x6e000:        81 'Q'  0 '\0'  0 '\0'  0 '\0'  33 '!'  68 'D'  82 'R'  79 'O'
0x6e008:        80 'P'  32 ' '  73 'I'  78 'N'  68 'D'  69 'E'  88 'X'  32 ' '
0x6e010:        67 'C'  79 'O'  78 'N'  67 'C'  85 'U'  82 'R'  82 'R'  69 'E'
0x6e018:        78 'N'  84 'T'  76 'L'  89 'Y'  32 ' '  105 'i' 100 'd' 120 'x'
0x6e020:        59 ';'  0 '\0'  114 'r' 117 'u' 101 'e' 41 ')'  59 ';'  0 '\0'
0x6e028:        108 'l' 40 '('  105 'i' 41 ')'  59 ';'  0 '\0'  95 '_'  110 'n'
0x6e030:        97 'a'  109 'm' 101 'e' 0 '\0'  48 '0'  48 '0'  50 '2'  95 '_'
0x6e038:        99 'c'  105 'i' 99 'c'  46 '.'  112 'p' 108 'l' 0 '\0'  0 '\0'

So what we have is that libpq thinks it's sent the next DROP INDEX,
but the backend hasn't seen it.

It's fairly hard to blame that state of affairs on the IPC::Run harness.
I'm wondering if we might be looking at some timing-dependent corner-case
bug in the new libpq pipelining code.  Pipelining isn't enabled:

  pipelineStatus = PQ_PIPELINE_OFF,

but that doesn't mean that the pipelining code hasn't been anywhere
near this command.  I can see

  cmd_queue_head = 0x300d40,
  cmd_queue_tail = 0x300d40,
  cmd_queue_recycle = 0x0,

(gdb) p *state->con->cmd_queue_head
$4 = {
  queryclass = PGQUERY_SIMPLE,
  query = 0x3004e0 "DROP INDEX CONCURRENTLY idx;",
  next = 0x0
}

The trouble with this theory, of course, is "if libpq is busted, why is
only this test case showing it?".  But AFAICS it would take some pretty
spooky action-at-a-distance for the Perl harness to have caused this.

            regards, tom lane



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Herman verschooten
Дата:
Сообщение: Re: ERROR: posting list tuple with 20 items cannot be split at offset 168
Следующее
От: David Rowley
Дата:
Сообщение: Re: BUG #17213: Wrong result from a query involving Merge Semi Join and Memoize