Re: conchuela timeouts since 2021-10-09 system upgrade

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: conchuela timeouts since 2021-10-09 system upgrade
Дата
Msg-id 186950.1635296017@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: conchuela timeouts since 2021-10-09 system upgrade  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-bugs
Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Oct 27, 2021 at 3:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> [...] I'd be prepared to believe that prairiedog's
>> ancient macOS version has some weird bug preventing kevent() from noticing
>> available data ... but (a) surely conchuela wouldn't share such a bug,
>> and (b) we've been using kevent() for a couple years now, so how come
>> we didn't see this before?

> There was this case soon after our kqueue support landed:
> https://www.postgresql.org/message-id/CA%2BhUKGLzaR5cV0EmZWoVXJDO_XwZpmpQX_sYwCBRE1qLBEcGPQ%40mail.gmail.com

Oh yeah ... that looks like the exact same thing, doesn't it?

I've just finished doing 500 cycles of amcheck's 002_cic.pl without a
failure after recompiling latch.c with -DWAIT_USE_POLL.  With the kevent
code path, it is hard to reproduce but not that hard.  So unless we can
learn something from the DragonFly case, I'm inclined to write this off
as ancient-macOS-bug and start using -DWAIT_USE_POLL on prairiedog.

> I don't have any ideas about conchuela.  For the next person who
> manages to reproduce this, just to sanity-check what we're passing in
> to kevent(), what do *port and waitfor look like when secure_read()
> blocks in WaitEventSetWait?

Not sure which variables you're referring to there, but here's a more
complete gdb dump from the stuck backend:

(gdb) bt
#0  0x9002ec88 in kevent ()
#1  0x003e3fa8 in WaitEventSetWait (set=0x3802258, timeout=-1, occurred_events=0xbfffe098, nevents=1,
wait_event_info=0)at latch.c:1601 
#2  0x0028d6a0 in secure_read (port=0x2501f40, ptr=0x7ad1d0, len=8192) at be-secure.c:186
#3  0x002958a4 in pq_recvbuf () at pqcomm.c:957
#4  0x00295994 in pq_getbyte () at pqcomm.c:1000
#5  0x004196a0 in PostgresMain (dbname=0x3805894 "", username=0x789020 "...") at postgres.c:353
#6  0x00363df8 in PostmasterMain (argc=8059652, argv=0x61789299) at postmaster.c:4560
#7  0x00298b6c in main (argc=4, argv=0x2500a30) at main.c:198
(gdb) f 1
#1  0x003e3fa8 in WaitEventSetWait (set=0x3802258, timeout=-1, occurred_events=0xbfffe098, nevents=1,
wait_event_info=0)at latch.c:1601 
1601            rc = kevent(set->kqueue_fd, NULL, 0,
(gdb) i locals
cur_event = (WaitEvent *) 0x78ed40
cur_kqueue_event = (struct kevent *) 0x0
timeout = {
  tv_sec = 13,
  tv_nsec = 0
}
rc = 4
timeout_p = (struct timespec *) 0x0
rc = 4
returned_events = 4
start_time = {
  tv_sec = -1879041144,
  tv_usec = 58742932
}
cur_time = {
  tv_sec = 59028992,
  tv_usec = 2
}
cur_timeout = -1
(gdb) p *set
$1 = {
  nevents = 3,
  nevents_space = 3,
  events = 0x380227c,
  latch = 0xc954ebc,
  latch_pos = 1,
  exit_on_postmaster_death = 0 '\0',
  kqueue_fd = 4,
  kqueue_ret_events = 0x38022ac,
  report_postmaster_not_running = 0 '\0'
}
(gdb) p set->events[0]
$2 = {
  pos = 0,
  events = 2,
  fd = 7,      # matches MyProcPort->sock
  user_data = 0x0
}
(gdb) p set->events[1]
$3 = {
  pos = 1,
  events = 1,
  fd = -1,
  user_data = 0x0
}
(gdb) p set->events[2]
$4 = {
  pos = 2,
  events = 16,
  fd = 3,
  user_data = 0x0
}
(gdb) p set->kqueue_ret_events[0]
$6 = {
  ident = 7,       # matches MyProcPort->sock
  filter = -1,     # EVFILT_READ
  flags = 1,       # EV_ADD
  fflags = 0,
  data = 40,
  udata = 0x380227c
}

So AFAICS, the data we're passing to kevent() is all sane.

It then occurred to me to replicate the case I saw this morning
(control-C'ing the test script) and ktrace all the involved processes.
The perl test script and the pgbench process both just immediately
die on SIGINT:

 25500 pgbench  RET   select -1 errno 4 Interrupted system call
 25500 pgbench  PSIG  SIGINT SIG_DFL

so they are clearly not doing anything to cause the DROP command
to get delivered.  Nonetheless, what the backend sees post-control-C is

 25502 postgres RET   kevent 1
 25502 postgres CALL  recvfrom(0x7,0x7ad1d0,0x2000,0,0,0)
 25502 postgres GIO   fd 7 wrote 34 bytes
       "Q\0\0\0!DROP INDEX CONCURRENTLY idx;\0"
 25502 postgres RET   recvfrom 34/0x22
 ... blah, blah, executing the command ...
 25502 postgres CALL  sendto(0x7,0x241d020,0x16,0,0,0)
 25502 postgres RET   sendto -1 errno 32 Broken pipe
 ... blah, blah, shutting down ...

It's really hard to look at this and not call it a kernel bug.

            regards, tom lane



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: BUG #17245: Index corruption involving deduplicated entries
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: ERROR: posting list tuple with 20 items cannot be split at offset 168