Re: conchuela timeouts since 2021-10-09 system upgrade
От | Tom Lane |
---|---|
Тема | Re: conchuela timeouts since 2021-10-09 system upgrade |
Дата | |
Msg-id | 186950.1635296017@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: conchuela timeouts since 2021-10-09 system upgrade (Thomas Munro <thomas.munro@gmail.com>) |
Список | pgsql-bugs |
Thomas Munro <thomas.munro@gmail.com> writes: > On Wed, Oct 27, 2021 at 3:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> [...] I'd be prepared to believe that prairiedog's >> ancient macOS version has some weird bug preventing kevent() from noticing >> available data ... but (a) surely conchuela wouldn't share such a bug, >> and (b) we've been using kevent() for a couple years now, so how come >> we didn't see this before? > There was this case soon after our kqueue support landed: > https://www.postgresql.org/message-id/CA%2BhUKGLzaR5cV0EmZWoVXJDO_XwZpmpQX_sYwCBRE1qLBEcGPQ%40mail.gmail.com Oh yeah ... that looks like the exact same thing, doesn't it? I've just finished doing 500 cycles of amcheck's 002_cic.pl without a failure after recompiling latch.c with -DWAIT_USE_POLL. With the kevent code path, it is hard to reproduce but not that hard. So unless we can learn something from the DragonFly case, I'm inclined to write this off as ancient-macOS-bug and start using -DWAIT_USE_POLL on prairiedog. > I don't have any ideas about conchuela. For the next person who > manages to reproduce this, just to sanity-check what we're passing in > to kevent(), what do *port and waitfor look like when secure_read() > blocks in WaitEventSetWait? Not sure which variables you're referring to there, but here's a more complete gdb dump from the stuck backend: (gdb) bt #0 0x9002ec88 in kevent () #1 0x003e3fa8 in WaitEventSetWait (set=0x3802258, timeout=-1, occurred_events=0xbfffe098, nevents=1, wait_event_info=0)at latch.c:1601 #2 0x0028d6a0 in secure_read (port=0x2501f40, ptr=0x7ad1d0, len=8192) at be-secure.c:186 #3 0x002958a4 in pq_recvbuf () at pqcomm.c:957 #4 0x00295994 in pq_getbyte () at pqcomm.c:1000 #5 0x004196a0 in PostgresMain (dbname=0x3805894 "", username=0x789020 "...") at postgres.c:353 #6 0x00363df8 in PostmasterMain (argc=8059652, argv=0x61789299) at postmaster.c:4560 #7 0x00298b6c in main (argc=4, argv=0x2500a30) at main.c:198 (gdb) f 1 #1 0x003e3fa8 in WaitEventSetWait (set=0x3802258, timeout=-1, occurred_events=0xbfffe098, nevents=1, wait_event_info=0)at latch.c:1601 1601 rc = kevent(set->kqueue_fd, NULL, 0, (gdb) i locals cur_event = (WaitEvent *) 0x78ed40 cur_kqueue_event = (struct kevent *) 0x0 timeout = { tv_sec = 13, tv_nsec = 0 } rc = 4 timeout_p = (struct timespec *) 0x0 rc = 4 returned_events = 4 start_time = { tv_sec = -1879041144, tv_usec = 58742932 } cur_time = { tv_sec = 59028992, tv_usec = 2 } cur_timeout = -1 (gdb) p *set $1 = { nevents = 3, nevents_space = 3, events = 0x380227c, latch = 0xc954ebc, latch_pos = 1, exit_on_postmaster_death = 0 '\0', kqueue_fd = 4, kqueue_ret_events = 0x38022ac, report_postmaster_not_running = 0 '\0' } (gdb) p set->events[0] $2 = { pos = 0, events = 2, fd = 7, # matches MyProcPort->sock user_data = 0x0 } (gdb) p set->events[1] $3 = { pos = 1, events = 1, fd = -1, user_data = 0x0 } (gdb) p set->events[2] $4 = { pos = 2, events = 16, fd = 3, user_data = 0x0 } (gdb) p set->kqueue_ret_events[0] $6 = { ident = 7, # matches MyProcPort->sock filter = -1, # EVFILT_READ flags = 1, # EV_ADD fflags = 0, data = 40, udata = 0x380227c } So AFAICS, the data we're passing to kevent() is all sane. It then occurred to me to replicate the case I saw this morning (control-C'ing the test script) and ktrace all the involved processes. The perl test script and the pgbench process both just immediately die on SIGINT: 25500 pgbench RET select -1 errno 4 Interrupted system call 25500 pgbench PSIG SIGINT SIG_DFL so they are clearly not doing anything to cause the DROP command to get delivered. Nonetheless, what the backend sees post-control-C is 25502 postgres RET kevent 1 25502 postgres CALL recvfrom(0x7,0x7ad1d0,0x2000,0,0,0) 25502 postgres GIO fd 7 wrote 34 bytes "Q\0\0\0!DROP INDEX CONCURRENTLY idx;\0" 25502 postgres RET recvfrom 34/0x22 ... blah, blah, executing the command ... 25502 postgres CALL sendto(0x7,0x241d020,0x16,0,0,0) 25502 postgres RET sendto -1 errno 32 Broken pipe ... blah, blah, shutting down ... It's really hard to look at this and not call it a kernel bug. regards, tom lane
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Peter GeogheganДата:
Сообщение: Re: BUG #17245: Index corruption involving deduplicated entries
Следующее
От: Peter GeogheganДата:
Сообщение: Re: ERROR: posting list tuple with 20 items cannot be split at offset 168