Обсуждение: BUG #16685: The ecpg thread/descriptor test fails sometimes on Windows
BUG #16685: The ecpg thread/descriptor test fails sometimes on Windows
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 16685 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 13.0 Operating system: Ubuntu 20.04 Description: When running `vcregress ecpgcheck`, sometimes I get: test thread/descriptor ... stderr FAILED 99 ms regression.diffs contains: --- .../src/interfaces/ecpg/test/expected/thread-descriptor.stderr 2019-12-04 16:05:46 +0300 +++ .../src/interfaces/ecpg/test/results/thread-descriptor.stderr 2020-10-20 10:00:34 +0300 @@ -0,0 +1 @@ +SQL error: descriptor "mydesc" not found on line 31 See also: https://www.postgresql.org/message-id/flat/230799.1603045446%40sss.pgh.pa.us In descriptor.pgc we have: 30: EXEC SQL ALLOCATE DESCRIPTOR mydesc; 31: EXEC SQL DEALLOCATE DESCRIPTOR mydesc; So the mydesc descriptor disappeared somehow just after allocation. `EXEC SQL DEALLOCATE DESCRIPTOR` and `EXEC SQL DEALLOCATE DESCRIPTOR` are implemented in ECPGallocate_desc and ECPGdeallocate_desc in ecpglib\descriptor.c, correspondingly, so I looked into the code. I found that the get_descriptors() function called in ECPGdeallocate_desc sometimes can return null. static struct descriptor * get_descriptors(void) { pthread_once(&descriptor_once, descriptor_key_init); return (struct descriptor *) pthread_getspecific(descriptor_key); } pthread_getspecific(key) implemented on Widnows as TlsGetValue(key); To make the bug reproduction easier, I replaced ecpg_schedule contents with 100 "test: thread/descriptor" lines and ran `vcregress ecpgcheck` in a loop with 100 iterations. And with such setup it takes just several minutes to get a failure. The following debugging code inserted into the ECPGallocate_desc: +++ b/src/interfaces/ecpg/ecpglib/descriptor.c @@ -829,6 +829,17 @@ ECPGallocate_desc(int line, const char *name) } strcpy(new->name, name); set_descriptors(new); + + long initialdk = descriptor_key; + for (int n = 0; n < 1000; n++) { + void *new1 = TlsGetValue(descriptor_key); + if (!new1) { + DWORD lasterr = GetLastError(); + fprintf(stdout, "TlsGetValue() returned null on iteration %d, error: %d, descriptor_key: %d, initial descriptor_key: %d.\n", + n, lasterr, descriptor_key, initialdk); + exit(2); + } + } return true; } shows on a failure: TlsGetValue() returned null on iteration 209, error: 0, descriptor_key: 28, initial descriptor_key: 0. or TlsGetValue() returned null on iteration: 369, error: 0, descriptor_key: 28, initial descriptor_key: 0 So the descriptor_key changed after set_descriptors(new), and following get_descriptors() would return null as seen on a test failure.
PG Bug reporting form <noreply@postgresql.org> writes: > I found that the get_descriptors() function called in ECPGdeallocate_desc > sometimes can return null. > The following debugging code inserted into the ECPGallocate_desc: > ... > shows on a failure: > TlsGetValue() returned null on iteration 209, error: 0, descriptor_key: 28, > initial descriptor_key: 0. > or > TlsGetValue() returned null on iteration: 369, error: 0, descriptor_key: 28, > initial descriptor_key: 0 Hm. I'm supposing that 0 isn't likely to be the correct TLS index, and what this is showing us is that we got here before anyone had done descriptor_key_init. I'll bet that the correct fix is static void set_descriptors(struct descriptor *value) { + pthread_once(&descriptor_once, descriptor_key_init); pthread_setspecific(descriptor_key, value); } and that you could probably reproduce this on non-Windows, too, if you tried hard (sticking a delay in get_descriptors might do it). regards, tom lane
I wrote: > I'll bet that the correct fix is > static void > set_descriptors(struct descriptor *value) > { > + pthread_once(&descriptor_once, descriptor_key_init); > pthread_setspecific(descriptor_key, value); > } No, looking closer, there are only two calls of set_descriptors() and each of them is preceded by a call to get_descriptors(). So it should not be possible to get here without descriptor_key having been initialized. I'm forced to the conclusion that there's something wrong with ECPG's emulation of pthread_once ... ... and now that I look at it, it seems just as obvious what is wrong there: void win32_pthread_once(volatile pthread_once_t *once, void (*fn) (void)) { if (!*once) { pthread_mutex_lock(&win32_pthread_once_lock); if (!*once) { *once = true; fn(); } pthread_mutex_unlock(&win32_pthread_once_lock); } } We should not set *once until AFTER we execute fn(). Otherwise, other threads passing through pthread_once() will mistakenly fall through, expecting the initialization to be done already. (So in this view, adding a sleep just before fn() would make the failure more reproducible.) regards, tom lane
Re: BUG #16685: The ecpg thread/descriptor test fails sometimes on Windows
От
Alexander Lakhin
Дата:
24.10.2020 19:54, Tom Lane wrote: > I'm forced to the conclusion that there's something wrong with > ECPG's emulation of pthread_once ... > > ... and now that I look at it, it seems just as obvious what > is wrong there: > > void > win32_pthread_once(volatile pthread_once_t *once, void (*fn) (void)) > { > if (!*once) > { > pthread_mutex_lock(&win32_pthread_once_lock); > if (!*once) > { > *once = true; > fn(); > } > pthread_mutex_unlock(&win32_pthread_once_lock); > } > } > > We should not set *once until AFTER we execute fn(). > Otherwise, other threads passing through pthread_once() > will mistakenly fall through, expecting the initialization > to be done already. > > (So in this view, adding a sleep just before fn() would > make the failure more reproducible.) Yes, adding "pg_usleep(1000L);" just before fn() leads to 15 of 100 tests failed (without the delay more than 100 iterations could pass successfully). And the reverse test construction: fn(); pg_usleep(1000L); *once = true; Makes all the tests (I ran 30x100 iterations) pass just fine. Thank you for looking into this! Best regards, Alexander
BTW, I searched the buildfarm logs of ECPG failures to see if I could find anything related to this, and noticed that there were a fair number of Windows-machine failures in the thread/prep test. They don't all look quite alike, but this one is representative (and the most recent): sysname|dory snapshot|2019-04-02 11:20:25 ================== pgsql.build/src/interfaces/ecpg/test/regression.diffs =================== diff -w -U3 c:/pgbuildfarm/pgbuildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/expected/thread-prep.stderr c:/pgbuildfarm/pgbuildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/results/thread-prep.stderr --- c:/pgbuildfarm/pgbuildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/expected/thread-prep.stderr 2018-04-20 16:03:08-0400 +++ c:/pgbuildfarm/pgbuildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/results/thread-prep.stderr 2019-04-02 07:43:27-0400 @@ -0,0 +1,84 @@ +SQL error: could not connect to database "ecpg1_regression" on line 48 +SQL error: the connection to the server was lost on line 53 +SQL error: the connection to the server was lost on line 52 +SQL error: invalid statement name "i" on line 53 +SQL error: the connection to the server was lost on line 52 +SQL error: invalid statement name "i" on line 53 +SQL error: the connection to the server was lost on line 52 +SQL error: invalid statement name "i" on line 53 +SQL error: the connection to the server was lost on line 52 +SQL error: invalid statement name "i" on line 53 ... quite a few repetitions of this ... +SQL error: the connection to the server was lost on line 52 +SQL error: invalid statement name "i" on line 53 +SQL error: invalid statement name "i" on line 55 +SQL error: connection "Connection: 1" does not exist on line 56 I suspect that this might arise from a pthread_once failure associated with the actual_connection_key variable. regards, tom lane