Обсуждение: BUG #16685: The ecpg thread/descriptor test fails sometimes on Windows
BUG #16685: The ecpg thread/descriptor test fails sometimes on Windows
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 16685 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 13.0 Operating system: Ubuntu 20.04 Description: When running `vcregress ecpgcheck`, sometimes I get: test thread/descriptor ... stderr FAILED 99 ms regression.diffs contains: --- .../src/interfaces/ecpg/test/expected/thread-descriptor.stderr 2019-12-04 16:05:46 +0300 +++ .../src/interfaces/ecpg/test/results/thread-descriptor.stderr 2020-10-20 10:00:34 +0300 @@ -0,0 +1 @@ +SQL error: descriptor "mydesc" not found on line 31 See also: https://www.postgresql.org/message-id/flat/230799.1603045446%40sss.pgh.pa.us In descriptor.pgc we have: 30: EXEC SQL ALLOCATE DESCRIPTOR mydesc; 31: EXEC SQL DEALLOCATE DESCRIPTOR mydesc; So the mydesc descriptor disappeared somehow just after allocation. `EXEC SQL DEALLOCATE DESCRIPTOR` and `EXEC SQL DEALLOCATE DESCRIPTOR` are implemented in ECPGallocate_desc and ECPGdeallocate_desc in ecpglib\descriptor.c, correspondingly, so I looked into the code. I found that the get_descriptors() function called in ECPGdeallocate_desc sometimes can return null. static struct descriptor * get_descriptors(void) { pthread_once(&descriptor_once, descriptor_key_init); return (struct descriptor *) pthread_getspecific(descriptor_key); } pthread_getspecific(key) implemented on Widnows as TlsGetValue(key); To make the bug reproduction easier, I replaced ecpg_schedule contents with 100 "test: thread/descriptor" lines and ran `vcregress ecpgcheck` in a loop with 100 iterations. And with such setup it takes just several minutes to get a failure. The following debugging code inserted into the ECPGallocate_desc: +++ b/src/interfaces/ecpg/ecpglib/descriptor.c @@ -829,6 +829,17 @@ ECPGallocate_desc(int line, const char *name) } strcpy(new->name, name); set_descriptors(new); + + long initialdk = descriptor_key; + for (int n = 0; n < 1000; n++) { + void *new1 = TlsGetValue(descriptor_key); + if (!new1) { + DWORD lasterr = GetLastError(); + fprintf(stdout, "TlsGetValue() returned null on iteration %d, error: %d, descriptor_key: %d, initial descriptor_key: %d.\n", + n, lasterr, descriptor_key, initialdk); + exit(2); + } + } return true; } shows on a failure: TlsGetValue() returned null on iteration 209, error: 0, descriptor_key: 28, initial descriptor_key: 0. or TlsGetValue() returned null on iteration: 369, error: 0, descriptor_key: 28, initial descriptor_key: 0 So the descriptor_key changed after set_descriptors(new), and following get_descriptors() would return null as seen on a test failure.
PG Bug reporting form <noreply@postgresql.org> writes:
> I found that the get_descriptors() function called in ECPGdeallocate_desc
> sometimes can return null.
> The following debugging code inserted into the ECPGallocate_desc:
> ...
> shows on a failure:
> TlsGetValue() returned null on iteration 209, error: 0, descriptor_key: 28,
> initial descriptor_key: 0.
> or
> TlsGetValue() returned null on iteration: 369, error: 0, descriptor_key: 28,
> initial descriptor_key: 0
Hm. I'm supposing that 0 isn't likely to be the correct TLS index,
and what this is showing us is that we got here before anyone had
done descriptor_key_init.
I'll bet that the correct fix is
static void
set_descriptors(struct descriptor *value)
{
+ pthread_once(&descriptor_once, descriptor_key_init);
pthread_setspecific(descriptor_key, value);
}
and that you could probably reproduce this on non-Windows, too,
if you tried hard (sticking a delay in get_descriptors might do it).
regards, tom lane
I wrote:
> I'll bet that the correct fix is
> static void
> set_descriptors(struct descriptor *value)
> {
> + pthread_once(&descriptor_once, descriptor_key_init);
> pthread_setspecific(descriptor_key, value);
> }
No, looking closer, there are only two calls of set_descriptors()
and each of them is preceded by a call to get_descriptors().
So it should not be possible to get here without descriptor_key
having been initialized.
I'm forced to the conclusion that there's something wrong with
ECPG's emulation of pthread_once ...
... and now that I look at it, it seems just as obvious what
is wrong there:
void
win32_pthread_once(volatile pthread_once_t *once, void (*fn) (void))
{
if (!*once)
{
pthread_mutex_lock(&win32_pthread_once_lock);
if (!*once)
{
*once = true;
fn();
}
pthread_mutex_unlock(&win32_pthread_once_lock);
}
}
We should not set *once until AFTER we execute fn().
Otherwise, other threads passing through pthread_once()
will mistakenly fall through, expecting the initialization
to be done already.
(So in this view, adding a sleep just before fn() would
make the failure more reproducible.)
regards, tom lane
Re: BUG #16685: The ecpg thread/descriptor test fails sometimes on Windows
От
Alexander Lakhin
Дата:
24.10.2020 19:54, Tom Lane wrote:
> I'm forced to the conclusion that there's something wrong with
> ECPG's emulation of pthread_once ...
>
> ... and now that I look at it, it seems just as obvious what
> is wrong there:
>
> void
> win32_pthread_once(volatile pthread_once_t *once, void (*fn) (void))
> {
> if (!*once)
> {
> pthread_mutex_lock(&win32_pthread_once_lock);
> if (!*once)
> {
> *once = true;
> fn();
> }
> pthread_mutex_unlock(&win32_pthread_once_lock);
> }
> }
>
> We should not set *once until AFTER we execute fn().
> Otherwise, other threads passing through pthread_once()
> will mistakenly fall through, expecting the initialization
> to be done already.
>
> (So in this view, adding a sleep just before fn() would
> make the failure more reproducible.)
Yes, adding "pg_usleep(1000L);" just before fn() leads to 15 of 100
tests failed (without the delay more than 100 iterations could pass
successfully).
And the reverse test construction:
fn();
pg_usleep(1000L);
*once = true;
Makes all the tests (I ran 30x100 iterations) pass just fine.
Thank you for looking into this!
Best regards,
Alexander
BTW, I searched the buildfarm logs of ECPG failures to see if I could
find anything related to this, and noticed that there were a fair
number of Windows-machine failures in the thread/prep test. They don't
all look quite alike, but this one is representative (and the most
recent):
sysname|dory
snapshot|2019-04-02 11:20:25
================== pgsql.build/src/interfaces/ecpg/test/regression.diffs ===================
diff -w -U3 c:/pgbuildfarm/pgbuildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/expected/thread-prep.stderr
c:/pgbuildfarm/pgbuildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/results/thread-prep.stderr
--- c:/pgbuildfarm/pgbuildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/expected/thread-prep.stderr 2018-04-20
16:03:08-0400
+++ c:/pgbuildfarm/pgbuildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/results/thread-prep.stderr 2019-04-02
07:43:27-0400
@@ -0,0 +1,84 @@
+SQL error: could not connect to database "ecpg1_regression" on line 48
+SQL error: the connection to the server was lost on line 53
+SQL error: the connection to the server was lost on line 52
+SQL error: invalid statement name "i" on line 53
+SQL error: the connection to the server was lost on line 52
+SQL error: invalid statement name "i" on line 53
+SQL error: the connection to the server was lost on line 52
+SQL error: invalid statement name "i" on line 53
+SQL error: the connection to the server was lost on line 52
+SQL error: invalid statement name "i" on line 53
... quite a few repetitions of this ...
+SQL error: the connection to the server was lost on line 52
+SQL error: invalid statement name "i" on line 53
+SQL error: invalid statement name "i" on line 55
+SQL error: connection "Connection: 1" does not exist on line 56
I suspect that this might arise from a pthread_once failure
associated with the actual_connection_key variable.
regards, tom lane