Обсуждение: BUG #16685: The ecpg thread/descriptor test fails sometimes on Windows

Поиск
Список
Период
Сортировка

BUG #16685: The ecpg thread/descriptor test fails sometimes on Windows

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      16685
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 13.0
Operating system:   Ubuntu 20.04
Description:

When running `vcregress ecpgcheck`, sometimes I get:
test thread/descriptor            ... stderr FAILED       99 ms

regression.diffs contains:
--- .../src/interfaces/ecpg/test/expected/thread-descriptor.stderr
2019-12-04 16:05:46 +0300
+++ .../src/interfaces/ecpg/test/results/thread-descriptor.stderr
2020-10-20 10:00:34 +0300
@@ -0,0 +1 @@
+SQL error: descriptor "mydesc" not found on line 31

See also:
https://www.postgresql.org/message-id/flat/230799.1603045446%40sss.pgh.pa.us

In descriptor.pgc we have:
30:          EXEC SQL ALLOCATE DESCRIPTOR mydesc;
31:          EXEC SQL DEALLOCATE DESCRIPTOR mydesc;
So the mydesc descriptor disappeared somehow just after allocation.

`EXEC SQL DEALLOCATE DESCRIPTOR` and `EXEC SQL DEALLOCATE DESCRIPTOR` are
implemented in ECPGallocate_desc and ECPGdeallocate_desc in
ecpglib\descriptor.c, correspondingly, so I looked into the code.

I found that the get_descriptors() function called in ECPGdeallocate_desc
sometimes can return null.
static struct descriptor *
get_descriptors(void)
{
    pthread_once(&descriptor_once, descriptor_key_init);
    return (struct descriptor *) pthread_getspecific(descriptor_key);
}
pthread_getspecific(key) implemented on Widnows as TlsGetValue(key);

To make the bug reproduction easier, I replaced ecpg_schedule contents with
100 "test: thread/descriptor" lines and ran `vcregress ecpgcheck` in a loop
with 100 iterations. And with such setup it takes just several minutes to
get a failure.

The following debugging code inserted into the ECPGallocate_desc:
+++ b/src/interfaces/ecpg/ecpglib/descriptor.c
@@ -829,6 +829,17 @@ ECPGallocate_desc(int line, const char *name)
        }
        strcpy(new->name, name);
        set_descriptors(new);
+
+       long initialdk = descriptor_key;
+       for (int n = 0; n < 1000; n++) {
+               void *new1 = TlsGetValue(descriptor_key);
+               if (!new1) {
+                       DWORD lasterr = GetLastError();
+                       fprintf(stdout, "TlsGetValue() returned null on
iteration %d, error: %d, descriptor_key: %d, initial descriptor_key:
%d.\n",
+                                       n, lasterr, descriptor_key,
initialdk);
+                       exit(2);
+               }
+       }
        return true;
 }
shows on a failure:
TlsGetValue() returned null on iteration 209, error: 0, descriptor_key: 28,
initial descriptor_key: 0.
or
TlsGetValue() returned null on iteration: 369, error: 0, descriptor_key: 28,
initial descriptor_key: 0

So the descriptor_key changed after set_descriptors(new), and following
get_descriptors() would return null as seen on a test failure.


Re: BUG #16685: The ecpg thread/descriptor test fails sometimes on Windows

От
Tom Lane
Дата:
PG Bug reporting form <noreply@postgresql.org> writes:
> I found that the get_descriptors() function called in ECPGdeallocate_desc
> sometimes can return null.

> The following debugging code inserted into the ECPGallocate_desc:
> ...
> shows on a failure:
> TlsGetValue() returned null on iteration 209, error: 0, descriptor_key: 28,
> initial descriptor_key: 0.
> or
> TlsGetValue() returned null on iteration: 369, error: 0, descriptor_key: 28,
> initial descriptor_key: 0

Hm.  I'm supposing that 0 isn't likely to be the correct TLS index,
and what this is showing us is that we got here before anyone had
done descriptor_key_init.

I'll bet that the correct fix is

 static void
 set_descriptors(struct descriptor *value)
 {
+    pthread_once(&descriptor_once, descriptor_key_init);
    pthread_setspecific(descriptor_key, value);
 }

and that you could probably reproduce this on non-Windows, too,
if you tried hard (sticking a delay in get_descriptors might do it).

            regards, tom lane



Re: BUG #16685: The ecpg thread/descriptor test fails sometimes on Windows

От
Tom Lane
Дата:
I wrote:
> I'll bet that the correct fix is

>  static void
>  set_descriptors(struct descriptor *value)
>  {
> +    pthread_once(&descriptor_once, descriptor_key_init);
>     pthread_setspecific(descriptor_key, value);
>  }

No, looking closer, there are only two calls of set_descriptors()
and each of them is preceded by a call to get_descriptors().
So it should not be possible to get here without descriptor_key
having been initialized.

I'm forced to the conclusion that there's something wrong with
ECPG's emulation of pthread_once ...

... and now that I look at it, it seems just as obvious what
is wrong there:

void
win32_pthread_once(volatile pthread_once_t *once, void (*fn) (void))
{
    if (!*once)
    {
        pthread_mutex_lock(&win32_pthread_once_lock);
        if (!*once)
        {
            *once = true;
            fn();
        }
        pthread_mutex_unlock(&win32_pthread_once_lock);
    }
}

We should not set *once until AFTER we execute fn().
Otherwise, other threads passing through pthread_once()
will mistakenly fall through, expecting the initialization
to be done already.

(So in this view, adding a sleep just before fn() would
make the failure more reproducible.)

            regards, tom lane



Re: BUG #16685: The ecpg thread/descriptor test fails sometimes on Windows

От
Alexander Lakhin
Дата:
24.10.2020 19:54, Tom Lane wrote:
> I'm forced to the conclusion that there's something wrong with
> ECPG's emulation of pthread_once ...
>
> ... and now that I look at it, it seems just as obvious what
> is wrong there:
>
> void
> win32_pthread_once(volatile pthread_once_t *once, void (*fn) (void))
> {
>     if (!*once)
>     {
>         pthread_mutex_lock(&win32_pthread_once_lock);
>         if (!*once)
>         {
>             *once = true;
>             fn();
>         }
>         pthread_mutex_unlock(&win32_pthread_once_lock);
>     }
> }
>
> We should not set *once until AFTER we execute fn().
> Otherwise, other threads passing through pthread_once()
> will mistakenly fall through, expecting the initialization
> to be done already.
>
> (So in this view, adding a sleep just before fn() would
> make the failure more reproducible.)
Yes, adding "pg_usleep(1000L);" just before fn() leads to 15 of 100
tests failed (without the delay more than 100 iterations could pass
successfully).
And the reverse test construction:
            fn();
            pg_usleep(1000L);
            *once = true;
Makes all the tests (I ran 30x100 iterations) pass just fine.
Thank you for looking into this!

Best regards,
Alexander



Re: BUG #16685: The ecpg thread/descriptor test fails sometimes on Windows

От
Tom Lane
Дата:
BTW, I searched the buildfarm logs of ECPG failures to see if I could
find anything related to this, and noticed that there were a fair
number of Windows-machine failures in the thread/prep test.  They don't
all look quite alike, but this one is representative (and the most
recent):

sysname|dory
snapshot|2019-04-02 11:20:25
================== pgsql.build/src/interfaces/ecpg/test/regression.diffs ===================
diff -w -U3 c:/pgbuildfarm/pgbuildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/expected/thread-prep.stderr
c:/pgbuildfarm/pgbuildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/results/thread-prep.stderr
--- c:/pgbuildfarm/pgbuildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/expected/thread-prep.stderr    2018-04-20
16:03:08-0400 
+++ c:/pgbuildfarm/pgbuildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/results/thread-prep.stderr    2019-04-02
07:43:27-0400 
@@ -0,0 +1,84 @@
+SQL error: could not connect to database "ecpg1_regression" on line 48
+SQL error: the connection to the server was lost on line 53
+SQL error: the connection to the server was lost on line 52
+SQL error: invalid statement name "i" on line 53
+SQL error: the connection to the server was lost on line 52
+SQL error: invalid statement name "i" on line 53
+SQL error: the connection to the server was lost on line 52
+SQL error: invalid statement name "i" on line 53
+SQL error: the connection to the server was lost on line 52
+SQL error: invalid statement name "i" on line 53
... quite a few repetitions of this ...
+SQL error: the connection to the server was lost on line 52
+SQL error: invalid statement name "i" on line 53
+SQL error: invalid statement name "i" on line 55
+SQL error: connection "Connection: 1" does not exist on line 56


I suspect that this might arise from a pthread_once failure
associated with the actual_connection_key variable.

            regards, tom lane