Обсуждение: Query cancel seems to be broken in master since Oct 17

Поиск
Список
Период
Сортировка

Query cancel seems to be broken in master since Oct 17

От
Vladimir Sitnikov
Дата:
Hi,

In pgjdbc we have regular regression testing against "build from master" PostgreSQL, and recent master builds fail for "statement cancel" test.

The PostgreSQL as of Mon Oct 17 00:09:39 UTC 2016 was fine, then "statement cancel" started to fail.
The test executes "select pg_sleep(10)" and tries to cancel it. In recent master builds, cancel seems to be ignored, and the statement lasts for 10 seconds.

Exactly the same driver and test version works fine for 8.4, 9.1, 9.2, 9.3, 9.4, 9.5, and 9.6:

Any hints what could be the issue?
Was the breakage intentional?

Vladimir

Re: Query cancel seems to be broken in master since Oct 17

От
Magnus Hagander
Дата:


On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
Hi,

In pgjdbc we have regular regression testing against "build from master" PostgreSQL, and recent master builds fail for "statement cancel" test.

The PostgreSQL as of Mon Oct 17 00:09:39 UTC 2016 was fine, then "statement cancel" started to fail.
The test executes "select pg_sleep(10)" and tries to cancel it. In recent master builds, cancel seems to be ignored, and the statement lasts for 10 seconds.

Exactly the same driver and test version works fine for 8.4, 9.1, 9.2, 9.3, 9.4, 9.5, and 9.6:

Any hints what could be the issue?
Was the breakage intentional?



That's certainly not intended to break things, but that was changed on Oct 17 and it relates to cancel keys.

What platform does the postgres server run on? Can can you check if query cancel works on libpq or if it's completely broken? 


--

Re: Query cancel seems to be broken in master since Oct 17

От
Vladimir Gordiychuk
Дата:
>What platform does the postgres server run on?

Ubuntu
OS name: "linux", version: "3.19.0-66-generic", arch: "amd64", family: "unix"

2016-10-18 11:05 GMT+03:00 Magnus Hagander <magnus@hagander.net>:


On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
Hi,

In pgjdbc we have regular regression testing against "build from master" PostgreSQL, and recent master builds fail for "statement cancel" test.

The PostgreSQL as of Mon Oct 17 00:09:39 UTC 2016 was fine, then "statement cancel" started to fail.
The test executes "select pg_sleep(10)" and tries to cancel it. In recent master builds, cancel seems to be ignored, and the statement lasts for 10 seconds.

Exactly the same driver and test version works fine for 8.4, 9.1, 9.2, 9.3, 9.4, 9.5, and 9.6:

Any hints what could be the issue?
Was the breakage intentional?



That's certainly not intended to break things, but that was changed on Oct 17 and it relates to cancel keys.

What platform does the postgres server run on? Can can you check if query cancel works on libpq or if it's completely broken? 


--

Re: Query cancel seems to be broken in master since Oct 17

От
Tom Lane
Дата:
Magnus Hagander <magnus@hagander.net> writes:
> On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov <
> sitnikov.vladimir@gmail.com> wrote:
>> The test executes "select pg_sleep(10)" and tries to cancel it. In recent
>> master builds, cancel seems to be ignored, and the statement lasts for 10
>> seconds.

> My guess is it's related to this:
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9e083fd4683294f41544e6d0d72f6e258ff3a77c
> That's certainly not intended to break things, but that was changed on Oct
> 17 and it relates to cancel keys.
> What platform does the postgres server run on? Can can you check if query
> cancel works on libpq or if it's completely broken?

I can confirm that query cancel is broken in HEAD on RHEL6.

regression=# select pg_sleep(10);
^CCancel request sent
... nothing happens for the balance of the 10 seconds ...
regression=#

There's a smoking gun in the postmaster log:

2016-10-18 09:10:34.547 EDT [18502] LOG:  wrong key in cancel request for process 18491
        regards, tom lane



Re: Query cancel seems to be broken in master since Oct 17

От
Heikki Linnakangas
Дата:
On 10/18/2016 04:13 PM, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov <
>> sitnikov.vladimir@gmail.com> wrote:
>>> The test executes "select pg_sleep(10)" and tries to cancel it. In recent
>>> master builds, cancel seems to be ignored, and the statement lasts for 10
>>> seconds.
>
>> My guess is it's related to this:
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9e083fd4683294f41544e6d0d72f6e258ff3a77c
>> That's certainly not intended to break things, but that was changed on Oct
>> 17 and it relates to cancel keys.
>> What platform does the postgres server run on? Can can you check if query
>> cancel works on libpq or if it's completely broken?
>
> I can confirm that query cancel is broken in HEAD on RHEL6.
>
> regression=# select pg_sleep(10);
> ^CCancel request sent
> ... nothing happens for the balance of the 10 seconds ...
> regression=#
>
> There's a smoking gun in the postmaster log:
>
> 2016-10-18 09:10:34.547 EDT [18502] LOG:  wrong key in cancel request for process 18491

Ok, I've reverted that commit for now. It clearly needs more thought, 
because of this, and the pademelon failure discussed on the other thread.

- Heikki




Re: Query cancel seems to be broken in master since Oct 17

От
Tom Lane
Дата:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 10/18/2016 04:13 PM, Tom Lane wrote:
>> There's a smoking gun in the postmaster log:
>> 2016-10-18 09:10:34.547 EDT [18502] LOG:  wrong key in cancel request for process 18491

> Ok, I've reverted that commit for now. It clearly needs more thought,
> because of this, and the pademelon failure discussed on the other thread.

I think that was an overreaction.  The problem is pretty obvious after
adding some instrumentation:

2016-10-18 09:57:47.508 EDT [21229] LOG:  wrong key (0x7B7E4D5E, expected 0xF0F804017B7E4D5E) in cancel request for
process21228 

To wit, the various cancel_key backend variables are declared as "long",
and the new code
if (!pg_strong_random(&MyCancelKey, sizeof(MyCancelKey)))

is therefore computing an 8-byte random value on 64-bit-long machines.
But only 4 bytes go to the client and come back.

The cleanest fix might be to change those various "long" variables
to uint32.  You'd have to think about how to handle the ntohl/htonl
calls that are used on them, though.
        regards, tom lane



Re: Query cancel seems to be broken in master since Oct 17

От
Tom Lane
Дата:
I wrote:
> The cleanest fix might be to change those various "long" variables
> to uint32.  You'd have to think about how to handle the ntohl/htonl
> calls that are used on them, though.

Or actually, no, you wouldn't have to think very hard.  I was supposing
that those calls were declared to traffic in "long"s, but they aren't
and never have been, at least not since SUSv2:

uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);

So s/long/uint32/ would probably fix it just fine.
        regards, tom lane



Re: Query cancel seems to be broken in master since Oct 17

От
Michael Paquier
Дата:
On Tue, Oct 18, 2016 at 11:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> The cleanest fix might be to change those various "long" variables
>> to uint32.  You'd have to think about how to handle the ntohl/htonl
>> calls that are used on them, though.
>
> Or actually, no, you wouldn't have to think very hard.  I was supposing
> that those calls were declared to traffic in "long"s, but they aren't
> and never have been, at least not since SUSv2:
>
> uint32_t htonl(uint32_t hostlong);
> uint16_t htons(uint16_t hostshort);
> uint32_t ntohl(uint32_t netlong);
> uint16_t ntohs(uint16_t netshort);
>
> So s/long/uint32/ would probably fix it just fine.

The pg_strong_random patch has been reverted, what's done is done.
FWIW only doing that is needed to fix the problem as already mentioned
by Tom:
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index f232083..634578d 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -38,7 +38,7 @@ volatile uint32 CritSectionCount = 0;int            MyProcPid;pg_time_t  MyStartTime;struct Port
*MyProcPort;
-long       MyCancelKey;
+uint32     MyCancelKey;int            MyPMChildSlot;
/*
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 78545da..5e623a1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -162,7 +162,7 @@ extern PGDLLIMPORT int MyProcPid;extern PGDLLIMPORT pg_time_t MyStartTime;extern PGDLLIMPORT struct
Port*MyProcPort;extern PGDLLIMPORT struct Latch *MyLatch;
 
-extern long MyCancelKey;
+extern uint32 MyCancelKey;extern int MyPMChildSlot;
extern char OutputFileName[];

I'll send an update for that, as well as a solution for the pademelon
problem on the SCRAM thread.
-- 
Michael



Re: Query cancel seems to be broken in master since Oct 17

От
Noah Misch
Дата:
On Tue, Oct 18, 2016 at 10:03:39AM -0400, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > On 10/18/2016 04:13 PM, Tom Lane wrote:
> >> There's a smoking gun in the postmaster log:
> >> 2016-10-18 09:10:34.547 EDT [18502] LOG:  wrong key in cancel request for process 18491
> 
> > Ok, I've reverted that commit for now. It clearly needs more thought, 
> > because of this, and the pademelon failure discussed on the other thread.
> 
> I think that was an overreaction.  The problem is pretty obvious after
> adding some instrumentation:

I endorse Heikki's revert.