Обсуждение: Query cancel seems to be broken in master since Oct 17
Hi,
In pgjdbc we have regular regression testing against "build from master" PostgreSQL, and recent master builds fail for "statement cancel" test.
The PostgreSQL as of Mon Oct 17 00:09:39 UTC 2016 was fine, then "statement cancel" started to fail.
The test executes "select pg_sleep(10)" and tries to cancel it. In recent master builds, cancel seems to be ignored, and the statement lasts for 10 seconds.
Exactly the same driver and test version works fine for 8.4, 9.1, 9.2, 9.3, 9.4, 9.5, and 9.6:
Here's a sample build report: https://travis-ci.org/pgjdbc/pgjdbc/builds/168444341
Any hints what could be the issue?
Was the breakage intentional?
Vladimir
On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
Hi,In pgjdbc we have regular regression testing against "build from master" PostgreSQL, and recent master builds fail for "statement cancel" test.The PostgreSQL as of Mon Oct 17 00:09:39 UTC 2016 was fine, then "statement cancel" started to fail.The test executes "select pg_sleep(10)" and tries to cancel it. In recent master builds, cancel seems to be ignored, and the statement lasts for 10 seconds.Exactly the same driver and test version works fine for 8.4, 9.1, 9.2, 9.3, 9.4, 9.5, and 9.6:Any hints what could be the issue?Was the breakage intentional?
My guess is it's related to this: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9e083fd4683294f41544e6d0d72f6e258ff3a77c
That's certainly not intended to break things, but that was changed on Oct 17 and it relates to cancel keys.
What platform does the postgres server run on? Can can you check if query cancel works on libpq or if it's completely broken?
>What platform does the postgres server run on?
Ubuntu
OS name: "linux", version: "3.19.0-66-generic", arch: "amd64", family: "unix"
2016-10-18 11:05 GMT+03:00 Magnus Hagander <magnus@hagander.net>:
On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:Hi,In pgjdbc we have regular regression testing against "build from master" PostgreSQL, and recent master builds fail for "statement cancel" test.The PostgreSQL as of Mon Oct 17 00:09:39 UTC 2016 was fine, then "statement cancel" started to fail.The test executes "select pg_sleep(10)" and tries to cancel it. In recent master builds, cancel seems to be ignored, and the statement lasts for 10 seconds.Exactly the same driver and test version works fine for 8.4, 9.1, 9.2, 9.3, 9.4, 9.5, and 9.6:Any hints what could be the issue?Was the breakage intentional?My guess is it's related to this: https://git.postgresql.org/gitweb/?p=postgresql.git;a= commitdiff;h= 9e083fd4683294f41544e6d0d72f6e 258ff3a77c That's certainly not intended to break things, but that was changed on Oct 17 and it relates to cancel keys.What platform does the postgres server run on? Can can you check if query cancel works on libpq or if it's completely broken?--
Magnus Hagander <magnus@hagander.net> writes: > On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov < > sitnikov.vladimir@gmail.com> wrote: >> The test executes "select pg_sleep(10)" and tries to cancel it. In recent >> master builds, cancel seems to be ignored, and the statement lasts for 10 >> seconds. > My guess is it's related to this: > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9e083fd4683294f41544e6d0d72f6e258ff3a77c > That's certainly not intended to break things, but that was changed on Oct > 17 and it relates to cancel keys. > What platform does the postgres server run on? Can can you check if query > cancel works on libpq or if it's completely broken? I can confirm that query cancel is broken in HEAD on RHEL6. regression=# select pg_sleep(10); ^CCancel request sent ... nothing happens for the balance of the 10 seconds ... regression=# There's a smoking gun in the postmaster log: 2016-10-18 09:10:34.547 EDT [18502] LOG: wrong key in cancel request for process 18491 regards, tom lane
On 10/18/2016 04:13 PM, Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov < >> sitnikov.vladimir@gmail.com> wrote: >>> The test executes "select pg_sleep(10)" and tries to cancel it. In recent >>> master builds, cancel seems to be ignored, and the statement lasts for 10 >>> seconds. > >> My guess is it's related to this: >> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9e083fd4683294f41544e6d0d72f6e258ff3a77c >> That's certainly not intended to break things, but that was changed on Oct >> 17 and it relates to cancel keys. >> What platform does the postgres server run on? Can can you check if query >> cancel works on libpq or if it's completely broken? > > I can confirm that query cancel is broken in HEAD on RHEL6. > > regression=# select pg_sleep(10); > ^CCancel request sent > ... nothing happens for the balance of the 10 seconds ... > regression=# > > There's a smoking gun in the postmaster log: > > 2016-10-18 09:10:34.547 EDT [18502] LOG: wrong key in cancel request for process 18491 Ok, I've reverted that commit for now. It clearly needs more thought, because of this, and the pademelon failure discussed on the other thread. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On 10/18/2016 04:13 PM, Tom Lane wrote: >> There's a smoking gun in the postmaster log: >> 2016-10-18 09:10:34.547 EDT [18502] LOG: wrong key in cancel request for process 18491 > Ok, I've reverted that commit for now. It clearly needs more thought, > because of this, and the pademelon failure discussed on the other thread. I think that was an overreaction. The problem is pretty obvious after adding some instrumentation: 2016-10-18 09:57:47.508 EDT [21229] LOG: wrong key (0x7B7E4D5E, expected 0xF0F804017B7E4D5E) in cancel request for process21228 To wit, the various cancel_key backend variables are declared as "long", and the new code if (!pg_strong_random(&MyCancelKey, sizeof(MyCancelKey))) is therefore computing an 8-byte random value on 64-bit-long machines. But only 4 bytes go to the client and come back. The cleanest fix might be to change those various "long" variables to uint32. You'd have to think about how to handle the ntohl/htonl calls that are used on them, though. regards, tom lane
I wrote: > The cleanest fix might be to change those various "long" variables > to uint32. You'd have to think about how to handle the ntohl/htonl > calls that are used on them, though. Or actually, no, you wouldn't have to think very hard. I was supposing that those calls were declared to traffic in "long"s, but they aren't and never have been, at least not since SUSv2: uint32_t htonl(uint32_t hostlong); uint16_t htons(uint16_t hostshort); uint32_t ntohl(uint32_t netlong); uint16_t ntohs(uint16_t netshort); So s/long/uint32/ would probably fix it just fine. regards, tom lane
On Tue, Oct 18, 2016 at 11:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> The cleanest fix might be to change those various "long" variables >> to uint32. You'd have to think about how to handle the ntohl/htonl >> calls that are used on them, though. > > Or actually, no, you wouldn't have to think very hard. I was supposing > that those calls were declared to traffic in "long"s, but they aren't > and never have been, at least not since SUSv2: > > uint32_t htonl(uint32_t hostlong); > uint16_t htons(uint16_t hostshort); > uint32_t ntohl(uint32_t netlong); > uint16_t ntohs(uint16_t netshort); > > So s/long/uint32/ would probably fix it just fine. The pg_strong_random patch has been reverted, what's done is done. FWIW only doing that is needed to fix the problem as already mentioned by Tom: diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index f232083..634578d 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -38,7 +38,7 @@ volatile uint32 CritSectionCount = 0;int MyProcPid;pg_time_t MyStartTime;struct Port *MyProcPort; -long MyCancelKey; +uint32 MyCancelKey;int MyPMChildSlot; /* diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 78545da..5e623a1 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -162,7 +162,7 @@ extern PGDLLIMPORT int MyProcPid;extern PGDLLIMPORT pg_time_t MyStartTime;extern PGDLLIMPORT struct Port*MyProcPort;extern PGDLLIMPORT struct Latch *MyLatch; -extern long MyCancelKey; +extern uint32 MyCancelKey;extern int MyPMChildSlot; extern char OutputFileName[]; I'll send an update for that, as well as a solution for the pademelon problem on the SCRAM thread. -- Michael
On Tue, Oct 18, 2016 at 10:03:39AM -0400, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > On 10/18/2016 04:13 PM, Tom Lane wrote: > >> There's a smoking gun in the postmaster log: > >> 2016-10-18 09:10:34.547 EDT [18502] LOG: wrong key in cancel request for process 18491 > > > Ok, I've reverted that commit for now. It clearly needs more thought, > > because of this, and the pademelon failure discussed on the other thread. > > I think that was an overreaction. The problem is pretty obvious after > adding some instrumentation: I endorse Heikki's revert.