Обсуждение: ERROR: canceling query due to user request

Поиск
Список
Период
Сортировка

ERROR: canceling query due to user request

От
"Kevin Grittner"
Дата:
I started with this issue on the pgsql-jdbc list, but was assured that this could not be a bug in the JDBC driver -- it
iseither a bug in the configuration or runtime environment or it is a server-side bug.  This list seemed like the
logicalnext step. 

We have seen this behavior in 8.0.3 with the 311 JDBC driver.  Today I tested with the latest snapshot of the 8.1
developmentcode and the tip of the JDBC driver code from the root.  It still happens. 

The environment for the latest test is:
SuSE Linux 9.3 (i586)
2.6.11.4-21.8-smp #1 SMP Tue Jul 19 12:42:37 UTC 2005
Dual Xeon
Disk on 7 spindle RAID 5 via SAN
All access through JDBC connections.
A single thread accessing the database.
A connection pool in use -- the thread may grab a different connection each time.
No other process concurrently executing against the database.
Happens with or without autovacuum running.

Failures have also occured on Windows 2000 and Windows XP servers with local hard drives.  Xeons in all cases.

The software involved "optimistically" tries to do a series of inserts, updates, and/or deletes in a single database
transaction. If an exception occurs (for example an insert of a row which already exists) or an update affects zero
rows,the transaction is rolled back and the operations are attempted in "cautious" mode -- a commit after each insert
updateor delete of a single row. 

The client is running full out, with a steady supply of ready-to-go requests.  Client CPU seems to be the bottleneck,
ratherthan anything on the database server hardware.  We are only seeing this problem in "cautious" mode -- a database
transactionhas been rolled back and we're committing each statement as we go. 

If we run the same set of data multiple times, the error occurs on different requests each time, indicating it is not
datadependent.  If we turn on logging at the JDBC driver level, it never happens, indicating that it is timing
sensitive. We've never seen it while stepping through the debugger, but since it seems to happen randomly once every
fewthousand requests, that would be a unlikely anyway. 

The error is manifest by this message and stack trace:

org.postgresql.util.PSQLException: ERROR: canceling query due to user request
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1499)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1284)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:175)
    at org.postgresql.jdbc2.AbstractJdbc2Connection.executeTransactionCommand(AbstractJdbc2Connection.java:617)
    at org.postgresql.jdbc2.AbstractJdbc2Connection.commit(AbstractJdbc2Connection.java:637)
    at (our application code)

There is no method in the JDBC interface to cancel a commit from the client side, and there is nothing in the client
codewhich is trying to do so.  There are no processes running on the server except services from the SuSE install and
theprocesses started by pg_ctl start. 

We tried setting the ulimit of everything to unlimited, where allowed.  We set the open files limit to 4096.  These
ulimitchanges had no affect on the problem. 

Does anyone have any ideas on a possible cause, or any diagnostic steps we should take?

Thanks,

-Kevin




Re: ERROR: canceling query due to user request

От
Tom Lane
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> org.postgresql.util.PSQLException: ERROR: canceling query due to user request

The only possible trigger of that message is a SIGINT sent to the backend.
Now the backend will SIGINT itself if a statement timeout expires, so one
possibility is that you have statement_timeout set and it's getting
exceeded.  Otherwise you need to be looking for external causes.

            regards, tom lane

Re: ERROR: canceling query due to user request

От
"Kevin Grittner"
Дата:
Thanks, Tom.

The statement_timeout setting was also raised on the JDBC list, and has been checked -- there is nothing setting
statement_timeout. The connection shows this value at zero.  None of our code contains anything PostgreSQL specific, so
thereis nothing in our framework or applications that would be aware of the "set statement_timeout" command to be able
toissue it.  I created the database instance, the user and the database, and have not set this as a default anywhere. 

Regarding the possibility that a ulimit setting is sending the signal, I am using "su" to set user id to root, issuing
theulimit statements to set everything unlimited or very high, then using "su" to set user id to postgres.  At that
pointthe settings from root still show.  Then I'm starting postgres using pg_ctl.  Is there any reason the ulimit
settingswould not carry through with this approach?  Is there a better way to do it? 

Other than that, what external causes might be at fault when I have both Windows machines and Linux machines which have
nothinginstalled but the operating system and PostgreSQL?  The only ulimit settings I couldn't set to "unlimited" were
pipesize and open files.  Is there any chance that the error/rollback path in the code is leaking open files on the
server? Is there anything I should run during the test to watch for potential resource exhaustion? 

-Kevin


>>> Tom Lane <tgl@sss.pgh.pa.us> 09/12/05 5:39 PM >>>
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> org.postgresql.util.PSQLException: ERROR: canceling query due to user request

The only possible trigger of that message is a SIGINT sent to the backend.
Now the backend will SIGINT itself if a statement timeout expires, so one
possibility is that you have statement_timeout set and it's getting
exceeded.  Otherwise you need to be looking for external causes.

            regards, tom lane


Re: ERROR: canceling query due to user request

От
"Kevin Grittner"
Дата:
One more thought -- I keep coming back to the fact that when we turn on logging in the JDBC driver on the client side,
theproblem does not occur.  The only possible reason I can see for this having any affect on the problem is the small
delayintroduced by the synchronous logging.  Since this is only showing up on commit of a database transaction which
followsclose on the heels of a rollback on the same connection, is there any chance that there is some very small
"settlingtime" needed for a rollback, and we're sometimes getting in ahead of this? 

-Kevin


>>> Tom Lane <tgl@sss.pgh.pa.us> 09/12/05 5:39 PM >>>
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> org.postgresql.util.PSQLException: ERROR: canceling query due to user request

The only possible trigger of that message is a SIGINT sent to the backend.
Now the backend will SIGINT itself if a statement timeout expires, so one
possibility is that you have statement_timeout set and it's getting
exceeded.  Otherwise you need to be looking for external causes.

            regards, tom lane


Re: ERROR: canceling query due to user request

От
Tom Lane
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> One more thought -- I keep coming back to the fact that when we turn
> on logging in the JDBC driver on the client side, the problem does not
> occur.  The only possible reason I can see for this having any affect
> on the problem is the small delay introduced by the synchronous
> logging.  Since this is only showing up on commit of a database
> transaction which follows close on the heels of a rollback on the same
> connection, is there any chance that there is some very small
> "settling time" needed for a rollback, and we're sometimes getting in
> ahead of this?

(1) No.

(2) Even if we posit the existence of such a serious bug as that, it
wouldn't explain how control gets to the SIGINT response code.

            regards, tom lane

Re: ERROR: canceling query due to user request

От
"Kevin Grittner"
Дата:
I'm having a really hard time coming up with theories about the cause
or things to check.

We ran the test again with logging to disk, and it didn't happen in an
hour of testing.  The logging boosted the average run time of the
series of database modificates we attempt as a single transaction
from 44 ms to 58 ms.  We logged to a dummy PrintWriter, which just
returned to driver code without doing anything, and the time went to
50 ms.  We got our first error after 9 minutes with that configuration.

The only thing running on the server is the postgres back end.  It would
be hard to imagine something outside of the postgres software itself
which would be able to send the signal only when a rollback occurred.
Can you think of anything which could be coming through the protocol
stream which would cause this signal during the commit after a rollback?

About the only other thing I can think to do is to try to come up with
a RAM-based PrintWriter to keep a rolling buffer of JDBC logging
which it would dump when we get the error.  Since a PrintWriter which
did absolutely nothing was right on the edge of blocking the problem,
I'm skeptical that adding even that much will allow the problem to show.

I welcome all suggestions on what to try or what to monitor.

-Kevin


>>> Tom Lane <tgl@sss.pgh.pa.us> 09/13/05 11:31 AM >>>
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> One more thought -- I keep coming back to the fact that when we turn
> on logging in the JDBC driver on the client side, the problem does not
> occur.  The only possible reason I can see for this having any affect
> on the problem is the small delay introduced by the synchronous
> logging.  Since this is only showing up on commit of a database
> transaction which follows close on the heels of a rollback on the same
> connection, is there any chance that there is some very small
> "settling time" needed for a rollback, and we're sometimes getting in
> ahead of this?

(1) No.

(2) Even if we posit the existence of such a serious bug as that, it
wouldn't explain how control gets to the SIGINT response code.

            regards, tom lane


Re: ERROR: canceling query due to user request

От
Tom Lane
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> I'm having a really hard time coming up with theories about the cause
> or things to check.

Have you tried strace'ing the backend process to see if you can see a
signal being delivered to it?

            regards, tom lane

Re: ERROR: canceling query due to user request

От
"Kevin Grittner"
Дата:
Thanks, Tom.  An excellent suggestion.  (This 50 year old dog has
today learned a new trick.)

There is good news and bad news.  The good news is that I found
the cause, and we can keep this from happening with a change on
our end.  The bad news is that I think it also points to a backend bug,
although not as serious as the potential (hypothetical) one I was
asking about earlier.

For the record (and the benefit of anyone with similar problems who
may search the archives), I got the connection pool set up, and found
the pids for the connection processes like this:

# ps aux|grep postgres
postgres 18307  0.0  0.0   3052  1100 pts/0    S    Sep09   0:00 su postgres
postgres 18308  0.0  0.0   3408  1828 pts/0    S+   Sep09   0:00 bash
postgres 15463  0.0  0.1  89508  3852 pts/0    S    10:09   0:05 /usr/local/pgsql/bin/postmaster -D /var/pgsql/data
postgres 15466  0.0  4.0  89652 83004 pts/0    S    10:09   0:00 postgres: writer process
postgres 15467  0.0  0.1   7052  2796 pts/0    S    10:09   0:02 postgres: stats buffer process
postgres 15468  0.0  0.0   6388  1996 pts/0    S    10:09   0:03 postgres: stats collector process
postgres   926  0.0  0.3  90484  7576 pts/0    S    14:34   0:00 postgres: dtr dtr 165.219.88.77(4436) idle
postgres   927  0.0  0.2  89956  4684 pts/0    S    14:34   0:00 postgres: dtr dtr 165.219.88.77(4437) idle
postgres   928  3.0  1.1  90404 23764 pts/0    S    14:34   0:07 postgres: dtr dtr 165.219.88.77(4438) idle
postgres   929  2.7  1.1  90468 23260 pts/0    S    14:34   0:07 postgres: dtr dtr 165.219.88.77(4439) idle
root       935  0.0  0.0   1796   648 pts/2    S+   14:38   0:00 grep postgres

I then established an strace session for each connection like this:

strace -tt -o strace.926 -p 926

Other flags may have been useful, but strace is new to me, so I took
the route I was sure I understood.  I ran until we got an error, which
involved running through 7,278 transactions.  I used Ctrl+C to stop
each strace and searched the results for SIGINT.  There were 1,799
of them.  They always came in a set of lines like this:

13:59:19.625498 recv(7, "P\0\0\0Y\0SELECT lcmtr.\"relationName"..., 8192, 0) = 125
13:59:19.625976 _llseek(32, 0, [0], SEEK_END) = 0
13:59:19.626057 _llseek(33, 0, [8192], SEEK_END) = 0
13:59:19.626159 _llseek(32, 0, [0], SEEK_END) = 0
13:59:19.626257 send(7, "1\0\0\0\0042\0\0\0\4T\0\0\0D\0\2relationName\0\0\0"..., 97, 0) = 97
13:59:19.626352 recv(7, 0x82b1000, 8192, 0) = ? ERESTARTSYS (To be restarted)
13:59:19.628477 --- SIGINT (Interrupt) @ 0 (0) ---
13:59:19.628559 sigreturn()             = ? (mask now [])

The SELECT statement was easy to find, and it became clear that a
programmer had code which was incorrectly canceling a JDBC
Statement after reaching the end of the (empty) ResultSet.  One
time out of 1,799 this was causing the error we were seeing on
the subsequent commit, which strikes me as a bug.

We've already changed the offending code to avoid the invocation
of the Statement.cancel method.  Not that it merits high priority, but
it might make sense for PostgreSQL to behave more consistently on
a commit when a statement within the database transaction has been
canceled.  There currently is a race condition where if the commit
comes fast enough after the Statement.cancel, it receives the error
which is the subject of this thread.

-Kevin


>>> Tom Lane <tgl@sss.pgh.pa.us> 09/13/05 1:18 PM >>>
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> I'm having a really hard time coming up with theories about the cause
> or things to check.

Have you tried strace'ing the backend process to see if you can see a
signal being delivered to it?

            regards, tom lane


Re: ERROR: canceling query due to user request

От
Tom Lane
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> The SELECT statement was easy to find, and it became clear that a
> programmer had code which was incorrectly canceling a JDBC
> Statement after reaching the end of the (empty) ResultSet.  One
> time out of 1,799 this was causing the error we were seeing on
> the subsequent commit, which strikes me as a bug.

> We've already changed the offending code to avoid the invocation
> of the Statement.cancel method.  Not that it merits high priority, but
> it might make sense for PostgreSQL to behave more consistently on
> a commit when a statement within the database transaction has been
> canceled.  There currently is a race condition where if the commit
> comes fast enough after the Statement.cancel, it receives the error
> which is the subject of this thread.

Yeah.  The Query Cancel protocol has a potential race condition in it:
the cancel request is not synchronous with respect to your existing
connection (it's actually sent over a different connection).  If you try
to send another command immediately, it's possible that the new command
gets to the backend before the cancel signal does, and then the cancel
zaps your new command instead of the one you intended.

This has been discussed before (try the pgsql-jdbc list archives).
I believe we concluded that an appropriate fix was to not consider the
cancel request "done" until the client sees the separate connection
dropped by the postmaster.  libpq's cancel functions wait for that to
happen, and I thought that JDBC had been fixed as well --- maybe you are
using an old driver version?

Of course, if the internal threading in your app is such that you might
issue another command before the Statement.cancel method finishes, then
it's your own bug.

            regards, tom lane

Re: [JDBC] ERROR: canceling query due to user request

От
Oliver Jowett
Дата:
Tom Lane wrote:

> This has been discussed before (try the pgsql-jdbc list archives).
> I believe we concluded that an appropriate fix was to not consider the
> cancel request "done" until the client sees the separate connection
> dropped by the postmaster.  libpq's cancel functions wait for that to
> happen, and I thought that JDBC had been fixed as well --- maybe you are
> using an old driver version?

I thought this got done too, but looking at the current JDBC driver code
 I don't see it..

-O