Обсуждение: Long running query - connection and keepalives enabled but query notcanceled
Long running query - connection and keepalives enabled but query notcanceled
От
Tomasz Ostrowski
Дата:
Hi. I have database users which have to run some ad-hoc queries on the database. These queries sometimes are pretty long running, especially when a user messes up a join condition for two 1Grows tables and therefore tries to count the number of quarks in the universe or something like this. After a query fails to return result for some time the user sometimes just closes his laptop lid, dropping his VPN connection and goes home to watch some cats on the Internet. The TCP keepalives functionality on the database server's operating system would figure out that the connection is not working anymore and close it. You'd assume that the database would get the info and cancel the query. Except - it does not. The database would get an error on the connection socket only after it tries to read or write to it next time. But it does not try to do this - it's busy counting those quarks until statement_timeout is reached. But default statement_timeout has to be quite large - the databases are large (up to several TB) and are sometimes exported with pg_dump, restored etc. - 24 hours is maybe a reasonable setting. Is there a way to configure the database somehow, so it would figure out that the connection is closed, faster than statement_timeout? I'd think maybe a periodic signal to the connection process asking for validating its connections somehow if it's running a query longer than tcp_keepalives_idle+tcp_keepalives_count*tcp_keepalives_interval? I think it would be possible to work-around with some external process checking /proc/[PID]/net/tcp[6] for closed connections and pg_cancel_backend() them. But it's managed PostgreSQL (AWS RDS), so it's not allowed. Is there any possibility to have something like this added to PostgreSQL? I believe it's not just us having this problem. -- Tomasz "Tometzky" Ostrowski
Re: Long running query - connection and keepalives enabled butquery not canceled
От
Laurenz Albe
Дата:
Tomasz Ostrowski wrote: > I have database users which have to run some ad-hoc queries on the > database. These queries sometimes are pretty long running, especially > when a user messes up a join condition for two 1Grows tables and > therefore tries to count the number of quarks in the universe or > something like this. > > After a query fails to return result for some time the user sometimes > just closes his laptop lid, dropping his VPN connection and goes home to > watch some cats on the Internet. > > The TCP keepalives functionality on the database server's operating > system would figure out that the connection is not working anymore and > close it. > > You'd assume that the database would get the info and cancel the query. > Except - it does not. > > The database would get an error on the connection socket only after it > tries to read or write to it next time. But it does not try to do this - > it's busy counting those quarks until statement_timeout is reached. By default "tcp_keepalives_idle" is quite large: 2 hours. Are you sure that the queries keep running for longer than that? Try to experiment with lower settings. It will cause marginally more network traffic, but dead connections will be detected more quickly. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
Re: Long running query - connection and keepalives enabled but querynot canceled
От
Tomasz Ostrowski
Дата:
On 6/4/19 6:24 PM, Laurenz Albe wrote: > Tomasz Ostrowski wrote: > >> The TCP keepalives functionality on the database server's operating >> system would figure out that the connection is not working anymore and >> close it. >> >> You'd assume that the database would get the info and cancel the query. >> Except - it does not. >> >> The database would get an error on the connection socket only after it >> tries to read or write to it next time. But it does not try to do this - >> it's busy counting those quarks until statement_timeout is reached. > > By default "tcp_keepalives_idle" is quite large: 2 hours. > > Are you sure that the queries keep running for longer than that? > > Try to experiment with lower settings. It will cause marginally more > network traffic, but dead connections will be detected more quickly. The keepalive settings we use is idle/interval/count=60/10/60. It means that the the dead connections should be cleared after 60s+10*60s=11m. I started to investigate when I found a query running for over 11 days (there was no query_timeout set on this server). But the problem is not that the dead connections aren't detected - they are, and the operating system clears them. They disappear from "netstat --tcp" output after expected time. The problem is that the database does not get the info and does not kill the query. I've reproduced this with running the SQL below on a local Postgresql 11 connected with TCP, configured with idle/interval/count=15/15/15: create or replace function pg_temp.fib(n int) returns int language plpgsql as $$ begin if n<=1 then return n; end if; return pg_temp.fib(n-1)+pg_temp.fib(n-2); end; $$; select pg_temp.fib(50); And then filtering out the connection with iptables. -- Tomasz "Tometzky" Ostrowski
Re: Long running query - connection and keepalives enabled but querynot canceled
От
Dmitry Vasiliev
Дата:
Tomasz Ostrowski wrote:
The database would get an error on the connection socket only after it tries to read or write to it next time. But it does not try to do this - it's busy counting those quarks until statement_timeout is reached.
It is unfortunate that PostgreSQL does not take into account that the client has disconnected and does not need to process query.
I know only one non-stable utility that can solve your problem: https://github.com/yandex/odyssey (unfortunately pgbouncer can't do that).--
Dmitry Vasiliev
Dmitry Vasiliev
Re: Long running query - connection and keepalives enabled but querynot canceled
От
Dmitry Vasiliev
Дата:
> Tomasz Ostrowski wrote: > >> The database would get an error on the connection socket only after it tries to read or write to it next time. But itdoes not try to do this - it's busy counting those quarks until statement_timeout is reached. >> > I found relevant patch from postgrespro on commitfest: https://commitfest.postgresql.org/21/1882/ -- Dmitry Vasiliev