Обсуждение: Long running query - connection and keepalives enabled but query notcanceled

Поиск
Список
Период
Сортировка

Long running query - connection and keepalives enabled but query notcanceled

От
Tomasz Ostrowski
Дата:
Hi.

I have database users which have to run some ad-hoc queries on the 
database. These queries sometimes are pretty long running, especially 
when a user messes up a join condition for two 1Grows tables and 
therefore tries to count the number of quarks in the universe or 
something like this.

After a query fails to return result for some time the user sometimes 
just closes his laptop lid, dropping his VPN connection and goes home to 
watch some cats on the Internet.

The TCP keepalives functionality on the database server's operating 
system would figure out that the connection is not working anymore and 
close it.

You'd assume that the database would get the info and cancel the query. 
Except - it does not.

The database would get an error on the connection socket only after it 
tries to read or write to it next time. But it does not try to do this - 
it's busy counting those quarks until statement_timeout is reached.

But default statement_timeout has to be quite large - the databases are 
large (up to several TB) and are sometimes exported with pg_dump, 
restored etc. - 24 hours is maybe a reasonable setting.

Is there a way to configure the database somehow, so it would figure out 
that the connection is closed, faster than statement_timeout?

I'd think maybe a periodic signal to the connection process asking for 
validating its connections somehow if it's running a query longer than 
tcp_keepalives_idle+tcp_keepalives_count*tcp_keepalives_interval?

I think it would be possible to work-around with some external process 
checking /proc/[PID]/net/tcp[6] for closed connections and 
pg_cancel_backend() them. But it's managed PostgreSQL (AWS RDS), so it's 
not allowed.

Is there any possibility to have something like this added to 
PostgreSQL? I believe it's not just us having this problem.

-- 
Tomasz "Tometzky" Ostrowski



Re: Long running query - connection and keepalives enabled butquery not canceled

От
Laurenz Albe
Дата:
Tomasz Ostrowski wrote:
> I have database users which have to run some ad-hoc queries on the 
> database. These queries sometimes are pretty long running, especially 
> when a user messes up a join condition for two 1Grows tables and 
> therefore tries to count the number of quarks in the universe or 
> something like this.
> 
> After a query fails to return result for some time the user sometimes 
> just closes his laptop lid, dropping his VPN connection and goes home to 
> watch some cats on the Internet.
> 
> The TCP keepalives functionality on the database server's operating 
> system would figure out that the connection is not working anymore and 
> close it.
> 
> You'd assume that the database would get the info and cancel the query. 
> Except - it does not.
> 
> The database would get an error on the connection socket only after it 
> tries to read or write to it next time. But it does not try to do this - 
> it's busy counting those quarks until statement_timeout is reached.

By default "tcp_keepalives_idle" is quite large: 2 hours.

Are you sure that the queries keep running for longer than that?

Try to experiment with lower settings.  It will cause marginally more
network traffic, but dead connections will be detected more quickly.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: Long running query - connection and keepalives enabled but querynot canceled

От
Tomasz Ostrowski
Дата:
On 6/4/19 6:24 PM, Laurenz Albe wrote:

> Tomasz Ostrowski wrote:
> 
>> The TCP keepalives functionality on the database server's operating
>> system would figure out that the connection is not working anymore and
>> close it.
>>
>> You'd assume that the database would get the info and cancel the query.
>> Except - it does not.
>>
>> The database would get an error on the connection socket only after it
>> tries to read or write to it next time. But it does not try to do this -
>> it's busy counting those quarks until statement_timeout is reached.
> 
> By default "tcp_keepalives_idle" is quite large: 2 hours.
> 
> Are you sure that the queries keep running for longer than that?
> 
> Try to experiment with lower settings.  It will cause marginally more
> network traffic, but dead connections will be detected more quickly.

The keepalive settings we use is idle/interval/count=60/10/60. It means 
that the the dead connections should be cleared after 60s+10*60s=11m. I 
started to investigate when I found a query running for over 11 days 
(there was no query_timeout set on this server).

But the problem is not that the dead connections aren't detected - they 
are, and the operating system clears them. They disappear from "netstat 
--tcp" output after expected time.

The problem is that the database does not get the info and does not kill 
the query.

I've reproduced this with running the SQL below on a local Postgresql 11 
connected with TCP, configured with idle/interval/count=15/15/15:

   create or replace function pg_temp.fib(n int) returns int language 
plpgsql as $$
     begin if n<=1 then return n; end if; return 
pg_temp.fib(n-1)+pg_temp.fib(n-2); end;
   $$;
   select pg_temp.fib(50);

And then filtering out the connection with iptables.

-- 
Tomasz "Tometzky" Ostrowski



Re: Long running query - connection and keepalives enabled but querynot canceled

От
Dmitry Vasiliev
Дата:
Tomasz Ostrowski wrote:

The database would get an error on the connection socket only after it tries to read or write to it next time. But it does not try to do this - it's busy counting those quarks until statement_timeout is reached.


It is unfortunate that PostgreSQL does not take into account that the client has disconnected and does not need to process query.
I know only one non-stable utility that can solve your problem: https://github.com/yandex/odyssey (unfortunately pgbouncer can't do that).

-- 
Dmitry Vasiliev

Re: Long running query - connection and keepalives enabled but querynot canceled

От
Dmitry Vasiliev
Дата:
> Tomasz Ostrowski wrote:
>
>> The database would get an error on the connection socket only after it tries to read or write to it next time. But
itdoes not try to do this - it's busy counting those quarks until statement_timeout is reached. 
>>
>

I found relevant patch from postgrespro on commitfest: https://commitfest.postgresql.org/21/1882/

--
Dmitry Vasiliev