Обсуждение: [BUGS] BUG #14471: PostgreSQL 9.6 "NOT IN" in select causes crash with"ERROR: unknown error"

Поиск
Список
Период
Сортировка

[BUGS] BUG #14471: PostgreSQL 9.6 "NOT IN" in select causes crash with"ERROR: unknown error"

От
josef.machytka@gmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      14471
Logged by:          Josef Machytka
Email address:      josef.machytka@gmail.com
PostgreSQL version: 9.6.1
Operating system:   Debian jessie
Description:

I checked stackoverflow and I found several questions complaining about
problems with "NOT IN" causing problems in selects.

Looks like PostgreSQL 9.6 now very probably has some bug connected with "NOT
IN" command. We have select which actually crashes after several hours (!)
of run with message "ERROR:  unknown error". Although according to explain
plan it should be done in ~40 minutes.

It is also possible that problem happens only when working with partitioned
tables. We have several sets of partitioned tables. Partitions are by months
- data are used for billing calculations.

Our select which crashes pg 9.6 is like this:

select <some columns> 
from table_1_partition t1
inner join table_2 t2 --also partitioned table
on ....
where (t1.col1, t2.col1, t1.col2) NOT IN
(select col1, col2, col3 
from table_3 --also partitioned table
group by col1, col2, col3)

When I remove NOT IN part everything works OK. Of course I replaced this
problematic part with NOT EXISTS construct. But "unknown error" crash with
"NOT IN" command is very unfortunate.


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14471: PostgreSQL 9.6 "NOT IN" in select causes crash with "ERROR: unknown error"

От
Tom Lane
Дата:
josef.machytka@gmail.com writes:
> Looks like PostgreSQL 9.6 now very probably has some bug connected with "NOT
> IN" command. We have select which actually crashes after several hours (!)
> of run with message "ERROR:  unknown error". Although according to explain
> plan it should be done in ~40 minutes.

This is an interesting report, but without enough information to replicate
the problem, we're unlikely to be able to help you.  A fragment of a
query, with zero information about the underlying tables, is far from
enough.

The only occurrences of the string "unknown error" that I can find in the
source code are in dblink and postgres_fdw (both reflecting cases where
the remote server did not return an error message, itself a "shouldn't
happen" situation).  If you were using either, you didn't say so; but if
you were, maybe taking a look in the remote server's log would be useful.

            regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14471: PostgreSQL 9.6 "NOT IN" in select causes crashwith "ERROR: unknown error"

От
Josef Machytka
Дата:
Yes, I am sorry, dblink is involved - I just did not see it as significant.
We start several processes in parallel to speed up whole billing calculation otherwise it would take 10+ hours to calculate everything in serial.
Ok, so at least "unknown error" is explained.
But problem with "NOT IN" remains.
When I replaced "NOT IN" with "NOT EXISTS" query ended after ~3 hours without any problems. Even over dblink.
Partitions have from 1M to 3M records each and now we use partitions for present year only.
So it is not small task but also not something really big (we work with much more data when we calculate statistics from web metrics).
And I run billing on 2 different GCE instances with pg 9.6.1 to have comparison. Behaviour was the same.
Unfortunately data are confident so it would be quite hard to give you some access to test it even with anonymous data because even structure of the query contains a lot of know-how.
Therefore I could sent only so crazy looking skeleton of the query. Sorry about it.


On 21 December 2016 at 16:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
josef.machytka@gmail.com writes:
> Looks like PostgreSQL 9.6 now very probably has some bug connected with "NOT
> IN" command. We have select which actually crashes after several hours (!)
> of run with message "ERROR:  unknown error". Although according to explain
> plan it should be done in ~40 minutes.

This is an interesting report, but without enough information to replicate
the problem, we're unlikely to be able to help you.  A fragment of a
query, with zero information about the underlying tables, is far from
enough.

The only occurrences of the string "unknown error" that I can find in the
source code are in dblink and postgres_fdw (both reflecting cases where
the remote server did not return an error message, itself a "shouldn't
happen" situation).  If you were using either, you didn't say so; but if
you were, maybe taking a look in the remote server's log would be useful.

                        regards, tom lane

Re: [BUGS] BUG #14471: PostgreSQL 9.6 "NOT IN" in select causes crash with "ERROR: unknown error"

От
Tom Lane
Дата:
Josef Machytka <josef.machytka@gmail.com> writes:
> Yes, I am sorry, dblink is involved - I just did not see it as significant.
> We start several processes in parallel to speed up whole billing
> calculation otherwise it would take 10+ hours to calculate everything in
> serial.
> Ok, so at least "unknown error" is explained.

Well, we have a theory about where it came from, but still not enough
information to improve the behavior.  Did you look to see what happened
on the remote server?

> But problem with "NOT IN" remains.
> When I replaced "NOT IN" with "NOT EXISTS" query ended after ~3 hours
> without any problems. Even over dblink.

You do know that NOT IN and NOT EXISTS behave quite differently with
respect to nulls?  I'm suspicious that the real problem here is that
your query is just wrong when written with NOT IN, and it specifies
some unreasonable amount of computation.  Possibly something is running
out of memory and not dealing with the case very well, leading to the
unhelpful error message.

FWIW, just about every bug report I've ever seen about NOT IN boiled
down to the complainant's subquery returning one or more nulls and
the complainant not understanding what will happen if it does.
Unfortunately, that's not a bug, it's the behavior required by the
SQL standard.

            regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14471: PostgreSQL 9.6 "NOT IN" in select causes crashwith "ERROR: unknown error"

От
Josef Machytka
Дата:
OK, this sounds very interesting.
We already know about many different problems with NULL values because we use heavily different GROUP BYs and WINDOW functions so we replace NULL's everywhere with 'unknown' or similar.
But maybe there is some problem in data import. I will check data.
Thanks and I will let you know what I have found.

On 21 December 2016 at 17:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Josef Machytka <josef.machytka@gmail.com> writes:
> Yes, I am sorry, dblink is involved - I just did not see it as significant.
> We start several processes in parallel to speed up whole billing
> calculation otherwise it would take 10+ hours to calculate everything in
> serial.
> Ok, so at least "unknown error" is explained.

Well, we have a theory about where it came from, but still not enough
information to improve the behavior.  Did you look to see what happened
on the remote server?

> But problem with "NOT IN" remains.
> When I replaced "NOT IN" with "NOT EXISTS" query ended after ~3 hours
> without any problems. Even over dblink.

You do know that NOT IN and NOT EXISTS behave quite differently with
respect to nulls?  I'm suspicious that the real problem here is that
your query is just wrong when written with NOT IN, and it specifies
some unreasonable amount of computation.  Possibly something is running
out of memory and not dealing with the case very well, leading to the
unhelpful error message.

FWIW, just about every bug report I've ever seen about NOT IN boiled
down to the complainant's subquery returning one or more nulls and
the complainant not understanding what will happen if it does.
Unfortunately, that's not a bug, it's the behavior required by the
SQL standard.

                        regards, tom lane

Re: [BUGS] BUG #14471: PostgreSQL 9.6 "NOT IN" in select causes crashwith "ERROR: unknown error"

От
Josef Machytka
Дата:
Hi again.
So I checked all logs and it turned out one of databases involved vent into recovery mode because some of its connections was killed.
Probably out of memory killer did it. So it caused chain reaction. And billing query was canceled "due to administration command" and it was shown through dblink as "unknown error".
Which looked quite horrible...

Regarding NOT IN - you were right there are NULL values.

Best regards


On 21 December 2016 at 17:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Josef Machytka <josef.machytka@gmail.com> writes:
> Yes, I am sorry, dblink is involved - I just did not see it as significant.
> We start several processes in parallel to speed up whole billing
> calculation otherwise it would take 10+ hours to calculate everything in
> serial.
> Ok, so at least "unknown error" is explained.

Well, we have a theory about where it came from, but still not enough
information to improve the behavior.  Did you look to see what happened
on the remote server?

> But problem with "NOT IN" remains.
> When I replaced "NOT IN" with "NOT EXISTS" query ended after ~3 hours
> without any problems. Even over dblink.

You do know that NOT IN and NOT EXISTS behave quite differently with
respect to nulls?  I'm suspicious that the real problem here is that
your query is just wrong when written with NOT IN, and it specifies
some unreasonable amount of computation.  Possibly something is running
out of memory and not dealing with the case very well, leading to the
unhelpful error message.

FWIW, just about every bug report I've ever seen about NOT IN boiled
down to the complainant's subquery returning one or more nulls and
the complainant not understanding what will happen if it does.
Unfortunately, that's not a bug, it's the behavior required by the
SQL standard.

                        regards, tom lane

Re: [BUGS] BUG #14471: PostgreSQL 9.6 "NOT IN" in select causes crash with "ERROR: unknown error"

От
Tom Lane
Дата:
Josef Machytka <josef.machytka@gmail.com> writes:
> So I checked all logs and it turned out one of databases involved vent into
> recovery mode because some of its connections was killed.
> Probably out of memory killer did it. So it caused chain reaction. And
> billing query was canceled "due to administration command" and it was shown
> through dblink as "unknown error".

OK.  In that case I'm betting that what dblink saw was a network error on
its next attempt to send a command.  I've not dug into the code to verify
this positively, but I think the behavior will be improved by Joe Conway's
recent patch,
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2f802d95b4904dbed3dfdca1b3a607cd085d2e20

            regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14471: PostgreSQL 9.6 "NOT IN" in select causes crashwith "ERROR: unknown error"

От
Joe Conway
Дата:
On 12/23/2016 01:08 AM, Josef Machytka wrote:
> Hi again.
> So I checked all logs and it turned out one of databases involved vent
> into recovery mode because some of its connections was killed.
> Probably out of memory killer did it. So it caused chain reaction. And
> billing query was canceled "due to administration command" and it was
> shown through dblink as "unknown error".
> Which looked quite horrible...

FYI, I committed a fix for this since the initial report.

Before:
----------
test=# -- kill the remote side manually
test=# select * from dblink('dbname=test','select pg_sleep(30)');
ERROR:  unknown error
CONTEXT:  Error occurred on dblink connection named "unnamed": could not
execute query.


After:
----------
test=# -- kill the remote side manually
test=# select * from dblink('dbname=test','select pg_sleep(30)');
ERROR:  server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

CONTEXT:  Error occurred on dblink connection named "unnamed": could not
execute query.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development