Обсуждение: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

Поиск
Список
Период
Сортировка

Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

От
Jean Landercy - BEEODIVERSITY
Дата:

Dear PostgreSQL Hacker Community,

 

I am facing a tricky bug which makes the Query Planner crashes when using COUNT(*) function.

Without any upgrade suddenly a table of a database instance could not be queried this way:

 

SELECT COUNT(*) FROM items;

 

-- ERROR:  variable not found in subplan target list

-- SQL state: XX000

 

Message and behaviour seem related to the Query Planner:

 

EXPLAIN SELECT COUNT(*) FROM item;
-- ERROR:  variable not found in subplan target list
-- SQL state: XX000

 

Looks like a column name could not be found (see https://github.com/postgres/postgres/blob/ce4f46fdc814eb1b704d81640f6d8f03625d0f53/src/backend/optimizer/plan/setrefs.c#L2967-L2972) in some specific context that is somehow hard to reproduce.

 

Interesting facts:

 

SELECT COUNT(id) FROM items;  -- 213
 
SELECT COUNT(*) FROM items WHERE id > 0; -- 213

 

Work as expected.

 

I can see that other people are recently facing a similar problem (https://www.postgresql.org/message-id/flat/4c347490-d734-5fdd-d613-1327601b4e7e%40mit.edu).

If it is the same bug then it is not related to the PGroonga extension as I don’t use it all.

 

Anyway, the bug is difficult to reproduce on my application.

At the time of writing, I could just isolate it on a specific database but I could not draw a MCVE from it.

I am looking for help to make it reproducible and feed your knowledge database.

 

My first guess was to open a post of SO (see for details https://stackoverflow.com/questions/72498741/how-can-i-reproduce-a-database-context-to-debug-a-tricky-postgresql-error-vari), but digging deeper in the investigation it seems it will require people with strong insights on how PostgreSQL actually works under the hood.

Therefore, I chose this specific mailing list.

 

The bug is tricky to reproduce, I could not succeed to replicate elsewhere (dump/restore does not preserve it).

Anyway it makes my database unusable and looks like a potential bug for your product and applications relying on it.

 

Faulty setup is about:

 

SELECT version();
-- PostgreSQL 13.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
 
SELECT extname, extversion FROM pg_extension;
-- "plpgsql"    "1.0"
-- "postgis"    "3.1.1"

 

By now, the only workarounds I have found are:

 

  • Dump database and recreate a new instance (problem seems to vanish but there is no guarantee it is solved or it will not happened later on);
  • Add dummy filter on all queries (more a trick than a solution).

 

I am writing to this mailing list to raise you attention on it.

I’ll be happy to help you investigate it deeper.

 

Best regards,

 

Landercy Jean

 

On Mon, Jun 06, 2022 at 09:34:24AM +0000, Jean Landercy - BEEODIVERSITY wrote:
> Faulty setup is about:
> 
> SELECT version();
> 
> -- PostgreSQL 13.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit

Please check if the problem occurs in v13.7

https://www.postgresql.org/message-id/2197859.1644623850@sss.pgh.pa.us
https://www.postgresql.org/message-id/flat/2121219.1644607692%40sss.pgh.pa.us#190c43702a91dbd0509ba545dbbab58d



RE: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

От
Jean Landercy - BEEODIVERSITY
Дата:
Dear Justin,

Thank you for your quick reply.
Unfortunately, the server having this issue is an Azure Flexible Server.
Upgrades are managed by Azure, I will have to wait until they release the version 13.7.

Is there a procedure to replicate the database and preserve the bug.
My attempts with pg_dump/psql failed (the bug vanishes).
If so then I can clone the faulty database and upgrade it on a newer version.

Best regards,

Landercy Jean




On Mon, Jun 06, 2022 at 04:50:55PM +0000, Jean Landercy - BEEODIVERSITY wrote:
> Dear Justin,
> 
> Thank you for your quick reply.
> Unfortunately, the server having this issue is an Azure Flexible Server.
> Upgrades are managed by Azure, I will have to wait until they release the version 13.7.

I don't know what to suggest other than to open a support ticket with the
vendor.

-- 
Justin



On Mon, 6 Jun 2022 at 21:34, Jean Landercy - BEEODIVERSITY
<jean.landercy@beeodiversity.com> wrote:
> SELECT COUNT(*) FROM items;
> -- ERROR:  variable not found in subplan target list
> -- SQL state: XX000

Can you share some more details about what "items" is. psql's "\d
items" output would be useful.  From what you've reported we can't
tell if this is a table or a view.

> The bug is tricky to reproduce, I could not succeed to replicate elsewhere (dump/restore does not preserve it).

Can you share the output of:

select attname,atttypid::regtype,attnum,atthasdef,atthasmissing,attgenerated,attisdropped
from pg_attribute where attrelid = 'items'::regclass order by attnum;

This will let us see if there's something strange going on with
dropped or has missing columns.  There may be some sequence of ALTER
TABLE ADD COLUMN ... DEFAULT / DROP COLUMN that is causing this. The
output of that might help us see if that could be a factor.

David



RE: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list

От
Jean Landercy - BEEODIVERSITY
Дата:
Dear David,

Thank you for taking time on this issue.

Here is the detail of the table (I have anonymized it on SO, this is its real name):

\d logistic_site
                                          Table « public.logistic_site »
   Colonne   |           Type           | Collationnement | NULL-able |                Par défaut

-------------+--------------------------+-----------------+-----------+-------------------------------------------
 id          | bigint                   |                 | not null  | nextval('logistic_site_id_seq'::regclass)
 key         | character varying(32)    |                 | not null  |
 name        | character varying(128)   |                 |           |
 created     | timestamp with time zone |                 | not null  |
 updated     | timestamp with time zone |                 | not null  |
 archived    | timestamp with time zone |                 |           |
 geom        | geometry(Polygon,4326)   |                 |           |
 location    | geometry(Point,4326)     |                 |           |
 notes       | text                     |                 |           |
 country_id  | bigint                   |                 |           |
 customer_id | bigint                   |                 |           |

Index :
    "logistic_site_pkey" PRIMARY KEY, btree (id)
    "logistic_site_country_id_9a696481" btree (country_id)
    "logistic_site_customer_id_a2c8a74a" btree (customer_id)
    "logistic_site_geom_105a08da_id" gist (geom)
    "logistic_site_key_2e791173_like" btree (key varchar_pattern_ops)
    "logistic_site_key_key" UNIQUE CONSTRAINT, btree (key)
    "logistic_site_location_54ae0166_id" gist (location)
Contraintes de clés étrangères :
    "logistic_site_country_id_9a696481_fk_logistic_country_id" FOREIGN KEY (country_id) REFERENCES logistic_country(id)
DEFERRABLEINITIALLY DEFERRED
 
    "logistic_site_customer_id_a2c8a74a_fk_logistic_customer_id" FOREIGN KEY (customer_id) REFERENCES
logistic_customer(id)DEFERRABLE INITIALLY DEFERRED
 
Référencé par :
    TABLE "logistic_hive" CONSTRAINT "logistic_hive_site_id_50c29dd8_fk_logistic_site_id" FOREIGN KEY (site_id)
REFERENCESlogistic_site(id) DEFERRABLE INITIALLY DEFERRED
 
    TABLE "logistic_packorder" CONSTRAINT "logistic_packorder_site_id_16e1a41a_fk_logistic_site_id" FOREIGN KEY
(site_id)REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED
 
    TABLE "logistic_projectsite" CONSTRAINT "logistic_projectsite_site_id_522bf74b_fk_logistic_site_id" FOREIGN KEY
(site_id)REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED
 
    TABLE "scientific_identification" CONSTRAINT "scientific_identification_site_id_d9e79149_fk_logistic_site_id"
FOREIGNKEY (site_id) REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED
 
    TABLE "scientific_inventory" CONSTRAINT "scientific_inventory_site_id_72521353_fk_logistic_site_id" FOREIGN KEY
(site_id)REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED
 
    TABLE "scientific_result" CONSTRAINT "scientific_result_site_id_af6c815d_fk_logistic_site_id" FOREIGN KEY (site_id)
REFERENCESlogistic_site(id) DEFERRABLE INITIALLY DEFERRED
 
    TABLE "scientific_selection" CONSTRAINT "scientific_selection_site_id_88d69cab_fk_logistic_site_id" FOREIGN KEY
(site_id)REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED
 

And the output of the related query:

SELECT
    attname, atttypid::regtype, attnum,atthasdef, atthasmissing, attgenerated, attisdropped
FROM
    pg_attribute 
WHERE
    attrelid = 'logistic_site'::regclass
ORDER BY
    attnum;

   attname   |         atttypid         | attnum | atthasdef | atthasmissing | attgenerated | attisdropped
-------------+--------------------------+--------+-----------+---------------+--------------+--------------
 tableoid    | oid                      |     -6 | f         | f             |              | f
 cmax        | cid                      |     -5 | f         | f             |              | f
 xmax        | xid                      |     -4 | f         | f             |              | f
 cmin        | cid                      |     -3 | f         | f             |              | f
 xmin        | xid                      |     -2 | f         | f             |              | f
 ctid        | tid                      |     -1 | f         | f             |              | f
 id          | bigint                   |      1 | t         | f             |              | f
 key         | character varying        |      2 | f         | f             |              | f
 name        | character varying        |      3 | f         | f             |              | f
 created     | timestamp with time zone |      4 | f         | f             |              | f
 updated     | timestamp with time zone |      5 | f         | f             |              | f
 archived    | timestamp with time zone |      6 | f         | f             |              | f
 geom        | geometry                 |      7 | f         | f             |              | f
 location    | geometry                 |      8 | f         | f             |              | f
 notes       | text                     |      9 | f         | f             |              | f
 country_id  | bigint                   |     10 | f         | f             |              | f
 customer_id | bigint                   |     11 | f         | f             |              | f
(17 lignes)

Additional information:
When trying to read the SQL related query for this table in PgAdmin4 I also have the error message popping up and the I
getno SQL. So maybe the problem resides in a deeper function the Query Planner and SQL generator functions rely on. 
 

Don't hesitate to ask for more information.

Best regards,

Jean

On Tue, 7 Jun 2022 at 19:58, Jean Landercy - BEEODIVERSITY
<jean.landercy@beeodiversity.com> wrote:
> Here is the detail of the table (I have anonymized it on SO, this is its real name):

>     "logistic_site_location_54ae0166_id" gist (location)

I imagine this is due to the planner choosing an index-only scan on
the above index. A similar problem was reported in [1].

A fix was committed in [2], which appears in 13.7.

You could turn off enable_indexonlyscan until 13.7 is available to you.

David

[1] https://www.postgresql.org/message-id/CAHUie24ddN+pDNw7fkhNrjrwAX=fXXfGZZEHhRuofV_N_ftaSg@mail.gmail.com
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0778b24ced8a873b432641001d046d1dde602466



David Rowley <dgrowleyml@gmail.com> writes:
> On Tue, 7 Jun 2022 at 19:58, Jean Landercy - BEEODIVERSITY
> <jean.landercy@beeodiversity.com> wrote:
>> Here is the detail of the table (I have anonymized it on SO, this is its real name):
>> "logistic_site_location_54ae0166_id" gist (location)
> I imagine this is due to the planner choosing an index-only scan on
> the above index. A similar problem was reported in [1].

The other gist index could also be the problem.  It seems odd though
that the planner would favor either index for this purpose over the btree
indexes on scalar columns, which you'd think would be a lot smaller.
I wonder if there is some quirk in gist cost estimation that makes it
improperly claim to be cheaper than btree scans.

            regards, tom lane



On Wed, 8 Jun 2022 at 07:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > On Tue, 7 Jun 2022 at 19:58, Jean Landercy - BEEODIVERSITY
> > <jean.landercy@beeodiversity.com> wrote:
> >> Here is the detail of the table (I have anonymized it on SO, this is its real name):
> >> "logistic_site_location_54ae0166_id" gist (location)
> > I imagine this is due to the planner choosing an index-only scan on
> > the above index. A similar problem was reported in [1].
>
> The other gist index could also be the problem.  It seems odd though
> that the planner would favor either index for this purpose over the btree
> indexes on scalar columns, which you'd think would be a lot smaller.
> I wonder if there is some quirk in gist cost estimation that makes it
> improperly claim to be cheaper than btree scans.

I installed PostGIS 3.1.1 and mocked this up with the attached.

Looking at the plans, I see:

# explain select count(*) from logistic_site;
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Aggregate  (cost=20.18..20.19 rows=1 width=8)
   ->  Bitmap Heap Scan on logistic_site  (cost=5.92..19.32 rows=340 width=0)
         ->  Bitmap Index Scan on logistic_site_location_54ae0166_id
(cost=0.00..5.84 rows=340 width=0)
(3 rows)

# drop index logistic_site_location_54ae0166_id;
# explain select count(*) from logistic_site;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Aggregate  (cost=9.92..9.93 rows=1 width=8)
   ->  Bitmap Heap Scan on logistic_site  (cost=5.26..9.39 rows=213 width=0)
         ->  Bitmap Index Scan on logistic_site_geom_105a08da_id
(cost=0.00..5.20 rows=213 width=0)
(3 rows)

# drop index logistic_site_geom_105a08da_id;
# explain select count(*) from logistic_site;
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Aggregate  (cost=13.93..13.94 rows=1 width=8)
   ->  Bitmap Heap Scan on logistic_site  (cost=9.26..13.39 rows=213 width=0)
         ->  Bitmap Index Scan on logistic_site_key_2e791173_like
(cost=0.00..9.21 rows=213 width=0)
(3 rows)

So it does appear that the location index is being chosen, at least
with the data that I inserted. Those gist indexes are costing quite a
bit cheaper than the cheapest btree index.

David

Вложения
On Wed, 8 Jun 2022 at 08:31, David Rowley <dgrowleyml@gmail.com> wrote:
> So it does appear that the location index is being chosen, at least
> with the data that I inserted. Those gist indexes are costing quite a
> bit cheaper than the cheapest btree index.

This seems just to be because the gist indexes are smaller, which is
likely due to me having inserted NULL values into them.

postgres=# select pg_relation_size('logistic_site_key_key');
 pg_relation_size
------------------
            16384
(1 row)


postgres=# select pg_relation_size('logistic_site_location_54ae0166_id');
 pg_relation_size
------------------
             8192
(1 row)

David



David Rowley <dgrowleyml@gmail.com> writes:
> On Wed, 8 Jun 2022 at 07:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wonder if there is some quirk in gist cost estimation that makes it
>> improperly claim to be cheaper than btree scans.

> I installed PostGIS 3.1.1 and mocked this up with the attached.

> Looking at the plans, I see:

> # explain select count(*) from logistic_site;
>                                                QUERY PLAN
> ---------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=20.18..20.19 rows=1 width=8)
>    ->  Bitmap Heap Scan on logistic_site  (cost=5.92..19.32 rows=340 width=0)
>          ->  Bitmap Index Scan on logistic_site_location_54ae0166_id
> (cost=0.00..5.84 rows=340 width=0)
> (3 rows)

> # drop index logistic_site_location_54ae0166_id;
> # explain select count(*) from logistic_site;
>                                              QUERY PLAN
> -----------------------------------------------------------------------------------------------------
>  Aggregate  (cost=9.92..9.93 rows=1 width=8)
>    ->  Bitmap Heap Scan on logistic_site  (cost=5.26..9.39 rows=213 width=0)
>          ->  Bitmap Index Scan on logistic_site_geom_105a08da_id
> (cost=0.00..5.20 rows=213 width=0)
> (3 rows)

That ... is pretty quirky already.  How did it prefer a scan with cost
19.32 over one with cost 9.39?  Seems like we've got a bug here somewhere.
The change in estimated rowcount is rather broken, too.

> So it does appear that the location index is being chosen, at least
> with the data that I inserted. Those gist indexes are costing quite a
> bit cheaper than the cheapest btree index.

It looks like the data you inserted for the geometry columns was uniformly
NULL, which perhaps would result in a very small gist index.  So maybe
for this test data the choice isn't so odd.  Seems unlikely that that'd
be true of the OP's production data, though.

            regards, tom lane



I wrote:
> That ... is pretty quirky already.  How did it prefer a scan with cost
> 19.32 over one with cost 9.39?  Seems like we've got a bug here somewhere.
> The change in estimated rowcount is rather broken, too.

Ah, false alarm.  I can reproduce your results if I stick an ANALYZE
between the first and second EXPLAIN.  So probably your change in
estimated rowcount and hence cost can be explained by an auto-analyze
coming along at just the right time.

Also, if I fill the geom and location columns with non-null data,
the planner stops preferring those indexes.

So now I'm guessing that the OP's data *was* mostly null, and the
planner preferred the gist indexes because they were smallest,
and then tripped over the nonreturnable-column bug.

            regards, tom lane