Обсуждение: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list
Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list
Dear PostgreSQL Hacker Community,
I am facing a tricky bug which makes the Query Planner crashes when using COUNT(*) function.
Without any upgrade suddenly a table of a database instance could not be queried this way:
SELECT COUNT(*) FROM items;
-- ERROR: variable not found in subplan target list
-- SQL state: XX000
Message and behaviour seem related to the Query Planner:
EXPLAIN
SELECT
COUNT(
*)
FROM item;
-- ERROR: variable not found in subplan target list
-- SQL state: XX000
Looks like a column name could not be found (see https://github.com/postgres/postgres/blob/ce4f46fdc814eb1b704d81640f6d8f03625d0f53/src/backend/optimizer/plan/setrefs.c#L2967-L2972) in some specific context that is somehow hard to reproduce.
Interesting facts:
SELECT
COUNT(id)
FROM items;
-- 213
SELECT
COUNT(
*)
FROM items
WHERE id
>
0;
-- 213
Work as expected.
I can see that other people are recently facing a similar problem (https://www.postgresql.org/message-id/flat/4c347490-d734-5fdd-d613-1327601b4e7e%40mit.edu).
If it is the same bug then it is not related to the PGroonga extension as I don’t use it all.
Anyway, the bug is difficult to reproduce on my application.
At the time of writing, I could just isolate it on a specific database but I could not draw a MCVE from it.
I am looking for help to make it reproducible and feed your knowledge database.
My first guess was to open a post of SO (see for details https://stackoverflow.com/questions/72498741/how-can-i-reproduce-a-database-context-to-debug-a-tricky-postgresql-error-vari), but digging deeper in the investigation it seems it will require people with strong insights on how PostgreSQL actually works under the hood.
Therefore, I chose this specific mailing list.
The bug is tricky to reproduce, I could not succeed to replicate elsewhere (dump/restore does not preserve it).
Anyway it makes my database unusable and looks like a potential bug for your product and applications relying on it.
Faulty setup is about:
SELECT version();
-- PostgreSQL 13.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
SELECT extname, extversion
FROM pg_extension;
-- "plpgsql" "1.0"
-- "postgis" "3.1.1"
By now, the only workarounds I have found are:
- Dump database and recreate a new instance (problem seems to vanish but there is no guarantee it is solved or it will not happened later on);
- Add dummy filter on all queries (more a trick than a solution).
I am writing to this mailing list to raise you attention on it.
I’ll be happy to help you investigate it deeper.
Best regards,
Landercy Jean
On Mon, Jun 06, 2022 at 09:34:24AM +0000, Jean Landercy - BEEODIVERSITY wrote: > Faulty setup is about: > > SELECT version(); > > -- PostgreSQL 13.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit Please check if the problem occurs in v13.7 https://www.postgresql.org/message-id/2197859.1644623850@sss.pgh.pa.us https://www.postgresql.org/message-id/flat/2121219.1644607692%40sss.pgh.pa.us#190c43702a91dbd0509ba545dbbab58d
RE: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list
Dear Justin, Thank you for your quick reply. Unfortunately, the server having this issue is an Azure Flexible Server. Upgrades are managed by Azure, I will have to wait until they release the version 13.7. Is there a procedure to replicate the database and preserve the bug. My attempts with pg_dump/psql failed (the bug vanishes). If so then I can clone the faulty database and upgrade it on a newer version. Best regards, Landercy Jean
On Mon, Jun 06, 2022 at 04:50:55PM +0000, Jean Landercy - BEEODIVERSITY wrote: > Dear Justin, > > Thank you for your quick reply. > Unfortunately, the server having this issue is an Azure Flexible Server. > Upgrades are managed by Azure, I will have to wait until they release the version 13.7. I don't know what to suggest other than to open a support ticket with the vendor. -- Justin
On Mon, 6 Jun 2022 at 21:34, Jean Landercy - BEEODIVERSITY <jean.landercy@beeodiversity.com> wrote: > SELECT COUNT(*) FROM items; > -- ERROR: variable not found in subplan target list > -- SQL state: XX000 Can you share some more details about what "items" is. psql's "\d items" output would be useful. From what you've reported we can't tell if this is a table or a view. > The bug is tricky to reproduce, I could not succeed to replicate elsewhere (dump/restore does not preserve it). Can you share the output of: select attname,atttypid::regtype,attnum,atthasdef,atthasmissing,attgenerated,attisdropped from pg_attribute where attrelid = 'items'::regclass order by attnum; This will let us see if there's something strange going on with dropped or has missing columns. There may be some sequence of ALTER TABLE ADD COLUMN ... DEFAULT / DROP COLUMN that is causing this. The output of that might help us see if that could be a factor. David
RE: Sudden database error with COUNT(*) making Query Planner crashes: variable not found in subplan target list
Dear David, Thank you for taking time on this issue. Here is the detail of the table (I have anonymized it on SO, this is its real name): \d logistic_site Table « public.logistic_site » Colonne | Type | Collationnement | NULL-able | Par défaut -------------+--------------------------+-----------------+-----------+------------------------------------------- id | bigint | | not null | nextval('logistic_site_id_seq'::regclass) key | character varying(32) | | not null | name | character varying(128) | | | created | timestamp with time zone | | not null | updated | timestamp with time zone | | not null | archived | timestamp with time zone | | | geom | geometry(Polygon,4326) | | | location | geometry(Point,4326) | | | notes | text | | | country_id | bigint | | | customer_id | bigint | | | Index : "logistic_site_pkey" PRIMARY KEY, btree (id) "logistic_site_country_id_9a696481" btree (country_id) "logistic_site_customer_id_a2c8a74a" btree (customer_id) "logistic_site_geom_105a08da_id" gist (geom) "logistic_site_key_2e791173_like" btree (key varchar_pattern_ops) "logistic_site_key_key" UNIQUE CONSTRAINT, btree (key) "logistic_site_location_54ae0166_id" gist (location) Contraintes de clés étrangères : "logistic_site_country_id_9a696481_fk_logistic_country_id" FOREIGN KEY (country_id) REFERENCES logistic_country(id) DEFERRABLEINITIALLY DEFERRED "logistic_site_customer_id_a2c8a74a_fk_logistic_customer_id" FOREIGN KEY (customer_id) REFERENCES logistic_customer(id)DEFERRABLE INITIALLY DEFERRED Référencé par : TABLE "logistic_hive" CONSTRAINT "logistic_hive_site_id_50c29dd8_fk_logistic_site_id" FOREIGN KEY (site_id) REFERENCESlogistic_site(id) DEFERRABLE INITIALLY DEFERRED TABLE "logistic_packorder" CONSTRAINT "logistic_packorder_site_id_16e1a41a_fk_logistic_site_id" FOREIGN KEY (site_id)REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED TABLE "logistic_projectsite" CONSTRAINT "logistic_projectsite_site_id_522bf74b_fk_logistic_site_id" FOREIGN KEY (site_id)REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED TABLE "scientific_identification" CONSTRAINT "scientific_identification_site_id_d9e79149_fk_logistic_site_id" FOREIGNKEY (site_id) REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED TABLE "scientific_inventory" CONSTRAINT "scientific_inventory_site_id_72521353_fk_logistic_site_id" FOREIGN KEY (site_id)REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED TABLE "scientific_result" CONSTRAINT "scientific_result_site_id_af6c815d_fk_logistic_site_id" FOREIGN KEY (site_id) REFERENCESlogistic_site(id) DEFERRABLE INITIALLY DEFERRED TABLE "scientific_selection" CONSTRAINT "scientific_selection_site_id_88d69cab_fk_logistic_site_id" FOREIGN KEY (site_id)REFERENCES logistic_site(id) DEFERRABLE INITIALLY DEFERRED And the output of the related query: SELECT attname, atttypid::regtype, attnum,atthasdef, atthasmissing, attgenerated, attisdropped FROM pg_attribute WHERE attrelid = 'logistic_site'::regclass ORDER BY attnum; attname | atttypid | attnum | atthasdef | atthasmissing | attgenerated | attisdropped -------------+--------------------------+--------+-----------+---------------+--------------+-------------- tableoid | oid | -6 | f | f | | f cmax | cid | -5 | f | f | | f xmax | xid | -4 | f | f | | f cmin | cid | -3 | f | f | | f xmin | xid | -2 | f | f | | f ctid | tid | -1 | f | f | | f id | bigint | 1 | t | f | | f key | character varying | 2 | f | f | | f name | character varying | 3 | f | f | | f created | timestamp with time zone | 4 | f | f | | f updated | timestamp with time zone | 5 | f | f | | f archived | timestamp with time zone | 6 | f | f | | f geom | geometry | 7 | f | f | | f location | geometry | 8 | f | f | | f notes | text | 9 | f | f | | f country_id | bigint | 10 | f | f | | f customer_id | bigint | 11 | f | f | | f (17 lignes) Additional information: When trying to read the SQL related query for this table in PgAdmin4 I also have the error message popping up and the I getno SQL. So maybe the problem resides in a deeper function the Query Planner and SQL generator functions rely on. Don't hesitate to ask for more information. Best regards, Jean
On Tue, 7 Jun 2022 at 19:58, Jean Landercy - BEEODIVERSITY <jean.landercy@beeodiversity.com> wrote: > Here is the detail of the table (I have anonymized it on SO, this is its real name): > "logistic_site_location_54ae0166_id" gist (location) I imagine this is due to the planner choosing an index-only scan on the above index. A similar problem was reported in [1]. A fix was committed in [2], which appears in 13.7. You could turn off enable_indexonlyscan until 13.7 is available to you. David [1] https://www.postgresql.org/message-id/CAHUie24ddN+pDNw7fkhNrjrwAX=fXXfGZZEHhRuofV_N_ftaSg@mail.gmail.com [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0778b24ced8a873b432641001d046d1dde602466
David Rowley <dgrowleyml@gmail.com> writes: > On Tue, 7 Jun 2022 at 19:58, Jean Landercy - BEEODIVERSITY > <jean.landercy@beeodiversity.com> wrote: >> Here is the detail of the table (I have anonymized it on SO, this is its real name): >> "logistic_site_location_54ae0166_id" gist (location) > I imagine this is due to the planner choosing an index-only scan on > the above index. A similar problem was reported in [1]. The other gist index could also be the problem. It seems odd though that the planner would favor either index for this purpose over the btree indexes on scalar columns, which you'd think would be a lot smaller. I wonder if there is some quirk in gist cost estimation that makes it improperly claim to be cheaper than btree scans. regards, tom lane
On Wed, 8 Jun 2022 at 07:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <dgrowleyml@gmail.com> writes: > > On Tue, 7 Jun 2022 at 19:58, Jean Landercy - BEEODIVERSITY > > <jean.landercy@beeodiversity.com> wrote: > >> Here is the detail of the table (I have anonymized it on SO, this is its real name): > >> "logistic_site_location_54ae0166_id" gist (location) > > I imagine this is due to the planner choosing an index-only scan on > > the above index. A similar problem was reported in [1]. > > The other gist index could also be the problem. It seems odd though > that the planner would favor either index for this purpose over the btree > indexes on scalar columns, which you'd think would be a lot smaller. > I wonder if there is some quirk in gist cost estimation that makes it > improperly claim to be cheaper than btree scans. I installed PostGIS 3.1.1 and mocked this up with the attached. Looking at the plans, I see: # explain select count(*) from logistic_site; QUERY PLAN --------------------------------------------------------------------------------------------------------- Aggregate (cost=20.18..20.19 rows=1 width=8) -> Bitmap Heap Scan on logistic_site (cost=5.92..19.32 rows=340 width=0) -> Bitmap Index Scan on logistic_site_location_54ae0166_id (cost=0.00..5.84 rows=340 width=0) (3 rows) # drop index logistic_site_location_54ae0166_id; # explain select count(*) from logistic_site; QUERY PLAN ----------------------------------------------------------------------------------------------------- Aggregate (cost=9.92..9.93 rows=1 width=8) -> Bitmap Heap Scan on logistic_site (cost=5.26..9.39 rows=213 width=0) -> Bitmap Index Scan on logistic_site_geom_105a08da_id (cost=0.00..5.20 rows=213 width=0) (3 rows) # drop index logistic_site_geom_105a08da_id; # explain select count(*) from logistic_site; QUERY PLAN ------------------------------------------------------------------------------------------------------ Aggregate (cost=13.93..13.94 rows=1 width=8) -> Bitmap Heap Scan on logistic_site (cost=9.26..13.39 rows=213 width=0) -> Bitmap Index Scan on logistic_site_key_2e791173_like (cost=0.00..9.21 rows=213 width=0) (3 rows) So it does appear that the location index is being chosen, at least with the data that I inserted. Those gist indexes are costing quite a bit cheaper than the cheapest btree index. David
Вложения
On Wed, 8 Jun 2022 at 08:31, David Rowley <dgrowleyml@gmail.com> wrote: > So it does appear that the location index is being chosen, at least > with the data that I inserted. Those gist indexes are costing quite a > bit cheaper than the cheapest btree index. This seems just to be because the gist indexes are smaller, which is likely due to me having inserted NULL values into them. postgres=# select pg_relation_size('logistic_site_key_key'); pg_relation_size ------------------ 16384 (1 row) postgres=# select pg_relation_size('logistic_site_location_54ae0166_id'); pg_relation_size ------------------ 8192 (1 row) David
David Rowley <dgrowleyml@gmail.com> writes: > On Wed, 8 Jun 2022 at 07:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I wonder if there is some quirk in gist cost estimation that makes it >> improperly claim to be cheaper than btree scans. > I installed PostGIS 3.1.1 and mocked this up with the attached. > Looking at the plans, I see: > # explain select count(*) from logistic_site; > QUERY PLAN > --------------------------------------------------------------------------------------------------------- > Aggregate (cost=20.18..20.19 rows=1 width=8) > -> Bitmap Heap Scan on logistic_site (cost=5.92..19.32 rows=340 width=0) > -> Bitmap Index Scan on logistic_site_location_54ae0166_id > (cost=0.00..5.84 rows=340 width=0) > (3 rows) > # drop index logistic_site_location_54ae0166_id; > # explain select count(*) from logistic_site; > QUERY PLAN > ----------------------------------------------------------------------------------------------------- > Aggregate (cost=9.92..9.93 rows=1 width=8) > -> Bitmap Heap Scan on logistic_site (cost=5.26..9.39 rows=213 width=0) > -> Bitmap Index Scan on logistic_site_geom_105a08da_id > (cost=0.00..5.20 rows=213 width=0) > (3 rows) That ... is pretty quirky already. How did it prefer a scan with cost 19.32 over one with cost 9.39? Seems like we've got a bug here somewhere. The change in estimated rowcount is rather broken, too. > So it does appear that the location index is being chosen, at least > with the data that I inserted. Those gist indexes are costing quite a > bit cheaper than the cheapest btree index. It looks like the data you inserted for the geometry columns was uniformly NULL, which perhaps would result in a very small gist index. So maybe for this test data the choice isn't so odd. Seems unlikely that that'd be true of the OP's production data, though. regards, tom lane
I wrote: > That ... is pretty quirky already. How did it prefer a scan with cost > 19.32 over one with cost 9.39? Seems like we've got a bug here somewhere. > The change in estimated rowcount is rather broken, too. Ah, false alarm. I can reproduce your results if I stick an ANALYZE between the first and second EXPLAIN. So probably your change in estimated rowcount and hence cost can be explained by an auto-analyze coming along at just the right time. Also, if I fill the geom and location columns with non-null data, the planner stops preferring those indexes. So now I'm guessing that the OP's data *was* mostly null, and the planner preferred the gist indexes because they were smallest, and then tripped over the nonreturnable-column bug. regards, tom lane