Обсуждение: 'CLUSTER' in one database prevents running it in two others on the same database cluster (PG15.2)

Поиск
Список
Период
Сортировка

'CLUSTER' in one database prevents running it in two others on the same database cluster (PG15.2)

От
Marco Boeringa
Дата:
Hi all,

I am running into a weird issue I haven't encountered before.

This is PostgreSQL 15.2 (Ubuntu 15.2-1.pgdg22.04+1) running on Ubuntu 
22.04.2 LTS, with PostGIS (POSTGIS="3.3.2 4975da8" [EXTENSION] 
PGSQL="150" GEOS="3.10.2-CAPI-1.16.0" PROJ="8.2.1" LIBXML="2.9.13" 
LIBJSON="0.15" LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)").

Now the PostgreSQL database cluster contains multiple databases. In one 
of these, an OpenStreetMap 'Planet' size database, a CLUSTER operation 
is running against a very large table of buildings (> 400M records). 
This runs fine, and I can see the actual progress in pgAdmin by 
inspecting the 'pg_stat_progress_cluster' view. So far everything seems OK.

However, as part of two secondary Python scripts running that do all 
kinds of batch operations against spatial tables (generalization and 
such) in two other databases on the same database cluster, at the very 
end of the processing, these initialize 'CLUSTER' operations as well. 
Both of these now fail with a:

"missing chunk number 0 for toast value X in pg_toast_Y"

type error.

After these errors, I see the CLUSTER command listed as well in the 
PostgreSQL main log, but the ERROR line mentioned above is listed just 
above the actual CLUSTER statement in the log. No other anomaly / error 
is visible there. The CLUSTER command just fails though on the two 
secondary databases. The time displayed in the log is exactly the same 
for these two log lines one after another, also the number between the 
square brackets listed immediately after the "CET" of the time are the 
same for the same database. The pg_toast_Y table names mentioned in the 
errors differ by the way between the two databases.

If I inspect the table that is mentioned in the CLUSTER command in the 
log, the very first one that is going to be CLUSTERed in the secondary 
databases, the table can be displayed fine in DBeaver, even spatially. 
Checking the geometries with PostGIS's 'ST_IsValid' also shows them to 
be fine, no errors in the geometries.

Everything else seems OK, database cluster is still up&running, all 
tables in all databases can be viewed and accessed in DBeaver and pgAdmin.

In all other rare cases where I have seen the "missing chunk number 0 
for toast value X in pg_toast_Y" type error, the generalization 
processing already failed somewhere half-way, never at the end when 
CLUSTER is initiated. The error therefore seems weird to occur at that 
point.

I have now been able to reproduce this twice, with two tiny database's 
filled with just Liechtenstein data, while the Planet database is still 
in the stage of CLUSTERing the huge building table.

Obviously, it will be interesting to see what happens if the CLUSTER 
operation on the Planet database is finished, but this will still take 
many hours, especially since other large tables need to be CLUSTERed in 
the same database, and I do not want to break off this process right now 
after already gotten this far.

I will report back once it is finished, but does anyone have a clue why 
this might happen?

The system has plenty of free space by the way (4x2TB NVMe), and the 
database runs on a professional workstation with ECC RAM, and used RAM 
also doesn't seem an issue.

Marco




Well, I can now confirm that if the CLUSTER operation on the 'Planet' size database is finished, the CLUSTER on the secondary databases runs fine, no "missing chunk number 0" error.

I wonder if there is a relation with the currently discussed issue here, as I see in the PostgreSQL documentation, that CLUSTER also requires an ACCESS EXCLUSIVE lock, that is also referred in this thread?:

https://www.postgresql.org/message-id/flat/17812-206eaec1d133a94a%40postgresql.org

Is my issue possibly the "concrete use case", that David J. is referring to?:

"I agree this does seem like a poor risk/reward on the fixing side,
especially absent a concrete live use case problem. I am curious what led
to this discovery."

Marco


-------- Doorgestuurd bericht --------
Onderwerp: 'CLUSTER' in one database prevents running it in two others on the same database cluster (PG15.2)
Datum: Sun, 5 Mar 2023 22:38:40 +0100
Van: Marco Boeringa <marco@boeringa.demon.nl>
Aan: pgsql-bugs@lists.postgresql.org


Hi all,

I am running into a weird issue I haven't encountered before.

This is PostgreSQL 15.2 (Ubuntu 15.2-1.pgdg22.04+1) running on Ubuntu 22.04.2 LTS, with PostGIS (POSTGIS="3.3.2 4975da8" [EXTENSION] PGSQL="150" GEOS="3.10.2-CAPI-1.16.0" PROJ="8.2.1" LIBXML="2.9.13" LIBJSON="0.15" LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)").

Now the PostgreSQL database cluster contains multiple databases. In one of these, an OpenStreetMap 'Planet' size database, a CLUSTER operation is running against a very large table of buildings (> 400M records). This runs fine, and I can see the actual progress in pgAdmin by inspecting the 'pg_stat_progress_cluster' view. So far everything seems OK.

However, as part of two secondary Python scripts running that do all kinds of batch operations against spatial tables (generalization and such) in two other databases on the same database cluster, at the very end of the processing, these initialize 'CLUSTER' operations as well. Both of these now fail with a:

"missing chunk number 0 for toast value X in pg_toast_Y"

type error.

After these errors, I see the CLUSTER command listed as well in the PostgreSQL main log, but the ERROR line mentioned above is listed just above the actual CLUSTER statement in the log. No other anomaly / error is visible there. The CLUSTER command just fails though on the two secondary databases. The time displayed in the log is exactly the same for these two log lines one after another, also the number between the square brackets listed immediately after the "CET" of the time are the same for the same database. The pg_toast_Y table names mentioned in the errors differ by the way between the two databases.

If I inspect the table that is mentioned in the CLUSTER command in the log, the very first one that is going to be CLUSTERed in the secondary databases, the table can be displayed fine in DBeaver, even spatially. Checking the geometries with PostGIS's 'ST_IsValid' also shows them to be fine, no errors in the geometries.

Everything else seems OK, database cluster is still up&running, all tables in all databases can be viewed and accessed in DBeaver and pgAdmin.

In all other rare cases where I have seen the "missing chunk number 0 for toast value X in pg_toast_Y" type error, the generalization processing already failed somewhere half-way, never at the end when CLUSTER is initiated. The error therefore seems weird to occur at that point.

I have now been able to reproduce this twice, with two tiny database's filled with just Liechtenstein data, while the Planet database is still in the stage of CLUSTERing the huge building table.

Obviously, it will be interesting to see what happens if the CLUSTER operation on the Planet database is finished, but this will still take many hours, especially since other large tables need to be CLUSTERed in the same database, and I do not want to break off this process right now after already gotten this far.

I will report back once it is finished, but does anyone have a clue why this might happen?

The system has plenty of free space by the way (4x2TB NVMe), and the database runs on a professional workstation with ECC RAM, and used RAM also doesn't seem an issue.

Marco

On 2023-Mar-05, Marco Boeringa wrote:

> However, as part of two secondary Python scripts running that do all kinds
> of batch operations against spatial tables (generalization and such) in two
> other databases on the same database cluster, at the very end of the
> processing, these initialize 'CLUSTER' operations as well. Both of these now
> fail with a:
>     "missing chunk number 0 for toast value X in pg_toast_Y"
> type error.

Hmm.  Very strange ...

> After these errors, I see the CLUSTER command listed as well in the
> PostgreSQL main log, but the ERROR line mentioned above is listed just above
> the actual CLUSTER statement in the log. No other anomaly / error is visible
> there. The CLUSTER command just fails though on the two secondary databases.
> The time displayed in the log is exactly the same for these two log lines
> one after another, also the number between the square brackets listed
> immediately after the "CET" of the time are the same for the same database.

It would make things much more understandable if you just pasted the log
lines, instead of describing details of those log lines.  In you
description you may omit clues that may seem trivial/unimportant to you
but indicate useful things to us.

One thing that I'm curious about is what the various values of 'Y' are
in these lines:

> The pg_toast_Y table names mentioned in the errors differ by the way between
> the two databases.

It would be useful if you could look up to what heap-table each of these
TOAST tables correspond, with a query like
   SELECT oid::regclass FROM pg_class WHERE reltoastrelid = 'pg_toast.pg_toast_Y'::regclass;
(ran in the database that each error was thrown from.)

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Learn about compilers. Then everything looks like either a compiler or
a database, and now you have two problems but one of them is fun."
            https://twitter.com/thingskatedid/status/1456027786158776329