Обсуждение: Potential "AIO / io workers" inter-worker locking issue in PG18?
Hi, I currently run PG18 + PostGIS 3.6.0 on an Ubuntu 24.04 VM guest as Windows 10 Hyper-V virtual machine. The machine is a dedicated refurbished HP Z840 local workstation with 2x22 cores (E5-2699 v4) with 512 GB RAM and a 10 TB NVMe raid-0, with the Ubuntu guest having 400 GB RAM available. On this machine, which is dedicated to just one custom written geoprocessing workflow involving OpenStreetMap data, I have successfully processed up to global OpenStreetMap Facebook Daylight distribution data, with up to > 2.4 B record Polygon table for all Facebook Daylight buildings. So this has proven a very capable system. However, after upgrading to PG18 and the switch to the "io_method = worker" setting (tested with 3, 5, 16 and 22 workers), I am seeing an issue where it appears there may be a major issue with io workers potentially getting into some sort of locking conflict, that takes hours to resolve. The custom written geoprocessing workflow uses Python multi-threading based on the Python 'concurrent.futures' framework in combination with either pyodbc or psycopg2 as database connector to implement a powerful parallel processing solution to speed up some of the computationally intensive tasks (which use UPDATEs), which I generally use with up to 44 threads to fully saturate the dual CPU 44 core system. The custom code creates a pool of jobs to process for the threads, with the code being designed to minimize inter-thread locking issues by taking into account PostgreSQL page locality (although the actual records to process are not assigned by pages but by unique IDs in the tables). Basically, the code is designed such that different threads never attempt to access the same database pages, as each thread gets it own unique pages assigned, thus avoiding inter-thread locking conflicts. This has worked really well in the past, with system usage maximized over all cores and significantly speeding up processing. Jobs are implemented as database VIEWs, that point to the records to process via the unique ID of each. These views must of course be read by each thread, which is probably where the PG18 io workers kick-in. This has worked really well in previous versions of PostgreSQL (tested up to PG17). However, in PG18, during the multi-threaded processing, I see some of my submitted jobs that in this case were run against a small OpenStreetMap Italy extract of Geofabrik, all of a sudden take > 1 hour to finish (up to 6 hours for this small extract), even though similar jobs from the same processing step, finish in less than 10 seconds (and the other jobs should as well). This seems to happen kind of "random". Many multi-threading tasks before and after the affected processing steps, do finish normally. When this happens, I observe the following things: - High processor activity, even though the jobs that should finish in seconds, take hours, all the while showing the high core usage. - PgAdmin shows all sessions created by the Python threads as 'active', with *no* wait events attached. - The pg_locks table does not show locking conflicts, all locks are granted. I did notice however, that the relation / table locks were not "fastpath" locks, but ordinary ones. All other locks taken, e.g. on indexes related to the same table, were fastpath. I don't know if this has any relevance though, as from what I read about the difference, this shouldn't cause such a big difference, not seconds to hours. - Please note that the processing DOES eventually proceed, so it is not an infinite dead-lock or something where I need to kill my Python code. It just takes hours to resolve. - Switching to "io_method = sync" seems to resolve this issue, and I do not observe some jobs of the same batch getting "stuck". This is the behavior I was used to seeing in <=PG17. I am not to familiar with all the internals of PostgreSQL and the new AIO framework and its "io workers". However, it seems there may be some sort of locking issue between io workers that can occasionally happen in PG18 with "io_method = worker"? Is there anyone else observing similar issues in high multi-threaded processing worklflows? Marco
I am not a PostgreSQL contributor and have no clue what the actual technical details are in the new AIO code, but reading your report the following questions came to my mind: * Does the failure also happen with io_mode=io_uring? If no, it is a proof that it is really bound to io_mode=worker, not to AIO in general. * Does the failure also happen with io_mode=worker when your Python code uses only 22 cores, and PostgreSQL uses only 22 workers (so Python and PostgreSQL do not share CPU cores)? If no, it might indicate that the problem could be solved by increasing the execution policy in favor of PostgreSQL to give a hint to the scheduler that a CPU core should be given to PostgreSQL FIRST as Python most likely is waiting on it to continue, but PostgreSQL could not continue because the schedule gave all the cores to Python... (classical deadlock; eventually resolves once enough CPU cores are free to eventually finish the starving thread). HTH -Markus Am 05.10.2025 um 10:55 schrieb Marco Boeringa: > Hi, > > I currently run PG18 + PostGIS 3.6.0 on an Ubuntu 24.04 VM guest as > Windows 10 Hyper-V virtual machine. > > The machine is a dedicated refurbished HP Z840 local workstation with > 2x22 cores (E5-2699 v4) with 512 GB RAM and a 10 TB NVMe raid-0, with > the Ubuntu guest having 400 GB RAM available. > > On this machine, which is dedicated to just one custom written > geoprocessing workflow involving OpenStreetMap data, I have > successfully processed up to global OpenStreetMap Facebook Daylight > distribution data, with up to > 2.4 B record Polygon table for all > Facebook Daylight buildings. So this has proven a very capable system. > > However, after upgrading to PG18 and the switch to the "io_method = > worker" setting (tested with 3, 5, 16 and 22 workers), I am seeing an > issue where it appears there may be a major issue with io workers > potentially getting into some sort of locking conflict, that takes > hours to resolve. > > The custom written geoprocessing workflow uses Python multi-threading > based on the Python 'concurrent.futures' framework in combination with > either pyodbc or psycopg2 as database connector to implement a > powerful parallel processing solution to speed up some of the > computationally intensive tasks (which use UPDATEs), which I generally > use with up to 44 threads to fully saturate the dual CPU 44 core > system. The custom code creates a pool of jobs to process for the > threads, with the code being designed to minimize inter-thread locking > issues by taking into account PostgreSQL page locality (although the > actual records to process are not assigned by pages but by unique IDs > in the tables). Basically, the code is designed such that different > threads never attempt to access the same database pages, as each > thread gets it own unique pages assigned, thus avoiding inter-thread > locking conflicts. This has worked really well in the past, with > system usage maximized over all cores and significantly speeding up > processing. Jobs are implemented as database VIEWs, that point to the > records to process via the unique ID of each. These views must of > course be read by each thread, which is probably where the PG18 io > workers kick-in. > > This has worked really well in previous versions of PostgreSQL (tested > up to PG17). However, in PG18, during the multi-threaded processing, I > see some of my submitted jobs that in this case were run against a > small OpenStreetMap Italy extract of Geofabrik, all of a sudden take > > 1 hour to finish (up to 6 hours for this small extract), even though > similar jobs from the same processing step, finish in less than 10 > seconds (and the other jobs should as well). This seems to happen kind > of "random". Many multi-threading tasks before and after the affected > processing steps, do finish normally. > > When this happens, I observe the following things: > > - High processor activity, even though the jobs that should finish in > seconds, take hours, all the while showing the high core usage. > > - PgAdmin shows all sessions created by the Python threads as > 'active', with *no* wait events attached. > > - The pg_locks table does not show locking conflicts, all locks are > granted. I did notice however, that the relation / table locks were > not "fastpath" locks, but ordinary ones. All other locks taken, e.g. > on indexes related to the same table, were fastpath. I don't know if > this has any relevance though, as from what I read about the > difference, this shouldn't cause such a big difference, not seconds to > hours. > > - Please note that the processing DOES eventually proceed, so it is > not an infinite dead-lock or something where I need to kill my Python > code. It just takes hours to resolve. > > - Switching to "io_method = sync" seems to resolve this issue, and I > do not observe some jobs of the same batch getting "stuck". This is > the behavior I was used to seeing in <=PG17. > > I am not to familiar with all the internals of PostgreSQL and the new > AIO framework and its "io workers". However, it seems there may be > some sort of locking issue between io workers that can occasionally > happen in PG18 with "io_method = worker"? Is there anyone else > observing similar issues in high multi-threaded processing worklflows? > > Marco > > >
Hi,
I currently run PG18 + PostGIS 3.6.0 on an Ubuntu 24.04 VM guest as
Windows 10 Hyper-V virtual machine.
The machine is a dedicated refurbished HP Z840 local workstation with
2x22 cores (E5-2699 v4) with 512 GB RAM and a 10 TB NVMe raid-0, with
the Ubuntu guest having 400 GB RAM available.
On this machine, which is dedicated to just one custom written
geoprocessing workflow involving OpenStreetMap data, I have successfully
processed up to global OpenStreetMap Facebook Daylight distribution
data, with up to > 2.4 B record Polygon table for all Facebook Daylight
buildings. So this has proven a very capable system.
However, after upgrading to PG18 and the switch to the "io_method =
worker" setting (tested with 3, 5, 16 and 22 workers), I am seeing an
issue where it appears there may be a major issue with io workers
potentially getting into some sort of locking conflict, that takes hours
to resolve.
The custom written geoprocessing workflow uses Python multi-threading
based on the Python 'concurrent.futures' framework in combination with
either pyodbc or psycopg2 as database connector to implement a powerful
parallel processing solution to speed up some of the computationally
intensive tasks (which use UPDATEs), which I generally use with up to 44
threads to fully saturate the dual CPU 44 core system. The custom code
creates a pool of jobs to process for the threads, with the code being
designed to minimize inter-thread locking issues by taking into account
PostgreSQL page locality (although the actual records to process are not
assigned by pages but by unique IDs in the tables). Basically, the code
is designed such that different threads never attempt to access the same
database pages, as each thread gets it own unique pages assigned, thus
avoiding inter-thread locking conflicts. This has worked really well in
the past, with system usage maximized over all cores and significantly
speeding up processing. Jobs are implemented as database VIEWs, that
point to the records to process via the unique ID of each. These views
must of course be read by each thread, which is probably where the PG18
io workers kick-in.
This has worked really well in previous versions of PostgreSQL (tested
up to PG17). However, in PG18, during the multi-threaded processing, I
see some of my submitted jobs that in this case were run against a small
OpenStreetMap Italy extract of Geofabrik, all of a sudden take > 1 hour
to finish (up to 6 hours for this small extract), even though similar
jobs from the same processing step, finish in less than 10 seconds (and
the other jobs should as well). This seems to happen kind of "random".
Many multi-threading tasks before and after the affected processing
steps, do finish normally.
When this happens, I observe the following things:
- High processor activity, even though the jobs that should finish in
seconds, take hours, all the while showing the high core usage.
- PgAdmin shows all sessions created by the Python threads as 'active',
with *no* wait events attached.
- The pg_locks table does not show locking conflicts, all locks are
granted. I did notice however, that the relation / table locks were not
"fastpath" locks, but ordinary ones. All other locks taken, e.g. on
indexes related to the same table, were fastpath. I don't know if this
has any relevance though, as from what I read about the difference, this
shouldn't cause such a big difference, not seconds to hours.
- Please note that the processing DOES eventually proceed, so it is not
an infinite dead-lock or something where I need to kill my Python code.
It just takes hours to resolve.
- Switching to "io_method = sync" seems to resolve this issue, and I do
not observe some jobs of the same batch getting "stuck". This is the
behavior I was used to seeing in <=PG17.
I am not to familiar with all the internals of PostgreSQL and the new
AIO framework and its "io workers". However, it seems there may be some
sort of locking issue between io workers that can occasionally happen in
PG18 with "io_method = worker"? Is there anyone else observing similar
issues in high multi-threaded processing worklflows?
Hi Thom,
I now also witnessed this issue with "io_method = sync", so it may not have relation with the number of workers set. I initially thought it did not occur with 'sync', as two runs successfully completed without delays, however, the last did show the issue. Unfortunately, this is a very complex multi-stage geoprocessing workflow, that cannot easily be cut down to a simple one SQL statement reproducible case. And for the specific Italy extract it takes about 7 hours to complete if the run is successful and without the delays observed, so each test run costs considerable time if I adjust anything.
There is also a PostGIS upgrade in the mix (3.5.2 to 3.6.0) that may or may not be involved, as that version of PostGIS is the minimum for PG18. I see a 3.6.1 is already planned and will need to re-test with that version once released. I definitely do use PostGIS functions at the stage the processing gets heavily delayed.
As to the question about the pg_aios view I wasn't aware off: it appears to be empty at that point, but I will need to confirm that observation, as with my last run, the moment I looked at the view, some of the very delayed multi-threaded jobs (> 6.5 hours instead of 10 seconds!) started slowly returning one by one, although some were still in wait / stuck for some time before all had returned, so the pg_aios view being empty probably is still representative of the stuck situation.
Also note that I also adjust the storage parameters of the tables involved to force a more aggressive vacuuming to avoid transaction ID wraparound (which shouldn't be an issue anyway with the small test Italy extract). This has all proven pretty reliable in the past and with previous PostgreSQL / PostGIS releases, up to the Facebook Daylight multi-billion record tables as noted in the previous post. There also is no PostgreSQL partitioning involved in any of this, these are ordinary tables.
Marco
On Sun, 5 Oct 2025, 10:52 Marco Boeringa, <marco@boeringa.demon.nl> wrote:Hi,
I currently run PG18 + PostGIS 3.6.0 on an Ubuntu 24.04 VM guest as
Windows 10 Hyper-V virtual machine.
The machine is a dedicated refurbished HP Z840 local workstation with
2x22 cores (E5-2699 v4) with 512 GB RAM and a 10 TB NVMe raid-0, with
the Ubuntu guest having 400 GB RAM available.
On this machine, which is dedicated to just one custom written
geoprocessing workflow involving OpenStreetMap data, I have successfully
processed up to global OpenStreetMap Facebook Daylight distribution
data, with up to > 2.4 B record Polygon table for all Facebook Daylight
buildings. So this has proven a very capable system.
However, after upgrading to PG18 and the switch to the "io_method =
worker" setting (tested with 3, 5, 16 and 22 workers), I am seeing an
issue where it appears there may be a major issue with io workers
potentially getting into some sort of locking conflict, that takes hours
to resolve.
The custom written geoprocessing workflow uses Python multi-threading
based on the Python 'concurrent.futures' framework in combination with
either pyodbc or psycopg2 as database connector to implement a powerful
parallel processing solution to speed up some of the computationally
intensive tasks (which use UPDATEs), which I generally use with up to 44
threads to fully saturate the dual CPU 44 core system. The custom code
creates a pool of jobs to process for the threads, with the code being
designed to minimize inter-thread locking issues by taking into account
PostgreSQL page locality (although the actual records to process are not
assigned by pages but by unique IDs in the tables). Basically, the code
is designed such that different threads never attempt to access the same
database pages, as each thread gets it own unique pages assigned, thus
avoiding inter-thread locking conflicts. This has worked really well in
the past, with system usage maximized over all cores and significantly
speeding up processing. Jobs are implemented as database VIEWs, that
point to the records to process via the unique ID of each. These views
must of course be read by each thread, which is probably where the PG18
io workers kick-in.
This has worked really well in previous versions of PostgreSQL (tested
up to PG17). However, in PG18, during the multi-threaded processing, I
see some of my submitted jobs that in this case were run against a small
OpenStreetMap Italy extract of Geofabrik, all of a sudden take > 1 hour
to finish (up to 6 hours for this small extract), even though similar
jobs from the same processing step, finish in less than 10 seconds (and
the other jobs should as well). This seems to happen kind of "random".
Many multi-threading tasks before and after the affected processing
steps, do finish normally.
When this happens, I observe the following things:
- High processor activity, even though the jobs that should finish in
seconds, take hours, all the while showing the high core usage.
- PgAdmin shows all sessions created by the Python threads as 'active',
with *no* wait events attached.
- The pg_locks table does not show locking conflicts, all locks are
granted. I did notice however, that the relation / table locks were not
"fastpath" locks, but ordinary ones. All other locks taken, e.g. on
indexes related to the same table, were fastpath. I don't know if this
has any relevance though, as from what I read about the difference, this
shouldn't cause such a big difference, not seconds to hours.
- Please note that the processing DOES eventually proceed, so it is not
an infinite dead-lock or something where I need to kill my Python code.
It just takes hours to resolve.
- Switching to "io_method = sync" seems to resolve this issue, and I do
not observe some jobs of the same batch getting "stuck". This is the
behavior I was used to seeing in <=PG17.
I am not to familiar with all the internals of PostgreSQL and the new
AIO framework and its "io workers". However, it seems there may be some
sort of locking issue between io workers that can occasionally happen in
PG18 with "io_method = worker"? Is there anyone else observing similar
issues in high multi-threaded processing worklflows?So, to confirm, you get the issue with as little as 3 io_workers?Also, what is pg_aios telling you during this time?Thom
Hi Thom,
I realized that my observation of the pg_aios view being empty was likely with the "io_method = sync" option set, which I guess doesn't use or fill the pg_aios view? Can you confirm the pg_aios view is unused with "io_method = sync", this aspect is not documented in the PostgreSQL help? Anyway, I will need to re-test with 'worker' set.
I do see Tomas Vondra mentioning that even the 'sync' option in PG18 still goes "through the AIO infrastructure", but what that exactly means, also in relation to the pg_aios view, IDK:
https://vondra.me/posts/tuning-aio-in-postgresql-18/
Marco
Hi Thom,
I now also witnessed this issue with "io_method = sync", so it may not have relation with the number of workers set. I initially thought it did not occur with 'sync', as two runs successfully completed without delays, however, the last did show the issue. Unfortunately, this is a very complex multi-stage geoprocessing workflow, that cannot easily be cut down to a simple one SQL statement reproducible case. And for the specific Italy extract it takes about 7 hours to complete if the run is successful and without the delays observed, so each test run costs considerable time if I adjust anything.
There is also a PostGIS upgrade in the mix (3.5.2 to 3.6.0) that may or may not be involved, as that version of PostGIS is the minimum for PG18. I see a 3.6.1 is already planned and will need to re-test with that version once released. I definitely do use PostGIS functions at the stage the processing gets heavily delayed.
As to the question about the pg_aios view I wasn't aware off: it appears to be empty at that point, but I will need to confirm that observation, as with my last run, the moment I looked at the view, some of the very delayed multi-threaded jobs (> 6.5 hours instead of 10 seconds!) started slowly returning one by one, although some were still in wait / stuck for some time before all had returned, so the pg_aios view being empty probably is still representative of the stuck situation.
Also note that I also adjust the storage parameters of the tables involved to force a more aggressive vacuuming to avoid transaction ID wraparound (which shouldn't be an issue anyway with the small test Italy extract). This has all proven pretty reliable in the past and with previous PostgreSQL / PostGIS releases, up to the Facebook Daylight multi-billion record tables as noted in the previous post. There also is no PostgreSQL partitioning involved in any of this, these are ordinary tables.
Marco
Op 5-10-2025 om 12:51 schreef Thom Brown:On Sun, 5 Oct 2025, 10:52 Marco Boeringa, <marco@boeringa.demon.nl> wrote:Hi,
I currently run PG18 + PostGIS 3.6.0 on an Ubuntu 24.04 VM guest as
Windows 10 Hyper-V virtual machine.
The machine is a dedicated refurbished HP Z840 local workstation with
2x22 cores (E5-2699 v4) with 512 GB RAM and a 10 TB NVMe raid-0, with
the Ubuntu guest having 400 GB RAM available.
On this machine, which is dedicated to just one custom written
geoprocessing workflow involving OpenStreetMap data, I have successfully
processed up to global OpenStreetMap Facebook Daylight distribution
data, with up to > 2.4 B record Polygon table for all Facebook Daylight
buildings. So this has proven a very capable system.
However, after upgrading to PG18 and the switch to the "io_method =
worker" setting (tested with 3, 5, 16 and 22 workers), I am seeing an
issue where it appears there may be a major issue with io workers
potentially getting into some sort of locking conflict, that takes hours
to resolve.
The custom written geoprocessing workflow uses Python multi-threading
based on the Python 'concurrent.futures' framework in combination with
either pyodbc or psycopg2 as database connector to implement a powerful
parallel processing solution to speed up some of the computationally
intensive tasks (which use UPDATEs), which I generally use with up to 44
threads to fully saturate the dual CPU 44 core system. The custom code
creates a pool of jobs to process for the threads, with the code being
designed to minimize inter-thread locking issues by taking into account
PostgreSQL page locality (although the actual records to process are not
assigned by pages but by unique IDs in the tables). Basically, the code
is designed such that different threads never attempt to access the same
database pages, as each thread gets it own unique pages assigned, thus
avoiding inter-thread locking conflicts. This has worked really well in
the past, with system usage maximized over all cores and significantly
speeding up processing. Jobs are implemented as database VIEWs, that
point to the records to process via the unique ID of each. These views
must of course be read by each thread, which is probably where the PG18
io workers kick-in.
This has worked really well in previous versions of PostgreSQL (tested
up to PG17). However, in PG18, during the multi-threaded processing, I
see some of my submitted jobs that in this case were run against a small
OpenStreetMap Italy extract of Geofabrik, all of a sudden take > 1 hour
to finish (up to 6 hours for this small extract), even though similar
jobs from the same processing step, finish in less than 10 seconds (and
the other jobs should as well). This seems to happen kind of "random".
Many multi-threading tasks before and after the affected processing
steps, do finish normally.
When this happens, I observe the following things:
- High processor activity, even though the jobs that should finish in
seconds, take hours, all the while showing the high core usage.
- PgAdmin shows all sessions created by the Python threads as 'active',
with *no* wait events attached.
- The pg_locks table does not show locking conflicts, all locks are
granted. I did notice however, that the relation / table locks were not
"fastpath" locks, but ordinary ones. All other locks taken, e.g. on
indexes related to the same table, were fastpath. I don't know if this
has any relevance though, as from what I read about the difference, this
shouldn't cause such a big difference, not seconds to hours.
- Please note that the processing DOES eventually proceed, so it is not
an infinite dead-lock or something where I need to kill my Python code.
It just takes hours to resolve.
- Switching to "io_method = sync" seems to resolve this issue, and I do
not observe some jobs of the same batch getting "stuck". This is the
behavior I was used to seeing in <=PG17.
I am not to familiar with all the internals of PostgreSQL and the new
AIO framework and its "io workers". However, it seems there may be some
sort of locking issue between io workers that can occasionally happen in
PG18 with "io_method = worker"? Is there anyone else observing similar
issues in high multi-threaded processing worklflows?So, to confirm, you get the issue with as little as 3 io_workers?Also, what is pg_aios telling you during this time?Thom
Hi Thom,
As an extension to what I already wrote: as the processing gets stuck during UPDATEs, I realized the pg_aios view is likely not involved, as the current AIO implementation of PG18 only affects the read operations like sequential and bitmap heap scans.
So not seeing anything listed in the pg_aios view might be normal? That said, I have attempted to view and refresh the view during other stages of the processing, with pgAdmin showing apparently read operations, but still no records displayed in pg_aios. Maybe I am hitting the "refresh" button on the wrong time though...
But maybe the whole new AIO thing isn't involved in these, and it is another issue in PG18. Just to summarize my observations once again:
- Multi-threaded processing implemented in Python using pyodbc and concurrent.futures apparently getting stuck waiting for PostgreSQL to return. The processing step involved should return in ***less than 10 seconds*** for the small Italy extract, but can take >1h (up to >6) when it gets randomly stuck (some runs successful without delay, others not).
- pgAdmin showing all sessions associated with the threads as 'Active' with no wait events nor blocking PIDs during the whole time the processing appears stuck in PostgreSQL.
- No other sessions like VACUUM visible in pgAdmin during the time the processing appears stuck except the main 'postgres' user session.
- All locks as shown in pg_locks are granted, and most if not all are fastpath, with only AccessShareLock and RowExclusiveLock on the table and its indexes involved. A couple of ExclusiveLock on virtualxid and transactionid.
- 'Top' in Ubuntu showing multiple backend 'postgres' processes continuously at high core usage, one for each thread (each Python thread of course uses its own connection).
- pg_aios view empty, but the processing is UPDATEs, so probably no surprise.
- The processing *DOES* eventually continue after this particular anomaly, with no further consequences and expected results at the end of the total processing flow, so it is not a true dead-lock.
- I have noticed it gets stuck when processing OpenStreetMap scrub or grassland of the Italy extract of Geofabrik. However, as written above, some processing runs are fine on the same data, while others get stuck and delayed. The issue may or may not involve PostGIS though considering this and the fact that the processing step getting stuck involves PostGIS functions.
- In pgAdmin, the SQL statements as generated by my geoprocessing workflow and as being processed by PostgreSQL when the processing is stuck, look like this:
UPDATE osm.landcover_scrubs_small_scale_2_ply AS t1 SET area_geo = t2.area_geo, perim_geo = t2.perim_geo, compact_geo = CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) / (4 * pi())) ELSE 0 END, npoints_geo = t2.npoints_geo, comp_npoints_geo = CASE WHEN t2.npoints_geo > 0 THEN (CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) / (4 * pi())) ELSE 0 END / t2.npoints_geo) ELSE 0 END, convex_ratio_geo = CASE WHEN ST_Area(ST_ConvexHull(way)::geography,true) > 0 THEN (t2.area_geo / ST_Area(ST_ConvexHull(way)::geography,true)) ELSE 1 END FROM (SELECT objectid,ST_Area(way::geography,true) AS area_geo,ST_Perimeter(way::geography,true) AS perim_geo,ST_NPoints(way) AS npoints_geo FROM osm.landcover_scrubs_small_scale_2_ply) AS t2 WHERE (t2.objectid = t1.objectid) AND t1.objectid IN (SELECT t3.objectid FROM mini_test.osm.osm_tmp_28128_ch5 AS t3)
- All of this worked fine in PG <= 17.
Marco
So, to confirm, you get the issue with as little as 3 io_workers?Also, what is pg_aios telling you during this time?Thom
Hi Markus, On my Ubuntu virtual machine, io_uring cannot be started. Setting "io_method = io_uring" and trying to restart the cluster, fails. It will not start, I have attempted this multiple times. Only 'sync' and 'worker' allow restarting after modifying the PostgreSQL configuration file. As I understood, the PostgreSQL binary needs to be compiled with the proper support, maybe my version on Ubuntu 24.04 that runs as a Windows Hyper-V virtual machine, doesn't have it. Although I did notice when installing PG18 from synaptic, that it installed an additional 'liburing' package or something named like that if I remember well... As to your question about Python and scheduling conflict: this is not the case. Python runs on the Windows host, not under Ubuntu inside the VM. I only have PostgreSQL installed on Ubuntu, as I use it with osm2pgsql there. I access the PostgreSQL instance via pyodbc or psycopg2 on the Windows host, so it is like a remote database server, just running on local hardware. Marco > I am not a PostgreSQL contributor and have no clue what the actual > technical details are in the new AIO code, but reading your report the > following questions came to my mind: > > * Does the failure also happen with io_mode=io_uring? If no, it is a proof that it is really bound to io_mode=worker, not to AIO in general. > * Does the failure also happen with io_mode=worker when your Python code > uses only 22 cores, and PostgreSQL uses only 22 workers (so Python and > PostgreSQL do not share CPU cores)? If no, it might indicate that the > problem could be solved by increasing the execution policy in favor of > PostgreSQL to give a hint to the scheduler that a CPU core should be > given to PostgreSQL FIRST as Python most likely is waiting on it to > continue, but PostgreSQL could not continue because the schedule gave > all the cores to Python... (classical deadlock; eventually resolves once > enough CPU cores are free to eventually finish the starving thread). > HTH > -Markus
Hi, On 2025-10-05 22:22:32 +0200, Marco Boeringa wrote: > I realized that my observation of the pg_aios view being empty was likely > with the "io_method = sync" option set, which I guess doesn't use or fill > the pg_aios view? Can you confirm the pg_aios view is unused with "io_method > = sync", this aspect is not documented in the PostgreSQL help? Anyway, I > will need to re-test with 'worker' set. pg_aios is populated even with io_method = sync, albeit with at most one entry entry per backend. If there were no entries in pg_aios at the time of your hang, it doesn't seem likely - although not impossible - for AIO to be responsible. Greetings, Andres Freund
Hi, On 2025-10-05 10:55:01 +0200, Marco Boeringa wrote: > This has worked really well in previous versions of PostgreSQL (tested up to > PG17). However, in PG18, during the multi-threaded processing, I see some of > my submitted jobs that in this case were run against a small OpenStreetMap > Italy extract of Geofabrik, all of a sudden take > 1 hour to finish (up to 6 > hours for this small extract), even though similar jobs from the same > processing step, finish in less than 10 seconds (and the other jobs should > as well). This seems to happen kind of "random". Many multi-threading tasks > before and after the affected processing steps, do finish normally. > > When this happens, I observe the following things: > > - High processor activity, even though the jobs that should finish in > seconds, take hours, all the while showing the high core usage. > > - PgAdmin shows all sessions created by the Python threads as 'active', with > *no* wait events attached. I think we need CPU profiles of these tasks. If something is continually taking a lot more CPU than expected, that seems like an issue worth investigating. Greetings, Andres Freund
Hi Andres, I should have phrased it better. The high processor and core activity is not unexpected. The code is designed to saturate the processor and maximize throughput by careful design of the Python threading stuff. It is just that all the jobs send to PostgreSQL via ODBC for the specific step in the processing and with the small Italy extract, should return in less than 10 seconds (which they do in those lucky runs I do not observe the issue), but some of the jobs for the specific step don't, e.g. 30 jobs return within 10 seconds, then the remaining 14 unexpectedly get stuck for 2 hours before returning, all the while staying at the same high core usage they were initiated with. So some of the PostgreSQL database sessions, as I already explained show up in pgAdmin as 'active' with no wait events or blocking pids, simply take an excessive amount of time, but will ultimately return. The CPU time, as witnessed with 'top' in Ubuntu, is really spend in PostgreSQL and the database sessions, not Python, which is run in Windows, and doesn't show high CPU usage in the Windows Task Manager. This doesn't always happen, it is kind of random. One run with the Italy data will be OK, the next not. Marco Op 6-10-2025 om 16:34 schreef Andres Freund: > Hi, > > On 2025-10-05 10:55:01 +0200, Marco Boeringa wrote: >> This has worked really well in previous versions of PostgreSQL (tested up to >> PG17). However, in PG18, during the multi-threaded processing, I see some of >> my submitted jobs that in this case were run against a small OpenStreetMap >> Italy extract of Geofabrik, all of a sudden take > 1 hour to finish (up to 6 >> hours for this small extract), even though similar jobs from the same >> processing step, finish in less than 10 seconds (and the other jobs should >> as well). This seems to happen kind of "random". Many multi-threading tasks >> before and after the affected processing steps, do finish normally. >> >> When this happens, I observe the following things: >> >> - High processor activity, even though the jobs that should finish in >> seconds, take hours, all the while showing the high core usage. >> >> - PgAdmin shows all sessions created by the Python threads as 'active', with >> *no* wait events attached. > I think we need CPU profiles of these tasks. If something is continually > taking a lot more CPU than expected, that seems like an issue worth > investigating. > > Greetings, > > Andres Freund
Hi, On 2025-10-06 18:09:25 +0200, Marco Boeringa wrote: > I should have phrased it better. The high processor and core activity is not > unexpected. The code is designed to saturate the processor and maximize > throughput by careful design of the Python threading stuff. It is just that > all the jobs send to PostgreSQL via ODBC for the specific step in the > processing and with the small Italy extract, should return in less than 10 > seconds (which they do in those lucky runs I do not observe the issue), but > some of the jobs for the specific step don't, e.g. 30 jobs return within 10 > seconds, then the remaining 14 unexpectedly get stuck for 2 hours before > returning, all the while staying at the same high core usage they were > initiated with. > So some of the PostgreSQL database sessions, as I already explained show up > in pgAdmin as 'active' with no wait events or blocking pids, simply take an > excessive amount of time, but will ultimately return. We need a profile of those processes while they use excessive amount of time. If they don't have wait events they're using CPU time, and seeing a profile of where all that time is spent might provide enough information where to look in more detail. Greetings, Andres Freund
Hi Andras, I am not really a Linux / Ubuntu expert. Can you give me a suggestion for how to create such a CPU profile for the specific PostgreSQL processes getting stuck? Marco Op 6-10-2025 om 18:13 schreef Andres Freund: > Hi, > > On 2025-10-06 18:09:25 +0200, Marco Boeringa wrote: >> I should have phrased it better. The high processor and core activity is not >> unexpected. The code is designed to saturate the processor and maximize >> throughput by careful design of the Python threading stuff. It is just that >> all the jobs send to PostgreSQL via ODBC for the specific step in the >> processing and with the small Italy extract, should return in less than 10 >> seconds (which they do in those lucky runs I do not observe the issue), but >> some of the jobs for the specific step don't, e.g. 30 jobs return within 10 >> seconds, then the remaining 14 unexpectedly get stuck for 2 hours before >> returning, all the while staying at the same high core usage they were >> initiated with. >> So some of the PostgreSQL database sessions, as I already explained show up >> in pgAdmin as 'active' with no wait events or blocking pids, simply take an >> excessive amount of time, but will ultimately return. > We need a profile of those processes while they use excessive amount of > time. If they don't have wait events they're using CPU time, and seeing a > profile of where all that time is spent might provide enough information where > to look in more detail. > > Greetings, > > Andres Freund
Hi, On 2025-10-06 18:17:11 +0200, Marco Boeringa wrote: > Hi Andras, > > I am not really a Linux / Ubuntu expert. Can you give me a suggestion for > how to create such a CPU profile for the specific PostgreSQL processes > getting stuck? https://wiki.postgresql.org/wiki/Profiling_with_perf is a good starting point. Greetings, Andres
Hi Andres, Thanks for the suggestion, this seems a useful option. However, when I attempt to run "perf top" in a Terminal window, I get the following warning: WARNING: perf not found for kernel 6.14.0-1012 I also see a suggestion to install the Azure linux-tools. However, if I type 'linux-tools' as search keyword in Synaptic Package Manager, I see a whole bunch of 'linux-tools', e.g. azure/aws/gcp/gke, which also include kernel version build numbers (at least that is what I assume they are). What version do you suggest I install for an ordinary locally running Ubuntu 24.04 VM? And do these packages indeed add the perf command? Marco Op 6-10-2025 om 18:21 schreef Andres Freund: > Hi, > > On 2025-10-06 18:17:11 +0200, Marco Boeringa wrote: >> Hi Andras, >> >> I am not really a Linux / Ubuntu expert. Can you give me a suggestion for >> how to create such a CPU profile for the specific PostgreSQL processes >> getting stuck? > https://wiki.postgresql.org/wiki/Profiling_with_perf is a good starting point. > > Greetings, > > Andres
Hi, On 2025-10-06 19:01:37 +0200, Marco Boeringa wrote: > Thanks for the suggestion, this seems a useful option. > > However, when I attempt to run "perf top" in a Terminal window, I get the > following warning: > > WARNING: perf not found for kernel 6.14.0-1012 > > I also see a suggestion to install the Azure linux-tools. However, if I type > 'linux-tools' as search keyword in Synaptic Package Manager, I see a whole > bunch of 'linux-tools', e.g. azure/aws/gcp/gke, which also include kernel > version build numbers (at least that is what I assume they are). > > What version do you suggest I install for an ordinary locally running Ubuntu > 24.04 VM? There are meta-packages to install linux-tools for the right version. E.g. linux-tools-virtual. Unfortunately ubuntu has multiple "kernel variants" (like -virtual) that you still have to choose between. You can figure out which base kernel you have with "dpkg -l|grep linux" or such. Greetings, Andres
Hi Andres, I now noticed that all the suggested 'linux-tools' packages that popped up in the warning message when I attempted to run "perf top" and referencing Azure, are already displayed as installed in Synaptic Package Manager. I guess it makes sense the packages and likely the kernel of my machine are for Azure, as it is a Windows Hyper-V virtual machine with the Microsoft provided Ubuntu install option. However, if the packages are installed, why can't I run perf, or do I need a 'linux-tools' specific command for that instead of perf? Marco Op 6-10-2025 om 19:09 schreef Andres Freund: > Hi, > > On 2025-10-06 19:01:37 +0200, Marco Boeringa wrote: >> Thanks for the suggestion, this seems a useful option. >> >> However, when I attempt to run "perf top" in a Terminal window, I get the >> following warning: >> >> WARNING: perf not found for kernel 6.14.0-1012 >> >> I also see a suggestion to install the Azure linux-tools. However, if I type >> 'linux-tools' as search keyword in Synaptic Package Manager, I see a whole >> bunch of 'linux-tools', e.g. azure/aws/gcp/gke, which also include kernel >> version build numbers (at least that is what I assume they are). >> >> What version do you suggest I install for an ordinary locally running Ubuntu >> 24.04 VM? > There are meta-packages to install linux-tools for the right > version. E.g. linux-tools-virtual. Unfortunately ubuntu has multiple "kernel > variants" (like -virtual) that you still have to choose between. > > You can figure out which base kernel you have with "dpkg -l|grep linux" or > such. > > Greetings, > > Andres
Hi Andres, I now found out that I do have a 'perf' living under one of 'usr' folders, but unfortunately, this is the 6.8 kernel version: /usr/lib/linux-tools-6.8.0-85 None of the other suggested packages and their likely install folders seem to contain perf. Since perf appears and rather understandably seems to need to exactly match the kernel version, I can't use this one, as my kernel was already upgraded to 6.14 by a more or less forced update in Software Updater. It is a pain the linux-tools-common package, that I suspect is the source of the 6.8 'perf' version and folder and tagged as that version in Synaptic, isn't being updated at the same time to allow you to run 'perf' with the proper version. I guess I will need to wait for an update of it. Marco Op 6-10-2025 om 19:09 schreef Andres Freund: > Hi, > > On 2025-10-06 19:01:37 +0200, Marco Boeringa wrote: >> Thanks for the suggestion, this seems a useful option. >> >> However, when I attempt to run "perf top" in a Terminal window, I get the >> following warning: >> >> WARNING: perf not found for kernel 6.14.0-1012 >> >> I also see a suggestion to install the Azure linux-tools. However, if I type >> 'linux-tools' as search keyword in Synaptic Package Manager, I see a whole >> bunch of 'linux-tools', e.g. azure/aws/gcp/gke, which also include kernel >> version build numbers (at least that is what I assume they are). >> >> What version do you suggest I install for an ordinary locally running Ubuntu >> 24.04 VM? > There are meta-packages to install linux-tools for the right > version. E.g. linux-tools-virtual. Unfortunately ubuntu has multiple "kernel > variants" (like -virtual) that you still have to choose between. > > You can figure out which base kernel you have with "dpkg -l|grep linux" or > such. > > Greetings, > > Andres
On 2025-10-06 22:41:31 +0200, Marco Boeringa wrote: > Hi Andres, > > I now found out that I do have a 'perf' living under one of 'usr' folders, > but unfortunately, this is the 6.8 kernel version: > > /usr/lib/linux-tools-6.8.0-85 > > None of the other suggested packages and their likely install folders seem > to contain perf. > > Since perf appears and rather understandably seems to need to exactly match > the kernel version, I can't use this one, as my kernel was already upgraded > to 6.14 by a more or less forced update in Software Updater. I'm pretty sure that you can use any halfway-recent perf binary, they don't actually need to match exactly. I don't know why ubuntu insists on a perfect match. I regularly run completely different versions. Greetings, Andres Freund
It didn't work: as soon as I attempted to run perf, it emitted the warning message about the kernel version mismatch with suggestions of package to install. However, I now realized after further digging, that Ubuntu usually has multiple kernel versions installed. I have now attempted to add the GRUB boot menu, which should allow me to boot with the older 6.8 version of the kernel (which was available during configuration of GRUB), and hopefully run perf with that version of the kernel. Marco Op 6-10-2025 om 23:39 schreef Andres Freund: > On 2025-10-06 22:41:31 +0200, Marco Boeringa wrote: >> Hi Andres, >> >> I now found out that I do have a 'perf' living under one of 'usr' folders, >> but unfortunately, this is the 6.8 kernel version: >> >> /usr/lib/linux-tools-6.8.0-85 >> >> None of the other suggested packages and their likely install folders seem >> to contain perf. >> >> Since perf appears and rather understandably seems to need to exactly match >> the kernel version, I can't use this one, as my kernel was already upgraded >> to 6.14 by a more or less forced update in Software Updater. > I'm pretty sure that you can use any halfway-recent perf binary, they don't > actually need to match exactly. I don't know why ubuntu insists on a perfect > match. I regularly run completely different versions. > > Greetings, > > Andres Freund
Hi Andres, That worked, I successfully booted with kernel 6.8! I can now run perf, but it emits a warning, see below. Do you have suggestions of how to set these perf 'paranoid' settings? Marco |Access to performance monitoring and observability operations is limited. │ │Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open │ │access to performance monitoring and observability operations for processes│ |without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability. │More information can be found at 'Perf events and tool security' document: │ │https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html │ │perf_event_paranoid setting is 4: │ │ -1: Allow use of (almost) all events by all users │ | Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK │ │>= 0: Disallow raw and ftrace function tracepoint access │ │>= 1: Disallow CPU event access │ │>= 2: Disallow kernel profiling │ │To make the adjusted perf_event_paranoid setting permanent preserve it │ │in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>) Op 7-10-2025 om 09:15 schreef Marco Boeringa: > It didn't work: as soon as I attempted to run perf, it emitted the > warning message about the kernel version mismatch with suggestions of > package to install. > > However, I now realized after further digging, that Ubuntu usually has > multiple kernel versions installed. I have now attempted to add the > GRUB boot menu, which should allow me to boot with the older 6.8 > version of the kernel (which was available during configuration of > GRUB), and hopefully run perf with that version of the kernel. > > Marco > > Op 6-10-2025 om 23:39 schreef Andres Freund: >> On 2025-10-06 22:41:31 +0200, Marco Boeringa wrote: >>> Hi Andres, >>> >>> I now found out that I do have a 'perf' living under one of 'usr' >>> folders, >>> but unfortunately, this is the 6.8 kernel version: >>> >>> /usr/lib/linux-tools-6.8.0-85 >>> >>> None of the other suggested packages and their likely install >>> folders seem >>> to contain perf. >>> >>> Since perf appears and rather understandably seems to need to >>> exactly match >>> the kernel version, I can't use this one, as my kernel was already >>> upgraded >>> to 6.14 by a more or less forced update in Software Updater. >> I'm pretty sure that you can use any halfway-recent perf binary, they >> don't >> actually need to match exactly. I don't know why ubuntu insists on a >> perfect >> match. I regularly run completely different versions. >> >> Greetings, >> >> Andres Freund
Hi Andres, I realized I simply need to run perf with 'sudo' privileges, that seems to work. All examples I saw omitted it, so hence the confusion. I will attempt to capture the stall with the Italy data once I finish another OpenStreetMap rendering test currently running. Marco Op 7-10-2025 om 09:34 schreef Marco Boeringa: > Hi Andres, > > That worked, I successfully booted with kernel 6.8! > > I can now run perf, but it emits a warning, see below. Do you have > suggestions of how to set these perf 'paranoid' settings? > > Marco > > |Access to performance monitoring and observability operations is > limited. │ > │Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to > open │ > │access to performance monitoring and observability operations for > processes│ > |without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability. > │More information can be found at 'Perf events and tool security' > document: │ > │https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html > │ > │perf_event_paranoid setting is 4: │ > │ -1: Allow use of (almost) all events by all users │ > | Ignore mlock limit after perf_event_mlock_kb without > CAP_IPC_LOCK │ > │>= 0: Disallow raw and ftrace function tracepoint access │ > │>= 1: Disallow CPU event access │ > │>= 2: Disallow kernel profiling │ > │To make the adjusted perf_event_paranoid setting permanent preserve > it │ > │in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>) > > Op 7-10-2025 om 09:15 schreef Marco Boeringa: >> It didn't work: as soon as I attempted to run perf, it emitted the >> warning message about the kernel version mismatch with suggestions of >> package to install. >> >> However, I now realized after further digging, that Ubuntu usually >> has multiple kernel versions installed. I have now attempted to add >> the GRUB boot menu, which should allow me to boot with the older 6.8 >> version of the kernel (which was available during configuration of >> GRUB), and hopefully run perf with that version of the kernel. >> >> Marco >> >> Op 6-10-2025 om 23:39 schreef Andres Freund: >>> On 2025-10-06 22:41:31 +0200, Marco Boeringa wrote: >>>> Hi Andres, >>>> >>>> I now found out that I do have a 'perf' living under one of 'usr' >>>> folders, >>>> but unfortunately, this is the 6.8 kernel version: >>>> >>>> /usr/lib/linux-tools-6.8.0-85 >>>> >>>> None of the other suggested packages and their likely install >>>> folders seem >>>> to contain perf. >>>> >>>> Since perf appears and rather understandably seems to need to >>>> exactly match >>>> the kernel version, I can't use this one, as my kernel was already >>>> upgraded >>>> to 6.14 by a more or less forced update in Software Updater. >>> I'm pretty sure that you can use any halfway-recent perf binary, >>> they don't >>> actually need to match exactly. I don't know why ubuntu insists on a >>> perfect >>> match. I regularly run completely different versions. >>> >>> Greetings, >>> >>> Andres Freund
Hi Andres, I have now been able to successfully run 'perf' after a VM restart and booting with the linux 6.8 kernel. I have copied in two results for 'perf top' and a recording of one of the about 10 stuck processes by supplying a specific pid of the postgres backend. Let me know if this is of any use. Of course, the recordings themselves are much bigger, I just copied the top part as displayed by the 'perf' report option. If you need the actual underlying files, let me know. Also, if you have specific suggestions for 'perf' command options you would like to see run, please post them in your response and I will try if I can run them. Andres: I posted separately to you I had a suspicion a recent linux kernel update (6.8 --> 6.14) might be involved that coincided with the PG18/PostGIS 3.6.0 upgrade when I initially had multiple successful runs of my tool without the temporary stalling / delay in returning of postgres after booting to the older 6.8 kernel. This has now proven wrong: these results are of the 6.8 kernel runs, so the kernel is not involved, and I also see the issue with the older kernel. *** sudo perf top: *** Samples: 1M of event 'cpu-clock:ppp', 4000 Hz, Event count (approx.): 346785156250 lost: 0/0 drop: 0/0 Overhead Shared Object Symbol 17,33% postgres [.] 0x0000000000584216 12,21% postgres [.] 0x000000000058737a 11,80% [kernel] [k] pv_native_safe_halt 10,24% postgres [.] ReleaseAndReadBuffer 4,92% postgres [.] heap_hot_search_buffer 3,63% postgres [.] hash_search_with_hash_value 2,36% postgres [.] 0x00000000005841f3 1,85% postgres [.] 0x0000000000554edd 1,61% postgres [.] 0x0000000000554e1f 1,40% postgres [.] 0x00000000005567f9 1,32% postgres [.] StartReadBuffer 1,24% postgres [.] heap_page_prune_opt 1,23% postgres [.] LWLockAcquire 1,14% postgres [.] HeapTupleSatisfiesVisibility 1,10% postgres [.] 0x00000000003977dd 0,92% postgres [.] index_getnext_tid 0,87% postgres [.] index_fetch_heap 0,83% postgres [.] ExecScan 0,82% postgres [.] ReleaseBuffer 0,81% postgres [.] LWLockRelease 0,71% postgres [.] XidInMVCCSnapshot 0,66% postgres [.] _bt_checkkeys 0,66% postgres [.] _bt_next 0,55% libxorgxrdp.so [.] crc_process_data 0,51% postgres [.] btgettuple 0,41% postgres [.] 0x00000000005567c6 0,40% postgres [.] 0x00000000003975d0 0,35% postgres [.] ExecStoreBufferHeapTuple 0,34% postgres [.] hash_bytes 0,33% postgres [.] MemoryContextReset 0,33% postgres [.] 0x00000000003975d3 0,32% postgres [.] ExecEvalSysVar 0,28% postgres [.] 0x00000000003975e3 0,27% postgres [.] ResourceOwnerForget 0,27% postgres [.] GlobalVisTestFor 0,26% postgres [.] 0x00000000003975b6 0,26% postgres [.] HeapTupleIsSurelyDead 0,23% postgres [.] 0x00000000003975c3 0,23% postgres [.] 0x00000000001ff80a 0,22% postgres [.] PredicateLockTID 0,21% postgres [.] slot_getsomeattrs_int 0,20% postgres [.] 0x000000000058421d 0,19% postgres [.] ReadBufferExtended 0,16% libc.so.6 [.] __memcmp_sse2 0,15% postgres [.] 0x0000000000382e9e *** sudo perf -p <PID of one stuck postgres backend> -g -d 10 *** *** sudo perf report -g *** Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10018000000 Children Self Command Shared Object Symbol + 100,00% 0,00% postgres postgres [.] _start + 100,00% 0,00% postgres libc.so.6 [.] __libc_start_main@@GLIBC_2.34 + 100,00% 0,00% postgres libc.so.6 [.] __libc_start_call_main + 100,00% 0,00% postgres postgres [.] main + 100,00% 0,00% postgres postgres [.] PostmasterMain + 100,00% 0,00% postgres postgres [.] 0x00005f35711b913e + 100,00% 0,00% postgres postgres [.] postmaster_child_launch + 100,00% 0,00% postgres postgres [.] 0x00005f3570fb9dbf + 100,00% 0,00% postgres postgres [.] PostgresMain + 100,00% 0,00% postgres postgres [.] 0x00005f3570fb6cb9 + 100,00% 0,00% postgres postgres [.] PortalRun + 100,00% 0,00% postgres postgres [.] 0x00005f3570fbeca3 + 100,00% 0,00% postgres postgres [.] 0x00005f3570fbcf85 + 100,00% 0,00% postgres postgres [.] standard_ExecutorRun + 100,00% 0,00% postgres postgres [.] 0x00005f3570de682b + 100,00% 0,00% postgres postgres [.] 0x00005f3570de966d + 97,51% 0,00% postgres postgres [.] 0x00005f3570de959d + 97,17% 0,95% postgres postgres [.] ExecScan + 82,75% 0,00% postgres postgres [.] 0x00005f3570dda135 + 82,35% 1,04% postgres postgres [.] index_fetch_heap + 28,80% 0,00% postgres postgres [.] 0x00005f3570bf3f87 + 23,67% 0,00% postgres postgres [.] 0x00005f3570bf3ee8 + 22,19% 0,00% postgres postgres [.] 0x00005f3570fa1216 + 22,18% 22,18% postgres postgres [.] 0x0000000000584216 + 17,06% 0,89% postgres postgres [.] LWLockRelease + 16,18% 0,00% postgres postgres [.] 0x00005f3570bf3f4f + 15,59% 0,00% postgres postgres [.] 0x00005f3570fa437a + 15,57% 15,57% postgres postgres [.] 0x000000000058737a + 14,90% 0,22% postgres postgres [.] ReadBufferExtended + 14,18% 13,05% postgres postgres [.] ReleaseAndReadBuffer + 14,00% 1,82% postgres postgres [.] StartReadBuffer + 6,59% 0,00% postgres postgres [.] 0x00005f3570bf3f18 + 5,97% 3,78% postgres postgres [.] heap_hot_search_buffer + 5,23% 0,00% postgres postgres [.] 0x00005f3570da0613 + 5,20% 0,00% postgres postgres [.] 0x00005f3570dda121 + 5,19% 0,22% postgres postgres [.] slot_getsomeattrs_int + 5,18% 0,68% postgres postgres [.] index_getnext_tid + 4,52% 0,65% postgres postgres [.] btgettuple + 3,92% 0,99% postgres postgres [.] _bt_next + 3,28% 3,28% postgres postgres [.] hash_search_with_hash_value + 2,96% 0,38% postgres postgres [.] ExecStoreBufferHeapTuple + 2,93% 0,00% postgres postgres [.] 0x00005f3570bf3fe0
Hi Andres, Probably less useful, but just to further illustrate the issue, the results op 'top' which shows multiple stuck postgress processes already busy for hours and as written before all displaying as 'active' with no wait events in pgAdmin: top - 21:42:44 up 12:20, 2 users, load average: 16,14, 16,14, 16,09 Tasks: 664 total, 17 running, 647 sleeping, 0 stopped, 0 zombie %Cpu(s): 37,0 us, 0,1 sy, 0,0 ni, 62,9 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st MiB Mem : 393429,0 total, 297340,0 free, 53521,1 used, 91791,0 buff/cache MiB Swap: 153600,0 total, 153600,0 free, 0,0 used. 339907,9 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 118256 postgres 20 0 164,3g 994408 985600 R 100,0 0,2 309:57.22 postgres 118264 postgres 20 0 164,3g 1,0g 994,8m R 100,0 0,3 309:57.46 postgres 118367 postgres 20 0 164,3g 911516 907456 R 100,0 0,2 309:55.78 postgres 118368 postgres 20 0 164,3g 915388 911328 R 100,0 0,2 309:55.82 postgres 118369 postgres 20 0 164,3g 914684 910272 R 100,0 0,2 309:55.20 postgres 118372 postgres 20 0 164,3g 924188 916256 R 100,0 0,2 309:55.20 postgres 118254 postgres 20 0 164,3g 1,0g 1,0g R 99,9 0,3 309:52.71 postgres 118255 postgres 20 0 164,3g 993,6m 981,8m R 99,9 0,3 309:57.73 postgres 118259 postgres 20 0 164,3g 978,2m 988768 R 99,9 0,2 309:57.47 postgres 118261 postgres 20 0 164,3g 1,0g 987,2m R 99,9 0,3 309:57.52 postgres 118262 postgres 20 0 164,3g 998560 990176 R 99,9 0,2 309:57.04 postgres 118265 postgres 20 0 164,3g 1,0g 1,0g R 99,9 0,3 309:55.86 postgres 118365 postgres 20 0 164,3g 925244 917312 R 99,9 0,2 309:56.02 postgres 118366 postgres 20 0 164,3g 920668 912384 R 99,9 0,2 309:56.02 postgres 118370 postgres 20 0 164,3g 923132 915200 R 99,9 0,2 309:55.22 postgres 118373 postgres 20 0 164,3g 924188 916256 R 99,9 0,2 309:54.96 postgres 2546 ubuntu_+ 20 0 560524 266288 106704 S 21,8 0,1 7:48.62 Xorg 2954 ubuntu_+ 20 0 9050316 455464 167812 S 7,3 0,1 4:33.90 gnome-s+ 1855 xrdp 20 0 133248 16344 8800 S 1,3 0,0 0:37.48 xrdp 4329 ubuntu_+ 20 0 851956 56648 43292 S 0,7 0,0 0:25.66 gnome-t+ 123595 ubuntu_+ 20 0 14644 5632 3520 R 0,4 0,0 0:00.31 top 123432 root 20 0 0 0 0 I 0,2 0,0 0:00.09 kworker+ 1385 polkitd 20 0 384600 9744 7392 S 0,1 0,0 0:01.92 polkitd 1982 gdm 20 0 8805756 237412 128832 S 0,1 0,1 1:20.05 gnome-s+ 3259 ubuntu_+ 20 0 388852 11472 6688 S 0,1 0,0 0:08.03 ibus-da+ 3368 ubuntu_+ 20 0 347872 30504 17556 S 0,1 0,0 0:03.70 ibus-ex+ 3638 ubuntu_+ 20 0 236772 6688 6336 S 0,1 0,0 0:02.33 ibus-en+ 30304 ubuntu_+ 20 0 6490464 271864 129800 S 0,1 0,1 0:23.91 nautilus 1 root 20 0 23892 13024 8800 S 0,0 0,0 0:04.11 systemd 2 root 20 0 0 0 0 S 0,0 0,0 0:00.04 kthreadd 3 root 20 0 0 0 0 S 0,0 0,0 0:00.00 pool_wo+ 4 root 0 -20 0 0 0 I 0,0 0,0 0:00.00 kworker+ 5 root 0 -20 0 0 0 I 0,0 0,0 0:00.00 kworker+ 6 root 0 -20 0 0 0 I 0,0 0,0 0:00.00 kworker+ 7 root 0 -20 0 0 0 I 0,0 0,0 0:00.00 kworker+ 9 root 0 -20 0 0 0 I 0,0 0,0 0:00.00 kworker+ 10 root 20 0 0 0 0 I 0,0 0,0 0:05.18 kworker+
Hi, On 2025-10-07 21:34:00 +0200, Marco Boeringa wrote: > I have now been able to successfully run 'perf' after a VM restart and > booting with the linux 6.8 kernel. Unfortunately the profile is lacking most symbols, so it's not particularly informative :(. You probably need to install the debug symbols package for postgres (and probably libc)... > Andres: I posted separately to you I had a suspicion a recent linux kernel > update (6.8 --> 6.14) might be involved that coincided with the PG18/PostGIS > 3.6.0 upgrade when I initially had multiple successful runs of my tool > without the temporary stalling / delay in returning of postgres after > booting to the older 6.8 kernel. This has now proven wrong: these results > are of the 6.8 kernel runs, so the kernel is not involved, and I also see > the issue with the older kernel. Good to know. Greetings, Andres Freund
Andres, Can you be a bit more specific as to what debug symbols package I need to install? There is a whole list of them for PG18 when I look with Synaptic Package Manager... Or do I simply need to install all PG18 listed debug symbols packages? It is a very long list though of packages, so I would prefer to limit it to the packages you think are really useful for your debug work. And do you have any comments as the command run? If I install the debug symbols, are the specific commands OK, or do you want other options? Marco Op 7-10-2025 om 21:56 schreef Andres Freund: > Hi, > > On 2025-10-07 21:34:00 +0200, Marco Boeringa wrote: >> I have now been able to successfully run 'perf' after a VM restart and >> booting with the linux 6.8 kernel. > Unfortunately the profile is lacking most symbols, so it's not particularly > informative :(. You probably need to install the debug symbols package for > postgres (and probably libc)... > > >> Andres: I posted separately to you I had a suspicion a recent linux kernel >> update (6.8 --> 6.14) might be involved that coincided with the PG18/PostGIS >> 3.6.0 upgrade when I initially had multiple successful runs of my tool >> without the temporary stalling / delay in returning of postgres after >> booting to the older 6.8 kernel. This has now proven wrong: these results >> are of the 6.8 kernel runs, so the kernel is not involved, and I also see >> the issue with the older kernel. > Good to know. > > Greetings, > > Andres Freund
Hi Andres, Searching a bit more, and reading the PostgreSQL wiki pages about debugging, I have now found the 'find-dbgsym-packages' command, part of the 'debian-goodies' Installing and running this against a postgres process ID, returned the following debug symbol packages. Does this list seem about right for you? Marco lib32stdc++6-14-dbg libkrb5-dbg libstdc++6-14-dbg libx32stdc++6-14-dbg postgresql-18-dbgsym
Hi Andres, This looks much better, doesn't it? I hope this helps. Let me know if you need anything else. Marco *** sudo perf top: *** Samples: 2M of event 'cpu-clock:ppp', 4000 Hz, Event count (approx.): 593240647834 lost: 0/0 drop: 0/0 Overhead Shared Object Symbol 24,66% [kernel] [k] pv_native_safe_halt 9,35% postgres [.] LWLockAttemptLock 7,69% postgres [.] heap_hot_search_buffer 6,68% postgres [.] tts_heap_getsomeattrs.lto_priv.0 4,93% postgres [.] LWLockReleaseInternal 4,26% postgres [.] ExecInterpExpr 4,12% postgres [.] hash_search_with_hash_value 3,42% postgres [.] PinBuffer 2,54% postgres [.] heapam_index_fetch_tuple.lto_priv.0 2,46% postgres [.] ReleaseAndReadBuffer 1,72% postgres [.] LWLockAcquire 1,72% postgres [.] HeapTupleSatisfiesVisibility 1,48% postgres [.] index_fetch_heap 1,44% postgres [.] index_getnext_tid 1,40% postgres [.] heap_page_prune_opt 1,28% postgres [.] _bt_readpage 1,28% postgres [.] ExecScan 1,23% postgres [.] LWLockRelease 1,10% postgres [.] IndexNext 1,10% postgres [.] XidInMVCCSnapshot 1,07% postgres [.] _bt_checkkeys 0,98% postgres [.] _bt_next 0,90% postgres [.] UnpinBufferNoOwner.lto_priv.0 0,82% postgres [.] GetPrivateRefCountEntry.lto_priv.0 0,81% postgres [.] StartReadBuffer 0,80% postgres [.] btgettuple 0,69% postgres [.] _bt_check_compare.lto_priv.0 0,67% libxorgxrdp.so [.] crc_process_data 0,56% postgres [.] ExecStoreBufferHeapTuple 0,55% postgres [.] MemoryContextReset 0,50% postgres [.] ExecEvalSysVar 0,50% postgres [.] hash_bytes 0,48% postgres [.] tts_virtual_clear.lto_priv.0 0,48% postgres [.] ExecNestLoop 0,43% postgres [.] ResourceOwnerForget 0,41% postgres [.] GlobalVisTestFor 0,37% postgres [.] HeapTupleIsSurelyDead 0,33% postgres [.] slot_getsomeattrs_int 0,33% postgres [.] PredicateLockTID 0,29% postgres [.] ReadBufferExtended 0,27% postgres [.] _bt_saveitem 0,23% postgres [.] _bt_setuppostingitems 0,22% libc.so.6 [.] __memcmp_sse2 0,21% postgres [.] ExecIndexScan 0,20% postgres [.] ReleaseBuffer 0,19% postgres [.] ResourceOwnerEnlarge *** sudo perf -p <PID of one stuck postgres backend> -g -d 10 *** *** sudo perf report -g *** Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10008250000 Children Self Command Shared Object Symbol + 100,00% 0,00% postgres postgres [.] _start + 100,00% 0,00% postgres libc.so.6 [.] __libc_start_main@@GLIBC_2.34 + 100,00% 0,00% postgres libc.so.6 [.] __libc_start_call_main + 100,00% 0,00% postgres postgres [.] main + 100,00% 0,00% postgres postgres [.] PostmasterMain + 100,00% 0,00% postgres postgres [.] ServerLoop.isra.0 + 100,00% 0,00% postgres postgres [.] postmaster_child_launch + 100,00% 0,00% postgres postgres [.] 0x00005f3570fb9dbf + 100,00% 0,00% postgres postgres [.] PostgresMain + 100,00% 0,00% postgres postgres [.] exec_simple_query + 100,00% 0,63% postgres postgres [.] ExecNestLoop + 100,00% 0,00% postgres postgres [.] PortalRun + 100,00% 0,00% postgres postgres [.] PortalRunMulti + 100,00% 0,00% postgres postgres [.] ProcessQuery + 100,00% 0,00% postgres postgres [.] standard_ExecutorRun + 100,00% 0,00% postgres postgres [.] ExecModifyTable + 94,63% 1,47% postgres postgres [.] ExecScan + 78,76% 1,49% postgres postgres [.] IndexNext + 66,89% 1,96% postgres postgres [.] index_fetch_heap + 64,35% 3,61% postgres postgres [.] heapam_index_fetch_tuple.lto_priv.0 + 21,92% 0,40% postgres postgres [.] ReadBufferExtended + 20,36% 1,19% postgres postgres [.] StartReadBuffer + 16,23% 5,79% postgres postgres [.] ExecInterpExpr + 15,00% 10,70% postgres postgres [.] heap_hot_search_buffer + 12,81% 12,81% postgres postgres [.] LWLockAttemptLock + 10,31% 2,36% postgres postgres [.] index_getnext_tid + 9,69% 0,48% postgres postgres [.] slot_getsomeattrs_int + 9,27% 9,27% postgres postgres [.] tts_heap_getsomeattrs.lto_priv.0 + 8,45% 1,85% postgres postgres [.] LWLockRelease + 7,91% 1,08% postgres postgres [.] btgettuple + 6,89% 1,12% postgres postgres [.] _bt_next + 6,77% 6,77% postgres postgres [.] LWLockReleaseInternal + 5,89% 5,89% postgres postgres [.] hash_search_with_hash_value + 5,78% 0,10% postgres postgres [.] _bt_readnextpage + 4,48% 4,14% postgres postgres [.] PinBuffer + 4,41% 1,90% postgres postgres [.] _bt_readpage + 4,09% 3,00% postgres postgres [.] ReleaseAndReadBuffer + 2,99% 2,38% postgres postgres [.] HeapTupleSatisfiesVisibility + 2,58% 0,82% postgres postgres [.] ExecStoreBufferHeapTuple + 2,43% 2,43% postgres postgres [.] LWLockAcquire + 2,28% 1,47% postgres postgres [.] _bt_checkkeys + 1,74% 1,74% postgres postgres [.] heap_page_prune_opt + 1,67% 1,10% postgres postgres [.] UnpinBufferNoOwner.lto_priv.0 + 1,47% 1,47% postgres postgres [.] XidInMVCCSnapshot + 1,18% 1,18% postgres postgres [.] GetPrivateRefCountEntry.lto_priv.0 + 1,09% 1,09% postgres postgres [.] MemoryContextReset Op 7-10-2025 om 23:38 schreef Marco Boeringa: > Hi Andres, > > Searching a bit more, and reading the PostgreSQL wiki pages about > debugging, I have now found the 'find-dbgsym-packages' command, part > of the 'debian-goodies' > > Installing and running this against a postgres process ID, returned > the following debug symbol packages. Does this list seem about right > for you? > > Marco > > lib32stdc++6-14-dbg > > libkrb5-dbg > > libstdc++6-14-dbg > > libx32stdc++6-14-dbg > > postgresql-18-dbgsym >
Hi, On 2025-10-08 00:13:34 +0200, Marco Boeringa wrote: > This looks much better, doesn't it? It indeed does! > I hope this helps. Let me know if you need anything else. > *** sudo perf -p <PID of one stuck postgres backend> -g -d 10 *** > *** sudo perf report -g *** Could you show perf report --no-children? That would show us which individual functions, rather than call-stacks, take the longest. > Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10008250000 > Children Self Command Shared Object Symbol > + 100,00% 0,00% postgres postgres [.] _start > + 100,00% 0,00% postgres libc.so.6 [.] > __libc_start_main@@GLIBC_2.34 > + 100,00% 0,00% postgres libc.so.6 [.] > __libc_start_call_main > + 100,00% 0,00% postgres postgres [.] main > + 100,00% 0,00% postgres postgres [.] PostmasterMain > + 100,00% 0,00% postgres postgres [.] ServerLoop.isra.0 > + 100,00% 0,00% postgres postgres [.] > postmaster_child_launch > + 100,00% 0,00% postgres postgres [.] 0x00005f3570fb9dbf > + 100,00% 0,00% postgres postgres [.] PostgresMain > + 100,00% 0,00% postgres postgres [.] exec_simple_query > + 100,00% 0,63% postgres postgres [.] ExecNestLoop > + 100,00% 0,00% postgres postgres [.] PortalRun > + 100,00% 0,00% postgres postgres [.] PortalRunMulti > + 100,00% 0,00% postgres postgres [.] ProcessQuery > + 100,00% 0,00% postgres postgres [.] standard_ExecutorRun > + 100,00% 0,00% postgres postgres [.] ExecModifyTable > + 94,63% 1,47% postgres postgres [.] ExecScan > + 78,76% 1,49% postgres postgres [.] IndexNext > + 66,89% 1,96% postgres postgres [.] index_fetch_heap > + 64,35% 3,61% postgres postgres [.] > heapam_index_fetch_tuple.lto_priv.0 So somehow >60% of the CPU time is spent fetching tuples corresponding to index entries. That seems ... a lot. Is it possible that you have a lot of dead rows in the involved tables? I don't immediately see how this could be related to AIO. Greetings, Andres Freund
Hi, On 2025-10-07 19:03:40 -0400, Andres Freund wrote: > On 2025-10-08 00:13:34 +0200, Marco Boeringa wrote: > So somehow >60% of the CPU time is spent fetching tuples corresponding to > index entries. That seems ... a lot. Is it possible that you have a lot of > dead rows in the involved tables? > > I don't immediately see how this could be related to AIO. Can you share the query and explain for it that was running in the "stuck" backend? Greetings, Andres Freund
Hi Andres,
> Could you show perf report --no-children? That would show us which individual
functions, rather than call-stacks, take the longest.
See entirely below!
> I don't immediately see how this could be related to AIO.
Yes, you could be right this is not related to AIO at all, but another issue introduced in PG18. The only reason I
initiallythought of AIO was of course that it is one of the main new features of PG18, and I could imagine "workers"
gettinginto some sort of inter-worker locking issues, just like threads can.
 
For sure, I did not see this issue in <= PG17, so some change in PG18 is causing it. Additionally, there is a small
chanceit might be related to PostGIS, as that was upgraded as well (3.5.2 --> 3.6.0) during the PG upgrade, as PG18
requiresPostGIS 3.6.0 minimum. And the query does use PostGIS functions, but none that AFAIK rely on e.g. a spatial
index.Functions like ST_Area just process an individual geometry, not the spatial relationship between multiple
geometries.
As I wrote before, this is a multi-threaded Python application (actually developed in a GIS), that uses Python's
'concurrent.futures'threading framework to create jobs of records to process for each thread, significantly speeding up
theprocessing. The queries are in fact dynamically build by the code, and part of a much larger geoprocessing workflow,
soit is hard to run them separately and provide a query plan (although in this case I could by rewriting part of the
querybelow).
 
However, I want to stress that any query plan is unlikely to yield anything. In normal circumstances and in PG17 and
below,this code runs fine! And it is only 1 in maybe 4 runs in PG18 that goes berserk and makes a processing step that
shouldtakes < 10 seconds, all of a sudden take > 2 hours.
 
UPDATE osm.landcover_scrubs_small_scale_2_ply AS t1 SET area_geo
       = t2.area_geo, perim_geo = t2.perim_geo, compact_geo = CASE WHEN
       t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) /
       (4 * pi())) ELSE 0 END, npoints_geo = t2.npoints_geo,
       comp_npoints_geo = CASE WHEN t2.npoints_geo > 0 THEN (CASE WHEN
       t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) /
       (4 * pi())) ELSE 0 END / t2.npoints_geo) ELSE 0 END,
       convex_ratio_geo = CASE WHEN
       ST_Area(ST_ConvexHull(way)::geography,true) > 0 THEN
       (t2.area_geo / ST_Area(ST_ConvexHull(way)::geography,true)) ELSE 1
       END FROM (SELECT objectid,ST_Area(way::geography,true) AS
       area_geo,ST_Perimeter(way::geography,true) AS
       perim_geo,ST_NPoints(way) AS npoints_geo FROM
       osm.landcover_scrubs_small_scale_2_ply)  AS t2 WHERE (t2.objectid
       = t1.objectid) AND t1.objectid IN (SELECT t3.objectid FROM
       mini_test.osm.osm_tmp_28128_ch5 AS t3)
> So somehow >60% of the CPU time is spent fetching tuples corresponding to
index entries. That seems ... a lot.  Is it possible that you have a lot of
dead rows in the involved tables?
Yes, that is perfectly possible. However, the particular table is only 
just over 100k records. It is true that my code is designed to process 
literally *every* record in a table. However, I specifically set 
adjusted table storage parameters with much more aggressive vacuum 
settings (essentially forcing always vacuum after something like 10k 
dead tuples irrespective of the size of the table). This has worked 
really well, and I have successfully UPDATEd all of Facebook Daylight 
size > 1B records tables with the same code, without ever running into 
this particular issue, nor transaction ID wraparound issues.
One particular thing to note as well is that, due to careful design of 
the jobs taking page locality into account, deliberately setting a low 
table fill factor, and plenty of RAM, quite a few but not all of the 
steps in the geoprocessing workflow, manage to run almost completely as 
PostgreSQL 'HOT' updates, so for all records in the table (even for very 
large ones).
*** sudo perf -p <PID of one stuck postgres backend> -g -d 10 ***
*** sudo perf report --no-children ***
Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10008250000
   Overhead  Command   Shared Object      Symbol
+   12,81%  postgres  postgres           [.] LWLockAttemptLock
+   10,70%  postgres  postgres           [.] heap_hot_search_buffer
+    9,27%  postgres  postgres           [.] 
tts_heap_getsomeattrs.lto_priv.0
+    6,77%  postgres  postgres           [.] LWLockReleaseInternal
+    5,89%  postgres  postgres           [.] hash_search_with_hash_value
+    5,79%  postgres  postgres           [.] ExecInterpExpr
+    4,14%  postgres  postgres           [.] PinBuffer
+    3,61%  postgres  postgres           [.] 
heapam_index_fetch_tuple.lto_priv.0
+    3,00%  postgres  postgres           [.] ReleaseAndReadBuffer
+    2,43%  postgres  postgres           [.] LWLockAcquire
+    2,38%  postgres  postgres           [.] HeapTupleSatisfiesVisibility
+    2,36%  postgres  postgres           [.] index_getnext_tid
+    1,96%  postgres  postgres           [.] index_fetch_heap
+    1,90%  postgres  postgres           [.] _bt_readpage
+    1,85%  postgres  postgres           [.] LWLockRelease
+    1,74%  postgres  postgres           [.] heap_page_prune_opt
+    1,49%  postgres  postgres           [.] IndexNext
+    1,47%  postgres  postgres           [.] _bt_checkkeys
+    1,47%  postgres  postgres           [.] ExecScan
+    1,47%  postgres  postgres           [.] XidInMVCCSnapshot
+    1,19%  postgres  postgres           [.] StartReadBuffer
+    1,18%  postgres  postgres           [.] 
GetPrivateRefCountEntry.lto_priv.0
+    1,12%  postgres  postgres           [.] _bt_next
+    1,10%  postgres  postgres           [.] UnpinBufferNoOwner.lto_priv.0
+    1,09%  postgres  postgres           [.] MemoryContextReset
+    1,08%  postgres  postgres           [.] btgettuple
+    0,91%  postgres  postgres           [.] _bt_check_compare.lto_priv.0
+    0,82%  postgres  postgres           [.] ExecEvalSysVar
+    0,82%  postgres  postgres           [.] ExecStoreBufferHeapTuple
+    0,71%  postgres  postgres           [.] hash_bytes
+    0,65%  postgres  postgres           [.] tts_virtual_clear.lto_priv.0
+    0,63%  postgres  postgres           [.] GlobalVisTestFor
+    0,63%  postgres  postgres           [.] ExecNestLoop
+    0,55%  postgres  postgres           [.] HeapTupleIsSurelyDead
+    0,52%  postgres  postgres           [.] ResourceOwnerForget
      0,48%  postgres  postgres           [.] slot_getsomeattrs_int
      0,42%  postgres  postgres           [.] PredicateLockTID
      0,40%  postgres  postgres           [.] ReadBufferExtended
      0,37%  postgres  postgres           [.] _bt_saveitem
      0,31%  postgres  postgres           [.] ExecIndexScan
      0,30%  postgres  libc.so.6          [.] __memcmp_sse2
      0,28%  postgres  postgres           [.] _bt_setuppostingitems
      0,26%  postgres  postgres           [.] ReleaseBuffer
      0,23%  postgres  postgres           [.] ResourceOwnerEnlarge
      0,23%  postgres  postgres           [.] 
HeapCheckForSerializableConflictOut
      0,22%  postgres  postgres           [.] IncrBufferRefCount
      0,18%  postgres  postgres           [.] pgstat_count_io_op
			
		
Could you show perf report --no-children? That would show us which individual functions, rather than call-stacks, take the longest.
Andres, I now captured a few more 'perf' sessions.
Run 1 is the original run I already showed you. For comparison, runs 2-4 are from a different backend captured during a new geoprocessing run, but all refering to same PID. Run 5 is also from the same geoprocessing run, but another backend PID, so another python thread launched from my code.
Not much difference between all of these, especially the ordinary "perf report", the "perf report --no children" show a little more variation. "perf report --no children" runs are grouped together at the bottom of the email.
Despite the minor differences, maybe seeing different captures of perf is still of some use.
Marco
*** sudo perf -p <PID of one stuck postgres backend> -g -d 10 ***
 *** sudo perf report -g ***
 RUN 1:
 Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10008250000
   Children      Self  Command   Shared Object      Symbol
 +  100,00%     0,00%  postgres  postgres           [.] _start
 +  100,00%     0,00%  postgres  libc.so.6          [.] __libc_start_main@@GLIBC_2.34
 +  100,00%     0,00%  postgres  libc.so.6          [.] __libc_start_call_main
 +  100,00%     0,00%  postgres  postgres           [.] main
 +  100,00%     0,00%  postgres  postgres           [.] PostmasterMain
 +  100,00%     0,00%  postgres  postgres           [.] ServerLoop.isra.0
 +  100,00%     0,00%  postgres  postgres           [.] postmaster_child_launch
 +  100,00%     0,00%  postgres  postgres           [.] 0x00005f3570fb9dbf
 +  100,00%     0,00%  postgres  postgres           [.] PostgresMain
 +  100,00%     0,00%  postgres  postgres           [.] exec_simple_query
 +  100,00%     0,63%  postgres  postgres           [.] ExecNestLoop
 +  100,00%     0,00%  postgres  postgres           [.] PortalRun
 +  100,00%     0,00%  postgres  postgres           [.] PortalRunMulti
 +  100,00%     0,00%  postgres  postgres           [.] ProcessQuery
 +  100,00%     0,00%  postgres  postgres           [.] standard_ExecutorRun
 +  100,00%     0,00%  postgres  postgres           [.] ExecModifyTable
 +   94,63%     1,47%  postgres  postgres           [.] ExecScan
 +   78,76%     1,49%  postgres  postgres           [.] IndexNext
 +   66,89%     1,96%  postgres  postgres           [.] index_fetch_heap
 +   64,35%     3,61%  postgres  postgres           [.] heapam_index_fetch_tuple.lto_priv.0
 +   21,92%     0,40%  postgres  postgres           [.] ReadBufferExtended
 +   20,36%     1,19%  postgres  postgres           [.] StartReadBuffer
 +   16,23%     5,79%  postgres  postgres           [.] ExecInterpExpr
 +   15,00%    10,70%  postgres  postgres           [.] heap_hot_search_buffer
 +   12,81%    12,81%  postgres  postgres           [.] LWLockAttemptLock
 +   10,31%     2,36%  postgres  postgres           [.] index_getnext_tid
 +    9,69%     0,48%  postgres  postgres           [.] slot_getsomeattrs_int
 +    9,27%     9,27%  postgres  postgres           [.] tts_heap_getsomeattrs.lto_priv.0
 +    8,45%     1,85%  postgres  postgres           [.] LWLockRelease
 +    7,91%     1,08%  postgres  postgres           [.] btgettuple
 +    6,89%     1,12%  postgres  postgres           [.] _bt_next
 +    6,77%     6,77%  postgres  postgres           [.] LWLockReleaseInternal
 +    5,89%     5,89%  postgres  postgres           [.] hash_search_with_hash_value
 +    5,78%     0,10%  postgres  postgres           [.] _bt_readnextpage
 +    4,48%     4,14%  postgres  postgres           [.] PinBuffer
 +    4,41%     1,90%  postgres  postgres           [.] _bt_readpage
 +    4,09%     3,00%  postgres  postgres           [.] ReleaseAndReadBuffer
 +    2,99%     2,38%  postgres  postgres           [.] HeapTupleSatisfiesVisibility
 +    2,58%     0,82%  postgres  postgres           [.] ExecStoreBufferHeapTuple
 +    2,43%     2,43%  postgres  postgres           [.] LWLockAcquire
 +    2,28%     1,47%  postgres  postgres           [.] _bt_checkkeys
 +    1,74%     1,74%  postgres  postgres           [.] heap_page_prune_opt
 +    1,67%     1,10%  postgres  postgres           [.] UnpinBufferNoOwner.lto_priv.0
 +    1,47%     1,47%  postgres  postgres           [.] XidInMVCCSnapshot
 +    1,18%     1,18%  postgres  postgres           [.] GetPrivateRefCountEntry.lto_priv.0
 +    1,09%     1,09%  postgres  postgres           [.] MemoryContextReset
 RUN 2:
 amples: 40K of event 'task-clock:ppp', Event count (approx.): 10020750000
   Children      Self  Command   Shared Object     Symbol
 +  100,00%     0,00%  postgres  postgres          [.] _start
 +  100,00%     0,00%  postgres  libc.so.6         [.] __libc_start_main@@GLIBC_2.34
 +  100,00%     0,00%  postgres  libc.so.6         [.] __libc_start_call_main
 +  100,00%     0,00%  postgres  postgres          [.] main
 +  100,00%     0,00%  postgres  postgres          [.] PostmasterMain
 +  100,00%     0,00%  postgres  postgres          [.] ServerLoop.isra.0
 +  100,00%     0,00%  postgres  postgres          [.] postmaster_child_launch
 +  100,00%     0,00%  postgres  postgres          [.] 0x00005f3570fb9dbf
 +  100,00%     0,00%  postgres  postgres          [.] PostgresMain
 +  100,00%     0,00%  postgres  postgres          [.] exec_simple_query
 +  100,00%     0,34%  postgres  postgres          [.] ExecNestLoop
 +  100,00%     0,00%  postgres  postgres          [.] PortalRun
 +  100,00%     0,00%  postgres  postgres          [.] PortalRunMulti
 +  100,00%     0,00%  postgres  postgres          [.] ProcessQuery
 +  100,00%     0,00%  postgres  postgres          [.] standard_ExecutorRun
 +  100,00%     0,00%  postgres  postgres          [.] ExecModifyTable
 +   97,34%     1,13%  postgres  postgres          [.] ExecScan
 +   88,55%     0,73%  postgres  postgres          [.] IndexNext
 +   82,41%     1,05%  postgres  postgres          [.] index_fetch_heap
 +   81,10%     1,78%  postgres  postgres          [.] heapam_index_fetch_tuple.lto_priv.0
 +   25,70%    25,70%  postgres  postgres          [.] LWLockAttemptLock
 +   16,67%     0,86%  postgres  postgres          [.] LWLockRelease
 +   15,91%    15,91%  postgres  postgres          [.] LWLockReleaseInternal
 +   15,27%     0,22%  postgres  postgres          [.] ReadBufferExtended
 +   14,41%     1,73%  postgres  postgres          [.] StartReadBuffer
 +   14,11%    13,01%  postgres  postgres          [.] ReleaseAndReadBuffer
 +    8,60%     3,08%  postgres  postgres          [.] ExecInterpExpr
 +    6,27%     3,90%  postgres  postgres          [.] heap_hot_search_buffer
 +    5,36%     1,11%  postgres  postgres          [.] index_getnext_tid
 +    5,20%     0,20%  postgres  postgres          [.] slot_getsomeattrs_int
 +    5,05%     5,05%  postgres  postgres          [.] tts_heap_getsomeattrs.lto_priv.0
 +    4,74%     4,59%  postgres  postgres          [.] PinBuffer
 +    4,21%     0,54%  postgres  postgres          [.] btgettuple
 +    3,75%     0,72%  postgres  postgres          [.] _bt_next
 +    3,27%     3,27%  postgres  postgres          [.] hash_search_with_hash_value
 +    3,04%     0,03%  postgres  postgres          [.] _bt_readnextpage
 +    3,00%     0,44%  postgres  postgres          [.] ExecStoreBufferHeapTuple
 +    2,43%     2,16%  postgres  postgres          [.] UnpinBufferNoOwner.lto_priv.0
 +    2,24%     1,01%  postgres  postgres          [.] _bt_readpage
 +    1,62%     1,25%  postgres  postgres          [.] HeapTupleSatisfiesVisibility
 +    1,51%     1,51%  postgres  postgres          [.] LWLockAcquire
 +    1,18%     1,18%  postgres  postgres          [.] heap_page_prune_opt
 +    1,07%     0,72%  postgres  postgres          [.] _bt_checkkeys
 +    0,85%     0,85%  postgres  postgres          [.] XidInMVCCSnapshot
 +    0,82%     0,74%  postgres  postgres          [.] ReleaseBuffer
 +    0,56%     0,56%  postgres  postgres          [.] GetPrivateRefCountEntry.lto_priv.0
      0,41%     0,41%  postgres  postgres          [.] _bt_check_compare.lto_priv.0
 RUN 3:
 Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10010750000
   Children      Self  Command   Shared Object     Symbol
 +  100,00%     0,00%  postgres  postgres          [.] _start
 +  100,00%     0,00%  postgres  libc.so.6         [.] __libc_start_main@@GLIBC_2.34
 +  100,00%     0,00%  postgres  libc.so.6         [.] __libc_start_call_main
 +  100,00%     0,00%  postgres  postgres          [.] main
 +  100,00%     0,00%  postgres  postgres          [.] PostmasterMain
 +  100,00%     0,00%  postgres  postgres          [.] ServerLoop.isra.0
 +  100,00%     0,00%  postgres  postgres          [.] postmaster_child_launch
 +  100,00%     0,00%  postgres  postgres          [.] 0x00005f3570fb9dbf
 +  100,00%     0,00%  postgres  postgres          [.] PostgresMain
 +  100,00%     0,00%  postgres  postgres          [.] exec_simple_query
 +  100,00%     0,34%  postgres  postgres          [.] ExecNestLoop
 +  100,00%     0,00%  postgres  postgres          [.] PortalRun
 +  100,00%     0,00%  postgres  postgres          [.] PortalRunMulti
 +  100,00%     0,00%  postgres  postgres          [.] ProcessQuery
 +  100,00%     0,00%  postgres  postgres          [.] standard_ExecutorRun
 +  100,00%     0,00%  postgres  postgres          [.] ExecModifyTable
 +   97,43%     1,10%  postgres  postgres          [.] ExecScan
 +   88,96%     0,79%  postgres  postgres          [.] IndexNext
 +   82,68%     1,03%  postgres  postgres          [.] index_fetch_heap
 +   81,42%     1,89%  postgres  postgres          [.] heapam_index_fetch_tuple.lto_priv.0
 +   25,54%    25,54%  postgres  postgres          [.] LWLockAttemptLock
 +   17,38%     1,01%  postgres  postgres          [.] LWLockRelease
 +   16,44%    16,44%  postgres  postgres          [.] LWLockReleaseInternal
 +   15,39%     0,25%  postgres  postgres          [.] ReadBufferExtended
 +   14,49%     1,69%  postgres  postgres          [.] StartReadBuffer
 +   13,77%    12,72%  postgres  postgres          [.] ReleaseAndReadBuffer
 +    8,28%     2,95%  postgres  postgres          [.] ExecInterpExpr
 +    6,05%     3,75%  postgres  postgres          [.] heap_hot_search_buffer
 +    5,39%     1,09%  postgres  postgres          [.] index_getnext_tid
 +    5,05%     0,21%  postgres  postgres          [.] slot_getsomeattrs_int
 +    4,88%     4,88%  postgres  postgres          [.] tts_heap_getsomeattrs.lto_priv.0
 +    4,87%     4,69%  postgres  postgres          [.] PinBuffer
 +    4,29%     0,61%  postgres  postgres          [.] btgettuple
 +    3,74%     0,74%  postgres  postgres          [.] _bt_next
 +    3,47%     3,47%  postgres  postgres          [.] hash_search_with_hash_value
 +    3,00%     0,05%  postgres  postgres          [.] _bt_readnextpage
 +    2,99%     0,50%  postgres  postgres          [.] ExecStoreBufferHeapTuple
 +    2,27%     2,01%  postgres  postgres          [.] UnpinBufferNoOwner.lto_priv.0
 +    2,19%     0,99%  postgres  postgres          [.] _bt_readpage
 +    1,52%     1,14%  postgres  postgres          [.] HeapTupleSatisfiesVisibility
 +    1,50%     1,50%  postgres  postgres          [.] LWLockAcquire
 +    1,12%     1,12%  postgres  postgres          [.] heap_page_prune_opt
 +    1,06%     0,72%  postgres  postgres          [.] _bt_checkkeys
 +    0,87%     0,87%  postgres  postgres          [.] XidInMVCCSnapshot
 +    0,87%     0,76%  postgres  postgres          [.] ReleaseBuffer
 +    0,59%     0,59%  postgres  postgres          [.] GetPrivateRefCountEntry.lto_priv.0
      0,41%     0,41%  postgres  postgres          [.] _bt_check_compare.lto_priv.0
 RUN 4:
 Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10010500000
   Children      Self  Command   Shared Object     Symbol
 +  100,00%     0,00%  postgres  postgres          [.] _start
 +  100,00%     0,00%  postgres  libc.so.6         [.] __libc_start_main@@GLIBC_2.34
 +  100,00%     0,00%  postgres  libc.so.6         [.] __libc_start_call_main
 +  100,00%     0,00%  postgres  postgres          [.] main
 +  100,00%     0,00%  postgres  postgres          [.] PostmasterMain
 +  100,00%     0,00%  postgres  postgres          [.] ServerLoop.isra.0
 +  100,00%     0,00%  postgres  postgres          [.] postmaster_child_launch
 +  100,00%     0,00%  postgres  postgres          [.] 0x00005f3570fb9dbf
 +  100,00%     0,00%  postgres  postgres          [.] PostgresMain
 +  100,00%     0,00%  postgres  postgres          [.] exec_simple_query
 +  100,00%     0,32%  postgres  postgres          [.] ExecNestLoop
 +  100,00%     0,00%  postgres  postgres          [.] PortalRun
 +  100,00%     0,00%  postgres  postgres          [.] PortalRunMulti
 +  100,00%     0,00%  postgres  postgres          [.] ProcessQuery
 +  100,00%     0,00%  postgres  postgres          [.] standard_ExecutorRun
 +  100,00%     0,00%  postgres  postgres          [.] ExecModifyTable
 +   97,30%     1,16%  postgres  postgres          [.] ExecScan
 +   88,81%     0,73%  postgres  postgres          [.] IndexNext
 +   82,49%     0,99%  postgres  postgres          [.] index_fetch_heap
 +   81,18%     2,01%  postgres  postgres          [.] heapam_index_fetch_tuple.lto_priv.0
 +   25,65%    25,65%  postgres  postgres          [.] LWLockAttemptLock
 +   16,65%     1,01%  postgres  postgres          [.] LWLockRelease
 +   15,79%     0,26%  postgres  postgres          [.] ReadBufferExtended
 +   15,76%    15,76%  postgres  postgres          [.] LWLockReleaseInternal
 +   15,00%     1,65%  postgres  postgres          [.] StartReadBuffer
 +   13,88%    12,75%  postgres  postgres          [.] ReleaseAndReadBuffer
 +    8,41%     3,03%  postgres  postgres          [.] ExecInterpExpr
 +    6,13%     3,82%  postgres  postgres          [.] heap_hot_search_buffer
 +    5,46%     1,04%  postgres  postgres          [.] index_getnext_tid
 +    5,02%     0,23%  postgres  postgres          [.] slot_getsomeattrs_int
 +    4,94%     4,78%  postgres  postgres          [.] PinBuffer
 +    4,83%     4,83%  postgres  postgres          [.] tts_heap_getsomeattrs.lto_priv.0
 +    4,40%     0,58%  postgres  postgres          [.] btgettuple
 +    3,87%     0,79%  postgres  postgres          [.] _bt_next
 +    3,76%     3,76%  postgres  postgres          [.] hash_search_with_hash_value
 +    3,09%     0,05%  postgres  postgres          [.] _bt_readnextpage
 +    2,88%     0,46%  postgres  postgres          [.] ExecStoreBufferHeapTuple
 +    2,44%     2,11%  postgres  postgres          [.] UnpinBufferNoOwner.lto_priv.0
 +    2,26%     1,01%  postgres  postgres          [.] _bt_readpage
 +    1,53%     1,16%  postgres  postgres          [.] HeapTupleSatisfiesVisibility
 +    1,45%     1,45%  postgres  postgres          [.] LWLockAcquire
 +    1,17%     1,17%  postgres  postgres          [.] heap_page_prune_opt
 +    1,09%     0,72%  postgres  postgres          [.] _bt_checkkeys
 +    0,82%     0,82%  postgres  postgres          [.] XidInMVCCSnapshot
 +    0,80%     0,71%  postgres  postgres          [.] ReleaseBuffer
 +    0,63%     0,63%  postgres  postgres          [.] GetPrivateRefCountEntry.lto_priv.0
      0,44%     0,44%  postgres  postgres          [.] _bt_check_compare.lto_priv.0
 RUN 5:
 Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10003250000
   Children      Self  Command   Shared Object      Symbol
 +  100,00%     0,00%  postgres  postgres           [.] _start
 +  100,00%     0,00%  postgres  libc.so.6          [.] __libc_start_main@@GLIBC_2.34
 +  100,00%     0,00%  postgres  libc.so.6          [.] __libc_start_call_main
 +  100,00%     0,00%  postgres  postgres           [.] main
 +  100,00%     0,00%  postgres  postgres           [.] PostmasterMain
 +  100,00%     0,00%  postgres  postgres           [.] ServerLoop.isra.0
 +  100,00%     0,00%  postgres  postgres           [.] postmaster_child_launch
 +  100,00%     0,00%  postgres  postgres           [.] 0x00005f3570fb9dbf
 +  100,00%     0,00%  postgres  postgres           [.] PostgresMain
 +  100,00%     0,00%  postgres  postgres           [.] exec_simple_query
 +  100,00%     0,00%  postgres  postgres           [.] PortalRun
 +  100,00%     0,00%  postgres  postgres           [.] PortalRunMulti
 +  100,00%     0,00%  postgres  postgres           [.] ProcessQuery
 +  100,00%     0,00%  postgres  postgres           [.] standard_ExecutorRun
 +  100,00%     0,00%  postgres  postgres           [.] ExecModifyTable
 +  100,00%     0,35%  postgres  postgres           [.] ExecNestLoop
 +   97,20%     0,90%  postgres  postgres           [.] ExecScan
 +   89,12%     0,85%  postgres  postgres           [.] IndexNext
 +   82,86%     1,00%  postgres  postgres           [.] index_fetch_heap
 +   81,60%     1,92%  postgres  postgres           [.] heapam_index_fetch_tuple.lto_priv.0
 +   26,49%    26,48%  postgres  postgres           [.] LWLockAttemptLock
 +   16,53%     0,81%  postgres  postgres           [.] LWLockRelease
 +   15,81%    15,81%  postgres  postgres           [.] LWLockReleaseInternal
 +   15,57%     0,22%  postgres  postgres           [.] ReadBufferExtended
 +   14,68%     1,88%  postgres  postgres           [.] StartReadBuffer
 +   14,10%    12,88%  postgres  postgres           [.] ReleaseAndReadBuffer
 +    8,27%     2,97%  postgres  postgres           [.] ExecInterpExpr
 +    5,87%     3,67%  postgres  postgres           [.] heap_hot_search_buffer
 +    5,38%     0,92%  postgres  postgres           [.] index_getnext_tid
 +    5,01%     0,21%  postgres  postgres           [.] slot_getsomeattrs_int
 +    4,85%     4,84%  postgres  postgres           [.] tts_heap_getsomeattrs.lto_priv.0
 +    4,81%     4,64%  postgres  postgres           [.] PinBuffer
 +    4,41%     0,53%  postgres  postgres           [.] btgettuple
 +    3,96%     0,92%  postgres  postgres           [.] _bt_next
 +    3,47%     3,47%  postgres  postgres           [.] hash_search_with_hash_value
 +    3,03%     0,03%  postgres  postgres           [.] _bt_readnextpage
 +    2,77%     0,40%  postgres  postgres           [.] ExecStoreBufferHeapTuple
 +    2,41%     2,11%  postgres  postgres           [.] UnpinBufferNoOwner.lto_priv.0
 +    2,28%     0,97%  postgres  postgres           [.] _bt_readpage
 +    1,46%     1,09%  postgres  postgres           [.] HeapTupleSatisfiesVisibility
 +    1,41%     1,41%  postgres  postgres           [.] LWLockAcquire
 +    1,14%     0,71%  postgres  postgres           [.] _bt_checkkeys
 +    1,12%     1,12%  postgres  postgres           [.] heap_page_prune_opt
 +    0,83%     0,83%  postgres  postgres           [.] XidInMVCCSnapshot
 +    0,81%     0,69%  postgres  postgres           [.] ReleaseBuffer
 +    0,57%     0,57%  postgres  postgres           [.] GetPrivateRefCountEntry.lto_priv.0
      0,48%     0,48%  postgres  postgres           [.] _bt_check_compare.lto_priv.0
 *** sudo perf -p <PID of one stuck postgres backend> -g -d 10 ***
 *** sudo perf report --no-children ***
 RUN 1 - NO CHILDREN:
 Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10008250000
   Overhead  Command   Shared Object      Symbol
 +   12,81%  postgres  postgres           [.] LWLockAttemptLock
 +   10,70%  postgres  postgres           [.] heap_hot_search_buffer
 +    9,27%  postgres  postgres           [.] tts_heap_getsomeattrs.lto_priv.0
 +    6,77%  postgres  postgres           [.] LWLockReleaseInternal
 +    5,89%  postgres  postgres           [.] hash_search_with_hash_value
 +    5,79%  postgres  postgres           [.] ExecInterpExpr
 +    4,14%  postgres  postgres           [.] PinBuffer
 +    3,61%  postgres  postgres           [.] heapam_index_fetch_tuple.lto_priv.0
 +    3,00%  postgres  postgres           [.] ReleaseAndReadBuffer
 +    2,43%  postgres  postgres           [.] LWLockAcquire
 +    2,38%  postgres  postgres           [.] HeapTupleSatisfiesVisibility
 +    2,36%  postgres  postgres           [.] index_getnext_tid
 +    1,96%  postgres  postgres           [.] index_fetch_heap
 +    1,90%  postgres  postgres           [.] _bt_readpage
 +    1,85%  postgres  postgres           [.] LWLockRelease
 +    1,74%  postgres  postgres           [.] heap_page_prune_opt
 +    1,49%  postgres  postgres           [.] IndexNext
 +    1,47%  postgres  postgres           [.] _bt_checkkeys
 +    1,47%  postgres  postgres           [.] ExecScan
 +    1,47%  postgres  postgres           [.] XidInMVCCSnapshot
 +    1,19%  postgres  postgres           [.] StartReadBuffer
 +    1,18%  postgres  postgres           [.] GetPrivateRefCountEntry.lto_priv.0
 +    1,12%  postgres  postgres           [.] _bt_next
 +    1,10%  postgres  postgres           [.] UnpinBufferNoOwner.lto_priv.0
 +    1,09%  postgres  postgres           [.] MemoryContextReset
 +    1,08%  postgres  postgres           [.] btgettuple
 +    0,91%  postgres  postgres           [.] _bt_check_compare.lto_priv.0
 +    0,82%  postgres  postgres           [.] ExecEvalSysVar
 +    0,82%  postgres  postgres           [.] ExecStoreBufferHeapTuple
 +    0,71%  postgres  postgres           [.] hash_bytes
 +    0,65%  postgres  postgres           [.] tts_virtual_clear.lto_priv.0
 +    0,63%  postgres  postgres           [.] GlobalVisTestFor
 +    0,63%  postgres  postgres           [.] ExecNestLoop
 +    0,55%  postgres  postgres           [.] HeapTupleIsSurelyDead
 +    0,52%  postgres  postgres           [.] ResourceOwnerForget
      0,48%  postgres  postgres           [.] slot_getsomeattrs_int
      0,42%  postgres  postgres           [.] PredicateLockTID
      0,40%  postgres  postgres           [.] ReadBufferExtended
      0,37%  postgres  postgres           [.] _bt_saveitem
      0,31%  postgres  postgres           [.] ExecIndexScan
      0,30%  postgres  libc.so.6          [.] __memcmp_sse2
      0,28%  postgres  postgres           [.] _bt_setuppostingitems
      0,26%  postgres  postgres           [.] ReleaseBuffer
      0,23%  postgres  postgres           [.] ResourceOwnerEnlarge
      0,23%  postgres  postgres           [.] HeapCheckForSerializableConflictOut
      0,22%  postgres  postgres           [.] IncrBufferRefCount
      0,18%  postgres  postgres           [.] pgstat_count_io_op
 RUN 2 - NO CHILDREN:
 Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10020750000
   Overhead  Command   Shared Object  Symbol
 +   25,70%  postgres  postgres       [.] LWLockAttemptLock
 +   15,91%  postgres  postgres       [.] LWLockReleaseInternal
 +   13,01%  postgres  postgres       [.] ReleaseAndReadBuffer
 +    5,05%  postgres  postgres       [.] tts_heap_getsomeattrs.lto_priv.0
 +    4,59%  postgres  postgres       [.] PinBuffer
 +    3,90%  postgres  postgres       [.] heap_hot_search_buffer
 +    3,27%  postgres  postgres       [.] hash_search_with_hash_value
 +    3,08%  postgres  postgres       [.] ExecInterpExpr
 +    2,16%  postgres  postgres       [.] UnpinBufferNoOwner.lto_priv.0
 +    1,78%  postgres  postgres       [.] heapam_index_fetch_tuple.lto_priv.0
 +    1,73%  postgres  postgres       [.] StartReadBuffer
 +    1,51%  postgres  postgres       [.] LWLockAcquire
 +    1,25%  postgres  postgres       [.] HeapTupleSatisfiesVisibility
 +    1,18%  postgres  postgres       [.] heap_page_prune_opt
 +    1,13%  postgres  postgres       [.] ExecScan
 +    1,11%  postgres  postgres       [.] index_getnext_tid
 +    1,05%  postgres  postgres       [.] index_fetch_heap
 +    1,01%  postgres  postgres       [.] _bt_readpage
 +    0,86%  postgres  postgres       [.] LWLockRelease
 +    0,85%  postgres  postgres       [.] XidInMVCCSnapshot
 +    0,74%  postgres  postgres       [.] ReleaseBuffer
 +    0,73%  postgres  postgres       [.] IndexNext
 +    0,72%  postgres  postgres       [.] _bt_checkkeys
 +    0,72%  postgres  postgres       [.] _bt_next
 +    0,56%  postgres  postgres       [.] GetPrivateRefCountEntry.lto_priv.0
 +    0,54%  postgres  postgres       [.] btgettuple
      0,44%  postgres  postgres       [.] ExecStoreBufferHeapTuple
      0,41%  postgres  postgres       [.] _bt_check_compare.lto_priv.0
      0,41%  postgres  postgres       [.] hash_bytes
      0,39%  postgres  postgres       [.] MemoryContextReset
      0,34%  postgres  postgres       [.] ExecEvalSysVar
      0,34%  postgres  postgres       [.] ExecNestLoop
      0,32%  postgres  postgres       [.] tts_virtual_clear.lto_priv.0
      0,32%  postgres  postgres       [.] ResourceOwnerForget
      0,30%  postgres  postgres       [.] GlobalVisTestFor
      0,30%  postgres  postgres       [.] HeapTupleIsSurelyDead
      0,26%  postgres  postgres       [.] PredicateLockTID
      0,22%  postgres  postgres       [.] ReadBufferExtended
      0,20%  postgres  postgres       [.] _bt_setuppostingitems
      0,20%  postgres  postgres       [.] ExecIndexScan
      0,20%  postgres  postgres       [.] slot_getsomeattrs_int
      0,17%  postgres  libc.so.6      [.] __memcmp_sse2
      0,16%  postgres  postgres       [.] _bt_saveitem
      0,13%  postgres  postgres       [.] ResourceOwnerEnlarge
      0,12%  postgres  postgres       [.] HeapCheckForSerializableConflictOut
      0,10%  postgres  postgres       [.] pgstat_count_io_op
      0,08%  postgres  postgres       [.] IncrBufferRefCount
 RUN 3 - NO CHILDREN:
 Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10010750000
   Overhead  Command   Shared Object  Symbol
 +   25,54%  postgres  postgres       [.] LWLockAttemptLock
 +   16,44%  postgres  postgres       [.] LWLockReleaseInternal
 +   12,72%  postgres  postgres       [.] ReleaseAndReadBuffer
 +    4,88%  postgres  postgres       [.] tts_heap_getsomeattrs.lto_priv.0
 +    4,69%  postgres  postgres       [.] PinBuffer
 +    3,75%  postgres  postgres       [.] heap_hot_search_buffer
 +    3,47%  postgres  postgres       [.] hash_search_with_hash_value
 +    2,95%  postgres  postgres       [.] ExecInterpExpr
 +    2,01%  postgres  postgres       [.] UnpinBufferNoOwner.lto_priv.0
 +    1,89%  postgres  postgres       [.] heapam_index_fetch_tuple.lto_priv.0
 +    1,69%  postgres  postgres       [.] StartReadBuffer
 +    1,50%  postgres  postgres       [.] LWLockAcquire
 +    1,14%  postgres  postgres       [.] HeapTupleSatisfiesVisibility
 +    1,12%  postgres  postgres       [.] heap_page_prune_opt
 +    1,10%  postgres  postgres       [.] ExecScan
 +    1,09%  postgres  postgres       [.] index_getnext_tid
 +    1,03%  postgres  postgres       [.] index_fetch_heap
 +    1,01%  postgres  postgres       [.] LWLockRelease
 +    0,99%  postgres  postgres       [.] _bt_readpage
 +    0,87%  postgres  postgres       [.] XidInMVCCSnapshot
 +    0,79%  postgres  postgres       [.] IndexNext
 +    0,76%  postgres  postgres       [.] ReleaseBuffer
 +    0,74%  postgres  postgres       [.] _bt_next
 +    0,72%  postgres  postgres       [.] _bt_checkkeys
 +    0,61%  postgres  postgres       [.] btgettuple
 +    0,59%  postgres  postgres       [.] GetPrivateRefCountEntry.lto_priv.0
      0,50%  postgres  postgres       [.] ExecStoreBufferHeapTuple
      0,41%  postgres  postgres       [.] _bt_check_compare.lto_priv.0
      0,41%  postgres  postgres       [.] hash_bytes
      0,38%  postgres  postgres       [.] MemoryContextReset
      0,34%  postgres  postgres       [.] ExecEvalSysVar
      0,34%  postgres  postgres       [.] HeapTupleIsSurelyDead
      0,34%  postgres  postgres       [.] ExecNestLoop
      0,32%  postgres  postgres       [.] ResourceOwnerForget
      0,31%  postgres  postgres       [.] GlobalVisTestFor
      0,31%  postgres  postgres       [.] tts_virtual_clear.lto_priv.0
      0,25%  postgres  postgres       [.] ReadBufferExtended
      0,22%  postgres  postgres       [.] PredicateLockTID
      0,21%  postgres  postgres       [.] slot_getsomeattrs_int
      0,17%  postgres  postgres       [.] ExecIndexScan
      0,17%  postgres  postgres       [.] _bt_saveitem
      0,16%  postgres  postgres       [.] _bt_setuppostingitems
      0,14%  postgres  postgres       [.] HeapCheckForSerializableConflictOut
      0,13%  postgres  libc.so.6      [.] __memcmp_sse2
      0,12%  postgres  postgres       [.] ResourceOwnerEnlarge
      0,11%  postgres  postgres       [.] IncrBufferRefCount
      0,10%  postgres  postgres       [.] pgstat_count_io_op
 RUN 4 - NO CHILDREN:
 Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10010500000
   Overhead  Command   Shared Object  Symbol
 +   25,65%  postgres  postgres       [.] LWLockAttemptLock
 +   15,76%  postgres  postgres       [.] LWLockReleaseInternal
 +   12,75%  postgres  postgres       [.] ReleaseAndReadBuffer
 +    4,83%  postgres  postgres       [.] tts_heap_getsomeattrs.lto_priv.0
 +    4,78%  postgres  postgres       [.] PinBuffer
 +    3,82%  postgres  postgres       [.] heap_hot_search_buffer
 +    3,76%  postgres  postgres       [.] hash_search_with_hash_value
 +    3,03%  postgres  postgres       [.] ExecInterpExpr
 +    2,11%  postgres  postgres       [.] UnpinBufferNoOwner.lto_priv.0
 +    2,01%  postgres  postgres       [.] heapam_index_fetch_tuple.lto_priv.0
 +    1,65%  postgres  postgres       [.] StartReadBuffer
 +    1,45%  postgres  postgres       [.] LWLockAcquire
 +    1,17%  postgres  postgres       [.] heap_page_prune_opt
 +    1,16%  postgres  postgres       [.] ExecScan
 +    1,16%  postgres  postgres       [.] HeapTupleSatisfiesVisibility
 +    1,04%  postgres  postgres       [.] index_getnext_tid
 +    1,01%  postgres  postgres       [.] LWLockRelease
 +    1,01%  postgres  postgres       [.] _bt_readpage
 +    0,99%  postgres  postgres       [.] index_fetch_heap
 +    0,82%  postgres  postgres       [.] XidInMVCCSnapshot
 +    0,79%  postgres  postgres       [.] _bt_next
 +    0,73%  postgres  postgres       [.] IndexNext
 +    0,72%  postgres  postgres       [.] _bt_checkkeys
 +    0,71%  postgres  postgres       [.] ReleaseBuffer
 +    0,63%  postgres  postgres       [.] GetPrivateRefCountEntry.lto_priv.0
 +    0,58%  postgres  postgres       [.] btgettuple
      0,46%  postgres  postgres       [.] ExecStoreBufferHeapTuple
      0,44%  postgres  postgres       [.] _bt_check_compare.lto_priv.0
      0,37%  postgres  postgres       [.] ExecEvalSysVar
      0,36%  postgres  postgres       [.] MemoryContextReset
      0,34%  postgres  postgres       [.] tts_virtual_clear.lto_priv.0
      0,32%  postgres  postgres       [.] ExecNestLoop
      0,31%  postgres  postgres       [.] GlobalVisTestFor
      0,30%  postgres  postgres       [.] HeapTupleIsSurelyDead
      0,30%  postgres  postgres       [.] hash_bytes
      0,27%  postgres  postgres       [.] PredicateLockTID
      0,26%  postgres  postgres       [.] ResourceOwnerForget
      0,26%  postgres  postgres       [.] ReadBufferExtended
      0,23%  postgres  postgres       [.] slot_getsomeattrs_int
      0,17%  postgres  postgres       [.] _bt_setuppostingitems
      0,17%  postgres  libc.so.6      [.] __memcmp_sse2
      0,15%  postgres  postgres       [.] ExecIndexScan
      0,14%  postgres  postgres       [.] _bt_saveitem
      0,12%  postgres  postgres       [.] int8eq
      0,12%  postgres  postgres       [.] ReservePrivateRefCountEntry.lto_priv.0
      0,12%  postgres  postgres       [.] ResourceOwnerEnlarge
      0,11%  postgres  postgres       [.] HeapCheckForSerializableConflictOut
 RUN 5 - NO CHILDREN:
 Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10003250000
   Overhead  Command   Shared Object      Symbol
 +   26,48%  postgres  postgres           [.] LWLockAttemptLock
 +   15,81%  postgres  postgres           [.] LWLockReleaseInternal
 +   12,88%  postgres  postgres           [.] ReleaseAndReadBuffer
 +    4,84%  postgres  postgres           [.] tts_heap_getsomeattrs.lto_priv.0
 +    4,64%  postgres  postgres           [.] PinBuffer
 +    3,67%  postgres  postgres           [.] heap_hot_search_buffer
 +    3,47%  postgres  postgres           [.] hash_search_with_hash_value
 +    2,97%  postgres  postgres           [.] ExecInterpExpr
 +    2,11%  postgres  postgres           [.] UnpinBufferNoOwner.lto_priv.0
 +    1,92%  postgres  postgres           [.] heapam_index_fetch_tuple.lto_priv.0
 +    1,88%  postgres  postgres           [.] StartReadBuffer
 +    1,41%  postgres  postgres           [.] LWLockAcquire
 +    1,12%  postgres  postgres           [.] heap_page_prune_opt
 +    1,09%  postgres  postgres           [.] HeapTupleSatisfiesVisibility
 +    1,00%  postgres  postgres           [.] index_fetch_heap
 +    0,97%  postgres  postgres           [.] _bt_readpage
 +    0,92%  postgres  postgres           [.] _bt_next
 +    0,92%  postgres  postgres           [.] index_getnext_tid
 +    0,90%  postgres  postgres           [.] ExecScan
 +    0,85%  postgres  postgres           [.] IndexNext
 +    0,83%  postgres  postgres           [.] XidInMVCCSnapshot
 +    0,81%  postgres  postgres           [.] LWLockRelease
 +    0,71%  postgres  postgres           [.] _bt_checkkeys
 +    0,69%  postgres  postgres           [.] ReleaseBuffer
 +    0,57%  postgres  postgres           [.] GetPrivateRefCountEntry.lto_priv.0
 +    0,53%  postgres  postgres           [.] btgettuple
      0,48%  postgres  postgres           [.] _bt_check_compare.lto_priv.0
      0,46%  postgres  postgres           [.] MemoryContextReset
      0,42%  postgres  postgres           [.] hash_bytes
      0,40%  postgres  postgres           [.] ExecStoreBufferHeapTuple
      0,36%  postgres  postgres           [.] ResourceOwnerForget
      0,35%  postgres  postgres           [.] ExecNestLoop
      0,33%  postgres  postgres           [.] ExecEvalSysVar
      0,32%  postgres  postgres           [.] HeapTupleIsSurelyDead
      0,27%  postgres  postgres           [.] GlobalVisTestFor
      0,26%  postgres  postgres           [.] PredicateLockTID
      0,22%  postgres  postgres           [.] ReadBufferExtended
      0,22%  postgres  postgres           [.] tts_virtual_clear.lto_priv.0
      0,21%  postgres  postgres           [.] slot_getsomeattrs_int
      0,19%  postgres  postgres           [.] _bt_saveitem
      0,19%  postgres  postgres           [.] _bt_setuppostingitems
      0,18%  postgres  postgres           [.] ResourceOwnerEnlarge
      0,17%  postgres  postgres           [.] ExecIndexScan
      0,14%  postgres  libc.so.6          [.] __memcmp_sse2
      0,12%  postgres  postgres           [.] pgstat_count_io_op
      0,10%  postgres  postgres           [.] int8eq
      0,10%  postgres  postgres           [.] HeapCheckForSerializableConflictOut
			
		
Hi Andres, Looking at the "perf report --no-children" list, I noticed the high "LWLock" related CPU activity. Although I have already mentioned part of it, here are a few more observations of pgAdmin and pg_locks during the stuck situation: - pgAdmin shows 15 active session without wait events - pg_locks shows 228 locks taken, of which 218 are "fastpath" - Only 10 locks are not fastpath, and they refer to the "transactionid". I probably misunderstand this, but I would have expected to see 15 transactionids, one for each database session? But maybe I have this wrong... There are no other transactionids. There *are* also 15 "virtualxid" locks visible, all "fastpath". All transactionid and virtualxid locks are "ExclusiveLock" type. The rest of the 228 locks are of "AccessShareLock" and "RowExclusiveLock" type. - My 'max_locks_per_transaction', as mentioned in this (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c4d5cb71d) commit related to PG18 and fastpath changes, is set to the default of 64. - If I look in the pg_stat_io view and regularly refresh it, I hardly see any changes in the table, except a few records (at least one related to autovacuum) now and then. This is also more or less confirmed by the disk RAIDs for tables and indexes, that show zero activity in Windows Task Manager (0 KB/s writes with 0 ms "average response time" (which doesn't happen in normal operation of the multi-threaded code where I do see almost continuous activity on the RAIDs). - It is likely the entire table being processed is cached in RAM, as there is plenty of it, far larger than the small Italy extract table being processed. Marco
Hi, On 2025-10-08 09:49:04 +0200, Marco Boeringa wrote: > Yes, you could be right this is not related to AIO at all, but another issue > introduced in PG18. Right. > For sure, I did not see this issue in <= PG17, so some change in PG18 is > causing it. Additionally, there is a small chance it might be related to > PostGIS, as that was upgraded as well (3.5.2 --> 3.6.0) during the PG > upgrade, as PG18 requires PostGIS 3.6.0 minimum. And the query does use > PostGIS functions, but none that AFAIK rely on e.g. a spatial > index. Functions like ST_Area just process an individual geometry, not the > spatial relationship between multiple geometries. Can you test this with postgis 3.6 on 17? > As I wrote before, this is a multi-threaded Python application (actually > developed in a GIS), that uses Python's 'concurrent.futures' threading > framework to create jobs of records to process for each thread, > significantly speeding up the processing. The queries are in fact > dynamically build by the code, and part of a much larger geoprocessing > workflow, so it is hard to run them separately and provide a query plan > (although in this case I could by rewriting part of the query below). > > However, I want to stress that any query plan is unlikely to yield > anything. In normal circumstances and in PG17 and below, this code runs > fine! And it is only 1 in maybe 4 runs in PG18 that goes berserk and makes a > processing step that should takes < 10 seconds, all of a sudden take > 2 > hours. The fact that it runs without a problem in 17 means it's actually rather meaningful to look at the query plan. It could have changed. Separately, it might help us to narrow down what changes to look at that could potentially be causing problems. > > So somehow >60% of the CPU time is spent fetching tuples corresponding to > index entries. That seems ... a lot. Is it possible that you have a lot of > dead rows in the involved tables? > > Yes, that is perfectly possible. However, the particular table is only just > over 100k records. > It is true that my code is designed to process literally *every* record in a > table. However, I specifically set adjusted table storage parameters with > much more aggressive vacuum settings (essentially forcing always vacuum > after something like 10k dead tuples irrespective of the size of the > table). This has worked really well, and I have successfully UPDATEd all of > Facebook Daylight size > 1B records tables with the same code, without ever > running into this particular issue, nor transaction ID wraparound issues. Making vacuum more aggressive won't really help much if you have longrunning queries/sessions, since vacuum can't clean up row versions that are still visibile to some of the transactions. > *** sudo perf -p <PID of one stuck postgres backend> -g -d 10 *** > *** sudo perf report --no-children *** > > Samples: 40K of event 'task-clock:ppp', Event count (approx.): 10008250000 > Overhead Command Shared Object Symbol > + 12,81% postgres postgres [.] LWLockAttemptLock > + 10,70% postgres postgres [.] heap_hot_search_buffer > + 9,27% postgres postgres [.] > tts_heap_getsomeattrs.lto_priv.0 > + 6,77% postgres postgres [.] LWLockReleaseInternal > + 5,89% postgres postgres [.] hash_search_with_hash_value > + 5,79% postgres postgres [.] ExecInterpExpr > + 4,14% postgres postgres [.] PinBuffer Could you "unfold" the callstacks for the top entries? And/or attach a perf report --no-children > somefile (when redirecting to a file perf will include much more detail than when using it interactively) Greetings, Andres Freund
Hi, On 2025-10-08 14:13:28 +0200, Marco Boeringa wrote: > Looking at the "perf report --no-children" list, I noticed the high "LWLock" > related CPU activity. Yes, that is remarkably high. It likely indicates that there is significant contention on some of the pages. > Although I have already mentioned part of it, here are a few more > observations of pgAdmin and pg_locks during the stuck situation: > > - pgAdmin shows 15 active session without wait events > > - pg_locks shows 228 locks taken, of which 218 are "fastpath" > > - Only 10 locks are not fastpath, and they refer to the "transactionid". I > probably misunderstand this, but I would have expected to see 15 > transactionids, one for each database session? But maybe I have this > wrong... There are no other transactionids. There *are* also 15 "virtualxid" > locks visible, all "fastpath". All transactionid and virtualxid locks are > "ExclusiveLock" type. The rest of the 228 locks are of "AccessShareLock" and > "RowExclusiveLock" type. lwlocks are not visible in pg_locks, they are lower-level / more granular. > - If I look in the pg_stat_io view and regularly refresh it, I hardly see > any changes in the table, except a few records (at least one related to > autovacuum) now and then. This is also more or less confirmed by the disk > RAIDs for tables and indexes, that show zero activity in Windows Task > Manager (0 KB/s writes with 0 ms "average response time" (which doesn't > happen in normal operation of the multi-threaded code where I do see almost > continuous activity on the RAIDs). Note that pg_stat_io is only updated at the end of a transaction, so just looking at it continuously while there are longrunning statements isn't necessarily going to tell you that much. Greetings, Andres Freund
Hi Andres, Thanks for all the help and suggestions so far! > Can you test this with postgis 3.6 on 17? If I find time... but looking through the release notes of PostGIS 3.6.0, I doubt the issue is in PostGIS, as there do not appear to be any changes to the specific PostGIS functions I am using. That doesn't exclude it though. > The fact that it runs without a problem in 17 means it's actually rather meaningful to look at the query plan. It could have changed. Separately, it might help us to narrow down what changes to look at that could potentially be causing problems. I would fully understand if this was an "ordinary" issue case with a simple self-contained query and with things going wrong each time in the same way. However, as said, besides the major issue of running the query separately from my geoprocessing workflow which involves many more steps - which would mean that any test outside of it would *not* be very much representative of what is going on inside my tool and geoprocessing workflow - there is the fact that things going wrong is a random anomaly. I cannot stress this enough: about 3-4 in 5 runs are OK, then a random follow up run *with exactly the same input data* turns out bad with the stall. Even if there was an easy way to run the query, I think the chance is highly likely the postgres query planner would come up with a decent plan, as in normal circumstances, there is no issue. > Making vacuum more aggressive won't really help much if you have longrunning queries/sessions, since vacuum can't clean up row versions that are still visibile to some of the transactions. My code batches the updates in sets of 2000 records at a time and then COMMITs, so the transactions themselves are limited in time and size, which should allow vacuum to do its job. > Could you "unfold" the callstacks for the top entries? And/or attach a perf report --no-children > somefile (when redirecting to a file perf will include much more detail than when using it interactively) See below: # Total Lost Samples: 0 # # Samples: 40K of event 'task-clock:ppp' # Event count (approx.): 10003250000 # # Overhead Command Shared Object Symbol # ........ ........ ................. ........................................... # 26.48% postgres postgres [.] LWLockAttemptLock | ---LWLockAttemptLock | |--23.15%--heapam_index_fetch_tuple.lto_priv.0 | index_fetch_heap | IndexNext | ExecScan | ExecNestLoop | ExecNestLoop | ExecModifyTable | standard_ExecutorRun | ProcessQuery | PortalRunMulti | PortalRun | exec_simple_query | PostgresMain | 0x5f3570fb9dbf | postmaster_child_launch | ServerLoop.isra.0 | PostmasterMain | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | --3.30%--StartReadBuffer ReadBufferExtended | --3.25%--heapam_index_fetch_tuple.lto_priv.0 index_fetch_heap IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 15.81% postgres postgres [.] LWLockReleaseInternal | ---LWLockReleaseInternal | --15.71%--LWLockRelease | |--15.03%--heapam_index_fetch_tuple.lto_priv.0 | index_fetch_heap | IndexNext | ExecScan | ExecNestLoop | ExecNestLoop | ExecModifyTable | standard_ExecutorRun | ProcessQuery | PortalRunMulti | PortalRun | exec_simple_query | PostgresMain | 0x5f3570fb9dbf | postmaster_child_launch | ServerLoop.isra.0 | PostmasterMain | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | --0.66%--StartReadBuffer ReadBufferExtended | --0.66%--heapam_index_fetch_tuple.lto_priv.0 index_fetch_heap IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 12.88% postgres postgres [.] ReleaseAndReadBuffer | ---ReleaseAndReadBuffer | --12.78%--heapam_index_fetch_tuple.lto_priv.0 index_fetch_heap IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 4.84% postgres postgres [.] tts_heap_getsomeattrs.lto_priv.0 | ---tts_heap_getsomeattrs.lto_priv.0 | --4.80%--slot_getsomeattrs_int ExecInterpExpr ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 4.64% postgres postgres [.] PinBuffer | ---PinBuffer | --4.64%--StartReadBuffer ReadBufferExtended | --4.57%--heapam_index_fetch_tuple.lto_priv.0 index_fetch_heap IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 3.67% postgres postgres [.] heap_hot_search_buffer | ---heap_hot_search_buffer | --3.61%--heapam_index_fetch_tuple.lto_priv.0 index_fetch_heap IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 3.47% postgres postgres [.] hash_search_with_hash_value | ---hash_search_with_hash_value | --3.45%--StartReadBuffer ReadBufferExtended | --3.35%--heapam_index_fetch_tuple.lto_priv.0 index_fetch_heap IndexNext ExecScan ExecNestLoop | --3.34%--ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 2.97% postgres postgres [.] ExecInterpExpr | ---ExecInterpExpr | |--1.64%--ExecNestLoop | | | --1.60%--ExecNestLoop | ExecModifyTable | standard_ExecutorRun | ProcessQuery | PortalRunMulti | PortalRun | exec_simple_query | PostgresMain | 0x5f3570fb9dbf | postmaster_child_launch | ServerLoop.isra.0 | PostmasterMain | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | --1.33%--ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 2.11% postgres postgres [.] UnpinBufferNoOwner.lto_priv.0 | ---UnpinBufferNoOwner.lto_priv.0 | |--1.24%--ExecStoreBufferHeapTuple | heapam_index_fetch_tuple.lto_priv.0 | index_fetch_heap | IndexNext | ExecScan | ExecNestLoop | ExecNestLoop | ExecModifyTable | standard_ExecutorRun | ProcessQuery | PortalRunMulti | PortalRun | exec_simple_query | PostgresMain | 0x5f3570fb9dbf | postmaster_child_launch | ServerLoop.isra.0 | PostmasterMain | main | __libc_start_call_main | __libc_start_main@@GLIBC_2.34 | _start | --0.81%--ReleaseAndReadBuffer heapam_index_fetch_tuple.lto_priv.0 index_fetch_heap IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 1.92% postgres postgres [.] heapam_index_fetch_tuple.lto_priv.0 | ---heapam_index_fetch_tuple.lto_priv.0 | --1.67%--index_fetch_heap IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 1.88% postgres postgres [.] StartReadBuffer | ---StartReadBuffer | --1.86%--ReadBufferExtended | --1.83%--heapam_index_fetch_tuple.lto_priv.0 index_fetch_heap IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 1.41% postgres postgres [.] LWLockAcquire | ---LWLockAcquire | --1.15%--heapam_index_fetch_tuple.lto_priv.0 index_fetch_heap IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 1.12% postgres postgres [.] heap_page_prune_opt | ---heap_page_prune_opt | --1.12%--heapam_index_fetch_tuple.lto_priv.0 index_fetch_heap IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 1.09% postgres postgres [.] HeapTupleSatisfiesVisibility | ---HeapTupleSatisfiesVisibility | --1.02%--heap_hot_search_buffer heapam_index_fetch_tuple.lto_priv.0 index_fetch_heap IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 1.00% postgres postgres [.] index_fetch_heap | ---index_fetch_heap | --0.83%--IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 0.97% postgres postgres [.] _bt_readpage | ---_bt_readpage | --0.97%--_bt_readnextpage _bt_next btgettuple index_getnext_tid IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 0.92% postgres postgres [.] _bt_next | ---_bt_next | --0.82%--btgettuple index_getnext_tid IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 0.92% postgres postgres [.] index_getnext_tid | ---index_getnext_tid | --0.82%--IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 0.90% postgres postgres [.] ExecScan | ---ExecScan ExecNestLoop | --0.83%--ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 0.85% postgres postgres [.] IndexNext | ---IndexNext | --0.80%--ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 0.83% postgres postgres [.] XidInMVCCSnapshot | ---XidInMVCCSnapshot 0.81% postgres postgres [.] LWLockRelease | ---LWLockRelease | --0.54%--heapam_index_fetch_tuple.lto_priv.0 index_fetch_heap IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 0.71% postgres postgres [.] _bt_checkkeys | ---_bt_checkkeys | --0.65%--_bt_readpage _bt_readnextpage _bt_next btgettuple index_getnext_tid IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 0.69% postgres postgres [.] ReleaseBuffer | ---ReleaseBuffer | --0.68%--ExecStoreBufferHeapTuple heapam_index_fetch_tuple.lto_priv.0 index_fetch_heap IndexNext ExecScan ExecNestLoop ExecNestLoop ExecModifyTable standard_ExecutorRun ProcessQuery PortalRunMulti PortalRun exec_simple_query PostgresMain 0x5f3570fb9dbf postmaster_child_launch ServerLoop.isra.0 PostmasterMain main __libc_start_call_main __libc_start_main@@GLIBC_2.34 _start 0.57% postgres postgres [.] GetPrivateRefCountEntry.lto_priv.0 | ---GetPrivateRefCountEntry.lto_priv.0 0.53% postgres postgres [.] btgettuple | ---btgettuple 0.48% postgres postgres [.] _bt_check_compare.lto_priv.0 0.46% postgres postgres [.] MemoryContextReset 0.42% postgres postgres [.] hash_bytes 0.40% postgres postgres [.] ExecStoreBufferHeapTuple 0.36% postgres postgres [.] ResourceOwnerForget 0.35% postgres postgres [.] ExecNestLoop 0.33% postgres postgres [.] ExecEvalSysVar 0.32% postgres postgres [.] HeapTupleIsSurelyDead 0.27% postgres postgres [.] GlobalVisTestFor 0.26% postgres postgres [.] PredicateLockTID 0.22% postgres postgres [.] ReadBufferExtended 0.22% postgres postgres [.] tts_virtual_clear.lto_priv.0 0.21% postgres postgres [.] slot_getsomeattrs_int 0.19% postgres postgres [.] _bt_saveitem 0.19% postgres postgres [.] _bt_setuppostingitems 0.18% postgres postgres [.] ResourceOwnerEnlarge 0.17% postgres postgres [.] ExecIndexScan 0.14% postgres libc.so.6 [.] __memcmp_sse2 0.12% postgres postgres [.] pgstat_count_io_op 0.10% postgres postgres [.] int8eq 0.10% postgres postgres [.] HeapCheckForSerializableConflictOut 0.10% postgres postgres [.] ReservePrivateRefCountEntry.lto_priv.0 0.09% postgres postgres [.] IncrBufferRefCount 0.06% postgres postgres [.] _bt_checkpage 0.03% postgres postgres [.] BufferGetLSNAtomic 0.03% postgres postgres [.] tag_hash 0.03% postgres postgres [.] LockBufHdr 0.03% postgres postgres [.] _bt_readnextpage 0.02% postgres postgres [.] tts_buffer_heap_getsomeattrs.lto_priv.0 0.02% postgres [kernel.kallsyms] [k] handle_softirqs 0.01% postgres postgres [.] _bt_steppage 0.01% postgres [kernel.kallsyms] [k] task_mm_cid_work 0.00% postgres [kernel.kallsyms] [k] _raw_spin_unlock_irq 0.00% postgres [kernel.kallsyms] [k] irqentry_exit_to_user_mode 0.00% postgres postgres [.] TransactionIdIsCurrentTransactionId 0.00% postgres postgres [.] index_rescan 0.00% postgres [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0.00% postgres postgres [.] AllocSetAlloc 0.00% postgres postgres [.] PredicateLockPage 0.00% postgres postgres [.] _bt_set_startikey
Hi,
On 2025-10-08 19:46:42 +0200, Marco Boeringa wrote:
> > The fact that it runs without a problem in 17 means it's actually rather
> meaningful to look at the query plan. It could have changed. Separately, it
> might help us to narrow down what changes to look at that could potentially
> be causing problems.
> 
> I would fully understand if this was an "ordinary" issue case with a simple
> self-contained query and with things going wrong each time in the same way.
> However, as said, besides the major issue of running the query separately
> from my geoprocessing workflow which involves many more steps - which would
> mean that any test outside of it would *not* be very much representative of
> what is going on inside my tool and geoprocessing workflow - there is the
> fact that things going wrong is a random anomaly. I cannot stress this
> enough: about 3-4 in 5 runs are OK, then a random follow up run *with
> exactly the same input data* turns out bad with the stall. Even if there was
> an easy way to run the query, I think the chance is highly likely the
> postgres query planner would come up with a decent plan, as in normal
> circumstances, there is no issue.
Even just knowing whether the "normal query plan" is the same one as we see in
profiles of "stuck" backends is valuable. Even if the query plan is perfectly
normal, it *still* is very important to know in which order the joins are
evaluated etc. But there also might be changes in the query plan between 17
and 18 that trigger the issue...
Without more details about what is expected to be run and what is actually
happening, it's just about impossible for us to debug this without a
reproducer that we can run and debug ourselves.
> > Making vacuum more aggressive won't really help much if you have
> longrunning queries/sessions, since vacuum can't clean up row versions that
> are still visibile to some of the transactions.
> 
> My code batches the updates in sets of 2000 records at a time and then
> COMMITs, so the transactions themselves are limited in time and size, which
> should allow vacuum to do its job.
Are the "stuck" backends stuck within one 2000 record batch, or are they
"just" slower processing each batch?
>     26.48%  postgres  postgres           [.] LWLockAttemptLock
>             |
>             ---LWLockAttemptLock
>                |
>                |--23.15%--heapam_index_fetch_tuple.lto_priv.0
>                |          index_fetch_heap
>                |          IndexNext
>                |          ExecScan
>                |          ExecNestLoop
>                |          ExecNestLoop
>                |          ExecModifyTable
>                |          standard_ExecutorRun
>                |          ProcessQuery
So the query plan we have is a nested loop between at least three tables
(there are two joins, c.f. the two ExecNestLoop calls), where there presumably
are a lot of row [versions] on the inner side of the innermost join.
In [1] you showed a query. Reformated that looks like this:
UPDATE osm.landcover_scrubs_small_scale_2_ply AS t1
SET area_geo = t2.area_geo,
    perim_geo = t2.perim_geo,
    compact_geo = CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) / (4 * pi())) ELSE 0 END,
    npoints_geo = t2.npoints_geo,
    comp_npoints_geo = CASE WHEN t2.npoints_geo > 0 THEN (CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) /
t2.area_geo)/ (4 * pi())) ELSE 0 END / t2.npoints_geo) ELSE 0 END,
 
    convex_ratio_geo = CASE WHEN ST_Area(ST_ConvexHull(way)::geography,true) > 0 THEN (t2.area_geo /
ST_Area(ST_ConvexHull(way)::geography,true))ELSE 1 END
 
FROM (
    SELECT
        objectid,
        ST_Area(way::geography,true) AS area_geo,
        ST_Perimeter(way::geography,true) AS perim_geo,
        ST_NPoints(way) AS npoints_geo
    FROM osm.landcover_scrubs_small_scale_2_ply) AS t2
WHERE (t2.objectid = t1.objectid)
   AND t1.objectid IN (SELECT t3.objectid FROM mini_test.osm.osm_tmp_28128_ch5 AS t3)
Which certainly fits with two nested loops, although I don't think I can infer
which order it the joins are in.
Is osm.landcover_scrubs_small_scale_2_ply.object_id unique?
Can there be multiple rows for one object_id in
mini_test.osm.osm_tmp_28128_ch5?
Are there indexes on mini_test.osm.osm_tmp_28128_ch5.unique_id and
osm.landcover_scrubs_small_scale_2_ply?
Greetings,
Andres Freund
[1] https://www.postgresql.org/message-id/53b44572-0ceb-4149-b361-07da2da91032%40boeringa.demon.nl
			
		Hi Andres, > Even just knowing whether the "normal query plan" is the same one as we see in > profiles of "stuck" backends is valuable. Even if the query plan is perfectly > normal, it *still* is very important to know in which order the joins are > evaluated etc. But there also might be changes in the query plan between 17 > and 18 that trigger the issue... > > Without more details about what is expected to be run and what is actually > happening, it's just about impossible for us to debug this without a > reproducer that we can run and debug ourselves. I now encountered the auto_explain option in the PostgreSQL help. May sound stupid, but I hadn't been aware of this option. This might help in getting an explain during the actual execution of my tool, if I understand the option properly. This would be far more valuable - as being the "real" thing - than some contrived reproduction case. I will need to investigate this a bit more: https://www.postgresql.org/docs/current/auto-explain.html >>> Making vacuum more aggressive won't really help much if you have >> longrunning queries/sessions, since vacuum can't clean up row versions that >> are still visibile to some of the transactions. >> >> My code batches the updates in sets of 2000 records at a time and then >> COMMITs, so the transactions themselves are limited in time and size, which >> should allow vacuum to do its job. > > Are the "stuck" backends stuck within one 2000 record batch, or are they > "just" slower processing each batch? I can't tell. But to explain: each thread has its own set of jobs assigned, and each job will be batched in sets of 2000 records until COMMIT. So if one job has 100k records to process, 50 commits should occur for that job by one Python thread. I take care to avoid to process records totally randomly, which could cause conflicts and locking issues between threads attempting to access the same locked database page, significantly slowing down the processing. Records are assigned by database page (and depending on some other parameters), which has worked really well so far. Note that this is just a simplified version of the different processing modes I developed for different challenges and geoprocessing steps. >> 26.48% postgres postgres [.] LWLockAttemptLock >> | >> ---LWLockAttemptLock >> | >> |--23.15%--heapam_index_fetch_tuple.lto_priv.0 >> | index_fetch_heap >> | IndexNext >> | ExecScan >> | ExecNestLoop >> | ExecNestLoop >> | ExecModifyTable >> | standard_ExecutorRun >> | ProcessQuery > > So the query plan we have is a nested loop between at least three tables > (there are two joins, c.f. the two ExecNestLoop calls), where there presumably > are a lot of row [versions] on the inner side of the innermost join. > > In [1] you showed a query. Reformated that looks like this: > > UPDATE osm.landcover_scrubs_small_scale_2_ply AS t1 > SET area_geo = t2.area_geo, > perim_geo = t2.perim_geo, > compact_geo = CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) / (4 * pi())) ELSE 0 END, > npoints_geo = t2.npoints_geo, > comp_npoints_geo = CASE WHEN t2.npoints_geo > 0 THEN (CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) / (4 * pi())) ELSE 0 END / t2.npoints_geo) ELSE 0 END, > convex_ratio_geo = CASE WHEN ST_Area(ST_ConvexHull(way)::geography,true) > 0 THEN (t2.area_geo / ST_Area(ST_ConvexHull(way)::geography,true)) ELSE 1 END > FROM ( > SELECT > objectid, > ST_Area(way::geography,true) AS area_geo, > ST_Perimeter(way::geography,true) AS perim_geo, > ST_NPoints(way) AS npoints_geo > FROM osm.landcover_scrubs_small_scale_2_ply) AS t2 > WHERE (t2.objectid = t1.objectid) > AND t1.objectid IN (SELECT t3.objectid FROM mini_test.osm.osm_tmp_28128_ch5 AS t3) > > > Which certainly fits with two nested loops, although I don't think I can infer > which order it the joins are in. > > > Is osm.landcover_scrubs_small_scale_2_ply.object_id unique? Yes. > Can there be multiple rows for one object_id in > mini_test.osm.osm_tmp_28128_ch5? No. This table contains the records to process, which are unique. It is the job.
It is a one-to-one join.
> Are there indexes on mini_test.osm.osm_tmp_28128_ch5.unique_id and > osm.landcover_scrubs_small_scale_2_ply? Yes, the unique ids / objectid fields are indexed to allow an efficient join.
I noticed the formatting of the last email was totally screwed when displayed on postgresql.org's mail archive making it hard to read there, so a re-post of the last email, hopefully it will be better this time.
Answers are intermingled with all the quotes, read carefully.
Hi Andres, > Even just knowing whether the "normal query plan" is the same one as we see in > profiles of "stuck" backends is valuable. Even if the query plan is perfectly > normal, it *still* is very important to know in which order the joins are > evaluated etc. But there also might be changes in the query plan between 17 > and 18 that trigger the issue... > > Without more details about what is expected to be run and what is actually > happening, it's just about impossible for us to debug this without a > reproducer that we can run and debug ourselves. I now encountered the auto_explain option in the PostgreSQL help. May sound stupid, but I hadn't been aware of this option. This might help in getting an explain during the actual execution of my tool, if I understand the option properly. This would be far more valuable - as being the "real" thing - than some contrived reproduction case. I will need to investigate this a bit more: https://www.postgresql.org/docs/current/auto-explain.html >>> Making vacuum more aggressive won't really help much if you have >> longrunning queries/sessions, since vacuum can't clean up row versions that >> are still visibile to some of the transactions. >> >> My code batches the updates in sets of 2000 records at a time and then >> COMMITs, so the transactions themselves are limited in time and size, which >> should allow vacuum to do its job. > > Are the "stuck" backends stuck within one 2000 record batch, or are they > "just" slower processing each batch? I can't tell. But to explain: each thread has its own set of jobs assigned, and each job will be batched in sets of 2000 records until COMMIT. So if one job has 100k records to process, 50 commits should occur for that job by one Python thread. I take care to avoid to process records totally randomly, which could cause conflicts and locking issues between threads attempting to access the same locked database page, significantly slowing down the processing. Records are assigned by database page (and depending on some other parameters), which has worked really well so far. Note that this is just a simplified version of the different processing modes I developed for different challenges and geoprocessing steps. >> 26.48% postgres postgres [.] LWLockAttemptLock >> | >> ---LWLockAttemptLock >> | >> |--23.15%--heapam_index_fetch_tuple.lto_priv.0 >> | index_fetch_heap >> | IndexNext >> | ExecScan >> | ExecNestLoop >> | ExecNestLoop >> | ExecModifyTable >> | standard_ExecutorRun >> | ProcessQuery > > So the query plan we have is a nested loop between at least three tables > (there are two joins, c.f. the two ExecNestLoop calls), where there presumably > are a lot of row [versions] on the inner side of the innermost join. > > In [1] you showed a query. Reformated that looks like this: > > UPDATE osm.landcover_scrubs_small_scale_2_ply AS t1 > SET area_geo = t2.area_geo, > perim_geo = t2.perim_geo, > compact_geo = CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) / (4 * pi())) ELSE 0 END, > npoints_geo = t2.npoints_geo, > comp_npoints_geo = CASE WHEN t2.npoints_geo > 0 THEN (CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) / (4 * pi())) ELSE 0 END / t2.npoints_geo) ELSE 0 END, > convex_ratio_geo = CASE WHEN ST_Area(ST_ConvexHull(way)::geography,true) > 0 THEN (t2.area_geo / ST_Area(ST_ConvexHull(way)::geography,true)) ELSE 1 END > FROM ( > SELECT > objectid, > ST_Area(way::geography,true) AS area_geo, > ST_Perimeter(way::geography,true) AS perim_geo, > ST_NPoints(way) AS npoints_geo > FROM osm.landcover_scrubs_small_scale_2_ply) AS t2 > WHERE (t2.objectid = t1.objectid) > AND t1.objectid IN (SELECT t3.objectid FROM mini_test.osm.osm_tmp_28128_ch5 AS t3) > > > Which certainly fits with two nested loops, although I don't think I can infer > which order it the joins are in. > > > Is osm.landcover_scrubs_small_scale_2_ply.object_id unique? Yes. > Can there be multiple rows for one object_id in > mini_test.osm.osm_tmp_28128_ch5? No. This table contains the records to process, which are unique. It is the job.
It is a one-to-one join.
> Are there indexes on mini_test.osm.osm_tmp_28128_ch5.unique_id and > osm.landcover_scrubs_small_scale_2_ply? Yes, the unique ids / objectid fields are indexed to allow an efficient join.
Last attempt to get the mail out in a proper format. I hate it when software gets "smart" with formatting... ;-(
I added one minor detail to the last question of Andres at the entire bottom, see below.
I now encountered the auto_explain option in the PostgreSQL help. May sound stupid, but I hadn't been aware of this option. This might help in getting an explain during the actual execution of my tool, if I understand the option properly. This would be far more valuable - as being the "real" thing - than some contrived reproduction case. I will need to investigate this a bit more:Even just knowing whether the "normal query plan" is the same one as we see in profiles of "stuck" backends is valuable. Even if the query plan is perfectly normal, it *still* is very important to know in which order the joins are evaluated etc. But there also might be changes in the query plan between 17 and 18 that trigger the issue... Without more details about what is expected to be run and what is actually happening, it's just about impossible for us to debug this without a reproducer that we can run and debug ourselves.
https://www.postgresql.org/docs/current/auto-explain.html
I can't tell. But to explain: each thread has its own set of jobs assigned, and each job will be batched in sets of 2000 records until COMMIT. So if one job has 100k records to process, 50 commits should occur for that job by one Python thread. I take care to avoid to process records totally randomly, which could cause conflicts and locking issues between threads attempting to access the same locked database page, significantly slowing down the processing. Records are assigned by database page (and depending on some other parameters), which has worked really well so far.Making vacuum more aggressive won't really help much if you havelongrunning queries/sessions, since vacuum can't clean up row versions that are still visibile to some of the transactions. My code batches the updates in sets of 2000 records at a time and then COMMITs, so the transactions themselves are limited in time and size, which should allow vacuum to do its job.Are the "stuck" backends stuck within one 2000 record batch, or are they "just" slower processing each batch?
Note that this is just a simplified version of the different processing modes I developed for different challenges and geoprocessing steps.
26.48% postgres postgres [.] LWLockAttemptLock | ---LWLockAttemptLock | |--23.15%--heapam_index_fetch_tuple.lto_priv.0 | index_fetch_heap | IndexNext | ExecScan | ExecNestLoop | ExecNestLoop | ExecModifyTable | standard_ExecutorRun | ProcessQuerySo the query plan we have is a nested loop between at least three tables (there are two joins, c.f. the two ExecNestLoop calls), where there presumably are a lot of row [versions] on the inner side of the innermost join. In [1] you showed a query. Reformated that looks like this: UPDATE osm.landcover_scrubs_small_scale_2_ply AS t1 SET area_geo = t2.area_geo, perim_geo = t2.perim_geo, compact_geo = CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) / (4 * pi())) ELSE 0 END, npoints_geo = t2.npoints_geo, comp_npoints_geo = CASE WHEN t2.npoints_geo > 0 THEN (CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) / (4 * pi())) ELSE 0 END / t2.npoints_geo) ELSE 0 END, convex_ratio_geo = CASE WHEN ST_Area(ST_ConvexHull(way)::geography,true) > 0 THEN (t2.area_geo / ST_Area(ST_ConvexHull(way)::geography,true)) ELSE 1 END FROM ( SELECT objectid, ST_Area(way::geography,true) AS area_geo, ST_Perimeter(way::geography,true) AS perim_geo, ST_NPoints(way) AS npoints_geo FROM osm.landcover_scrubs_small_scale_2_ply) AS t2 WHERE (t2.objectid = t1.objectid) AND t1.objectid IN (SELECT t3.objectid FROM mini_test.osm.osm_tmp_28128_ch5 AS t3) Which certainly fits with two nested loops, although I don't think I can infer which order it the joins are in. Is osm.landcover_scrubs_small_scale_2_ply.object_id unique?
Yes.
Can there be multiple rows for one object_id in mini_test.osm.osm_tmp_28128_ch5?
No. This table contains the records to process, which are unique. It is the job.
It is a one-to-one join.
Are there indexes on mini_test.osm.osm_tmp_28128_ch5.unique_id and osm.landcover_scrubs_small_scale_2_ply?
Yes, the unique ids / objectid fields are indexed to allow an efficient join.
Actually, the "*_ch<number>" database object that represents the records to process for one job, references a database view. Each thread gets its own view. All views reference the same secondary table that has an index on the objectid.
Op 8-10-2025 om 21:08 schreef Andres Freund:
> Even just knowing whether the "normal query plan" is the same one as we see in
> profiles of "stuck" backends is valuable. Even if the query plan is perfectly
> normal, it *still* is very important to know in which order the joins are
> evaluated etc. But there also might be changes in the query plan between 17
> and 18 that trigger the issue...
>
> Without more details about what is expected to be run and what is actually
> happening, it's just about impossible for us to debug this without a
> reproducer that we can run and debug ourselves.
>
>
Hi Andres,
I have tried to get the auto_explain stuff to run, but that has not 
succeeded not yet.
However, I realized that due to the extreme long "stall", it should be 
possible to simply copy out the visible SQL statement from pgAdmin and 
run it in a separate window, as the required (temporary) tables and 
views to run the SQL statement would be there at that point all the 
while the processing appears stuck, and run Explain simply from pgAdmin.
This resulted in the plan I pasted below in JSON format.
Any insights you gain from this in combination with the other stuff I 
shared and the answers I gave to your last questions?
Marco
"[
   {
     ""Plan"": {
       ""Node Type"": ""ModifyTable"",
       ""Operation"": ""Update"",
       ""Parallel Aware"": false,
       ""Async Capable"": false,
       ""Relation Name"": ""landcover_grassy_small_scale_2_ply"",
       ""Schema"": ""osm"",
       ""Alias"": ""t1"",
       ""Startup Cost"": 1.14,
       ""Total Cost"": 9129.99,
       ""Plan Rows"": 0,
       ""Plan Width"": 0,
       ""Disabled"": false,
       ""Plans"": [
         {
           ""Node Type"": ""Nested Loop"",
           ""Parent Relationship"": ""Outer"",
           ""Parallel Aware"": false,
           ""Async Capable"": false,
           ""Join Type"": ""Inner"",
           ""Startup Cost"": 1.14,
           ""Total Cost"": 9129.99,
           ""Plan Rows"": 70,
           ""Plan Width"": 62,
           ""Disabled"": false,
           ""Output"": 
[""st_area((landcover_grassy_small_scale_2_ply.way)::geography, true)"", 
""st_perimeter((landcover_grassy_small_scale_2_ply.way)::geography, 
true)"", ""CASE WHEN 
(st_area((landcover_grassy_small_scale_2_ply.way)::geography, true) > 
'0'::double precision) THEN 
((power(st_perimeter((landcover_grassy_small_scale_2_ply.way)::geography, 
true), '2'::double precision) / 
st_area((landcover_grassy_small_scale_2_ply.way)::geography, true)) / 
'12.566370614359172'::double precision) ELSE '0'::double precision 
END"", ""st_npoints(landcover_grassy_small_scale_2_ply.way)"", ""CASE 
WHEN (st_npoints(landcover_grassy_small_scale_2_ply.way) > 0) THEN (CASE 
WHEN (st_area((landcover_grassy_small_scale_2_ply.way)::geography, true) 
 > '0'::double precision) THEN 
((power(st_perimeter((landcover_grassy_small_scale_2_ply.way)::geography, 
true), '2'::double precision) / 
st_area((landcover_grassy_small_scale_2_ply.way)::geography, true)) / 
'12.566370614359172'::double precision) ELSE '0'::double precision END / 
(st_npoints(landcover_grassy_small_scale_2_ply.way))::double precision) 
ELSE '0'::double precision END"", ""CASE WHEN 
(st_area((st_convexhull(t1.way))::geography, true) > '0'::double 
precision) THEN 
(st_area((landcover_grassy_small_scale_2_ply.way)::geography, true) / 
st_area((st_convexhull(t1.way))::geography, true)) ELSE '1'::double 
precision END"", ""t1.ctid"", 
""landcover_grassy_small_scale_2_ply.ctid"", 
""landcover_grassy_small_scale_2_ply_pg.ctid""],
           ""Inner Unique"": true,
           ""Plans"": [
             {
               ""Node Type"": ""Nested Loop"",
               ""Parent Relationship"": ""Outer"",
               ""Parallel Aware"": false,
               ""Async Capable"": false,
               ""Join Type"": ""Inner"",
               ""Startup Cost"": 0.72,
               ""Total Cost"": 173.50,
               ""Plan Rows"": 70,
               ""Plan Width"": 828,
               ""Disabled"": false,
               ""Output"": [""t1.way"", ""t1.ctid"", ""t1.objectid"", 
""landcover_grassy_small_scale_2_ply_pg.ctid"", 
""landcover_grassy_small_scale_2_ply_pg.objectid""],
               ""Inner Unique"": true,
               ""Plans"": [
                 {
                   ""Node Type"": ""Index Scan"",
                   ""Parent Relationship"": ""Outer"",
                   ""Parallel Aware"": false,
                   ""Async Capable"": false,
                   ""Scan Direction"": ""Forward"",
                   ""Index Name"": ""idx_osm_35"",
                   ""Relation Name"": 
""landcover_grassy_small_scale_2_ply_pg"",
                   ""Schema"": ""osm"",
                   ""Alias"": ""landcover_grassy_small_scale_2_ply_pg"",
                   ""Startup Cost"": 0.29,
                   ""Total Cost"": 3.70,
                   ""Plan Rows"": 70,
                   ""Plan Width"": 14,
                   ""Disabled"": false,
                   ""Output"": 
[""landcover_grassy_small_scale_2_ply_pg.ctid"", 
""landcover_grassy_small_scale_2_ply_pg.objectid""],
                   ""Index Cond"": 
""((landcover_grassy_small_scale_2_ply_pg.page_number >= 28873) AND 
(landcover_grassy_small_scale_2_ply_pg.page_number < 29373))""
                 },
                 {
                   ""Node Type"": ""Index Scan"",
                   ""Parent Relationship"": ""Inner"",
                   ""Parallel Aware"": false,
                   ""Async Capable"": false,
                   ""Scan Direction"": ""Forward"",
                   ""Index Name"": 
""landcover_grassy_small_scale_2_ply_pkey"",
                   ""Relation Name"": 
""landcover_grassy_small_scale_2_ply"",
                   ""Schema"": ""osm"",
                   ""Alias"": ""t1"",
                   ""Startup Cost"": 0.42,
                   ""Total Cost"": 2.43,
                   ""Plan Rows"": 1,
                   ""Plan Width"": 814,
                   ""Disabled"": false,
                   ""Output"": [""t1.way"", ""t1.ctid"", ""t1.objectid""],
                   ""Index Cond"": ""(t1.objectid = 
landcover_grassy_small_scale_2_ply_pg.objectid)""
                 }
               ]
             },
             {
               ""Node Type"": ""Index Scan"",
               ""Parent Relationship"": ""Inner"",
               ""Parallel Aware"": false,
               ""Async Capable"": false,
               ""Scan Direction"": ""Forward"",
               ""Index Name"": ""landcover_grassy_small_scale_2_ply_pkey"",
               ""Relation Name"": ""landcover_grassy_small_scale_2_ply"",
               ""Schema"": ""osm"",
               ""Alias"": ""landcover_grassy_small_scale_2_ply"",
               ""Startup Cost"": 0.42,
               ""Total Cost"": 0.64,
               ""Plan Rows"": 1,
               ""Plan Width"": 814,
               ""Disabled"": false,
               ""Output"": [""landcover_grassy_small_scale_2_ply.way"", 
""landcover_grassy_small_scale_2_ply.ctid"", 
""landcover_grassy_small_scale_2_ply.objectid""],
               ""Index Cond"": 
""(landcover_grassy_small_scale_2_ply.objectid = t1.objectid)""
             }
           ]
         }
       ]
     }
   }
]"
			
		Just found out about the other thread, which seems to describe an issue that may well be related to the problems I observe in my multi-threaded Python implementation using PostgreSQL 18: https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BHh7u3YpPLKQS%2BB7KJu0Bw99EjhuPuvBRVccDfgod95g%40mail.gmail.com#969d704180758e8dbd124604206175e0 As I have zero experience compiling PostgreSQL from scratch and PostgreSQL development in general, I will unfortunately avoid going down the rabbit hole of such an attempt to be able to test the patch, but will eagerly await PG18.1 which hopefully will have it included and might fix the issues I observed. Marco
Hi, On October 10, 2025 5:26:38 PM EDT, Marco Boeringa <marco@boeringa.demon.nl> wrote: >Just found out about the other thread, which seems to describe an issue that may well be related to the problems I observein my multi-threaded Python implementation using PostgreSQL 18: > >https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BHh7u3YpPLKQS%2BB7KJu0Bw99EjhuPuvBRVccDfgod95g%40mail.gmail.com#969d704180758e8dbd124604206175e0 > >As I have zero experience compiling PostgreSQL from scratch and PostgreSQL development in general, I will unfortunatelyavoid going down the rabbit hole of such an attempt to be able to test the patch, but will eagerly await PG18.1which hopefully will have it included and might fix the issues I observed. If that issue were causal you'd see something in pg_aios and the issue wouldn't be present with worker or sync io methods.Nor would you see high cpu utilization. This is a different issue, likely not aio related. Greetings, Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Then as follow up on that: is there anything you still need from me, or is the information I gave so far including the last requested answers and data of perf, enough to give you at least a decent start to figure out where the issue is? Op 10-10-2025 om 23:30 schreef Andres Freund: > Hi, > > On October 10, 2025 5:26:38 PM EDT, Marco Boeringa <marco@boeringa.demon.nl> wrote: >> Just found out about the other thread, which seems to describe an issue that may well be related to the problems I observein my multi-threaded Python implementation using PostgreSQL 18: >> >> https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BHh7u3YpPLKQS%2BB7KJu0Bw99EjhuPuvBRVccDfgod95g%40mail.gmail.com#969d704180758e8dbd124604206175e0 >> >> As I have zero experience compiling PostgreSQL from scratch and PostgreSQL development in general, I will unfortunatelyavoid going down the rabbit hole of such an attempt to be able to test the patch, but will eagerly await PG18.1which hopefully will have it included and might fix the issues I observed. > If that issue were causal you'd see something in pg_aios and the issue wouldn't be present with worker or sync io methods.Nor would you see high cpu utilization. This is a different issue, likely not aio related. > > Greetings, > > Andres
Hi Andres, I have been doing a bit more investigation. As I explained before, the problematic multi-threaded geoprocessing step is not some stand-alone query that can be easily reduced to small easily portable reproducible case with attached data. In fact, this geoprocessing step is part of a large custom build Python geoprocessing workflow, with total code probably in the 25k code lines range. However, based on the apparent poor query plan in PG18 / PostGIS 3.6.0, I now reviewed the exact code once more. I noticed that just before entering the multi-threaded code that emits the queries as seen below, I am actually adding the primary key field 'objectid' as "GENERATED BY DEFAULT AS IDENTITY" to the 'osm.landcover_scrubs_small_scale_2_ply' table. Now I also noticed I did not run ANALYZE after that against the same table. I have now added this to the code. Although it is still preliminary, first tests seem to indicate that this resolves the issue, and prevents the stalls or better said apparent hugely inefficient query plan (remember: a < 10 sec process was turned into multi-hour). I still need to do more thorough testing to be sure though. However, this raises a couple of question: - While ANALYZE is of course hugely important for proper statistics and query planning, I have wondered if PostgreSQL shouldn't automatically have updated the statistics for the addition of the primary key with IDENTITY? It seems to me that based on the definition of the primary key column and IDENTITY and table size, the actual distribution of values is essentially already known even before any sampling of ANALYZE to update statistics? - Am I right to assume that only the statistics on the objectid field play any role in this issue? As you can see, the WHERE clause does not involve any other fields than the two objectid fields of the main table and the chunk table specifying the job. All other values computed are just derived straight from the geometry column. - Were there any hints in the all the other data I supplied as to where PG18's query planning without the updated statistics of the new ANALYZE step added, is going wrong? And why this was never an issue in <= PG17? I also did some preliminary test with the old PG17.6 / PostgGIS 3.6.0 cluster with the same Italy extract data. I still need to do more thorough testing, both with and without the extra ANALYZE step, to fully exclude that there isn't something related to the upgrade to PostGIS 3.6.0, but first indications are as I already saw with the PG17.6 / PostgGIS 3.5.3, that there are no issue with <= PG17 / PostGIS combination as regards this apparent planner issue. Marco Op 8-10-2025 om 21:08 schreef Andres Freund: > In [1] you showed a query. Reformated that looks like this: > > UPDATE osm.landcover_scrubs_small_scale_2_ply AS t1 > SET area_geo = t2.area_geo, > perim_geo = t2.perim_geo, > compact_geo = CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) / (4 * pi())) ELSE 0 END, > npoints_geo = t2.npoints_geo, > comp_npoints_geo = CASE WHEN t2.npoints_geo > 0 THEN (CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo)/ (4 * pi())) ELSE 0 END / t2.npoints_geo) ELSE 0 END, > convex_ratio_geo = CASE WHEN ST_Area(ST_ConvexHull(way)::geography,true) > 0 THEN (t2.area_geo / ST_Area(ST_ConvexHull(way)::geography,true))ELSE 1 END > FROM ( > SELECT > objectid, > ST_Area(way::geography,true) AS area_geo, > ST_Perimeter(way::geography,true) AS perim_geo, > ST_NPoints(way) AS npoints_geo > FROM osm.landcover_scrubs_small_scale_2_ply) AS t2 > WHERE (t2.objectid = t1.objectid) > AND t1.objectid IN (SELECT t3.objectid FROM mini_test.osm.osm_tmp_28128_ch5 AS t3) > > > Which certainly fits with two nested loops, although I don't think I can infer > which order it the joins are in.
Hi Andres,
I have now been able to capture an actual bad plan using PostgreSQL's 
'auto_explain' option, so this the bad plan that makes a process and job 
that should take just a few seconds, cost more than 7 hours to execute. 
And this is of course without updating the tables statistics with 
ANALYZE after adding the primary key objectid, as that solved the issue 
in PG18. It is now clear where PostgreSQL spends all of its time when 
the bad plan is generated, see the loop number of the second index scan. 
I am not sure why in this nested loop, two index scans on essentially 
the same key and table are executed. You can compare this bad plan with 
the one below it, that was generated with EXPLAIN in pgAdmin, not from 
actual auto_explain output, but shows a different query execution plan.
Note again:
- I did not see the bad plan in PG<=17, even without the now added 
ANALYZE step that solves the issue in PG18.
- The two tables involved ('landcover_grassy_small_scale_2_ply' and 
'landcover_grassy_small_scale_2_ply_pg') both have primary keys and 
unique objectid indexes.
- The join is one-to-one. 'landcover_grassy_small_scale_2_ply_pg' has 
unique values for objectids, so only record per corresponding record 
in 'landcover_grassy_small_scale_2_ply'
- The 'osm_tmp_28232_ch3' references a database view and represents the 
chunk / multi-threaded job that needs to be executed from my Python 
code, and references a selection of 
the 'landcover_grassy_small_scale_2_ply_pg' table's records, which is 
the filter in the last index scan in the bad plan below.
Marco
*** ACTUAL BAD PLAN AS CAPTURED BY auto_explain ***:
2025-10-17 23:32:17.375 CEST [242803] osm@mini_test LOG: duration: 
27133980.536 ms  plan:
     Query Text: UPDATE osm.landcover_grassy_small_scale_2_ply AS t1 SET 
area_geo = t2.area_geo, perim_geo = t2.perim_geo, compact_geo = CASE 
WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) / (4 * 
pi())) ELSE 0 END, npoints_geo = t2.npoints_geo, comp_npoints_geo = CASE 
WHEN t2.npoints_geo > 0 THEN (CASE WHEN t2.area_geo > 0 THEN 
((power(t2.perim_geo,2) / t2.area_geo) / (4 * pi())) ELSE 0 END / 
t2.npoints_geo) ELSE 0 END, convex_ratio_geo = CASE WHEN 
ST_Area(ST_ConvexHull(way)::geography,true) > 0 THEN (t2.area_geo / 
ST_Area(ST_ConvexHull(way)::geography,true)) ELSE 1 END FROM (SELECT 
objectid,ST_Area(way::geography,true) AS 
area_geo,ST_Perimeter(way::geography,true) AS perim_geo,ST_NPoints(way) 
AS npoints_geo FROM osm.landcover_grassy_small_scale_2_ply)  AS t2 WHERE 
(t2.objectid = t1.objectid) AND t1.objectid IN (SELECT t3.objectid FROM 
mini_test.osm.osm_tmp_28232_ch3 AS t3)
     Update on osm.landcover_grassy_small_scale_2_ply t1 
(cost=1.17..134.56 rows=0 width=0) (actual 
time=27133980.532..27133980.534 rows=0.00 loops=1)
       Buffers: shared hit=7585627828
       ->  Nested Loop  (cost=1.17..134.56 rows=1 width=62) (actual 
time=225948.297..27133979.894 rows=8.00 loops=1)
             Output: 
st_area((landcover_grassy_small_scale_2_ply.way)::geography, true), 
st_perimeter((landcover_grassy_small_scale_2_ply.way)::geography, true), 
CASE WHEN (st_area((landcover_grassy_small_scale_2_ply.way)::geography, 
true) > '0'::double precision) THEN 
((power(st_perimeter((landcover_grassy_small_scale_2_ply.way)::geography, 
true), '2'::double precision) / 
st_area((landcover_grassy_small_scale_2_ply.way)::geography, true)) / 
'12.566370614359172'::double precision) ELSE '0'::double precision END, 
st_npoints(landcover_grassy_small_scale_2_ply.way), CASE WHEN 
(st_npoints(landcover_grassy_small_scale_2_ply.way) > 0) THEN (CASE WHEN 
(st_area((landcover_grassy_small_scale_2_ply.way)::geography, true) > 
'0'::double precision) THEN 
((power(st_perimeter((landcover_grassy_small_scale_2_ply.way)::geography, 
true), '2'::double precision) / 
st_area((landcover_grassy_small_scale_2_ply.way)::geography, true)) / 
'12.566370614359172'::double precision) ELSE '0'::double precision END / 
(st_npoints(landcover_grassy_small_scale_2_ply.way))::double precision) 
ELSE '0'::double precision END, CASE WHEN 
(st_area((st_convexhull(t1.way))::geography, true) > '0'::double 
precision) THEN 
(st_area((landcover_grassy_small_scale_2_ply.way)::geography, true) / 
st_area((st_convexhull(t1.way))::geography, true)) ELSE '1'::double 
precision END, t1.ctid, landcover_grassy_small_scale_2_ply.ctid, 
landcover_grassy_small_scale_2_ply_pg.ctid
             Inner Unique: true
             Join Filter: (landcover_grassy_small_scale_2_ply.objectid = 
landcover_grassy_small_scale_2_ply_pg.objectid)
             Buffers: shared hit=7585627705
             ->  Nested Loop  (cost=0.75..4.79 rows=1 width=620) (actual 
time=5.784..27131099.372 rows=222396.00 loops=1)
                   Output: t1.way, t1.ctid, t1.objectid, 
landcover_grassy_small_scale_2_ply.way, 
landcover_grassy_small_scale_2_ply.ctid, 
landcover_grassy_small_scale_2_ply.objectid
                   Inner Unique: true
                   Join Filter: (t1.objectid = 
landcover_grassy_small_scale_2_ply.objectid)
                   Rows Removed by Join Filter: 24729879210
                   Buffers: shared hit=7584737792
                   ->  Index Scan using 
landcover_grassy_small_scale_2_ply_pkey on 
osm.landcover_grassy_small_scale_2_ply t1  (cost=0.38..2.39 rows=1 
width=310) (actual time=5.176..462.613 rows=222396.00 loops=1)
                         Output: t1.way, t1.ctid, t1.objectid
                         Index Searches: 1
                         Buffers: shared hit=66411
                   ->  Index Scan using 
landcover_grassy_small_scale_2_ply_pkey on 
osm.landcover_grassy_small_scale_2_ply  (cost=0.38..2.39 rows=1 
width=310) (actual time=0.035..114.709 rows=111198.50 loops=222396)
                         Output: landcover_grassy_small_scale_2_ply.way, 
landcover_grassy_small_scale_2_ply.ctid, 
landcover_grassy_small_scale_2_ply.objectid
                         Index Searches: 222396
                         Buffers: shared hit=7584671381
             ->  Index Scan using idx_osm_41 on 
osm.landcover_grassy_small_scale_2_ply_pg  (cost=0.42..2.44 rows=1 
width=14) (actual time=0.009..0.009 rows=0.00 loops=222396)
                   Output: landcover_grassy_small_scale_2_ply_pg.ctid, 
landcover_grassy_small_scale_2_ply_pg.objectid
                   Index Cond: 
(landcover_grassy_small_scale_2_ply_pg.objectid = t1.objectid)
                   Filter: 
((landcover_grassy_small_scale_2_ply_pg.page_number >= 31276) AND 
(landcover_grassy_small_scale_2_ply_pg.page_number < 31766))
                   Rows Removed by Filter: 1
                   Index Searches: 222396
                   Buffers: shared hit=889584
*** LIKELY GOOD PLAN AS GENERATED BY CAPTURING THE SQL AND RUNNING 
'EXPLAIN' IN pgAdmin ***:
"[
   {
     ""Plan"": {
       ""Node Type"": ""ModifyTable"",
       ""Operation"": ""Update"",
       ""Parallel Aware"": false,
       ""Async Capable"": false,
       ""Relation Name"": ""landcover_grassy_small_scale_2_ply"",
       ""Schema"": ""osm"",
       ""Alias"": ""t1"",
       ""Startup Cost"": 1.14,
       ""Total Cost"": 9129.99,
       ""Plan Rows"": 0,
       ""Plan Width"": 0,
       ""Disabled"": false,
       ""Plans"": [
         {
           ""Node Type"": ""Nested Loop"",
           ""Parent Relationship"": ""Outer"",
           ""Parallel Aware"": false,
           ""Async Capable"": false,
           ""Join Type"": ""Inner"",
           ""Startup Cost"": 1.14,
           ""Total Cost"": 9129.99,
           ""Plan Rows"": 70,
           ""Plan Width"": 62,
           ""Disabled"": false,
           ""Output"": 
[""st_area((landcover_grassy_small_scale_2_ply.way)::geography, true)"",
""st_perimeter((landcover_grassy_small_scale_2_ply.way)::geography, 
true)"", ""CASE WHEN 
(st_area((landcover_grassy_small_scale_2_ply.way)::geography, true) > 
'0'::double precision) THEN 
((power(st_perimeter((landcover_grassy_small_scale_2_ply.way)::geography, 
true), '2'::double precision) / 
st_area((landcover_grassy_small_scale_2_ply.way)::geography, true)) / 
'12.566370614359172'::double precision) ELSE '0'::double precision 
END"", ""st_npoints(landcover_grassy_small_scale_2_ply.way)"", ""CASE 
WHEN (st_npoints(landcover_grassy_small_scale_2_ply.way) > 0) THEN (CASE 
WHEN (st_area((landcover_grassy_small_scale_2_ply.way)::geography, true) 
 > '0'::double precision) THEN 
((power(st_perimeter((landcover_grassy_small_scale_2_ply.way)::geography, 
true), '2'::double precision) / 
st_area((landcover_grassy_small_scale_2_ply.way)::geography, true)) / 
'12.566370614359172'::double precision) ELSE '0'::double precision END / 
(st_npoints(landcover_grassy_small_scale_2_ply.way))::double precision) 
ELSE '0'::double precision END"", ""CASE WHEN 
(st_area((st_convexhull(t1.way))::geography, true) > '0'::double 
precision) THEN 
(st_area((landcover_grassy_small_scale_2_ply.way)::geography, true) / 
st_area((st_convexhull(t1.way))::geography, true)) ELSE '1'::double 
precision END"", ""t1.ctid"", 
""landcover_grassy_small_scale_2_ply.ctid"", 
""landcover_grassy_small_scale_2_ply_pg.ctid""],
           ""Inner Unique"": true,
           ""Plans"": [
             {
               ""Node Type"": ""Nested Loop"",
               ""Parent Relationship"": ""Outer"",
               ""Parallel Aware"": false,
               ""Async Capable"": false,
               ""Join Type"": ""Inner"",
               ""Startup Cost"": 0.72,
               ""Total Cost"": 173.50,
               ""Plan Rows"": 70,
               ""Plan Width"": 828,
               ""Disabled"": false,
               ""Output"": [""t1.way"", ""t1.ctid"", ""t1.objectid"", 
""landcover_grassy_small_scale_2_ply_pg.ctid"", 
""landcover_grassy_small_scale_2_ply_pg.objectid""],
               ""Inner Unique"": true,
               ""Plans"": [
                 {
                   ""Node Type"": ""Index Scan"",
                   ""Parent Relationship"": ""Outer"",
                   ""Parallel Aware"": false,
                   ""Async Capable"": false,
                   ""Scan Direction"": ""Forward"",
                   ""Index Name"": ""idx_osm_35"",
                   ""Relation Name"": 
""landcover_grassy_small_scale_2_ply_pg"",
                   ""Schema"": ""osm"",
                   ""Alias"": ""landcover_grassy_small_scale_2_ply_pg"",
                   ""Startup Cost"": 0.29,
                   ""Total Cost"": 3.70,
                   ""Plan Rows"": 70,
                   ""Plan Width"": 14,
                   ""Disabled"": false,
                   ""Output"": 
[""landcover_grassy_small_scale_2_ply_pg.ctid"", 
""landcover_grassy_small_scale_2_ply_pg.objectid""],
                   ""Index Cond"": 
""((landcover_grassy_small_scale_2_ply_pg.page_number >= 28873) AND 
(landcover_grassy_small_scale_2_ply_pg.page_number < 29373))""
                 },
                 {
                   ""Node Type"": ""Index Scan"",
                   ""Parent Relationship"": ""Inner"",
                   ""Parallel Aware"": false,
                   ""Async Capable"": false,
                   ""Scan Direction"": ""Forward"",
                   ""Index Name"": 
""landcover_grassy_small_scale_2_ply_pkey"",
                   ""Relation Name"": 
""landcover_grassy_small_scale_2_ply"",
                   ""Schema"": ""osm"",
                   ""Alias"": ""t1"",
                   ""Startup Cost"": 0.42,
                   ""Total Cost"": 2.43,
                   ""Plan Rows"": 1,
                   ""Plan Width"": 814,
                   ""Disabled"": false,
                   ""Output"": [""t1.way"", ""t1.ctid"", ""t1.objectid""],
                   ""Index Cond"": ""(t1.objectid = 
landcover_grassy_small_scale_2_ply_pg.objectid)""
                 }
               ]
             },
             {
               ""Node Type"": ""Index Scan"",
               ""Parent Relationship"": ""Inner"",
               ""Parallel Aware"": false,
               ""Async Capable"": false,
               ""Scan Direction"": ""Forward"",
               ""Index Name"": ""landcover_grassy_small_scale_2_ply_pkey"",
               ""Relation Name"": ""landcover_grassy_small_scale_2_ply"",
               ""Schema"": ""osm"",
               ""Alias"": ""landcover_grassy_small_scale_2_ply"",
               ""Startup Cost"": 0.42,
               ""Total Cost"": 0.64,
               ""Plan Rows"": 1,
               ""Plan Width"": 814,
               ""Disabled"": false,
               ""Output"": [""landcover_grassy_small_scale_2_ply.way"", 
""landcover_grassy_small_scale_2_ply.ctid"", 
""landcover_grassy_small_scale_2_ply.objectid""],
               ""Index Cond"": 
""(landcover_grassy_small_scale_2_ply.objectid = t1.objectid)""
             }
           ]
         }
       ]
     }
   }
]"
Op 12-10-2025 om 10:24 schreef Marco Boeringa:
> Hi Andres,
>
> I have been doing a bit more investigation. As I explained before, the 
> problematic multi-threaded geoprocessing step is not some stand-alone 
> query that can be easily reduced to small easily portable reproducible 
> case with attached data. In fact, this geoprocessing step is part of a 
> large custom build Python geoprocessing workflow, with total code 
> probably in the 25k code lines range.
>
> However, based on the apparent poor query plan in PG18 / PostGIS 
> 3.6.0, I now reviewed the exact code once more. I noticed that just 
> before entering the multi-threaded code that emits the queries as seen 
> below, I am actually adding the primary key field 'objectid' as 
> "GENERATED BY DEFAULT AS IDENTITY" to the 
> 'osm.landcover_scrubs_small_scale_2_ply' table.
>
> Now I also noticed I did not run ANALYZE after that against the same 
> table. I have now added this to the code. Although it is still 
> preliminary, first tests seem to indicate that this resolves the 
> issue, and prevents the stalls or better said apparent hugely 
> inefficient query plan (remember: a < 10 sec process was turned into 
> multi-hour). I still need to do more thorough testing to be sure though.
>
> However, this raises a couple of question:
>
> - While ANALYZE is of course hugely important for proper statistics 
> and query planning, I have wondered if PostgreSQL shouldn't 
> automatically have updated the statistics for the addition of the 
> primary key with IDENTITY? It seems to me that based on the definition 
> of the primary key column and IDENTITY and table size, the actual 
> distribution of values is essentially already known even before any 
> sampling of ANALYZE to update statistics?
>
> - Am I right to assume that only the statistics on the objectid field 
> play any role in this issue? As you can see, the WHERE clause does not 
> involve any other fields than the two objectid fields of the main 
> table and the chunk table specifying the job. All other values 
> computed are just derived straight from the geometry column.
>
> - Were there any hints in the all the other data I supplied as to 
> where PG18's query planning without the updated statistics of the new 
> ANALYZE step added, is going wrong? And why this was never an issue in 
> <= PG17?
>
> I also did some preliminary test with the old PG17.6 / PostgGIS 3.6.0 
> cluster with the same Italy extract data. I still need to do more 
> thorough testing, both with and without the extra ANALYZE step, to 
> fully exclude that there isn't something related to the upgrade to 
> PostGIS 3.6.0, but first indications are as I already saw with 
> the PG17.6 / PostgGIS 3.5.3, that there are no issue with <= PG17 / 
> PostGIS combination as regards this apparent planner issue.
>
> Marco
>
> Op 8-10-2025 om 21:08 schreef Andres Freund:
>> In [1] you showed a query. Reformated that looks like this:
>>
>> UPDATE osm.landcover_scrubs_small_scale_2_ply AS t1
>> SET area_geo = t2.area_geo,
>>      perim_geo = t2.perim_geo,
>>      compact_geo = CASE WHEN t2.area_geo > 0 THEN 
>> ((power(t2.perim_geo,2) / t2.area_geo) / (4 * pi())) ELSE 0 END,
>>      npoints_geo = t2.npoints_geo,
>>      comp_npoints_geo = CASE WHEN t2.npoints_geo > 0 THEN (CASE WHEN 
>> t2.area_geo > 0 THEN ((power(t2.perim_geo,2) / t2.area_geo) / (4 * 
>> pi())) ELSE 0 END / t2.npoints_geo) ELSE 0 END,
>>      convex_ratio_geo = CASE WHEN 
>> ST_Area(ST_ConvexHull(way)::geography,true) > 0 THEN (t2.area_geo / 
>> ST_Area(ST_ConvexHull(way)::geography,true)) ELSE 1 END
>> FROM (
>>      SELECT
>>          objectid,
>>          ST_Area(way::geography,true) AS area_geo,
>>          ST_Perimeter(way::geography,true) AS perim_geo,
>>          ST_NPoints(way) AS npoints_geo
>>      FROM osm.landcover_scrubs_small_scale_2_ply) AS t2
>> WHERE (t2.objectid = t1.objectid)
>>     AND t1.objectid IN (SELECT t3.objectid FROM 
>> mini_test.osm.osm_tmp_28128_ch5 AS t3)
>>
>>
>> Which certainly fits with two nested loops, although I don't think I 
>> can infer
>> which order it the joins are in.
			
		On Mon, 20 Oct 2025 at 09:37, Marco Boeringa <marco@boeringa.demon.nl> wrote: > I am not sure why in this nested loop, two index scans on essentially > the same key and table are executed. You can compare this bad plan with The query contains a self-join, so that's why you're seeing the index scanned twice in the query plan. If that's not needed, then you should remove it from the query. If objectid is unique for this table then I don't see why you need to join the table again to access the very same row that you're updating. Just put those function calls in the UPDATE's SET clause. (We do have self join elimination in v18, but I see that it's a bit overly strict in what it removes around looking for duplicate relations when one of them is the query's result relation. Likely that can be made better so it still looks for duplicate relations including the result relation, but just never considers removing that one, only the other duplicate(s).) > *** ACTUAL BAD PLAN AS CAPTURED BY auto_explain ***: > -> Index Scan using > landcover_grassy_small_scale_2_ply_pkey on > osm.landcover_grassy_small_scale_2_ply t1 (cost=0.38..2.39 rows=1 > width=310) (actual time=5.176..462.613 rows=222396.00 loops=1) > Output: t1.way, t1.ctid, t1.objectid > Index Searches: 1 > Buffers: shared hit=66411 This table must have been VACUUMed or ANALYZEd either when it was empty or when it contained 1 row. There's no predicate here, so that estimate, aside from clamping to 1, comes directly from pg_class.reltuples. A new table or truncated table would never estimate 1 row as the planner always plays it safe when there are no statistics generated yet and assumes 10 pages worth of rows. I can't think of any specific reason why v18 behaves differently from v17 on this... Maybe you've gotten unlikely with an autovacuum timing thing and it's running at a slightly different time than in v17, perhaps because it completed the autovacuum of another table slightly quicker than v17 did. v18 can perform asynchronous reads for vacuum, maybe that could mean more vacuum_cost_page_hits and less vacuum_cost_page_misses when calculating vacuum_cost_limit. Or, perhaps you're doing something like performing a manual VACUUM after the tables have had all of their rows deleted? David
Hi David, To be honest I am not a SQL wizard like some of you here on the list, but part of the reason I setup the query as it currently is, is that the PostGIS function calls like ST_Area and ST_Perimeter can be very expensive depending on the complexity and size of the geometry, and I thus want to avoid at all cost to have to unnecessarily recalculate them multiple times in the same query. Maybe I am misunderstanding how PostgreSQL processes such queries, but I need the values multiple times to calculate some other parameters. So unless PostgreSQL is smart enough to cache the result and not execute ST_Area multiple times if it is used multiple times in the same query, I thought it wise to separate out the calculation and use the SELECT's results as input for the calculation of the other parameters. Maybe that isn't actually needed, but I think I remember seeing performance gains from the current setup when I initially wrote it this way, but I am not entirely sure at this point in time, it is a while ago. I know far to little about the internals of PostgreSQL and its exact query processing to say anything really sensible about this. A couple of things that still strike me: - As I wrote in a previous post, just before entering the Python multi-threaded processing that generates the jobs, I am adding a primary key column with unique objectids (GENERATED BY DEFAULT AS IDENTITY), that is subsequently used in the join. I realize this actually an enhancement request, but I am wondering why, if PostgreSQL already needs to dig through the entire table to add a new column with unique objectid values, it doesn't automatically update the statistics of the newly added column and write out the proper number of records of the table, that must be known by the end of the addition of the column, to the 'pg_class.reltuples' table you refer to? It seems to me it would save the need of the ANALYZE here, and at least ensure that there is some useful statistics available on the vital primary key column and about the relation size? - As I wrote in a previous post, the multi-threaded Python code creates multiple jobs (dozens). What I am seeing is that only part of the jobs is failing, and with some runs, none. E.g. I can run my tool 3 times without issues, than the fourth run some of the jobs get the bad plan (I only see this behavior in PG18, I never saw it in <= PG17). In all of these cases, the input data is *exactly* the same. The planning behavior therefor appears non-deterministic. As I understood it, PostgreSQL may indeed have non-deterministic behavior if it switches to the genetic algorithm on complex queries with many joins, but I have the feeling that my query doesn't quite satisfy that level of complexity? Or am I wrong here, and do you consider it likely it went through the genetic algorithm? It would actually be desirable if EXPLAIN (especially 'auto_explain') output always showed whether the genetic algorithm was activated, so one could judge if non-deterministic behavior of the planner is expected. - Lastly, did you notice the likely "good" plan I posted below the "bad" one. I generated that one by simply copy the visible query to pgAdmin and hitting EXPLAIN, so it's not the real thing as from 'auto_explain', but it does show some marked differences between the plan. Do you have any comments to add as to the differences? Marco Op 20-10-2025 om 03:33 schreef David Rowley: > On Mon, 20 Oct 2025 at 09:37, Marco Boeringa <marco@boeringa.demon.nl> wrote: >> I am not sure why in this nested loop, two index scans on essentially >> the same key and table are executed. You can compare this bad plan with > The query contains a self-join, so that's why you're seeing the index > scanned twice in the query plan. If that's not needed, then you should > remove it from the query. If objectid is unique for this table then I > don't see why you need to join the table again to access the very same > row that you're updating. Just put those function calls in the > UPDATE's SET clause. > > (We do have self join elimination in v18, but I see that it's a bit > overly strict in what it removes around looking for duplicate > relations when one of them is the query's result relation. Likely that > can be made better so it still looks for duplicate relations including > the result relation, but just never considers removing that one, only > the other duplicate(s).) > >> *** ACTUAL BAD PLAN AS CAPTURED BY auto_explain ***: >> -> Index Scan using >> landcover_grassy_small_scale_2_ply_pkey on >> osm.landcover_grassy_small_scale_2_ply t1 (cost=0.38..2.39 rows=1 >> width=310) (actual time=5.176..462.613 rows=222396.00 loops=1) >> Output: t1.way, t1.ctid, t1.objectid >> Index Searches: 1 >> Buffers: shared hit=66411 > This table must have been VACUUMed or ANALYZEd either when it was > empty or when it contained 1 row. There's no predicate here, so that > estimate, aside from clamping to 1, comes directly from > pg_class.reltuples. A new table or truncated table would never > estimate 1 row as the planner always plays it safe when there are no > statistics generated yet and assumes 10 pages worth of rows. I can't > think of any specific reason why v18 behaves differently from v17 on > this... Maybe you've gotten unlikely with an autovacuum timing thing > and it's running at a slightly different time than in v17, perhaps > because it completed the autovacuum of another table slightly quicker > than v17 did. v18 can perform asynchronous reads for vacuum, maybe > that could mean more vacuum_cost_page_hits and less > vacuum_cost_page_misses when calculating vacuum_cost_limit. > > Or, perhaps you're doing something like performing a manual VACUUM > after the tables have had all of their rows deleted? > > David
Hi David, I wrote in my last post somewhere: "In all of these cases, the input data is *exactly* the same." I think I need to correct this. Although the input base table 'landcover_grassy_small_scale_2_ply' is the same and the number of records it has as well for each run, the actual contents of the database views that represent the multi-threading jobs (in the example below 'osm_tmp_28232_ch3'), and the number of records they refer to, *can* vary in a non-deterministic way, anywhere from 0 to about 5000 records. Most jobs have several thousands, close to the limit of 5000, but some can have considerably less. Actually, it appears that most of jobs getting the bad plan have the lower number of records (dozens to a few hundreds, instead of a few thousands), at least that is what I saw with the last run, but I would need to do further testing to confirm. That said, they are implemented as non-materialized ordinary database views, so don't have their own statistics and such (the underlying table the views refer to is a secondary table called ' landcover_grassy_small_scale_2_ply_pg' derived from 'landcover_grassy_small_scale_2_ply' that was visible in the 'auto_explain' output)? So this shouldn't actually be that relevant? Marco See query below for reference: UPDATE osm.landcover_grassy_small_scale_2_ply AS t1 SET area_geo = t2.area_geo, perim_geo = t2.perim_geo, compact_geo = CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo, 2) / t2.area_geo) / (4 * pi())) ELSE 0 END, npoints_geo = t2.npoints_geo, comp_npoints_geo = CASE WHEN t2.npoints_geo > 0 THEN (CASE WHEN t2.area_geo > 0 THEN ((power(t2.perim_geo, 2) / t2.area_geo) / (4 * pi())) ELSE 0 END / t2.npoints_geo) ELSE 0 END, convex_ratio_geo = CASE WHEN ST_Area(ST_ConvexHull(way)::geography, TRUE) > 0 THEN (t2.area_geo / ST_Area(ST_ConvexHull(way)::geography, TRUE)) ELSE 1 END FROM ( SELECT objectid, ST_Area(way::geography, TRUE) AS area_geo, ST_Perimeter(way::geography, TRUE) AS perim_geo, ST_NPoints(way) AS npoints_geo FROM osm.landcover_grassy_small_scale_2_ply) AS t2 WHERE (t2.objectid = t1.objectid) AND t1.objectid IN ( SELECT t3.objectid FROM mini_test.osm.osm_tmp_28232_ch3 AS t3) Op 20-10-2025 om 11:34 schreef Marco Boeringa: > Hi David, > > To be honest I am not a SQL wizard like some of you here on the list, > but part of the reason I setup the query as it currently is, is that > the PostGIS function calls like ST_Area and ST_Perimeter can be very > expensive depending on the complexity and size of the geometry, and I > thus want to avoid at all cost to have to unnecessarily recalculate > them multiple times in the same query. Maybe I am misunderstanding how > PostgreSQL processes such queries, but I need the values multiple > times to calculate some other parameters. So unless PostgreSQL is > smart enough to cache the result and not execute ST_Area multiple > times if it is used multiple times in the same query, I thought it > wise to separate out the calculation and use the SELECT's results as > input for the calculation of the other parameters. Maybe that isn't > actually needed, but I think I remember seeing performance gains from > the current setup when I initially wrote it this way, but I am not > entirely sure at this point in time, it is a while ago. I know far to > little about the internals of PostgreSQL and its exact query > processing to say anything really sensible about this. > > A couple of things that still strike me: > > - As I wrote in a previous post, just before entering the Python > multi-threaded processing that generates the jobs, I am adding a > primary key column with unique objectids (GENERATED BY DEFAULT AS > IDENTITY), that is subsequently used in the join. I realize this > actually an enhancement request, but I am wondering why, if PostgreSQL > already needs to dig through the entire table to add a new column with > unique objectid values, it doesn't automatically update the statistics > of the newly added column and write out the proper number of records > of the table, that must be known by the end of the addition of the > column, to the 'pg_class.reltuples' table you refer to? It seems to me > it would save the need of the ANALYZE here, and at least ensure that > there is some useful statistics available on the vital primary key > column and about the relation size? > > - As I wrote in a previous post, the multi-threaded Python code > creates multiple jobs (dozens). What I am seeing is that only part of > the jobs is failing, and with some runs, none. E.g. I can run my tool > 3 times without issues, than the fourth run some of the jobs get the > bad plan (I only see this behavior in PG18, I never saw it in <= > PG17). In all of these cases, the input data is *exactly* the same. > The planning behavior therefor appears non-deterministic. As I > understood it, PostgreSQL may indeed have non-deterministic behavior > if it switches to the genetic algorithm on complex queries with many > joins, but I have the feeling that my query doesn't quite satisfy that > level of complexity? Or am I wrong here, and do you consider it likely > it went through the genetic algorithm? It would actually be desirable > if EXPLAIN (especially 'auto_explain') output always showed whether > the genetic algorithm was activated, so one could judge if > non-deterministic behavior of the planner is expected. > > - Lastly, did you notice the likely "good" plan I posted below the > "bad" one. I generated that one by simply copy the visible query to > pgAdmin and hitting EXPLAIN, so it's not the real thing as from > 'auto_explain', but it does show some marked differences between the > plan. Do you have any comments to add as to the differences? > > Marco > > Op 20-10-2025 om 03:33 schreef David Rowley: >> On Mon, 20 Oct 2025 at 09:37, Marco Boeringa >> <marco@boeringa.demon.nl> wrote: >>> I am not sure why in this nested loop, two index scans on essentially >>> the same key and table are executed. You can compare this bad plan with >> The query contains a self-join, so that's why you're seeing the index >> scanned twice in the query plan. If that's not needed, then you should >> remove it from the query. If objectid is unique for this table then I >> don't see why you need to join the table again to access the very same >> row that you're updating. Just put those function calls in the >> UPDATE's SET clause. >> >> (We do have self join elimination in v18, but I see that it's a bit >> overly strict in what it removes around looking for duplicate >> relations when one of them is the query's result relation. Likely that >> can be made better so it still looks for duplicate relations including >> the result relation, but just never considers removing that one, only >> the other duplicate(s).) >> >>> *** ACTUAL BAD PLAN AS CAPTURED BY auto_explain ***: >>> -> Index Scan using >>> landcover_grassy_small_scale_2_ply_pkey on >>> osm.landcover_grassy_small_scale_2_ply t1 (cost=0.38..2.39 rows=1 >>> width=310) (actual time=5.176..462.613 rows=222396.00 loops=1) >>> Output: t1.way, t1.ctid, t1.objectid >>> Index Searches: 1 >>> Buffers: shared hit=66411 >> This table must have been VACUUMed or ANALYZEd either when it was >> empty or when it contained 1 row. There's no predicate here, so that >> estimate, aside from clamping to 1, comes directly from >> pg_class.reltuples. A new table or truncated table would never >> estimate 1 row as the planner always plays it safe when there are no >> statistics generated yet and assumes 10 pages worth of rows. I can't >> think of any specific reason why v18 behaves differently from v17 on >> this... Maybe you've gotten unlikely with an autovacuum timing thing >> and it's running at a slightly different time than in v17, perhaps >> because it completed the autovacuum of another table slightly quicker >> than v17 did. v18 can perform asynchronous reads for vacuum, maybe >> that could mean more vacuum_cost_page_hits and less >> vacuum_cost_page_misses when calculating vacuum_cost_limit. >> >> Or, perhaps you're doing something like performing a manual VACUUM >> after the tables have had all of their rows deleted? >> >> David
Op 20-10-2025 om 11:34 schreef Marco Boeringa: > Hi David, > > To be honest I am not a SQL wizard like some of you here on the list, > but part of the reason I setup the query as it currently is, is that > the PostGIS function calls like ST_Area and ST_Perimeter can be very > expensive depending on the complexity and size of the geometry, and I > thus want to avoid at all cost to have to unnecessarily recalculate > them multiple times in the same query. Maybe I am misunderstanding how > PostgreSQL processes such queries, but I need the values multiple > times to calculate some other parameters. So unless PostgreSQL is > smart enough to cache the result and not execute ST_Area multiple > times if it is used multiple times in the same query, I thought it > wise to separate out the calculation and use the SELECT's results as > input for the calculation of the other parameters. Maybe that isn't > actually needed, but I think I remember seeing performance gains from > the current setup when I initially wrote it this way, but I am not > entirely sure at this point in time, it is a while ago. I know far to > little about the internals of PostgreSQL and its exact query > processing to say anything really sensible about this. > Hi David, Looking through the 'auto_explain' output of the bad query plan, I noticed the below included clause as generated by the planner. In the context of what I actually wrote above about the desire to not run expensive function calls like ST_Area multiple times, do I understand it correctly from the 'auto_explain' output excerpt that PostgreSQL, by removing the self join, actually *does* run the ST_Area and ST_Perimeter multiple times? Is this how I need to interpret this part of the 'auto_explain' output? If there is no caching of the function result, this could be expensive as well. Marco *** 'auto_explain' output excerpt ***: st_area((landcover_grassy_small_scale_2_ply.way)::geography, TRUE), st_perimeter((landcover_grassy_small_scale_2_ply.way)::geography, TRUE), CASE WHEN (st_area((landcover_grassy_small_scale_2_ply.way)::geography, TRUE) > '0'::double PRECISION) THEN ((power(st_perimeter((landcover_grassy_small_scale_2_ply.way)::geography, TRUE), '2'::double PRECISION) / st_area((landcover_grassy_small_scale_2_ply.way)::geography, TRUE)) / '12.566370614359172'::double PRECISION) ELSE '0'::double PRECISION END, st_npoints(landcover_grassy_small_scale_2_ply.way), CASE WHEN (st_npoints(landcover_grassy_small_scale_2_ply.way) > 0) THEN (CASE WHEN (st_area((landcover_grassy_small_scale_2_ply.way)::geography, TRUE) > '0'::double PRECISION) THEN ((power(st_perimeter((landcover_grassy_small_scale_2_ply.way)::geography, TRUE), '2'::double PRECISION) / st_area((landcover_grassy_small_scale_2_ply.way)::geography, TRUE)) / '12.566370614359172'::double PRECISION) ELSE '0'::double PRECISION END / (st_npoints(landcover_grassy_small_scale_2_ply.way))::double PRECISION) ELSE '0'::double PRECISION END, CASE WHEN (st_area((st_convexhull(t1.way))::geography, TRUE) > '0'::double PRECISION) THEN (st_area((landcover_grassy_small_scale_2_ply.way)::geography, TRUE) / st_area((st_convexhull(t1.way))::geography, TRUE)) ELSE '1'::double PRECISION END, t1.ctid, landcover_grassy_small_scale_2_ply.ctid, landcover_grassy_small_scale_2_ply_pg.ctid
On Tue, 21 Oct 2025 at 03:42, Marco Boeringa <marco@boeringa.demon.nl> wrote: > Looking through the 'auto_explain' output of the bad query plan, I > noticed the below included clause as generated by the planner. In the > context of what I actually wrote above about the desire to not run > expensive function calls like ST_Area multiple times, do I understand it > correctly from the 'auto_explain' output excerpt that PostgreSQL, by > removing the self join, actually *does* run the ST_Area and ST_Perimeter > multiple times? Is this how I need to interpret this part of the > 'auto_explain' output? If there is no caching of the function result, > this could be expensive as well. So you basically have something like: UPDATE t t1 SET col1 = t2.a1, col2 = t2.a2 FROM (SELECT unique_col, f1(col3) as a1, f2(col4) as a2 FROM t) AS t2 WHERE t1.unique_col = t2.unique_col AND <other filter clauses> Assuming here that unique_col has a UNIQUE or PK constraint. The self join basically amounts to wasted effort. There is no function result caching anywhere. Looking at the EXPLAIN output, it seems those functions are executed once per row that's output from the join and just below the "Update" node and they're executed 8 times. That won't change if you get rid of the self join. David
Hi David, Thanks for the explanation. Back when I developed this, I am pretty sure I tried to find out the answer to this, but was left somewhat confused as to the results of subqueries in relation to this aspect of avoiding doing unnecessary work related to costly functions. E.g. this very old StackOverflow post of Erwin Brandstetter, who's name I have seen come up in that discussion forum a lot with what suggests based on his answers a pretty thorough knowledge of PostgreSQL and databases in general, at least suggested a subquery could work: https://stackoverflow.com/questions/20718499/does-postgresql-cache-function-calls#comment31095072_20718499 And in relation to that post and thread, and the suggestion of WITH / CTE clause, would that be a suitable substitute and avoid the recalling of the functions? I assume with the MATERIALIZED option, it should, that is what the MATERIALIZED option is for, isn't it? Marco Op 20-10-2025 om 21:09 schreef David Rowley: > On Tue, 21 Oct 2025 at 03:42, Marco Boeringa <marco@boeringa.demon.nl> wrote: >> Looking through the 'auto_explain' output of the bad query plan, I >> noticed the below included clause as generated by the planner. In the >> context of what I actually wrote above about the desire to not run >> expensive function calls like ST_Area multiple times, do I understand it >> correctly from the 'auto_explain' output excerpt that PostgreSQL, by >> removing the self join, actually *does* run the ST_Area and ST_Perimeter >> multiple times? Is this how I need to interpret this part of the >> 'auto_explain' output? If there is no caching of the function result, >> this could be expensive as well. > So you basically have something like: > > UPDATE t t1 SET col1 = t2.a1, col2 = t2.a2 > FROM (SELECT unique_col, f1(col3) as a1, f2(col4) as a2 FROM t) AS t2 > WHERE t1.unique_col = t2.unique_col > AND <other filter clauses> > > Assuming here that unique_col has a UNIQUE or PK constraint. The self > join basically amounts to wasted effort. There is no function result > caching anywhere. Looking at the EXPLAIN output, it seems those > functions are executed once per row that's output from the join and > just below the "Update" node and they're executed 8 times. That won't > change if you get rid of the self join. > > David
On Tue, 21 Oct 2025 at 09:02, Marco Boeringa <marco@boeringa.demon.nl> wrote: > https://stackoverflow.com/questions/20718499/does-postgresql-cache-function-calls#comment31095072_20718499 > > And in relation to that post and thread, and the suggestion of WITH / > CTE clause, would that be a suitable substitute and avoid the recalling > of the functions? I assume with the MATERIALIZED option, it should, that > is what the MATERIALIZED option is for, isn't it? That article states "this function is invoked multiple times with the same parameter", so doesn't sound very applicable for your case since your function parameter changes with every row. I don't see how WITH MATERIALIZED could help you here as that's not a parameterized cache. I suppose we could adjust the planner to consider something similar to Memoize for caching results for expensive stable functions. We'd have to put a lot of trust into n_distinct estimates and the function(s) COST setting, however. I suspect you're trying to optimise for something that's not an actual problem. David
Hi David, PostGIS functions can be very expensive, especially in the context of the fact that Polygon and Line geometries can vary vastly in size in terms of the number of vertices that constitute them, which has a profound impact on some PostGIS function calls, merely due to the enormous complexity of some shapes. But of course you're right that any change will need some thorough testing before assuming it will actually benefit the queries. Marco Op 20-10-2025 om 23:06 schreef David Rowley: > On Tue, 21 Oct 2025 at 09:02, Marco Boeringa <marco@boeringa.demon.nl> wrote: >> https://stackoverflow.com/questions/20718499/does-postgresql-cache-function-calls#comment31095072_20718499 >> >> And in relation to that post and thread, and the suggestion of WITH / >> CTE clause, would that be a suitable substitute and avoid the recalling >> of the functions? I assume with the MATERIALIZED option, it should, that >> is what the MATERIALIZED option is for, isn't it? > That article states "this function is invoked multiple times with the > same parameter", so doesn't sound very applicable for your case since > your function parameter changes with every row. > > I don't see how WITH MATERIALIZED could help you here as that's not a > parameterized cache. I suppose we could adjust the planner to consider > something similar to Memoize for caching results for expensive stable > functions. We'd have to put a lot of trust into n_distinct estimates > and the function(s) COST setting, however. > > I suspect you're trying to optimise for something that's not an actual problem. > > David
On Tue, 21 Oct 2025 at 10:57, Marco Boeringa <marco@boeringa.demon.nl> wrote: > PostGIS functions can be very expensive, especially in the context of > the fact that Polygon and Line geometries can vary vastly in size in > terms of the number of vertices that constitute them, which has a > profound impact on some PostGIS function calls, merely due to the > enormous complexity of some shapes. How expensive the function call is is irrelevant as there simply is no function result caching going on and there's nothing wired up in any released version of PostgreSQL which gives you this with the query you've written. You still seem to be under the illusion that the self-join is giving you some sort of caching. If you remain content in not trusting me on that, by all means, create a plpgsql function with a RAISE NOTICE and try it out for yourself. > But of course you're right that any change will need some thorough > testing before assuming it will actually benefit the queries. I don't recall talking about testing... (It may help if you quote things you're replying to. This conversation will be quite hard to follow with your top post replies.) This whole conversation has drifted well off what the original report was about, so I think it's better if you need more help on this to use pgsql-performance@lists.postgresql.org David
Hi David, I totally understood your remarks about no caching going on, and don't intend to keep the query as-is with self join if I decide to make and benchmark any changes. Any changes will certainly either involve a WITH / CTE materialized, or first writing out some temporary table with the results of the expensive function calls to be able to re-use them from there. Whether any of that will result in any measurable change and benefits, will require thorough testing. Yes, you are right this has drifted. I don't need any more help and this issue can essentially be closed, I just wanted to report the fact that an application and workflow that has run for years without issues, was heavily affected by the (planner) changes for PG18. I have submitted all data I can on this, and the culprit is now clear, also thanks to Andres's useful (debugging) suggestions. It is also clear what the solution in my case is, the extra ANALYZE run I added to my code. So, as said, I don't need more help, although answers to some of unanswered questions regarding specific observations I made (e.g. PG18 versus PG17 and lower) would still be welcome, e.g. to repeat from previous posts.: - As I wrote in a previous post, just before entering the Python multi-threaded processing that generates the jobs, I am adding a primary key column with unique objectids (GENERATED BY DEFAULT AS IDENTITY), that is subsequently used in the join. I realize this actually an enhancement request, but I am wondering why, if PostgreSQL already needs to dig through the entire table to add a new column with unique objectid values, it doesn't automatically update the statistics of the newly added column and write out the proper number of records of the table, that must be known by the end of the addition of the column, to the 'pg_class.reltuples' table you refer to? It seems to me it would save the need of the ANALYZE here, and at least ensure that there is some useful statistics available on the vital primary key column and about the relation size? - As I wrote in a previous post, the multi-threaded Python code creates multiple jobs (dozens). What I am seeing is that only part of the jobs is failing, and with some runs, none. E.g. I can run my tool 3 times without issues, than the fourth run some of the jobs get the bad plan (I only see this behavior in PG18, I never saw it in <= PG17). In all of these cases, the input data is *exactly* the same. The planning behavior therefor appears non-deterministic. As I understood it, PostgreSQL may indeed have non-deterministic behavior if it switches to the genetic algorithm on complex queries with many joins, but I have the feeling that my query doesn't quite satisfy that level of complexity? Or am I wrong here, and do you consider it likely it went through the genetic algorithm? It would actually be desirable if EXPLAIN (especially 'auto_explain') output always showed whether the genetic algorithm was activated, so one could judge if non-deterministic behavior of the planner is expected. But having no answers to these questions is not a major issue, it is just my curiosity. Thanks for your answers and contributions so far, Marco Op 21-10-2025 om 00:16 schreef David Rowley: > On Tue, 21 Oct 2025 at 10:57, Marco Boeringa <marco@boeringa.demon.nl> wrote: >> PostGIS functions can be very expensive, especially in the context of >> the fact that Polygon and Line geometries can vary vastly in size in >> terms of the number of vertices that constitute them, which has a >> profound impact on some PostGIS function calls, merely due to the >> enormous complexity of some shapes. > How expensive the function call is is irrelevant as there simply is no > function result caching going on and there's nothing wired up in any > released version of PostgreSQL which gives you this with the query > you've written. > > You still seem to be under the illusion that the self-join is giving > you some sort of caching. If you remain content in not trusting me on > that, by all means, create a plpgsql function with a RAISE NOTICE and > try it out for yourself. > >> But of course you're right that any change will need some thorough >> testing before assuming it will actually benefit the queries. > I don't recall talking about testing... (It may help if you quote > things you're replying to. This conversation will be quite hard to > follow with your top post replies.) > > This whole conversation has drifted well off what the original report > was about, so I think it's better if you need more help on this to use > pgsql-performance@lists.postgresql.org > > David
On Tue, 21 Oct 2025 at 11:46, Marco Boeringa <marco@boeringa.demon.nl> wrote: > I totally understood your remarks about no caching going on, and don't > intend to keep the query as-is with self join if I decide to make and > benchmark any changes. Any changes will certainly either involve a WITH > / CTE materialized, or first writing out some temporary table with the > results of the expensive function calls to be able to re-use them from > there. To get what you want, you'd need to do something like add OFFSET 0 to the subquery. That would prevent the planner from pulling it up into the main query. However, if you do that, it'll mean running those PostGIS functions on every row in the landcover_grassy_small_scale_2_ply table. You could get around that by either moving or duplicating the "t1.objectid IN (SELECT t3.objectid FROM mini_test.osm.osm_tmp_28232_ch3 AS t3)" and putting it in the WHERE clause of the subquery. David