Re: Potential "AIO / io workers" inter-worker locking issue in PG18?

Поиск
Список
Период
Сортировка
От Thom Brown
Тема Re: Potential "AIO / io workers" inter-worker locking issue in PG18?
Дата
Msg-id CAA-aLv4=V9PgYe1B41LnN0gAN8Mux2Ci9kdf_35eQjO7-GbfBw@mail.gmail.com
обсуждение исходный текст
Ответ на Potential "AIO / io workers" inter-worker locking issue in PG18?  (Marco Boeringa <marco@boeringa.demon.nl>)
Ответы Re: Potential "AIO / io workers" inter-worker locking issue in PG18?
Re: Potential "AIO / io workers" inter-worker locking issue in PG18?
Список pgsql-bugs
On Sun, 5 Oct 2025, 10:52 Marco Boeringa, <marco@boeringa.demon.nl> wrote:
Hi,

I currently run PG18 + PostGIS 3.6.0 on an Ubuntu 24.04 VM guest as
Windows 10 Hyper-V virtual machine.

The machine is a dedicated refurbished HP Z840 local workstation with
2x22 cores (E5-2699 v4) with 512 GB RAM and a 10 TB NVMe raid-0, with
the Ubuntu guest having 400 GB RAM available.

On this machine, which is dedicated to just one custom written
geoprocessing workflow involving OpenStreetMap data, I have successfully
processed up to global OpenStreetMap Facebook Daylight distribution
data, with up to > 2.4 B record Polygon table for all Facebook Daylight
buildings. So this has proven a very capable system.

However, after upgrading to PG18 and the switch to the "io_method =
worker" setting (tested with 3, 5, 16 and 22 workers), I am seeing an
issue where it appears there may be a major issue with io workers
potentially getting into some sort of locking conflict, that takes hours
to resolve.

The custom written geoprocessing workflow uses Python multi-threading
based on the Python 'concurrent.futures' framework in combination with
either pyodbc or psycopg2 as database connector to implement a powerful
parallel processing solution to speed up some of the computationally
intensive tasks (which use UPDATEs), which I generally use with up to 44
threads to fully saturate the dual CPU 44 core system. The custom code
creates a pool of jobs to process for the threads, with the code being
designed to minimize inter-thread locking issues by taking into account
PostgreSQL page locality (although the actual records to process are not
assigned by pages but by unique IDs in the tables). Basically, the code
is designed such that different threads never attempt to access the same
database pages, as each thread gets it own unique pages assigned, thus
avoiding inter-thread locking conflicts. This has worked really well in
the past, with system usage maximized over all cores and significantly
speeding up processing. Jobs are implemented as database VIEWs, that
point to the records to process via the unique ID of each. These views
must of course be read by each thread, which is probably where the PG18
io workers kick-in.

This has worked really well in previous versions of PostgreSQL (tested
up to PG17). However, in PG18, during the multi-threaded processing, I
see some of my submitted jobs that in this case were run against a small
OpenStreetMap Italy extract of Geofabrik, all of a sudden take > 1 hour
to finish (up to 6 hours for this small extract), even though similar
jobs from the same processing step, finish in less than 10 seconds (and
the other jobs should as well). This seems to happen kind of "random".
Many multi-threading tasks before and after the affected processing
steps, do finish normally.

When this happens, I observe the following things:

- High processor activity, even though the jobs that should finish in
seconds, take hours, all the while showing the high core usage.

- PgAdmin shows all sessions created by the Python threads as 'active',
with *no* wait events attached.

- The pg_locks table does not show locking conflicts, all locks are
granted. I did notice however, that the relation / table locks were not
"fastpath" locks, but ordinary ones. All other locks taken, e.g. on
indexes related to the same table, were fastpath. I don't know if this
has any relevance though, as from what I read about the difference, this
shouldn't cause such a big difference, not seconds to hours.

- Please note that the processing DOES eventually proceed, so it is not
an infinite dead-lock or something where I need to kill my Python code.
It just takes hours to resolve.

- Switching to "io_method = sync" seems to resolve this issue, and I do
not observe some jobs of the same batch getting "stuck". This is the
behavior I was used to seeing in <=PG17.

I am not to familiar with all the internals of PostgreSQL and the new
AIO framework and its "io workers". However, it seems there may be some
sort of locking issue between io workers that can occasionally happen in
PG18 with "io_method = worker"? Is there anyone else observing similar
issues in high multi-threaded processing worklflows?

So, to confirm, you get the issue with as little as 3 io_workers?

Also, what is pg_aios telling you during this time?

Thom

В списке pgsql-bugs по дате отправления: