Potential "AIO / io workers" inter-worker locking issue in PG18?

Поиск

Список

Период

Сортировка

От	Marco Boeringa
Тема	Potential "AIO / io workers" inter-worker locking issue in PG18?
Дата	5 октября 11:55:01
Msg-id	32179d3f-3c28-4d71-88c6-f6678feaea4a@boeringa.demon.nl обсуждение исходный текст
Ответы	Re: Potential "AIO / io workers" inter-worker locking issue in PG18? Re: Potential "AIO / io workers" inter-worker locking issue in PG18? Re: Potential "AIO / io workers" inter-worker locking issue in PG18?
Список	pgsql-bugs

Дерево обсуждения

Hi,

I currently run PG18 + PostGIS 3.6.0 on an Ubuntu 24.04 VM guest as 
Windows 10 Hyper-V virtual machine.

The machine is a dedicated refurbished HP Z840 local workstation with 
2x22 cores (E5-2699 v4) with 512 GB RAM and a 10 TB NVMe raid-0, with 
the Ubuntu guest having 400 GB RAM available.

On this machine, which is dedicated to just one custom written 
geoprocessing workflow involving OpenStreetMap data, I have successfully 
processed up to global OpenStreetMap Facebook Daylight distribution 
data, with up to > 2.4 B record Polygon table for all Facebook Daylight 
buildings. So this has proven a very capable system.

However, after upgrading to PG18 and the switch to the "io_method = 
worker" setting (tested with 3, 5, 16 and 22 workers), I am seeing an 
issue where it appears there may be a major issue with io workers 
potentially getting into some sort of locking conflict, that takes hours 
to resolve.

The custom written geoprocessing workflow uses Python multi-threading 
based on the Python 'concurrent.futures' framework in combination with 
either pyodbc or psycopg2 as database connector to implement a powerful 
parallel processing solution to speed up some of the computationally 
intensive tasks (which use UPDATEs), which I generally use with up to 44 
threads to fully saturate the dual CPU 44 core system. The custom code 
creates a pool of jobs to process for the threads, with the code being 
designed to minimize inter-thread locking issues by taking into account 
PostgreSQL page locality (although the actual records to process are not 
assigned by pages but by unique IDs in the tables). Basically, the code 
is designed such that different threads never attempt to access the same 
database pages, as each thread gets it own unique pages assigned, thus 
avoiding inter-thread locking conflicts. This has worked really well in 
the past, with system usage maximized over all cores and significantly 
speeding up processing. Jobs are implemented as database VIEWs, that 
point to the records to process via the unique ID of each. These views 
must of course be read by each thread, which is probably where the PG18 
io workers kick-in.

This has worked really well in previous versions of PostgreSQL (tested 
up to PG17). However, in PG18, during the multi-threaded processing, I 
see some of my submitted jobs that in this case were run against a small 
OpenStreetMap Italy extract of Geofabrik, all of a sudden take > 1 hour 
to finish (up to 6 hours for this small extract), even though similar 
jobs from the same processing step, finish in less than 10 seconds (and 
the other jobs should as well). This seems to happen kind of "random". 
Many multi-threading tasks before and after the affected processing 
steps, do finish normally.

When this happens, I observe the following things:

- High processor activity, even though the jobs that should finish in 
seconds, take hours, all the while showing the high core usage.

- PgAdmin shows all sessions created by the Python threads as 'active', 
with *no* wait events attached.

- The pg_locks table does not show locking conflicts, all locks are 
granted. I did notice however, that the relation / table locks were not 
"fastpath" locks, but ordinary ones. All other locks taken, e.g. on 
indexes related to the same table, were fastpath. I don't know if this 
has any relevance though, as from what I read about the difference, this 
shouldn't cause such a big difference, not seconds to hours.

- Please note that the processing DOES eventually proceed, so it is not 
an infinite dead-lock or something where I need to kill my Python code. 
It just takes hours to resolve.

- Switching to "io_method = sync" seems to resolve this issue, and I do 
not observe some jobs of the same batch getting "stuck". This is the 
behavior I was used to seeing in <=PG17.

I am not to familiar with all the internals of PostgreSQL and the new 
AIO framework and its "io workers". However, it seems there may be some 
sort of locking issue between io workers that can occasionally happen in 
PG18 with "io_method = worker"? Is there anyone else observing similar 
issues in high multi-threaded processing worklflows?

Marco

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Potential "AIO / io workers" inter-worker locking issue in PG18?