Re: Potential "AIO / io workers" inter-worker locking issue in PG18?

Поиск
Список
Период
Сортировка
От Markus KARG
Тема Re: Potential "AIO / io workers" inter-worker locking issue in PG18?
Дата
Msg-id 0b2bdf07-59a1-4194-9f8d-cab885d28908@headcrashing.eu
обсуждение исходный текст
Ответ на Potential "AIO / io workers" inter-worker locking issue in PG18?  (Marco Boeringa <marco@boeringa.demon.nl>)
Список pgsql-bugs
I am not a PostgreSQL contributor and have no clue what the actual 
technical details are in the new AIO code, but reading your report the 
following questions came to my mind:

* Does the failure also happen with io_mode=io_uring? If no, it is a 
proof that it is really bound to io_mode=worker, not to AIO in general.

* Does the failure also happen with io_mode=worker when your Python code 
uses only 22 cores, and PostgreSQL uses only 22 workers (so Python and 
PostgreSQL do not share CPU cores)? If no, it might indicate that the 
problem could be solved by increasing the execution policy in favor of 
PostgreSQL to give a hint to the scheduler that a CPU core should be 
given to PostgreSQL FIRST as Python most likely is waiting on it to 
continue, but PostgreSQL could not continue because the schedule gave 
all the cores to Python... (classical deadlock; eventually resolves once 
enough CPU cores are free to eventually finish the starving thread).

HTH

-Markus


Am 05.10.2025 um 10:55 schrieb Marco Boeringa:
> Hi,
>
> I currently run PG18 + PostGIS 3.6.0 on an Ubuntu 24.04 VM guest as 
> Windows 10 Hyper-V virtual machine.
>
> The machine is a dedicated refurbished HP Z840 local workstation with 
> 2x22 cores (E5-2699 v4) with 512 GB RAM and a 10 TB NVMe raid-0, with 
> the Ubuntu guest having 400 GB RAM available.
>
> On this machine, which is dedicated to just one custom written 
> geoprocessing workflow involving OpenStreetMap data, I have 
> successfully processed up to global OpenStreetMap Facebook Daylight 
> distribution data, with up to > 2.4 B record Polygon table for all 
> Facebook Daylight buildings. So this has proven a very capable system.
>
> However, after upgrading to PG18 and the switch to the "io_method = 
> worker" setting (tested with 3, 5, 16 and 22 workers), I am seeing an 
> issue where it appears there may be a major issue with io workers 
> potentially getting into some sort of locking conflict, that takes 
> hours to resolve.
>
> The custom written geoprocessing workflow uses Python multi-threading 
> based on the Python 'concurrent.futures' framework in combination with 
> either pyodbc or psycopg2 as database connector to implement a 
> powerful parallel processing solution to speed up some of the 
> computationally intensive tasks (which use UPDATEs), which I generally 
> use with up to 44 threads to fully saturate the dual CPU 44 core 
> system. The custom code creates a pool of jobs to process for the 
> threads, with the code being designed to minimize inter-thread locking 
> issues by taking into account PostgreSQL page locality (although the 
> actual records to process are not assigned by pages but by unique IDs 
> in the tables). Basically, the code is designed such that different 
> threads never attempt to access the same database pages, as each 
> thread gets it own unique pages assigned, thus avoiding inter-thread 
> locking conflicts. This has worked really well in the past, with 
> system usage maximized over all cores and significantly speeding up 
> processing. Jobs are implemented as database VIEWs, that point to the 
> records to process via the unique ID of each. These views must of 
> course be read by each thread, which is probably where the PG18 io 
> workers kick-in.
>
> This has worked really well in previous versions of PostgreSQL (tested 
> up to PG17). However, in PG18, during the multi-threaded processing, I 
> see some of my submitted jobs that in this case were run against a 
> small OpenStreetMap Italy extract of Geofabrik, all of a sudden take > 
> 1 hour to finish (up to 6 hours for this small extract), even though 
> similar jobs from the same processing step, finish in less than 10 
> seconds (and the other jobs should as well). This seems to happen kind 
> of "random". Many multi-threading tasks before and after the affected 
> processing steps, do finish normally.
>
> When this happens, I observe the following things:
>
> - High processor activity, even though the jobs that should finish in 
> seconds, take hours, all the while showing the high core usage.
>
> - PgAdmin shows all sessions created by the Python threads as 
> 'active', with *no* wait events attached.
>
> - The pg_locks table does not show locking conflicts, all locks are 
> granted. I did notice however, that the relation / table locks were 
> not "fastpath" locks, but ordinary ones. All other locks taken, e.g. 
> on indexes related to the same table, were fastpath. I don't know if 
> this has any relevance though, as from what I read about the 
> difference, this shouldn't cause such a big difference, not seconds to 
> hours.
>
> - Please note that the processing DOES eventually proceed, so it is 
> not an infinite dead-lock or something where I need to kill my Python 
> code. It just takes hours to resolve.
>
> - Switching to "io_method = sync" seems to resolve this issue, and I 
> do not observe some jobs of the same batch getting "stuck". This is 
> the behavior I was used to seeing in <=PG17.
>
> I am not to familiar with all the internals of PostgreSQL and the new 
> AIO framework and its "io workers". However, it seems there may be 
> some sort of locking issue between io workers that can occasionally 
> happen in PG18 with "io_method = worker"? Is there anyone else 
> observing similar issues in high multi-threaded processing worklflows?
>
> Marco
>
>
>



В списке pgsql-bugs по дате отправления: