Potential "AIO / io workers" inter-worker locking issue in PG18?
От | Marco Boeringa |
---|---|
Тема | Potential "AIO / io workers" inter-worker locking issue in PG18? |
Дата | |
Msg-id | 32179d3f-3c28-4d71-88c6-f6678feaea4a@boeringa.demon.nl обсуждение исходный текст |
Ответы |
Re: Potential "AIO / io workers" inter-worker locking issue in PG18?
Re: Potential "AIO / io workers" inter-worker locking issue in PG18? Re: Potential "AIO / io workers" inter-worker locking issue in PG18? |
Список | pgsql-bugs |
Hi, I currently run PG18 + PostGIS 3.6.0 on an Ubuntu 24.04 VM guest as Windows 10 Hyper-V virtual machine. The machine is a dedicated refurbished HP Z840 local workstation with 2x22 cores (E5-2699 v4) with 512 GB RAM and a 10 TB NVMe raid-0, with the Ubuntu guest having 400 GB RAM available. On this machine, which is dedicated to just one custom written geoprocessing workflow involving OpenStreetMap data, I have successfully processed up to global OpenStreetMap Facebook Daylight distribution data, with up to > 2.4 B record Polygon table for all Facebook Daylight buildings. So this has proven a very capable system. However, after upgrading to PG18 and the switch to the "io_method = worker" setting (tested with 3, 5, 16 and 22 workers), I am seeing an issue where it appears there may be a major issue with io workers potentially getting into some sort of locking conflict, that takes hours to resolve. The custom written geoprocessing workflow uses Python multi-threading based on the Python 'concurrent.futures' framework in combination with either pyodbc or psycopg2 as database connector to implement a powerful parallel processing solution to speed up some of the computationally intensive tasks (which use UPDATEs), which I generally use with up to 44 threads to fully saturate the dual CPU 44 core system. The custom code creates a pool of jobs to process for the threads, with the code being designed to minimize inter-thread locking issues by taking into account PostgreSQL page locality (although the actual records to process are not assigned by pages but by unique IDs in the tables). Basically, the code is designed such that different threads never attempt to access the same database pages, as each thread gets it own unique pages assigned, thus avoiding inter-thread locking conflicts. This has worked really well in the past, with system usage maximized over all cores and significantly speeding up processing. Jobs are implemented as database VIEWs, that point to the records to process via the unique ID of each. These views must of course be read by each thread, which is probably where the PG18 io workers kick-in. This has worked really well in previous versions of PostgreSQL (tested up to PG17). However, in PG18, during the multi-threaded processing, I see some of my submitted jobs that in this case were run against a small OpenStreetMap Italy extract of Geofabrik, all of a sudden take > 1 hour to finish (up to 6 hours for this small extract), even though similar jobs from the same processing step, finish in less than 10 seconds (and the other jobs should as well). This seems to happen kind of "random". Many multi-threading tasks before and after the affected processing steps, do finish normally. When this happens, I observe the following things: - High processor activity, even though the jobs that should finish in seconds, take hours, all the while showing the high core usage. - PgAdmin shows all sessions created by the Python threads as 'active', with *no* wait events attached. - The pg_locks table does not show locking conflicts, all locks are granted. I did notice however, that the relation / table locks were not "fastpath" locks, but ordinary ones. All other locks taken, e.g. on indexes related to the same table, were fastpath. I don't know if this has any relevance though, as from what I read about the difference, this shouldn't cause such a big difference, not seconds to hours. - Please note that the processing DOES eventually proceed, so it is not an infinite dead-lock or something where I need to kill my Python code. It just takes hours to resolve. - Switching to "io_method = sync" seems to resolve this issue, and I do not observe some jobs of the same batch getting "stuck". This is the behavior I was used to seeing in <=PG17. I am not to familiar with all the internals of PostgreSQL and the new AIO framework and its "io workers". However, it seems there may be some sort of locking issue between io workers that can occasionally happen in PG18 with "io_method = worker"? Is there anyone else observing similar issues in high multi-threaded processing worklflows? Marco
В списке pgsql-bugs по дате отправления: