Document efficient self-joins / UPDATE LIMIT techniques.

Поиск
Список
Период
Сортировка
От Corey Huinker
Тема Document efficient self-joins / UPDATE LIMIT techniques.
Дата
Msg-id CADkLM=caNEQsUwPWnfi2jR4ix99E0EJM_3jtcE-YjnEQC7Rssw@mail.gmail.com
обсуждение исходный текст
Ответы Re: Document efficient self-joins / UPDATE LIMIT techniques.  (Corey Huinker <corey.huinker@gmail.com>)
Re: Document efficient self-joins / UPDATE LIMIT techniques.  (Laurenz Albe <laurenz.albe@cybertec.at>)
Список pgsql-hackers
This patch adds a few examples to demonstrate the following:

* The existence of the ctid column on every table
* The utility of ctds in self joins
* A practical usage of SKIP LOCKED

The reasoning for this is a bit long, but if you're interested, keep reading.

In the past, there has been a desire to see a LIMIT clause of some sort on UPDATE and DELETE statements. The reason for this usually stems from having a large archive or backfill operation that if done in one single transaction would overwhelm normal operations, either by the transaction failing outright, locking too many rows, flooding the WAL causing replica lag, or starving other processes of limited I/O.

The reasons for not adding a LIMIT clause are pretty straightforward: it isn't in the SQL Standard, and UPDATE/DELETE operations are unordered operations, so updating 1000 rows randomly isn't a great idea. The people wanting the LIMIT clause were undeterred by this, because they know that they intend to keep issuing updates until they run out of rows to update.

Given these limitations, I would write something like this:

WITH doomed AS (
    SELECT t.id
    FROM my_table AS t
    WHERE t.expiration_date < :'some_archive_date'
    FOR UPDATE SKIP LOCKED
    LIMIT 1000 )
DELETE FROM my_table
WHERE id IN (SELECT id FROM doomed );

This wouldn't interfere with any other updates, so I felt good about it running when the system was not-too-busy. I'd then write a script to run that in a loop, with sleeps to allow the replicas a chance to catch their breath. Then, when the rowcount finally dipped below 1000, I'd issue the final

DELETE FROM my_table WHERE expiration_date < :'some_archive_date';

And this was ok, because at that point I have good reason to believe that there are at most 1000 rows lingering out there, so waiting on locks for those was no big deal.

But a query like this involves one scan along one index (or worse, a seq scan) followed by another scan, either index or seq. Either way, we're taking up a lot of cache with rows we don't even care about.

Then in v12, the query planner got hip to bitmap tidscans, allowing for this optimization:

WITH doomed AS (
    SELECT t.ctid AS tid
    FROM my_table AS t
    WHERE t.expiration_date < :'some_archive_date'
    FOR UPDATE SKIP LOCKED
    LIMIT 1000 )
DELETE FROM my_table
USING doomed WHERE my_table.ctid = doomed.tid;

And this works pretty well, especially if you set up a partial index to meet the quals in the CTE. But we don't document this anywhere, and until UPDATE and DELETE get a LIMIT clause, we probably should document this workaround.
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Detecting use-after-free bugs using gcc's malloc() attribute
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: Changing types of block and chunk sizes in memory contexts