Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

Поиск
Список
Период
Сортировка
От Jim Jarvie
Тема Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED
Дата
Msg-id ccaa5c05-196a-50c4-9d9d-705e3a6bccac@talentstack.to
обсуждение исходный текст
Ответ на Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED  (Michael Lewis <mlewis@entrata.com>)
Ответы Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED  (Michael Lewis <mlewis@entrata.com>)
Список pgsql-performance

Updates added, mixture of good and bad news:

On 18-Aug.-2020 20:39, Michael Lewis wrote:
 
How long does each process take in total? How strict does that FIFO really
need to be when you are already doing SKIP LOCKED anyway?

The processes all bottleneck[ed] for several minutes, approximately exponential to the number above the threshold where the problem happened.  Up to around 60 concurrent worked with minimal delay but beyond that a few more added about a minute, 70 about 5 minutes, 80 about 30 minutes and beyond that was hours (I left up to 12 hours one time).

However, I removed the order by clause which eliminated [only] the high CPU.  The processes all stopped in the same pattern, just without the high CPU use.  So the ordering was the factor in the CPU use, but was not responsible for the - forgive the pun - lock up.

I then added a random few seconds of delay to each process before it executes the select in order to prevent too many of them colliding on simultaneous selects.  That was enough to make the lock-up disappear and individual selects complete in a few ms, regardless of how many other concurrent transactions are in progress (tested up to 192 concurrent).  But still not quite out the woods - read below.

Can you expound on the partitioning? Are all consumers of the queue always
hitting one active partition and anytime a row is processed, it always
moves to one of many? archived type partitions?

Partitions are list partitioned as 'incoming', 'processing', 'retry', 'ok', 'failed':

Incoming: This is itself hash partitioned (64 partitions) approx 10M rows added/day so partitioned to allow incoming throughput; this works well.

Processing: Simple partition, data is moved into this in blocks as the rows count drops below a threshold, another block is added, coming from the incoming.

Retry: simple partition, non fatal errors go in here and go back into the processing queue for retries later.

Failed: simple partition, fatal errors go here.  Thankfully very few.

OK: hash partition, as everything that was in incoming should eventually end up here.  64 partitions currently.

There is one interesting thing about this.  When the SELECT FOR UPDATE SKIP LOCKED is executed, reasonably frequently, the select aborts with the error:

Tuple to be locked was already moved to another partition due to concurrent update.

This error still persists even though the lock-up has been removed by the time delay, so there is a regular stream of transactions aborting due to this (I just re-run the transaction to recover).

Now, if locking worked as I understand it, if another process locked and migrated, this should still have left the lock in place on the original partition and created a new one on the newly inserted partition until a commit was done.  The second process should not have visibility on the newly inserted row and the skip locked should simply have skipped over the locked but deleted row on the original partition.

What am I missing?  All of this feels like some locking/partitioning issue but I'm unsure exactly what.

Is that done via FDW or otherwise within the same database transaction? Are
you connecting some queue consumer application code to Postgres, select for
update, doing work on some remote system that is slow, and then coming back
and committing the DB work?
Alas not FDW, an actual external system elsewhere in the world which sends an ACK when it has processed the message.  I have no control or influence on this.
By the way, top-posting is discouraged here and partial quotes with
interspersed comments are common practice.
Noted!

В списке pgsql-performance по дате отправления:

Предыдущее
От: Kiran Singh
Дата:
Сообщение: Re: Replication lag due to lagging restart_lsn
Следующее
От: Michael Lewis
Дата:
Сообщение: Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED