Re: [PERFORM] DELETE vs TRUNCATE explanation

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: [PERFORM] DELETE vs TRUNCATE explanation
Дата
Msg-id 19949.1342376967@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: [PERFORM] DELETE vs TRUNCATE explanation  (Jeff Janes <jeff.janes@gmail.com>)
Ответы Re: [PERFORM] DELETE vs TRUNCATE explanation  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [PERFORM] DELETE vs TRUNCATE explanation  (Craig Ringer <ringerc@ringerc.id.au>)
Re: [PERFORM] DELETE vs TRUNCATE explanation  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Jeff Janes <jeff.janes@gmail.com> writes:
> On Thu, Jul 12, 2012 at 9:55 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> The topic was poor performance when truncating lots of small tables
>> repeatedly on test environments with fsync=off.
>> 
>> On Thu, Jul 12, 2012 at 6:00 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> I think the problem is in the Fsync Absorption queue.  Every truncate
>>> adds a FORGET_RELATION_FSYNC to the queue, and processing each one of
>>> those leads to sequential scanning the checkpointer's pending ops hash
>>> table, which is quite large.  It is almost entirely full of other
>>> requests which have already been canceled, but it still has to dig
>>> through them all.   So this is essentially an N^2 operation.

> The attached patch addresses this problem by deleting the entry when
> it is safe to do so, and flagging it as canceled otherwise.

I don't like this patch at all.  It seems ugly and not terribly safe,
and it won't help at all when the checkpointer is in the midst of an
mdsync scan, which is a nontrivial part of its cycle.

I think what we ought to do is bite the bullet and refactor the
representation of the pendingOps table.  What I'm thinking about
is reducing the hash key to just RelFileNodeBackend + ForkNumber,
so that there's one hashtable entry per fork, and then storing a
bitmap to indicate which segment numbers need to be sync'd.  At
one gigabyte to the bit, I think we could expect the bitmap would
not get terribly large.  We'd still have a "cancel" flag in each
hash entry, but it'd apply to the whole relation fork not each
segment.

If we did this then the FORGET_RELATION_FSYNC code path could use
a hashtable lookup instead of having to traverse the table
linearly; and that would get rid of the O(N^2) performance issue.
The performance of FORGET_DATABASE_FSYNC might still suck, but
DROP DATABASE is a pretty heavyweight operation anyhow.

I'm willing to have a go at coding this design if it sounds sane.
Comments?

> Also, I still wonder if it is worth memorizing fsyncs (under
> fsync=off) that may or may not ever take place.  Is there any
> guarantee that we can make by doing so, that couldn't be made
> otherwise?

Yeah, you have a point there.  It's not real clear that switching fsync
from off to on is an operation that we can make any guarantees about,
short of executing something like the code recently added to initdb
to force-sync the entire PGDATA tree.  Perhaps we should change fsync
to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could
skip forwarding fsync requests when it's off?
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Getting rid of pre-assignment of index names in CREATE TABLE LIKE
Следующее
От: Gurjeet Singh
Дата:
Сообщение: Re: Getting rid of pre-assignment of index names in CREATE TABLE LIKE