Re: Relation extension scalability

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Relation extension scalability
Дата
Msg-id 27689.1427656904@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Relation extension scalability  (Andres Freund <andres@2ndquadrant.com>)
Ответы Re: Relation extension scalability  (Andres Freund <andres@2ndquadrant.com>)
Список pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
> As a quick recap, relation extension basically works like:
> 1) We lock the relation for extension
> 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> 3) smgrnblocks() is used to find the new target block
> 4) We search for a victim buffer (via BufferAlloc()) to put the new
>    block into
> 5) If dirty the victim buffer is cleaned
> 6) The relation is extended using smgrextend()
> 7) The page is initialized

> The problems come from 4) and 5) potentially each taking a fair
> while.

Right, so basically we want to get those steps out of the exclusive lock
scope.

> There's two things that seem to make sense to me:

> First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
> and introduce a bufmgr function specifically for extension.

I think that removing P_NEW is likely to require a fair amount of
refactoring of calling code, so I'm not thrilled with doing that.
On the other hand, perhaps all that code would have to be touched
anyway to modify the scope over which the extension lock is held.

> Secondly I think we could maybe remove the requirement of needing an
> extension lock alltogether. It's primarily required because we're
> worried that somebody else can come along, read the page, and initialize
> it before us. ISTM that could be resolved by *not* writing any data via
> smgrextend()/mdextend().

I'm afraid this would break stuff rather thoroughly, in particular
handling of out-of-disk-space cases.  And I really don't see how you get
consistent behavior at all for multiple concurrent callers if there's no
locking.

One idea that might help is to change smgrextend's API so that it doesn't
need a buffer to write from, but just has an API of "add a prezeroed block
on-disk and tell me the number of the block you added".  On the other
hand, that would then require reading in the block after allocating a
buffer to hold it (I don't think you can safely assume otherwise) so the
added read step might eat any savings.
        regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: proposal: row_to_array function
Следующее
От: James Cloos
Дата:
Сообщение: Re: Rounding to even for numeric data type