Re: Relation extension scalability

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Relation extension scalability
Дата
Msg-id CA+Tgmoa8nddwpMA4Emn7sVoNrQ883mn8ZJBiqXa8dm3puKn=qQ@mail.gmail.com
обсуждение исходный текст
Ответ на Relation extension scalability  (Andres Freund <andres@2ndquadrant.com>)
Ответы Re: Relation extension scalability  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <andres@2ndquadrant.com>
> As a quick recap, relation extension basically works like:
> 1) We lock the relation for extension
> 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> 3) smgrnblocks() is used to find the new target block
> 4) We search for a victim buffer (via BufferAlloc()) to put the new
>    block into
> 5) If dirty the victim buffer is cleaned
> 6) The relation is extended using smgrextend()
> 7) The page is initialized
>
> The problems come from 4) and 5) potentially each taking a fair
> while. If the working set mostly fits into shared_buffers 4) can
> requiring iterating over all shared buffers several times to find a
> victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
> dirty limits 5) can take a couple seconds.

Interesting.  I had always assumed the bottleneck was waiting for the
filesystem to extend the relation.

> Secondly I think we could maybe remove the requirement of needing an
> extension lock alltogether. It's primarily required because we're
> worried that somebody else can come along, read the page, and initialize
> it before us. ISTM that could be resolved by *not* writing any data via
> smgrextend()/mdextend(). If we instead only do the write once we've read
> in & locked the page exclusively there's no need for the extension
> lock. We probably still should write out the new page to the OS
> immediately once we've initialized it; to avoid creating sparse files.
>
> The other reason we need the extension lock is that code like
> lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
> pages that are about to be initilized by the extending backend. I think
> we should just remove that code and deal with the problem by retrying in
> the extending backend; that's why I think moving extension to a
> different file might be helpful.

I thought the primary reason we did this is because we wanted to
write-and-fsync the block so that, if we're out of disk space, any
attendant failure will happen before we put data into the block.  Once
we've initialized the block, a subsequent failure to write or fsync it
will be hard to recover from; basically, we won't be able to
checkpoint any more.  If we discover the problem while the block is
still all-zeroes, the transaction that uncovers the problem errors
out, but the system as a whole is still OK.

Or at least, I think.  Maybe I'm misunderstanding.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: Rounding to even for numeric data type
Следующее
От: Robert Haas
Дата:
Сообщение: Re: getting rid of "thread fork emulation" in pgbench?