Re: Relation extension scalability

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Relation extension scalability
Дата
Msg-id 20150330004709.GC4878@alap3.anarazel.de
обсуждение исходный текст
Ответ на Relation extension scalability  (Andres Freund <andres@2ndquadrant.com>)
Список pgsql-hackers
On 2015-03-29 20:02:06 -0400, Robert Haas wrote:
> On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <andres@2ndquadrant.com>
> > As a quick recap, relation extension basically works like:
> > 1) We lock the relation for extension
> > 2) ReadBuffer*(P_NEW) is being called, to extend the relation
> > 3) smgrnblocks() is used to find the new target block
> > 4) We search for a victim buffer (via BufferAlloc()) to put the new
> >    block into
> > 5) If dirty the victim buffer is cleaned
> > 6) The relation is extended using smgrextend()
> > 7) The page is initialized
> >
> > The problems come from 4) and 5) potentially each taking a fair
> > while. If the working set mostly fits into shared_buffers 4) can
> > requiring iterating over all shared buffers several times to find a
> > victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
> > dirty limits 5) can take a couple seconds.
> 
> Interesting.  I had always assumed the bottleneck was waiting for the
> filesystem to extend the relation.

That might be the case sometimes, but it's not what I've actually
observed so far. I think most modern filesystems doing preallocation
resolved this to some degree.

> > Secondly I think we could maybe remove the requirement of needing an
> > extension lock alltogether. It's primarily required because we're
> > worried that somebody else can come along, read the page, and initialize
> > it before us. ISTM that could be resolved by *not* writing any data via
> > smgrextend()/mdextend(). If we instead only do the write once we've read
> > in & locked the page exclusively there's no need for the extension
> > lock. We probably still should write out the new page to the OS
> > immediately once we've initialized it; to avoid creating sparse files.
> >
> > The other reason we need the extension lock is that code like
> > lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
> > pages that are about to be initilized by the extending backend. I think
> > we should just remove that code and deal with the problem by retrying in
> > the extending backend; that's why I think moving extension to a
> > different file might be helpful.
> 
> I thought the primary reason we did this is because we wanted to
> write-and-fsync the block so that, if we're out of disk space, any
> attendant failure will happen before we put data into the block.

Well, we only write and register a fsync. Afaics we don't actually
perform the fsync it at that point. I don't think having to do the
fsync() necessarily precludes removing the extension lock.

> Once we've initialized the block, a subsequent failure to write or
> fsync it will be hard to recover from;

At the very least the buffer shouldn't become dirty before we
successfully wrote once, right. It seems quite doable to achieve that
without the lock though. We'll have to do the write without going
through the buffer manager, but that seems doable.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Removing INNER JOINs
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Rounding to even for numeric data type