Re: Load distributed checkpoint

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: Load distributed checkpoint
Дата
Msg-id Pine.GSO.4.64.0612212323350.13321@westnet.com
обсуждение исходный текст
Ответ на Re: Load distributed checkpoint  ("Inaam Rana" <inaamrana@gmail.com>)
Список pgsql-hackers
On Wed, 20 Dec 2006, Inaam Rana wrote:

> Talking of bgwriter_* parameters I think we are missing a crucial 
> internal counter i.e. number of dirty pages. How much work bgwriter has 
> to do at each wakeup call should be a function of total buffers and 
> currently dirty buffers.

This is actually a question I'd been meaning to throw out myself to this 
list.  How hard would it be to add an internal counter to the buffer 
management scheme that kept track of the current number of dirty pages? 
I've been looking at the bufmgr code lately trying to figure out how to 
insert one as part of building an auto-tuning bgwriter, but it's unclear 
to me how I'd lock such a resource properly and scalably.  I have a 
feeling I'd be inserting a single-process locking bottleneck into that 
code with any of the naive implementations I considered.

The main problem I've been seeing is also long waits stuck behind a slow 
fsync on Linux.  What I've been moving toward testing is an approach 
slightly different from the proposals here.  What if all the database page 
writes (background writer, buffer eviction, or checkpoint scan) were 
counted and periodic fsync requests send to the bgwriter based on that? 
For example, when I know I have a battery-backed caching controller that 
will buffer 64MB worth of data for me, if I forced a fsync after every 
6000 8K writes, no single fsync would get stuck waiting for the disk to 
write for longer than I'd like.

Give the admin a max_writes_before_sync parameter, make the default of 0 
work just like the current behavior, and off you go; a simple tunable that 
doesn't require a complicated scheme to implement or break anybody's 
existing setup.  Combined with a properly tuned background writer, that 
would solve the issues I've been running into.  It would even make the 
problem of Linux caching too many writes until checkpoint time go away (I 
know how to eliminate that by adjusting caching policy, but I have to be 
root to do it; a DBA should be able to work around that issue even if they 
don't have access to the kernel tunables.)

While I'm all for testing to prove me wrong, my gut feel is that going all 
the way to sync writes a la Oracle is a doomed approach, particularly on 
low-end hardware where they're super expensive.  Following The Oracle Way 
is a good roadmap for a lot of things, but I wouldn't put building a lean 
enough database to run on modest hardware on that list.  You can do sync 
writes with perfectly good performance on systems with a good 
battery-backed cache, but I think you'll get creamed in comparisons 
against MySQL on IDE disks if you start walking down that path; since 
right now a fair comparison with similar logging behavior is an even match 
there, that's a step backwards.

Also on the topic of sync writes to the database proper:  wouldn't using 
O_DIRECT for those potentially counter-productive?  I was under the 
impressions that one of the behaviors counted on by Postgres was that data 
evicted from its buffer cache, eventually intended for writing to disk, 
was still kept around for a bit in the OS buffer cache.  A subsequent read 
because the data was needed again might find the data already in the OS 
buffer, therefore avoiding an actual disk read; that substantially reduces 
the typical penalty for the database engine making a bad choice on what to 
evict.  I fear a move to direct writes would put more pressure on the LRU 
implementation to be very smart, and that's code that you really don't 
want to be more complicated.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Interface for pg_autovacuum
Следующее
От: "Takayuki Tsunakawa"
Дата:
Сообщение: Re: Load distributed checkpoint