Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Поиск
Список
Период
Сортировка
От Bruce Momjian
Тема Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?
Дата
Msg-id 199803121336.IAA14863@candle.pha.pa.us
обсуждение исходный текст
Ответ на Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?  (dg@illustra.com (David Gould))
Ответы Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?  (ocie@paracel.com)
Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?  (dg@illustra.com (David Gould))
Список pgsql-hackers
> I have had the pleasure to work on the guts of one of the major databases
> raw partition storage managers over the last ten years (hint, not my
> current domain), and guess what? It implements a file system. And, not a
> particularly good filesystem at that. Think about something like "FAT",
> but not quite that nice. It was also a major source of pain in that it
> was complex, heavily concurrent, and any errors show up as massive data
> loss or corruption. Be careful what you wish for.

Interesting.

>
> Most of the supposed benefit comes from integrating the buffer cache
> management and the writeahead log so that you can defer or avoid I/O (as
> long as the log records get to disk, there is no reason to ever write the
> data page unless you need the buffer for something else). You can also
> convert random I/O to semi sequential I/O if most writes are done by a page
> cleaner or by a checkpoint as this gives you lots of I/O to sort.

Yes, I have heard that the standard file system read-ahead is often
useless for a database, so on a raw partition you know the next block
that is going to be requested, so you can prefetch there rather than
having the file system prefetch the next sequential block.

Also nice so you can control what gets written to disk/fsync'ed and what doesn't
get fsync'ed.

> I don't know the current state of Postgres so I cannot comment on it, but at
> least with Illustra, the lack of a traditional writeahead log style
> transaction system was a major performance hit as it forced an fsync at
> every commit. A good WAL system gets many commits per log I/O, but
> Illusta was stuck with many writes per transaction. If Postgres still does
> this (and the recent elimination of time travel suggests that it might not),
> it would be well worth fixing.

Our idea is to control when pg_log gets written to disk.  We keep active
pg_log pages in shared memory, and every 30-60 seconds, we make a memory
copy of the current pg_log active pages, do a system sync() (which
happens anyway at that interval), update the pg_log file with the saved
changes, and fsync() the pg_log pages to disk.  That way, after a crash,
the current database only shows transactions as committed where we are
sure all the data has made it to disk.

I have a more detailed posting if you are interested.

> A last point, the raw disk, implement our own filesystem architecture used
> by some systems is much more compelling if the filesystems are slow and
> inflexible, and the filesystem caching is ineffective. These things were
> more true back in the early 80's when these systems were being designed.
> Things are not as bad now, in particular ext2 has quite good performance.
>
> Sorry for the ramble...

No ramble at all.  It is not every day we get someone with real-world
experience in changing from a filesystem to a raw partition database
storage manager.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: AW: AW: [HACKERS] attlen weirdness?
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: [HACKERS] varchar() vs char16 performance