Re: Hardware/OS recommendations for large databases (

Поиск
Список
Период
Сортировка
Bruce Momjian wrote:
> Greg Stark wrote:
>
>> Alan Stange <stange@rentec.com> writes:
>>
>>
>>> The point your making doesn't match my experience with *any* storage or program
>>> I've ever used, including postgresql.   Your point suggests that the storage
>>> system is idle  and that postgresql is broken because it isn't able to use the
>>> resources available...even when the cpu is very idle.  How can that make sense?
>>>
>> Well I think what he's saying is that Postgres is issuing a read, then waiting
>> for the data to return. Then it does some processing, and goes back to issue
>> another read. The CPU is idle half the time because Postgres isn't capable of
>> doing any work while waiting for i/o, and the i/o system is idle half the time
>> while the CPU intensive part happens.
>>
>> (Consider as a pathological example a program that reads 8k then sleeps for
>> 10ms, and loops doing that 1,000 times. Now consider the same program
>> optimized to read 8M asynchronously and sleep for 10s. By the time it's
>> finished sleeping it has probably read in all 8M. Whereas the program that
>> read 8k in little chunks interleaved with small sleeps would probably take
>> twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
>> idle.)
>>
>> It's a reasonable theory and it's not inconsistent with the results you sent.
>> But it's not exactly proven either. Nor is it clear how to improve matters.
>> Adding additional threads to handle the i/o adds an enormous amount of
>> complexity and creates lots of opportunity for other contention that could
>> easily eat all of the gains.
>>
>
> Perfect summary.  We have a background writer now.  Ideally we would
> have a background reader, that reads-ahead blocks into the buffer cache.
> The problem is that while there is a relatively long time between a
> buffer being dirtied and the time it must be on disk (checkpoint time),
> the read-ahead time is much shorter, requiring some kind of quick
> "create a thread" approach that could easily bog us down as outlined
> above.
>
> Right now the file system will do read-ahead for a heap scan (but not an
> index scan), but even then, there is time required to get that kernel
> block into the PostgreSQL shared buffers, backing up Luke's observation
> of heavy memcpy() usage.
>
> So what are our options?  mmap()?  I have no idea.  Seems larger page
> size does help.
For sequential scans, you do have a background reader.  It's the
kernel.  As long as you don't issue a seek() between read() calls, the
kernel will get the hint about sequential IO and begin to perform a read
ahead for you.  This is where the above analysis isn't quite right:
while postgresql is processing the returned data from the read() call,
the kernel has also issued reads as part of the read ahead, keeping the
device busy while the cpu is busy.  (I'm assuming these details for
Linux; Solaris/UFS does work this way).  Issue one seek on the file and
the read ahead algorithm will back off for a while.   This was my point
about some descriptions of how the system works not being sensible.

If your goal is sequential IO, then one must use larger block sizes.
No one would use 8KB IO for achieving high sequential IO rates.   Simply
put, read() is about the slowest way to get 8KB of data.     Switching
to 32KB blocks reduces all the system call overhead by a large margin.
Larger blocks would be better still, up to the stripe size of your
mirror.   (Of course, you're using a mirror and not raid5 if you care
about performance.)

I don't think the memcpy of data from the kernel to userspace is that
big of an issue right now.  dd and all the high end network interfaces
manage OK doing it, so I'd expect postgresql to do all right with it now
yet too.   Direct IO will avoid that memcpy, but then you also don't get
any caching of the files in memory.  I'd be more concerned about any
memcpy calls or general data management within postgresql.    Does
postgresql use the platform specific memcpy() in libc?  Some care might
be needed to ensure that the memory blocks within postgresql are all
properly aligned to make sure that one isn't ping-ponging cache lines
around (usually done by padding the buffer sizes by an extra 32 bytes or
L1 line size).   Whatever you do, all the usual high performance
computing tricks should be used prior to considering any rewriting of
major code sections.

Personally, I'd like to see some detailed profiling being done using
hardware counters for cpu cycles and cache misses, etc.   Given the poor
quality of work that has been discussed here in this thread, I don't
have much confidence in any other additional results at this time.
None of the analysis would be acceptable in any environment in which
I've worked.   Be sure to take a look at Sun's free Workshop tools as
they are excellent for this sort of profiling and one doesn't need to
recompile to use them.    If I get a little time in the next week or two
I might take a crack at this.

Cheers,

-- Alan


В списке pgsql-performance по дате отправления:

Предыдущее
От: Ralph Mason
Дата:
Сообщение: Binary Refcursor possible?
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Binary Refcursor possible?