Re: Wierd context-switching issue on Xeon

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Wierd context-switching issue on Xeon
Дата
Msg-id 11437.1082324861@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Wierd context-switching issue on Xeon  (Josh Berkus <josh@agliodbs.com>)
Ответы Re: Wierd context-switching issue on Xeon  (Dave Cramer <pg@fastcrypt.com>)
Re: Wierd context-switching issue on Xeon  (Greg Stark <gsstark@mit.edu>)
Re: Wierd context-switching issue on Xeon  (Josh Berkus <josh@agliodbs.com>)
Список pgsql-performance
After some further digging I think I'm starting to understand what's up
here, and the really fundamental answer is that a multi-CPU Xeon MP box
sucks for running Postgres.

I did a bunch of oprofile measurements on a machine belonging to one of
Josh's clients, using a test case that involved heavy concurrent access
to a relatively small amount of data (little enough to fit into Postgres
shared buffers, so that no I/O or kernel calls were really needed once
the test got going).  I found that by nearly any measure --- elapsed
time, bus transactions, or machine-clear events --- the spinlock
acquisitions associated with grabbing and releasing the BufMgrLock took
an unreasonable fraction of the time.  I saw about 15% of elapsed time,
40% of bus transactions, and nearly 100% of pipeline-clear cycles going
into what is essentially two instructions out of the entire backend.
(Pipeline clears occur when the cache coherency logic detects a memory
write ordering problem.)

I am not completely clear on why this machine-level bottleneck manifests
as a lot of context swaps at the OS level.  I think what is happening is
that because SpinLockAcquire is so slow, a process is much more likely
than you'd normally expect to arrive at SpinLockAcquire while another
process is also acquiring the spinlock.  This puts the two processes
into a "lockstep" condition where the second process is nearly certain
to observe the BufMgrLock as locked, and be forced to suspend itself,
even though the time the first process holds the BufMgrLock is not
really very long at all.

If you google for Xeon and "cache coherency" you'll find quite a bit of
suggestive information about why this might be more true on the Xeon
setup than others.  A couple of interesting hits:

http://www.theinquirer.net/?article=10797
says that Xeon MP uses a *slower* FSB than Xeon DP.  This would
translate directly to more time needed to transfer a dirty cache line
from one processor to the other, which is the basic operation that we're
talking about here.

http://www.aceshardware.com/Spades/read.php?article_id=30000187
says that Opterons use a different cache coherency protocol that is
fundamentally superior to the Xeon's, because dirty cache data can be
transferred directly between two processor caches without waiting for
main memory.

So in the short term I think we have to tell people that Xeon MP is not
the most desirable SMP platform to run Postgres on.  (Josh thinks that
the specific motherboard chipset being used in these machines might
share some of the blame too.  I don't have any evidence for or against
that idea, but it's certainly possible.)

In the long run, however, CPUs continue to get faster than main memory
and the price of cache contention will continue to rise.  So it seems
that we need to give up the assumption that SpinLockAcquire is a cheap
operation.  In the presence of heavy contention it won't be.

One thing we probably have got to do soon is break up the BufMgrLock
into multiple finer-grain locks so that there will be less contention.
However I am wary of doing this incautiously, because if we do it in a
way that makes for a significant rise in the number of locks that have
to be acquired to access a buffer, we might end up with a net loss.

I think Neil Conway was looking into how the bufmgr might be
restructured to reduce lock contention, but if he had come up with
anything he didn't mention exactly what.  Neil?

            regards, tom lane

В списке pgsql-performance по дате отправления:

Предыдущее
От: Markus Bertheau
Дата:
Сообщение: Re: sunquery and estimated rows
Следующее
От: Tom Lane
Дата:
Сообщение: Re: sunquery and estimated rows