Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)
Дата
Msg-id 4F610516.9040102@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Ответы Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)
Список pgsql-hackers
On 12.03.2012 21:33, I wrote:
> The slowdown with > 6 clients seems to be spinlock contention. I ran
> "perf record" for a short duration during one of the ramdrive tests, and
> saw the spinlock acquisition in ReserveXLogInsertLocation() consuming
> about 80% of all CPU time.
>
> I then hacked the patch a little bit, removing the check in XLogInsert
> for fullPageWrites and forcePageWrites, as well as the check for "did a
> checkpoint just happen" (see
> http://community.enterprisedb.com/xloginsert-scale-tests/disable-fpwcheck.patch).
> My hunch was that accessing those fields causes cache line stealing,
> making the cache line containing the spinlock even more busy. That hunch
> seems to be correct; when I reran the tests with that patch, the
> performance with high # of clients became much better. See the results
> with "xloginsert-scale-13.patch". With that change, the single-client
> case is still about 10% slower than current code, but the performance
> with > 8 clients is almost as good as with current code. Between 2-6
> clients, the patch is a win.
>
> The hack that restored the > 6 clients performance to current level is
> not safe, of course, so I'll have to figure out a safe way to get that
> effect.

I managed to do that in a safe way, and also found a couple of other
small changes that made a big difference to performance. I found out
that the number of cache misses while holding the spinlock matter a lot,
which in hindsight isn't surprising. I aligned the XLogCtlInsert struct
on a 64-byte boundary, so that the new spinlock and the fields it
protects all fit on the same cache line (on boxes with cache line size
 >= 64 bytes, anyway). I also changed the logic of the insertion slots
slightly, so that when a slot is reserved, while holding the spinlock,
it doesn't need to be immediately updated. That avoids one cache miss,
as the cache line holding the slot doesn't need to be accessed while
holding the spinlock. And to reduce contention on cache lines when an
insertion is finished and the insertion slot is updated, I shuffled the
slots so that slots that are logically adjacent are spaced apart in memory.

When all those changes are put together, the patched version now beats
or matches the current code in the RAM drive tests, except that the
single-client case is still about 10% slower. I added the new test
results at http://community.enterprisedb.com/xloginsert-scale-tests/,
and a new version of the patch is attached.

If all of this sounds pessimistic, let me remind that I've been testing
the cases where I'm seeing regressions, so that I can fix them, and not
trying to demonstrate how good this is in the best case. These tests
have been with very small WAL records, with only 16 bytes of payload.
Larger WAL records benefit more. I also ran one test with larger, 100
byte WAL records, and put the results up on that site.

> Also, even when the performance is as good as current code, it's
> not good to spend all the CPU time spinning on the spinlock. I didn't
> measure the CPU usage with current code, but I would expect it to be
> sleeping, not spinning, when not doing useful work.

This is still an issue.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: wal_buffers, redux
Следующее
От: "ktm@rice.edu"
Дата:
Сообщение: Re: Faster compression, again