Re: Relation extension scalability

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Relation extension scalability
Дата
Msg-id 20150719135841.GG25610@awork2.anarazel.de
обсуждение исходный текст
Ответ на Relation extension scalability  (Andres Freund <andres@2ndquadrant.com>)
Ответы Re: Relation extension scalability  (Andres Freund <andres@anarazel.de>)
Re: Relation extension scalability  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Hi,

I, every now and then, spent a bit of time making this more efficient
over the last few weeks.

I had a bit of a problem to reproduce the problems I'd seen in
production on physical hardware (found EC2 to be to variable to
benchmark this), but luckily 2ndQuadrant today allowed me access to
their four socket machine[1] of the AXLE project.  Thanks Simon and
Tomas!

First, some mostly juicy numbers:

My benchmark was a parallel COPY into a single wal logged target
table:
CREATE TABLE data(data text);
The source data has been generated with
narrow:
COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY;
wide:
COPY (select repeat(random()::text, 10) FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY;

Between every test I ran a TRUNCATE data; CHECKPOINT;

For each number of clients I ran pgbench for 70 seconds. I'd previously
determined using -P 1 that the numbers are fairly stable. Longer runs
would have been nice, but then I'd not have finished in time.

shared_buffers = 48GB, narrow table contents:
client     tps after:      tps before:
1          180.255577      210.125143
2          338.231058      391.875088
4          638.814300      405.243901
8          1126.852233     370.922271
16         1242.363623     498.487008
32         1229.648854     484.477042
48         1223.288397     468.127943
64         1198.007422     438.238119
96         1201.501278     370.556354
128        1198.554929     288.213032
196        1189.603398     193.841993
256        1144.082291     191.293781
512        643.323675      200.782105

shared_buffers = 1GB, narrow table contents:
client     tps after:      tps before:
1          191.137410      210.787214
2          351.293017      384.086634
4          649.800991      420.703149
8          1103.770749     355.947915
16         1287.192256     489.050768
32         1226.329585     464.936427
48         1187.266489     443.386440
64         1182.698974     402.251258
96         1208.315983     331.290851
128        1183.469635     269.250601
196        1202.847382     202.788617
256        1177.924515     190.876852
512        572.457773      192.413191

1
shared_buffers = 48GB, wide table contents:
client     tps after:      tps before:
1          59.685215       68.445331
2          102.034688      103.210277
4          179.434065      78.982315
8          222.613727      76.195353
16         232.162484      77.520265
32         231.979136      71.654421
48         231.981216      64.730114
64         230.955979      57.444215
96         228.016910      56.324725
128        227.693947      45.701038
196        227.410386      37.138537
256        224.626948      35.265530
512        105.356439      34.397636

shared_buffers = 1GB, wide table contents:
(ran out of patience)

Note that the peak performance with the patch is significantly better,
but there's currently a noticeable regression in single threaded
performance. That undoubtedly needs to be addressed.


So, to get to the actual meat: My goal was to essentially get rid of an
exclusive lock over relation extension alltogether. I think I found a
way to do that that addresses the concerns made in this thread.

Thew new algorithm basically is:
1) Acquire victim buffer, clean it, and mark it as pinned
2) Get the current size of the relation, save buffer into blockno
3) Try to insert an entry into the buffer table for blockno
4) If the page is already in the buffer table, increment blockno by 1,
   goto 3)
5) Try to read the page. In most cases it'll not yet exist. But the page
   might concurrently have been written by another backend and removed
   from shared buffers already. If already existing, goto 1)
6) Zero out the page on disk.

I think this does handle the concurrency issues.

This patch very clearly is in the POC stage. But I do think the approach
is generally sound.  I'd like to see some comments before deciding
whether to carry on.


Greetings,

Andres Freund

PS: Yes, I know that precision in the benchmark isn't warranted, but I'm
too lazy to truncate them.

[1]
[10:28:11 PM] Tomas Vondra: 4x Intel Xeon E5­4620 Eight Core 2.2GHz
Processor’s generation Sandy Bridge EP
each core handles 2 threads, so 16 threads total
256GB (16x16GB) ECC REG System Validated Memory (1333 MHz)
2x 250GB SATA 2.5” Enterprise Level HDs (RAID 1, ~250GB)
17x 600GB SATA 2.5” Solid State HDs (RAID 0, ~10TB)
LSI MegaRAID 9271­8iCC controller and Cache Vault Kit (1GB cache)
2 x Nvidia Tesla K20 Active GPU Cards (GK110GL)


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: BRIN index and aborted transaction
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Relation extension scalability