Re: Relation extension scalability
От | Andres Freund |
---|---|
Тема | Re: Relation extension scalability |
Дата | |
Msg-id | 20150719135841.GG25610@awork2.anarazel.de обсуждение исходный текст |
Ответ на | Relation extension scalability (Andres Freund <andres@2ndquadrant.com>) |
Ответы |
Re: Relation extension scalability
(Andres Freund <andres@anarazel.de>)
Re: Relation extension scalability (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
Hi, I, every now and then, spent a bit of time making this more efficient over the last few weeks. I had a bit of a problem to reproduce the problems I'd seen in production on physical hardware (found EC2 to be to variable to benchmark this), but luckily 2ndQuadrant today allowed me access to their four socket machine[1] of the AXLE project. Thanks Simon and Tomas! First, some mostly juicy numbers: My benchmark was a parallel COPY into a single wal logged target table: CREATE TABLE data(data text); The source data has been generated with narrow: COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY; wide: COPY (select repeat(random()::text, 10) FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY; Between every test I ran a TRUNCATE data; CHECKPOINT; For each number of clients I ran pgbench for 70 seconds. I'd previously determined using -P 1 that the numbers are fairly stable. Longer runs would have been nice, but then I'd not have finished in time. shared_buffers = 48GB, narrow table contents: client tps after: tps before: 1 180.255577 210.125143 2 338.231058 391.875088 4 638.814300 405.243901 8 1126.852233 370.922271 16 1242.363623 498.487008 32 1229.648854 484.477042 48 1223.288397 468.127943 64 1198.007422 438.238119 96 1201.501278 370.556354 128 1198.554929 288.213032 196 1189.603398 193.841993 256 1144.082291 191.293781 512 643.323675 200.782105 shared_buffers = 1GB, narrow table contents: client tps after: tps before: 1 191.137410 210.787214 2 351.293017 384.086634 4 649.800991 420.703149 8 1103.770749 355.947915 16 1287.192256 489.050768 32 1226.329585 464.936427 48 1187.266489 443.386440 64 1182.698974 402.251258 96 1208.315983 331.290851 128 1183.469635 269.250601 196 1202.847382 202.788617 256 1177.924515 190.876852 512 572.457773 192.413191 1 shared_buffers = 48GB, wide table contents: client tps after: tps before: 1 59.685215 68.445331 2 102.034688 103.210277 4 179.434065 78.982315 8 222.613727 76.195353 16 232.162484 77.520265 32 231.979136 71.654421 48 231.981216 64.730114 64 230.955979 57.444215 96 228.016910 56.324725 128 227.693947 45.701038 196 227.410386 37.138537 256 224.626948 35.265530 512 105.356439 34.397636 shared_buffers = 1GB, wide table contents: (ran out of patience) Note that the peak performance with the patch is significantly better, but there's currently a noticeable regression in single threaded performance. That undoubtedly needs to be addressed. So, to get to the actual meat: My goal was to essentially get rid of an exclusive lock over relation extension alltogether. I think I found a way to do that that addresses the concerns made in this thread. Thew new algorithm basically is: 1) Acquire victim buffer, clean it, and mark it as pinned 2) Get the current size of the relation, save buffer into blockno 3) Try to insert an entry into the buffer table for blockno 4) If the page is already in the buffer table, increment blockno by 1, goto 3) 5) Try to read the page. In most cases it'll not yet exist. But the page might concurrently have been written by another backend and removed from shared buffers already. If already existing, goto 1) 6) Zero out the page on disk. I think this does handle the concurrency issues. This patch very clearly is in the POC stage. But I do think the approach is generally sound. I'd like to see some comments before deciding whether to carry on. Greetings, Andres Freund PS: Yes, I know that precision in the benchmark isn't warranted, but I'm too lazy to truncate them. [1] [10:28:11 PM] Tomas Vondra: 4x Intel Xeon E54620 Eight Core 2.2GHz Processor’s generation Sandy Bridge EP each core handles 2 threads, so 16 threads total 256GB (16x16GB) ECC REG System Validated Memory (1333 MHz) 2x 250GB SATA 2.5” Enterprise Level HDs (RAID 1, ~250GB) 17x 600GB SATA 2.5” Solid State HDs (RAID 0, ~10TB) LSI MegaRAID 92718iCC controller and Cache Vault Kit (1GB cache) 2 x Nvidia Tesla K20 Active GPU Cards (GK110GL)
Вложения
В списке pgsql-hackers по дате отправления: