Re: Parallel tuplesort (for parallel B-Tree index creation)
От | Robert Haas |
---|---|
Тема | Re: Parallel tuplesort (for parallel B-Tree index creation) |
Дата | |
Msg-id | CA+TgmoYiGH72XKQH_ayTQn_n=sMYGuYHZ1-67MyzX0n-ZHQp6Q@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@heroku.com>) |
Список | pgsql-hackers |
On Wed, Nov 9, 2016 at 10:18 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Maybe another way of putting this is that, while there's clearly a >> benefit to having some kind of a cap, it's appropriate to pick a large >> value, such as 500. Having no cap at all risks creating many extra >> tapes that just waste memory, and also risks an unduly >> cache-inefficient final merge. Reigning that in makes sense. >> However, we can't reign it in too far or we'll create slow polyphase >> merges in case that are reasonably likely to occur in real life. > > I completely agree with your analysis. Cool. BTW, my run with 10 tapes completed in 10696528.377 ms (02:58:16.528) - i.e. almost 3 minutes slower than with no tape limit. Building runs took 4260.16 s, and the final merge pass began at 8239.12 s. That's certainly better than I expected, and it seems to show that even if the number of tapes is grossly inadequate for the number of runs, you can still make up most of the time that you lose to I/O with improved cache efficiency -- at least under favorable circumstances. Of course, on many systems I/O bandwidth will be a scarce resource, so that argument can be overdone -- and even if not, a 10-tape sort version takes FAR longer to deliver the first tuple. I also tried this out with work_mem = 512MB. Doubling work_mem reduces the number of runs enough that we don't get a polyphase merge in any case. With no limit on tapes: 2016-11-10 11:24:45 UTC [54042] LOG: switching to external sort with 1873 tapes: CPU: user: 11.34 s, system: 0.48 s, elapsed: 12.13 s 2016-11-10 12:36:22 UTC [54042] LOG: finished writing run 308 to tape 307: CPU: user: 4096.63 s, system: 156.88 s, elapsed: 4309.66 s 2016-11-10 12:36:22 UTC [54042] LOG: using 516563 KB of memory for read buffers among 308 input tapes 2016-11-10 12:36:30 UTC [54042] LOG: performsort done (except 308-way final merge): CPU: user: 4097.75 s, system: 157.24 s, elapsed: 4317.76 s 2016-11-10 13:54:07 UTC [54042] LOG: external sort ended, 6255577 disk blocks used: CPU: user: 8638.72 s, system: 177.42 s, elapsed: 8974.44 s With a max_sort_tapes = 501: 2016-11-10 14:23:50 UTC [54042] LOG: switching to external sort with 502 tapes: CPU: user: 10.99 s, system: 0.54 s, elapsed: 11.57 s 2016-11-10 15:36:47 UTC [54042] LOG: finished writing run 278 to tape 277: CPU: user: 4190.31 s, system: 155.33 s, elapsed: 4388.86 s 2016-11-10 15:36:47 UTC [54042] LOG: using 517313 KB of memory for read buffers among 278 input tapes 2016-11-10 15:36:54 UTC [54042] LOG: performsort done (except 278-way final merge): CPU: user: 4191.36 s, system: 155.68 s, elapsed: 4395.66 s 2016-11-10 16:53:39 UTC [54042] LOG: external sort ended, 6255699 disk blocks used: CPU: user: 8673.07 s, system: 175.93 s, elapsed: 9000.80 s 0.3% slower with the tape limit, but that might be noise. Even if not, it seems pretty silly to create 1873 tapes when we only need ~300. At work_mem = 2GB: 2016-11-10 18:08:00 UTC [54042] LOG: switching to external sort with 7490 tapes: CPU: user: 44.28 s, system: 1.99 s, elapsed: 46.33 s 2016-11-10 19:23:06 UTC [54042] LOG: finished writing run 77 to tape 76: CPU: user: 4342.10 s, system: 156.21 s, elapsed: 4551.95 s 2016-11-10 19:23:06 UTC [54042] LOG: using 2095202 KB of memory for read buffers among 77 input tapes 2016-11-10 19:23:12 UTC [54042] LOG: performsort done (except 77-way final merge): CPU: user: 4343.36 s, system: 157.07 s, elapsed: 4558.79 s 2016-11-10 20:24:24 UTC [54042] LOG: external sort ended, 6255946 disk blocks used: CPU: user: 7894.71 s, system: 176.36 s, elapsed: 8230.13 s At work_mem = 2GB, max_sort_tapes = 501: 2016-11-10 21:28:23 UTC [54042] LOG: switching to external sort with 502 tapes: CPU: user: 44.09 s,system: 1.94 s, elapsed: 46.07 s 2016-11-10 22:42:28 UTC [54042] LOG: finished writing run 68 to tape 67: CPU: user: 4278.49 s, system: 154.39 s, elapsed: 4490.25 s 2016-11-10 22:42:28 UTC [54042] LOG: using 2095427 KB of memory for read buffers among 68 input tapes 2016-11-10 22:42:34 UTC [54042] LOG: performsort done (except 68-way final merge): CPU: user: 4279.60 s, system: 155.21 s, elapsed: 4496.83 s 2016-11-10 23:42:10 UTC [54042] LOG: external sort ended, 6255983 disk blocks used: CPU: user: 7733.98 s, system: 173.85 s, elapsed: 8072.55 s Roughly 2% faster. Maybe still noise, but less likely. 7490 tapes certainly seems over the top. At work_mem = 8GB: 2016-11-14 19:17:28 UTC [54042] LOG: switching to external sort with 29960 tapes: CPU: user: 183.80 s, system: 7.71 s, elapsed: 191.61 s 2016-11-14 20:32:02 UTC [54042] LOG: finished writing run 20 to tape 19: CPU: user: 4431.44 s, system: 176.82 s, elapsed: 4665.16 s 2016-11-14 20:32:02 UTC [54042] LOG: using 8388083 KB of memory for read buffers among 20 input tapes 2016-11-14 20:32:26 UTC [54042] LOG: performsort done (except 20-way final merge): CPU: user: 4432.99 s, system: 181.29 s, elapsed: 4689.52 s 2016-11-14 21:30:56 UTC [54042] LOG: external sort ended, 6256003 disk blocks used: CPU: user: 7835.83 s, system: 199.01 s, elapsed: 8199.29 s At work_mem = 8GB, max_sort_tapes = 501: 2016-11-14 21:52:43 UTC [54042] LOG: switching to external sort with 502 tapes: CPU: user: 181.08 s, system: 7.66 s, elapsed: 189.05 s 2016-11-14 23:06:06 UTC [54042] LOG: finished writing run 17 to tape 16: CPU: user: 4381.56 s, system: 161.82 s, elapsed: 4591.63 s 2016-11-14 23:06:06 UTC [54042] LOG: using 8388158 KB of memory for read buffers among 17 input tapes 2016-11-14 23:06:36 UTC [54042] LOG: performsort done (except 17-way final merge): CPU: user: 4383.45 s, system: 165.32 s, elapsed: 4622.04 s 2016-11-14 23:54:00 UTC [54042] LOG: external sort ended, 6256002 disk blocks used: CPU: user: 7124.49 s, system: 182.16 s, elapsed: 7466.18 s Roughly 9% faster. Building runs seems to be very slowly degrading as we increase work_mem, but the final merge is speeding up somewhat more quickly. Intuitively that makes sense to me: if merging were faster than quicksorting, we could just merge-sort all the time instead of using quicksort for internal sorts. Also, we've got 29960 tapes now, better than three orders of magnitude more than what we actually need. At this work_mem setting, 501 tapes is enough to efficiently sort at least 4TB of data and quite possibly a good bit more. So, committed 0001, with comment changes along the lines I proposed before. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: