Обсуждение: Why percent_rank is so slower than rank?
Hi all, <br /><br />I'm new to window functions. Recently I run some simple queries but surprised to find percent_rank isso slower than rank, could anybody tell me why?<br /><br />The table schema:<br />test=# \d inventory1<br /> Table"public.inventory1"<br /> Column | Type | Modifiers <br />----------------------+---------+-----------<br/> inv_date_sk | integer | not null<br /> inv_item_sk | integer | not null<br /> inv_warehouse_sk | integer | not null<br /> inv_quantity_on_hand |integer | <br /><br />test=# \dt+ inventory1<br /> List of relations<br /> Schema | Name | Type | Owner | Size | Description <br /> --------+------------+-------+----------+---------+-------------<br /> public| inventory1 | table | workshop | 8880 kB | <br /><br />The rank query result:<br />test=# explain analyze selectinv_date_sk,inv_item_sk, rank()over(partition by inv_date_sk order by inv_item_sk) from inventory1;<br /> QUERY PLAN <br />-------------------------------------------------------------------------------------------------------------------------------<br /> WindowAgg (cost=19563.99..23343.99 rows=189000 width=8) (actual time=631.947..1361.158 rows=189000 loops=1)<br /> -> Sort (cost=19563.99..20036.49 rows=189000 width=8) (actual time=631.924..771.990 rows=189000 loops=1)<br /> Sort Key: inv_date_sk, inv_item_sk<br /> Sort Method: quicksort Memory: 12218kB<br /> -> SeqScan on inventory1 (cost=0.00..3000.00 rows=189000 width=8) (actual time=0.055..198.948 rows=189000 loops=1)<br /> Totalruntime: 1500.193 ms<br />(6 rows)<br /><br />The percent_rank result:<br />test=# explain analyze select inv_date_sk,inv_item_sk,percent_rank()over(partition by inv_date_sk order by inv_item_sk) from inventory1;<br /> QUERY PLAN <br /> -------------------------------------------------------------------------------------------------------------------------------<br /> WindowAgg (cost=19563.99..23343.99 rows=189000 width=8) (actual time=766.432..32924.804 rows=189000 loops=1)<br /> -> Sort (cost=19563.99..20036.49 rows=189000 width=8) (actual time=756.320..905.407 rows=189000 loops=1)<br /> Sort Key: inv_date_sk, inv_item_sk<br /> Sort Method: quicksort Memory: 12218kB<br /> -> Seq Scan on inventory1 (cost=0.00..3000.00 rows=189000 width=8) (actual time=0.102..224.607 rows=189000 loops=1)<br/> Total runtime: 33152.188 ms<br />(6 rows)<br /><br />One special thing is that all the values of the partitionkey(inv_date_sk) are the same, that is, there is only one window partition. I find that percent_rank needs to bufferall the tuples to get the total number of rows. But why is it so expensive?<br /><br />I use 8.4.4. And I only increasethe work_mem to 100M and leave other parameters untouched. <br /><br />Thanks,<br />Li Jie<br />
Jie Li <jay23jack@gmail.com> writes: > I'm new to window functions. Recently I run some simple queries but > surprised to find percent_rank is so slower than rank, could anybody tell me > why? Huh, interesting. I can reproduce this with toy data, such as create table inventory1 (inv_date_sk int, inv_item_sk int); insert into inventory1 select 1, random()* 100000 from generate_series(1,189000); explain analyze select inv_date_sk,inv_item_sk, percent_rank()over(partition by inv_date_sk order by inv_item_sk) from inventory1; The example is *not* particularly slow if you leave work_mem at default. But if you bump up work_mem enough so that the WindowAgg's internal tuplestore fits into memory, it slows down like crazy. A bit of quality time with oprofile shows that all the time is going into this memmove() in tuplestore_trim(): /* * Slide the array down and readjust pointers. This may look pretty * stupid, but we expect that there will usuallynot be very many * tuple-pointers to move, so this isn't that expensive; and it keeps a * lot of other logicsimple. * * In fact, in the current usage for merge joins, it's demonstrable that * there will always be exactlyone non-removed tuple; so optimize that * case. */ if (nremove + 1 == state->memtupcount) state->memtuples[0]= state->memtuples[nremove]; else memmove(state->memtuples, state->memtuples + nremove, (state->memtupcount - nremove) * sizeof(void *)); We're throwing away one tuple at a time as we advance forward through the tuplestore, and moving 100000+ tuple pointers each time. Ugh. This code was all right when written, because (IIRC) the mergejoin case was actually the only caller. But it's not all right for WindowAgg's less-predictable usage patterns. I thought for a bit about changing things around so that the first-used tuple slot isn't necessarily state->memtuples[0], but just like the comment says, that complicates a lot of other logic. And there isn't any easy place to reclaim the wasted slots later. What seems like the best bet is to put in a heuristic to make tuplestore_trim simply not do anything until nremove reaches some reasonably large amount, perhaps 10% of the number of stored tuples. This wastes up to 10% of the alloted memory, but that seems tolerable. We could complicate things a bit more by remembering that so-and-so many slots are authorized to be removed, and forcing a trim operation to discard them if we find ourselves in memory trouble. I'm not sure that extra complication is worthwhile though. Comments? regards, tom lane
I wrote: > We're throwing away one tuple at a time as we advance forward through > the tuplestore, and moving 100000+ tuple pointers each time. Ugh. > This code was all right when written, because (IIRC) the mergejoin > case was actually the only caller. But it's not all right for > WindowAgg's less-predictable usage patterns. > I thought for a bit about changing things around so that the first-used > tuple slot isn't necessarily state->memtuples[0], but just like the > comment says, that complicates a lot of other logic. And there isn't > any easy place to reclaim the wasted slots later. > What seems like the best bet is to put in a heuristic to make > tuplestore_trim simply not do anything until nremove reaches some > reasonably large amount, perhaps 10% of the number of stored tuples. > This wastes up to 10% of the alloted memory, but that seems tolerable. On reflection I think just not doing anything isn't a very good idea. The problem with that is that a mis-coded caller could try to fetch tuples that it had already told the tuplestore could be trimmed away; and this would work, most of the time, until you got unlucky and the trim operation had actually deleted them. I think it's pretty important for bug-catching purposes that the tuplestore enforce that those tuples are not available anymore. Hence the attached patch, which combines the two ideas by recycling tuples immediately but not sliding the pointer array until a reasonable amount of movement has occurred. This fixes the complained-of performance problem AFAICT. I'm not sure whether or not to back-patch this into 9.0 and 8.4. The code in tuplestore.c hasn't changed at all since 8.4, so there's not much risk of cross-version bugs, but if I did miss anything we could be shipping a buggy version next week. Thoughts? regards, tom lane diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c index 9bbaba43771f495fdf24e9f2afd545b69a22ecbd..8c8139c897679892e0d4ad13e69ae8d814484206 100644 *** a/src/backend/utils/sort/tuplestore.c --- b/src/backend/utils/sort/tuplestore.c *************** struct Tuplestorestate *** 145,152 **** --- 145,159 ---- /* * This array holds pointers to tuples in memory if we are in state INMEM. * In states WRITEFILE and READFILE it's not used. + * + * When memtupdeleted > 0, the first memtupdeleted pointers are already + * released due to a tuplestore_trim() operation, but we haven't expended + * the effort to slide the remaining pointers down. These unused pointers + * are set to NULL to catch any invalid accesses. Note that memtupcount + * includes the deleted pointers. */ void **memtuples; /* array of pointers to palloc'd tuples */ + int memtupdeleted; /* the first N slots are currently unused */ int memtupcount; /* number of tuples currently present */ int memtupsize; /* allocated length of memtuples array */ *************** tuplestore_begin_common(int eflags, bool *** 252,257 **** --- 259,265 ---- state->context = CurrentMemoryContext; state->resowner = CurrentResourceOwner; + state->memtupdeleted = 0; state->memtupcount = 0; state->memtupsize = 1024; /* initial guess */ state->memtuples = (void **) palloc(state->memtupsize * sizeof(void *)); *************** tuplestore_clear(Tuplestorestate *state) *** 401,407 **** state->myfile = NULL; if (state->memtuples) { ! for (i = 0; i < state->memtupcount; i++) { FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i])); pfree(state->memtuples[i]); --- 409,415 ---- state->myfile = NULL; if (state->memtuples) { ! for (i = state->memtupdeleted; i < state->memtupcount; i++) { FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i])); pfree(state->memtuples[i]); *************** tuplestore_clear(Tuplestorestate *state) *** 409,414 **** --- 417,423 ---- } state->status = TSS_INMEM; state->truncated = false; + state->memtupdeleted = 0; state->memtupcount = 0; readptr = state->readptrs; for (i = 0; i < state->readptrcount; readptr++, i++) *************** tuplestore_end(Tuplestorestate *state) *** 432,438 **** BufFileClose(state->myfile); if (state->memtuples) { ! for (i = 0; i < state->memtupcount; i++) pfree(state->memtuples[i]); pfree(state->memtuples); } --- 441,447 ---- BufFileClose(state->myfile); if (state->memtuples) { ! for (i = state->memtupdeleted; i < state->memtupcount; i++) pfree(state->memtuples[i]); pfree(state->memtuples); } *************** tuplestore_gettuple(Tuplestorestate *sta *** 774,787 **** } else { ! if (readptr->current <= 0) { Assert(!state->truncated); return NULL; } readptr->current--; /* last returned tuple */ } ! if (readptr->current <= 0) { Assert(!state->truncated); return NULL; --- 783,796 ---- } else { ! if (readptr->current <= state->memtupdeleted) { Assert(!state->truncated); return NULL; } readptr->current--; /* last returned tuple */ } ! if (readptr->current <= state->memtupdeleted) { Assert(!state->truncated); return NULL; *************** dumptuples(Tuplestorestate *state) *** 969,975 **** { int i; ! for (i = 0;; i++) { TSReadPointer *readptr = state->readptrs; int j; --- 978,984 ---- { int i; ! for (i = state->memtupdeleted;; i++) { TSReadPointer *readptr = state->readptrs; int j; *************** dumptuples(Tuplestorestate *state) *** 984,989 **** --- 993,999 ---- break; WRITETUP(state, state->memtuples[i]); } + state->memtupdeleted = 0; state->memtupcount = 0; } *************** tuplestore_trim(Tuplestorestate *state) *** 1153,1176 **** nremove = oldest - 1; if (nremove <= 0) return; /* nothing to do */ Assert(nremove <= state->memtupcount); /* Release no-longer-needed tuples */ ! for (i = 0; i < nremove; i++) { FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i])); pfree(state->memtuples[i]); } /* ! * Slide the array down and readjust pointers. This may look pretty ! * stupid, but we expect that there will usually not be very many ! * tuple-pointers to move, so this isn't that expensive; and it keeps a ! * lot of other logic simple. * ! * In fact, in the current usage for merge joins, it's demonstrable that ! * there will always be exactly one non-removed tuple; so optimize that ! * case. */ if (nremove + 1 == state->memtupcount) state->memtuples[0] = state->memtuples[nremove]; --- 1163,1198 ---- nremove = oldest - 1; if (nremove <= 0) return; /* nothing to do */ + + Assert(nremove >= state->memtupdeleted); Assert(nremove <= state->memtupcount); /* Release no-longer-needed tuples */ ! for (i = state->memtupdeleted; i < nremove; i++) { FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i])); pfree(state->memtuples[i]); + state->memtuples[i] = NULL; } + state->memtupdeleted = nremove; + + /* mark tuplestore as truncated (used for Assert crosschecks only) */ + state->truncated = true; /* ! * If nremove is less than 1/8th memtupcount, just stop here, leaving the ! * "deleted" slots as NULL. This prevents us from expending O(N^2) time ! * repeatedly memmove-ing a large pointer array. The worst case space ! * wastage is pretty small, since it's just pointers and not whole tuples. ! */ ! if (nremove < state->memtupcount / 8) ! return; ! ! /* ! * Slide the array down and readjust pointers. * ! * In mergejoin's current usage, it's demonstrable that there will always ! * be exactly one non-removed tuple; so optimize that case. */ if (nremove + 1 == state->memtupcount) state->memtuples[0] = state->memtuples[nremove]; *************** tuplestore_trim(Tuplestorestate *state) *** 1178,1192 **** memmove(state->memtuples, state->memtuples + nremove, (state->memtupcount - nremove) * sizeof(void *)); state->memtupcount -= nremove; for (i = 0; i < state->readptrcount; i++) { if (!state->readptrs[i].eof_reached) state->readptrs[i].current -= nremove; } - - /* mark tuplestore as truncated (used for Assert crosschecks only) */ - state->truncated = true; } /* --- 1200,1212 ---- memmove(state->memtuples, state->memtuples + nremove, (state->memtupcount - nremove) * sizeof(void *)); + state->memtupdeleted = 0; state->memtupcount -= nremove; for (i = 0; i < state->readptrcount; i++) { if (!state->readptrs[i].eof_reached) state->readptrs[i].current -= nremove; } } /*
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm not sure whether or not to back-patch this into 9.0 and 8.4. > The code in tuplestore.c hasn't changed at all since 8.4, so > there's not much risk of cross-version bugs, but if I did miss > anything we could be shipping a buggy version next week. > Thoughts? Is there a performance regression involved, or is it a new feature which hasn't performed well on this type of query until your patch? If the latter, I'd be inclined to give it some time on HEAD and release it in the *following* minor release unless you're *very* confident it couldn't break anything. It's an uphill battle to convince managers that it's safe to apply minor upgrades with minimal testing. It doesn't take to many slips for the boulder to roll all the way back to the bottom of that hill. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm not sure whether or not to back-patch this into 9.0 and 8.4. >> The code in tuplestore.c hasn't changed at all since 8.4, so >> there's not much risk of cross-version bugs, but if I did miss >> anything we could be shipping a buggy version next week. >> Thoughts? > Is there a performance regression involved, or is it a new feature > which hasn't performed well on this type of query until your patch? Well, since window functions didn't exist before 8.4, it's difficult to argue that there was a regression. It's certainly a performance bug though: nobody would expect that giving a query *more* work_mem would cause it to run many times slower. > If the latter, I'd be inclined to give it some time on HEAD and > release it in the *following* minor release unless you're *very* > confident it couldn't break anything. Well, I'm reasonably confident in the patch, and it does pass regression tests. But I've been wrong before. I'm not terribly thrilled with that suggestion though. Do you have reason to think that anybody is likely to exercise window functions in HEAD, beyond what the regression tests do, in the next couple of months? Slipping the application of the patch to back branches by a little bit doesn't make a lot of management sense to me. I think either we trust it and put it into back branches, or we don't trust it and put it only in HEAD, so it goes through a beta cycle before hitting production. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Do you have reason to think that anybody is likely to exercise > window functions in HEAD, beyond what the regression tests do, in > the next couple of months? Not specifically, no. From the description (not having read the patch) I was somewhat concerned that it might affect something outside that narrow use case. If that's not possible, then I'm more comfortable putting it in a release that soon after it hits the repository. It's a judgment call, and you're clearly in the best position to make it. You asked for thoughts, so I shared my concerns. :-) -Kevin
On Thu, Dec 09, 2010 at 05:18:57PM -0500, Tom Lane wrote: > I wrote: > > We're throwing away one tuple at a time as we advance forward through > > the tuplestore, and moving 100000+ tuple pointers each time. Ugh. > > This code was all right when written, because (IIRC) the mergejoin > > case was actually the only caller. But it's not all right for > > WindowAgg's less-predictable usage patterns. > > > I thought for a bit about changing things around so that the first-used > > tuple slot isn't necessarily state->memtuples[0], but just like the > > comment says, that complicates a lot of other logic. And there isn't > > any easy place to reclaim the wasted slots later. > > > What seems like the best bet is to put in a heuristic to make > > tuplestore_trim simply not do anything until nremove reaches some > > reasonably large amount, perhaps 10% of the number of stored tuples. > > This wastes up to 10% of the alloted memory, but that seems tolerable. > > On reflection I think just not doing anything isn't a very good idea. > The problem with that is that a mis-coded caller could try to fetch > tuples that it had already told the tuplestore could be trimmed away; > and this would work, most of the time, until you got unlucky and the > trim operation had actually deleted them. I think it's pretty important > for bug-catching purposes that the tuplestore enforce that those tuples > are not available anymore. > > Hence the attached patch, which combines the two ideas by recycling > tuples immediately but not sliding the pointer array until a reasonable > amount of movement has occurred. This fixes the complained-of > performance problem AFAICT. > > I'm not sure whether or not to back-patch this into 9.0 and 8.4. The > code in tuplestore.c hasn't changed at all since 8.4, so there's not > much risk of cross-version bugs, but if I did miss anything we could > be shipping a buggy version next week. Thoughts? > > regards, tom lane > +1 for back patching. Ken
2010/12/10 Tom Lane <tgl@sss.pgh.pa.us>: > I wrote: >> We're throwing away one tuple at a time as we advance forward through >> the tuplestore, and moving 100000+ tuple pointers each time. Ugh. >> This code was all right when written, because (IIRC) the mergejoin >> case was actually the only caller. But it's not all right for >> WindowAgg's less-predictable usage patterns. > >> I thought for a bit about changing things around so that the first-used >> tuple slot isn't necessarily state->memtuples[0], but just like the >> comment says, that complicates a lot of other logic. And there isn't >> any easy place to reclaim the wasted slots later. > >> What seems like the best bet is to put in a heuristic to make >> tuplestore_trim simply not do anything until nremove reaches some >> reasonably large amount, perhaps 10% of the number of stored tuples. >> This wastes up to 10% of the alloted memory, but that seems tolerable. > > On reflection I think just not doing anything isn't a very good idea. > The problem with that is that a mis-coded caller could try to fetch > tuples that it had already told the tuplestore could be trimmed away; > and this would work, most of the time, until you got unlucky and the > trim operation had actually deleted them. I think it's pretty important > for bug-catching purposes that the tuplestore enforce that those tuples > are not available anymore. I see it's too late now that you've committed it, but it seems there was another way to avoid it by not trimming from percent_rank() individually. Once the whole partition is fit to the memory, you don't need to trim it since it never grows. The trimming logic is for something like moving aggregates and (simple) rank(), which grows tuplestore content as it advances. percent_rank() doesn't seem to match the optimization. Regards, -- Hitoshi Harada
Hitoshi Harada <umi.tanuki@gmail.com> writes: > I see it's too late now that you've committed it, Patches can always be reverted... > but it seems there > was another way to avoid it by not trimming from percent_rank() > individually. Once the whole partition is fit to the memory, you don't > need to trim it since it never grows. The trimming logic is for > something like moving aggregates and (simple) rank(), which grows > tuplestore content as it advances. percent_rank() doesn't seem to > match the optimization. I don't think this idea leads to a robust solution. When you have a combination of different window functions being used in the same scan, you can't expect any one of them to know the global situation. Having percent_rank lie about its requirements in order to avoid bad behavior in the tuplestore infrastructure is just going to create more problems down the road. We need to have the individual functions tell the truth and then do any optimization hacking in the WindowAgg code or infrastructure. regards, tom lane
2010/12/11 Tom Lane <tgl@sss.pgh.pa.us>: > Hitoshi Harada <umi.tanuki@gmail.com> writes: >> I see it's too late now that you've committed it, > > Patches can always be reverted... > >> but it seems there >> was another way to avoid it by not trimming from percent_rank() >> individually. Once the whole partition is fit to the memory, you don't >> need to trim it since it never grows. The trimming logic is for >> something like moving aggregates and (simple) rank(), which grows >> tuplestore content as it advances. percent_rank() doesn't seem to >> match the optimization. > > I don't think this idea leads to a robust solution. When you have a > combination of different window functions being used in the same scan, > you can't expect any one of them to know the global situation. Having > percent_rank lie about its requirements in order to avoid bad behavior > in the tuplestore infrastructure is just going to create more problems > down the road. We need to have the individual functions tell the truth > and then do any optimization hacking in the WindowAgg code or > infrastructure. Hm? Once percent_rank() scans to the partition end, any other window functions that scans row by row don't need to care the memory reduction, aren't they? Or more generally, if the partition was scanned to the end, we don't need to trim tuplestore anymore. Am I misunderstanding? Regards, -- Hitoshi Harada
Hitoshi Harada <umi.tanuki@gmail.com> writes: > Hm? Once percent_rank() scans to the partition end, any other window > functions that scans row by row don't need to care the memory > reduction, aren't they? Or more generally, if the partition was > scanned to the end, we don't need to trim tuplestore anymore. Am I > misunderstanding? Giving back the memory as we do the scan is still a good thing IMO; there might be other uses for it. In any case I don't see where you're going to put such a heuristic without breaking potentially interesting uses elsewhere. The tuplestore doesn't know anything about partitions being read to the end; and WindowAgg doesn't (or shouldn't) know about whether the tuplestore is all in memory. Furthermore, the performance problem would exist for any situation where the window functions had read far beyond the frame start, whether that was all the way to partition end or not. Consider a frame like ROWS BETWEEN 10000 PRECEDING AND 10000 FOLLOWING. In the end this is a local problem inside tuplestore, and kluging its callers to work around it is the wrong approach. regards, tom lane
Tom Lane wrote: > argue that there was a regression. It's certainly a performance bug > though: nobody would expect that giving a query *more* work_mem would > cause it to run many times slower. I wouldn't be that surprised - otherwise it'd just be hard-coded to something large. Especially since earlier in the thread: > The example is *not* particularly slow if you leave work_mem at default. which makes me think it's arguably not quite a bug.