Re: Abbreviated keys for Numeric
От | Peter Geoghegan |
---|---|
Тема | Re: Abbreviated keys for Numeric |
Дата | |
Msg-id | CAM3SWZRsvR758mWWdpNbohnN-qQhqSpSJAis8nhm2zmhDUgNpA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Abbreviated keys for Numeric (Andrew Gierth <andrew@tao11.riddles.org.uk>) |
Ответы |
Re: Abbreviated keys for Numeric
(Andrew Gierth <andrew@tao11.riddles.org.uk>)
|
Список | pgsql-hackers |
On Fri, Mar 20, 2015 at 11:58 PM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > Comparisons between nulls and nulls, or between nulls and non-nulls, are > cheap; only comparisons between non-nulls and non-nulls can be > expensive. > > The purpose of abbreviation is to replace expensive comparisons by cheap > ones where possible, and therefore the cost model used for abbreviation > should ignore nulls entirely; all that matters is the number of non-null > values and the probability of saving time by abbreviating them. I don't think it's that simple. The actual point of abbreviation is to amortize the total cost of performing O(n log n) comparisons (the average case, which is usually assumed by infrastructure like sortsupport), by performing an encoding process n times. That encoding process may be more expensive than each of the comparisons. Sometimes significantly more expensive. Forget about abbreviation entirely for a minute - consider plain old sorting of strxfrm() blobs, which for example the glibc documentation recommends (with some caveats) as the preferred approach to sorting strings while respecting collation. Suppose that your applications happens to frequently encounter fully sorted arrays of strings when passed an array of strings to sort. If you were using our qsort() implementation, which, rather questionably, has a pre-check for fully sorted input (that throws away work when the pre-check fails), then you'll only have n comparisons. Even if an encoding is only as expensive as an actual comparison, the potential to lose with encoding clearly exists. Similarly, we don't abbreviate for top-n heap sorts, even though all of those abbreviated keys may be used for at least one comparison. Now, I hear what you're saying about weighing the costs and the benefits -- you point out that abbreviation of NULL values is not a cost, which is certainly true. But if the total number of NULL values is so high that they dominate, then abbreviation might well be the wrong thing for that reason alone. In general, string transformation is not recommended for sorting very small lists of strings. Where I think your argument is stronger is around the cost of actually aborting in particular (even though not much work is thrown away). Scanning through the memtuples array once more certainly isn't free. And you could take the view that it's always worth the risk, since it's at most a marginal cost saved if the first ~10k tuples are representative, but a large cost incurred when they are not. So on reflection, I am inclined to put the check for non-null values back in. I also concede that the cost of abbreviation is lower with your numeric encoding scheme than it is for that of the text opclass, which favors this approach. -- Peter Geoghegan
В списке pgsql-hackers по дате отправления:
Следующее
От: Kouhei KaigaiДата:
Сообщение: Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API)