Обсуждение: Re: [PERFORM] Estimation problem with a LIKE clause containing a /
"Guillaume Smet" <guillaume.smet@gmail.com> writes: > On Nov 8, 2007 12:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I've applied a patch that might help you: >> http://archives.postgresql.org/pgsql-committers/2007-11/msg00104.php > AFAICS, it doesn't seem to fix the problem. I just compiled > REL8_1_STABLE branch and I still has the following behaviour: OK, I tried it in fr_FR locale and what I find is that regression=# select '123/' < '1230'::text; ?column? ---------- t (1 row) so make_greater_string() will still think that its first try at generating an upper-bound string is good enough. However regression=# select '123/1' < '1230'::text; ?column? ---------- f (1 row) so the data starting with '123/' is still outside the generated range, leading to a wrong estimate. I didn't see this behavior yesterday but I was experimenting with en_US which I guess has different rules. What I am tempted to do about this is have make_greater_string tack "zz" onto the supplied prefix, so that it would have to find a string that compares greater than "123/zz" before reporting success. This is getting pretty klugy though, so cc'ing to pgsql-hackers to see if anyone has a better idea. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > What I am tempted to do about this is have make_greater_string tack "zz" > onto the supplied prefix, so that it would have to find a string that > compares greater than "123/zz" before reporting success. This is > getting pretty klugy though, so cc'ing to pgsql-hackers to see if anyone > has a better idea. Hm, instead of "zz" is there a convenient way to find out what actual character sorts last amongst all the single characters in the locale's encoding? Doesn't really strike at the core reason that this is so klugy though. Surely the "right" thing is to push the concept of open versus closed end-points through deeper into the estimation logic? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Gregory Stark <stark@enterprisedb.com> writes: > Doesn't really strike at the core reason that this is so klugy though. Surely > the "right" thing is to push the concept of open versus closed end-points > through deeper into the estimation logic? No, the right thing is to take the folk who defined "dictionary sort order" out behind the barn and shoot 'em ;-). This has got nothing to do with open/closed endpoints and everything to do with the bizarre sorting rules used by some locales. In particular the reason I want to append a letter is that some locales discriminate against non-letter characters in the first pass of sorting. I did do some experimentation and found that among the ASCII characters (ie, codes 32-126), nearly all the non-C locales on my Fedora machine sort Z last and z next-to-last or vice versa. Most of the remainder sort digits last and z or Z as the last non-digit character. Since Z is not that close to the end of the sort order in C locale, however, z seems the best bet. regards, tom lane
I wrote: > I did do some experimentation and found that among the ASCII characters > (ie, codes 32-126), nearly all the non-C locales on my Fedora machine > sort Z last and z next-to-last or vice versa. Most of the remainder > sort digits last and z or Z as the last non-digit character. Since Z is > not that close to the end of the sort order in C locale, however, z > seems the best bet. With still further experimentation, it seems that doesn't work very well, because the locales that sort digits last also seem not to discriminate against digits in their first pass. What did seem to work was: * Determine which of the strings "Z", "z", "y", "9" is seen as largest by strcoll(). * Append this string to the given input. * Search (using the CVS-HEAD make_greater_string logic) for a string greater than that. This rule works for all the locales I have installed ... but I don't have any Far Eastern locales installed. Also, my test cases are only covering ASCII characters, and I believe many locales have some non-ASCII letters that sort after 'Z'. I'm not sure how hard we need to try to cover those corner cases, though. It is ultimately only an estimate... regards, tom lane
On Nov 9, 2007 3:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > This rule works for all the locales I have installed ... but I don't > have any Far Eastern locales installed. Also, my test cases are only > covering ASCII characters, and I believe many locales have some non-ASCII > letters that sort after 'Z'. I'm not sure how hard we need to try to > cover those corner cases, though. It is ultimately only an estimate... My opinion is that it's acceptable to fix the problem for most cases in most locales because, as you said, it's only an estimate. We didn't have any report of this problem for years so it seems that it's not a common case or at least it's not common that the bad estimate leads to noticeably bad plans. As far as I understand what you plan to do, it doesn't seem to be something that prevents us to fix the problem afterwards if someone comes with an example which doesn't fit in the schema you're proposing and has a real performance problem with it. -- Guillaume
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > This rule works for all the locales I have installed ... but I don't > have any Far Eastern locales installed. Also, my test cases are only > covering ASCII characters, and I believe many locales have some non-ASCII > letters that sort after 'Z'. I'm not sure how hard we need to try to > cover those corner cases, though. It is ultimately only an estimate... If I understand correctly what we're talking about it's generating estimates for LIKE 'foo%' using the algorithm which makes sense for C locale which means generating the next range of values which start with 'foo%'. It seems to me the problematic situations is when the most-frequent-values come into play. Being off slightly in the histogram isn't going to generate very inaccurate estimates but including or not a most-frequent-value could throw off the estimate severely. Could we not use the bogus range to calculate the histogram estimate but apply the LIKE pattern directly to the most-frequent-values instead of applying the bogus range? Or would that be too much code re-organization for now? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
Gregory Stark <stark@enterprisedb.com> writes: > Could we not use the bogus range to calculate the histogram estimate > but apply the LIKE pattern directly to the most-frequent-values > instead of applying the bogus range? Or would that be too much code > re-organization for now? We have already done that for quite some time. It won't help Guillaume's case anyhow: he's got no MCVs, presumably because the field is unique. regards, tom lane
On Nov 9, 2007 5:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > he's got no MCVs, presumably because the field > is unique. It is. The ancestors field contains the current folder itself so the id of the folder (which is the primary key) is in it. -- Guillaume
Tom, Just to confirm you that your last commit fixed the problem: lbo=# explain analyze select * from cms_items where ancestors LIKE '1062/%'; QUERY PLAN --------------------------------------------------------------------------------------------------------------- Seq Scan on cms_items (cost=0.00..688.26 rows=*9097* width=103) (actual time=0.011..22.605 rows=11326 loops=1) Filter: ((ancestors)::text ~~ '1062/%'::text) Total runtime: 30.022 ms (3 rows) Thanks for your time. -- Guillaume