Re: [v9.2] make_greater_string() does not return a string in some cases
От | Robert Haas |
---|---|
Тема | Re: [v9.2] make_greater_string() does not return a string in some cases |
Дата | |
Msg-id | CA+TgmoaDz_Ri7NcLebzbb_EzTxEGy_mwwsenLPrh9JaTVSL1kA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [v9.2] make_greater_string() does not return a string in some cases (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: [v9.2] make_greater_string() does not return a string in some cases
|
Список | pgsql-hackers |
On Thu, Sep 22, 2011 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, the metric that we were indirectly using earlier was the > number of characters in a given locale for which the algorithm > fails to find a greater one (excluding whichever character is "last", > I guess, or you could just recognize there's always at least one). What about characters that sort differently in sequence than individually? >> For example, if we were to try all possible >> three character strings in some encoding and run make_greater_string() >> on each one of them, we could then measure the failure percentage. Or >> if that's too many cases to crank through then we could limit it some >> way - > > Even in UTF8 there's only a couple million assigned code points, so for > test purposes anyway it doesn't seem like we couldn't crank through them > all. Also, in many cases you could probably figure it out by analysis > instead of brute-force testing every case. > > A more reasonable objection might be that a whole lot of those code > points are things nobody cares about, and so we need to weight the > results somehow by the actual popularity of the character. Not sure > how to take that into account. I guess whether that's a problem in practice will depend somewhat on the quality of the algorithms we're able to find. If our best algorithms still have a 1% failure rate, then yeah, that's an issue, but in that case I'd suggest that our best algorithms suck and we need to think harder about alternate solutions. If we're talking about failing on 5 characters out of a million we can just eyeball them. I'm not trying to reduce this testing to something that is entirely mechanic in every way; I'm just saying that I'm not optimistic about my ability to judge which algorithms will work best in practice without some kind of automated aid. > Another issue here is that we need to consider not just whether we find > a greater character, but "how much greater" it is. This would apply to > my suggestion of incrementing the top byte without considering > lower-order bytes --- we'd be skipping quite a lot of code space for > each increment, and it's conceivable that that would be quite hurtful in > some cases. Not sure how to account for that either. An extreme > example here is an "incrementer" that just immediately returns the last > character in the sort order for any lesser input. Right... well, this is why I'm not wild about doing this by incrementing in the first place. But now that I think about it, what about using some slightly-less-stupid version of that approach as a fallback strategy? For example, we could pick, oh, say, 20 characters out of the space of code points, about evenly distributed under whatever collations we think are likely to be in use. In the incrementer, we try some kind of increment-the-bytes strategy for a while and if it doesn't pan out, we zip through the array and try substituting each of the fallback characters. If more than one works, we test the survivors against each other until we're left with just one winner. The bound might not be real tight, but as long as it's good enough to make the planner pick an index scan it might not matter very much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Euler Taveira de OliveiraДата:
Сообщение: Re: Hot Backup with rsync fails at pg_clog if under load
Следующее
От: Tom LaneДата:
Сообщение: Re: [v9.2] make_greater_string() does not return a string in some cases