Re: [v9.2] make_greater_string() does not return a string in some cases

Поиск

Список

Период

Сортировка

От	Robert Haas
Тема	Re: [v9.2] make_greater_string() does not return a string in some cases
Дата	22 сентября 2011 г. 13:10:59
Msg-id	CA+TgmoaDz_Ri7NcLebzbb_EzTxEGy_mwwsenLPrh9JaTVSL1kA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [v9.2] make_greater_string() does not return a string in some cases (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: [v9.2] make_greater_string() does not return a string in some cases
Список	pgsql-hackers

Дерево обсуждения

On Thu, Sep 22, 2011 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, the metric that we were indirectly using earlier was the
> number of characters in a given locale for which the algorithm
> fails to find a greater one (excluding whichever character is "last",
> I guess, or you could just recognize there's always at least one).

What about characters that sort differently in sequence than individually?

>> For example, if we were to try all possible
>> three character strings in some encoding and run make_greater_string()
>> on each one of them, we could then measure the failure percentage.  Or
>> if that's too many cases to crank through then we could limit it some
>> way -
>
> Even in UTF8 there's only a couple million assigned code points, so for
> test purposes anyway it doesn't seem like we couldn't crank through them
> all.  Also, in many cases you could probably figure it out by analysis
> instead of brute-force testing every case.
>
> A more reasonable objection might be that a whole lot of those code
> points are things nobody cares about, and so we need to weight the
> results somehow by the actual popularity of the character.  Not sure
> how to take that into account.

I guess whether that's a problem in practice will depend somewhat on
the quality of the algorithms we're able to find.  If our best
algorithms still have a 1% failure rate, then yeah, that's an issue,
but in that case I'd suggest that our best algorithms suck and we need
to think harder about alternate solutions.  If we're talking about
failing on 5 characters out of a million we can just eyeball them.
I'm not trying to reduce this testing to something that is entirely
mechanic in every way; I'm just saying that I'm not optimistic about
my ability to judge which algorithms will work best in practice
without some kind of automated aid.

> Another issue here is that we need to consider not just whether we find
> a greater character, but "how much greater" it is.  This would apply to
> my suggestion of incrementing the top byte without considering
> lower-order bytes --- we'd be skipping quite a lot of code space for
> each increment, and it's conceivable that that would be quite hurtful in
> some cases.  Not sure how to account for that either.  An extreme
> example here is an "incrementer" that just immediately returns the last
> character in the sort order for any lesser input.

Right...  well, this is why I'm not wild about doing this by
incrementing in the first place.

But now that I think about it, what about using some
slightly-less-stupid version of that approach as a fallback strategy?
For example, we could pick, oh, say, 20 characters out of the space of
code points, about evenly distributed under whatever collations we
think are likely to be in use.  In the incrementer, we try some kind
of increment-the-bytes strategy for a while and if it doesn't pan out,
we zip through the array and try substituting each of the fallback
characters.  If more than one works, we test the survivors against
each other until we're left with just one winner.  The bound might not
be real tight, but as long as it's good enough to make the planner
pick an index scan it might not matter very much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [v9.2] make_greater_string() does not return a string in some cases