Re: ICU integration

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: ICU integration
Дата
Msg-id CAEepm=30SQpEUjau=dScuNeVZaK2kJ6QQDCHF75u5W=Cz=3Scw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: ICU integration  (Peter Geoghegan <pg@heroku.com>)
Список pgsql-hackers
<div dir="ltr">On Sat, Sep 24, 2016 at 10:13 PM, Peter Geoghegan <<a
href="mailto:pg@heroku.com">pg@heroku.com</a>>wrote:<br />> On Fri, Sep 23, 2016 at 7:27 AM, Thomas Munro<br
/>><<a href="mailto:thomas.munro@enterprisedb.com">thomas.munro@enterprisedb.com</a>> wrote:<br />>> It
lookslike varstr_abbrev_convert calls strxfrm unconditionally<br />>> (assuming TRUST_STRXFRM is defined).
 <captain-obvious>Thisneeds to<br />>> use ucol_getSortKey instead when appropriate.</>  It looks
likeit's a<br />>> bit more helpful than strxfrm about telling you the output buffer size<br />>> it wants,
andit doesn't need nul termination, which is nice.<br />>> Unfortunately it is like strxfrm in that the output
buffer'scontents<br />>> is unspecified if it ran out of space.<br />><br />> One can use the
ucol_nextSortKeyPart()interface to just get the first<br />> 4/8 bytes of an abbreviated key, reducing the overhead
somewhat,so<br />> the output buffer size limitation is probably irrelevant. The ICU<br />> documentation says
somethingabout this being useful for Radix sort,<br />> but I suspect it's more often used to generate abbreviated
keys.<br/>> Abbreviated keys were not my original idea. They're really just a<br />> standard technique.<br /><br
/>Nice! The other advantage of ucol_nextSortKeyPart is that you don't have to convert the whole string to UChar (UTF16)
first,as I think you would need to with ucol_getSortKey, because the UCharIterator mechanism can read directly from a
UTF8string.  I see in the documentation that ucol_nextSortKeyPart and ucol_getSortKey don't have compatible output, and
thiscaveat may be related to whether sort key compression is used.  I don't understand what sort of compression is
involvedbut out of curiosity I asked ICU to spit out some sort keys from ucol_nextSortKeyPart so I could see their
size. As you say, we can ask it to stop at 4 or 8 bytes which is very convenient for our purposes, but here I asked for
moreto get the full output so I could see where the primary weight part ends.  The primary weight took one byte for the
Latinletters I tried and two for the Japanese characters I tried (except 一 which was just 0xaa).<br /><br
/>ucol_nextSortKeyPart(en_US,"a", ...) -> 29 01 05 01 05<br />ucol_nextSortKeyPart(en_US, "ab", ...) -> 29 2b 01
0601 06<br />ucol_nextSortKeyPart(en_US, "abc", ...) -> 29 2b 2d 01 07 01 07<br />ucol_nextSortKeyPart(en_US,
"abcd",...) -> 29 2b 2d 2f 01 08 01 08<br />ucol_nextSortKeyPart(en_US, "A", ...) -> 29 01 05 01 dc<br
/>ucol_nextSortKeyPart(en_US,"AB", ...) -> 29 2b 01 06 01 dc dc<br />ucol_nextSortKeyPart(en_US, "ABC", ...) ->
292b 2d 01 07 01 dc dc dc<br />ucol_nextSortKeyPart(en_US, "ABCD", ...) -> 29 2b 2d 2f 01 08 01 dc dc dc dc<br
/>ucol_nextSortKeyPart(ja_JP,"一", ...) -> aa 01 05 01 05<br />ucol_nextSortKeyPart(ja_JP, "一二", ...) -> aa d0 0f
0106 01 06<br />ucol_nextSortKeyPart(ja_JP, "一二三", ...) -> aa d0 0f cb b8 01 07 01 07<br
/>ucol_nextSortKeyPart(ja_JP,"一二三四", ...) -> aa d0 0f cb b8 cb d5 01 08 01 08<br />ucol_nextSortKeyPart(ja_JP, "日",
...)-> d0 18 01 05 01 05<br />ucol_nextSortKeyPart(ja_JP, "日本", ...) -> d0 18 d1 d0 01 06 01 06 <br
/>ucol_nextSortKeyPart(fr_FR,"cote", ...) -> 2d 45 4f 31 01 08 01 08<br />ucol_nextSortKeyPart(fr_FR, "côte", ...)
->2d 45 4f 31 01 44 8e 06 01 09<br />ucol_nextSortKeyPart(fr_FR, "coté", ...) -> 2d 45 4f 31 01 42 88 01 09<br
/>ucol_nextSortKeyPart(fr_FR,"côté", ...) -> 2d 45 4f 31 01 44 8e 44 88 01 0a<br />ucol_nextSortKeyPart(fr_CA,
"cote",...) -> 2d 45 4f 31 01 08 01 08<br />ucol_nextSortKeyPart(fr_CA, "côte", ...) -> 2d 45 4f 31 01 44 8e 06
0109<br />ucol_nextSortKeyPart(fr_CA, "coté", ...) -> 2d 45 4f 31 01 88 08 01 09<br />ucol_nextSortKeyPart(fr_CA,
"côté",...) -> 2d 45 4f 31 01 88 44 8e 06 01 0a<br /><br />I wonder how it manages to deal with fr_CA's reversed
secondaryweighting rule which requires you to consider diacritics in reverse order -- apparently abandoned in France
butstill used in Canada -- using a fixed size space for state between calls.<br /><br />-- <br />Thomas Munro<br /><a
href="http://www.enterprisedb.com">http://www.enterprisedb.com</a></div>

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Write Ahead Logging for Hash Indexes
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: Hash Indexes