Re: ICU integration
От | Thomas Munro |
---|---|
Тема | Re: ICU integration |
Дата | |
Msg-id | CAEepm=30SQpEUjau=dScuNeVZaK2kJ6QQDCHF75u5W=Cz=3Scw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: ICU integration (Peter Geoghegan <pg@heroku.com>) |
Список | pgsql-hackers |
<div dir="ltr">On Sat, Sep 24, 2016 at 10:13 PM, Peter Geoghegan <<a href="mailto:pg@heroku.com">pg@heroku.com</a>>wrote:<br />> On Fri, Sep 23, 2016 at 7:27 AM, Thomas Munro<br />><<a href="mailto:thomas.munro@enterprisedb.com">thomas.munro@enterprisedb.com</a>> wrote:<br />>> It lookslike varstr_abbrev_convert calls strxfrm unconditionally<br />>> (assuming TRUST_STRXFRM is defined). <captain-obvious>Thisneeds to<br />>> use ucol_getSortKey instead when appropriate.</> It looks likeit's a<br />>> bit more helpful than strxfrm about telling you the output buffer size<br />>> it wants, andit doesn't need nul termination, which is nice.<br />>> Unfortunately it is like strxfrm in that the output buffer'scontents<br />>> is unspecified if it ran out of space.<br />><br />> One can use the ucol_nextSortKeyPart()interface to just get the first<br />> 4/8 bytes of an abbreviated key, reducing the overhead somewhat,so<br />> the output buffer size limitation is probably irrelevant. The ICU<br />> documentation says somethingabout this being useful for Radix sort,<br />> but I suspect it's more often used to generate abbreviated keys.<br/>> Abbreviated keys were not my original idea. They're really just a<br />> standard technique.<br /><br />Nice! The other advantage of ucol_nextSortKeyPart is that you don't have to convert the whole string to UChar (UTF16) first,as I think you would need to with ucol_getSortKey, because the UCharIterator mechanism can read directly from a UTF8string. I see in the documentation that ucol_nextSortKeyPart and ucol_getSortKey don't have compatible output, and thiscaveat may be related to whether sort key compression is used. I don't understand what sort of compression is involvedbut out of curiosity I asked ICU to spit out some sort keys from ucol_nextSortKeyPart so I could see their size. As you say, we can ask it to stop at 4 or 8 bytes which is very convenient for our purposes, but here I asked for moreto get the full output so I could see where the primary weight part ends. The primary weight took one byte for the Latinletters I tried and two for the Japanese characters I tried (except 一 which was just 0xaa).<br /><br />ucol_nextSortKeyPart(en_US,"a", ...) -> 29 01 05 01 05<br />ucol_nextSortKeyPart(en_US, "ab", ...) -> 29 2b 01 0601 06<br />ucol_nextSortKeyPart(en_US, "abc", ...) -> 29 2b 2d 01 07 01 07<br />ucol_nextSortKeyPart(en_US, "abcd",...) -> 29 2b 2d 2f 01 08 01 08<br />ucol_nextSortKeyPart(en_US, "A", ...) -> 29 01 05 01 dc<br />ucol_nextSortKeyPart(en_US,"AB", ...) -> 29 2b 01 06 01 dc dc<br />ucol_nextSortKeyPart(en_US, "ABC", ...) -> 292b 2d 01 07 01 dc dc dc<br />ucol_nextSortKeyPart(en_US, "ABCD", ...) -> 29 2b 2d 2f 01 08 01 dc dc dc dc<br />ucol_nextSortKeyPart(ja_JP,"一", ...) -> aa 01 05 01 05<br />ucol_nextSortKeyPart(ja_JP, "一二", ...) -> aa d0 0f 0106 01 06<br />ucol_nextSortKeyPart(ja_JP, "一二三", ...) -> aa d0 0f cb b8 01 07 01 07<br />ucol_nextSortKeyPart(ja_JP,"一二三四", ...) -> aa d0 0f cb b8 cb d5 01 08 01 08<br />ucol_nextSortKeyPart(ja_JP, "日", ...)-> d0 18 01 05 01 05<br />ucol_nextSortKeyPart(ja_JP, "日本", ...) -> d0 18 d1 d0 01 06 01 06 <br />ucol_nextSortKeyPart(fr_FR,"cote", ...) -> 2d 45 4f 31 01 08 01 08<br />ucol_nextSortKeyPart(fr_FR, "côte", ...) ->2d 45 4f 31 01 44 8e 06 01 09<br />ucol_nextSortKeyPart(fr_FR, "coté", ...) -> 2d 45 4f 31 01 42 88 01 09<br />ucol_nextSortKeyPart(fr_FR,"côté", ...) -> 2d 45 4f 31 01 44 8e 44 88 01 0a<br />ucol_nextSortKeyPart(fr_CA, "cote",...) -> 2d 45 4f 31 01 08 01 08<br />ucol_nextSortKeyPart(fr_CA, "côte", ...) -> 2d 45 4f 31 01 44 8e 06 0109<br />ucol_nextSortKeyPart(fr_CA, "coté", ...) -> 2d 45 4f 31 01 88 08 01 09<br />ucol_nextSortKeyPart(fr_CA, "côté",...) -> 2d 45 4f 31 01 88 44 8e 06 01 0a<br /><br />I wonder how it manages to deal with fr_CA's reversed secondaryweighting rule which requires you to consider diacritics in reverse order -- apparently abandoned in France butstill used in Canada -- using a fixed size space for state between calls.<br /><br />-- <br />Thomas Munro<br /><a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a></div>
В списке pgsql-hackers по дате отправления: