Re: Supporting SJIS as a database encoding

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема Re: Supporting SJIS as a database encoding
Дата
Msg-id 20160906.122904.256837704.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: Supporting SJIS as a database encoding  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: Supporting SJIS as a database encoding  ("Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com>)
Список pgsql-hackers
Hello,

At Mon, 5 Sep 2016 19:38:33 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<529db688-72fc-1ca2-f898-b0b99e30076f@iki.fi>
> On 09/05/2016 05:47 PM, Tom Lane wrote:
> > "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
> >> Before digging into the problem, could you share your impression on
> >> whether PostgreSQL can support SJIS?  Would it be hopeless?
> >
> > I think it's pretty much hopeless.
> 
> Agreed.

+1, even as a user of SJIS:)

> But one thing that would help a little, would be to optimize the UTF-8
> -> SJIS conversion. It uses a very generic routine, with a binary
> search over a large array of mappings. I bet you could do better than
> that, maybe using a hash table or a radix tree instead of the large
> binary-searched array.

I'm very impressed by the idea. Mean number of iterations for
binsearch on current conversion table with 8000 characters is
about 13 and the table size is under 100kBytes (maybe).

A three-level array with 2 byte values will take about 1.6~2MB of memory.

A radix tree for UTF-8->some-encoding conversion requires about,
or up to.. (using 1 byte index to point the next level)

(1 *  ((7f + 1) +     (df - c2 + 1) * (bf - 80 + 1) +     (ef - e0 + 1) * (bf - 80 + 1)^2)) = 67 kbytes.

SJIS characters are 2byte length at longest so about 8000
characters takes extra 16 k Bytes. And some padding space will be
added on them.

As the result, radix tree seems to be promising because of small
requirement of additional memory and far less comparisons.  Also
Big5 and other encodings including EUC-* will get benefit from
it.

Implementing radix tree code, then redefining the format of
mapping table to suppot radix tree, then modifying mapping
generator script are needed.

If no one oppse to this, I'll do that.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Supporting SJIS as a database encoding
Следующее
От: "Tsunakawa, Takayuki"
Дата:
Сообщение: Re: Supporting SJIS as a database encoding