Re: Supporting SJIS as a database encoding

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема Re: Supporting SJIS as a database encoding
Дата
Msg-id 20160908.153546.187438961.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: Supporting SJIS as a database encoding  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Ответы Re: Supporting SJIS as a database encoding  ("Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com>)
Re: Supporting SJIS as a database encoding  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
Hello,

At Wed, 07 Sep 2016 16:13:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20160907.161304.112519789.horiguchi.kyotaro@lab.ntt.co.jp>
> > Implementing radix tree code, then redefining the format of mapping table
> > > to suppot radix tree, then modifying mapping generator script are needed.
> > > 
> > > If no one oppse to this, I'll do that.

So, I did that as a PoC. The radix tree takes a little less than
100k bytes (far smaller than expected:) and it is defnitely
faster than binsearch.


The attached patch does the following things.

- Defines a struct for static radix tree (utf_radix_tree). Currently it supports up to 3-byte encodings.

- Adds a map generator script UCS_to_SJIS_radix.pl, which generates utf8_to_sjis_radix.map from utf8_to_sjis.map.

- Adds a new conversion function utf8_to_sjis_radix.

- Modifies UtfToLocal so as to allow map to be NULL.

- Modifies utf8_to_sjis to use the new conversion function instead of ULmapSJIS.


The followings are to be done.

- utf8_to_sjis_radix could be more generic.

- SJIS->UTF8 is not implemented but it would be easily done since there's no difference in using the radix tree
mechanism.(but the output character is currently assumed to be 2-byte long)
 

- It doesn't support 4-byte codes so this is not applicable to sjis_2004. Extending the radix tree to support 4-byte
wouldn'tbe hard.
 


The following is the result of a simple test.

=# create table t (a text); alter table t alter column a storage plain;
=# insert into t values ('... 7130 cahracters containing (I believe) all characters in SJIS encoding');
=# insert into t values ('... 7130 cahracters containing (I believe) all characters in SJIS encoding');

# Doing that twice is just my mistake.

$ export PGCLIENTENCODING=SJIS

$ time psql postgres -c '
$ psql -c '\encoding' postgres
SJIS

<Using radix tree>
$ time psql postgres -c 'select t.a from t, generate_series(0, 9999)' > /dev/null

real    0m22.696s
user    0m16.991s
sys    0m0.182s>

Using binsearch the result for the same operation was 
real    0m35.296s
user    0m17.166s
sys    0m0.216s

Returning in UTF-8 bloats the result string by about 1.5 times so
it doesn't seem to make sense comparing with it. But it takes
real = 47.35s.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Noah Misch
Дата:
Сообщение: Re: Parallel build with MSVC
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Bug in two-phase transaction recovery