Re: Supporting SJIS as a database encoding
От | Kyotaro HORIGUCHI |
---|---|
Тема | Re: Supporting SJIS as a database encoding |
Дата | |
Msg-id | 20160908.153546.187438961.horiguchi.kyotaro@lab.ntt.co.jp обсуждение исходный текст |
Ответ на | Re: Supporting SJIS as a database encoding (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Ответы |
Re: Supporting SJIS as a database encoding
("Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com>)
Re: Supporting SJIS as a database encoding (Heikki Linnakangas <hlinnaka@iki.fi>) |
Список | pgsql-hackers |
Hello, At Wed, 07 Sep 2016 16:13:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160907.161304.112519789.horiguchi.kyotaro@lab.ntt.co.jp> > > Implementing radix tree code, then redefining the format of mapping table > > > to suppot radix tree, then modifying mapping generator script are needed. > > > > > > If no one oppse to this, I'll do that. So, I did that as a PoC. The radix tree takes a little less than 100k bytes (far smaller than expected:) and it is defnitely faster than binsearch. The attached patch does the following things. - Defines a struct for static radix tree (utf_radix_tree). Currently it supports up to 3-byte encodings. - Adds a map generator script UCS_to_SJIS_radix.pl, which generates utf8_to_sjis_radix.map from utf8_to_sjis.map. - Adds a new conversion function utf8_to_sjis_radix. - Modifies UtfToLocal so as to allow map to be NULL. - Modifies utf8_to_sjis to use the new conversion function instead of ULmapSJIS. The followings are to be done. - utf8_to_sjis_radix could be more generic. - SJIS->UTF8 is not implemented but it would be easily done since there's no difference in using the radix tree mechanism.(but the output character is currently assumed to be 2-byte long) - It doesn't support 4-byte codes so this is not applicable to sjis_2004. Extending the radix tree to support 4-byte wouldn'tbe hard. The following is the result of a simple test. =# create table t (a text); alter table t alter column a storage plain; =# insert into t values ('... 7130 cahracters containing (I believe) all characters in SJIS encoding'); =# insert into t values ('... 7130 cahracters containing (I believe) all characters in SJIS encoding'); # Doing that twice is just my mistake. $ export PGCLIENTENCODING=SJIS $ time psql postgres -c ' $ psql -c '\encoding' postgres SJIS <Using radix tree> $ time psql postgres -c 'select t.a from t, generate_series(0, 9999)' > /dev/null real 0m22.696s user 0m16.991s sys 0m0.182s> Using binsearch the result for the same operation was real 0m35.296s user 0m17.166s sys 0m0.216s Returning in UTF-8 bloats the result string by about 1.5 times so it doesn't seem to make sense comparing with it. But it takes real = 47.35s. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
В списке pgsql-hackers по дате отправления: