Re: [HACKERS] Radix tree for character conversion

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема Re: [HACKERS] Radix tree for character conversion
Дата
Msg-id 20170127.173357.221584433.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: [HACKERS] Radix tree for character conversion  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Ответы Re: [HACKERS] Radix tree for character conversion  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Список pgsql-hackers
Hi, this is an intermediate report without a patch.

At Thu, 26 Jan 2017 21:42:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170126.214212.111556326.horiguchi.kyotaro@lab.ntt.co.jp>
> > > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
> > >
> > >   Before adding radix tree stuff, applied pgperltidy and inserted
> > >   format-skipping pragma for the parts where perltidy seems to do
> > >   too much.
> > 
> > Which version of perltidy did you use? Looking at the archives, the
> > perl code is cleaned up with a specific version, v20090616. See
> > https://www.postgresql.org/message-id/20151204054322.GA2070309@tornado.leadboat.com
> > for example on the matter. As perltidy changes over time, this may be
> > a sensitive change if done this way.
> 
> Hmm. I will make a confirmation on that.. tomorrow.

My perltidy -v said "v20121207'. Anyway, I gave up to apply
perltidy by myself. So I'll just drop 0003 and new 0004 (name
changed to 0003) is made immediately on 0002.

> > > 0004-Use-radix-tree-for-character-conversion.patch
> > >
> > >   Radix tree body.
> > 
> > Well, here a lot of diffs could have been saved.
> > 
> > > The unattached fifth patch is generated by the following steps.
> > >
> > > [$(TOP)]$ ./configure
> > > [Unicode]$ make
> > > [Unicode]$ make distclean
> > > [Unicode]$ git add .
> > > [Unicode]$ commit
> > > === COMMITE MESSSAGE
> > > Replace map files with radix tree files.
> > >
> > > These encodings no longer uses the former map files and uses new radix
> > > tree files.
> > > ===
> > 
> > OK, I can see that working, with 200k of maps generated.. So going
> > through the important bits of this jungle..
> 
> Many thaks for the exploration.
> 
> > +/*
> > + * radix tree conversion function - this should be identical to the function in
> > + * ../conv.c with the same name
..
> > This is not nice. Having a duplication like that is a recipe to forget
> > about it as this patch introduces a dependency with conv.c and the
> > radix tree generation.

In the attatched patch, mb/char_conveter.c which contains one
inline function is created and it is includ'ed from mb/conv.c and
mb/Unicode/map_checker.c.

> > Having a .gitignore in Unicode/ would be nice, particularly to avoid
> > committing map_checker.

I missed this.  I added .gitignore to ignore map_checker stuff
and authority files and old-style map files.

> > A README documenting things may be welcome, or at least comments at
> > the top of map_checker.c. Why is map_checker essential? What does it
> > do? There is no way to understand that easily, except that it includes
> > a "radix tree conversion function", and that it performs sanity checks
> > on the radix trees to be sure that they are on a good shape. But as
> > this something that one would guess only after looking at your patch
> > and the code (at least I will sleep less stupid tonight after reading
> > this stuff).
> 
> Okay, I'll do that.

The patch has not been provided yet, I'm going to put the
following comment just before the main() in map_checker.c.

/** The old-style plain map files were error-resistant due to its* straight-forward way for generation from authority
files.In contrast the* radix tree maps are generated by a rather complex calculation and have a* complex,
hard-to-confirmformat.** This program runs sanity check of the radix tree maps by confirming all* characters in the
plainmap files to be converted to the same code by the* corresponding radix tree map.** All map files are included by
map_checker.hthat is generated by the script* make_mapchecker.pl as the variable mappairs.**/
 


I'll do the following things later.

> > --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> > +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> >  # Drop these SJIS codes from the source for UTF8=>SJIS conversion
> >  #<<< do not let perltidy touch this
> > -my @reject_sjis =(
> > +my @reject_sjis = (
> >     0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782,
> > -   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> > +   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> >     0x879a..0x879c
> > -);
> > +   );
> > This is not generated, it would be nice to drop the noise from the patch.
> 
> Mmm. I'm not sure how this is generated but I'll care for that.
> 
> > Here is another one:
> > -       $i->{code} = $jis | (
> > -           $jis < 0x100
> > -           ? 0x8e00
> > -           : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> > -
> > +#<<< do not let perltidy touch this
> > +       $i->{code} = $jis | ($jis < 0x100 ? 0x8e00:
> > +                            ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> > +#>>>
> 
> Ok. Will revert this.
> 
> >         if (l == 2)
> >         {
> > -           iutf = *utf++ << 8;
> > -           iutf |= *utf++;
> > +           b3 = *utf++;
> > +           b4 = *utf++;
> >         }
> > Ah, OK. This conversion is important so as it performs a minimum of
> > bitwise operations. Yes let's keep that. That's pretty cool to get a
> > faster operation.
> 
> It is Heikki's work:p
> 
> I'll address them and repost the next version sooner.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





В списке pgsql-hackers по дате отправления:

Предыдущее
От: Venkata B Nagothi
Дата:
Сообщение: Re: [HACKERS] patch proposal
Следующее
От: Andrew Dunstan
Дата:
Сообщение: Re: [HACKERS] Failure in commit_ts tap tests