Re: [HACKERS] Radix tree for character conversion

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема Re: [HACKERS] Radix tree for character conversion
Дата
Msg-id 20170126.214212.111556326.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: [HACKERS] Radix tree for character conversion  (Michael Paquier <michael.paquier@gmail.com>)
Ответы Re: [HACKERS] Radix tree for character conversion  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Список pgsql-hackers
Thank you for looking this.

At Thu, 26 Jan 2017 16:28:16 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqREL1fsDBGv4zRvaXY+UKtS0wzkamJcnYhX0--OZvpUUQ@mail.gmail.com>
> On Tue, Jan 10, 2017 at 8:22 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > [...patch...]
> 
> Nobody has showed up yet to review this patch, so I am giving it a shot.
> 
> The patch file sizes are scary at first sight, but after having a look:
>  36 files changed, 1411 insertions(+), 54398 deletions(-)
> Yes that's a surprise, something like git diff --irreversible-delete
> would have helped as most of the diffs are just caused by 3 files
> being deleted in patch 0004, making 50k lines going to the abyss of
> deletion.

Thank you. Good to hear that. I'll try that at the next chance.

> > Hello, I found a bug in my portion while rebasing.
> 
> Right, that's 0001. Nice catch.
> 
> > The attached files are the following. This patchset is not
> > complete missing changes of map files. The change is tremendously
> > large but generatable.
> >
> > 0001-Add-missing-semicolon.patch
> >
> >   UCS_to_EUC_JP.pl has a line missing teminating semicolon. This
> >   doesn't harm but surely a syntax error. This patch fixes it.
> >   This might should be a separate patch.
> 
> This requires a back-patch. This makes me wonder how long this script
> has actually not run...
> 
> > 0002-Correct-reference-resolution-syntax.patch
> >
> >   convutils.pm has lines with different syntax of reference
> >   resolution. This unifies the syntax.
> 
> Yes that looks right to me.

Yes, I thoght that the three patches can be back-patched, a kind
of bug fix.

> I am the best perl guru on this list but
> looking around $$var{foo} is bad, ${$var}{foo} is better, and
> $var->{foo} is even better. This also generates no diffs when running
> make in src/backend/utils/mb/Unicode/. So no objections to that.

Thank you for the explanation. I think no '$$'s is left alone.

> > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
> >
> >   Before adding radix tree stuff, applied pgperltidy and inserted
> >   format-skipping pragma for the parts where perltidy seems to do
> >   too much.
> 
> Which version of perltidy did you use? Looking at the archives, the
> perl code is cleaned up with a specific version, v20090616. See
> https://www.postgresql.org/message-id/20151204054322.GA2070309@tornado.leadboat.com
> for example on the matter. As perltidy changes over time, this may be
> a sensitive change if done this way.

Hmm. I will make a confirmation on that.. tomorrow.

> > 0004-Use-radix-tree-for-character-conversion.patch
> >
> >   Radix tree body.
> 
> Well, here a lot of diffs could have been saved.
> 
> > The unattached fifth patch is generated by the following steps.
> >
> > [$(TOP)]$ ./configure
> > [Unicode]$ make
> > [Unicode]$ make distclean
> > [Unicode]$ git add .
> > [Unicode]$ commit
> > === COMMITE MESSSAGE
> > Replace map files with radix tree files.
> >
> > These encodings no longer uses the former map files and uses new radix
> > tree files.
> > ===
> 
> OK, I can see that working, with 200k of maps generated.. So going
> through the important bits of this jungle..

Many thaks for the exploration.

> +/*
> + * radix tree conversion function - this should be identical to the function in
> + * ../conv.c with the same name
> + */
> +static inline uint32
> +pg_mb_radix_conv(const pg_mb_radix_tree *rt,
> +                int l,
> +                unsigned char b1,
> +                unsigned char b2,
> +                unsigned char b3,
> +                unsigned char b4)
> This is not nice. Having a duplication like that is a recipe to forget
> about it as this patch introduces a dependency with conv.c and the
> radix tree generation.

Mmmmm. I agree to you, but conv.c contains unwanted reference to
elog or other sutff of the core. Separating the function in a
dedicate source file named such as "../pg_mb_radix_conv.c" will
work. If it is not so bad, I'll do that in the next version.

> Having a .gitignore in Unicode/ would be nice, particularly to avoid
> committing map_checker.
> 
> A README documenting things may be welcome, or at least comments at
> the top of map_checker.c. Why is map_checker essential? What does it
> do? There is no way to understand that easily, except that it includes
> a "radix tree conversion function", and that it performs sanity checks
> on the radix trees to be sure that they are on a good shape. But as
> this something that one would guess only after looking at your patch
> and the code (at least I will sleep less stupid tonight after reading
> this stuff).

Okay, I'll do that.

> --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
> +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
>  # Drop these SJIS codes from the source for UTF8=>SJIS conversion
>  #<<< do not let perltidy touch this
> -my @reject_sjis =(
> +my @reject_sjis = (
>     0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782,
> -   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
> +   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
>     0x879a..0x879c
> -);
> +   );
> This is not generated, it would be nice to drop the noise from the patch.

Mmm. I'm not sure how this is generated but I'll care for that.

> Here is another one:
> -       $i->{code} = $jis | (
> -           $jis < 0x100
> -           ? 0x8e00
> -           : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> -
> +#<<< do not let perltidy touch this
> +       $i->{code} = $jis | ($jis < 0x100 ? 0x8e00:
> +                            ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
> +#>>>

Ok. Will revert this.

>         if (l == 2)
>         {
> -           iutf = *utf++ << 8;
> -           iutf |= *utf++;
> +           b3 = *utf++;
> +           b4 = *utf++;
>         }
> Ah, OK. This conversion is important so as it performs a minimum of
> bitwise operations. Yes let's keep that. That's pretty cool to get a
> faster operation.

It is Heikki's work:p

I'll address them and repost the next version sooner.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: [HACKERS] Replication slot xmin is not reset if HS feedback isturned off while standby is shut down
Следующее
От: Haribabu Kommi
Дата:
Сообщение: Re: [HACKERS] pg_hba_file_settings view patch