Re: WIP: index support for regexp search

Поиск

Список

Период

Сортировка

От	Alexander Korotkov
Тема	Re: WIP: index support for regexp search
Дата	23 января 2013 г. 10:36:59
Msg-id	CAPpHfdtB1mvdb8QTv4RZKWWuWEHHeNyntkXQGVWasJ7YqwwfRw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: WIP: index support for regexp search (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: WIP: index support for regexp search Re: WIP: index support for regexp search
Список	pgsql-hackers

Дерево обсуждения

Hi!

Some quick answers to the part of notes/issues. I will provide rest of answers soon.

On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The biggest problem is that I really don't care for the idea of
contrib/pg_trgm being this cozy with the innards of regex_t. Sooner
or later we are going to want to split the regex code off again as a
standalone library, and we *must* have a cleaner division of labor if
that is ever to happen. Not sure what a suitable API would look like
though.

The only option I see now is to provide a method like "export_cnfa" which would export corresponding CNFA in fixed format.

I think the assumption that all MB characters fit in 4 bytes is
unacceptable; someday we'll want to support wider Unicode characters
than we do now, and this code seems utterly unable to handle it. It's
especially bad that the code isn't even bothering to defend itself
against the possibility of wider characters.

In attached patch I introduce MAX_MULTIBYTE_CHARACTER_LENGTH macro and use it in type definition. Is this way ok?

Can't just modify pg_trgm--1.0.sql in place, must create a "1.1" version
and an upgrade script.

Fixed.

Comments and documentation still need a lot of copy-editing, also I
think a lot of the comment blocks will not survive pg_indent. It'd be
a good idea to run trgm_regexp.c through pg_indent as soon as you have
that fixed.

Fixed.

New file trgm_regexp.c lacks a copyright notice

Fixed.

Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good
enough; need to pass through the actual collation for the regex
operator.

We have collation passed to gin_extract_query in ginscan.c. I noticed that gincost_pattern don't pass collation to gin_extract_query. Is it a bug? Anyway this is not collation we need. We need collation used in operator clause. In attached patch I introduce additional argument to gin_extract_query which represent collation of operator clause. Do you think it is reasonable change in GIN interface? If so, I will provide it as separate patch.

Not too happy with convertPgWchar: aside from hard-wired, unchecked
assumption about maximum length of pg_wchar2mb_with_len result, why is
it that this is doing a lowercase conversion? Surely the regex stuff
dealt with that already?

Trigrams are already lowercased. We can simplify our calculations by excluding uppercased characters from consideration.

------
With best regards,
Alexander Korotkov.