Re: Empty string in lexeme for tsvector

Поиск

Список

Период

Сортировка

От	Jean-Christophe Arnu
Тема	Re: Empty string in lexeme for tsvector
Дата	30 сентября 2021 г. 21:34:15
Msg-id	CAHZmTm3PVmAxAmJ9XNSDyb-mU68m0F3j8DXvq9RygqHA93Jf8A@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Empty string in lexeme for tsvector (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: Empty string in lexeme for tsvector
Список	pgsql-hackers

Дерево обсуждения

Thank you Tom for your review.

Le mer. 29 sept. 2021 à 21:36, Tom Lane <tgl@sss.pgh.pa.us> a écrit :

Jean-Christophe Arnu <jcarnu@gmail.com> writes:
> [ empty_string_in_tsvector_v4.patch ]

I looked through this patch a bit. I don't agree with adding
these new error conditions to tsvector_setweight_by_filter and
tsvector_delete_arr. Those don't prevent bad lexemes from being
added to tsvectors, so AFAICS they can have no effect other than
breaking existing applications. In fact, tsvector_delete_arr is
one thing you could use to fix existing bad tsvectors, so making
it throw an error seems actually counterproductive.

Agreed.

The new patch included here does not change tsvector_setweight_by_filter()

anymore. Tests and docs are also upgraded.

This patch is not ready yet.

(By the same token, I think there's a good argument for
tsvector_delete_arr to just ignore nulls, not throw an error.
That's a somewhat orthogonal issue, though.)

Nulls are now ignored in tsvector_delete_arr().

What I'm wondering about more than that is whether array_to_tsvector
is the only place that can inject an empty lexeme ... don't we have
anything else that can add lexemes without going through the parser?

I crawled the docs [1] in order to check each tsvector output from functions and

operators :

* The only fonctions left that may lead to empty strings seem

both json_to_tsvector() and jsonb_to_tsvector(). Both functions use parsetext

(in ts_parse.c) that seems to behave correctly and don't create "empty string".

* concat operator "||' allows to compute a ts_vector containing "empty string" if

one of its operands contains itself an empty string tsvector. This seems perfectly

correct from the operator point of view... Should we change this behaviour to

filter out empty strings ?

I also wonder if we should not also consider changing COPY FROM behaviour

on empty string lexemes.

Current version is just crashing on empty string lexemes. Should

we allow them anyway as COPY FROM input (it seems weird not to be able

to re-import dumped data) or "skipping them" just like array_to_tsvector()

does in the patched version (that may lead to database content changes) or

finally should we not change COPY behaviour ?

I admit this is a tricky bunch of questions I'm too rookie to answer.

[1] https://www.postgresql.org/docs/14/functions-textsearch.html

Jean-Christophe Arnu

Вложения

empty_string_in_tsvector_v5.patch

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Jaime Casanova
Дата: 30 сентября 2021 г., 21:31:16
Сообщение: Re: FETCH FIRST clause PERCENT option

Следующее

От: Andres Freund
Дата: 30 сентября 2021 г., 21:36:41
Сообщение: 002_types.pl fails on some timezones on windows

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Empty string in lexeme for tsvector

Вложения

Предыдущее

Следующее