Re: english parser in text search: support for multiple words in the same position

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: english parser in text search: support for multiple words in the same position
Дата	3 октября 2010 г. 19:24:45
Msg-id	6568.1286144672@sss.pgh.pa.us обсуждение исходный текст
Ответ на	Re: english parser in text search: support for multiple words in the same position (Sushant Sinha <sushant354@gmail.com>)
Ответы	Re: english parser in text search: support for multiple words in the same position
Список	pgsql-hackers

Дерево обсуждения

[ sorry for not responding on this sooner, it's been hectic the last couple weeks ]

Sushant Sinha <sushant354@gmail.com> writes:
>> I looked at this patch a bit.  I'm fairly unhappy that it seems to be
>> inventing a brand new mechanism to do something the ts parser can
>> already do.  Why didn't you code the url-part mechanism using the
>> existing support for compound words? 

> I am not familiar with compound word implementation and so I am not sure
> how to split a url with compound word support. I looked into the
> documentation for compound words and that does not say much about how to
> identify components of a token.

IIRC, the way that that works is associated with pushing a sub-state
of the state machine in order to scan each compound-word part.  I don't
have the details in my head anymore, though I recall having traced
through it in the past.  Look at the state machine actions that are
associated with producing the compound word tokens and sub-tokens.

> The current patch is not inventing any new mechanism. It uses the
> special handler mechanism already present in the parser.

The fact that that mechanism is there doesn't mean that it's the right
one for this task.  I think that Teodor meant it for other things
altogether.  If it were the best way to solve this problem, why wouldn't
he have used it for compound words?

>> The changes made to parsetext()
>> seem particularly scary: it's not clear at all that that's not breaking
>> unrelated behaviors.  In fact, the changes in the regression test
>> results suggest strongly to me that it *is* breaking things.  Why are
>> there so many diffs in examples that include no URLs at all?

> I think some of the difference is coming from the fact that now pos
> starts with 0 and it used to be 1 earlier. That is easily fixable
> though. 

You cannot seriously believe that it's okay for a patch to just
arbitrarily change such an easily user-visible behavior.  This comes
back again to the point that this patch is not going to get in at all
unless it makes the absolute minimum amount of change in the established
behavior of the parser.  I think we can probably accept a patch that
produces new tokens (of newly-defined types) in addition to what it
already produced for URL-looking input, because any existing dictionary
configuration will just drop the newly-defined token types on the floor
leaving you with exactly the same indexing behavior you got in the
last three major releases.  Changes beyond that are going to need to
meet a very high bar of arguable necessity.

>> An issue that's nearly as bad is the 100% lack of documentation,

> I did not provide any explanation as I could not find any place in the
> code to provide the documentation (that was just a modification of state
> machine).

The code is not the place that I'm complaining about the lack of
documentation in.  A patch that changes user-visible behavior needs to
change the appropriate parts of the SGML documentation also.  In the
case at hand, I think an absolute minimum level of documentation would
involve changing table 12-1 here:
http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html
and probably adding another example to the ones following that table.
But there may well be other parts of chapter 12 that need revision also.

Now I will grant that an early-draft patch needn't include final user
docs, but if you're omitting that then it's all the more important that
you provide clear information with the patch about what it's supposed to
do, so that reviewers can understand what the point is.  The two
sentences of description provided in the commitfest entry were nowhere
near enough for intelligent reviewing, IMO.
        regards, tom lane

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: english parser in text search: support for multiple words in the same position