Обсуждение: Prefix support for synonym dictionary
Hi there,
attached is our patch for CVS HEAD, which adds prefix support for synonym
dictionary.
Quick example:
> cat $SHAREDIR/tsearch_data/synonym_sample.syn
postgres pgsql
postgresql pgsql
postgre pgsql
gogle googl
indices index*
=# create text search dictionary syn( template=synonym,synonyms='synonym_sample');
=# select ts_lexize('syn','indices'); ts_lexize
----------- {index}
(1 row)
=# create text search configuration tst ( copy=simple);
=# alter text search configuration tst alter mapping for asciiword with syn;
=# select to_tsquery('tst','indices'); to_tsquery
------------ 'index':*
(1 row)
=# select 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices'); ?column?
---------- t
(1 row)
Regards, Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Hi,
The patch looks good.
Comments:
1. The docs should be clarified a little. For instance, it should have a
link back to the definition of a prefix search (12.3.2). I included my
doc suggestions as an attachment.
2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps
fragile) way. After calling findwrd(), the "end" pointer is pointing at
either the end of the string, or the *; depending on whether the string
ends in * and whether flags is NULL. I only mention this because I had
to take a more careful look to see what was happening. Perhaps add a
comment to make it more clear?
3. The patch looks for the special byte '*'. I think that's fine,
because we depend on the files being in UTF-8 encoding, where it's the
same byte. However, I thought it was worth mentioning in case we want to
support other encodings for text search files later.
Regards,
Jeff Davis
Вложения
On Sun, Aug 2, 2009 at 3:05 PM, Jeff Davis<pgsql@j-davis.com> wrote: > The patch looks good. > > Comments: > > 1. The docs should be clarified a little. For instance, it should have a > link back to the definition of a prefix search (12.3.2). I included my > doc suggestions as an attachment. > > 2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps > fragile) way. After calling findwrd(), the "end" pointer is pointing at > either the end of the string, or the *; depending on whether the string > ends in * and whether flags is NULL. I only mention this because I had > to take a more careful look to see what was happening. Perhaps add a > comment to make it more clear? > > 3. The patch looks for the special byte '*'. I think that's fine, > because we depend on the files being in UTF-8 encoding, where it's the > same byte. However, I thought it was worth mentioning in case we want to > support other encodings for text search files later. Oleg, Are you planning to update this patch this week? If not I will set it to "Returned with Feedback". Thanks, ...Robert
On Wed, 2009-08-05 at 12:34 -0400, Robert Haas wrote: > Oleg, > > Are you planning to update this patch this week? If not I will set it > to "Returned with Feedback". My only comments were related to docs and comments, and I supplied a patch as a suggested fix for the docs. Also, the patch is very small. I'd hate to hold it up over such a minor issue, and it seems like a useful feature. If Oleg is unavailable, would you mind just having a second review of the patch to see if they agree with my suggestions, and then mark "ready for committer review"? Regards,Jeff Davis
> 1. The docs should be clarified a little. For instance, it should have a
> link back to the definition of a prefix search (12.3.2). I included my
> doc suggestions as an attachment.
Thank you, merged
> 2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps
> fragile) way. After calling findwrd(), the "end" pointer is pointing at
> either the end of the string, or the *; depending on whether the string
> ends in * and whether flags is NULL. I only mention this because I had
> to take a more careful look to see what was happening. Perhaps add a
> comment to make it more clear?
Add comments:
/*
* Finds the next whitespace-delimited word within the 'in' string.
* Returns a pointer to the first character of the word, and a pointer
* to the next byte after the last character in the word (in *end).
* Character '*' at the end of word will not be threated as word
* charater if flags is not null.
*/
static char *
findwrd(char *in, char **end, uint16 *flags)
> 3. The patch looks for the special byte '*'. I think that's fine,
> because we depend on the files being in UTF-8 encoding, where it's the
> same byte. However, I thought it was worth mentioning in case we want to
> support other encodings for text search files later.
tsearch_readline() converts file's UTF8 encoding into server encoding. pgsql
supports only encoding which are a superset of ASCII. So it's safe to use
asterisk with any encodings
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Вложения
2009/8/6 Teodor Sigaev <teodor@sigaev.ru>: >> 1. The docs should be clarified a little. For instance, it should have a >> link back to the definition of a prefix search (12.3.2). I included my >> doc suggestions as an attachment. > > Thank you, merged > >> 2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps >> fragile) way. After calling findwrd(), the "end" pointer is pointing at >> either the end of the string, or the *; depending on whether the string >> ends in * and whether flags is NULL. I only mention this because I had >> to take a more careful look to see what was happening. Perhaps add a >> comment to make it more clear? > > Add comments: > /* > * Finds the next whitespace-delimited word within the 'in' string. > * Returns a pointer to the first character of the word, and a pointer > * to the next byte after the last character in the word (in *end). > * Character '*' at the end of word will not be threated as word > * charater if flags is not null. > */ > static char * > findwrd(char *in, char **end, uint16 *flags) > > > >> 3. The patch looks for the special byte '*'. I think that's fine, >> because we depend on the files being in UTF-8 encoding, where it's the >> same byte. However, I thought it was worth mentioning in case we want to >> support other encodings for text search files later. > > tsearch_readline() converts file's UTF8 encoding into server encoding. pgsql > supports only encoding which are a superset of ASCII. So it's safe to use > asterisk with any encodings Jeff, Based on these comments, do you want to go ahead and mark this "Ready for Committer"? https://commitfest.postgresql.org/action/patch_view?id=133 ...Robert
On Thu, 2009-08-06 at 12:19 -0400, Robert Haas wrote: > Based on these comments, do you want to go ahead and mark this "Ready > for Committer"? Done, thanks Teodor. However, on the commitfest page, the patches got updated in the wrong places: "prefix support" and "filtering dictionary support" are pointing at each others' patches. Regards,Jeff Davis
On Thu, Aug 6, 2009 at 12:53 PM, Jeff Davis<pgsql@j-davis.com> wrote: > On Thu, 2009-08-06 at 12:19 -0400, Robert Haas wrote: >> Based on these comments, do you want to go ahead and mark this "Ready >> for Committer"? > > Done, thanks Teodor. > > However, on the commitfest page, the patches got updated in the wrong > places: "prefix support" and "filtering dictionary support" are pointing > at each others' patches. Fixed. ...Robert