Re: Can pg_trgm handle non-alphanumeric characters?
От | MauMau |
---|---|
Тема | Re: Can pg_trgm handle non-alphanumeric characters? |
Дата | |
Msg-id | 96F7F5F69579425F9B42EAAA8B19CD14@maumau обсуждение исходный текст |
Ответ на | Re: Can pg_trgm handle non-alphanumeric characters? ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>) |
Список | pgsql-hackers |
From: "Kevin Grittner" <Kevin.Grittner@wicourts.gov> > "MauMau" <maumau307@gmail.com> wrote: >> For information, what kind of breakage would occur? > >> I imagined removing KEEPONLYALNUM would just accept >> non-alphanumeric characters and cause no harm to those who use >> only alphanumeric characters. > > This would break our current usages because of the handling of > trigrams at the "edges" of groups of qualifying characters. It > would make similarity (and distance) values less useful for our > current name searches using it. To simulate the effect, I used an > '8' in place of a comma instead of recompiling with the suggested > change. > > test=# select show_trgm('smith,john'); > show_trgm > ----------------------------------------------------------- > {" j"," s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "} > (1 row) > > test=# select show_trgm('smith8john'); > show_trgm > ----------------------------------------------------- > {" s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8} > (1 row) > > test=# select similarity('smith,john', 'jon smith'); > similarity > ------------ > 0.615385 > (1 row) > > test=# select similarity('smith8john', 'jon smith'); > similarity > ------------ > 0.3125 > (1 row) > > So making the proposed change unconditionally could indeed hurt > current users of the technique. On the other hand, if there was > fine-grained control of this, it might make trigrams useful for > searching statute cites (using all characters) as well as names > (using the current character set); so I wouldn't want it to just be > controlled by a global GUC. Thank you for your concise examples. I probably got it. From your examples, I thought KEEPONLYALNUM controls whether non-alphanumeric characters are included in trigrams, though I haven't read the code of pg_trgm. So, removing KEEPONLYALNUM definition produces trigrams unnecessary for users who handle only alphanumeric text. That would lead to undesirable query results. Then, I wonder what would be the ideal specification...to add alphanumeric/non-alphanumeric boolean switch to similarity() function, add non-alphanumeric version of operators (ex. %* and <->*) and non-alphanumeric version of operator classes (ex. gin_allchars_trgm_ops)? At least, I understood the fix is not appropriate for minor releases. Regards MauMau
В списке pgsql-hackers по дате отправления: