Обсуждение: [BUGS] pg_trgm word_similarity inconsistencies or bug

Поиск

Список

Период

Сортировка

[BUGS] pg_trgm word_similarity inconsistencies or bug

От

Cristiano Coelho

Дата:

28 октября 2017 г., 00:48:08

Hello all, this is related to postgres 9.6 (9.6.4) and a good description can be found here https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words

But in summary, word_similarity doesn’t seem to do exactly what the docs say, since it will match trigrams from multiple words rather tan doing a word by word comparison.

Below is a table with output and expected output, thanks to kiln from stackoverflow to provide it.

with data(t) as (

values

('message'),

('message s'),

('message sag'),

('message sag sag'),

('message sag sage')

select t, word_similarity('sage', t), my_word_similarity('sage', t)

from data;

        t         | word_similarity | my_word_similarity

------------------+-----------------+--------------------

 message          |             0.6 |                0.3

 message s        |             0.8 |                0.3

 message sag      |               1 |                0.5

 message sag sag  |               1 |                0.5

 message sag sage |               1 |                  1

Re: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Arthur Zakirov

Дата:

28 октября 2017 г., 14:22:29

On Fri, Oct 27, 2017 at 06:48:08PM +0000, Cristiano Coelho wrote:
> Hello all, this is related to postgres 9.6 (9.6.4) and a good description can be found here
https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words
> 
> But in summary, word_similarity doesn’t seem to do exactly what the docs say, since it will match trigrams from
multiplewords rather tan doing a word by word comparison.
 
> 
> Below is a table with output and expected output, thanks to kiln from stackoverflow to provide it.
> 

Interesting. An klin's answer from stackoverflow.com is right.

The initial example can be reduced to the next:

=# select word_similarity('sage', 'age sag');word_similarity 
-----------------              1

It computes maximum similarity using closest trigrams not considering order of
'sage' trigrams. It determines that all
trigrams from 'sage' match trigrams from 'age sag'.

Initial order of 'age sag' trigrams:
'  a', ' ag', 'age', 'ge ', '  s', ' sa', 'sag', 'ag '               ^                           ^               |from
                    |to
 
Sorted 'sage' trigrams (all of them occured within 'age sag' trigrams
continuously):
'  s', ' sa', 'age', 'ge ', 'sag'

Maybe the problem should be solved by considering 'sage' trigrams
initial order.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

31 октября 2017 г., 00:08:21

On Sat, Oct 28, 2017 at 11:22 AM, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:

On Fri, Oct 27, 2017 at 06:48:08PM +0000, Cristiano Coelho wrote:
> Hello all, this is related to postgres 9.6 (9.6.4) and a good description can be found here https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words
>
> But in summary, word_similarity doesn’t seem to do exactly what the docs say, since it will match trigrams from multiple words rather tan doing a word by word comparison.
>
> Below is a table with output and expected output, thanks to kiln from stackoverflow to provide it.
>

Interesting. An klin's answer from stackoverflow.com is right.

The initial example can be reduced to the next:

=# select word_similarity('sage', 'age sag');
word_similarity
-----------------
1

It computes maximum similarity using closest trigrams not considering order of
'sage' trigrams. It determines that all
trigrams from 'sage' match trigrams from 'age sag'.

Initial order of 'age sag' trigrams:
' a', ' ag', 'age', 'ge ', ' s', ' sa', 'sag', 'ag '
^ ^
|from |to
Sorted 'sage' trigrams (all of them occured within 'age sag' trigrams
continuously):
' s', ' sa', 'age', 'ge ', 'sag'

Maybe the problem should be solved by considering 'sage' trigrams
initial order.

We searching for continuous extent of second string trigrams (in original orders) which has best similarity with first string trigrams.

Possible solution could be forcing this extent boundaries to be at word boundaries. However, it would become less convenient to search for *part* of word. And we already have users adopt this feature.

So, I see following solution:

1) Define GUC variable which specifies whether word_similarity() should force extent boundaries to be at word boundaries,

2) Document both cases of word_similarity() behavior.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Jan Przemysław Wójcik

Дата:

31 октября 2017 г., 19:02:27

2017-10-30 19:08 GMT+01:00 Alexander Korotkov <a.korotkov@postgrespro.ru>:
>
> On Sat, Oct 28, 2017 at 11:22 AM, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
>>
>> On Fri, Oct 27, 2017 at 06:48:08PM +0000, Cristiano Coelho wrote:
>> > Hello all, this is related to postgres 9.6 (9.6.4) and a good description can be found here
https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words
>> >
>> > But in summary, word_similarity doesn’t seem to do exactly what the docs say, since it will match trigrams from
multiplewords rather tan doing a word by word comparison. 
>> >
>> > Below is a table with output and expected output, thanks to kiln from stackoverflow to provide it.
>> >
>>
>> Interesting. An klin's answer from stackoverflow.com is right.
>>
>> The initial example can be reduced to the next:
>>
>> =# select word_similarity('sage', 'age sag');
>>  word_similarity
>> -----------------
>>                1
>>
>> It computes maximum similarity using closest trigrams not considering order of
>> 'sage' trigrams. It determines that all
>> trigrams from 'sage' match trigrams from 'age sag'.
>>
>> Initial order of 'age sag' trigrams:
>> '  a', ' ag', 'age', 'ge ', '  s', ' sa', 'sag', 'ag '
>>                 ^                           ^
>>                 |from                       |to
>> Sorted 'sage' trigrams (all of them occured within 'age sag' trigrams
>> continuously):
>> '  s', ' sa', 'age', 'ge ', 'sag'
>>
>> Maybe the problem should be solved by considering 'sage' trigrams
>> initial order.
>
>
> We searching for continuous extent of second string trigrams (in original orders) which has best similarity with
firststring trigrams. 
> Possible solution could be forcing this extent boundaries to be at word boundaries.  However, it would become less
convenientto search for *part* of word.  And we already have users adopt this feature. 
> So, I see following  solution:
> 1) Define GUC variable which specifies whether word_similarity() should force extent boundaries to be at word
boundaries,
> 2) Document both cases of word_similarity() behavior.
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>

Look at the example:

with data(word, string) as (
values   ('sage', 'message'),   ('sage', 'message s'),   ('sage', 'message sa')
)

select   similarity(word, string),   word_similarity (word, string)
from data;
similarity | word_similarity
------------+-----------------       0.3 |             0.6  0.363636 |             0.8  0.454545 |               1
(3 rows)

When searching for a part of a word I would expect that the word
similarity is the same in all three rows. It's really strange that the
context of the second word (sa) makes the similarity equal to 1.

From a user's point of view it's also hard to understand why there is
such a big difference between similarity() and word_similarity(),
especially when comparing just two words (the first row).

I do not think the current function has any practical use.


------
Jan Przemysław Wójcik

2017-10-30 19:08 GMT+01:00 Alexander Korotkov <a.korotkov@postgrespro.ru>:
> On Sat, Oct 28, 2017 at 11:22 AM, Arthur Zakirov <a.zakirov@postgrespro.ru>
> wrote:
>>
>> On Fri, Oct 27, 2017 at 06:48:08PM +0000, Cristiano Coelho wrote:
>> > Hello all, this is related to postgres 9.6 (9.6.4) and a good
>> > description can be found here
>> > https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words
>> >
>> > But in summary, word_similarity doesn’t seem to do exactly what the docs
>> > say, since it will match trigrams from multiple words rather tan doing a
>> > word by word comparison.
>> >
>> > Below is a table with output and expected output, thanks to kiln from
>> > stackoverflow to provide it.
>> >
>>
>> Interesting. An klin's answer from stackoverflow.com is right.
>>
>> The initial example can be reduced to the next:
>>
>> =# select word_similarity('sage', 'age sag');
>>  word_similarity
>> -----------------
>>                1
>>
>> It computes maximum similarity using closest trigrams not considering
>> order of
>> 'sage' trigrams. It determines that all
>> trigrams from 'sage' match trigrams from 'age sag'.
>>
>> Initial order of 'age sag' trigrams:
>> '  a', ' ag', 'age', 'ge ', '  s', ' sa', 'sag', 'ag '
>>                 ^                           ^
>>                 |from                       |to
>> Sorted 'sage' trigrams (all of them occured within 'age sag' trigrams
>> continuously):
>> '  s', ' sa', 'age', 'ge ', 'sag'
>>
>> Maybe the problem should be solved by considering 'sage' trigrams
>> initial order.
>
>
> We searching for continuous extent of second string trigrams (in original
> orders) which has best similarity with first string trigrams.
> Possible solution could be forcing this extent boundaries to be at word
> boundaries.  However, it would become less convenient to search for *part*
> of word.  And we already have users adopt this feature.
> So, I see following  solution:
> 1) Define GUC variable which specifies whether word_similarity() should
> force extent boundaries to be at word boundaries,
> 2) Document both cases of word_similarity() behavior.
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

31 октября 2017 г., 21:25:03

On Tue, Oct 31, 2017 at 4:02 PM, Jan Przemysław Wójcik <jan.przemyslaw.wojcik@gmail.com> wrote:

2017-10-30 19:08 GMT+01:00 Alexander Korotkov <a.korotkov@postgrespro.ru>:
>
> On Sat, Oct 28, 2017 at 11:22 AM, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
>>
>> On Fri, Oct 27, 2017 at 06:48:08PM +0000, Cristiano Coelho wrote:
>> > Hello all, this is related to postgres 9.6 (9.6.4) and a good description can be found here https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words
>> >
>> > But in summary, word_similarity doesn’t seem to do exactly what the docs say, since it will match trigrams from multiple words rather tan doing a word by word comparison.
>> >
>> > Below is a table with output and expected output, thanks to kiln from stackoverflow to provide it.
>> >
>>
>> Interesting. An klin's answer from stackoverflow.com is right.
>>
>> The initial example can be reduced to the next:
>>
>> =# select word_similarity('sage', 'age sag');
>> word_similarity
>> -----------------
>> 1
>>
>> It computes maximum similarity using closest trigrams not considering order of
>> 'sage' trigrams. It determines that all
>> trigrams from 'sage' match trigrams from 'age sag'.
>>
>> Initial order of 'age sag' trigrams:
>> ' a', ' ag', 'age', 'ge ', ' s', ' sa', 'sag', 'ag '
>> ^ ^
>> |from |to
>> Sorted 'sage' trigrams (all of them occured within 'age sag' trigrams
>> continuously):
>> ' s', ' sa', 'age', 'ge ', 'sag'
>>
>> Maybe the problem should be solved by considering 'sage' trigrams
>> initial order.
>
>
> We searching for continuous extent of second string trigrams (in original orders) which has best similarity with first string trigrams.
> Possible solution could be forcing this extent boundaries to be at word boundaries. However, it would become less convenient to search for *part* of word. And we already have users adopt this feature.
> So, I see following solution:
> 1) Define GUC variable which specifies whether word_similarity() should force extent boundaries to be at word boundaries,
> 2) Document both cases of word_similarity() behavior.

Look at the example:

with data(word, string) as (
values
('sage', 'message'),
('sage', 'message s'),
('sage', 'message sa')
)

select
similarity(word, string),
word_similarity (word, string)
from data;

similarity | word_similarity
------------+-----------------
0.3 | 0.6
0.363636 | 0.8
0.454545 | 1
(3 rows)

When searching for a part of a word I would expect that the word
similarity is the same in all three rows. It's really strange that the
context of the second word (sa) makes the similarity equal to 1.

From a user's point of view it's also hard to understand why there is
such a big difference between similarity() and word_similarity(),
especially when comparing just two words (the first row).

Probably word_similarity() is not a good name for this function. Initially it was called substring_similarity() which now seems like better name for that.

I do not think the current function has any practical use.

It's hard for me to agree or disagree with you. There is no technical problem to force word_similarity() to search for extent boundaries within word boundaries. However, we already have customers using this function (and they are likely satisfied with its currency behavior). It's important for me that our fix wouldn't affect them. I asked them to join this discussion. I hope that together we'll find a consensus.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

06 ноября 2017 г., 17:34:24

Hi!

I'd like to forward a feedback from our customer who uses word_similarity() function.
François finds current behavior of word_similarity() to be useful. Thus, I think we should preserve it. But documentation correction is needed and option for alternative behavior would be useful too.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

---------- Forwarded message ----------
From: François CHAHUNEAU <Francois.CHAHUNEAU@numen.fr>
Date: Wed, Nov 1, 2017 at 1:04 AM
Subject: RE: [BUGS] pg_trgm word_similarity inconsistencies or bug
To: Alexander Korotkov <a.korotkov@postgrespro.ru>
Cc: Thierry BOUDIERE <Thierry.BOUDIERE@numen.fr>, "foli@numen.mg" <foli@numen.mg>

Hello Alexander,

We agree that the current pg_trgm documentation does not correctly reflect the de facto behavior of word_similarity(), and that something has to be changed. But to us, it is more a documentation problem than anything else.

What is computed is still « substring_similarity » as was initially specified between us, but it is influenced by a strong word boundary bias caused by the way trigrams are padded at word boundaries. This bias was noticed by early reviewers and you explained that this motivated the name switch to « word_similarity ». As you will remember, at the time we discovered this, we were suprised because we considerd this as a slight misnomer. Indeed, what is currently described in the 9.6 pg_trgm documentation is inaccurate (although seemingly consistent with this new name) and has to be amended.

Now, word_similarity() has been out for more than a year and, of course, it is preferable to avoid any breaking changes… In our case, we consider the name « unfortunate » and the explanation buggy, not the function itself.

As you may remember from the initial discussion, some other users stressed the importance to be able to matchsub strings. We tend to agree with what Jeff Janes wrote in this discussion :

The reason I like the option of not treating word boundaries as

special in this case is that often in scientific vocabulary, and in

catalog part numbers, people are pretty inconsistent about whether

they included spaces. "HEK 293", "HEK293", and "HEK-293" could be all

the same thing. So I like to strip out spaces and punctuation on both

sides of operator. Of course I can't do that if there are invisible

un-removable spaces on the substring side.

But, It doesn't sound like I am going to win that debate. Given that,

I don't think we need a different name for the function. I'm fine with

explaining the word-boundary subtlety in the documentation, and

keeping the function name itself simple.

Now, considering your proposal :

As far as we are concerned, we use <% and %> everyday for efficient fuzzy matching on large databases. Our typical usage scenario is matching noisy OCRized text strings against reference databases.

> 1) Define GUC variable which specifies whether word_similarity() should force extent boundaries to be at word boundaries,

Ok for us, iff default behavior remains the same as now, for backward compatibility reasons. We could take advantage, in some cases, of the new « word rounded » behavior controlled by the GUC variable, but this would not cover all scenarios currently in use.

> 2) Document both cases of word_similarity() behavior.

This is clearly needed anyway.

Best regards,

François CHAHUNEAU
Directeur des technologies

NUMEN DIGITAL| 24, rue Marc Seguin 75018 Paris France | www.numen.fr
Tel +33 1 40 37 95 03 | Mob +33 6 07 85 21 79 | Fax +33 1 40 37 94 94

Pensez vert, n’imprimez que nécessaire. Les informations contenues dans le présent e-mail sont exclusivement adressées au(x) destinataire(s) de ce message et peuvent contenir des informations confidentielles, protégées par un secret professionnel. L’utilisation de ces informations par d’autres personnes que le(s) destinataire(s) est strictement interdite. Si vous n’êtes pas destinataire de ce message, la publication, la reproduction, la diffusion et /ou la distribution de ces informations auprès de tiers n’est pas autorisée. Si vous avez reçu cet e-mail par erreur, veuillez nous en informer immédiatement, détruire l'email, ses copies et documents joints et le supprimer.

De : Alexander Korotkov [mailto:a.korotkov@postgrespro.ru]
Envoyé : mardi 31 octobre 2017 16:18
À : Thierry BOUDIERE <Thierry.BOUDIERE@numen.fr>; François CHAHUNEAU <Francois.CHAHUNEAU@numen.fr>
Objet : Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

Dear, Thierry and François!

PostgreSQL users found inconsistency between documentation and implementation of word_similarity().

Possible solution proposed by the reporter is to alter the implementation.

But it's important for me that your interests are not affected but potential further change of implementation of word_similarity().

Could you please share your opinion on changes proposed by Jan in the pgsql-bugs mailing list?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Jan Przemysław Wójcik

Дата:

07 ноября 2017 г., 18:51:35

Hi,

my statement about the function usefulness was probably too categorical,
though I had in mind the current name of the function.

I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and would lead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.

------
Jan Przemysław Wójcik



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

07 ноября 2017 г., 22:24:27

Hi!

On Tue, Nov 7, 2017 at 3:51 PM, Jan Przemysław Wójcik <jan.przemyslaw.wojcik@gmail.com> wrote:

Hi,

my statement about the function usefulness was probably too categorical,
though I had in mind the current name of the function.

I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and would lead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.

Good point. I've no complaints about that. I'm going to propose corresponding patch to the next commitfest.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

07 декабря 2017 г., 19:38:59

On Tue, Nov 7, 2017 at 7:24 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Tue, Nov 7, 2017 at 3:51 PM, Jan Przemysław Wójcik <jan.przemyslaw.wojcik@gmail.com> wrote:
my statement about the function usefulness was probably too categorical,
though I had in mind the current name of the function.

I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and would lead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.

Good point. I've no complaints about that. I'm going to propose corresponding patch to the next commitfest.

I've written a draft patch for fixing this inconsistency. Please, find it in attachment. This patch doesn't contain proper documentation and comments yet.

I've called existing behavior subset_similarity(). I didn't use name substring_similarity(), because it doesn't really looking for substring with appropriate padding, but rather searching for continuous subset of trigrams. For index search over subset similarity, %>>, <<%, <->>>, <<<-> operators are provided. I've added extra arrow sign to denote these operators look deeper into string.

Simultaneously, word_similarity() now forces extent bounds to be word bounds. Now word_similarity() behaves similar to my_word_similarity() proposed on stackoverlow.

# with data(t) as (

values

('message'),

('message s'),

('message sag'),

('message sag sag'),

('message sag sage')

)

select t, subset_similarity('sage', t), word_similarity('sage', t)

from data;

t | subset_similarity | word_similarity

------------------+-------------------+-----------------

message | 0.6 | 0.3

message s | 0.8 | 0.363636

message sag | 1 | 0.5

message sag sag | 1 | 0.5

message sag sage | 1 | 1

(5 rows)

The difference here is only in 'messsage s' row, because word_similarity() allows matching one word to two or more while my_word_similarity() doesn't allow that. In this case word_similarity() returns similarity between 'sage' and 'message s'.

# select similarity('sage', 'message s');

similarity

------------

0.363636

(1 row)

I think behavior of word_similarity() appears better here, because typo can break word into two.

I also wonder if word_similarity() and subset_similarity() should share same threshold value for indexed search. subset_similarity() typically returns higher values than word_similarity(). Thus, it's probably makes sense to split their threshold values.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Вложения

pg-trgm-word-subset-similarity-1.patch

Re: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

François CHAHUNEAU

Дата:

07 декабря 2017 г., 23:39:05

Hello Alexander,

This is fine with us. Yes, separate thresholds seem preferable.

Best Regards

Obtenez Outlook pour iOS

From: Alexander Korotkov <a.korotkov@postgrespro.ru>
Sent: Thursday, December 7, 2017 4:38:59 PM
To: Jan Przemysław Wójcik; Cristiano Coelho
Cc: pgsql-bugs@postgresql.org; François CHAHUNEAU; Artur Zakirov; pgsql-hackers
Subject: Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

François CHAHUNEAU
Directeur des technologies

NUMEN DIGITAL| 24, rue Marc Seguin 75018 Paris France | www.numen.fr
Tel +33 1 40 37 95 03 | Mob +33 6 07 85 21 79 | Fax +33 1 40 37 94 94

On Tue, Nov 7, 2017 at 7:24 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Tue, Nov 7, 2017 at 3:51 PM, Jan Przemysław Wójcik <jan.przemyslaw.wojcik@gmail.com> wrote:
my statement about the function usefulness was probably too categorical,
though I had in mind the current name of the function.

I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and would lead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.

Good point. I've no complaints about that. I'm going to propose corresponding patch to the next commitfest.

I've written a draft patch for fixing this inconsistency. Please, find it in attachment. This patch doesn't contain proper documentation and comments yet.

Simultaneously, word_similarity() now forces extent bounds to be word bounds. Now word_similarity() behaves similar to my_word_similarity() proposed on stackoverlow.

# with data(t) as (

values

('message'),

('message s'),

('message sag'),

('message sag sag'),

('message sag sage')

)

select t, subset_similarity('sage', t), word_similarity('sage', t)

from data;

t | subset_similarity | word_similarity

------------------+-------------------+-----------------

message | 0.6 | 0.3

message s | 0.8 | 0.363636

message sag | 1 | 0.5

message sag sag | 1 | 0.5

message sag sage | 1 | 1

(5 rows)

# select similarity('sage', 'message s');

similarity

------------

0.363636

(1 row)

I think behavior of word_similarity() appears better here, because typo can break word into two.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Robert Haas

Дата:

07 декабря 2017 г., 23:59:06

On Tue, Nov 7, 2017 at 7:51 AM, Jan Przemysław Wójcik
<jan.przemyslaw.wojcik@gmail.com> wrote:
> I'm afraid that creating a function that implements quite different
> algorithms depending on a global parameter seems very hacky and would lead
> to misunderstandings. I do understand the need of backward compatibility,
> but I'd opt for the lesser evil. Perhaps a good idea would be to change the
> name to 'substring_similarity()' and introduce the new function
> 'word_similarity()' later, for example in the next major version release.

That breaks things for everybody using word_similarity() currently.
If the previous discussion of this topic concluded that
word_similarity() was an OK name despite being a slight misnomer, I
don't think we should change our mind now.  Instead the new function
can be called something which makes the difference clear, e.g.
strict_word_similarity(), and the old function can remain as it is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

08 декабря 2017 г., 17:50:20

On Thu, Dec 7, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 7, 2017 at 7:51 AM, Jan Przemysław Wójcik
<jan.przemyslaw.wojcik@gmail.com> wrote:
> I'm afraid that creating a function that implements quite different
> algorithms depending on a global parameter seems very hacky and would lead
> to misunderstandings. I do understand the need of backward compatibility,
> but I'd opt for the lesser evil. Perhaps a good idea would be to change the
> name to 'substring_similarity()' and introduce the new function
> 'word_similarity()' later, for example in the next major version release.

That breaks things for everybody using word_similarity() currently.
If the previous discussion of this topic concluded that
word_similarity() was an OK name despite being a slight misnomer, I
don't think we should change our mind now. Instead the new function
can be called something which makes the difference clear, e.g.
strict_word_similarity(), and the old function can remain as it is.

Thank you for pointing this. Yes, it would be better not to change existing names and behavior, but adjust documentation and add alternative behavior with another name.

Therefore, I'm going to provide patchset of two patches:

1) Improve word_similarity() documentation.

2) Add new function strict_word_similarity() (or whatever better name we invent).

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

08 декабря 2017 г., 17:50:20

On Thu, Dec 7, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 7, 2017 at 7:51 AM, Jan Przemysław Wójcik
<jan.przemyslaw.wojcik@gmail.com> wrote:
> I'm afraid that creating a function that implements quite different
> algorithms depending on a global parameter seems very hacky and would lead
> to misunderstandings. I do understand the need of backward compatibility,
> but I'd opt for the lesser evil. Perhaps a good idea would be to change the
> name to 'substring_similarity()' and introduce the new function
> 'word_similarity()' later, for example in the next major version release.

That breaks things for everybody using word_similarity() currently.
If the previous discussion of this topic concluded that
word_similarity() was an OK name despite being a slight misnomer, I
don't think we should change our mind now. Instead the new function
can be called something which makes the difference clear, e.g.
strict_word_similarity(), and the old function can remain as it is.

Thank you for pointing this. Yes, it would be better not to change existing names and behavior, but adjust documentation and add alternative behavior with another name.

Therefore, I'm going to provide patchset of two patches:

1) Improve word_similarity() documentation.

2) Add new function strict_word_similarity() (or whatever better name we invent).

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

12 декабря 2017 г., 02:45:38

On Fri, Dec 8, 2017 at 2:50 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Thu, Dec 7, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Nov 7, 2017 at 7:51 AM, Jan Przemysław Wójcik
<jan.przemyslaw.wojcik@gmail.com> wrote:
> I'm afraid that creating a function that implements quite different
> algorithms depending on a global parameter seems very hacky and would lead
> to misunderstandings. I do understand the need of backward compatibility,
> but I'd opt for the lesser evil. Perhaps a good idea would be to change the
> name to 'substring_similarity()' and introduce the new function
> 'word_similarity()' later, for example in the next major version release.

That breaks things for everybody using word_similarity() currently.
If the previous discussion of this topic concluded that
word_similarity() was an OK name despite being a slight misnomer, I
don't think we should change our mind now. Instead the new function
can be called something which makes the difference clear, e.g.
strict_word_similarity(), and the old function can remain as it is.

+1
Thank you for pointing this. Yes, it would be better not to change existing names and behavior, but adjust documentation and add alternative behavior with another name.
Therefore, I'm going to provide patchset of two patches:
1) Improve word_similarity() documentation.
2) Add new function strict_word_similarity() (or whatever better name we invent).

Please, find patchset attached.

0001-pg-trgm-word-similarity-docs-improvement.patch – contains improvement to documentation of word_similarity() and related operators. I decided to give formal definition first (what exactly it internally does), and then example and some more human-understandable description. This patch also adjusts two comments where lower and upper bounds mess up.

0002-pg-trgm-strict_word-similarity.patch – implementation of strict_word_similarity() with comments, docs and tests.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Вложения

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

12 декабря 2017 г., 02:45:38

On Fri, Dec 8, 2017 at 2:50 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Thu, Dec 7, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Nov 7, 2017 at 7:51 AM, Jan Przemysław Wójcik
<jan.przemyslaw.wojcik@gmail.com> wrote:
> I'm afraid that creating a function that implements quite different
> algorithms depending on a global parameter seems very hacky and would lead
> to misunderstandings. I do understand the need of backward compatibility,
> but I'd opt for the lesser evil. Perhaps a good idea would be to change the
> name to 'substring_similarity()' and introduce the new function
> 'word_similarity()' later, for example in the next major version release.

That breaks things for everybody using word_similarity() currently.
If the previous discussion of this topic concluded that
word_similarity() was an OK name despite being a slight misnomer, I
don't think we should change our mind now. Instead the new function
can be called something which makes the difference clear, e.g.
strict_word_similarity(), and the old function can remain as it is.

+1
Thank you for pointing this. Yes, it would be better not to change existing names and behavior, but adjust documentation and add alternative behavior with another name.
Therefore, I'm going to provide patchset of two patches:
1) Improve word_similarity() documentation.
2) Add new function strict_word_similarity() (or whatever better name we invent).

От

Alexander Korotkov

Дата:

13 декабря 2017 г., 17:13:01

On Tue, Dec 12, 2017 at 2:33 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:

0002-pg-trgm-strict_word-similarity.patch – implementation of strict_word_similarity() with comments, docs and tests.
After some looking in

1)
repeated piece of code:
+ if (strategy == SimilarityStrategyNumber)
+ nlimit = similarity_threshold;
+ else if (strategy == WordSimilarityStrategyNumber)
+ nlimit = word_similarity_threshold;
+ else /* strategy == StrictWordSimilarityStrategyNumber */
+ nlimit = strict_word_similarity_threshold;
Isn't it better to move that piece to separate function?

Good point. Moved to separate function.

2)
calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
bool check_only, bool word_bounds)

Seems, two bools args are replaceble to bitwise-ORed flag. It will simplify adding new options in future.

Yep. I've introduced flags.

Also, I've adjusted tests to make them stable (found example where TOP-8 distances are unique).

Please, find revised patch in attachment.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

0002-pg-trgm-strict_word-similarity-2.patch

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

13 декабря 2017 г., 17:13:01

On Tue, Dec 12, 2017 at 2:33 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:

0002-pg-trgm-strict_word-similarity.patch – implementation of strict_word_similarity() with comments, docs and tests.
After some looking in

1)
repeated piece of code:
+ if (strategy == SimilarityStrategyNumber)
+ nlimit = similarity_threshold;
+ else if (strategy == WordSimilarityStrategyNumber)
+ nlimit = word_similarity_threshold;
+ else /* strategy == StrictWordSimilarityStrategyNumber */
+ nlimit = strict_word_similarity_threshold;
Isn't it better to move that piece to separate function?

Good point. Moved to separate function.

2)
calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
bool check_only, bool word_bounds)

Seems, two bools args are replaceble to bitwise-ORed flag. It will simplify adding new options in future.

Yep. I've introduced flags.

Also, I've adjusted tests to make them stable (found example where TOP-8 distances are unique).

Please, find revised patch in attachment.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

0002-pg-trgm-strict_word-similarity-2.patch

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

05 января 2018 г., 03:25:52

On Wed, Dec 13, 2017 at 2:13 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Tue, Dec 12, 2017 at 2:33 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
0002-pg-trgm-strict_word-similarity.patch – implementation of strict_word_similarity() with comments, docs and tests.
After some looking in

1)
repeated piece of code:
+ if (strategy == SimilarityStrategyNumber)
+ nlimit = similarity_threshold;
+ else if (strategy == WordSimilarityStrategyNumber)
+ nlimit = word_similarity_threshold;
+ else /* strategy == StrictWordSimilarityStrategyNumber */
+ nlimit = strict_word_similarity_threshold;
Isn't it better to move that piece to separate function?

Good point. Moved to separate function.

2)
calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
bool check_only, bool word_bounds)

Seems, two bools args are replaceble to bitwise-ORed flag. It will simplify adding new options in future.

Yep. I've introduced flags.

Also, I've adjusted tests to make them stable (found example where TOP-8 distances are unique).
Please, find revised patch in attachment.

I just found that patch apply is failed according to commitfest.cputube.org. I think it's because I sent only second patch from patchset in last message.

Anyway I resend both patches rebased to current master.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

05 января 2018 г., 03:25:52

On Wed, Dec 13, 2017 at 2:13 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Tue, Dec 12, 2017 at 2:33 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
0002-pg-trgm-strict_word-similarity.patch – implementation of strict_word_similarity() with comments, docs and tests.
After some looking in

1)
repeated piece of code:
+ if (strategy == SimilarityStrategyNumber)
+ nlimit = similarity_threshold;
+ else if (strategy == WordSimilarityStrategyNumber)
+ nlimit = word_similarity_threshold;
+ else /* strategy == StrictWordSimilarityStrategyNumber */
+ nlimit = strict_word_similarity_threshold;
Isn't it better to move that piece to separate function?

Good point. Moved to separate function.

2)
calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
bool check_only, bool word_bounds)

Seems, two bools args are replaceble to bitwise-ORed flag. It will simplify adding new options in future.

Yep. I've introduced flags.

Also, I've adjusted tests to make them stable (found example where TOP-8 distances are unique).
Please, find revised patch in attachment.

Alexander Korotkov

Дата:

02 марта 2018 г., 03:26:32

On Thu, Mar 1, 2018 at 11:05 PM, David Steele <david@pgmasters.net> wrote:

On 1/4/18 4:25 PM, Alexander Korotkov wrote:
>
> I just found that patch apply is failed according to
> commitfest.cputube.org <http://commitfest.cputube.org>. I think it's
> because I sent only second patch from patchset in last message.
> Anyway I resend both patches rebased to current master.

I agree with Teodor (upthread, not quoted here) that the documentation
could use some editing.

I started to do it myself, but quickly realized I have no knowledge of
the content. I'm afraid I would destroy the meaning while updating the
grammar.

That's to problem. If you're willing to help you can edit the documentation

and let me review that it's correct. Also feel free to ask any questions and

more explanation from me. Ultimately, we need to have a documentation

that any average user can understand, not to mention you :)

Anyone understand the subject matter well enough to review the
documentation?

I expect it would be hard to find anybody matching this criteria. But it

would be nice to find one though.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

02 марта 2018 г., 03:26:32

On Thu, Mar 1, 2018 at 11:05 PM, David Steele <david@pgmasters.net> wrote:

On 1/4/18 4:25 PM, Alexander Korotkov wrote:
>
> I just found that patch apply is failed according to
> commitfest.cputube.org <http://commitfest.cputube.org>. I think it's
> because I sent only second patch from patchset in last message.
> Anyway I resend both patches rebased to current master.

I agree with Teodor (upthread, not quoted here) that the documentation
could use some editing.

I started to do it myself, but quickly realized I have no knowledge of
the content. I'm afraid I would destroy the meaning while updating the
grammar.

That's to problem. If you're willing to help you can edit the documentation

and let me review that it's correct. Also feel free to ask any questions and

more explanation from me. Ultimately, we need to have a documentation

that any average user can understand, not to mention you :)

Anyone understand the subject matter well enough to review the
documentation?

I expect it would be hard to find anybody matching this criteria. But it

would be nice to find one though.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

David Steele

Дата:

02 марта 2018 г., 06:12:25

Hi Alexander,

On 3/1/18 4:26 PM, Alexander Korotkov wrote:
> On Thu, Mar 1, 2018 at 11:05 PM, David Steele <david@pgmasters.net 
> <mailto:david@pgmasters.net>> wrote:
> 
>     I agree with Teodor (upthread, not quoted here) that the documentation
>     could use some editing.
> 
>     I started to do it myself, but quickly realized I have no knowledge of
>     the content.  I'm afraid I would destroy the meaning while updating the
>     grammar.
> 
> That's to problem.  If you're willing to help you can edit the documentation
> and let me review that it's correct.  Also feel free to ask any 
> questions and
> more explanation from me.  Ultimately, we need to have a documentation
> that any average user can understand, not to mention you :)

OK, I'm the CFM so I have my plate full for the next few days but if 
nobody picks this up then I will give it a go.

>     Anyone understand the subject matter well enough to review the
>     documentation?
> 
> I expect it would be hard to find anybody matching this criteria.

You are probably right, but it never hurts to try.

-- 
-David
david@pgmasters.net

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

David Steele

Дата:

02 марта 2018 г., 06:12:25

Hi Alexander,

On 3/1/18 4:26 PM, Alexander Korotkov wrote:
> On Thu, Mar 1, 2018 at 11:05 PM, David Steele <david@pgmasters.net 
> <mailto:david@pgmasters.net>> wrote:
> 
>     I agree with Teodor (upthread, not quoted here) that the documentation
>     could use some editing.
> 
>     I started to do it myself, but quickly realized I have no knowledge of
>     the content.  I'm afraid I would destroy the meaning while updating the
>     grammar.
> 
> That's to problem.  If you're willing to help you can edit the documentation
> and let me review that it's correct.  Also feel free to ask any 
> questions and
> more explanation from me.  Ultimately, we need to have a documentation
> that any average user can understand, not to mention you :)

OK, I'm the CFM so I have my plate full for the next few days but if 
nobody picks this up then I will give it a go.

>     Anyone understand the subject matter well enough to review the
>     documentation?
> 
> I expect it would be hard to find anybody matching this criteria.

You are probably right, but it never hurts to try.

-- 
-David
david@pgmasters.net

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Teodor Sigaev

Дата:

06 марта 2018 г., 18:04:16

> I agree with Teodor (upthread, not quoted here) that the documentation
> could use some editing.
> 
> I started to do it myself, but quickly realized I have no knowledge of
> the content.  I'm afraid I would destroy the meaning while updating the
> grammar.
> 
> Anyone understand the subject matter well enough to review the
> documentation?

Liudmila tried to improve docs in Alexander's patchset.

https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752cd93@postgrespro.ru

BTW, adding Liudmila's message to commitfest task 
(https://commitfest.postgresql.org/17/1403/) doesn't work
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Teodor Sigaev

Дата:

06 марта 2018 г., 18:04:16

> I agree with Teodor (upthread, not quoted here) that the documentation
> could use some editing.
> 
> I started to do it myself, but quickly realized I have no knowledge of
> the content.  I'm afraid I would destroy the meaning while updating the
> grammar.
> 
> Anyone understand the subject matter well enough to review the
> documentation?

Liudmila tried to improve docs in Alexander's patchset.

https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752cd93@postgrespro.ru

BTW, adding Liudmila's message to commitfest task 
(https://commitfest.postgresql.org/17/1403/) doesn't work
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

David Steele

Дата:

06 марта 2018 г., 22:59:30

On 3/6/18 7:04 AM, Teodor Sigaev wrote:
>> I agree with Teodor (upthread, not quoted here) that the documentation
>> could use some editing.
>>
>> I started to do it myself, but quickly realized I have no knowledge of
>> the content.  I'm afraid I would destroy the meaning while updating the
>> grammar.
>>
>> Anyone understand the subject matter well enough to review the
>> documentation?
> 
> Liudmila tried to improve docs in Alexander's patchset.
> 
> https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752cd93@postgrespro.ru

This looks good to me with a few minor exceptions:

+   <function>word_similarity(text, text)</function> requires further
+   explanation. Consider the following example:

Maybe too verbose?  I think "<function>word_similarity(text,
text)</function> requires further explanation." can be removed entirely.

+   string.  However, this function does not add paddings to the

"add padding"

> BTW, adding Liudmila's message to commitfest task
> (https://commitfest.postgresql.org/17/1403/) doesn't work

Doesn't work for me either.

Alexander, can you post the final patches to the thread so they show up
in the CF app?

Thanks,
-- 
-David
david@pgmasters.net

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

David Steele

Дата:

06 марта 2018 г., 22:59:30

On 3/6/18 7:04 AM, Teodor Sigaev wrote:
>> I agree with Teodor (upthread, not quoted here) that the documentation
>> could use some editing.
>>
>> I started to do it myself, but quickly realized I have no knowledge of
>> the content.  I'm afraid I would destroy the meaning while updating the
>> grammar.
>>
>> Anyone understand the subject matter well enough to review the
>> documentation?
> 
> Liudmila tried to improve docs in Alexander's patchset.
> 
> https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752cd93@postgrespro.ru

This looks good to me with a few minor exceptions:

+   <function>word_similarity(text, text)</function> requires further
+   explanation. Consider the following example:

Maybe too verbose?  I think "<function>word_similarity(text,
text)</function> requires further explanation." can be removed entirely.

+   string.  However, this function does not add paddings to the

"add padding"

> BTW, adding Liudmila's message to commitfest task
> (https://commitfest.postgresql.org/17/1403/) doesn't work

Doesn't work for me either.

Alexander, can you post the final patches to the thread so they show up
in the CF app?

Thanks,
-- 
-David
david@pgmasters.net

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

21 марта 2018 г., 18:25:11

On Tue, Mar 6, 2018 at 7:59 PM, David Steele <david@pgmasters.net> wrote:

On 3/6/18 7:04 AM, Teodor Sigaev wrote:
>> I agree with Teodor (upthread, not quoted here) that the documentation
>> could use some editing.
>>
>> I started to do it myself, but quickly realized I have no knowledge of
>> the content. I'm afraid I would destroy the meaning while updating the
>> grammar.
>>
>> Anyone understand the subject matter well enough to review the
>> documentation?
>
> Liudmila tried to improve docs in Alexander's patchset.
>
> https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752cd93@postgrespro.ru

This looks good to me with a few minor exceptions:

+ <function>word_similarity(text, text)</function> requires further
+ explanation. Consider the following example:

Maybe too verbose? I think "<function>word_similarity(text,
text)</function> requires further explanation." can be removed entirely.

+ string. However, this function does not add paddings to the

"add padding"

> BTW, adding Liudmila's message to commitfest task
> (https://commitfest.postgresql.org/17/1403/) doesn't work

Doesn't work for me either.

Alexander, can you post the final patches to the thread so they show up
in the CF app?

I'm sorry for not updating patches, I've missed this message in the thread.

BTW, Teodor have pushed fix to the documentation up to 9.6.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=aea7c17e86e99a7ed4da489b3df2b5493b5e5e95

And new function strict_word_similarity() to PostgreSQL 11.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=be8a7a6866276b228b4ffaa3003e1dc2dd1d140a

Could someone put this information to stackoverflow?

https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words

I don't have enough of reputation to comment.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Alexander Korotkov

Дата:

21 марта 2018 г., 18:25:11

On Tue, Mar 6, 2018 at 7:59 PM, David Steele <david@pgmasters.net> wrote:

On 3/6/18 7:04 AM, Teodor Sigaev wrote:
>> I agree with Teodor (upthread, not quoted here) that the documentation
>> could use some editing.
>>
>> I started to do it myself, but quickly realized I have no knowledge of
>> the content. I'm afraid I would destroy the meaning while updating the
>> grammar.
>>
>> Anyone understand the subject matter well enough to review the
>> documentation?
>
> Liudmila tried to improve docs in Alexander's patchset.
>
> https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752cd93@postgrespro.ru

This looks good to me with a few minor exceptions:

+ <function>word_similarity(text, text)</function> requires further
+ explanation. Consider the following example:

Maybe too verbose? I think "<function>word_similarity(text,
text)</function> requires further explanation." can be removed entirely.

+ string. However, this function does not add paddings to the

"add padding"

> BTW, adding Liudmila's message to commitfest task
> (https://commitfest.postgresql.org/17/1403/) doesn't work

Doesn't work for me either.

Alexander, can you post the final patches to the thread so they show up
in the CF app?

I'm sorry for not updating patches, I've missed this message in the thread.

BTW, Teodor have pushed fix to the documentation up to 9.6.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=aea7c17e86e99a7ed4da489b3df2b5493b5e5e95

And new function strict_word_similarity() to PostgreSQL 11.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=be8a7a6866276b228b4ffaa3003e1dc2dd1d140a

Could someone put this information to stackoverflow?

https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words

I don't have enough of reputation to comment.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Teodor Sigaev

Дата:

21 марта 2018 г., 18:35:56

Thank you, pushed

David Steele wrote:
> On 3/6/18 7:04 AM, Teodor Sigaev wrote:
>>> I agree with Teodor (upthread, not quoted here) that the documentation
>>> could use some editing.
>>>
>>> I started to do it myself, but quickly realized I have no knowledge of
>>> the content.  I'm afraid I would destroy the meaning while updating the
>>> grammar.
>>>
>>> Anyone understand the subject matter well enough to review the
>>> documentation?
>>
>> Liudmila tried to improve docs in Alexander's patchset.
>>
>> https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752cd93@postgrespro.ru
> 
> This looks good to me with a few minor exceptions:
> 
> +   <function>word_similarity(text, text)</function> requires further
> +   explanation. Consider the following example:
> 
> Maybe too verbose?  I think "<function>word_similarity(text,
> text)</function> requires further explanation." can be removed entirely.
> 
> +   string.  However, this function does not add paddings to the
> 
> "add padding"
> 
>> BTW, adding Liudmila's message to commitfest task
>> (https://commitfest.postgresql.org/17/1403/) doesn't work
> 
> Doesn't work for me either.
> 
> Alexander, can you post the final patches to the thread so they show up
> in the CF app?
> 
> Thanks,
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Teodor Sigaev

Дата:

21 марта 2018 г., 18:35:56

Thank you, pushed

David Steele wrote:
> On 3/6/18 7:04 AM, Teodor Sigaev wrote:
>>> I agree with Teodor (upthread, not quoted here) that the documentation
>>> could use some editing.
>>>
>>> I started to do it myself, but quickly realized I have no knowledge of
>>> the content.  I'm afraid I would destroy the meaning while updating the
>>> grammar.
>>>
>>> Anyone understand the subject matter well enough to review the
>>> documentation?
>>
>> Liudmila tried to improve docs in Alexander's patchset.
>>
>> https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752cd93@postgrespro.ru
> 
> This looks good to me with a few minor exceptions:
> 
> +   <function>word_similarity(text, text)</function> requires further
> +   explanation. Consider the following example:
> 
> Maybe too verbose?  I think "<function>word_similarity(text,
> text)</function> requires further explanation." can be removed entirely.
> 
> +   string.  However, this function does not add paddings to the
> 
> "add padding"
> 
>> BTW, adding Liudmila's message to commitfest task
>> (https://commitfest.postgresql.org/17/1403/) doesn't work
> 
> Doesn't work for me either.
> 
> Alexander, can you post the final patches to the thread so they show up
> in the CF app?
> 
> Thanks,
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Liudmila Mantrova

Дата:

16 апреля 2018 г., 22:48:47

Hi everyone,

When translating doc updates, Alexander Lakhin noticed that trigram 
examples were not quite accurate.
A small patch fixing this issue is attached.


On 03/21/2018 03:35 PM, Teodor Sigaev wrote:
> Thank you, pushed
>
> David Steele wrote:
>> On 3/6/18 7:04 AM, Teodor Sigaev wrote:
>>>> I agree with Teodor (upthread, not quoted here) that the documentation
>>>> could use some editing.
>>>>
>>>> I started to do it myself, but quickly realized I have no knowledge of
>>>> the content.  I'm afraid I would destroy the meaning while updating 
>>>> the
>>>> grammar.
>>>>
>>>> Anyone understand the subject matter well enough to review the
>>>> documentation?
>>>
>>> Liudmila tried to improve docs in Alexander's patchset.
>>>
>>> https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752cd93@postgrespro.ru 
>>>
>>
>> This looks good to me with a few minor exceptions:
>>
>> +   <function>word_similarity(text, text)</function> requires further
>> +   explanation. Consider the following example:
>>
>> Maybe too verbose?  I think "<function>word_similarity(text,
>> text)</function> requires further explanation." can be removed entirely.
>>
>> +   string.  However, this function does not add paddings to the
>>
>> "add padding"
>>
>>> BTW, adding Liudmila's message to commitfest task
>>> (https://commitfest.postgresql.org/17/1403/) doesn't work
>>
>> Doesn't work for me either.
>>
>> Alexander, can you post the final patches to the thread so they show up
>> in the CF app?
>>
>> Thanks,
>>
>

-- 
Liudmila Mantrova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

pg-trgm-docfix.patch

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Liudmila Mantrova

Дата:

16 апреля 2018 г., 22:48:47

Hi everyone,

When translating doc updates, Alexander Lakhin noticed that trigram 
examples were not quite accurate.
A small patch fixing this issue is attached.


On 03/21/2018 03:35 PM, Teodor Sigaev wrote:
> Thank you, pushed
>
> David Steele wrote:
>> On 3/6/18 7:04 AM, Teodor Sigaev wrote:
>>>> I agree with Teodor (upthread, not quoted here) that the documentation
>>>> could use some editing.
>>>>
>>>> I started to do it myself, but quickly realized I have no knowledge of
>>>> the content.  I'm afraid I would destroy the meaning while updating 
>>>> the
>>>> grammar.
>>>>
>>>> Anyone understand the subject matter well enough to review the
>>>> documentation?
>>>
>>> Liudmila tried to improve docs in Alexander's patchset.
>>>
>>> https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752cd93@postgrespro.ru 
>>>
>>
>> This looks good to me with a few minor exceptions:
>>
>> +   <function>word_similarity(text, text)</function> requires further
>> +   explanation. Consider the following example:
>>
>> Maybe too verbose?  I think "<function>word_similarity(text,
>> text)</function> requires further explanation." can be removed entirely.
>>
>> +   string.  However, this function does not add paddings to the
>>
>> "add padding"
>>
>>> BTW, adding Liudmila's message to commitfest task
>>> (https://commitfest.postgresql.org/17/1403/) doesn't work
>>
>> Doesn't work for me either.
>>
>> Alexander, can you post the final patches to the thread so they show up
>> in the CF app?
>>
>> Thanks,
>>
>

-- 
Liudmila Mantrova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

pg-trgm-docfix.patch

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Bruce Momjian

Дата:

27 апреля 2018 г., 00:57:23

On Mon, Apr 16, 2018 at 07:48:47PM +0300, Liudmila Mantrova wrote:
> Hi everyone,
> 
> When translating doc updates, Alexander Lakhin noticed that trigram examples
> were not quite accurate.
> A small patch fixing this issue is attached.

FYI, this has been applied by Teodor Sigaev:

   https://git.postgresql.org/pg/commitdiff/9975c128a1d1bd7e7366adf133b21540a2bc2450

---------------------------------------------------------------------------


> 
> 
> On 03/21/2018 03:35 PM, Teodor Sigaev wrote:
> >Thank you, pushed
> >
> >David Steele wrote:
> >>On 3/6/18 7:04 AM, Teodor Sigaev wrote:
> >>>>I agree with Teodor (upthread, not quoted here) that the documentation
> >>>>could use some editing.
> >>>>
> >>>>I started to do it myself, but quickly realized I have no knowledge of
> >>>>the content.  I'm afraid I would destroy the meaning while updating
> >>>>the
> >>>>grammar.
> >>>>
> >>>>Anyone understand the subject matter well enough to review the
> >>>>documentation?
> >>>
> >>>Liudmila tried to improve docs in Alexander's patchset.
> >>>
> >>>https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752cd93@postgrespro.ru
> >>>
> >>
> >>This looks good to me with a few minor exceptions:
> >>
> >>+   <function>word_similarity(text, text)</function> requires further
> >>+   explanation. Consider the following example:
> >>
> >>Maybe too verbose?  I think "<function>word_similarity(text,
> >>text)</function> requires further explanation." can be removed entirely.
> >>
> >>+   string.  However, this function does not add paddings to the
> >>
> >>"add padding"
> >>
> >>>BTW, adding Liudmila's message to commitfest task
> >>>(https://commitfest.postgresql.org/17/1403/) doesn't work
> >>
> >>Doesn't work for me either.
> >>
> >>Alexander, can you post the final patches to the thread so they show up
> >>in the CF app?
> >>
> >>Thanks,
> >>
> >
> 
> -- 
> Liudmila Mantrova
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
> 

> diff --git a/doc/src/sgml/pgtrgm.sgml b/doc/src/sgml/pgtrgm.sgml
> index 8f39529..be43cdf 100644
> --- a/doc/src/sgml/pgtrgm.sgml
> +++ b/doc/src/sgml/pgtrgm.sgml
> @@ -152,9 +152,9 @@
>  </programlisting>
>  
>     In the first string, the set of trigrams is
> -   <literal>{"  w"," wo","ord","wor","rd "}</literal>.
> +   <literal>{"  w"," wo","wor","ord","rd "}</literal>.
>     In the second string, the ordered set of trigrams is
> -   <literal>{"  t"," tw",two,"wo ","  w"," wo","wor","ord","rds", ds "}</literal>.
> +   <literal>{"  t"," tw","two","wo ","  w"," wo","wor","ord","rds","ds "}</literal>.
>     The most similar extent of an ordered set of trigrams in the second string
>     is <literal>{"  w"," wo","wor","ord"}</literal>, and the similarity is
>     <literal>0.8</literal>.
> @@ -172,7 +172,7 @@
>     At the same time, <function>strict_word_similarity(text, text)</function>
>     has to select an extent that matches word boundaries.  In the example above,
>     <function>strict_word_similarity(text, text)</function> would select the
> -   extent <literal>{"  w"," wo","wor","ord","rds", ds "}</literal>, which
> +   extent <literal>{"  w"," wo","wor","ord","rds","ds "}</literal>, which
>     corresponds to the whole word <literal>'words'</literal>.
>  
>  <programlisting>


-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

От

Bruce Momjian

Дата:

27 апреля 2018 г., 00:57:23

On Mon, Apr 16, 2018 at 07:48:47PM +0300, Liudmila Mantrova wrote:
> Hi everyone,
> 
> When translating doc updates, Alexander Lakhin noticed that trigram examples
> were not quite accurate.
> A small patch fixing this issue is attached.

FYI, this has been applied by Teodor Sigaev:

   https://git.postgresql.org/pg/commitdiff/9975c128a1d1bd7e7366adf133b21540a2bc2450

---------------------------------------------------------------------------


> 
> 
> On 03/21/2018 03:35 PM, Teodor Sigaev wrote:
> >Thank you, pushed
> >
> >David Steele wrote:
> >>On 3/6/18 7:04 AM, Teodor Sigaev wrote:
> >>>>I agree with Teodor (upthread, not quoted here) that the documentation
> >>>>could use some editing.
> >>>>
> >>>>I started to do it myself, but quickly realized I have no knowledge of
> >>>>the content.  I'm afraid I would destroy the meaning while updating
> >>>>the
> >>>>grammar.
> >>>>
> >>>>Anyone understand the subject matter well enough to review the
> >>>>documentation?
> >>>
> >>>Liudmila tried to improve docs in Alexander's patchset.
> >>>
> >>>https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752cd93@postgrespro.ru
> >>>
> >>
> >>This looks good to me with a few minor exceptions:
> >>
> >>+   <function>word_similarity(text, text)</function> requires further
> >>+   explanation. Consider the following example:
> >>
> >>Maybe too verbose?  I think "<function>word_similarity(text,
> >>text)</function> requires further explanation." can be removed entirely.
> >>
> >>+   string.  However, this function does not add paddings to the
> >>
> >>"add padding"
> >>
> >>>BTW, adding Liudmila's message to commitfest task
> >>>(https://commitfest.postgresql.org/17/1403/) doesn't work
> >>
> >>Doesn't work for me either.
> >>
> >>Alexander, can you post the final patches to the thread so they show up
> >>in the CF app?
> >>
> >>Thanks,
> >>
> >
> 
> -- 
> Liudmila Mantrova
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
> 

> diff --git a/doc/src/sgml/pgtrgm.sgml b/doc/src/sgml/pgtrgm.sgml
> index 8f39529..be43cdf 100644
> --- a/doc/src/sgml/pgtrgm.sgml
> +++ b/doc/src/sgml/pgtrgm.sgml
> @@ -152,9 +152,9 @@
>  </programlisting>
>  
>     In the first string, the set of trigrams is
> -   <literal>{"  w"," wo","ord","wor","rd "}</literal>.
> +   <literal>{"  w"," wo","wor","ord","rd "}</literal>.
>     In the second string, the ordered set of trigrams is
> -   <literal>{"  t"," tw",two,"wo ","  w"," wo","wor","ord","rds", ds "}</literal>.
> +   <literal>{"  t"," tw","two","wo ","  w"," wo","wor","ord","rds","ds "}</literal>.
>     The most similar extent of an ordered set of trigrams in the second string
>     is <literal>{"  w"," wo","wor","ord"}</literal>, and the similarity is
>     <literal>0.8</literal>.
> @@ -172,7 +172,7 @@
>     At the same time, <function>strict_word_similarity(text, text)</function>
>     has to select an extent that matches word boundaries.  In the example above,
>     <function>strict_word_similarity(text, text)</function> would select the
> -   extent <literal>{"  w"," wo","wor","ord","rds", ds "}</literal>, which
> +   extent <literal>{"  w"," wo","wor","ord","rds","ds "}</literal>, which
>     corresponds to the whole word <literal>'words'</literal>.
>  
>  <programlisting>


-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [BUGS] pg_trgm word_similarity inconsistencies or bug

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения