Обсуждение: [GENERAL] Full Text Search combined with Fuzzy

Поиск
Список
Период
Сортировка

[GENERAL] Full Text Search combined with Fuzzy

От
Nicolas Paris
Дата:
Hello,

AFAIK there is no built-in way to combine full text search and fuzzy matching
(https://www.postgresql.org/docs/current/static/fuzzystrmatch.html).
By example, phrase searching with tipos in it.

First I don't know if postgresql concurrents (lucene based...) are able
to do so.

Second, is such feature is in the road map ?

Third, I wonder if it is a good idea to use the postgresql synonyms
feature for such prupose.(https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html)
I mean, building up a synonyms dictionnary containing tipos. By eg:

postgres        pgsql
postgresql      pgsql
postgrez        pgsql
postgre         pgsql
gogle           googl
gooogle         googl

There is multiple way to build such dictionary. But my question is about
the implementation of dictionnaries in postgresql: Is postgresql
supposed to take advantage of billion entries dictionaries ?

Thanks by advance for you answers,


Re: [GENERAL] Full Text Search combined with Fuzzy

От
Oleg Bartunov
Дата:


On Sun, Feb 26, 2017 at 3:52 PM, Nicolas Paris <niparisco@gmail.com> wrote:
Hello,

AFAIK there is no built-in way to combine full text search and fuzzy matching
(https://www.postgresql.org/docs/current/static/fuzzystrmatch.html).
By example, phrase searching with tipos in it.

First I don't know if postgresql concurrents (lucene based...) are able
to do so.


Second, is such feature is in the road map ?

Third, I wonder if it is a good idea to use the postgresql synonyms
feature for such prupose.(https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html)
I mean, building up a synonyms dictionnary containing tipos. By eg:

postgres        pgsql
postgresql      pgsql
postgrez        pgsql
postgre         pgsql
gogle           googl
gooogle         googl

There is multiple way to build such dictionary. But my question is about
the implementation of dictionnaries in postgresql: Is postgresql
supposed to take advantage of billion entries dictionaries ?

dictionary is just a program, so it's  up to developer how to write efficient program to deal with billion entries. Specifically to synonym dictionary, it's not intended to work with a lot of entries. btw, have a look on contrib/dict_xsyn dictionary, which is more flexible than synonym.

Thanks by advance for you answers,


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Full Text Search combined with Fuzzy

От
Nicolas Paris
Дата:
Le 27 févr. 2017 à 10:32, Oleg Bartunov écrivait :
>
>
> On Sun, Feb 26, 2017 at 3:52 PM, Nicolas Paris <niparisco@gmail.com> wrote:
>
>     Hello,
>
>     AFAIK there is no built-in way to combine full text search and fuzzy
>     matching
>     (https://www.postgresql.org/docs/current/static/fuzzystrmatch.html).
>     By example, phrase searching with tipos in it.
>
>     First I don't know if postgresql concurrents (lucene based...) are able
>     to do so.
>
>
> Usually, https://www.postgresql.org/docs/current/static/pgtrgm.html is used for
> this.
>

Hi Oleg,

Thanks. I thought pgtrgm was not able to index my long texts because of
limitation of 8191 bytes per index row for btree.

Then I found out it is possible to use pgtrgm over a GIN/GIST index.
My final use case is phrase mining in texts.

I want my application returns texts that contains approximatly the user
entry:

Eg: user search "Hello Word"
a text containing "blah blah blah hello world blah blah blah" would be
returned.

Test:
postgres=# CREATE table test_trgm (texts text);
CREATE TABLE
postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
CREATE INDEX
postgres=# SET enable_seqscan = OFF;
SET
postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah');
INSERT 0 1
postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah');
INSERT 0 1
postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
                   texts                   | similarity
-------------------------------------------+------------
 blah blah blah hello world blah blah blah |   0.473684
 blah blah blah hello word blah blah blah  |     0.6875
(2 rows)

postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Bitmap Heap Scan on test_trgm  (cost=52.01..56.03 rows=1 width=32)
   Recheck Cond: (texts % 'hello word'::text)
   ->  Bitmap Index Scan on test_trgm_texts_idx  (cost=0.00..52.01 rows=1 width=0)
         Index Cond: (texts % 'hello word'::text)
(4 rows)

Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
my requirements ?

Thanks for the help !


>
>     Second, is such feature is in the road map ?
>
>     Third, I wonder if it is a good idea to use the postgresql synonyms
>     feature for such prupose.(https://www.postgresql.org/docs/current/static/
>     textsearch-dictionaries.html)
>     I mean, building up a synonyms dictionnary containing tipos. By eg:
>
>     postgres        pgsql
>     postgresql      pgsql
>     postgrez        pgsql
>     postgre         pgsql
>     gogle           googl
>     gooogle         googl
>
>     There is multiple way to build such dictionary. But my question is about
>     the implementation of dictionnaries in postgresql: Is postgresql
>     supposed to take advantage of billion entries dictionaries ?
>
>
> dictionary is just a program, so it's  up to developer how to write efficient
> program to deal with billion entries. Specifically to synonym dictionary, it's
> not intended to work with a lot of entries. btw, have a look on contrib/
> dict_xsyn dictionary, which is more flexible than synonym.
>
>
>     Thanks by advance for you answers,
>
>
>     --
>     Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>     To make changes to your subscription:
>     http://www.postgresql.org/mailpref/pgsql-general
>
>


Re: [GENERAL] Full Text Search combined with Fuzzy

От
Artur Zakirov
Дата:
On 03.03.2017 15:49, Nicolas Paris wrote:
>
> Hi Oleg,
>
> Thanks. I thought pgtrgm was not able to index my long texts because of
> limitation of 8191 bytes per index row for btree.
>
> Then I found out it is possible to use pgtrgm over a GIN/GIST index.
> My final use case is phrase mining in texts.
>
> I want my application returns texts that contains approximatly the user
> entry:
>
> Eg: user search "Hello Word"
> a text containing "blah blah blah hello world blah blah blah" would be
> returned.
>
> Test:
> postgres=# CREATE table test_trgm (texts text);
> CREATE TABLE
> postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
> CREATE INDEX
> postgres=# SET enable_seqscan = OFF;
> SET
> postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah');
> INSERT 0 1
> postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah');
> INSERT 0 1
> postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
>                    texts                   | similarity
> -------------------------------------------+------------
>  blah blah blah hello world blah blah blah |   0.473684
>  blah blah blah hello word blah blah blah  |     0.6875
> (2 rows)
>
> postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
>                                     QUERY PLAN
> -----------------------------------------------------------------------------------
>  Bitmap Heap Scan on test_trgm  (cost=52.01..56.03 rows=1 width=32)
>    Recheck Cond: (texts % 'hello word'::text)
>    ->  Bitmap Index Scan on test_trgm_texts_idx  (cost=0.00..52.01 rows=1 width=0)
>          Index Cond: (texts % 'hello word'::text)
> (4 rows)
>
> Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
> my requirements ?
>
> Thanks for the help !
>

Hello,

If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For
example:

postgres=# SELECT texts, word_similarity('hello word', texts) FROM
test_trgm WHERE 'hello word' <% texts;
                    texts                   | word_similarity
-------------------------------------------+-----------------
  blah blah blah hello world blah blah blah |        0.818182
  blah blah blah hello word blah blah blah  |               1
(2 rows)

1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [GENERAL] Full Text Search combined with Fuzzy

От
Nicolas Paris
Дата:
Le 03 mars 2017 à 14:08, Artur Zakirov écrivait :
> On 03.03.2017 15:49, Nicolas Paris wrote:
> >
> >Hi Oleg,
> >
> >Thanks. I thought pgtrgm was not able to index my long texts because of
> >limitation of 8191 bytes per index row for btree.
> >
> >Then I found out it is possible to use pgtrgm over a GIN/GIST index.
> >My final use case is phrase mining in texts.
> >
> >I want my application returns texts that contains approximatly the user
> >entry:
> >
> >Eg: user search "Hello Word"
> >a text containing "blah blah blah hello world blah blah blah" would be
> >returned.
> >
> >Test:
> >postgres=# CREATE table test_trgm (texts text);
> >CREATE TABLE
> >postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops);
> >CREATE INDEX
> >postgres=# SET enable_seqscan = OFF;
> >SET
> >postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah');
> >INSERT 0 1
> >postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah');
> >INSERT 0 1
> >postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
> >                   texts                   | similarity
> >-------------------------------------------+------------
> > blah blah blah hello world blah blah blah |   0.473684
> > blah blah blah hello word blah blah blah  |     0.6875
> >(2 rows)
> >
> >postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word';
> >                                    QUERY PLAN
> >-----------------------------------------------------------------------------------
> > Bitmap Heap Scan on test_trgm  (cost=52.01..56.03 rows=1 width=32)
> >   Recheck Cond: (texts % 'hello word'::text)
> >   ->  Bitmap Index Scan on test_trgm_texts_idx  (cost=0.00..52.01 rows=1 width=0)
> >         Index Cond: (texts % 'hello word'::text)
> >(4 rows)
> >
> >Conclusion: If I d'say 0.4 is my threshold, would this methodology meet
> >my requirements ?
> >
> >Thanks for the help !
> >
>
> Hello,
>
> If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For
> example:
>
> postgres=# SELECT texts, word_similarity('hello word', texts) FROM test_trgm
> WHERE 'hello word' <% texts;
>                    texts                   | word_similarity
> -------------------------------------------+-----------------
>  blah blah blah hello world blah blah blah |        0.818182
>  blah blah blah hello word blah blah blah  |               1
> (2 rows)
>
> 1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html
>

Nice ! I do have 9.6 version.

Would this kind of index could handle more than 20M large texts ? The
recheck condition looks ressource consuming.

The full text index + phrase search + synonym dictionnary is the only
other alternativ to deal with typo-phrase mining ?

Is there any possibility in the future to add typo in the full text
road-map ?

Thanks,

> --
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company


Re: [GENERAL] Full Text Search combined with Fuzzy

От
Artur Zakirov
Дата:
On 03.03.2017 16:17, Nicolas Paris wrote:
> Nice ! I do have 9.6 version.
>
> Would this kind of index could handle more than 20M large texts ? The
> recheck condition looks ressource consuming.

You are right. I think pg_trgm will be not good for such large texts,
unfortunately.

>
> The full text index + phrase search + synonym dictionnary is the only
> other alternativ to deal with typo-phrase mining ?

I suppose there are no other options now. Though, prefix search maybe
will help you [1].

>
> Is there any possibility in the future to add typo in the full text
> road-map ?

As far as I know, there is no plans in the near future to add similarity
full text search.

1.
https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company