Обсуждение: PostgreSQL GIN index not used when ts_query language is fetched from a column

Поиск
Список
Период
Сортировка

PostgreSQL GIN index not used when ts_query language is fetched from a column

От
Jaap Roes
Дата:
I've got a table that stores some multilingual content:

    CREATE TABLE search (
      content text NOT NULL,
      language regconfig NOT NULL,
      fulltext tsvector
    );
    CREATE INDEX search_fulltext ON search USING GIN(fulltext);

    INSERT INTO search (language, content) VALUES
      ('dutch', 'Als achter vliegen vliegen vliegen vliegen vliegen vliegen achterna'),
      ('dutch', 'Langs de koele kali liep een kale koeli met een kilo kali op zijn kale koeli-kop.'),
      ('dutch', 'Moeder sneed zeven scheve sneden brood'),
      ('english', 'I saw Susie sitting in a shoe shine shop. Where she sits she shines, and where she shines she
sits.'),
      ('english', 'How can a clam cram in a clean cream can?'),
      ('english', 'Can you can a can as a canner can can a can?');

    UPDATE search SET fulltext = to_tsvector(language, content);

To make sure I always search in the correct language I use these queries:

    SELECT FROM search WHERE fulltext @@ to_tsquery(language, 'shine');
    (1 row)

    SELECT FROM search WHERE fulltext @@ to_tsquery(language, 'vlieg');
    (1 row)

Because hardcoding the language doesn't give the correct results:

    SELECT FROM search WHERE fulltext @@ to_tsquery('dutch', 'shine');
    (0 rows)

    SELECT FROM search WHERE fulltext @@ to_tsquery('english', 'vlieg');
    (0 rows)

The problem however is that PostgreSQL doesn't use the GIN index when using the first set of queries and instead does a
sequentialscan: 

(Note: I've disabled scanning using SET enable_seqscan = OFF; for these examples because of the low amount of rows)

    EXPLAIN ANALYZE SELECT * FROM search WHERE fulltext @@ to_tsquery(language, 'shine’);

    Seq Scan on search  (cost=0.00..17.35 rows=2 width=136) (actual time=0.040..0.044 rows=1 loops=1)
        Filter: (fulltext @@ to_tsquery(language, 'shine'::text))
        Rows Removed by Filter: 5
    Planning time: 0.039 ms
    Execution time: 0.064 ms

While it does when hardcoding a language:

    EXPLAIN ANALYZE SELECT FROM search WHERE fulltext @@ to_tsquery('dutch', 'vlieg');
    Bitmap Heap Scan on search  (cost=12.63..23.66 rows=82 width=0) (actual time=0.044..0.044 rows=1 loops=1)
      Recheck Cond: (fulltext @@ '''vlieg'''::tsquery)
      Heap Blocks: exact=1
        ->  Bitmap Index Scan on search_fulltext  (cost=0.00..12.61 rows=82 width=0) (actual time=0.037..0.037 rows=1
loops=1)
          Index Cond: (fulltext @@ '''vlieg'''::tsquery)
    Planning time: 0.128 ms
    Execution time: 0.065 ms

So my question is: Is it at all possible to use a column in the ts_query to use the correct language config and still
havePostgres use the GIN index? 

I’ve tried this on both PostgreSQL 9.4 and 9.5.

Obviously the real table has a lot more rows, so here's the execution plan from the real table:

Using a column for language config:

    Seq Scan on search  (cost=0.00..8727.25 rows=188 width=0) (actual time=0.725..352.307 rows=1689 loops=1)
      Filter: (fulltext @@ to_tsquery(language_config, 'example'::text))
      Rows Removed by Filter: 35928
    Planning time: 0.053 ms
    Execution time: 352.915 ms

When hardcoding the language:

    Bitmap Heap Scan on search  (cost=28.65..4088.92 rows=1633 width=0) (actual time=0.514..10.475 rows=1684 loops=1)
      Recheck Cond: (fulltext @@ '''exampl'''::tsquery)
      Heap Blocks: exact=1522
        ->  Bitmap Index Scan on search_fulltext  (cost=0.00..28.24 rows=1633 width=0) (actual time=0.333..0.333
rows=1684loops=1)         
          Index Cond: (fulltext @@ '''exampl'''::tsquery)
    Planning time: 0.180 ms
    Execution time: 10.564 ms

Note: I previously asked this on stackexhange
(http://dba.stackexchange.com/questions/149765/postgresql-gin-index-not-used-when-ts-query-language-is-fetched-from-a-column)
butI’m assuming there are more knowledgeable people on this mailing list ;-) 

Re: PostgreSQL GIN index not used when ts_query language is fetched from a column

От
Jaap Roes
Дата:
> On 15 Sep 2016, at 14:46, Jaap Roes <jroes@leukeleu.nl> wrote:
>
> I've got a table that stores some multilingual content:
>
>    CREATE TABLE search (
>      content text NOT NULL,
>      language regconfig NOT NULL,
>      fulltext tsvector
>    );
>    CREATE INDEX search_fulltext ON search USING GIN(fulltext);
...
> To make sure I always search in the correct language I use these queries:
>
>    SELECT FROM search WHERE fulltext @@ to_tsquery(language, 'shine');
>    (1 row)
>
>    SELECT FROM search WHERE fulltext @@ to_tsquery(language, 'vlieg');
>    (1 row)
>
> Because hardcoding the language doesn't give the correct results:
>
>    SELECT FROM search WHERE fulltext @@ to_tsquery('dutch', 'shine');
>    (0 rows)
>
>    SELECT FROM search WHERE fulltext @@ to_tsquery('english', 'vlieg');
>    (0 rows)
>
> The problem however is that PostgreSQL doesn't use the GIN index when using the first set of queries and instead does
asequential scan: 
...
>    EXPLAIN ANALYZE SELECT * FROM search WHERE fulltext @@ to_tsquery(language, 'shine’);
>
>    Seq Scan on search  (cost=0.00..17.35 rows=2 width=136) (actual time=0.040..0.044 rows=1 loops=1)
>        Filter: (fulltext @@ to_tsquery(language, 'shine'::text))
>        Rows Removed by Filter: 5
>    Planning time: 0.039 ms
>    Execution time: 0.064 ms
>
> While it does when hardcoding a language:
>
>    EXPLAIN ANALYZE SELECT FROM search WHERE fulltext @@ to_tsquery('dutch', 'vlieg');
>    Bitmap Heap Scan on search  (cost=12.63..23.66 rows=82 width=0) (actual time=0.044..0.044 rows=1 loops=1)
>      Recheck Cond: (fulltext @@ '''vlieg'''::tsquery)
>      Heap Blocks: exact=1
>        ->  Bitmap Index Scan on search_fulltext  (cost=0.00..12.61 rows=82 width=0) (actual time=0.037..0.037 rows=1
loops=1)
>          Index Cond: (fulltext @@ '''vlieg'''::tsquery)
>    Planning time: 0.128 ms
>    Execution time: 0.065 ms
>
> So my question is: Is it at all possible to use a column in the ts_query to use the correct language config and still
havePostgres use the GIN index? 

So I came up with a solution. Pre-localise the query in a join and use that to filter the matches:

    SELECT * FROM search s
    INNER JOIN (
        SELECT 'dutch'::regconfig AS language, to_tsquery('dutch', 'shine') as q
        UNION SELECT 'english'::regconfig AS language, to_tsquery('english', 'shine') as q
        UNION SELECT 'simple'::regconfig AS language, to_tsquery('simple', 'shine') as q
    ) q ON (s.language=q.language)
    WHERE fulltext @@ q;

This seems to work, but the query plan looks a bit confusing, so I’m not super confident about the correctness:

    Nested Loop  (cost=205.44..1327.12 rows=188 width=1590) (actual time=3.350..7.010 rows=16 loops=1)
      ->  Unique  (cost=0.08..0.11 rows=3 width=0) (actual time=0.010..0.021 rows=3 loops=1)
            ->  Sort  (cost=0.08..0.09 rows=3 width=0) (actual time=0.008..0.011 rows=3 loops=1)
                  Sort Key: ('dutch'::regconfig), ('''vlieg'''::tsquery)
                  Sort Method: quicksort  Memory: 25kB
                  ->  Append  (cost=0.00..0.06 rows=3 width=0) (actual time=0.001..0.002 rows=3 loops=1)
                        ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
                        ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
                        ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=1)
      ->  Bitmap Heap Scan on search s  (cost=205.36..441.70 rows=63 width=1554) (actual time=2.310..2.318 rows=5
loops=3)
            Recheck Cond: ((fulltext @@ ('''vlieg'''::tsquery)) AND ((language)::oid = (('dutch'::regconfig))::oid))
            Heap Blocks: exact=16
            ->  BitmapAnd  (cost=205.36..205.36 rows=63 width=0) (actual time=2.303..2.303 rows=0 loops=3)
                  ->  Bitmap Index Scan on search_fulltext  (cost=0.00..17.41 rows=188 width=0) (actual
time=0.018..0.018rows=16 loops=3) 
                        Index Cond: (fulltext @@ ('''vlieg'''::tsquery))
                  ->  Bitmap Index Scan on search_language  (cost=0.00..187.67 rows=12539 width=0) (actual
time=2.277..2.277rows=12539 loops=3) 
                        Index Cond: ((language_config)::oid = (('dutch'::regconfig))::oid)
    Planning time: 0.228 ms
    Execution time: 7.058 ms

Is this the way to go? Or is there a better way, I’m eager to find out!



Re: PostgreSQL GIN index not used when ts_query language is fetched from a column

От
Jeff Janes
Дата:
On Thu, Sep 15, 2016 at 5:46 AM, Jaap Roes <jroes@leukeleu.nl> wrote:
I've got a table that stores some multilingual content:

    CREATE TABLE search (
      content text NOT NULL,
      language regconfig NOT NULL,
      fulltext tsvector
    );
    CREATE INDEX search_fulltext ON search USING GIN(fulltext);

    INSERT INTO search (language, content) VALUES
      ('dutch', 'Als achter vliegen vliegen vliegen vliegen vliegen vliegen achterna'),
      ('dutch', 'Langs de koele kali liep een kale koeli met een kilo kali op zijn kale koeli-kop.'),
      ('dutch', 'Moeder sneed zeven scheve sneden brood'),
      ('english', 'I saw Susie sitting in a shoe shine shop. Where she sits she shines, and where she shines she sits.'),
      ('english', 'How can a clam cram in a clean cream can?'),
      ('english', 'Can you can a can as a canner can can a can?');

    UPDATE search SET fulltext = to_tsvector(language, content);

To make sure I always search in the correct language I use these queries:

    SELECT FROM search WHERE fulltext @@ to_tsquery(language, 'shine');
    (1 row)

I don't think that what you want is possible to do in a clean way.  It can't know what the language for a row is until it finds the row, but it can't find the row using an index until it runs to_tsquery, and it can't run to_tsquery until it knows the language.

You would probably need to run the query once for each language, and filter out the results which it found under the 'wrong' language.

SELECT * FROM search WHERE fulltext @@ to_tsquery('dutch', 'shine') and language = 'dutch'::regconfig
union all
SELECT * FROM search WHERE fulltext @@ to_tsquery('english', 'shine') and language = 'english'::regconfig;

If you have more than two languages, this will become tedious, and perhaps non-performant.  I am not aware of a more general solution, though.

It might help to build partial indexes on each language.

Cheers,

Jeff