Re: [GENERAL] full text search on hstore or json with materialized view?

Поиск
Список
Период
Сортировка
От George Neuner
Тема Re: [GENERAL] full text search on hstore or json with materialized view?
Дата
Msg-id e76gfch5qnkvm7j7ov313f3q3nr0kjfpkv@4ax.com
обсуждение исходный текст
Ответ на [GENERAL] full text search on hstore or json with materialized view?  (Rj Ewing <ewing.rj@gmail.com>)
Ответы Re: [GENERAL] full text search on hstore or json with materialized view?
Список pgsql-general
On Wed, 19 Apr 2017 16:28:13 -0700, Rj Ewing <ewing.rj@gmail.com>
wrote:

>okay, messing around a bit more with the secondary k,v table it seems like
>this could be a good solution..
>
>I created a keys table to hold the 63 key values, then I dropped and
>recreated the secondary table, using a FK referencing the keys table. I'm
>not really sure why, but a basic full text query on 44 million row is
>taking aproxx. 20ms.

That pretty much confirms your statistics were bad ... using the FK
table or not wouldn't make any difference to the planner.

But if you are getting 20ms on 44M rows, then one or more of the
following must be true:
 - your text values must be very short
 - your FTS queries must be very simple
 - you aren't reading the results


For comparison:

I have an application that does FTS on a table of NAICS descriptions
indexed using tsvectors with an average length of 4.8 tokens per.  It
does a 3-part All/Any/None term search.

On my 24-core 2.4GHz server, a single threaded query with the whole
table and index in memory takes ~1 ms to search 20K rows using a
realistic tsquery:  e.g.,

  SELECT code,description
    FROM naics
    WHERE ts_index @@ to_tsquery('packaged & software & !(wholesale)')

[getting the data out of Postgresql takes longer than the search]


GIN indexes don't exactly scale linearly, and tsquery is, in general,
much more dependent on the lengths of the tsvectors than on the
complexity of the match, but with 44M rows of similarly distributed
data, a similarly realistic query would be expected to take well over
1 second.


My example is genuine but too small to bother parallelizing [mentioned
in a previous message].  Since you are *testing* with 1M records (that
create 44M k:v shards), I am assuming you will need to deal with much
more than that in deployment.  And if you think you need FTS, then you
must be expecting more than simple word matches [as below], else you
might do something simpler like

  SELECT ...
    WHERE val ILIKE <value>



>my table structure is:
>
> Table "public.samples_lg_txt"
> Column |   Type   | Modifiers
>--------+----------+-----------
> id     | integer  |
> key    | integer  |
> val    | text     |
> tsv    | tsvector |
>Indexes:
>    "idx_tsv_samples_lg_text" gin (tsv)
>Foreign-key constraints:
>    "samples_lg_txt_id_fkey" FOREIGN KEY (id) REFERENCES samples_lg(id)
>                                ON DELETE CASCADE
>    "samples_lg_txt_key_fkey" FOREIGN KEY (key) REFERENCES keys(id)
>
>
>how would I write an AND query that filtered on 2 separate keys from the
>samples_lg_txt table?
>
>something like:
>
>SELECT COUNT(*) FROM samples WHERE id IN ( SELECT DISTINCT(s.id) FROM
>samples_lg_txt s JOIN keys k ON s.key = k.id WHERE (*name = 'key1' AND tsv
>@@ to_tsquery('value1')) AND (name = 'key2' AND tsv @@
>to_tsquery('value2'))*;

You're overthinking it

  SELECT count(distinct s.id)
    FROM  samples_lg_txt AS s
    JOIN  keys AS k ON k.id = s.key
    WHERE (k.name = 'key1' AND s.tsv @@ to_query('value1')
       OR (k.name = 'key2' AND s.tsv @@ to_query('value2')


There's actually no need to join if you can use the key name instead
of an integer id.  You can FK on strings, so you can still maintain an
identity table of keys.  E.g.,

> id     | integer   |
> key    | vchar(32) | FK key(name) ...
> val    | text      |
> tsv    | tsvector  |


Then the query could be just

  SELECT count(distinct id)
    FROM  samples_lg_txt
    WHERE (key = 'key1' AND tsv @@ to_query('value1')
       OR (key = 'key2' AND tsv @@ to_query('value2')


Just a reminder [it's late here 8-)]: FK columns contain values - not
weird references to the foreign tables.  The constraint just enforces
that any value inserted/updated into the FK column matches an existing
value in the relevant foreign table.


George

В списке pgsql-general по дате отправления:

Предыдущее
От: Samuel Williams
Дата:
Сообщение: [GENERAL] cluster on brin indexes?
Следующее
От: George Neuner
Дата:
Сообщение: Re: [GENERAL] full text search on hstore or json with materialized view?