Re: [GENERAL] full text search on hstore or json with materialized view?

Поиск
Список
Период
Сортировка
От Rj Ewing
Тема Re: [GENERAL] full text search on hstore or json with materialized view?
Дата
Msg-id CAOSSsV3PQaGPLGnXnv6QL0t4TrSSX_7UOk1OXZbigWy9v6Wx5w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [GENERAL] full text search on hstore or json with materialized view?  (George Neuner <gneuner2@comcast.net>)
Список pgsql-general


On Wed, Apr 19, 2017 at 9:55 PM, George Neuner <gneuner2@comcast.net> wrote:
On Wed, 19 Apr 2017 16:28:13 -0700, Rj Ewing <ewing.rj@gmail.com>
wrote:

>okay, messing around a bit more with the secondary k,v table it seems like
>this could be a good solution..
>
>I created a keys table to hold the 63 key values, then I dropped and
>recreated the secondary table, using a FK referencing the keys table. I'm
>not really sure why, but a basic full text query on 44 million row is
>taking aproxx. 20ms.

That pretty much confirms your statistics were bad ... using the FK
table or not wouldn't make any difference to the planner.

But if you are getting 20ms on 44M rows, then one or more of the
following must be true:
 - your text values must be very short
 - your FTS queries must be very simple
 - you aren't reading the results

​text is on average very short. 1-3 words per value.​ It was a count(*) query with only a single condition.


For comparison:

I have an application that does FTS on a table of NAICS descriptions
indexed using tsvectors with an average length of 4.8 tokens per.  It
does a 3-part All/Any/None term search.

On my 24-core 2.4GHz server, a single threaded query with the whole
table and index in memory takes ~1 ms to search 20K rows using a
realistic tsquery:  e.g.,

  SELECT code,description
    FROM naics
    WHERE ts_index @@ to_tsquery('packaged & software & !(wholesale)')

[getting the data out of Postgresql takes longer than the search]


GIN indexes don't exactly scale linearly, and tsquery is, in general,
much more dependent on the lengths of the tsvectors than on the
complexity of the match, but with 44M rows of similarly distributed
data, a similarly realistic query would be expected to take well over
1 second.


My example is genuine but too small to bother parallelizing [mentioned
in a previous message].  Since you are *testing* with 1M records (that
create 44M k:v shards), I am assuming you will need to deal with much
more than that in deployment.  And if you think you need FTS, then you
must be expecting more than simple word matches [as below], else you
might do something simpler like

  SELECT ...
    WHERE val ILIKE <value>

​the 1M records would most likely be the max. On average the tables would have more like 100,000 records each.

I also realized that I inserted all k:v pairs into the secondary k:v table. In reality, I would only index strings, which would eliminate approx 25% of the k:v bringing that number down closer to 30M.

from my understanding, ILIKE doesn't do any text normalization, which is something we would like to have.​


>my table structure is:
>
> Table "public.samples_lg_txt"
> Column |   Type   | Modifiers
>--------+----------+-----------
> id     | integer  |
> key    | integer  |
> val    | text     |
> tsv    | tsvector |
>Indexes:
>    "idx_tsv_samples_lg_text" gin (tsv)
>Foreign-key constraints:
>    "samples_lg_txt_id_fkey" FOREIGN KEY (id) REFERENCES samples_lg(id)
>                                ON DELETE CASCADE
>    "samples_lg_txt_key_fkey" FOREIGN KEY (key) REFERENCES keys(id)
>
>
>how would I write an AND query that filtered on 2 separate keys from the
>samples_lg_txt table?
>
>something like:
>
>SELECT COUNT(*) FROM samples WHERE id IN ( SELECT DISTINCT(s.id) FROM
>samples_lg_txt s JOIN keys k ON s.key = k.id WHERE (*name = 'key1' AND tsv
>@@ to_tsquery('value1')) AND (name = 'key2' AND tsv @@
>to_tsquery('value2'))*;

You're overthinking it

  SELECT count(distinct s.id)
    FROM  samples_lg_txt AS s
    JOIN  keys AS k ON k.id = s.key
    WHERE (k.name = 'key1' AND s.tsv @@ to_query('value1')
       OR (k.name = 'key2' AND s.tsv @@ to_query('value2')

​but that is an OR query, I'm trying to do an AND query.
There's actually no need to join if you can use the key name instead
of an integer id.  You can FK on strings, so you can still maintain an
identity table of keys.  E.g.,

> id     | integer   |
> key    | vchar(32) | FK key(name) ...
> val    | text      |
> tsv    | tsvector  |


Then the query could be just

  SELECT count(distinct id)
    FROM  samples_lg_txt
    WHERE (key = 'key1' AND tsv @@ to_query('value1')
       OR (key = 'key2' AND tsv @@ to_query('value2')


Just a reminder [it's late here 8-)]: FK columns contain values - not
weird references to the foreign tables.  The constraint just enforces
that any value inserted/updated into the FK column matches an existing
value in the relevant foreign table.

​this would make queries simpler​. I guess a disadvantage to using a string for the key is that the db size would be larger, and thus not as likely to fit the entire table in ram. If there are only 63 keys across 44M rows, it seems that storing an smallint would take less space then storing the string.

I don't really have a need for the identity table of keys. It's only purpose was to shrink the database size.

Thanks again for the detailed responses!

В списке pgsql-general по дате отправления:

Предыдущее
От: Rj Ewing
Дата:
Сообщение: Re: [GENERAL] full text search on hstore or json with materialized view?
Следующее
От: Samuel Williams
Дата:
Сообщение: Re: [GENERAL] Large data and slow queries