Re: [GENERAL] full text search on hstore or json with materialized view?

Поиск
Список
Период
Сортировка
От George Neuner
Тема Re: [GENERAL] full text search on hstore or json with materialized view?
Дата
Msg-id 9qtffchto9c7p5o6oanqrp62ks1jruegf5@4ax.com
обсуждение исходный текст
Ответ на [GENERAL] full text search on hstore or json with materialized view?  (Rj Ewing <ewing.rj@gmail.com>)
Ответы Re: [GENERAL] full text search on hstore or json with materialized view?  (Rj Ewing <ewing.rj@gmail.com>)
Список pgsql-general

Please don't top post.
https://en.wikipedia.org/wiki/Posting_style#Placement_of_replies
https://en.wikipedia.org/wiki/Posting_style#Choosing_the_proper_posting_style



>> On Tue, 18 Apr 2017 14:38:15 -0700, Rj Ewing <ewing.rj@gmail.com>
>> wrote:
>>
>> > :
>> >An idea that has come up is to use a materialized view or secondary table
>> >with triggers, where we would have 3 columns (id, key, value).
>> >
>> >I think this would allow us to store a tsvector and gin index. Giving us
>> >the ability to use fulltext search on k:v pairs, then join the original
>> >data on the id field to return the entire record.
>> > :


On Wed, 19 Apr 2017 11:57:26 -0700, Rj Ewing <ewing.rj@gmail.com>
wrote:

>I did some testing using a secondary table with the key, value column.
>However I don't think this will provide the performance that we need.
>Queries we taking 60+ seconds just for a count.

SELECT count(*) or filtered?

Either way, your statistics may be way off.  Did you vacuum analyze
the table after the inserts (and the index creation if it was done
separately)?


>With 1 million rows in the primary table, this resulted in 44 million rows
>in the secondary k,v table for full text searching. The same query is es
>takes ~50 ms on my local machine with 1/10th the ram allocated to es then
>was allocated to psql.
>
>I'm gonna test using trigrams indexes on approx 10 json fields, and see if
>that gives us what we are looking for.
>
>any thought on getting sub 1 sec queries on a table with 44 million rows?
>
>RJ

Based on your description of the data [at top], I rather doubt
trigrams will be an improvement over tsvector.  And they're more
cumbersome to use if you don't need better similarity matching than
what tsvector offers [which itself is somewhat adjustable via
dictionaries].

Without more detail re: your hardware, Postgresql version, what
indexes are/will be available, the types of queries you want to run,
etc., it's very hard to give really meaningful suggestions.


The kind of query you have alluded to is pretty easily parallelized:
it can be spread over multiple sessions with result aggregation done
on the client side.

Or, if you you have 9.6, you might try using backend parallelism:
https://www.postgresql.org/docs/9.6/static/parallel-query.html
[I've not used this, but some people have done it successfully.]


If you can restrict the FTS query to certain keys:

  SELECT id FROM mytable
    WHERE tsquery( ... ) @@ to_tsvector(v)
    AND k IN ( ... )
    GROUP BY id

  [note: according to David Rowley, GROUP BY may be parallelized
         whereas  DISTINCT currently cannot be.]

then given an index on 'k' it may be much faster than just the FTS
query alone.  Subject to key variability, it also may be improved by
table partitioning to reduce the search space.

If the FTS query is key restricted, you can parallelize either on the
client or on the server.  If the FTS query is not key restricted, you
pretty much are limited to server side (and 9.6 or later).


And I'm out of suggestions for now.

Parallel query is your best bet for maximum performance, but unless
you have enough RAM to hold the entire table and its indexes, and all
the query workspaces, then I doubt you will be able to get anywhere
near your optimistic execution target for FTS on 40+ million rows.


YMMV,
George

В списке pgsql-general по дате отправления:

Предыдущее
От: Ken Tanzer
Дата:
Сообщение: Re: [GENERAL] referential integrity between elements of an array andanother table?
Следующее
От: Samuel Williams
Дата:
Сообщение: Re: [GENERAL] Large data and slow queries