Обсуждение: [GENERAL] full text search on hstore or json with materialized view?

Поиск
Список
Период
Сортировка

[GENERAL] full text search on hstore or json with materialized view?

От
Rj Ewing
Дата:
I am evaluating postgres for as a datastore for our webapp. We are moving away from a triple store db due to performance issues.

Our data model consists of sets of user defined attributes. Approx 10% of the attributes tend to be 100% filled with 50% of the attributes having approx 25% filled. This is fairly sparse data, and it seems that jsonb or hstore will be best for us.

Unfortunately, from my understanding, postres doesn't support fulltext search across hstore or jsonb key:values or even the entire document. While this is not a deal breaker, this would be a great feature to have. We have been experimenting w/ elasticsearch a bit, and particularly enjoy this feature, however we don't really want to involve the complexity and overhead of adding elasticsearch in front of our datasource right now.

An idea that has come up is to use a materialized view or secondary table with triggers, where we would have 3 columns (id, key, value).

I think this would allow us to store a tsvector and gin index. Giving us the ability to use fulltext search on k:v pairs, then join the original data on the id field to return the entire record.

is anyone currently doing this? Is there a better alternative? Any performance issues that immediately jump out ( I realize the writes will take longer)?

the nature of our data is "relatively" static with bulk uploads (100 - 1000 records). So we can sacrifice some write performance.

RJ

Re: [GENERAL] full text search on hstore or json with materializedview?

От
Bruce Momjian
Дата:
On Tue, Apr 18, 2017 at 02:38:15PM -0700, Rj Ewing wrote:
> I am evaluating postgres for as a datastore for our webapp. We are moving away
> from a triple store db due to performance issues.
>
> Our data model consists of sets of user defined attributes. Approx 10% of the
> attributes tend to be 100% filled with 50% of the attributes having approx 25%
> filled. This is fairly sparse data, and it seems that jsonb or hstore will be
> best for us.
>
> Unfortunately, from my understanding, postres doesn't support fulltext search
> across hstore or jsonb key:values or even the entire document. While this is
> not a deal breaker, this would be a great feature to have. We have been
> experimenting w/ elasticsearch a bit, and particularly enjoy this feature,
> however we don't really want to involve the complexity and overhead of adding
> elasticsearch in front of our datasource right now.

Full text search of JSON and JSONB data is coming in Postgres 10, which
is to to be released in September of this year:

    https://www.depesz.com/2017/04/04/waiting-for-postgresql-10-full-text-search-support-for-json-and-jsonb/

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: [GENERAL] full text search on hstore or json with materialized view?

От
Rj Ewing
Дата:
A step in the right direction for me, however it doesn't appear to support per field full text searching. 
It is exciting though!

On Tue, Apr 18, 2017 at 3:00 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Apr 18, 2017 at 02:38:15PM -0700, Rj Ewing wrote:
> I am evaluating postgres for as a datastore for our webapp. We are moving away
> from a triple store db due to performance issues.
>
> Our data model consists of sets of user defined attributes. Approx 10% of the
> attributes tend to be 100% filled with 50% of the attributes having approx 25%
> filled. This is fairly sparse data, and it seems that jsonb or hstore will be
> best for us.
>
> Unfortunately, from my understanding, postres doesn't support fulltext search
> across hstore or jsonb key:values or even the entire document. While this is
> not a deal breaker, this would be a great feature to have. We have been
> experimenting w/ elasticsearch a bit, and particularly enjoy this feature,
> however we don't really want to involve the complexity and overhead of adding
> elasticsearch in front of our datasource right now.

Full text search of JSON and JSONB data is coming in Postgres 10, which
is to to be released in September of this year:

        https://www.depesz.com/2017/04/04/waiting-for-postgresql-10-full-text-search-support-for-json-and-jsonb/

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: [GENERAL] full text search on hstore or json with materialized view?

От
George Neuner
Дата:
On Tue, 18 Apr 2017 14:38:15 -0700, Rj Ewing <ewing.rj@gmail.com>
wrote:

>I am evaluating postgres for as a datastore for our webapp. We are moving
>away from a triple store db due to performance issues.
>
>Our data model consists of sets of user defined attributes. Approx 10% of
>the attributes tend to be 100% filled with 50% of the attributes having
>approx 25% filled. This is fairly sparse data, and it seems that jsonb or
>hstore will be best for us.
>
>Unfortunately, from my understanding, postres doesn't support fulltext
>search across hstore or jsonb key:values or even the entire document. While
>this is not a deal breaker, this would be a great feature to have. We have
>been experimenting w/ elasticsearch a bit, and particularly enjoy this
>feature, however we don't really want to involve the complexity and
>overhead of adding elasticsearch in front of our datasource right now.

hstore and JSON values all really are just formatted text with a
custom column type.  You can create tsvectors from the values if you
cast them to text.

Note that a tsvector can only work on a /flat/ key:value structure: it
won't understand nesting, and it and even with a flat store it won't
understand the difference between keys/tags and the associated values.

E.g., you will be able to see that a value contains both "foo" and
"bar", but to distinguish 'foo:bar' from 'bar:foo' or 'foo:q,bar:z'
you either must check the token positions (from the tsvector) or *try*
to extract the key(s) you are interested in and check the associated
value(s).

This might work ok if you search only for keys in a "document" ... but
trying to search values, I think would be far too complicated.

It might help if you stored a 2D array instead of a flat structure,
but even that would be fairly complicated to work with.



>An idea that has come up is to use a materialized view or secondary table
>with triggers, where we would have 3 columns (id, key, value).
>
>I think this would allow us to store a tsvector and gin index. Giving us
>the ability to use fulltext search on k:v pairs, then join the original
>data on the id field to return the entire record.

This is a much better idea because it separates the key from the
value, and unlike the full "document" case [above], you will know that
the FTS index is covering only the values.

If you need to preserve key order to reconstruct records, you will
need an additional column to maintain that ordering.


>is anyone currently doing this? Is there a better alternative? Any
>performance issues that immediately jump out ( I realize the writes will
>take longer)?
>
>the nature of our data is "relatively" static with bulk uploads (100 - 1000
>records). So we can sacrifice some write performance.
>
>RJ

Having to "reconstruct" records will make reads take longer as well,
but I think separating the keys and values is the best way to do it.


YMMV,
George

Re: [GENERAL] full text search on hstore or json with materialized view?

От
Rj Ewing
Дата:
I did some testing using a secondary table with the key, value column. However I don't think this will provide the performance that we need. Queries we taking 60+ seconds just for a count.

With 1 million rows in the primary table, this resulted in 44 million rows in the secondary k,v table for full text searching. The same query is es takes ~50 ms on my local machine with 1/10th the ram allocated to es then was allocated to psql.

I'm gonna test using trigrams indexes on approx 10 json fields, and see if that gives us what we are looking for.

any thought on getting sub 1 sec queries on a table with 44 million rows?

RJ

On Tue, Apr 18, 2017 at 10:35 PM, George Neuner <gneuner2@comcast.net> wrote:
On Tue, 18 Apr 2017 14:38:15 -0700, Rj Ewing <ewing.rj@gmail.com>
wrote:

>I am evaluating postgres for as a datastore for our webapp. We are moving
>away from a triple store db due to performance issues.
>
>Our data model consists of sets of user defined attributes. Approx 10% of
>the attributes tend to be 100% filled with 50% of the attributes having
>approx 25% filled. This is fairly sparse data, and it seems that jsonb or
>hstore will be best for us.
>
>Unfortunately, from my understanding, postres doesn't support fulltext
>search across hstore or jsonb key:values or even the entire document. While
>this is not a deal breaker, this would be a great feature to have. We have
>been experimenting w/ elasticsearch a bit, and particularly enjoy this
>feature, however we don't really want to involve the complexity and
>overhead of adding elasticsearch in front of our datasource right now.

hstore and JSON values all really are just formatted text with a
custom column type.  You can create tsvectors from the values if you
cast them to text.

Note that a tsvector can only work on a /flat/ key:value structure: it
won't understand nesting, and it and even with a flat store it won't
understand the difference between keys/tags and the associated values.

E.g., you will be able to see that a value contains both "foo" and
"bar", but to distinguish 'foo:bar' from 'bar:foo' or 'foo:q,bar:z'
you either must check the token positions (from the tsvector) or *try*
to extract the key(s) you are interested in and check the associated
value(s).

This might work ok if you search only for keys in a "document" ... but
trying to search values, I think would be far too complicated.

It might help if you stored a 2D array instead of a flat structure,
but even that would be fairly complicated to work with.



>An idea that has come up is to use a materialized view or secondary table
>with triggers, where we would have 3 columns (id, key, value).
>
>I think this would allow us to store a tsvector and gin index. Giving us
>the ability to use fulltext search on k:v pairs, then join the original
>data on the id field to return the entire record.

This is a much better idea because it separates the key from the
value, and unlike the full "document" case [above], you will know that
the FTS index is covering only the values.

If you need to preserve key order to reconstruct records, you will
need an additional column to maintain that ordering.


>is anyone currently doing this? Is there a better alternative? Any
>performance issues that immediately jump out ( I realize the writes will
>take longer)?
>
>the nature of our data is "relatively" static with bulk uploads (100 - 1000
>records). So we can sacrifice some write performance.
>
>RJ

Having to "reconstruct" records will make reads take longer as well,
but I think separating the keys and values is the best way to do it.


YMMV,
George



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] full text search on hstore or json with materialized view?

От
Rj Ewing
Дата:
okay, messing around a bit more with the secondary k,v table it seems like this could be a good solution..

I created a keys table to hold the 63 key values, then I dropped and recreated the secondary table, using a FK referencing the keys table. I'm not really sure why, but a basic full text query on 44 million row is taking aproxx. 20ms.

my table structure is:

 Table "public.samples_lg_txt"
 Column |   Type   | Modifiers
--------+----------+-----------
 id     | integer  |
 key    | integer  |
 val    | text     |
 tsv    | tsvector |
Indexes:
    "idx_tsv_samples_lg_text" gin (tsv)
Foreign-key constraints:
    "samples_lg_txt_id_fkey" FOREIGN KEY (id) REFERENCES samples_lg(id) ON DELETE CASCADE
    "samples_lg_txt_key_fkey" FOREIGN KEY (key) REFERENCES keys(id)


how would I write an AND query that filtered on 2 separate keys from the samples_lg_txt table?

something like:

SELECT COUNT(*) FROM samples WHERE id IN ( SELECT DISTINCT(s.id) FROM samples_lg_txt s JOIN keys k ON s.key = k.id WHERE (name = 'key1' AND tsv @@ to_tsquery('value1')) AND (name = 'key2' AND tsv @@ to_tsquery('value2'));

On Wed, Apr 19, 2017 at 11:57 AM, Rj Ewing <ewing.rj@gmail.com> wrote:
I did some testing using a secondary table with the key, value column. However I don't think this will provide the performance that we need. Queries we taking 60+ seconds just for a count.

With 1 million rows in the primary table, this resulted in 44 million rows in the secondary k,v table for full text searching. The same query is es takes ~50 ms on my local machine with 1/10th the ram allocated to es then was allocated to psql.

I'm gonna test using trigrams indexes on approx 10 json fields, and see if that gives us what we are looking for.

any thought on getting sub 1 sec queries on a table with 44 million rows?

RJ

On Tue, Apr 18, 2017 at 10:35 PM, George Neuner <gneuner2@comcast.net> wrote:
On Tue, 18 Apr 2017 14:38:15 -0700, Rj Ewing <ewing.rj@gmail.com>
wrote:

>I am evaluating postgres for as a datastore for our webapp. We are moving
>away from a triple store db due to performance issues.
>
>Our data model consists of sets of user defined attributes. Approx 10% of
>the attributes tend to be 100% filled with 50% of the attributes having
>approx 25% filled. This is fairly sparse data, and it seems that jsonb or
>hstore will be best for us.
>
>Unfortunately, from my understanding, postres doesn't support fulltext
>search across hstore or jsonb key:values or even the entire document. While
>this is not a deal breaker, this would be a great feature to have. We have
>been experimenting w/ elasticsearch a bit, and particularly enjoy this
>feature, however we don't really want to involve the complexity and
>overhead of adding elasticsearch in front of our datasource right now.

hstore and JSON values all really are just formatted text with a
custom column type.  You can create tsvectors from the values if you
cast them to text.

Note that a tsvector can only work on a /flat/ key:value structure: it
won't understand nesting, and it and even with a flat store it won't
understand the difference between keys/tags and the associated values.

E.g., you will be able to see that a value contains both "foo" and
"bar", but to distinguish 'foo:bar' from 'bar:foo' or 'foo:q,bar:z'
you either must check the token positions (from the tsvector) or *try*
to extract the key(s) you are interested in and check the associated
value(s).

This might work ok if you search only for keys in a "document" ... but
trying to search values, I think would be far too complicated.

It might help if you stored a 2D array instead of a flat structure,
but even that would be fairly complicated to work with.



>An idea that has come up is to use a materialized view or secondary table
>with triggers, where we would have 3 columns (id, key, value).
>
>I think this would allow us to store a tsvector and gin index. Giving us
>the ability to use fulltext search on k:v pairs, then join the original
>data on the id field to return the entire record.

This is a much better idea because it separates the key from the
value, and unlike the full "document" case [above], you will know that
the FTS index is covering only the values.

If you need to preserve key order to reconstruct records, you will
need an additional column to maintain that ordering.


>is anyone currently doing this? Is there a better alternative? Any
>performance issues that immediately jump out ( I realize the writes will
>take longer)?
>
>the nature of our data is "relatively" static with bulk uploads (100 - 1000
>records). So we can sacrifice some write performance.
>
>RJ

Having to "reconstruct" records will make reads take longer as well,
but I think separating the keys and values is the best way to do it.


YMMV,
George



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] full text search on hstore or json with materialized view?

От
George Neuner
Дата:

Please don't top post.
https://en.wikipedia.org/wiki/Posting_style#Placement_of_replies
https://en.wikipedia.org/wiki/Posting_style#Choosing_the_proper_posting_style



>> On Tue, 18 Apr 2017 14:38:15 -0700, Rj Ewing <ewing.rj@gmail.com>
>> wrote:
>>
>> > :
>> >An idea that has come up is to use a materialized view or secondary table
>> >with triggers, where we would have 3 columns (id, key, value).
>> >
>> >I think this would allow us to store a tsvector and gin index. Giving us
>> >the ability to use fulltext search on k:v pairs, then join the original
>> >data on the id field to return the entire record.
>> > :


On Wed, 19 Apr 2017 11:57:26 -0700, Rj Ewing <ewing.rj@gmail.com>
wrote:

>I did some testing using a secondary table with the key, value column.
>However I don't think this will provide the performance that we need.
>Queries we taking 60+ seconds just for a count.

SELECT count(*) or filtered?

Either way, your statistics may be way off.  Did you vacuum analyze
the table after the inserts (and the index creation if it was done
separately)?


>With 1 million rows in the primary table, this resulted in 44 million rows
>in the secondary k,v table for full text searching. The same query is es
>takes ~50 ms on my local machine with 1/10th the ram allocated to es then
>was allocated to psql.
>
>I'm gonna test using trigrams indexes on approx 10 json fields, and see if
>that gives us what we are looking for.
>
>any thought on getting sub 1 sec queries on a table with 44 million rows?
>
>RJ

Based on your description of the data [at top], I rather doubt
trigrams will be an improvement over tsvector.  And they're more
cumbersome to use if you don't need better similarity matching than
what tsvector offers [which itself is somewhat adjustable via
dictionaries].

Without more detail re: your hardware, Postgresql version, what
indexes are/will be available, the types of queries you want to run,
etc., it's very hard to give really meaningful suggestions.


The kind of query you have alluded to is pretty easily parallelized:
it can be spread over multiple sessions with result aggregation done
on the client side.

Or, if you you have 9.6, you might try using backend parallelism:
https://www.postgresql.org/docs/9.6/static/parallel-query.html
[I've not used this, but some people have done it successfully.]


If you can restrict the FTS query to certain keys:

  SELECT id FROM mytable
    WHERE tsquery( ... ) @@ to_tsvector(v)
    AND k IN ( ... )
    GROUP BY id

  [note: according to David Rowley, GROUP BY may be parallelized
         whereas  DISTINCT currently cannot be.]

then given an index on 'k' it may be much faster than just the FTS
query alone.  Subject to key variability, it also may be improved by
table partitioning to reduce the search space.

If the FTS query is key restricted, you can parallelize either on the
client or on the server.  If the FTS query is not key restricted, you
pretty much are limited to server side (and 9.6 or later).


And I'm out of suggestions for now.

Parallel query is your best bet for maximum performance, but unless
you have enough RAM to hold the entire table and its indexes, and all
the query workspaces, then I doubt you will be able to get anywhere
near your optimistic execution target for FTS on 40+ million rows.


YMMV,
George

Re: [GENERAL] full text search on hstore or json with materialized view?

От
Jeff Janes
Дата:
Please don't top-post, thanks.

On Tue, Apr 18, 2017 at 3:18 PM, Rj Ewing <ewing.rj@gmail.com> wrote: 
On Tue, Apr 18, 2017 at 3:00 PM, Bruce Momjian <bruce@momjian.us> wrote:

Full text search of JSON and JSONB data is coming in Postgres 10, which
is to to be released in September of this year:

        https://www.depesz.com/2017/04/04/waiting-for-postgresql-10-full-text-search-support-for-json-and-jsonb/


A step in the right direction for me, however it doesn't appear to support per field full text searching. 
It is exciting though!


Your best bet might be to ignore the per-field searching in the initial (indexed) pass of the query to get everything that has all the search terms, regardless of which field they occur in.  And the re-check whether each of the found values was found in the appropriate field in a later pass.

Something like

select * from sample where 
             to_tsvector(json_thing->>:key1) @@ :value1 
      and to_tsvector(json_thing->>:key2) @@ :value2 
      and to_tsvector('english',json_thing) @@ (:value1 || :value2)

From the initial email:

An idea that has come up is to use a materialized view or secondary table with triggers, where we would have 3 columns (id, key, value).

How would this be different from the "triple store" you are abandoning?

Cheers,

Jeff

Re: [GENERAL] full text search on hstore or json with materialized view?

От
George Neuner
Дата:
On Wed, 19 Apr 2017 16:28:13 -0700, Rj Ewing <ewing.rj@gmail.com>
wrote:

>okay, messing around a bit more with the secondary k,v table it seems like
>this could be a good solution..
>
>I created a keys table to hold the 63 key values, then I dropped and
>recreated the secondary table, using a FK referencing the keys table. I'm
>not really sure why, but a basic full text query on 44 million row is
>taking aproxx. 20ms.

That pretty much confirms your statistics were bad ... using the FK
table or not wouldn't make any difference to the planner.

But if you are getting 20ms on 44M rows, then one or more of the
following must be true:
 - your text values must be very short
 - your FTS queries must be very simple
 - you aren't reading the results


For comparison:

I have an application that does FTS on a table of NAICS descriptions
indexed using tsvectors with an average length of 4.8 tokens per.  It
does a 3-part All/Any/None term search.

On my 24-core 2.4GHz server, a single threaded query with the whole
table and index in memory takes ~1 ms to search 20K rows using a
realistic tsquery:  e.g.,

  SELECT code,description
    FROM naics
    WHERE ts_index @@ to_tsquery('packaged & software & !(wholesale)')

[getting the data out of Postgresql takes longer than the search]


GIN indexes don't exactly scale linearly, and tsquery is, in general,
much more dependent on the lengths of the tsvectors than on the
complexity of the match, but with 44M rows of similarly distributed
data, a similarly realistic query would be expected to take well over
1 second.


My example is genuine but too small to bother parallelizing [mentioned
in a previous message].  Since you are *testing* with 1M records (that
create 44M k:v shards), I am assuming you will need to deal with much
more than that in deployment.  And if you think you need FTS, then you
must be expecting more than simple word matches [as below], else you
might do something simpler like

  SELECT ...
    WHERE val ILIKE <value>



>my table structure is:
>
> Table "public.samples_lg_txt"
> Column |   Type   | Modifiers
>--------+----------+-----------
> id     | integer  |
> key    | integer  |
> val    | text     |
> tsv    | tsvector |
>Indexes:
>    "idx_tsv_samples_lg_text" gin (tsv)
>Foreign-key constraints:
>    "samples_lg_txt_id_fkey" FOREIGN KEY (id) REFERENCES samples_lg(id)
>                                ON DELETE CASCADE
>    "samples_lg_txt_key_fkey" FOREIGN KEY (key) REFERENCES keys(id)
>
>
>how would I write an AND query that filtered on 2 separate keys from the
>samples_lg_txt table?
>
>something like:
>
>SELECT COUNT(*) FROM samples WHERE id IN ( SELECT DISTINCT(s.id) FROM
>samples_lg_txt s JOIN keys k ON s.key = k.id WHERE (*name = 'key1' AND tsv
>@@ to_tsquery('value1')) AND (name = 'key2' AND tsv @@
>to_tsquery('value2'))*;

You're overthinking it

  SELECT count(distinct s.id)
    FROM  samples_lg_txt AS s
    JOIN  keys AS k ON k.id = s.key
    WHERE (k.name = 'key1' AND s.tsv @@ to_query('value1')
       OR (k.name = 'key2' AND s.tsv @@ to_query('value2')


There's actually no need to join if you can use the key name instead
of an integer id.  You can FK on strings, so you can still maintain an
identity table of keys.  E.g.,

> id     | integer   |
> key    | vchar(32) | FK key(name) ...
> val    | text      |
> tsv    | tsvector  |


Then the query could be just

  SELECT count(distinct id)
    FROM  samples_lg_txt
    WHERE (key = 'key1' AND tsv @@ to_query('value1')
       OR (key = 'key2' AND tsv @@ to_query('value2')


Just a reminder [it's late here 8-)]: FK columns contain values - not
weird references to the foreign tables.  The constraint just enforces
that any value inserted/updated into the FK column matches an existing
value in the relevant foreign table.


George

Re: [GENERAL] full text search on hstore or json with materialized view?

От
George Neuner
Дата:
On Thu, 20 Apr 2017 00:55:48 -0400, George Neuner
<gneuner2@comcast.net> wrote:

Doh!

>  SELECT count(distinct s.id)
>    FROM  samples_lg_txt AS s
>    JOIN  keys AS k ON k.id = s.key
>    WHERE (k.name = 'key1' AND s.tsv @@ to_query('value1')
>       OR (k.name = 'key2' AND s.tsv @@ to_query('value2')
>
>  :
>
>  SELECT count(distinct id)
>    FROM  samples_lg_txt
>    WHERE (key = 'key1' AND tsv @@ to_query('value1')
>       OR (key = 'key2' AND tsv @@ to_query('value2')


All the WHERE clauses need closing parentheses.

Time for bed,
George

Re: [GENERAL] full text search on hstore or json with materialized view?

От
Rj Ewing
Дата:


On Wed, Apr 19, 2017 at 6:44 PM, George Neuner <gneuner2@comcast.net> wrote:
On Wed, 19 Apr 2017 11:57:26 -0700, Rj Ewing <ewing.rj@gmail.com>
wrote:

>I did some testing using a secondary table with the key, value column.
>However I don't think this will provide the performance that we need.
>Queries we taking 60+ seconds just for a count.

SELECT count(*) or filtered?

Either way, your statistics may be way off.  Did you vacuum analyze
the table after the inserts (and the index creation if it was done
separately)?
 
​I think my statistics were off. I never ran vacuum analyze.

Without more detail re: your hardware, Postgresql version, what
indexes are/will be available, the types of queries you want to run,
etc., it's very hard to give really meaningful suggestions.

​postgresql 9.6, currently testing on 8gb ram, but have upto 64gb for production. 7-core 2.10GHz​.

mostly want to run ad-hoc queries, returning entire row, matching 1 - 3 k:v conditions.

The kind of query you have alluded to is pretty easily parallelized:
it can be spread over multiple sessions with result aggregation done
on the client side.

Or, if you you have 9.6, you might try using backend parallelism:
https://www.postgresql.org/docs/9.6/static/parallel-query.html
[I've not used this, but some people have done it successfully.]
 

If you can restrict the FTS query to certain keys:

  SELECT id FROM mytable
    WHERE tsquery( ... ) @@ to_tsvector(v)
    AND k IN ( ... )
    GROUP BY id

  [note: according to David Rowley, GROUP BY may be parallelized
         whereas  DISTINCT currently cannot be.]

then given an index on 'k' it may be much faster than just the FTS
query alone.  Subject to key variability, it also may be improved by
table partitioning to reduce the search space.

If the FTS query is key restricted, you can parallelize either on the
client or on the server.  If the FTS query is not key restricted, you
pretty much are limited to server side (and 9.6 or later).

​I'll look into parallelism if we can't get the performance we need.

​What do you mean if I can restrict the FTS query to certain keys? I'm not a sql expert, but it seems like the above query would match multiple keys to 1 tsquery value?

We need to be able to do AND conditions with separate k:v pairs. Our keys are know ahead of time, and would vary for a given table, but would be in the range of 30-60 keys per table.

Re: [GENERAL] full text search on hstore or json with materialized view?

От
Rj Ewing
Дата:


On Wed, Apr 19, 2017 at 8:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Your best bet might be to ignore the per-field searching in the initial (indexed) pass of the query to get everything that has all the search terms, regardless of which field they occur in.  And the re-check whether each of the found values was found in the appropriate field in a later pass.

Something like

select * from sample where 
             to_tsvector(json_thing->>:key1) @@ :value1 
      and to_tsvector(json_thing->>:key2) @@ :value2 
      and to_tsvector('english',json_thing) @@ (:value1 || :value2)

​that worked pretty well when there was an AND condition with multiple k:v pairs as you have. However replacing it with an OR condition across k:v pairs it was pretty slow. I do like the simplicity though. Maybe indexing the 10ish most common columns ​would be a "good enough" solution.

From the initial email:

An idea that has come up is to use a materialized view or secondary table with triggers, where we would have 3 columns (id, key, value).

How would this be different from the "triple store" you are abandoning?

​it would be fairly similar. One advantage would be that we could simplify the backend to just a RDMS (which we use already), and not have to maintain a separate "triple store" instance

Re: [GENERAL] full text search on hstore or json with materialized view?

От
Rj Ewing
Дата:


On Wed, Apr 19, 2017 at 9:55 PM, George Neuner <gneuner2@comcast.net> wrote:
On Wed, 19 Apr 2017 16:28:13 -0700, Rj Ewing <ewing.rj@gmail.com>
wrote:

>okay, messing around a bit more with the secondary k,v table it seems like
>this could be a good solution..
>
>I created a keys table to hold the 63 key values, then I dropped and
>recreated the secondary table, using a FK referencing the keys table. I'm
>not really sure why, but a basic full text query on 44 million row is
>taking aproxx. 20ms.

That pretty much confirms your statistics were bad ... using the FK
table or not wouldn't make any difference to the planner.

But if you are getting 20ms on 44M rows, then one or more of the
following must be true:
 - your text values must be very short
 - your FTS queries must be very simple
 - you aren't reading the results

​text is on average very short. 1-3 words per value.​ It was a count(*) query with only a single condition.


For comparison:

I have an application that does FTS on a table of NAICS descriptions
indexed using tsvectors with an average length of 4.8 tokens per.  It
does a 3-part All/Any/None term search.

On my 24-core 2.4GHz server, a single threaded query with the whole
table and index in memory takes ~1 ms to search 20K rows using a
realistic tsquery:  e.g.,

  SELECT code,description
    FROM naics
    WHERE ts_index @@ to_tsquery('packaged & software & !(wholesale)')

[getting the data out of Postgresql takes longer than the search]


GIN indexes don't exactly scale linearly, and tsquery is, in general,
much more dependent on the lengths of the tsvectors than on the
complexity of the match, but with 44M rows of similarly distributed
data, a similarly realistic query would be expected to take well over
1 second.


My example is genuine but too small to bother parallelizing [mentioned
in a previous message].  Since you are *testing* with 1M records (that
create 44M k:v shards), I am assuming you will need to deal with much
more than that in deployment.  And if you think you need FTS, then you
must be expecting more than simple word matches [as below], else you
might do something simpler like

  SELECT ...
    WHERE val ILIKE <value>

​the 1M records would most likely be the max. On average the tables would have more like 100,000 records each.

I also realized that I inserted all k:v pairs into the secondary k:v table. In reality, I would only index strings, which would eliminate approx 25% of the k:v bringing that number down closer to 30M.

from my understanding, ILIKE doesn't do any text normalization, which is something we would like to have.​


>my table structure is:
>
> Table "public.samples_lg_txt"
> Column |   Type   | Modifiers
>--------+----------+-----------
> id     | integer  |
> key    | integer  |
> val    | text     |
> tsv    | tsvector |
>Indexes:
>    "idx_tsv_samples_lg_text" gin (tsv)
>Foreign-key constraints:
>    "samples_lg_txt_id_fkey" FOREIGN KEY (id) REFERENCES samples_lg(id)
>                                ON DELETE CASCADE
>    "samples_lg_txt_key_fkey" FOREIGN KEY (key) REFERENCES keys(id)
>
>
>how would I write an AND query that filtered on 2 separate keys from the
>samples_lg_txt table?
>
>something like:
>
>SELECT COUNT(*) FROM samples WHERE id IN ( SELECT DISTINCT(s.id) FROM
>samples_lg_txt s JOIN keys k ON s.key = k.id WHERE (*name = 'key1' AND tsv
>@@ to_tsquery('value1')) AND (name = 'key2' AND tsv @@
>to_tsquery('value2'))*;

You're overthinking it

  SELECT count(distinct s.id)
    FROM  samples_lg_txt AS s
    JOIN  keys AS k ON k.id = s.key
    WHERE (k.name = 'key1' AND s.tsv @@ to_query('value1')
       OR (k.name = 'key2' AND s.tsv @@ to_query('value2')

​but that is an OR query, I'm trying to do an AND query.
There's actually no need to join if you can use the key name instead
of an integer id.  You can FK on strings, so you can still maintain an
identity table of keys.  E.g.,

> id     | integer   |
> key    | vchar(32) | FK key(name) ...
> val    | text      |
> tsv    | tsvector  |


Then the query could be just

  SELECT count(distinct id)
    FROM  samples_lg_txt
    WHERE (key = 'key1' AND tsv @@ to_query('value1')
       OR (key = 'key2' AND tsv @@ to_query('value2')


Just a reminder [it's late here 8-)]: FK columns contain values - not
weird references to the foreign tables.  The constraint just enforces
that any value inserted/updated into the FK column matches an existing
value in the relevant foreign table.

​this would make queries simpler​. I guess a disadvantage to using a string for the key is that the db size would be larger, and thus not as likely to fit the entire table in ram. If there are only 63 keys across 44M rows, it seems that storing an smallint would take less space then storing the string.

I don't really have a need for the identity table of keys. It's only purpose was to shrink the database size.

Thanks again for the detailed responses!

Re: [GENERAL] full text search on hstore or json with materialized view?

От
George Neuner
Дата:
On Thu, 20 Apr 2017 07:56:18 -0700, Rj Ewing <ewing.rj@gmail.com>
wrote:

>On Wed, Apr 19, 2017 at 6:44 PM, George Neuner <gneuner2@comcast.net> wrote:
>>
>> If you can restrict the FTS query to certain keys:
>>
>>   SELECT id FROM mytable
>>     WHERE tsquery( ... ) @@ to_tsvector(v)
>>     AND k IN ( ... )
>>     GROUP BY id
>>
>>   [note: according to David Rowley, GROUP BY may be parallelized
>>          whereas  DISTINCT currently cannot be.]
>>
>> then given an index on 'k' it may be much faster than just the FTS
>> query alone.  Subject to key variability, it also may be improved by
>> table partitioning to reduce the search space.
>>
>> If the FTS query is key restricted, you can parallelize either on the
>> client or on the server.  If the FTS query is not key restricted, you
>> pretty much are limited to server side (and 9.6 or later).
>>
>> ?I'll look into parallelism if we can't get the performance we need.
>
>What do you mean if I can restrict the FTS query to certain keys? I'm not
>a sql expert, but it seems like the above query would match multiple keys
>to 1 tsquery value


You weren't specific as to the types of queries you wanted ... you
mentioned somewhere higher up in the discussion:

> ... a basic full text query on 44 million row is taking aproxx. 20ms.

That implied you wanted to FTS search every row.  Only later did you
give an example that tied FTS patterns to particular keys.  Until you
did that, there was no reason to assume the FTS search was targeted -
you might have wanted e.g., records where *any* k:v value matched the
FTS pattern.

[The take away here is: "try to be as specific as possible". 8-) ]


Obviously you can associate a FTS pattern with a particular key value
- just AND the conditions in the WHERE or HAVING clauses.

But be aware that, in general, the more conditions you place on a
query, the slower it runs.


George

Re: [GENERAL] full text search on hstore or json with materialized view?

От
George Neuner
Дата:
On Thu, 20 Apr 2017 08:50:31 -0700, Rj Ewing <ewing.rj@gmail.com>
wrote:

>On Wed, Apr 19, 2017 at 9:55 PM, George Neuner <gneuner2@comcast.net> wrote:
>
>> ... Since you are *testing* with 1M records (that
>> create 44M k:v shards), I am assuming you will need to deal with much
>> more than that in deployment.  And if you think you need FTS, then you
>> must be expecting more than simple word matches [as below], else you
>> might do something simpler like
>>
>>   SELECT ...
>>     WHERE val ILIKE <value>
>
>the 1M records would most likely be the max. On average the tables would
>have more like 100,000 records each.

Ok, so my assumption was way off ... you should be able to achieve the
timing you want with appropriate indexing.

>from my understanding, *ILIKE* doesn't do any text normalization, which is
>something we would like to have.

Right. If you want rooting/stemming or dictionary translation, then
you do need to use FTS.


>> >how would I write an AND query that filtered on 2 separate keys from the
>> >samples_lg_txt table?
>> >
>> >something like:
>> >
>> > SELECT COUNT(*) FROM samples WHERE id IN ( SELECT DISTINCT(s.id) FROM
>> > samples_lg_txt s JOIN keys k ON s.key = k.id WHERE (*name = 'key1' AND
>> > tsv @@ to_tsquery('value1')) AND (name = 'key2' AND tsv @@
>> > to_tsquery('value2'))*;
>>
>> You're overthinking it
>>
>>   SELECT count(distinct s.id)
>>     FROM  samples_lg_txt AS s
>>     JOIN  keys AS k ON k.id = s.key
>>     WHERE (k.name = 'key1' AND s.tsv @@ to_query('value1')
>>        OR (k.name = 'key2' AND s.tsv @@ to_query('value2')
>
>but that is an OR query, I'm trying to do an AND query.

Sorry, I missed the AND in your original query.  Still the nested
SELECT is unnecessary.

Postgresql's planner/optimizer is pretty smart, and probably would
collapse your code into mine (modulo the AND/OR goof), but I prefer
not to rely on the planner to be smart ... that gets you into trouble
when you have to switch between DBMS.


>> There's actually no need to join if you can use the key name instead
>> of an integer id.  You can FK on strings, so you can still maintain an
>> identity table of keys.  E.g.,
>>
>> > id     | integer   |
>> > key    | vchar(32) | FK key(name) ...
>> > val    | text      |
>> > tsv    | tsvector  |
>>
>>
>> Then the query could be just
>>
>>   SELECT count(distinct id)
>>     FROM  samples_lg_txt
>>     WHERE (key = 'key1' AND tsv @@ to_query('value1')
>>        OR (key = 'key2' AND tsv @@ to_query('value2')
>>
>
>?this would make queries simpler?.

Yes - it eliminates the joins, and the query runs on a single table.

>I guess a disadvantage to using a string
>for the key is that the db size would be larger, and thus not as likely to
>fit the entire table in ram. If there are only 63 keys across 44M rows, it
>seems that storing an smallint would take less space then storing the
>string.

Maybe.  Using the integer FK reduces the table size, but it requires a
join with the foreign table.  A join of two tables requires indexes
for the join columns on both tables [which may or may not already
exist], and produces [variously] a temporary hash or key relation
table that represents the rows of the "joined" table.  These temporary
structures can grow very large and may have to spill onto disk.

You can somewhat control that with the work_mem setting.  But remember
that the setting applies to every operation of every concurrent query
... so setting work_mem very high can backfire.


So saving one place can cost you in another.  TANSTAAFL.


>I don't really have a need for the identity table of keys. It's only
>purpose was to shrink the database size.
>
>Thanks again for the detailed responses!


George