Обсуждение: tsearch2, large data and indexes

Поиск
Список
Период
Сортировка

tsearch2, large data and indexes

От
Ivan Voras
Дата:
Hello,

If a table contains simple fields as well as large (hundreds of KiB)
text fields, will accessing only the simple fields cause the entire
record data, including the large fields, to be read and unpacked?
(e.g. SELECT int_field FROM table_with_large_text)

More details: after thinking about it some more, it might have
something to do with tsearch2 and indexes: the large data in this case
is a tsvector, indexed with GIN, and the query plan involves a
re-check condition.

The query is of the form:
SELECT simple_fields FROM table WHERE fts @@ to_tsquery('...').

Does the "re-check condition" mean that the original tsvector data is
always read from the table in addition to the index? That would be
very wasteful since data is practically duplicated in the table and in
the index. Any way around it?


Re: tsearch2, large data and indexes

От
Heikki Linnakangas
Дата:
On 04/20/2014 02:15 AM, Ivan Voras wrote:
> Hello,
>
> If a table contains simple fields as well as large (hundreds of KiB)
> text fields, will accessing only the simple fields cause the entire
> record data, including the large fields, to be read and unpacked?
> (e.g. SELECT int_field FROM table_with_large_text)

No.

> More details: after thinking about it some more, it might have
> something to do with tsearch2 and indexes: the large data in this case
> is a tsvector, indexed with GIN, and the query plan involves a
> re-check condition.
>
> The query is of the form:
> SELECT simple_fields FROM table WHERE fts @@ to_tsquery('...').
>
> Does the "re-check condition" mean that the original tsvector data is
> always read from the table in addition to the index?

Yes, if the re-check condition involves the fts column. I don't see why
you would have a re-check condition with a query like that, though. Are
there some other WHERE-conditions that you didn't show us?

The large fields are stored in the toast table. You can check if the
toast table is accessed with a query like this:

select * from pg_stat_all_tables where relid = (select reltoastrelid
from pg_class where relname='table');

Run that before and after your query, and see if the numbers change.

- Heikki


Re: tsearch2, large data and indexes

От
Ivan Voras
Дата:
On 22 April 2014 08:40, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 04/20/2014 02:15 AM, Ivan Voras wrote:
>> More details: after thinking about it some more, it might have
>> something to do with tsearch2 and indexes: the large data in this case
>> is a tsvector, indexed with GIN, and the query plan involves a
>> re-check condition.
>>
>> The query is of the form:
>> SELECT simple_fields FROM table WHERE fts @@ to_tsquery('...').
>>
>> Does the "re-check condition" mean that the original tsvector data is
>> always read from the table in addition to the index?
>
>
> Yes, if the re-check condition involves the fts column. I don't see why you
> would have a re-check condition with a query like that, though. Are there
> some other WHERE-conditions that you didn't show us?

Yes, I've read about tsearch2 and GIN indexes and there shouldn't be a
recheck condition - but there is.
This is the query:

SELECT documents.id, title, raw_data, q, ts_rank(fts_data, q, 4) AS
rank, html_filename
            FROM documents, to_tsquery('document') AS q
            WHERE fts_data @@ q
         ORDER BY rank DESC  LIMIT 25;

And here is the explain analyze: http://explain.depesz.com/s/4xm
It clearly shows a bitmap index scan operation is immediately followed
by a recheck operation AND that the recheck operation actually does
something, because it reduces the number of records from 61 to 58
(!!!).

This is the table structure:

nn=# \d documents
                             Table "public.documents"
    Column     |   Type   |                       Modifiers
---------------+----------+--------------------------------------------------------
 id            | integer  | not null default
nextval('documents_id_seq'::regclass)
 ctime         | integer  | not null default unix_ts(now())
 dtime         | integer  | not null
 title         | text     | not null
 html_filename | text     | not null
 raw_data      | text     |
 fts_data      | tsvector | not null
 tags          | text[]   |
 dtype         | integer  | not null default 0
 flags         | integer  | not null default 0
Indexes:
    "documents_pkey" PRIMARY KEY, btree (id)
    "documents_html_filename" UNIQUE, btree (html_filename)
    "documents_dtime" btree (dtime)
    "documents_fts_data" gin (fts_data)
    "documents_tags" gin (tags)


> The large fields are stored in the toast table. You can check if the toast
> table is accessed with a query like this:
>
> select * from pg_stat_all_tables where relid = (select reltoastrelid from
> pg_class where relname='table');
>
> Run that before and after your query, and see if the numbers change.

Before:


relid|schemaname|relname|seq_scan|seq_tup_read|idx_scan|idx_tup_fetch|n_tup_ins|n_tup_upd|n_tup_del|n_tup_hot_upd|n_live_tup|n_dead_tup|last_vacuum|last_autovacuum|last_analyze|last_autoanalyze|vacuum_count|autovacuum_count|analyze_count|autoanalyze_count
27290|pg_toast|pg_toast_27283|3|0|2481289|10631453|993194|0|266306|0|147931|2514||2014-04-18
00:49:11.066443+02|||0|11|0|0

After:


relid|schemaname|relname|seq_scan|seq_tup_read|idx_scan|idx_tup_fetch|n_tup_ins|n_tup_upd|n_tup_del|n_tup_hot_upd|n_live_tup|n_dead_tup|last_vacuum|last_autovacuum|last_analyze|last_autoanalyze|vacuum_count|autovacuum_count|analyze_count|autoanalyze_count
27290|pg_toast|pg_toast_27283|3|0|2481347|10632814|993194|0|266306|0|147931|2514||2014-04-18
00:49:11.066443+02|||0|11|0|0

idx_scan has changed from 2481289 to 2481347 (58)
idx_tup_fetch has changed from 10631453 to 10632814 (1361)

Number 58 corresponds to the number of rows found by the index, seen
in the EXPLAIN output, I don't know where 1361 comes from.

I'm also surprised by the amount of memory used for sorting (23 kB),
since the actually returned data from my query (all the tuples from
all the 58 rows) amount to around 2 kB - but this is not an actual
problem.


Re: tsearch2, large data and indexes

От
Jeff Janes
Дата:
On Tue, Apr 22, 2014 at 12:57 AM, Ivan Voras <ivoras@freebsd.org> wrote:
On 22 April 2014 08:40, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> On 04/20/2014 02:15 AM, Ivan Voras wrote:
>> More details: after thinking about it some more, it might have
>> something to do with tsearch2 and indexes: the large data in this case
>> is a tsvector, indexed with GIN, and the query plan involves a
>> re-check condition.

I think bitmap scans always insert a recheck, do to the possibility of bitmap overflow.

But that doesn't mean that it ever got triggered.  In 9.4., explain (analyze) will report on the overflows.


Yes, I've read about tsearch2 and GIN indexes and there shouldn't be a
recheck condition - but there is.
This is the query:

SELECT documents.id, title, raw_data, q, ts_rank(fts_data, q, 4) AS
rank, html_filename
            FROM documents, to_tsquery('document') AS q
            WHERE fts_data @@ q
         ORDER BY rank DESC  LIMIT 25;

And here is the explain analyze: http://explain.depesz.com/s/4xm
It clearly shows a bitmap index scan operation is immediately followed
by a recheck operation AND that the recheck operation actually does
something, because it reduces the number of records from 61 to 58
(!!!).

That could be ordinary visibility checking, not qual rechecking.

Cheers,

Jeff

Re: tsearch2, large data and indexes

От
Ivan Voras
Дата:
On 22 April 2014 17:58, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Apr 22, 2014 at 12:57 AM, Ivan Voras <ivoras@freebsd.org> wrote:
>>
>> On 22 April 2014 08:40, Heikki Linnakangas <hlinnakangas@vmware.com>
>> wrote:
>> > On 04/20/2014 02:15 AM, Ivan Voras wrote:
>> >> More details: after thinking about it some more, it might have
>> >> something to do with tsearch2 and indexes: the large data in this case
>> >> is a tsvector, indexed with GIN, and the query plan involves a
>> >> re-check condition.
>
>
> I think bitmap scans always insert a recheck, do to the possibility of
> bitmap overflow.
>
> But that doesn't mean that it ever got triggered.  In 9.4., explain
> (analyze) will report on the overflows.

Ok, I found out what is happening, quoting from the documentation:

"GIN indexes are not lossy for standard queries, but their performance
depends logarithmically on the number of unique words. (However, GIN
indexes store only the words (lexemes) oftsvector values, and not
their weight labels. Thus a table row recheck is needed when using a
query that involves weights.)"

My query doesn't have weights but the tsvector in the table has them -
I take it this is what is meant by "involves weights."

So... there's really no way for tsearch2 to produce results based on
the index alone, without recheck? This is... limiting.

>> Yes, I've read about tsearch2 and GIN indexes and there shouldn't be a
>> recheck condition - but there is.
>> This is the query:
>>
>> SELECT documents.id, title, raw_data, q, ts_rank(fts_data, q, 4) AS
>> rank, html_filename
>>             FROM documents, to_tsquery('document') AS q
>>             WHERE fts_data @@ q
>>          ORDER BY rank DESC  LIMIT 25;
>>
>> And here is the explain analyze: http://explain.depesz.com/s/4xm
>> It clearly shows a bitmap index scan operation is immediately followed
>> by a recheck operation AND that the recheck operation actually does
>> something, because it reduces the number of records from 61 to 58
>> (!!!).
>
>
> That could be ordinary visibility checking, not qual rechecking.

Visibility as in transaction-wise? It's not, this was the only client
connected to the dev server, and the only transaction(s) happening.


Re: tsearch2, large data and indexes

От
Matheus de Oliveira
Дата:

On Wed, Apr 23, 2014 at 8:08 AM, Ivan Voras <ivoras@freebsd.org> wrote:
>> And here is the explain analyze: http://explain.depesz.com/s/4xm
>> It clearly shows a bitmap index scan operation is immediately followed
>> by a recheck operation AND that the recheck operation actually does
>> something, because it reduces the number of records from 61 to 58
>> (!!!).
>
>
> That could be ordinary visibility checking, not qual rechecking.

Visibility as in transaction-wise? It's not, this was the only client
connected to the dev server, and the only transaction(s) happening.

I guess Jeff meant the visibility of tuples, in this case there may have 3 rows that are referenced by the index but are not visible to your current transaction (they may be visible by other transaction or simple hasn't been marked by VACUUM).

If you have no concurrent transactions, you can run VACUUM on your table, run the query again and see if the row counts matches.

Best regards,
--
Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres

Re: tsearch2, large data and indexes

От
Heikki Linnakangas
Дата:
On 04/22/2014 10:57 AM, Ivan Voras wrote:
> On 22 April 2014 08:40, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>> On 04/20/2014 02:15 AM, Ivan Voras wrote:
>>> More details: after thinking about it some more, it might have
>>> something to do with tsearch2 and indexes: the large data in this case
>>> is a tsvector, indexed with GIN, and the query plan involves a
>>> re-check condition.
>>>
>>> The query is of the form:
>>> SELECT simple_fields FROM table WHERE fts @@ to_tsquery('...').
>>>
>>> Does the "re-check condition" mean that the original tsvector data is
>>> always read from the table in addition to the index?
>>
>> Yes, if the re-check condition involves the fts column. I don't see why you
>> would have a re-check condition with a query like that, though. Are there
>> some other WHERE-conditions that you didn't show us?
>
> Yes, I've read about tsearch2 and GIN indexes and there shouldn't be a
> recheck condition - but there is.
> This is the query:
>
> SELECT documents.id, title, raw_data, q, ts_rank(fts_data, q, 4) AS
> rank, html_filename
>              FROM documents, to_tsquery('document') AS q
>              WHERE fts_data @@ q
>           ORDER BY rank DESC  LIMIT 25;

It's the ranking that's causing the detoasting. "ts_rank(fts_data, q,
4)" has to fetch the contents of the fts_data column.

Sorry, I was confused earlier: the "Recheck Cond:" line is always there
in the EXPLAIN output of bitmap index scans, even if the recheck
condition is never executed at runtime. It's because the executor has to
be prepared to run the recheck-condition, if the bitmap grows large
enough to become "lossy", so that it only stores the page numbers of
matching tuples, not the individual tuples

- Heikki


Re: tsearch2, large data and indexes

От
Sergey Konoplev
Дата:
On Wed, Apr 23, 2014 at 4:08 AM, Ivan Voras <ivoras@freebsd.org> wrote:
> Ok, I found out what is happening, quoting from the documentation:
>
> "GIN indexes are not lossy for standard queries, but their performance
> depends logarithmically on the number of unique words. (However, GIN
> indexes store only the words (lexemes) oftsvector values, and not
> their weight labels. Thus a table row recheck is needed when using a
> query that involves weights.)"
>
> My query doesn't have weights but the tsvector in the table has them -
> I take it this is what is meant by "involves weights."
>
> So... there's really no way for tsearch2 to produce results based on
> the index alone, without recheck? This is... limiting.

My guess is that you could use strip() function [1] to get rid of
weights in your table or, that would probably be better, in your index
only by using expressions in it and in the query, eg.

...USING gin (strip(fts_data))

and

... WHERE strip(fts_data) @@ q

[1] http://www.postgresql.org/docs/9.3/static/textsearch-features.html

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com


Re: tsearch2, large data and indexes

От
Heikki Linnakangas
Дата:
On 04/24/2014 01:56 AM, Sergey Konoplev wrote:
> On Wed, Apr 23, 2014 at 4:08 AM, Ivan Voras <ivoras@freebsd.org> wrote:
>> Ok, I found out what is happening, quoting from the documentation:
>>
>> "GIN indexes are not lossy for standard queries, but their performance
>> depends logarithmically on the number of unique words. (However, GIN
>> indexes store only the words (lexemes) oftsvector values, and not
>> their weight labels. Thus a table row recheck is needed when using a
>> query that involves weights.)"
>>
>> My query doesn't have weights but the tsvector in the table has them -
>> I take it this is what is meant by "involves weights."
>>
>> So... there's really no way for tsearch2 to produce results based on
>> the index alone, without recheck? This is... limiting.
>
> My guess is that you could use strip() function [1] to get rid of
> weights in your table or, that would probably be better, in your index
> only by using expressions in it and in the query, eg.

As the docs say, the GIN index does not store the weights. As such,
there is no need to strip them. A recheck would be necessary if your
query needs the weights, precisely because the weights are not included
in the index.

(In the OP's query, it's the ranking that was causing the detoasting.)

- Heikki


Re: tsearch2, large data and indexes

От
Ivan Voras
Дата:
On 24 April 2014 13:34, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

> As the docs say, the GIN index does not store the weights. As such, there is
> no need to strip them. A recheck would be necessary if your query needs the
> weights, precisely because the weights are not included in the index.
>
> (In the OP's query, it's the ranking that was causing the detoasting.)

Thanks!

My problem is that I actually need the ranking. My queries can return
a large number of documents (tens of thousands) but I usually need
only the first couple of pages of most relevant results (e.g. 50-100
records). With PostgreSQL and tsearch2, this means that the tens of
thousands of documents found via the index are then detoasted and
ranked.

Does anyone have experience with external search engines which also
have ranking but are more efficient? How about Solr?


Re: tsearch2, large data and indexes

От
Sergey Konoplev
Дата:
On Thu, Apr 24, 2014 at 4:34 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 04/24/2014 01:56 AM, Sergey Konoplev wrote:
>> My guess is that you could use strip() function [1] to get rid of
>> weights in your table or, that would probably be better, in your index
>> only by using expressions in it and in the query, eg.
>
> As the docs say, the GIN index does not store the weights. As such, there is
> no need to strip them. A recheck would be necessary if your query needs the
> weights, precisely because the weights are not included in the index.
>
> (In the OP's query, it's the ranking that was causing the detoasting.)

strip() is needed in the index because without it the index expression
wont match one that is in the WHERE block, and the index wont be used.
This way we could probably get rid of the "involves weights" thing,
that causes to "recheck condition", if I interpret the docs correct.

ts_rank(), for its turn, is supposed to be used in the higher node of
the plan, so there is no way for it to affect the query somehow.

But, again, it is just my guess, and it requires testing.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com


Re: tsearch2, large data and indexes

От
Sergey Konoplev
Дата:
On Thu, Apr 24, 2014 at 5:34 AM, Ivan Voras <ivoras@freebsd.org> wrote:
> On 24 April 2014 13:34, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>
>> As the docs say, the GIN index does not store the weights. As such, there is
>> no need to strip them. A recheck would be necessary if your query needs the
>> weights, precisely because the weights are not included in the index.
>>
>> (In the OP's query, it's the ranking that was causing the detoasting.)
>
> Thanks!
>
> My problem is that I actually need the ranking. My queries can return
> a large number of documents (tens of thousands) but I usually need
> only the first couple of pages of most relevant results (e.g. 50-100
> records). With PostgreSQL and tsearch2, this means that the tens of
> thousands of documents found via the index are then detoasted and
> ranked.

Heikki, what about the "GIN improvements part 3: ordering in index"
patch, was it committed?

http://www.postgresql.org/message-id/flat/CAPpHfduWvqv5b0XZ1DZuqAW29erDCULZp2wotfJzDBs7BHpKXw@mail.gmail.com

Ivan, there is a hope that we could get a more effective FTS solution
that any others I have heard about with this patch.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com


Re: tsearch2, large data and indexes

От
Heikki Linnakangas
Дата:
On 04/24/2014 10:57 PM, Sergey Konoplev wrote:
> On Thu, Apr 24, 2014 at 5:34 AM, Ivan Voras <ivoras@freebsd.org> wrote:
>> On 24 April 2014 13:34, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>>
>>> As the docs say, the GIN index does not store the weights. As such, there is
>>> no need to strip them. A recheck would be necessary if your query needs the
>>> weights, precisely because the weights are not included in the index.
>>>
>>> (In the OP's query, it's the ranking that was causing the detoasting.)
>>
>> Thanks!
>>
>> My problem is that I actually need the ranking. My queries can return
>> a large number of documents (tens of thousands) but I usually need
>> only the first couple of pages of most relevant results (e.g. 50-100
>> records). With PostgreSQL and tsearch2, this means that the tens of
>> thousands of documents found via the index are then detoasted and
>> ranked.
>
> Heikki, what about the "GIN improvements part 3: ordering in index"
> patch, was it committed?
>
> http://www.postgresql.org/message-id/flat/CAPpHfduWvqv5b0XZ1DZuqAW29erDCULZp2wotfJzDBs7BHpKXw@mail.gmail.com

Nope, wasn't committed.

- Heikki