Обсуждение: seq vs index scan in join query

Поиск
Список
Период
Сортировка

seq vs index scan in join query

От
Emanuel Alvarez
Дата:
hi all,

we're in the process of optimizing some queries and we've noted a case
where the planner prefers a sequential scan instead of using an index,
while the index scan is actually much faster. to give you some
context: we have two main tables, keywords and results. keywords has
approximately 700.000 rows; while results holds approximately one row
per keyword per day (roughly 70m at the moment, not all keywords are
active at any given day). results is currently partitioned by
(creation) time. it's also worth noting that we use SSDs in our
servers, and have random_page_cost set to 1.


the problematic query looks like this:

SELECT keywords.strategy_id, results.position, results.created_at FROM results JOIN  keywords ON results.keyword_id =
keywords.idWHERE results.account_id = 1    AND results.created_at >= '2017-10-25 00:00:00.000000'    AND
results.created_at<= '2017-11-10 23:59:59.999999';
 


as you can see in the query plan [1] a sequential scan is preferred.
as we understand it, this happens because the number of rows returned
from results is too large. if we reduce this number by either
selecting a smaller created_at range, or another account_id with fewer
keywords, the planner falls back to an index scan, confirming that the
number of rows returned from results has a direct influence in this
choice.

on the other hand, if we disable sequential scans (SET enable_seqscan
= 0), we see than not only the query runs faster but the cost seems to
be lower, as seen in the query plan [2].

in this example the gain it's not much: ~0.5s. but when we add a
second join table with additional keyword data the planner still
prefers a sequential scan on a table that has +6m rows. query looks
like this:

SELECT keywords.strategy_id, results.position, results.created_at,
keyword_data.volume FROM results JOIN  keywords ON results.keyword_id = keywords.id JOIN keyword_data ON
keywords.keyword_data_id= keyword_data.id WHERE results.account_id = 1    AND results.created_at >= '2017-10-25
00:00:00.000000'   AND results.created_at <= '2017-11-19 23:59:59.999999';
 


in this case query takes up to 8s, query plan can be found in [3].
obviously dataset has to be large to prefer a sequential on a 6m rows
table. similarly, reducing the created_at range or using an account_id
with fewer keywords makes the planner prefer index scan, accelerating
the query considerably.

currently we're exploring the option of fetching keywords data within
a subquery and feed that into the main query, which works as expected,
but also complicates the design a bit.

we'd like to know:1. why does the planner prefers a sequential scan in these cases?2. is there a way we can make the
plannerchoose a better strategy
 
using indexes?

thank you for your time.

[1] seq scan plan:
https://gist.github.com/emnlvrz/5e53235c82260be011d84cf264e597e7
[2] indexed plan:
https://gist.github.com/emnlvrz/8aa85edbdedcdb90d8d4f38863abc134
[3] seq scan additional join plan:
https://gist.github.com/emnlvrz/b3f13518f863f829c65f91a514f407d9


Re: seq vs index scan in join query

От
legrand legrand
Дата:
Hi,

Could you give us the partitions (ranges values) and indexes definition for
result table ?

Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html


Re: seq vs index scan in join query

От
Marti Raudsepp
Дата:
Hi

On Wed, Nov 29, 2017 at 8:55 AM, Emanuel Alvarez <ema@abductedcow.com.ar> wrote:
> on the other hand, if we disable sequential scans (SET enable_seqscan
> = 0), we see than not only the query runs faster but the cost seems to
> be lower, as seen in the query plan [2].

True, the cost of the scan itself is lower, but together with
hashjoin/nestloop, the total cost of plan [2] is higher.

This is a wild guess but...

-> Index Scan using keywords_pkey on keywords  Buffers: shared hit=284808 read=4093
vs
-> Seq Scan on keywords  Buffers: shared read=36075

Looks like the index scan's advantage in this example is a much higher
cache hit ratio (despite touching so many more pages) and PostgreSQL
is underestimating it.

Have you tuned the effective_cache_size setting? A good starting point
is half the total RAM in your machine. It would be interesting to see
how high you need to set it for the planner to switch to the index
scan plan.

Regards,
Marti Raudsepp


Re: seq vs index scan in join query

От
Laurenz Albe
Дата:
Emanuel Alvarez wrote:
> the problematic query looks like this:
> 
> SELECT keywords.strategy_id, results.position, results.created_at FROM results
>   JOIN  keywords ON results.keyword_id = keywords.id
>   WHERE results.account_id = 1
>      AND results.created_at >= '2017-10-25 00:00:00.000000'
>      AND results.created_at <= '2017-11-10 23:59:59.999999';
> 
> 
> as you can see in the query plan [1] a sequential scan is preferred.
> as we understand it, this happens because the number of rows returned
> from results is too large. if we reduce this number by either
> selecting a smaller created_at range, or another account_id with fewer
> keywords, the planner falls back to an index scan, confirming that the
> number of rows returned from results has a direct influence in this
> choice.
> 
> on the other hand, if we disable sequential scans (SET enable_seqscan
> = 0), we see than not only the query runs faster but the cost seems to
> be lower, as seen in the query plan [2].

The optimizer is right here.

Even though your second execution without sequential scans ran faster,
it is worse.

That is because the execution with the sequential scan touched
26492  + 80492 = 106984 blocks, while the second execution touched
311301 + 48510 = 359811 blocks, more than three times as many.

The second execution was just lucky because most of these blocks were
already cached, and it had to read only half as many blocks from disk.

If you repeat the execution a couple of times, you should see that
the execution using the sequential scans becomes faster.


You can boost performance even more by increasing work_mem
so that the hash can be created in memory.

Yours,
Laurenz Albe







Re: seq vs index scan in join query

От
Andres Freund
Дата:
On 2017-11-29 18:17:18 +0100, Laurenz Albe wrote:
> That is because the execution with the sequential scan touched
> 26492  + 80492 = 106984 blocks, while the second execution touched
> 311301 + 48510 = 359811 blocks, more than three times as many.

That's not necessarily said. What those count are buffer *accesses*,
*not* the number of distinct blocks accessed. You'll very commonly have
more buffer accesses in indexscans but still fewer total reads because a
lot of those accesses will be reads previously done in the same
scan. Just imagine a scan of an index with a leaf page pointing to 100
tuples of the same value - that'd result in at least a 100 buffer
accesses, but it'd be highly likely that they'll be in cache.

Greetings,

Andres Freund


Re: seq vs index scan in join query

От
Laurenz Albe
Дата:
Andres Freund wrote:
> On 2017-11-29 18:17:18 +0100, Laurenz Albe wrote:
> > That is because the execution with the sequential scan touched
> > 26492  + 80492 = 106984 blocks, while the second execution touched
> > 311301 + 48510 = 359811 blocks, more than three times as many.
> 
> That's not necessarily said. What those count are buffer *accesses*,
> *not* the number of distinct blocks accessed. You'll very commonly have
> more buffer accesses in indexscans but still fewer total reads because a
> lot of those accesses will be reads previously done in the same
> scan. Just imagine a scan of an index with a leaf page pointing to 100
> tuples of the same value - that'd result in at least a 100 buffer
> accesses, but it'd be highly likely that they'll be in cache.

Thanks for explaining that.

Yours,
Laurenz Albe


Re: seq vs index scan in join query

От
Emanuel Alvarez
Дата:
Thank you all for your responses!

On Wed, Nov 29, 2017 at 7:31 AM, legrand legrand
<legrand_legrand@hotmail.com> wrote:
> Hi,
>
> Could you give us the partitions (ranges values) and indexes definition for
> result table ?

We partition by month, they usually start the 20th of each month (this
was the date we partitioned the table), for the tables in questions
constraints look like this:
"constraint_37" CHECK (created_at >= '2017-10-21 00:00:00'::timestamp
without time zone AND created_at < '2017-11-20 00:00:00'::timestamp
without time zone)
"constraint_110" CHECK (created_at >= '2017-11-20 00:00:00'::timestamp
without time zone AND created_at < '2017-12-20 00:00:00'::timestamp
without time zone)

Indexes are:   "results_account_id_created_at_idx" btree (account_id, created_at DESC)   "results_id_idx" btree (id)
"results_keyword_id_created_at_idx"btree (keyword_id, created_at DESC)
 

On Wed, Nov 29, 2017 at 11:57 AM, Marti Raudsepp <marti@juffo.org> wrote:
> -> Index Scan using keywords_pkey on keywords
>    Buffers: shared hit=284808 read=4093
> vs
> -> Seq Scan on keywords
>    Buffers: shared read=36075
>
> Looks like the index scan's advantage in this example is a much higher
> cache hit ratio (despite touching so many more pages) and PostgreSQL
> is underestimating it.

Interesting, we were completely missing this.

> Have you tuned the effective_cache_size setting? A good starting point
> is half the total RAM in your machine. It would be interesting to see
> how high you need to set it for the planner to switch to the index
> scan plan.

We did this, and although it has an effect it's rapidly shadowed by
the results size. For example, setting it to 10GB is OK for one month
worth of data, but it'll fallback to seq scan for two months data.
Although we might be able to live with that for now.

On Wed, Nov 29, 2017 at 2:17 PM, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> The optimizer is right here.

Never doubted it :)

> Even though your second execution without sequential scans ran faster,
> it is worse.
>
> That is because the execution with the sequential scan touched
> 26492  + 80492 = 106984 blocks, while the second execution touched
> 311301 + 48510 = 359811 blocks, more than three times as many.
>
> The second execution was just lucky because most of these blocks were
> already cached, and it had to read only half as many blocks from disk.
>
> If you repeat the execution a couple of times, you should see that
> the execution using the sequential scans becomes faster.

In a live environment, execution times for sequential scans are always
slower, though they do vary in time. The reason for this is that the
partition for last month's results is accessed frequently, and as such
is kept in the cache; while the other two tables, keywords and
keyword_data, are accessed sparsely, and mostly through their indexes.
This way, indexes have a much bigger chance of being cached in memory
than other parts of the table. In other words: the second execution is
always lucky.

Is this interpretation correct? Is there any option to deal with this
issue? Besides adding more RAM.

We could, theoretically, partition keyword_data as it's also time
base, but it's not that big to justify a partitioning. It's also not
small enough to be doing sequential scan on it all the time.


> You can boost performance even more by increasing work_mem
> so that the hash can be created in memory.

This is interesting, and has a positive effect on our queries. We are
currently testing a combination of work_mem with effective_cache_size
settings, though I'm afraid refactoring the query will be inevitable.

Thank you!


Re: seq vs index scan in join query

От
Jeff Janes
Дата:
On Tue, Nov 28, 2017 at 10:55 PM, Emanuel Alvarez wrote: > hi all, > > we're in the process of optimizing some queries and we've noted a case > where the planner prefers a sequential scan instead of using an index, > while the index scan is actually much faster. to give you some > context: we have two main tables, keywords and results. keywords has > approximately 700.000 rows; while results holds approximately one row > per keyword per day (roughly 70m at the moment, not all keywords are > active at any given day). results is currently partitioned by > (creation) time. it's also worth noting that we use SSDs in our > servers, and have random_page_cost set to 1. > > > the problematic query looks like this: > > SELECT keywords.strategy_id, results.position, results.created_at FROM > results > JOIN keywords ON results.keyword_id = keywords.id > WHERE results.account_id = 1 > AND results.created_at >= '2017-10-25 00:00:00.000000' > AND results.created_at <= '2017-11-10 23:59:59.999999'; > > > as you can see in the query plan [1] a sequential scan is preferred. > I would say the preference is not for the seq scan, but rather for the hash join. If the seq scan couldn't be fed into a hash join, it would not look very favorable. I think hash joins are a bit optimistic on how much cpu time they think they use building the hash table. You can probably get better plans for this type of query by increasing cpu_tuple_cost to 0.02 or 0.03. That works because the hash join over the seq scan has to scan 700,000 tuples to build the hash table, which is then probed only 70,000 time, while the nested loop index scan just probes the 70,000 rows is needs directly and ignores the other 90%. ... > > on the other hand, if we disable sequential scans (SET enable_seqscan > = 0), we see than not only the query runs faster but the cost seems to > be lower, as seen in the query plan [2]. > The costs for plan 2 doesn't look lower to me. 196754.90 > 120421.32 > > in this example the gain it's not much: ~0.5s. but when we add a > second join table with additional keyword data the planner still > prefers a sequential scan on a table that has +6m rows. query looks > like this: > > SELECT keywords.strategy_id, results.position, results.created_at, > keyword_data.volume FROM results > JOIN keywords ON results.keyword_id = keywords.id > JOIN keyword_data ON keywords.keyword_data_id = keyword_data.id > WHERE results.account_id = 1 > AND results.created_at >= '2017-10-25 00:00:00.000000' > AND results.created_at <= '2017-11-19 23:59:59.999999'; > > > in this case query takes up to 8s, query plan can be found in [3]. > obviously dataset has to be large to prefer a sequential on a 6m rows > table. similarly, reducing the created_at range or using an account_id > with fewer keywords makes the planner prefer index scan, accelerating > the query considerably. > If that is the query you are really concerned about, we would have to see the faster plan for that query. (Or better yet, keep the created_at range the same, and set enable_hashjoin to off to get it to switch plans). This looks like is a very skewed query. keyword_data has 10 rows for every row in keywords, yet adding a join to keyword_data doesn't increase the number of rows returned by the query at all. That is weird, isn't it? For what its worth, in my hands on your simpler query it likes to sort the 70,000 qualifying rows from "results" table, then do a merge join againsts the index on keywords. And it truly is the faster option. I have to enable_mergejoin=off before I can get either of your plans. Once I do, the nested loop does seem to be faster than the hash join but not by the two fold that you see, and they jump around quite a bit from run to run. Cheers, Jeff