seq vs index scan in join query

Поиск

Список

Период

Сортировка

От	Emanuel Alvarez
Тема	seq vs index scan in join query
Дата	29 ноября 2017 г. 12:55:50
Msg-id	CAPfMJ90rmEdCXQ0ox9umUMs6=oVfomv9nc9V-bcsdLtZAJgszg@mail.gmail.com обсуждение исходный текст
Ответы	Re: seq vs index scan in join query Re: seq vs index scan in join query Re: seq vs index scan in join query Re: seq vs index scan in join query
Список	pgsql-general

Дерево обсуждения

hi all,

we're in the process of optimizing some queries and we've noted a case
where the planner prefers a sequential scan instead of using an index,
while the index scan is actually much faster. to give you some
context: we have two main tables, keywords and results. keywords has
approximately 700.000 rows; while results holds approximately one row
per keyword per day (roughly 70m at the moment, not all keywords are
active at any given day). results is currently partitioned by
(creation) time. it's also worth noting that we use SSDs in our
servers, and have random_page_cost set to 1.

the problematic query looks like this:

SELECT keywords.strategy_id, results.position, results.created_at FROM results JOIN keywords ON results.keyword_id =
keywords.idWHERE results.account_id = 1 AND results.created_at >= '2017-10-25 00:00:00.000000' AND
results.created_at<= '2017-11-10 23:59:59.999999';

as you can see in the query plan [1] a sequential scan is preferred.
as we understand it, this happens because the number of rows returned
from results is too large. if we reduce this number by either
selecting a smaller created_at range, or another account_id with fewer
keywords, the planner falls back to an index scan, confirming that the
number of rows returned from results has a direct influence in this
choice.

on the other hand, if we disable sequential scans (SET enable_seqscan
= 0), we see than not only the query runs faster but the cost seems to
be lower, as seen in the query plan [2].

in this example the gain it's not much: ~0.5s. but when we add a
second join table with additional keyword data the planner still
prefers a sequential scan on a table that has +6m rows. query looks
like this:

SELECT keywords.strategy_id, results.position, results.created_at,
keyword_data.volume FROM results JOIN keywords ON results.keyword_id = keywords.id JOIN keyword_data ON
keywords.keyword_data_id= keyword_data.id WHERE results.account_id = 1 AND results.created_at >= '2017-10-25
00:00:00.000000' AND results.created_at <= '2017-11-19 23:59:59.999999';

in this case query takes up to 8s, query plan can be found in [3].
obviously dataset has to be large to prefer a sequential on a 6m rows
table. similarly, reducing the created_at range or using an account_id
with fewer keywords makes the planner prefer index scan, accelerating
the query considerably.

currently we're exploring the option of fetching keywords data within
a subquery and feed that into the main query, which works as expected,
but also complicates the design a bit.

we'd like to know:1. why does the planner prefers a sequential scan in these cases?2. is there a way we can make the
plannerchoose a better strategy

using indexes?

thank you for your time.

[1] seq scan plan:
https://gist.github.com/emnlvrz/5e53235c82260be011d84cf264e597e7
[2] indexed plan:
https://gist.github.com/emnlvrz/8aa85edbdedcdb90d8d4f38863abc134
[3] seq scan additional join plan:
https://gist.github.com/emnlvrz/b3f13518f863f829c65f91a514f407d9

В списке pgsql-general по дате отправления:

Предыдущее

От: Tom Lane
Дата: 29 ноября 2017 г., 09:59:51
Сообщение: Re: large numbers of inserts out of memory strategy

Следующее

От: legrand legrand
Дата: 29 ноября 2017 г., 16:31:24
Сообщение: Re: seq vs index scan in join query

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

seq vs index scan in join query

Предыдущее

Следующее