Re: Unexpected expensive index scan

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Unexpected expensive index scan
Дата
Msg-id 15575.1475067892@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Unexpected expensive index scan  (Jake Nielsen <jake.k.nielsen@gmail.com>)
Ответы Re: Unexpected expensive index scan  (Jake Nielsen <jake.k.nielsen@gmail.com>)
Список pgsql-performance
[ Please don't re-quote the entire damn thread in each followup. Have
some respect for your readers' time, and assume that they have already
seen the previous traffic, or could go look it up if they haven't.
The point of quoting at all is just to quickly remind people where we
are in the discussion. ]

Jake Nielsen <jake.k.nielsen@gmail.com> writes:
> So... it seems that setting the userId to one that has less rows in the
> table results in the index actually being used...
> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM SyncerEvent WHERe userId =
> '57d35db7353b0d627c0e592f' AND ID > 12468 ORDER BY ID LIMIT 4000;

It looks from the numbers floating around in this thread that the
userId used in your original query actually matches about 50% of
the table.  That would make it unsurprising that the planner doesn't
want to use an index.  A rule of thumb is that a seqscan is going
to be cheaper than an indexscan if your query retrieves, or even
just has to fetch, more than a few percent of the table.

Now, given the existence of an index on (userID, ID) --- in that
order --- I would expect the planner to want to use that index
for a query shaped exactly as you show above.  Basically, it knows
that that just requires starting at the ('57d35db7353b0d627c0e592f',
12468) position in the index and scanning forward for 4000 index
entries; no extraneous table rows will be fetched at all.  If you
increased the LIMIT enough, it'd go over to a seqscan-and-sort to
avoid doing so much random access to the table, but I'd think the
crossover point for that is well above 4000 out of 3.3M rows.

However, as soon as you add any other unindexable conditions,
the situation changes because rows that fail the additional
conditions represent useless fetches.  Now, instead of fetching
4000 rows using the index, it's fetching 4000 times some multiplier.

It's hard to tell for sure given the available info, but I think
that the extra inequalities in your original query reject a pretty
sizable proportion of rows, resulting in the indexscan approach
needing to fetch a great deal more than 4000 rows, making it look
to be more expensive than a seqscan.

I'm not sure why it's preferring the pkey index to the one on
(userID, ID), but possibly that has something to do with that
index being better correlated to the physical table order, resulting
in a prediction of less random I/O when using that index.

So the bottom line is that given your data statistics, there may
well be no really good plan for your original query.  It just
requires fetching a lot of rows, and indexes can't help very much.

If you say "well yeah, but it seems to perform fine when I force
it to use that index anyway", the answer may be that you need to
adjust random_page_cost.  The default value is OK for tables that
are mostly sitting on spinning rust, but if your database is
RAM-resident or SSD-resident you probably want a value closer to 1.

            regards, tom lane


В списке pgsql-performance по дате отправления:

Предыдущее
От: Karl Denninger
Дата:
Сообщение: Re: PostgreSQL on ZFS: performance tuning
Следующее
От: Vitalii Tymchyshyn
Дата:
Сообщение: Re: Millions of tables