Обсуждение: [PERFORM] Querying with multicolumn index

Поиск
Список
Период
Сортировка

[PERFORM] Querying with multicolumn index

От
Eric Jiang
Дата:
Hi all,
I have a query that I *think* should use a multicolumn index, but
sometimes isn't, resulting in slow queries.

We have a DB that records GPS coordinates for vehicles:

                               Table "public.updates"
   Column   |           Type           |                      Modifiers
------------+--------------------------+------------------------------------------------------
 id         | integer                  | not null default
nextval('updates_id_seq'::regclass)
 driver_id  | integer                  | not null
 latitude   | double precision         | not null
 longitude  | double precision         | not null
 time       | timestamp with time zone | not null default now()
 vehicle_id | integer                  |
Indexes:
    "updates_pkey" PRIMARY KEY, btree (id)
    "ix_updates_time" btree ("time")
    "updates_driver_id_time_idx" btree (driver_id, "time")
    "updates_vehicle_id_time_idx" btree (vehicle_id, "time")

Table has about 15M records across 100 distinct driver_id.

I want to get the last record for a specific driver:

SELECT * FROM updates WHERE driver_id=123 ORDER BY "time" DESC LIMIT 1;

For some values of driver_id, it does what I expect and uses
updates_driver_id_time_idx to fetch the records in 2 ms or less. For
other values of driver_id, it does an index scan backwards on
ix_updates_time, taking upwards of 2 minutes.

Good plan:

 Limit  (cost=0.11..1.38 rows=1 width=56) (actual time=2.710..2.710
rows=1 loops=1)
   ->  Index Scan Backward using updates_driver_id_time_idx on updates
 (cost=0.11..139278.28 rows=110051 width=56) (actual time=2.709..2.709
rows=1 loops=1)
         Index Cond: (driver_id = 17127)
 Total runtime: 2.732 ms
(4 rows)

Bad plan:

 Limit  (cost=0.09..0.69 rows=1 width=56) (actual
time=216769.111..216769.112 rows=1 loops=1)
   ->  Index Scan Backward using ix_updates_time on updates
(cost=0.09..272339.04 rows=448679 width=56) (actual
time=216769.110..216769.110 rows=1 loops=1)
         Filter: (driver_id = 30132)
         Rows Removed by Filter: 5132087
 Total runtime: 216769.174 ms


From cursory testing, the difference seems to be based on how many
total rows there are for a particular driver. The above query uses
updates_driver_id_time_idx for drivers with less than about 300K rows,
but uses ix_updates_time for drivers with more than about 300K rows.

Anything we can do to make it do the "right" thing? We are also
considering denormalizing the data and keeping a "cache" of the same
data in another table.

pgsql version: 9.3.14 and 9.5.3, already tried vacuum analyze.

Thanks,
Eric


--
Eric Jiang, DoubleMap
eric@doublemap.com | www.doublemap.com


Re: [PERFORM] Querying with multicolumn index

От
Andreas Joseph Krogh
Дата:
På fredag 09. desember 2016 kl. 18:00:16, skrev Eric Jiang <eric@doublemap.com>:
Hi all,
I have a query that I *think* should use a multicolumn index, but
sometimes isn't, resulting in slow queries.

We have a DB that records GPS coordinates for vehicles:

                               Table "public.updates"
   Column   |           Type           |                      Modifiers
------------+--------------------------+------------------------------------------------------
 id         | integer                  | not null default
nextval('updates_id_seq'::regclass)
 driver_id  | integer                  | not null
 latitude   | double precision         | not null
 longitude  | double precision         | not null
 time       | timestamp with time zone | not null default now()
 vehicle_id | integer                  |
Indexes:
    "updates_pkey" PRIMARY KEY, btree (id)
    "ix_updates_time" btree ("time")
    "updates_driver_id_time_idx" btree (driver_id, "time")
    "updates_vehicle_id_time_idx" btree (vehicle_id, "time")

Table has about 15M records across 100 distinct driver_id.

I want to get the last record for a specific driver:

SELECT * FROM updates WHERE driver_id=123 ORDER BY "time" DESC LIMIT 1;

For some values of driver_id, it does what I expect and uses
updates_driver_id_time_idx to fetch the records in 2 ms or less. For
other values of driver_id, it does an index scan backwards on
ix_updates_time, taking upwards of 2 minutes.

Good plan:

 Limit  (cost=0.11..1.38 rows=1 width=56) (actual time=2.710..2.710
rows=1 loops=1)
   ->  Index Scan Backward using updates_driver_id_time_idx on updates
 (cost=0.11..139278.28 rows=110051 width=56) (actual time=2.709..2.709
rows=1 loops=1)
         Index Cond: (driver_id = 17127)
 Total runtime: 2.732 ms
(4 rows)

Bad plan:

 Limit  (cost=0.09..0.69 rows=1 width=56) (actual
time=216769.111..216769.112 rows=1 loops=1)
   ->  Index Scan Backward using ix_updates_time on updates
(cost=0.09..272339.04 rows=448679 width=56) (actual
time=216769.110..216769.110 rows=1 loops=1)
         Filter: (driver_id = 30132)
         Rows Removed by Filter: 5132087
 Total runtime: 216769.174 ms


From cursory testing, the difference seems to be based on how many
total rows there are for a particular driver. The above query uses
updates_driver_id_time_idx for drivers with less than about 300K rows,
but uses ix_updates_time for drivers with more than about 300K rows.

Anything we can do to make it do the "right" thing? We are also
considering denormalizing the data and keeping a "cache" of the same
data in another table.

pgsql version: 9.3.14 and 9.5.3, already tried vacuum analyze.

Thanks,
Eric
 
You should be having this index:
 
create index updates_driver_time_idx ON updates(driver_id, "time" DESC);
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Вложения

Re: [PERFORM] Querying with multicolumn index

От
Eric Jiang
Дата:
On Fri, Dec 9, 2016 at 9:56 AM, Andreas Joseph Krogh <andreas@visena.com> wrote:
You should be having this index:
 
create index updates_driver_time_idx ON updates(driver_id, "time" DESC);

I'm not sure I understand the intent of this fix - are you saying that btree indexes only work in a certain direction?

I created this index and the query plans did not change.

--
Eric Jiang, DoubleMap
eric@doublemap.com | www.doublemap.com

Re: [PERFORM] Querying with multicolumn index

От
Tom Lane
Дата:
Eric Jiang <eric@doublemap.com> writes:
> I have a query that I *think* should use a multicolumn index, but
> sometimes isn't, resulting in slow queries.

I tried to duplicate this behavior, without success.  Are you running
with nondefault planner parameters?

            regards, tom lane


Re: [PERFORM] Querying with multicolumn index

От
Daniel Blanch Bataller
Дата:
Hi

As a quick fix: Have you considered dropping ix_updates_time index?

I’ve been able to reproduce the issue, but with bigger numbers than you. When I dropped ix_updates_time it went much
muchfaster. It uses updates_driver_id_time_idx index instead. 

For some reason the planner does not make a good estimation in this case. Can you show us EXPLAIN (ANALYZE, BUFFERS)
beforeand after dropping ix_updates_time index? Can you show us too the output of SHOW shared_buffers; ?  

I suspect the issue has to do with low shared_buffers configuration and cache misses, and maybe some costs suboptimal
configurationI’ll try to find it out, if anyone can enlighten us it will be very welcomed. 


P.S. Meanwhile If you still need 'time' index, you can create an index using ‘time' and ‘customer_id' in this order.



Cheers,

Daniel Blanch.


> El 9 dic 2016, a las 18:00, Eric Jiang <eric@doublemap.com> escribió:
>
> Hi all,
> I have a query that I *think* should use a multicolumn index, but
> sometimes isn't, resulting in slow queries.
>
> We have a DB that records GPS coordinates for vehicles:
>
>                               Table "public.updates"
>   Column   |           Type           |                      Modifiers
> ------------+--------------------------+------------------------------------------------------
> id         | integer                  | not null default
> nextval('updates_id_seq'::regclass)
> driver_id  | integer                  | not null
> latitude   | double precision         | not null
> longitude  | double precision         | not null
> time       | timestamp with time zone | not null default now()
> vehicle_id | integer                  |
> Indexes:
>    "updates_pkey" PRIMARY KEY, btree (id)
>    "ix_updates_time" btree ("time")
>    "updates_driver_id_time_idx" btree (driver_id, "time")
>    "updates_vehicle_id_time_idx" btree (vehicle_id, "time")
>
> Table has about 15M records across 100 distinct driver_id.
>
> I want to get the last record for a specific driver:
>
> SELECT * FROM updates WHERE driver_id=123 ORDER BY "time" DESC LIMIT 1;
>
> For some values of driver_id, it does what I expect and uses
> updates_driver_id_time_idx to fetch the records in 2 ms or less. For
> other values of driver_id, it does an index scan backwards on
> ix_updates_time, taking upwards of 2 minutes.
>
> Good plan:
>
> Limit  (cost=0.11..1.38 rows=1 width=56) (actual time=2.710..2.710
> rows=1 loops=1)
>   ->  Index Scan Backward using updates_driver_id_time_idx on updates
> (cost=0.11..139278.28 rows=110051 width=56) (actual time=2.709..2.709
> rows=1 loops=1)
>         Index Cond: (driver_id = 17127)
> Total runtime: 2.732 ms
> (4 rows)
>
> Bad plan:
>
> Limit  (cost=0.09..0.69 rows=1 width=56) (actual
> time=216769.111..216769.112 rows=1 loops=1)
>   ->  Index Scan Backward using ix_updates_time on updates
> (cost=0.09..272339.04 rows=448679 width=56) (actual
> time=216769.110..216769.110 rows=1 loops=1)
>         Filter: (driver_id = 30132)
>         Rows Removed by Filter: 5132087
> Total runtime: 216769.174 ms
>
>
> From cursory testing, the difference seems to be based on how many
> total rows there are for a particular driver. The above query uses
> updates_driver_id_time_idx for drivers with less than about 300K rows,
> but uses ix_updates_time for drivers with more than about 300K rows.
>
> Anything we can do to make it do the "right" thing? We are also
> considering denormalizing the data and keeping a "cache" of the same
> data in another table.
>
> pgsql version: 9.3.14 and 9.5.3, already tried vacuum analyze.
>
> Thanks,
> Eric
>
>
> --
> Eric Jiang, DoubleMap
> eric@doublemap.com | www.doublemap.com
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



Re: [PERFORM] Querying with multicolumn index

От
Eric Jiang
Дата:
We aren't using any special planner settings - all enable_* options are "on".

On Sat, Dec 10, 2016 at 12:06 AM, Daniel Blanch Bataller
<daniel.blanch.bataller@gmail.com> wrote:
> As a quick fix: Have you considered dropping ix_updates_time index?

We do occasionally want to use ix_updates_time, although not very often.

> I’ve been able to reproduce the issue, but with bigger numbers than you. When I dropped ix_updates_time it went much
muchfaster. It uses updates_driver_id_time_idx index instead. 
>
> For some reason the planner does not make a good estimation in this case. Can you show us EXPLAIN (ANALYZE, BUFFERS)
beforeand after dropping ix_updates_time index? Can you show us too the output of SHOW shared_buffers; ? 

Here's EXPLAIN (ANALYZE, BUFFERS) with the above bad query on a cold cache:

 Limit  (cost=0.09..0.70 rows=1 width=56) (actual
time=244846.915..244846.915 rows=1 loops=1)
   Buffers: shared hit=3999254 read=57831
   I/O Timings: read=242139.661
   ->  Index Scan Backward using ix_updates_time on updates
(cost=0.09..271696.74 rows=442550 width=56) (actual
time=244846.913..244846.913 rows=1 loops=1)
         Filter: (driver_id = 30132)
         Rows Removed by Filter: 5316811
         Buffers: shared hit=3999254 read=57831
         I/O Timings: read=242139.661
 Total runtime: 244846.946 ms

and after dropping ix_updates_time:

 Limit  (cost=0.11..0.98 rows=1 width=56) (actual time=2.270..2.271
rows=1 loops=1)
   Buffers: shared hit=1 read=4
   I/O Timings: read=2.230
   ->  Index Scan Backward using updates_driver_id_time_idx on updates
 (cost=0.11..382307.69 rows=442550 width=56) (actual time=2.270..2.270
rows=1 loops=1)
         Index Cond: (driver_id = 30132)
         Buffers: shared hit=1 read=4
         I/O Timings: read=2.230
 Total runtime: 2.305 ms

and `SHOW shared_buffers;`

 shared_buffers
----------------
 244MB

> I suspect the issue has to do with low shared_buffers configuration and cache misses, and maybe some costs suboptimal
configurationI’ll try to find it out, if anyone can enlighten us it will be very welcomed. 
>
>
> P.S. Meanwhile If you still need 'time' index, you can create an index using ‘time' and ‘customer_id' in this order.

Did you mean an index on (time, driver_id)? I did:

CREATE INDEX CONCURRENTLY ix_updates_time_driver_id ON updates
("time", driver_id)

but seems like the planner will use it for driver_id having more than
~300k rows:

 Limit  (cost=0.11..0.79 rows=1 width=56) (actual
time=115.051..115.052 rows=1 loops=1)
   Buffers: shared hit=20376
   ->  Index Scan Backward using ix_updates_time_driver_id on updates
(cost=0.11..302189.90 rows=443924 width=56) (actual
time=115.048..115.048 rows=1 loops=1)
         Index Cond: (driver_id = 30132)
         Buffers: shared hit=20376
 Total runtime: 115.091 ms

It does seem faster than when having an index on just "time", but
still not optimal.

Really appreciate everyone's help!

--
Eric Jiang, DoubleMap
eric@doublemap.com | www.doublemap.com


Re: [PERFORM] Querying with multicolumn index

От
Tomas Vondra
Дата:
Hi,

On 12/10/2016 12:51 AM, Tom Lane wrote:
> Eric Jiang <eric@doublemap.com> writes:
>> I have a query that I *think* should use a multicolumn index, but
>> sometimes isn't, resulting in slow queries.
>
> I tried to duplicate this behavior, without success.  Are you running
> with nondefault planner parameters?
>

My guess is this is a case of LIMIT the matching rows are uniformly
distributed in the input data. The planner likely concludes that for a
driver with a lot of data we'll find the first row using ix_updates_time
very quickly, and that it will be cheaper than inspecting the larger
multi-column index. But imagine a driver with a lots of data long time
ago. That breaks the LIMIT fairly quickly.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PERFORM] Querying with multicolumn index

От
Tom Lane
Дата:
Eric Jiang <eric@doublemap.com> writes:
> We aren't using any special planner settings - all enable_* options are "on".

No, I'm asking about the cost settings (random_page_cost etc).  The cost
estimates you're showing seem impossible with the default settings.

            regards, tom lane


Re: [PERFORM] Querying with multicolumn index

От
Tom Lane
Дата:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 12/10/2016 12:51 AM, Tom Lane wrote:
>> I tried to duplicate this behavior, without success.  Are you running
>> with nondefault planner parameters?

> My guess is this is a case of LIMIT the matching rows are uniformly
> distributed in the input data. The planner likely concludes that for a
> driver with a lot of data we'll find the first row using ix_updates_time
> very quickly, and that it will be cheaper than inspecting the larger
> multi-column index. But imagine a driver with a lots of data long time
> ago. That breaks the LIMIT fairly quickly.

The fact that it's slow enough to be a problem is doubtless related to
that effect.  But AFAICS, the planner should never prefer that index
for this query, because even with a uniform-density assumption, the
index that really matches the query ought to look better.

            regards, tom lane


Re: [PERFORM] Querying with multicolumn index

От
Eric Jiang
Дата:
On Sat, Dec 10, 2016 at 4:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We aren't using any special planner settings - all enable_* options are "on".
>
> No, I'm asking about the cost settings (random_page_cost etc).  The cost
> estimates you're showing seem impossible with the default settings.

Tom, really appreciate your pointers. This problem was occurring on
Heroku Postgres databases, and they seem to have set different cost
constants. I tried using SET LOCAL to set them back to the default
settings before running EXPLAIN.

My testing here shows that resetting all of random_page_cost,
cpu_tuple_cost, cpu_index_tuple_cost, and cpu_operator_cost does not
change the plan (but does change the cost estimates), while setting
effective_cache_size alone will change the plan.

Specifically, changing only effective_cache_size from '900000kB' to
'4GB' caused the planner to prefer the optimal index
updates_driver_id_time_idx.

Is increasing the DB's RAM the correct fix for this problem? It seems
to me that no matter how much cache is available, looking at the
(driver_id, time) index is always the optimal choice for this query.

Thanks,
Eric

--
Eric Jiang, DoubleMap
eric@doublemap.com | www.doublemap.com


Re: [PERFORM] Querying with multicolumn index

От
Daniel Blanch Bataller
Дата:
Hi all,

Thomas is absolutely right, the distribution I synthetically made, had 6M records but very old, 9M old, as you can see
ithad to skip 9M records before finding a suitable record using time index.  

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM updates WHERE driver_id = 100 ORDER BY "time" DESC LIMIT 1;
                                                                         QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.44..0.65 rows=1 width=36) (actual time=3827.807..3827.807 rows=1 loops=1)
   Buffers: shared hit=24592 read=99594 written=659
   ->  Index Scan Backward using updates_time_idx on updates  (cost=0.44..1284780.53 rows=6064800 width=36) (actual
time=3827.805..3827.805rows=1 loops=1) 
         Filter: (driver_id = 100)
         Rows Removed by Filter: 9000000
         Buffers: shared hit=24592 read=99594 written=659
 Planning time: 0.159 ms
 Execution time: 3827.846 ms
(8 rows)


Here you have my tests where I was able to reproduce the problem using default settings on 9.6, 9.5 and 9.3. 9.6 and
9.5choose the wrong index, while 9.3 didn’t. (update: 9.5 didn’t fail last time)  



However when I tried to add more than one value with this strange distribution ~ 30% of distribution to one value the
indexbad choice problem didn’t happen again in none of the different versions. 

I Hope this helps. Regards,

Daniel Blanch.


> El 10 dic 2016, a las 21:34, Tomas Vondra <tomas.vondra@2ndquadrant.com> escribió:
>
> Hi,
>
> On 12/10/2016 12:51 AM, Tom Lane wrote:
>> Eric Jiang <eric@doublemap.com> writes:
>>> I have a query that I *think* should use a multicolumn index, but
>>> sometimes isn't, resulting in slow queries.
>>
>> I tried to duplicate this behavior, without success.  Are you running
>> with nondefault planner parameters?
>>
>
> My guess is this is a case of LIMIT the matching rows are uniformly distributed in the input data. The planner likely
concludesthat for a driver with a lot of data we'll find the first row using ix_updates_time very quickly, and that it
willbe cheaper than inspecting the larger multi-column index. But imagine a driver with a lots of data long time ago.
Thatbreaks the LIMIT fairly quickly. 
>
> regards
>
> --
> Tomas Vondra                  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Вложения

Re: [PERFORM] Querying with multicolumn index

От
Daniel Blanch Bataller
Дата:
Hi all,

If anyone still interested in the issue I think I have a very plausible explanation of Eric’s postgresql bad index choice that is: bloated updates_driver_id_time_idx index.

Though it’s possible to fool postgresql planner, as I’ve shown in previous tests, this happens with a very concrete data distribution ~ 100 evenly distributed keys over 15M records and ~ 6M records under one single key, if you play a bit with figures it doesn’t happen anymore.

Eric’s data distribution wasn’t that extreme, as far as I know he had ~ 100K vs 500K distributions … Well, I’ve been able to reproduce the problem with a close data distribution to Erik’s. I made it creating a ‘bloated index’. If optimal index is too big, postgres tries with another suboptimal index, in this case index ’time’. 

See this excerpt of my tests results:

(..)
-- populate table with 99 homogeneus distributed values
INSERT INTO updates SELECT q, q % 99, q, q, to_timestamp(q), q % 99 FROM generate_series(1, 15000000) q;
INSERT 0 15000000
Time: 65686,547 ms
-- populate table with 1 value with 500K rows, simmilar distribution you posted.
INSERT INTO updates SELECT q + 15000000, 100, q, q, to_timestamp(q), -- timestamp will start at 1 at end at 6M
100 FROM generate_series(1, 500000) q;
INSERT 0 500000
Time: 2463,073 ms
-- add constraints and indexes

(…)

-- create 'bloated' driver_id, time index.
CREATE INDEX ON updates (driver_id, "time") WITH (fillfactor = 10);
CREATE INDEX
Time: 41234,091 ms
-- check index sizes, updates_driver_id_idx is huge.
SELECT relname, relpages FROM pg_class WHERE relname LIKE 'updates%';
           relname           | relpages 
-----------------------------+----------
 updates                     |   129167
 updates_driver_id_time_idx  |   576919  
 updates_id_seq              |        1
 updates_pkey                |    42502
 updates_time_idx            |    42502
 updates_vehicle_id_time_idx |    59684
(6 rows)

Time: 16,810 ms
-- check behavior with bloated index
ANALYZE updates;
ANALYZE
Time: 254,917 ms

(..)

Time: 4,635 ms
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM updates WHERE driver_id = 100 ORDER BY "time" DESC LIMIT 1;
                                                                         QUERY PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..1.91 rows=1 width=36) (actual time=21486.015..21486.015 rows=1 loops=1)
   Buffers: shared hit=39618 read=160454 written=592
   ->  Index Scan Backward using updates_time_idx on updates  (cost=0.43..691283.45 rows=469134 width=36) (actual time=21486.014..21486.014 rows=1 loops=1)
         Filter: (driver_id = 100)
         Rows Removed by Filter: 14500000
         Buffers: shared hit=39618 read=160454 written=592
 Planning time: 0.171 ms
 Execution time: 21486.068 ms
(8 rows)

Time: 21486,905 ms
-- rebuild index with default fillfactor
ALTER INDEX updates_driver_id_time_idx SET (fillfactor = 90);
ALTER INDEX
Time: 0,682 ms
REINDEX INDEX updates_driver_id_time_idx;
REINDEX
Time: 23559,530 ms
-- recheck index sizes, updates_driver_id_idx should look pretty simmilar to others.
SELECT relname, relpages FROM pg_class WHERE relname LIKE 'updates%';
           relname           | relpages 
-----------------------------+----------
 updates                     |   129167
 updates_driver_id_time_idx  |    59684
 updates_id_seq              |        1
 updates_pkey                |    42502
 updates_time_idx            |    42502
 updates_vehicle_id_time_idx |    59684
(6 rows)

Time: 0,452 ms
-- check behavior with regular sized index
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM updates WHERE driver_id = 100 ORDER BY "time" DESC LIMIT 1;
                                                                          QUERY PLAN                                                                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.56..1.69 rows=1 width=36) (actual time=0.032..0.033 rows=1 loops=1)
   Buffers: shared hit=2 read=3
   ->  Index Scan Backward using updates_driver_id_time_idx on updates  (cost=0.56..529197.34 rows=469133 width=36) (actual time=0.032..0.032 rows=1 loops=1)
         Index Cond: (driver_id = 100)
         Buffers: shared hit=2 read=3
 Planning time: 0.074 ms
 Execution time: 0.046 ms
(7 rows)

Time: 0,312 ms


@Eric

How to solve the problem:

First of all check if this is the case, check indexes sizes, if you have this problem updates_driver_id_time_idx should be significantly bigger than others.

SELECT relname, relpages FROM pg_class WHERE relname LIKE 'updates%’;

Check index configuration to see if you have different fillfactor configuration

\d+ updates_driver_id_time_idx

If you have setup a different fillfactor, turn it to normal, that is 90%. I don’t see why you should have a low fillfactor, your data doesn’t seem to have frecuent updates, by the contrary, it seems only write and read data.

ALTER INDEX updates_driver_id_time_idx SET (fillfactor = 90)

If your index fillfactor is normal, there is a chance it got bloated, but this is rare. 

Reindex your data.

REINDEX INDEX updates_driver_id_time_idx;

Run tests again.


Regards,

Daniel Blanch.

P.S. Here you have my full tests and output, you might find them useful. Don’t forget to show us index sizes and index configuration, please.




El 11 dic 2016, a las 7:04, Daniel Blanch Bataller <daniel.blanch.bataller@gmail.com> escribió:

Hi all,

Thomas is absolutely right, the distribution I synthetically made, had 6M records but very old, 9M old, as you can see it had to skip 9M records before finding a suitable record using time index.

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM updates WHERE driver_id = 100 ORDER BY "time" DESC LIMIT 1;
                                                                        QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.44..0.65 rows=1 width=36) (actual time=3827.807..3827.807 rows=1 loops=1)
  Buffers: shared hit=24592 read=99594 written=659
  ->  Index Scan Backward using updates_time_idx on updates  (cost=0.44..1284780.53 rows=6064800 width=36) (actual time=3827.805..3827.805 rows=1 loops=1)
        Filter: (driver_id = 100)
        Rows Removed by Filter: 9000000
        Buffers: shared hit=24592 read=99594 written=659
Planning time: 0.159 ms
Execution time: 3827.846 ms
(8 rows)


Here you have my tests where I was able to reproduce the problem using default settings on 9.6, 9.5 and 9.3. 9.6 and 9.5 choose the wrong index, while 9.3 didn’t. (update: 9.5 didn’t fail last time)

<test_bad_index_choice.sql><bad_idx_choice.9.6.out><bad_idx_choice.9.5.out><bad_idx_choice.9.3.out>

However when I tried to add more than one value with this strange distribution ~ 30% of distribution to one value the index bad choice problem didn’t happen again in none of the different versions.

I Hope this helps. Regards,

Daniel Blanch.


El 10 dic 2016, a las 21:34, Tomas Vondra <tomas.vondra@2ndquadrant.com> escribió:

Hi,

On 12/10/2016 12:51 AM, Tom Lane wrote:
Eric Jiang <eric@doublemap.com> writes:
I have a query that I *think* should use a multicolumn index, but
sometimes isn't, resulting in slow queries.

I tried to duplicate this behavior, without success.  Are you running
with nondefault planner parameters?


My guess is this is a case of LIMIT the matching rows are uniformly distributed in the input data. The planner likely concludes that for a driver with a lot of data we'll find the first row using ix_updates_time very quickly, and that it will be cheaper than inspecting the larger multi-column index. But imagine a driver with a lots of data long time ago. That breaks the LIMIT fairly quickly.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Вложения