Обсуждение: index choosing problem

Поиск
Список
Период
Сортировка

index choosing problem

От
Rural Hunter
Дата:
I have a table with serveral million records. they are divided into
about one hundred catagory(column cid). I created index includes the cid
as the first column. I had a problem with some cids they only have few
records comparing with other cids. Some of them only have serveral
thousand rows. Some queries are not using index on the cids. I got the
explain for the queries.
Note:
article_others_cid_time_style_idx is the index contains cid as the first
column
article_others_pkey is the primary key on an auto incremented column aid.

# select count(*) from article_others;
   count
---------
  6888459
(1 row)

# select count(*) from article_others where cid=74;
  count
-------
   4199
(1 row)

1. # explain select count(*) from article_others where cid=74;
                                                       QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=32941.95..32941.96 rows=1 width=0)
    ->  Index Scan using article_others_cid_time_style_idx on
article_others  (cost=0.00..32909.34 rows=13047 width=0)
          Index Cond: (cid = 74)
(3 rows)

2. # explain select aid from article_others where cid=74 limit 10;
                                                       QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
  Limit  (cost=0.00..25.22 rows=10 width=8)
    ->  Index Scan using article_others_cid_time_style_idx on
article_others  (cost=0.00..32909.34 rows=13047 width=8)
          Index Cond: (cid = 74)
(3 rows)

3. # explain select aid from article_others where cid=74 order by aid
desc limit 10;
                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
  Limit  (cost=0.00..1034.00 rows=10 width=8)
    ->  Index Scan Backward using article_others_pkey on article_others
(cost=0.00..1349056.65 rows=13047 width=8)
          Filter: (cid = 74)
(3 rows)

4. # explain select aid from article_others where cid=74 order by aid
desc limit 1;
                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
  Limit  (cost=0.00..103.40 rows=1 width=8)
    ->  Index Scan Backward using article_others_pkey on article_others
(cost=0.00..1349060.65 rows=13047 width=8)
          Filter: (cid = 74)
(3 rows)

5. # explain select max(aid) from article_others where cid=74;
                                                         QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
  Result  (cost=104.70..104.71 rows=1 width=0)
    InitPlan 1 (returns $0)
      ->  Limit  (cost=0.00..104.70 rows=1 width=8)
            ->  Index Scan Backward using article_others_pkey on
article_others  (cost=0.00..1365988.55 rows=13047 width=8)
                  Index Cond: (aid IS NOT NULL)
                  Filter: (cid = 74)
(6 rows)

Now the query 3-5 using article_others_pkey are quite slow. The rows for
cid 74 are very old and seldom get updated. I think pg needs to scan
quite a lot on article_others_pkey before it gets the rows for cid 74.
The same query for other cids with new and majority of rows runs very
fast. for example:
# explain select max(aid) from article_others where cid=258;
                                                          QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
  Result  (cost=1.54..1.55 rows=1 width=0)
    InitPlan 1 (returns $0)
      ->  Limit  (cost=0.00..1.54 rows=1 width=8)
            ->  Index Scan Backward using article_others_pkey on
article_others  (cost=0.00..1366260.55 rows=889520 width=8)
                  Index Cond: (aid IS NOT NULL)
                  Filter: (cid = 258)

So I think if pg chooses to use index article_others_cid_time_style_idx
the performance would be much better. or any other solution I can take
to improve the query performance for those cids like 74?
Another question, why the plan shows rows=13047 for cid=74 while
actually it only has 4199 rows? There is almost no data changes for cid
74 and I just vacuum/analyzed the table this morning.

Re: index choosing problem

От
Ants Aasma
Дата:
2012/3/15 Rural Hunter <ruralhunter@gmail.com>:
> Now the query 3-5 using article_others_pkey are quite slow. The rows for cid
> 74 are very old and seldom get updated. I think pg needs to scan quite a lot
> on article_others_pkey before it gets the rows for cid 74. The same query
> for other cids with new and majority of rows runs very fast.

This is because the PostgreSQL cost model doesn't know about the
correlation between aid and cid. In absence of information it assumes
that it will find a row with cid=74 about every 68 rows
(889520/13047).

One option to fix this case is to use OFFSET 0 as an optimization barrier:
SELECT max(aid) FROM
    (SELECT aid FROM article_others WHERE cid=74 OFFSET 0) AS x;

That has the unfortunate effect of performing badly for cid's that are
extremely popular. That may or may not be acceptable in your case.

To fix this properly the query optimizer needs to know the relationship between
aid and cid and needs to know how to apply that to estimating the cost
of index scans. A prerequisite for implementing this is to have
multi-column statistics. To do the estimation, the current linear cost
model needs to be changed to something that can express a non-linear
relationship between tuples returned and cost, e.g. a piece-wise
linear model. The stats collection part is actually feasible, in fact
I'm currently working on a patch for that. As for the estimation
improvement, I have an idea how it might work, but I'm not really sure
yet if the performance hit for query planning would be acceptable.

> Another question, why the plan shows rows=13047 for cid=74 while actually it
> only has 4199 rows? There is almost no data changes for cid 74 and I just
> vacuum/analyzed the table this morning.

Might just be an artifact of random sampling. Try raising your stats
target and re-analyzing to confirm.

All the best,
Ants Aasma