Re: Querying distinct values from a large table
От | imad |
---|---|
Тема | Re: Querying distinct values from a large table |
Дата | |
Msg-id | 1f30b80c0701300339t1fee83b4x138a7d139874e9e7@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Querying distinct values from a large table (Richard Huxton <dev@archonet.com>) |
Список | pgsql-performance |
To be sure about the performance of index scan, try forcing the planner to use it instead of seq scan. A way might be to force the planner to use index scan on your table by using a dummy where clause. Try using a condition in your where clause which holds true for all rows. --Imad www.EnterpriseDB.com On 1/30/07, Richard Huxton <dev@archonet.com> wrote: > Igor Lobanov wrote: > > > > > > Richard Huxton wrote: > >>> I have rather large table with about 5 millions of rows and a dozen > >>> of columns. Let's suppose that columns are named 'a', 'b', 'c' etc. I > >>> need to query distinct pairs of ('a';'b') from this table. > > > > >> What version of PostgreSQL is it? > > > > 8.1.4 > > Current release is 8.1.6 - probably worth upgrading when you've got > time. It should be a simple matter of replacing the binaries but do > check the release notes. > > >> How many distinct values are you getting back from your 5 million > >> rows? If there are too many, an index isn't going to help. > > > > No more than 10,000. > > OK. Should be possible to do something then. > > >> Can you share the EXPLAIN ANALYSE output? You might want to try > >> increasing work_mem for this one query to speed any sorting. > > > > Real table and colum names are obfuscated because of NDA, sorry. > > Fair enough. > > > explain analyze select distinct a, b from tbl > > > > EXPLAIN ANALYZE output is: > > > > Unique (cost=500327.32..525646.88 rows=1848 width=6) (actual > > time=52719.868..56126.356 rows=5390 loops=1) > > -> Sort (cost=500327.32..508767.17 rows=3375941 width=6) (actual > > time=52719.865..54919.989 rows=3378864 loops=1) > > Sort Key: a, b > > -> Seq Scan on tbl (cost=0.00..101216.41 rows=3375941 > > width=6) (actual time=16.643..20652.610 rows=3378864 loops=1) > > Total runtime: 57307.394 ms > > Hmm - am I right in thinking (a,b) are the only two columns on this > table? That means you'll have a lot of rows per page and an index scan > could end up fetching lots of pages to check the rows are visible. Still > - I'd expect it to be better than a seq scan. > > The first thing to try is to put the index back and run "SET > enable_seqscan=off" before the explain analyse. That should force it to > use the index. Then we'll see what costs it's expecting. > > >> How often is the table updated? Clustering might buy you some > >> improvements (but not a huge amount I suspect). > > > > It is updated once per 3-5 seconds. > > OK - forget clustering then. > > > And one more thing. I don't know if it helps, but column 'a' can have > > value from a limited set: 0, 1 or 2. Column 'b' is also an integer > > (foreign key, actually). > > Hmm - might be worth trying distinct on (b,a) with an index that way > around - that would give you greater selectivity at the top-level of the > btree. Can you repeat the EXPLAIN ANALYSE with that too please. > > -- > Richard Huxton > Archonet Ltd > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org >
В списке pgsql-performance по дате отправления: