Re: index-only scans
От | Oleg Bartunov |
---|---|
Тема | Re: index-only scans |
Дата | |
Msg-id | Pine.LNX.4.64.1108121705560.110@sn.sai.msu.ru обсуждение исходный текст |
Ответ на | index-only scans (Robert Haas <robertmhaas@gmail.com>) |
Список | pgsql-hackers |
Robert, I imagine we store positional information in gin index and return tuples in relevant order - instant full-text search ! Great work, guys ! Oleg On Thu, 11 Aug 2011, Robert Haas wrote: > Please find attached a patch implementing a basic version of > index-only scans. This patch is the work of my colleague Ibrar Ahmed > and myself, and also incorporates some code from previous patches > posted by Heikki Linnakanagas. > > I'm able to demonstrate a significant performance benefit from this > patch, even in a situation where it doesn't save any disk I/O, due to > reduced thrashing of shared_buffers. Configuration settings: > > max_connections = 100 > shared_buffers = 400MB > maintenance_work_mem = 256MB > synchronous_commit = off > checkpoint_segments = 100 > checkpoint_timeout = 30min > checkpoint_completion_target = 0.9 > checkpoint_warning = 90s > seq_page_cost = 1.0 > random_page_cost = 1.0 > effective_cache_size = 3GB > > Test setup: > > pgbench -i -s 50 > create table sample_data as select (random()*5000000)::int as aid, > repeat('a', 1000) as filler from generate_series(1,100000); > > Test queries: > > select sum(aid) from sample_data a1 where exists (select * from > pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); > select sum(aid) from sample_data a1 where exists (select * from > pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); > > On my laptop, the first query executes in about 555 ms, while the > second one takes about 1125 ms. Inspection via pg_buffercache reveals > that the second one thrashes shared_buffers something fierce, while > the first one does not. You can actually get the time for the first > query down to about 450 ms if you can persuade PostgreSQL to cache the > entire sample_data table - which is difficult, due the > BufferAccessStrategy stuff - and as soon as you run the second query, > it blows the table out of cache, so practically speaking you're not > going to get that faster time very often. I expect that you could get > an even larger benefit from this type of query if you could avoid > actual disk I/O, rather than just buffer cache thrashing, but I > haven't come up with a suitable test cases for that yet (volunteers?). > > There are several things about this patch that probably need further > thought and work, or at least some discussion. > > 1. The way that nodeIndexscan.c builds up the faux heap tuple is > perhaps susceptible to improvement. I thought about building a > virtual tuple, but then what do I do with an OID column, if I have > one? Or maybe this should be done some other way altogether. > > 2. Suppose we scan one tuple on a not-all-visible page followed by 99 > tuples on all-visible pages. The code as written will hold the pin on > the first heap page for the entire scan. As soon as we hit the end of > the scan or another tuple where we have to actually visit the page, > the old pin will be released, but until then we hold onto it. This > isn't totally stupid: the next tuple that's on a not-all-visible page > could easily be on the same not-all-visible page we already have > pinned. And in 99% cases holding the pin for slightly longer is > probably completely harmless. On the flip side, it could increase the > chances of interfering with VACUUM. Then again, a non-index-only scan > would still interfere with the same VACUUM, so maybe we don't care. > > 3. The code in create_index_path() builds up a bitmapset of heap > attributes that get used for any purpose anywhere in the query, and > hangs it on the RelOptInfo so it doesn't need to be rebuilt for every > index under consideration. However, if it were somehow possible to > have the rel involved without using any attributes at all, we'd > rebuild the cache over and over, since it would never become non-NULL. > I don't think that can happen right now, but future planner changes > might make it possible. > > 4. There are a couple of cases that use index-only scans even though > the EXPLAIN output sort of makes it look like they shouldn't. For > example, in the above queries, an index-only scan is chosen even > though the query does "SELECT *" from the table being scanned. Even > though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems > that the target list of an EXISTS query is in fact discarded, e.g.: > > create or replace function error() returns int as $$begin select 1/0; > end$$ language plpgsql; > select * from pgbench_accounts a where exists (select error() from > pgbench_branches b where b.bid = a.aid); -- doesn't error out! > > Along similar lines, COUNT(*) does not preclude an index-only scan, > because the * is apparently just window dressing. You'll still get > just a seq-scan unless you have an indexable qual in the query > somewhere, because... > > 5. We haven't made any planner changes at all, not even for cost > estimation. It is not clear to me what the right way to do cost > estimation here is. It seems that it would be useful to know what > percentage of the table is all-visible at planning time, but even if > we did, there could (and probably often will) be correlation between > the pages the query is interested in and which visibility map bits are > set. So I'm concerned about being overly optimistic about the cost > savings. Also, there's the problem of figuring out how to keep the > percent-of-table-that-has-the-vm-bits-set statistic up to date. Maybe > that's a job for ANALYZE but I haven't thought about it much yet. > > 6. I'm sure there are probably some statements in the documentation > that need updating, but I haven't tracked 'em down yet. > > Comments, testing, review appreciated... > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
В списке pgsql-hackers по дате отправления: