Обсуждение: More tablescanning fun
On this table project_id | integer | not null id | integer | not null date | date | not null team_id | integer | not null work_units | bigint | not null Indexes: email_contrib_pkey primary key btree (project_id, id, date) with this breakdown of data project_id | count ------------+---------- 5 | 56427141 8 | 1058843 24 | 361595 25 | 4092575 205 | 58512516 Any kind of operation on an entire project wants to tablescan, even though it's going to take way longer. explain analyze select sum(work_units) from email_contrib where project_id=8; Index scan 126, 56, 55 seconds Seq. scan 1517, 850, 897 seconds It seems like the metrics used for the cost of index scanning v. table scanning on large tables need to be revisited. It might be such a huge difference in this case because the table is essentially clustered on the primary key. I can test this by doing an aggregate for, say, a specific team_id, which would be pretty well spread across the entire table, but that'll have to wait a bit. Anyone have any thoughts on this? Also, is there a TODO to impliment real clustered indexes? Doing stuff by project_id on this table in sybase was very efficient, because there was a real clustered index on the PK. By clustered index, I mean an index where the leaf nodes of the B-tree were the actual table rows. This means the only overhead in going through the index is scanning the branches, which in this case would be pretty light-weight. Is this something that I should be using some PGSQL-specific feature for, like inheritance? I've been really happy so far with PGSQL (comming from Sybase and DB2), but it seems there's still some pretty big performance issues that want to be addressed (or I should say performance issues that hurt really big when you hit them :) ). -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > It seems like the metrics used for the cost of index scanning v. table > scanning on large tables need to be revisited. It might be such a huge > difference in this case because the table is essentially clustered on > the primary key. Probably. What does the correlation figure in pg_stats show as? There's been some previous debate about the equation used to correct for correlation, which is certainly bogus (I picked it more or less out of the air ;-)). But so far no one has proposed a replacement equation with any better foundation ... take a look in src/backend/optimizer/path/costsize.c if you want to get involved. > Also, is there a TODO to impliment > real clustered indexes? No. It's not apparent to me how you could do that without abandoning MVCC, which we're not likely to do. regards, tom lane
On Thu, Apr 24, 2003 at 07:58:30PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > It seems like the metrics used for the cost of index scanning v. table > > scanning on large tables need to be revisited. It might be such a huge > > difference in this case because the table is essentially clustered on > > the primary key. > > Probably. What does the correlation figure in pg_stats show as? stats=# select attname, correlation from pg_stats where tablename='email_contrib'; attname | correlation ------------+------------- project_id | 1 id | 0.449204 date | 0.271775 team_id | 0.165588 work_units | 0.0697928 > There's been some previous debate about the equation used to correct > for correlation, which is certainly bogus (I picked it more or less > out of the air ;-)). But so far no one has proposed a replacement > equation with any better foundation ... take a look in > src/backend/optimizer/path/costsize.c if you want to get involved. Are you reffering to the PF formula? > > Also, is there a TODO to impliment > > real clustered indexes? > > No. It's not apparent to me how you could do that without abandoning > MVCC, which we're not likely to do. Hmm... does MVCC mandate inserts go at the end? My understanding is that each tuple indicates it's insert/last modified time; if this is the case, why would a true clustered index break mvcc? I guess an update that moves the tuple would be tricky, but I'm guesing there's some kind of magic that could happen there... worst case would be adding an 'expired' timestamp. On the other hand, it might be possible to get the advantages of a clustered index without doing a *true* clustered index. The real point is to be able to use indexes; I've heard things like 'if you need to access more than 10% of a table then using an index would be disasterous', and that's not good... that number should really be over 50% for most reasonable ratios of fields indexed to fields in table (of course field size plays a factor). -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
"Jim C. Nasby" <jim@nasby.net> writes: > On Thu, Apr 24, 2003 at 07:58:30PM -0400, Tom Lane wrote: >> There's been some previous debate about the equation used to correct >> for correlation, which is certainly bogus (I picked it more or less >> out of the air ;-)). But so far no one has proposed a replacement >> equation with any better foundation ... take a look in >> src/backend/optimizer/path/costsize.c if you want to get involved. > Are you reffering to the PF formula? The PF formula is good as far as I know, but it assumes an uncorrelated table order. The debate is how to correct it for nonzero correlation. Specifically, this bit: * When the index ordering is exactly correlated with the table ordering * (just after a CLUSTER, for example), the number of pages fetched should * be just sT. What's more, these will be sequential fetches, not the * random fetches that occur in the uncorrelated case. So, depending on * the extent of correlation, we should estimate the actual I/O cost * somewhere between s * T * 1.0 and PF * random_cost. We currently * interpolate linearly between these two endpoints based on the * correlation squared (XXX is that appropriate?). I believe the endpoints s*T and PF*random_cost, I think, but the curve between them is anyone's guess. It's also quite possible that the correlation stat that we currently compute is inadequate to model what's going on. >> No. It's not apparent to me how you could do that without abandoning >> MVCC, which we're not likely to do. > Hmm... does MVCC mandate inserts go at the end? Anywhere that there's free space. The point is that you can't promise updates will fit on the same page as the original tuple. So whatever desirable physical ordering you may have started with will surely degrade over time. > On the other hand, it might be possible to get the advantages of a > clustered index without doing a *true* clustered index. The real point > is to be able to use indexes; I've heard things like 'if you need to > access more than 10% of a table then using an index would be > disasterous', and that's not good... that number should really be over > 50% for most reasonable ratios of fields indexed to fields in table (of > course field size plays a factor). If you have to read 50% of a table, you certainly should be doing a linear scan. There will be hardly any pages you can skip (unless the table is improbably well clustered), and the extra I/O needed to read the index will buy you nothing. regards, tom lane
On Fri, Apr 25, 2003 at 01:23:10AM -0400, Tom Lane wrote: > I believe the endpoints s*T and PF*random_cost, I think, but the curve > between them is anyone's guess. It's also quite possible that the > correlation stat that we currently compute is inadequate to model what's > going on. In this case, the interpolation can't be at fault, because correlation is 1 (unless the interpolation is backwards, but that doesn't appear to be the case). One possibility is that IndexSelectivity isn't taking most_common_(vals|freqs) into account. Looking at this from an idea case, most (or all) of this query should be retrieved by simply incrementing through both the index and the tuples at the same time. We should end up pulling 0.7% of the index and raw pages combined. Analyze thinks that using the index will be about 50% more expensive, though. (3258557 v. 2274866) A thought that comes to mind here is that it would be incredible if pgsql could take metrics of how long things actually take on a live system and incorporate them... basically learning as it goes. A first step in this case would be to keep tabs on how close real page-read counts come to what the optimizer predicted, and storing that for later analysis. This would make it easier for you to verify your linear correlation assumption, for example (it'd also make it easier to validate the PF formula). > >> No. It's not apparent to me how you could do that without abandoning > >> MVCC, which we're not likely to do. > > > Hmm... does MVCC mandate inserts go at the end? > > Anywhere that there's free space. The point is that you can't promise > updates will fit on the same page as the original tuple. So whatever > desirable physical ordering you may have started with will surely > degrade over time. Yes, updates are the tricky part to clustered indexes, and MVCC might make it harder. What Sybase 11 (which only supports page locking) does is see if the update moves the tuple off it's current page. If it doesn't, it just shuffles the page around as needed and goes on with business. If it needs to move, it grabs (and locks) the page it needs to move to, inserts it on that page (possibly incurring a page split), and deletes it from the old page. My guess is that with MVCC, you can't simply delete the old tuple... you'd have to leave some kind of 'bread crumb' behind for older transactions to see (though, I guess this would already have to be happening somehow). The reason to do this in this case is well worth it though... we end up with one table (simplifies code) that should essentially act as if it was multiple (5 in this case) tables, so performance should still be very good. > > On the other hand, it might be possible to get the advantages of a > > clustered index without doing a *true* clustered index. The real point > > is to be able to use indexes; I've heard things like 'if you need to > > access more than 10% of a table then using an index would be > > disasterous', and that's not good... that number should really be over > > 50% for most reasonable ratios of fields indexed to fields in table (of > > course field size plays a factor). > > If you have to read 50% of a table, you certainly should be doing a > linear scan. There will be hardly any pages you can skip (unless the > table is improbably well clustered), and the extra I/O needed to read > the index will buy you nothing. Yes, and it's that 'improbably well clustered' case that I have here. :) But even if you're only 25% clustered, I think you'll still see a huge gain on a very large table, especially if the index tuples are substantially smaller than the raw tuples (which they normally should be). -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Jim C. Nasby kirjutas R, 25.04.2003 kell 07:59: > > > Also, is there a TODO to impliment > > > real clustered indexes? > > > > No. It's not apparent to me how you could do that without abandoning > > MVCC, which we're not likely to do. > > Hmm... does MVCC mandate inserts go at the end? I have been pondering if keeping pages half-empty (or even 70% empty) could solve both clustering problems and longish updates for much data. If we could place the copy in the same page than original, most big updates would be possible by one sweep of disk heads and also clustering order would be easier to keep if pages were kept intentionally half empty. So "VACUUM FULL 65% EMPTY;" could make sense ? ------------- Hannu
On Fri, Apr 25, 2003 at 07:28:06PM +0300, Hannu Krosing wrote: > I have been pondering if keeping pages half-empty (or even 70% empty) > could solve both clustering problems and longish updates for much data. > > If we could place the copy in the same page than original, most big > updates would be possible by one sweep of disk heads and also clustering > order would be easier to keep if pages were kept intentionally half > empty. > > So "VACUUM FULL 65% EMPTY;" could make sense ? That's actually a recommended practice, at least for sybase when you're using a clustered index, depending on what you're using it for. If you cluster a table in such a way that inserts will happen across the entire table, you'll actually end up with a fillratio (amount of data v. empty space on a page) of 75% over time, because of page splits. When sybase goes to insert, if it can't find room on the page it needs to insert into (keep in mind this is a clustered table, so a given row *must* go into a given position), it will split that single page into two pages, each of which will then have a fillratio of 50%. Of course they'll eventually approach 100%, so the average fill ratio across all pages for the table would be 75%. I'm not familiar enough with pgsql's guts to know how big an impact updates across pages are, or if they even happen often at all. If you're not maintaining a clustered table, couldn't all updates just occur in-place? Or are you thinking of the case where you have a lot of variable-length stuff? -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Hannu Krosing <hannu@tm.ee> writes: > I have been pondering if keeping pages half-empty (or even 70% empty) > could solve both clustering problems and longish updates for much data. You could achieve that pretty easily if you simply don't ever VACUUM FULL ;-) UPDATE has always (AFAIR) attempted to place the new version on the same page as the old, moving it elsewhere only if it doesn't fit. So that part of the logic is already there. > So "VACUUM FULL 65% EMPTY;" could make sense ? Not so much that, as a parameter to CLUSTER telling it to fill pages only x% full. regards, tom lane
Jim C. Nasby kirjutas R, 25.04.2003 kell 19:56: > I'm not familiar enough with pgsql's guts to know how big an impact > updates across pages are, or if they even happen often at all. If you're > not maintaining a clustered table, couldn't all updates just occur > in-place? In postgres _no_ updates happen in-place. The MVCC concurrency works by always inserting a new tuple on update . ----------- Hannu
On Fri, 25 Apr 2003 09:38:01 -0500, "Jim C. Nasby" <jim@nasby.net> wrote: >In this case, the interpolation can't be at fault, because correlation >is 1 (unless the interpolation is backwards, but that doesn't appear to >be the case). But your index has 3 columns which causes the index correlation to be assumed as 1/3. So the interpolation uses 1/9 (correlation squared) and you get a cost estimation that almost equals the upper bound. If you want to play around with other interpolation methods, you might want to get this patch: http://www.pivot.at/pg/16-correlation-732.diff A short description of the GUC parameters introduced by this patch can be found here: http://archives.postgresql.org/pgsql-performance/2002-11/msg00256.php As a short term workaround for an unmodified Postgres installation, you can create an index on email_contrib(project_id). Servus Manfred
On Wed, Apr 30, 2003 at 04:14:46PM +0200, Manfred Koizar wrote: > On Fri, 25 Apr 2003 09:38:01 -0500, "Jim C. Nasby" <jim@nasby.net> > wrote: > >In this case, the interpolation can't be at fault, because correlation > >is 1 (unless the interpolation is backwards, but that doesn't appear to > >be the case). > > But your index has 3 columns which causes the index correlation to be > assumed as 1/3. So the interpolation uses 1/9 (correlation squared) > and you get a cost estimation that almost equals the upper bound. Hmm... interesting... maybe it would also be a good idea to expand ANALYZE so that it will analyze actual index correlation? ie: in this case, it would notice that the index on project_id, id, date is highly correlated, across all 3 columns. Supporting something close to a real clustered index would also work as well, since the optimizer would treat that case differently (essentially as a combination between an index scan but doing a seq. scan within each page). -- Jim C. Nasby (aka Decibel!) jim@nasby.net Member: Triangle Fraternity, Sports Car Club of America Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"