Обсуждение: The number of bytes is stored in index_size of pgstatindex() ?
Hi all,
It seems that description of index_size that is returned by pgstatindex() is wrong.
The document says "Total number of pages in index".
But it looks like the number of bytes is stored in index_size.
  The document page:
    http://www.postgresql.org/docs/9.5/interactive/pgstattuple.html
Regards,
--
----
 OTSUKA Kenji
    otsuka.kenji@nttcom.co.jp
			
		=?UTF-8?B?5aSn5aGa5oay5Y+4?= <otsuka.kenji@nttcom.co.jp> writes:
> It seems that description of index_size that is returned by pgstatindex() is wrong.
> The document says "Total number of pages in index".
> But it looks like the number of bytes is stored in index_size.
Hmm, you're right, because what the code does is
        values[j++] = psprintf(INT64_FORMAT,
                               (indexStat.root_pages +
                                indexStat.leaf_pages +
                                indexStat.internal_pages +
                                indexStat.deleted_pages +
                                indexStat.empty_pages) * BLCKSZ);
so the result is clearly measured in bytes not pages.
I was going to just change the docs to say "Total index size in bytes",
but then I noticed that this number is not that either: it does not
count the metapage and therefore is BLCKSZ less than the true file
size.  As an example:
regression=# select * from pgstatindex('tenk1_unique1');
 version | tree_level | index_size | root_block_no | internal_pages | leaf_pages
 | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+-----------
-+-------------+---------------+------------------+--------------------
       2 |          1 |     237568 |             3 |              0 |         28
 |           0 |             0 |            87.91 |                  0
(1 row)
regression=# select relfilenode from pg_class where relname = 'tenk1_unique1';
 relfilenode
-------------
       27449
(1 row)
$ ls -l 27449
-rw-------. 1 postgres postgres 245760 Feb 17 18:58 27449
I think this is a bug and we ought to fix the code to include the
metapage in the reported index_size.  Thoughts?
            regards, tom lane
			
		Re: [HACKERS] The number of bytes is stored in index_size of pgstatindex() ?
От
 
		    	Peter Geoghegan
		    Дата:
		        On Thu, Feb 18, 2016 at 9:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think this is a bug and we ought to fix the code to include the > metapage in the reported index_size. Thoughts? I tend to agree, but I think you should note that specifically in the documentation. I'm in favor of tools like pgstattuple and pageinspect going into this kind of detail in their documentation generally. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes:
> On Thu, Feb 18, 2016 at 9:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think this is a bug and we ought to fix the code to include the
>> metapage in the reported index_size.  Thoughts?
> I tend to agree, but I think you should note that specifically in the
> documentation. I'm in favor of tools like pgstattuple and pageinspect
> going into this kind of detail in their documentation generally.
Yeah ... the numbers already appear not to add up because the root page
is counted in index_size but not any other column, so there's already
something worthy of explanation there.  Maybe something like "The reported
index_size will normally correspond to two more pages than are accounted
for by internal_pages + leaf_pages + empty_pages + deleted_pages, because
it also includes the index's metapage and root page".
            regards, tom lane
			
		Re: [HACKERS] The number of bytes is stored in index_size of pgstatindex() ?
От
 
		    	Peter Geoghegan
		    Дата:
		        On Thu, Feb 18, 2016 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah ... the numbers already appear not to add up because the root page
> is counted in index_size but not any other column, so there's already
> something worthy of explanation there.  Maybe something like "The reported
> index_size will normally correspond to two more pages than are accounted
> for by internal_pages + leaf_pages + empty_pages + deleted_pages, because
> it also includes the index's metapage and root page".
That's odd. Having taken a quick look at pgstatindex_impl(), I dislike
that it counts the root separately from internal pages. That's not how
they're actually presented and understood in the code.
It's also possible to have more than one root page, since we do a scan
in physical order, which the code considers. There could be a fast
root and a true root. Looks like one is counted as deleted, though:
regression=# select index_size/2^13 as total_pages, root_block_no,
internal_pages, leaf_pages, deleted_pages from
pgstatindex('btree_tall_idx');
 total_pages │ root_block_no │ internal_pages │ leaf_pages │ deleted_pages
─────────────┼───────────────┼────────────────┼────────────┼───────────────
         206 │            81 │              3 │        160 │            42
(1 row)
BTW, I am actively working on the amcheck B-Tree checker tool, and
hope to post it soon.
--
Peter Geoghegan
			
		Peter Geoghegan <pg@heroku.com> writes:
> That's odd. Having taken a quick look at pgstatindex_impl(), I dislike
> that it counts the root separately from internal pages. That's not how
> they're actually presented and understood in the code.
Yeah, that seems a bit strange to me as well.  Should we change it to
count the root as an internal page, or is that going too far?
Note that it's already the case that in a one-page index (root is also
a leaf), the root will be included in the leaf_pages count.  So it
sure seems inconsistent that it's not included in the internal_pages
count when it's not a leaf.
> It's also possible to have more than one root page, since we do a scan
> in physical order, which the code considers. There could be a fast
> root and a true root. Looks like one is counted as deleted, though:
Well, actually, since we don't have write lock on the index it'd be
possible to see zero or multiple roots because the root's location
changes.  That's already mentioned in the documentation, if somewhat
obliquely.
            regards, tom lane
			
		Re: [HACKERS] The number of bytes is stored in index_size of pgstatindex() ?
От
 
		    	Peter Geoghegan
		    Дата:
		        On Thu, Feb 18, 2016 at 10:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, that seems a bit strange to me as well.  Should we change it to
> count the root as an internal page, or is that going too far?
I think we should change it. It seems like a bug to me. I've had the
same point come up ("leaf-ness/internal-ness and root-ness are
orthogonal") a couple of times with Heikki over the years. I just
haven't used pgstattuple very much for some reason, and so didn't
catch it before now.
> Note that it's already the case that in a one-page index (root is also
> a leaf), the root will be included in the leaf_pages count.  So it
> sure seems inconsistent that it's not included in the internal_pages
> count when it's not a leaf.
That's what I was thinking.
> Well, actually, since we don't have write lock on the index it'd be
> possible to see zero or multiple roots because the root's location
> changes.  That's already mentioned in the documentation, if somewhat
> obliquely.
Ah, yes. Another consequence of going in physical order.
--
Peter Geoghegan
			
		Peter Geoghegan <pg@heroku.com> writes:
> On Thu, Feb 18, 2016 at 10:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, that seems a bit strange to me as well.  Should we change it to
>> count the root as an internal page, or is that going too far?
> I think we should change it. It seems like a bug to me.
Me too.  Is it enough bug-like to be something to back-patch, or should
we just change it in HEAD?
            regards, tom lane
			
		Re: [HACKERS] The number of bytes is stored in index_size of pgstatindex() ?
От
 
		    	Peter Geoghegan
		    Дата:
		        I vote back patch. Subtle differences between the branches should be avoided.
--
 Peter Geoghegan
			
		
I wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> I think we should change it. It seems like a bug to me.
> Me too.  Is it enough bug-like to be something to back-patch, or should
> we just change it in HEAD?
Actually, there's a significantly worse bug here: I just realized that the
page type tests are done in the wrong order.  A deleted page that was
formerly a leaf will be reported as though it was a live leaf page,
because both the BTP_LEAF and BTP_DELETED flags are set for such a page.
It looks like this was done correctly to begin with, and I broke it in
d287818eb514d431b1a68e1f3940cd958f82aa34.  Not sure what I was thinking :-(
Anyway, I think that puts the final nail in the coffin of the idea that
the current code's behavior is sane enough to preserve.  I think we should
fix all these things and back-patch 'em all.
            regards, tom lane
			
		Re: [HACKERS] The number of bytes is stored in index_size of pgstatindex() ?
От
 
		    	Peter Geoghegan
		    Дата:
		        On Thu, Feb 18, 2016 at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It looks like this was done correctly to begin with, and I broke it in
> d287818eb514d431b1a68e1f3940cd958f82aa34.  Not sure what I was thinking :-(
I think that you might not have simply changed the order in a totally
misguided way back in 2008, as you seem to imply. Consider what this
block does following your commit just now:
...
else if (P_IGNORE(opaque))
    indexStat.empty_pages++;    /* this is the "half dead" state */
else if (P_ISLEAF(opaque))
{
    int         max_avail;
    max_avail = BLCKSZ - (BLCKSZ - ((PageHeader) page)->pd_special +
SizeOfPageHeaderData);
    indexStat.max_avail += max_avail;
    indexStat.free_space += PageGetFreeSpace(page);
    indexStat.leaf_pages++;
    /*
     * If the next leaf is on an earlier block, it means a
     * fragmentation.
     */
    if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno)
        indexStat.fragments++;
}
...
I think that the P_ISLEAF() instrumentation of free space and
fragments might still need to happen for deleted and/or half dead
pages. Having looked at the 2008 commit d287818eb514d431 myself, ISTM
that your intent might well have been to have that happen -- why else
would any reasonable person have changed the order at all?
The avg_leaf_density and leaf_fragmentation fields might now be argued
to misrepresent the true picture. This is not clearly a departure from
their documented behavior, if only because the descriptions are almost
the same as the names of the fields themselves. If you think that the
instrumentation of free space is the most useful possible behavior as
of today, which it might well be, then you might have clarified that
this behavior was your intent in today's commit, for example by
updating the descriptions of the fields avg_leaf_density and
leaf_fragmentation in the docs.
Just a thought.
--
Peter Geoghegan
			
		Peter Geoghegan <pg@heroku.com> writes:
> I think that the P_ISLEAF() instrumentation of free space and
> fragments might still need to happen for deleted and/or half dead
> pages.
Don't see why; the documentation and field names clearly imply that those
numbers are accumulated only over leaf pages.  I certainly wouldn't expect
the fragmentation measure to include dead pages, for example, since they
would not get traversed by scans.  (Whether the "rightlink points to a
higher page number" rule for fragmentation is actually very useful is a
separate question; but as long as that's the measure, only pages that
are part of the leaf scan sequence should count.)
> Having looked at the 2008 commit d287818eb514d431 myself, ISTM
> that your intent might well have been to have that happen -- why else
> would any reasonable person have changed the order at all?
My best guess is that I was thinking that the tests were independent,
and rearranged them so that the most common case would be tested first.
I'm quite sure I didn't intend to change the statistical behavior, else
I would have updated docs and/or mentioned it in the commit message.
            regards, tom lane
			
		Re: [HACKERS] The number of bytes is stored in index_size of pgstatindex() ?
От
 
		    	Peter Geoghegan
		    Дата:
		        On Thu, Feb 18, 2016 at 3:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Don't see why; the documentation and field names clearly imply that those > numbers are accumulated only over leaf pages. Then I guess what might be lacking is a delineation of what a leaf page is for the purposes of pgstatindex(). > I certainly wouldn't expect > the fragmentation measure to include dead pages, for example, since they > would not get traversed by scans. (Whether the "rightlink points to a > higher page number" rule for fragmentation is actually very useful is a > separate question; but as long as that's the measure, only pages that > are part of the leaf scan sequence should count.) Why would dead pages not get traversed by scans? It's clear that they would be traversed to the extent that there will be buffer locking and inspection of the B-Tree special area flags to determine that the page should be ignored. >> Having looked at the 2008 commit d287818eb514d431 myself, ISTM >> that your intent might well have been to have that happen -- why else >> would any reasonable person have changed the order at all? > > My best guess is that I was thinking that the tests were independent Dunno. I'd have thought that the basic fact that the ordering had some significance was fairly obvious. One of the tests is "if (P_IGNORE(...))", and I can't imagine that not provoking thought. I'm particular about things like this because I care about making sure these tools are useful for teaching novice hackers about Postgres internals. I've made my point and will leave it at that, though. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes:
> Why would dead pages not get traversed by scans?
Because they've been removed from the right-link/left-link chains.
            regards, tom lane
			
		Re: [HACKERS] The number of bytes is stored in index_size of pgstatindex() ?
От
 
		    	Peter Geoghegan
		    Дата:
		        On Thu, Feb 18, 2016 at 4:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Because they've been removed from the right-link/left-link chains. That isn't the same thing as being inaccessible by scans, clearly (just what you call the "leaf scan sequence"). Besides, half-dead pages still have right-link/left-link chains, and there are usage patterns where half-dead pages might accumulate. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes:
> On Thu, Feb 18, 2016 at 4:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Because they've been removed from the right-link/left-link chains.
> That isn't the same thing as being inaccessible by scans, clearly
> (just what you call the "leaf scan sequence").
Only a physical-order scan, ie vacuum, would visit a dead page
(ignoring transient corner cases like a page getting deleted while an
indexscan is in flight to it).  So I think treating it as part of the
fragmentation measure is completely wrong: the point of that measure,
AFAICS, is to model how close an index-order traversal is to linear.
Half-dead pages are also normally very transient --- the only way they
persist is if there's a crash partway through a page deletion.  So I think
it's appropriate to assume that future indexscans won't visit those,
either.
> there are usage patterns where half-dead pages might accumulate.
Other than a usage pattern of "randomly SIGKILL backends every few
seconds", I don't see how that would happen.
            regards, tom lane
			
		Re: [HACKERS] The number of bytes is stored in index_size of pgstatindex() ?
От
 
		    	Peter Geoghegan
		    Дата:
		        On Thu, Feb 18, 2016 at 4:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Only a physical-order scan, ie vacuum, would visit a dead page > (ignoring transient corner cases like a page getting deleted while an > indexscan is in flight to it). So I think treating it as part of the > fragmentation measure is completely wrong: the point of that measure, > AFAICS, is to model how close an index-order traversal is to linear. > Half-dead pages are also normally very transient --- the only way they > persist is if there's a crash partway through a page deletion. So I think > it's appropriate to assume that future indexscans won't visit those, > either. Okay. >> there are usage patterns where half-dead pages might accumulate. > > Other than a usage pattern of "randomly SIGKILL backends every few > seconds", I don't see how that would happen. I meant where pages could accumulate without being recycled. -- Peter Geoghegan
Re: [HACKERS] The number of bytes is stored in index_size of pgstatindex() ?
От
 
		    	Alvaro Herrera
		    Дата:
		        Peter Geoghegan wrote: > On Thu, Feb 18, 2016 at 4:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> there are usage patterns where half-dead pages might accumulate. > > > > Other than a usage pattern of "randomly SIGKILL backends every few > > seconds", I don't see how that would happen. > > I meant where pages could accumulate without being recycled. But those pages are supposed to be used as the index grows. So unless they are forgotten by the FSM, they shouldn't accumulate. (Except where the table doesn't grow but only shrinks, so there's no need for new index pages, but I don't think that's an interesting case.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] The number of bytes is stored in index_size of pgstatindex() ?
От
 
		    	Peter Geoghegan
		    Дата:
		        On Fri, Feb 19, 2016 at 12:05 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > But those pages are supposed to be used as the index grows. So unless > they are forgotten by the FSM, they shouldn't accumulate. (Except where > the table doesn't grow but only shrinks, so there's no need for new > index pages, but I don't think that's an interesting case.) Sure. I'm talking about a narrow issue around how things are represented in pgstatindex() only. -- Peter Geoghegan