Обсуждение: mdnblocks() sabotages error checking in _mdfd_getseg()

Поиск
Список
Период
Сортировка

mdnblocks() sabotages error checking in _mdfd_getseg()

От
Robert Haas
Дата:
This is per a report by an EnterpriseDB customer and a bunch of
off-list analysis by Kevin Grittner and Rahila Syed.

Suppose you have a large relation with OID 123456.  There are segment
files 123456, 123456.1, and 123456.2.  Due to some kind of operating
system malfeasance, 123456.1 disappears; you are officially in
trouble.

Now, a funny thing happens.  The next time you call mdnblocks() on
this relation, which will probably happen pretty quickly since every
sequential scan does one, it will see that 123456 is a complete
segment and it will create an empty 123456.1.  It and any future
mdnblocks() calls will report that the length of the relation is equal
to the length of one full segment, because they don't check for the
next segment unless the current segment is completely full.

Now, if subsequent to this an index scan happens to sweep through and
try to fetch a block in 123456.2, it will work!  This happens because
_mdfd_getseg() doesn't care about the length of the segments; it only
cares whether or not they exist.  If 123456.1 were actually missing,
then we'd never test whether 123456.2 exists and we'd get an error.
But because mdnblocks() created 123456.1, _mdfd_getseg() is now quite
happy to fetch blocks in 123456.2; it considers the empty 123456.1
file to be a sufficient reason to look for 123456.2, and seeing that
file, and finding the block it wants therein, it happily returns that
block, blithely ignoring the fact that it passed over a completely .1
segment before returning a block from .2.  This is maybe not the
smartest thing ever.

The comment in mdnblocks.c says this:

                         * Because we pass O_CREAT, we will create the
next segment (with
                         * zero length) immediately, if the last
segment is of length
                         * RELSEG_SIZE.  While perhaps not strictly
necessary, this keeps
                         * the logic simple.

I don't really see how this "keeps the logic simple".  What it does is
allow sequential scans and index scans to have two completely
different notions of how many accessible blocks there are in the
relation.  Granted, this kind of thing really shouldn't happen, but
sometimes bad things do happen.  Therefore, I propose the attached
patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> The comment in mdnblocks.c says this:

>                          * Because we pass O_CREAT, we will create the
> next segment (with
>                          * zero length) immediately, if the last
> segment is of length
>                          * RELSEG_SIZE.  While perhaps not strictly
> necessary, this keeps
>                          * the logic simple.

> I don't really see how this "keeps the logic simple".

My (vague) recollection is that this is defending some assumptions
elsewhere in md.c.  You sure you haven't broken anything with this?
Relation extension across a segment boundary, perhaps?
        regards, tom lane



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Robert Haas
Дата:
On Wed, Dec 9, 2015 at 4:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> The comment in mdnblocks.c says this:
>
>>                          * Because we pass O_CREAT, we will create the
>> next segment (with
>>                          * zero length) immediately, if the last
>> segment is of length
>>                          * RELSEG_SIZE.  While perhaps not strictly
>> necessary, this keeps
>>                          * the logic simple.
>
>> I don't really see how this "keeps the logic simple".
>
> My (vague) recollection is that this is defending some assumptions
> elsewhere in md.c.  You sure you haven't broken anything with this?
> Relation extension across a segment boundary, perhaps?

It doesn't look like that particular thing gets broken here.  The key
to extending across a segment boundary is that _mdfd_openseg() needs
to get O_CREAT as its fourth argument, but _mdfd_getseg() thankfully
does not rely on mdnblocks() to have already done that: it passes
O_CREAT itself.  I did also test it by creating a 13GB
pgbench_accounts table, and at least in that simple case it works
fine.  It's possible that there is some other thing I've missed, but
in general I think that if there is some dependency on mdnblocks()
creating new segments, that's a bad idea and we should try to get rid
of it.  A call that supposedly reports the number of blocks in a
relation really has no business changing on-disk state.

(More broadly, as Kevin was pointing out to me yesterday, md.c looks
like it could do with a face lift.  Keeping a linked list of 1GB
segments and chasing down the list to find the length of the file may
have been fine when relations over 1GB were rare, but that's now
routine.  Some relations may be over 1TB, and using a linked list of
1000 entries to keep track of all of those segments does not seem like
an especially good choice.  In fact, having no way to get the relation
length other than scanning 1000 files doesn't seem like an especially
good choice even if we used a better data structure.  Putting a header
page in the heap would make getting the length of a relation O(1)
instead of O(segments), and for a bonus, we'd be able to reliably
detect it if a relation file disappeared out from under us.  That's a
difficult project and definitely not my top priority, but this code is
old and crufty all the same.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Andres Freund
Дата:
On 2015-12-10 11:10:10 -0500, Robert Haas wrote:
> (More broadly, as Kevin was pointing out to me yesterday, md.c looks
> like it could do with a face lift.  Keeping a linked list of 1GB
> segments and chasing down the list to find the length of the file may
> have been fine when relations over 1GB were rare, but that's now
> routine.  Some relations may be over 1TB, and using a linked list of
> 1000 entries to keep track of all of those segments does not seem like
> an especially good choice.

Yes, that sucks. I've posted a patch for that at
http://archives.postgresql.org/message-id/20141031223210.GN13584%40awork2.anarazel.de
- I don't have access to to either the workload or a good testing
machine anymore though, so I've kinda lost interest for a while.

I'll try to push the patch with the renaming suggested downthread by Tom
soonish.

> In fact, having no way to get the relation length other than scanning
> 1000 files doesn't seem like an especially good choice even if we used
> a better data structure.  Putting a header page in the heap would make
> getting the length of a relation O(1) instead of O(segments), and for
> a bonus, we'd be able to reliably detect it if a relation file
> disappeared out from under us.  That's a difficult project and
> definitely not my top priority, but this code is old and crufty all
> the same.)

The md layer doesn't really know whether it's dealing with an index, or
with an index, or ... So handling this via a metapage doesn't seem
particularly straightforward.

Andres



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Robert Haas
Дата:
On Thu, Dec 10, 2015 at 11:36 AM, Andres Freund <andres@anarazel.de> wrote:
>> In fact, having no way to get the relation length other than scanning
>> 1000 files doesn't seem like an especially good choice even if we used
>> a better data structure.  Putting a header page in the heap would make
>> getting the length of a relation O(1) instead of O(segments), and for
>> a bonus, we'd be able to reliably detect it if a relation file
>> disappeared out from under us.  That's a difficult project and
>> definitely not my top priority, but this code is old and crufty all
>> the same.)
>
> The md layer doesn't really know whether it's dealing with an index, or
> with an index, or ... So handling this via a metapage doesn't seem
> particularly straightforward.

It's not straightforward, but I don't think that's the reason.  What
we could do is look at the call sites that use
RelationGetNumberOfBlocks() and change some of them to get the
information some other way instead.  I believe get_relation_info() and
initscan() are the primary culprits, accounting for some enormous
percentage of the system calls we do on a read-only pgbench workload.
Those functions certainly know enough to consult a metapage if we had
such a thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Greg Stark
Дата:
On Thu, Dec 10, 2015 at 4:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> It's not straightforward, but I don't think that's the reason.  What
> we could do is look at the call sites that use
> RelationGetNumberOfBlocks() and change some of them to get the
> information some other way instead.  I believe get_relation_info() and
> initscan() are the primary culprits, accounting for some enormous
> percentage of the system calls we do on a read-only pgbench workload.
> Those functions certainly know enough to consult a metapage if we had
> such a thing.

Would this not run into a chicken and egg problem with recovery?
Unless you're going to fsync the meta page whenever the file is
extended you'll have to xlog any updates to it and treat the values in
memory as authoritative. But when replaying xlog you'll see obsolete
inconsistent versions on disk and won't have the correct values in
memory either.

It seems to me that if you want to fix the linked lists of files
that's orthogonal to whether the file lengths on disk are
authoritative. You can always keep the lengths or at least the number
of files cached and updated in shared memory in a more efficient
storage format.

-- 
greg



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Andres Freund
Дата:
On 2015-12-10 17:55:37 +0000, Greg Stark wrote:
> It seems to me that if you want to fix the linked lists of files
> that's orthogonal to whether the file lengths on disk are
> authoritative. You can always keep the lengths or at least the number
> of files cached and updated in shared memory in a more efficient
> storage format.

FWIW, I've a very preliminary patch doing exactly that, keeping the file
size in shared memory. I'm not sure it'd end up nice if relation
extension would have to grab an exclusive lock on the metapage.



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Robert Haas
Дата:
On Thu, Dec 10, 2015 at 12:55 PM, Greg Stark <stark@mit.edu> wrote:
> On Thu, Dec 10, 2015 at 4:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> It's not straightforward, but I don't think that's the reason.  What
>> we could do is look at the call sites that use
>> RelationGetNumberOfBlocks() and change some of them to get the
>> information some other way instead.  I believe get_relation_info() and
>> initscan() are the primary culprits, accounting for some enormous
>> percentage of the system calls we do on a read-only pgbench workload.
>> Those functions certainly know enough to consult a metapage if we had
>> such a thing.
>
> Would this not run into a chicken and egg problem with recovery?

No.

> Unless you're going to fsync the meta page whenever the file is
> extended you'll have to xlog any updates to it and treat the values in
> memory as authoritative. But when replaying xlog you'll see obsolete
> inconsistent versions on disk and won't have the correct values in
> memory either.

All true, but irrelevant.  The places I just mentioned wouldn't get
called during recovery.  They're only invoked from queries.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Simon Riggs
Дата:
On 10 December 2015 at 16:47, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Dec 10, 2015 at 11:36 AM, Andres Freund <andres@anarazel.de> wrote:
>> In fact, having no way to get the relation length other than scanning
>> 1000 files doesn't seem like an especially good choice even if we used
>> a better data structure.  Putting a header page in the heap would make
>> getting the length of a relation O(1) instead of O(segments), and for
>> a bonus, we'd be able to reliably detect it if a relation file
>> disappeared out from under us.  That's a difficult project and
>> definitely not my top priority, but this code is old and crufty all
>> the same.)
>
> The md layer doesn't really know whether it's dealing with an index, or
> with an index, or ... So handling this via a metapage doesn't seem
> particularly straightforward.

It's not straightforward, but I don't think that's the reason.  What
we could do is look at the call sites that use
RelationGetNumberOfBlocks() and change some of them to get the
information some other way instead.  I believe get_relation_info() and
initscan() are the primary culprits, accounting for some enormous
percentage of the system calls we do on a read-only pgbench workload.
Those functions certainly know enough to consult a metapage if we had
such a thing.

It looks pretty straightforward to me... 

The number of relations with >1 file is likely to be fairly small, so we can just have an in-memory array to record that. 8 bytes per relation >1 GB isn't going to take much shmem, but we can extend using dynshmem as needed. We can seq scan the array at relcache build time and invalidate relcache when we extend. WAL log any extension to a new segment and write the table to disk at checkpoint.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Robert Haas
Дата:
On Thu, Dec 10, 2015 at 1:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 10 December 2015 at 16:47, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Dec 10, 2015 at 11:36 AM, Andres Freund <andres@anarazel.de>
>> wrote:
>> >> In fact, having no way to get the relation length other than scanning
>> >> 1000 files doesn't seem like an especially good choice even if we used
>> >> a better data structure.  Putting a header page in the heap would make
>> >> getting the length of a relation O(1) instead of O(segments), and for
>> >> a bonus, we'd be able to reliably detect it if a relation file
>> >> disappeared out from under us.  That's a difficult project and
>> >> definitely not my top priority, but this code is old and crufty all
>> >> the same.)
>> >
>> > The md layer doesn't really know whether it's dealing with an index, or
>> > with an index, or ... So handling this via a metapage doesn't seem
>> > particularly straightforward.
>>
>> It's not straightforward, but I don't think that's the reason.  What
>> we could do is look at the call sites that use
>> RelationGetNumberOfBlocks() and change some of them to get the
>> information some other way instead.  I believe get_relation_info() and
>> initscan() are the primary culprits, accounting for some enormous
>> percentage of the system calls we do on a read-only pgbench workload.
>> Those functions certainly know enough to consult a metapage if we had
>> such a thing.
>
> It looks pretty straightforward to me...
>
> The number of relations with >1 file is likely to be fairly small, so we can
> just have an in-memory array to record that. 8 bytes per relation >1 GB
> isn't going to take much shmem, but we can extend using dynshmem as needed.
> We can seq scan the array at relcache build time and invalidate relcache
> when we extend. WAL log any extension to a new segment and write the table
> to disk at checkpoint.

Invaliding the relcache when we extend would be extremely expensive,
but we could probably come up with some variant of this that would
work.  I'm not very excited about this design, though; I think
actually putting a metapage on each relation would be better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Dec 10, 2015 at 1:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> We can seq scan the array at relcache build time and invalidate relcache
>> when we extend. WAL log any extension to a new segment and write the table
>> to disk at checkpoint.

> Invaliding the relcache when we extend would be extremely expensive,

... and I think it would be too late anyway, if backends are relying on
the relcache to tell the truth.  You can't require an exclusive lock on
a rel just to extend it, which means there cannot be a guarantee that
what a backend has in its relcache will be up to date with current reality.

I really don't like Robert's proposal of a metapage though.  We've got too
darn many forks per relation already.

It strikes me that this discussion is perhaps conflating two different
issues.  Robert seems to be concerned about how we'd detect (not recover
from, just detect) filesystem misfeasance in the form of complete loss
of a non-last segment file.  The other issue is a desire to reduce the
cost of mdnblocks() calls.  It may be worth thinking about those two
things together, but we shouldn't lose sight of these being separate
goals, assuming that anybody besides Robert thinks that the segment
file loss issue is worth worrying about.
        regards, tom lane



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Robert Haas
Дата:
On Thu, Dec 10, 2015 at 1:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Dec 10, 2015 at 1:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> We can seq scan the array at relcache build time and invalidate relcache
>>> when we extend. WAL log any extension to a new segment and write the table
>>> to disk at checkpoint.
>
>> Invaliding the relcache when we extend would be extremely expensive,
>
> ... and I think it would be too late anyway, if backends are relying on
> the relcache to tell the truth.  You can't require an exclusive lock on
> a rel just to extend it, which means there cannot be a guarantee that
> what a backend has in its relcache will be up to date with current reality.

True.

> I really don't like Robert's proposal of a metapage though.  We've got too
> darn many forks per relation already.

Oh, I wasn't thinking of adding a fork, just repurposing block 0 of
the main fork, as we do for some index types.

> It strikes me that this discussion is perhaps conflating two different
> issues.  Robert seems to be concerned about how we'd detect (not recover
> from, just detect) filesystem misfeasance in the form of complete loss
> of a non-last segment file.  The other issue is a desire to reduce the
> cost of mdnblocks() calls.  It may be worth thinking about those two
> things together, but we shouldn't lose sight of these being separate
> goals, assuming that anybody besides Robert thinks that the segment
> file loss issue is worth worrying about.

Don't get me wrong, I'm not willing to expend *any* extra cycles to
notice a problem here.  But all things being equal, code that notices
broken stuff is better than code that doesn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Greg Stark
Дата:
On Thu, Dec 10, 2015 at 7:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I really don't like Robert's proposal of a metapage though.  We've got too
>> darn many forks per relation already.
>
> Oh, I wasn't thinking of adding a fork, just repurposing block 0 of
> the main fork, as we do for some index types.

Not to mention that it should probably be done for all forks. In
practice none of the other forks are going to go over a gigabyte these
days but having different extension logic depending on the fork would
be a pain and only limit future uses of forks.

It doesn't seem like an easy change to make work with binary upgrades
though. I suppose you could just support the metapage being absent in
perpetuity but then it's hard to test and would make the logic in md
even more complex than now rather than simpler.


Incidentally, for comparison Oracle has a header page on every data
file with various information. Data files are added like we add
segments to tables but they are actually pieces of a tablespace rather
than a table. The most significant piece of metadata that they store
in this header block is actually the commit sequence number --
effectively equivalent to our LSN -- that the file is consistent with.
That lets you see how far back you need to start recovery and you can
narrow down which data files actually need recovery.




-- 
greg



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Andres Freund
Дата:
On 2015-12-09 16:50:06 -0500, Robert Haas wrote:
> Now, if subsequent to this an index scan happens to sweep through and
> try to fetch a block in 123456.2, it will work!  This happens because
> _mdfd_getseg() doesn't care about the length of the segments; it only
> cares whether or not they exist.

Unless in recovery in the startup process, or when EXTENSION_CREATE is
passed to it. Depending on whether it or mdnblocks were called first,
and depending on which segment is missing.  In that case it'd *possibly*
pad the last block in a sgment with zeroes, Yes, only the last block:         *         * We have to maintain the
invariantthat segments before the last         * active segment are of size RELSEG_SIZE; therefore, pad them out
* with zeroes if needed.  (This only matters if caller is         * extending the relation discontiguously, but that
canhappen in         * hash indexes.)         */        if (behavior == EXTENSION_CREATE || InRecovery)        {
   if (_mdnblocks(reln, forknum, v) < RELSEG_SIZE)            {                char       *zerobuf = palloc0(BLCKSZ);
 
                mdextend(reln, forknum,                         nextsegno * ((BlockNumber) RELSEG_SIZE) - 1,
            zerobuf, skipFsync);                pfree(zerobuf);            }            v->mdfd_chain =
_mdfd_openseg(reln,forknum, +nextsegno, O_CREAT);        }
 


> If 123456.1 were actually missing,
> then we'd never test whether 123456.2 exists and we'd get an error.
> But because mdnblocks() created 123456.1, _mdfd_getseg() is now quite
> happy to fetch blocks in 123456.2; it considers the empty 123456.1
> file to be a sufficient reason to look for 123456.2, and seeing that
> file, and finding the block it wants therein, it happily returns that
> block, blithely ignoring the fact that it passed over a completely .1
> segment before returning a block from .2.  This is maybe not the
> smartest thing ever.

I think it might be worthwhile to add a check against this. It'd not be
very hard to ensure the segment is indeed large enough iff further
segments exist. Possibly not an ERROR, but a WARNING might be useful. To
avoid overhead and absurd log spamming we should make that check only if
the the segment isn't already open.


> The comment in mdnblocks.c says this:
> 
>                          * Because we pass O_CREAT, we will create the
> next segment (with
>                          * zero length) immediately, if the last
> segment is of length
>                          * RELSEG_SIZE.  While perhaps not strictly
> necessary, this keeps
>                          * the logic simple.
> 
> I don't really see how this "keeps the logic simple".  What it does is
> allow sequential scans and index scans to have two completely
> different notions of how many accessible blocks there are in the
> relation.  Granted, this kind of thing really shouldn't happen, but
> sometimes bad things do happen.  Therefore, I propose the attached
> patch.

I'm in favor of changing this behaviour - it's indeed weird and somewhat
error prone.

Do you want to change this in master, or backpatch a change? To me the
code in md.c is subtle and complicated enough that I'm rather inclined
to only change this in master.


This code is really hard to test effectively :(.


Greetings,

Andres Freund



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Robert Haas
Дата:
On Tue, Dec 15, 2015 at 9:51 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-09 16:50:06 -0500, Robert Haas wrote:
>> Now, if subsequent to this an index scan happens to sweep through and
>> try to fetch a block in 123456.2, it will work!  This happens because
>> _mdfd_getseg() doesn't care about the length of the segments; it only
>> cares whether or not they exist.
>
> Unless in recovery in the startup process, or when EXTENSION_CREATE is
> passed to it. Depending on whether it or mdnblocks were called first,
> and depending on which segment is missing.  In that case it'd *possibly*
> pad the last block in a sgment with zeroes, Yes, only the last block:

Yes, that's clearly inadequate.  I think the fact that it only pads
the last block, possibly creating a sparse file, should also be fixed,
but as a separate patch.

>> If 123456.1 were actually missing,
>> then we'd never test whether 123456.2 exists and we'd get an error.
>> But because mdnblocks() created 123456.1, _mdfd_getseg() is now quite
>> happy to fetch blocks in 123456.2; it considers the empty 123456.1
>> file to be a sufficient reason to look for 123456.2, and seeing that
>> file, and finding the block it wants therein, it happily returns that
>> block, blithely ignoring the fact that it passed over a completely .1
>> segment before returning a block from .2.  This is maybe not the
>> smartest thing ever.
>
> I think it might be worthwhile to add a check against this. It'd not be
> very hard to ensure the segment is indeed large enough iff further
> segments exist. Possibly not an ERROR, but a WARNING might be useful. To
> avoid overhead and absurd log spamming we should make that check only if
> the the segment isn't already open.

We could:

1. Before extending a relation to RELSEG_SIZE, verify that the next
segment of the relation doesn't already exist on disk.  If we find out
that it does, then throw an error and refuse the extension.

2. Teach _mdfd_getseg() that a segment of a size less than RELSEG_SIZE
is, by definition, the last segment, and don't even check whether
further files exist on disk.  mdnblocks() already behaves this way, so
it would just be making things consistent.

If we did these things, then if you got a hole in your relation, the
entire system would agree that the data after the hole doesn't exist
any more, and if the segment prior to the hole got re-extended, it
would be prevented from "bridging" the hole.  We wouldn't be incurring
any additional overhead, either, because I bet #2 would save more
cycles than #1 costs.

However, I don't think this is exactly what you are proposing.  I'm
skeptical of the idea that _mdfd_getseg() should probe ahead to see
whether we're dealing with a malformed relation where the intermediate
segments still exist but have zero length.  If we fix mdnblocks() as
I've proposed, then that scenario will become much less probable.  Of
course, the operating system could still decide to truncate a segment
other than the last for some reason or other, but it could also decide
to remove the file completely, or to remove several files, or lots of
other things.

>> The comment in mdnblocks.c says this:
>>
>>                          * Because we pass O_CREAT, we will create the
>> next segment (with
>>                          * zero length) immediately, if the last
>> segment is of length
>>                          * RELSEG_SIZE.  While perhaps not strictly
>> necessary, this keeps
>>                          * the logic simple.
>>
>> I don't really see how this "keeps the logic simple".  What it does is
>> allow sequential scans and index scans to have two completely
>> different notions of how many accessible blocks there are in the
>> relation.  Granted, this kind of thing really shouldn't happen, but
>> sometimes bad things do happen.  Therefore, I propose the attached
>> patch.
>
> I'm in favor of changing this behaviour - it's indeed weird and somewhat
> error prone.
>
> Do you want to change this in master, or backpatch a change? To me the
> code in md.c is subtle and complicated enough that I'm rather inclined
> to only change this in master.

Yeah, that's my inclination as well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Andres Freund
Дата:
On 2015-12-15 10:53:58 -0500, Robert Haas wrote:
> On Tue, Dec 15, 2015 at 9:51 AM, Andres Freund <andres@anarazel.de> wrote:
> > Unless in recovery in the startup process, or when EXTENSION_CREATE is
> > passed to it. Depending on whether it or mdnblocks were called first,
> > and depending on which segment is missing.  In that case it'd *possibly*
> > pad the last block in a sgment with zeroes, Yes, only the last block:
> 
> Yes, that's clearly inadequate.  I think the fact that it only pads
> the last block, possibly creating a sparse file, should also be fixed,
> but as a separate patch.

I was wondering about that - but after a bit of thinking I'm disinclined
to got hat way. Consider the scenario where we write to the last block
in a humongous relation, drop that relation, and then crash, before a
checkpoint. Right now we'll write the last block of each segment during
recovery. After such a change we'd end up rewriting the whole file...

I think we should primarily update that comment.


> 1. Before extending a relation to RELSEG_SIZE, verify that the next
> segment of the relation doesn't already exist on disk.  If we find out
> that it does, then throw an error and refuse the extension.

That bit wouldn't work directly that way IIRC - the segment is allowed
to exist as an 'inactive' segment. We only truncate segments to 0 in
mdtruncate()...  But a size 0 check should do the trick.


> 2. Teach _mdfd_getseg() that a segment of a size less than RELSEG_SIZE
> is, by definition, the last segment, and don't even check whether
> further files exist on disk.  mdnblocks() already behaves this way, so
> it would just be making things consistent.

> However, I don't think this is exactly what you are proposing.  I'm
> skeptical of the idea that _mdfd_getseg() should probe ahead to see
> whether we're dealing with a malformed relation where the intermediate
> segments still exist but have zero length.

That's not exactly what I was thinking of. I'm was thinking of doing a
_mdnblocks(reln, forknum, v) == RELSEG_SIZE check in _mdfd_getseg()'s
main loop, whenever nextsegno < targetseg. That'll make that check
rather cheap.  Which sounds pretty much like your 2).

Andres



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Robert Haas
Дата:
On Tue, Dec 15, 2015 at 11:08 AM, Andres Freund <andres@anarazel.de> wrote:
>> However, I don't think this is exactly what you are proposing.  I'm
>> skeptical of the idea that _mdfd_getseg() should probe ahead to see
>> whether we're dealing with a malformed relation where the intermediate
>> segments still exist but have zero length.
>
> That's not exactly what I was thinking of. I'm was thinking of doing a
> _mdnblocks(reln, forknum, v) == RELSEG_SIZE check in _mdfd_getseg()'s
> main loop, whenever nextsegno < targetseg. That'll make that check
> rather cheap.  Which sounds pretty much like your 2).

Hmm, yes it does.  But now that I think about it, we're not otherwise
doing _mdnblocks() in that loop.  So that would add a system call per
loop iteration.  That doesn't seem like such a swell idea.

If you're OK with it, I think I'll commit the original patch.  That
seems like a good thing to do regardless of what we decide about the
rest of this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: mdnblocks() sabotages error checking in _mdfd_getseg()

От
Andres Freund
Дата:
On 2015-12-15 11:28:10 -0500, Robert Haas wrote:
> Hmm, yes it does.  But now that I think about it, we're not otherwise
> doing _mdnblocks() in that loop.  So that would add a system call per
> loop iteration.  That doesn't seem like such a swell idea.

We'd do that only when intially open the relation (that far). And after
we've already executed a, much more expensive, open() on the same
file. So I'm not particularly bothered by that.


> If you're OK with it, I think I'll commit the original patch.  That
> seems like a good thing to do regardless of what we decide about the
> rest of this.

Ok.

I thought for a moment that that patch would need a bit more editing
because of:        /*         * This is the last segment we want to keep. Truncate the file to         * the right
length,and clear chain link that points to any         * remaining segments (which we shall zap). NOTE: if nblocks is
     * exactly a multiple K of RELSEG_SIZE, we will truncate the K+1st         * segment to 0 length but keep it. This
adheresto the invariant         * given in the header comments.         */
 
in mdtruncate(). But afaics this doesn't change:*    On disk, a relation must consist of consecutively numbered
segment*   files in the pattern*        -- Zero or more full segments of exactly RELSEG_SIZE blocks each*        --
Exactlyone partial segment of size 0 <= size < RELSEG_SIZE blocks*        -- Optionally, any number of inactive
segmentsof size 0 blocks.
 
at all, so I think we're good.

Andres