Обсуждение: Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

Поиск

Список

Период

Сортировка

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Robert Haas

Дата:

04 апреля 2016 г., 19:45:36

On Sat, Mar 26, 2016 at 8:39 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-03-25 12:02:05 -0400, Robert Haas wrote:
>> Gosh, that's surprising.  I wonder if that just revealed an underlying
>> issue rather than creating it.
>
> I think that's the case; it's just somewhat unlikely to hit in other
> cases.
>
> If SMgrRelation->md_fd[n] is an empty relation, and mdread() or another
> routine is asking for a block in the second segment - which will error
> out. But even if the first segment is zero bytes, _mdfd_getseg() will
> dutifully try to open the second segment. Which will succeed in the case
> of a truncated relation, because we leave the truncated segment in
> place.
>
> ISTM that _mdfd_getseg better check the length of the last segment
> before opening the next one?

Well, I agree that it's pretty strange that _mdfd_getseg() makes no
such check, but I still don't think I understand what's going on here.
Backends shouldn't be requesting nonexistent blocks from a relation -
higher-level safeguards, like holding AccessExclusiveLock before
trying to complete a DROP or TRUNCATE - are supposed to prevent that.
If this patch is causing us to hold onto smgr references to a relation
on which we no longer hold locks, I think that's irretrievably broken
and should be reverted.  I really doubt this will be the only thing
that goes wrong if you do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Noah Misch

Дата:

06 апреля 2016 г., 03:57:12

On Mon, Apr 04, 2016 at 12:45:25PM -0400, Robert Haas wrote:
> Backends shouldn't be requesting nonexistent blocks from a relation -
> higher-level safeguards, like holding AccessExclusiveLock before
> trying to complete a DROP or TRUNCATE - are supposed to prevent that.
> If this patch is causing us to hold onto smgr references to a relation
> on which we no longer hold locks, I think that's irretrievably broken
> and should be reverted.  I really doubt this will be the only thing
> that goes wrong if you do that.

+1

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

11 апреля 2016 г., 19:51:19

On 2016-04-04 12:45:25 -0400, Robert Haas wrote:
> Well, I agree that it's pretty strange that _mdfd_getseg() makes no
> such check, but I still don't think I understand what's going on here.
> Backends shouldn't be requesting nonexistent blocks from a relation -
> higher-level safeguards, like holding AccessExclusiveLock before
> trying to complete a DROP or TRUNCATE - are supposed to prevent that.

I don't think that's really true: We write blocks back without holding
any sort of relation level locks; and thus do _mdfd_getseg() type
accesses as well.

And we're not really "requesting nonexistant blocks" - the segments are
just opened to get the associated file descriptor, and they're opened
with EXTENSION_RETURN_NULL. It turns out to be important
performance-wise to reuse fd's when triggering kernel writeback.

> If this patch is causing us to hold onto smgr references to a relation
> on which we no longer hold locks, I think that's irretrievably broken
> and should be reverted.  I really doubt this will be the only thing
> that goes wrong if you do that.

As we already have done that for writes for a long time, I'm a bit
surprised about that statement.

Greetings,

Andres Freund

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Robert Haas

Дата:

11 апреля 2016 г., 20:05:00

On Mon, Apr 11, 2016 at 12:50 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-04-04 12:45:25 -0400, Robert Haas wrote:
>> Well, I agree that it's pretty strange that _mdfd_getseg() makes no
>> such check, but I still don't think I understand what's going on here.
>> Backends shouldn't be requesting nonexistent blocks from a relation -
>> higher-level safeguards, like holding AccessExclusiveLock before
>> trying to complete a DROP or TRUNCATE - are supposed to prevent that.
>
> I don't think that's really true: We write blocks back without holding
> any sort of relation level locks; and thus do _mdfd_getseg() type
> accesses as well.

You're right, but I think that's more because I didn't say it
correctly than because you haven't done something novel.  DROP and
relation truncation know about shared buffers, and they go clear
blocks that that might be affected from it as part of the truncate
operation, which means that no other backend will see them after they
are gone.  The lock makes sure that no other references can be added
while we're busy removing any that are already there.  So I think that
there is currently an invariant that any block we are attempting to
access should actually still exist.  It sounds like these references
are sticking around in backend-private memory, which means they are
neither protected by locks nor able to be cleared out on drop or
truncate.  I think that's a new thing, and a bit scary.

> And we're not really "requesting nonexistant blocks" - the segments are
> just opened to get the associated file descriptor, and they're opened
> with EXTENSION_RETURN_NULL. It turns out to be important
> performance-wise to reuse fd's when triggering kernel writeback.

The possibly-saving grace here, I suppose, is that the references
we're worried about are just being used to issue hints to the
operating system.  So I guess if we sent a hint on a wrong block or
skip sending a hint altogether because of some failure, no harm done,
as long as we don't error out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

11 апреля 2016 г., 20:17:53

On 2016-04-11 13:04:48 -0400, Robert Haas wrote:
> You're right, but I think that's more because I didn't say it
> correctly than because you haven't done something novel.

Could be.

> DROP and
> relation truncation know about shared buffers, and they go clear
> blocks that that might be affected from it as part of the truncate
> operation, which means that no other backend will see them after they
> are gone.  The lock makes sure that no other references can be added
> while we're busy removing any that are already there.  So I think that
> there is currently an invariant that any block we are attempting to
> access should actually still exist.

Note that we're not actually accessing any blocks, we're just opening a
segment to get the associated file descriptor.

> It sounds like these references are sticking around in backend-private
> memory, which means they are neither protected by locks nor able to be
> cleared out on drop or truncate.  I think that's a new thing, and a
> bit scary.

True. But how would you batch flush requests in a sorted manner
otherwise, without re-opening file descriptors otherwise? And that's
prety essential for performance.

I can think of a number of relatively easy ways to address this:
1) Just zap (or issue?) all pending flush requests when getting an  smgrinval/smgrclosenode
2) Do 1), but filter for the closed relnode
3) Actually handle the case of the last open segment not being  RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks()
doesso.

I'm kind of inclined to do both 3) and 1).

> The possibly-saving grace here, I suppose, is that the references
> we're worried about are just being used to issue hints to the
> operating system.

Indeed.

> So I guess if we sent a hint on a wrong block or
> skip sending a hint altogether because of some failure, no harm done,
> as long as we don't error out.

Which the writeback code is careful not to do; afaics it's just the
"already open segment" issue making problems here.

- Andres

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Fabien COELHO

Дата:

12 апреля 2016 г., 11:52:09

Hello Andres,

> I can think of a number of relatively easy ways to address this:
> 1) Just zap (or issue?) all pending flush requests when getting an
>   smgrinval/smgrclosenode
> 2) Do 1), but filter for the closed relnode
> 3) Actually handle the case of the last open segment not being
>   RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
>
> I'm kind of inclined to do both 3) and 1).

Alas I'm a little bit outside my area of expertise. Just for suggestion 
purpose, possibly totally wrong (please accept my apology if this is the 
case), the ideas I had while thinking about these issues, some may be 
quite close to your suggestions above:
 - keep track of file/segment closing/reopenings (say with a counter), so   that if a flush is about a file/segment
whichhas been closed or   reopened, it is just skipped. I'm not sure this is enough, because one   process may do the
filecleaning and another may want to flush, although   I guess there is some kind of locking/shared data structures to
manage  such interactions between pg processes.
 
 - because of "empty the bucket when filled behavior" of the current   implementation, a buffer to be flused may be
keptfor a very   long time in the bucket. I think that flushing advices become stale   after a while so should not be
issued(the buffer may have been   written again, ...), or the bucket should be flushed after a while   even of not
full.

Also, a detail in "pg_flush_data", there are a serie of three if/endif, 
but it is not obvious to me whether they are mutually exclusive while 
looking at the source, so I was wondering whether the code could try 
several flushings approaches one after the other... This is probably not 
the case, but I would be more at ease with a if/elif/elif/endif structure 
there so that is clearer.

-- 
Fabien.

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Robert Haas

Дата:

12 апреля 2016 г., 16:01:03

On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
> I can think of a number of relatively easy ways to address this:
> 1) Just zap (or issue?) all pending flush requests when getting an
>    smgrinval/smgrclosenode
> 2) Do 1), but filter for the closed relnode
> 3) Actually handle the case of the last open segment not being
>    RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
>
> I'm kind of inclined to do both 3) and 1).

#3 seems like it's probably about 15 years overdue, so let's do that anyway.

I don't quite understand why #1 fixes the problem - not because I
doubt you, but just because I haven't studied it much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Abhijit Menon-Sen

Дата:

14 апреля 2016 г., 08:39:36

At 2016-04-12 09:00:57 -0400, robertmhaas@gmail.com wrote:
>
> On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
> >
> > 3) Actually handle the case of the last open segment not being
> >    RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
> 
> #3 seems like it's probably about 15 years overdue, so let's do that
> anyway.

I don't have anything useful to contribute, I'm just trying to
understand this fix.

_mdfd_getseg() looks like this:
targetseg = blkno / ((BlockNumber) RELSEG_SIZE);for (nextsegno = 1; nextsegno <= targetseg; nextsegno++){
Assert(nextsegno== v->mdfd_segno + 1);
 
    if (v->mdfd_chain == NULL)    {        /*         * Normally we will create new segments only if authorized by the
      * caller (i.e., we are doing mdextend()). […]         */        if (behavior == EXTENSION_CREATE || InRecovery)
    {            if (_mdnblocks(reln, forknum, v) < RELSEG_SIZE)                                   /* mdextend */
    v->mdfd_chain = _mdfd_openseg(reln, forknum, +nextsegno, O_CREAT);        }        else        {            /* We
won'tcreate segment if not existent */            v->mdfd_chain = _mdfd_openseg(reln, forknum, nextsegno, 0);        }
     if (v->mdfd_chain == NULL)                           /* return NULL or ERROR */    }    v = v->mdfd_chain;}
 

Do I understand correctly that the case of the "last open segment"
(i.e., the one for which mdfd_chain == NULL) not being RELSEG_SIZE
(i.e., _mdnblocks(reln, forknum, v) < RELSEG_SIZE) should not call
_mfdf_openseg on nextsegno if behaviour is not EXTENSION_CREATE or
InRecovery?

And that "We won't create segment if not existent" should happen, but
doesn't only because the next segment file wasn't removed earlier, so
we have to add an extra check for that case?

In other words, is something like the following what's needed here, or
is there more to it?

-- Abhijit

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 578276d..58ea5a0 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1783,6 +1783,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,               if
(v->mdfd_chain== NULL)               {
 
+                       bool thisSegNeedsExtension = _mdnblocks(reln, forknum, v) < RELSEG_SIZE;
+                       /*                        * Normally we will create new segments only if authorized by the
                 * caller (i.e., we are doing mdextend()).  But when doing WAL
 
@@ -1799,7 +1801,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,                        */
                    if (behavior == EXTENSION_CREATE || InRecovery)                       {
 
-                               if (_mdnblocks(reln, forknum, v) < RELSEG_SIZE)
+                               if (thisSegNeedsExtension)                               {
        char       *zerobuf = palloc0(BLCKSZ);
 
@@ -1810,7 +1812,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
  }                               v->mdfd_chain = _mdfd_openseg(reln, forknum, +nextsegno, O_CREAT);
  }
 
-                       else
+                       else if (!thisSegNeedsExtension)                       {                               /* We
won'tcreate segment if not existent */                               v->mdfd_chain = _mdfd_openseg(reln, forknum,
nextsegno,0);

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Robert Haas

Дата:

15 апреля 2016 г., 20:42:39

On Thu, Apr 14, 2016 at 1:39 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:
> At 2016-04-12 09:00:57 -0400, robertmhaas@gmail.com wrote:
>>
>> On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
>> >
>> > 3) Actually handle the case of the last open segment not being
>> >    RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
>>
>> #3 seems like it's probably about 15 years overdue, so let's do that
>> anyway.
>
> Do I understand correctly that the case of the "last open segment"
> (i.e., the one for which mdfd_chain == NULL) not being RELSEG_SIZE
> (i.e., _mdnblocks(reln, forknum, v) < RELSEG_SIZE) should not call
> _mfdf_openseg on nextsegno if behaviour is not EXTENSION_CREATE or
> InRecovery?
>
> And that "We won't create segment if not existent" should happen, but
> doesn't only because the next segment file wasn't removed earlier, so
> we have to add an extra check for that case?
>
> In other words, is something like the following what's needed here, or
> is there more to it?

Something like that is what I was thinking about, but I notice it has
the disadvantage of adding lseeks to cater to a shouldn't-happen
condition.  Not sure if we should care.

My attempts to test things were also singularly unrewarding.  Even
after messing with the filesystem in various ways, I couldn't figure
out exactly how this makes a difference.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Noah Misch

Дата:

17 апреля 2016 г., 01:23:09

On Tue, Apr 12, 2016 at 09:00:57AM -0400, Robert Haas wrote:
> On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
> > I can think of a number of relatively easy ways to address this:
> > 1) Just zap (or issue?) all pending flush requests when getting an
> >    smgrinval/smgrclosenode
> > 2) Do 1), but filter for the closed relnode
> > 3) Actually handle the case of the last open segment not being
> >    RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
> >
> > I'm kind of inclined to do both 3) and 1).

Andres, you own the underlying open item, right?  Do you plan to implement
this fix you have outlined?  If so, when?

> #3 seems like it's probably about 15 years overdue, so let's do that anyway.
> 
> I don't quite understand why #1 fixes the problem - not because I
> doubt you, but just because I haven't studied it much.

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

17 апреля 2016 г., 01:28:28

On 2016-04-16 18:23:01 -0400, Noah Misch wrote:
> On Tue, Apr 12, 2016 at 09:00:57AM -0400, Robert Haas wrote:
> > On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
> > > I can think of a number of relatively easy ways to address this:
> > > 1) Just zap (or issue?) all pending flush requests when getting an
> > >    smgrinval/smgrclosenode
> > > 2) Do 1), but filter for the closed relnode
> > > 3) Actually handle the case of the last open segment not being
> > >    RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
> > >
> > > I'm kind of inclined to do both 3) and 1).
> 
> Andres, you own the underlying open item, right?

Right.

> Do you plan to implement this fix you have outlined?  If so, when?

Yes, I plan to next week.

- Andres

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Noah Misch

Дата:

17 апреля 2016 г., 01:53:26

On Sat, Apr 16, 2016 at 03:28:23PM -0700, Andres Freund wrote:
> On 2016-04-16 18:23:01 -0400, Noah Misch wrote:
> > On Tue, Apr 12, 2016 at 09:00:57AM -0400, Robert Haas wrote:
> > > On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
> > > > I can think of a number of relatively easy ways to address this:
> > > > 1) Just zap (or issue?) all pending flush requests when getting an
> > > >    smgrinval/smgrclosenode
> > > > 2) Do 1), but filter for the closed relnode
> > > > 3) Actually handle the case of the last open segment not being
> > > >    RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
> > > >
> > > > I'm kind of inclined to do both 3) and 1).
> > 
> > Andres, you own the underlying open item, right?
> 
> Right.
> 
> > Do you plan to implement this fix you have outlined?  If so, when?
> 
> Yes, I plan to next week.

Completion by the end of next week will work.  Thanks.  At that time
(2016-04-24T00:00 UTC), this would otherwise have spent twenty-three days as
an open item.  Would you inform this thread if you discover reasons to think
your plan will no longer succeed?

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

21 апреля 2016 г., 20:42:44

On 2016-04-15 13:42:34 -0400, Robert Haas wrote:
> On Thu, Apr 14, 2016 at 1:39 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:
> > At 2016-04-12 09:00:57 -0400, robertmhaas@gmail.com wrote:
> >>
> >> On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
> >> >
> >> > 3) Actually handle the case of the last open segment not being
> >> >    RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
> >>
> >> #3 seems like it's probably about 15 years overdue, so let's do that
> >> anyway.
> >
> > Do I understand correctly that the case of the "last open segment"
> > (i.e., the one for which mdfd_chain == NULL) not being RELSEG_SIZE
> > (i.e., _mdnblocks(reln, forknum, v) < RELSEG_SIZE) should not call
> > _mfdf_openseg on nextsegno if behaviour is not EXTENSION_CREATE or
> > InRecovery?
> >
> > And that "We won't create segment if not existent" should happen, but
> > doesn't only because the next segment file wasn't removed earlier, so
> > we have to add an extra check for that case?
> >
> > In other words, is something like the following what's needed here, or
> > is there more to it?

Yes, I think that should solve the bug reported here. Did you check?

What makes me rather worried about solely using this as a solution right
now is the InRecovery check, which triggers segment extensions/creations
even with EXTENSION_RETURN_NULL.  Afaics it's more or less a dormant bug
that _mdfd_getseg() ignores RETURN_NULL in the InRecovery case, because
mdopen() (and other places) do *not* behave that way!

> Something like that is what I was thinking about, but I notice it has
> the disadvantage of adding lseeks to cater to a shouldn't-happen
> condition.  Not sure if we should care.

I'm not particularly concerned about that, my gues sit'd barely impact
the number of lseeks, because most callers will have called mdnblocks()
before anyway.

> My attempts to test things were also singularly unrewarding.  Even
> after messing with the filesystem in various ways, I couldn't figure
> out exactly how this makes a difference.

What do you mean by this? You couldn't reproduce the bug? Or not how to
trigger such a test?

Andres

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Robert Haas

Дата:

21 апреля 2016 г., 21:11:44

On Thu, Apr 21, 2016 at 9:45 AM, Andres Freund <andres@anarazel.de> wrote:
>> My attempts to test things were also singularly unrewarding.  Even
>> after messing with the filesystem in various ways, I couldn't figure
>> out exactly how this makes a difference.
>
> What do you mean by this? You couldn't reproduce the bug? Or not how to
> trigger such a test?

Let's see.  I think what i did is made _mdfd_getseg() error out if it
got to a segment that was not the right size and preceded the one it
was trying to reach.  But I could not hit that error.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

22 апреля 2016 г., 20:07:43

On 2016-04-14 11:09:29 +0530, Abhijit Menon-Sen wrote:
> At 2016-04-12 09:00:57 -0400, robertmhaas@gmail.com wrote:
> >
> > On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
> > >
> > > 3) Actually handle the case of the last open segment not being
> > >    RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
> >
> > #3 seems like it's probably about 15 years overdue, so let's do that
> > anyway.
>
> I don't have anything useful to contribute, I'm just trying to
> understand this fix.
>
> _mdfd_getseg() looks like this:
>
>     targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
>     for (nextsegno = 1; nextsegno <= targetseg; nextsegno++)
>     {
>         Assert(nextsegno == v->mdfd_segno + 1);
>
>         if (v->mdfd_chain == NULL)
>         {
>             /*
>              * Normally we will create new segments only if authorized by the
>              * caller (i.e., we are doing mdextend()). […]
>              */
>             if (behavior == EXTENSION_CREATE || InRecovery)
>             {
>                 if (_mdnblocks(reln, forknum, v) < RELSEG_SIZE)
>                                     /* mdextend */
>                 v->mdfd_chain = _mdfd_openseg(reln, forknum, +nextsegno, O_CREAT);
>             }
>             else
>             {
>                 /* We won't create segment if not existent */
>                 v->mdfd_chain = _mdfd_openseg(reln, forknum, nextsegno, 0);
>             }
>             if (v->mdfd_chain == NULL)
>                             /* return NULL or ERROR */
>         }
>         v = v->mdfd_chain;
>     }
>
> Do I understand correctly that the case of the "last open segment"
> (i.e., the one for which mdfd_chain == NULL) not being RELSEG_SIZE
> (i.e., _mdnblocks(reln, forknum, v) < RELSEG_SIZE) should not call
> _mfdf_openseg on nextsegno if behaviour is not EXTENSION_CREATE or
> InRecovery?
>
> And that "We won't create segment if not existent" should happen, but
> doesn't only because the next segment file wasn't removed earlier, so
> we have to add an extra check for that case?
>
> In other words, is something like the following what's needed here, or
> is there more to it?

It's a bit more complicated than that, but that's the principle. Thanks
for the patch, and sorry for my late response. See my attached version
for a more fleshed out version of this.

Looking at the code around this I found:
* _mdfd_getseg(), in contrast to mdnblocks() doesn't check segment size,
  neither whether to continue with a too small, nor to error out with a
  too big segment
* For, imo pretty bad, reasons in mdsync() we currently rely on
  EXTENSION_RETURN_NULL to extend the file in recovery, and to set errno
  to ENOENT. Brrr.
* we currently FATAL if a segment is too big - I've copied that
  behaviour, but why not just ERROR out?

The attached patch basically adds the segment size checks to
_mdfd_getseg(), and doesn't perform extension, even in recovery, if
EXTENSION_REALLY_RETURN_NULL is passed.

This fixes the issue for me, both in the originally reported variant,
and in some more rudely setup version (i.e. rm'ing files).

I'm seing some weird behaviour (even with writeback flushing disabled)
with smgr invals and recovery/standbys, but I don't want to dive into it
before http://www.postgresql.org/message-id/CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
is addressed, it's too likely to be related.

Greetings,

Andres Freund

Вложения

0001-Don-t-open-formally-non-existant-segments-in-_mdfd_g.patch

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Robert Haas

Дата:

23 апреля 2016 г., 00:26:20

On Fri, Apr 22, 2016 at 1:07 PM, Andres Freund <andres@anarazel.de> wrote:
> The attached patch basically adds the segment size checks to
> _mdfd_getseg(), and doesn't perform extension, even in recovery, if
> EXTENSION_REALLY_RETURN_NULL is passed.
>
> This fixes the issue for me, both in the originally reported variant,
> and in some more rudely setup version (i.e. rm'ing files).

I think this looks broadly reasonable.  Some specific comments:

+                /*
+                 * Normally we will create new segments only if authorized by
+                 * the caller (i.e., we are doing mdextend()).  But when doing
+                 * WAL recovery, create segments anyway; this allows cases
+                 * such as replaying WAL data that has a write into a
+                 * high-numbered segment of a relation that was later deleted.
+                 * We want to go ahead and create the segments so we can
+                 * finish out the replay.

Add something like: "However, if the caller has specified
EXTENSION_REALLY_RETURN_NULL, then extension is not desired even in
recovery; we won't reach this point in that case."

+                    errno = ENOENT;        /* some callers check errno, yuck */

I think a considerably more detailed comment would be warranted.

+                else
+                {
+                    ereport(ERROR,
+                            (errcode_for_file_access(),
+                             errmsg("could not open file \"%s\"
(target block %u): previous segment is only %u blocks",
+                                    _mdfd_segpath(reln, forknum, nextsegno),
+                                    blkno, nblocks)));
+                }

The else and its ensuing level of indentation are unnecessary.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Robert Haas

Дата:

25 апреля 2016 г., 15:55:59

On Fri, Apr 22, 2016 at 5:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Apr 22, 2016 at 1:07 PM, Andres Freund <andres@anarazel.de> wrote:
>> The attached patch basically adds the segment size checks to
>> _mdfd_getseg(), and doesn't perform extension, even in recovery, if
>> EXTENSION_REALLY_RETURN_NULL is passed.
>>
>> This fixes the issue for me, both in the originally reported variant,
>> and in some more rudely setup version (i.e. rm'ing files).
>
> I think this looks broadly reasonable.  Some specific comments:
>
> +                /*
> +                 * Normally we will create new segments only if authorized by
> +                 * the caller (i.e., we are doing mdextend()).  But when doing
> +                 * WAL recovery, create segments anyway; this allows cases
> +                 * such as replaying WAL data that has a write into a
> +                 * high-numbered segment of a relation that was later deleted.
> +                 * We want to go ahead and create the segments so we can
> +                 * finish out the replay.
>
> Add something like: "However, if the caller has specified
> EXTENSION_REALLY_RETURN_NULL, then extension is not desired even in
> recovery; we won't reach this point in that case."
>
> +                    errno = ENOENT;        /* some callers check errno, yuck */
>
> I think a considerably more detailed comment would be warranted.
>
> +                else
> +                {
> +                    ereport(ERROR,
> +                            (errcode_for_file_access(),
> +                             errmsg("could not open file \"%s\"
> (target block %u): previous segment is only %u blocks",
> +                                    _mdfd_segpath(reln, forknum, nextsegno),
> +                                    blkno, nblocks)));
> +                }
>
> The else and its ensuing level of indentation are unnecessary.

Andres, this issue has now been open for more than a month, which is
frankly kind of ridiculous given the schedule we're trying to hit for
beta.  Do you think it's possible to commit something RSN without
compromising the quality of PostgreSQL 9.6?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Thom Brown

Дата:

25 апреля 2016 г., 16:13:17

On 22 April 2016 at 18:07, Andres Freund <andres@anarazel.de> wrote:
> On 2016-04-14 11:09:29 +0530, Abhijit Menon-Sen wrote:
>> At 2016-04-12 09:00:57 -0400, robertmhaas@gmail.com wrote:
>> >
>> > On Mon, Apr 11, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
>> > >
>> > > 3) Actually handle the case of the last open segment not being
>> > >    RELSEG_SIZE properly in _mdfd_getseg() - mdnblocks() does so.
>> >
>> > #3 seems like it's probably about 15 years overdue, so let's do that
>> > anyway.
>>
>> I don't have anything useful to contribute, I'm just trying to
>> understand this fix.
>>
>> _mdfd_getseg() looks like this:
>>
>>       targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
>>       for (nextsegno = 1; nextsegno <= targetseg; nextsegno++)
>>       {
>>               Assert(nextsegno == v->mdfd_segno + 1);
>>
>>               if (v->mdfd_chain == NULL)
>>               {
>>                       /*
>>                        * Normally we will create new segments only if authorized by the
>>                        * caller (i.e., we are doing mdextend()). […]
>>                        */
>>                       if (behavior == EXTENSION_CREATE || InRecovery)
>>                       {
>>                               if (_mdnblocks(reln, forknum, v) < RELSEG_SIZE)
>>                                     /* mdextend */
>>                               v->mdfd_chain = _mdfd_openseg(reln, forknum, +nextsegno, O_CREAT);
>>                       }
>>                       else
>>                       {
>>                               /* We won't create segment if not existent */
>>                               v->mdfd_chain = _mdfd_openseg(reln, forknum, nextsegno, 0);
>>                       }
>>                       if (v->mdfd_chain == NULL)
>>                             /* return NULL or ERROR */
>>               }
>>               v = v->mdfd_chain;
>>       }
>>
>> Do I understand correctly that the case of the "last open segment"
>> (i.e., the one for which mdfd_chain == NULL) not being RELSEG_SIZE
>> (i.e., _mdnblocks(reln, forknum, v) < RELSEG_SIZE) should not call
>> _mfdf_openseg on nextsegno if behaviour is not EXTENSION_CREATE or
>> InRecovery?
>>
>> And that "We won't create segment if not existent" should happen, but
>> doesn't only because the next segment file wasn't removed earlier, so
>> we have to add an extra check for that case?
>>
>> In other words, is something like the following what's needed here, or
>> is there more to it?
>
> It's a bit more complicated than that, but that's the principle. Thanks
> for the patch, and sorry for my late response. See my attached version
> for a more fleshed out version of this.
>
> Looking at the code around this I found:
> * _mdfd_getseg(), in contrast to mdnblocks() doesn't check segment size,
>   neither whether to continue with a too small, nor to error out with a
>   too big segment
> * For, imo pretty bad, reasons in mdsync() we currently rely on
>   EXTENSION_RETURN_NULL to extend the file in recovery, and to set errno
>   to ENOENT. Brrr.
> * we currently FATAL if a segment is too big - I've copied that
>   behaviour, but why not just ERROR out?
>
> The attached patch basically adds the segment size checks to
> _mdfd_getseg(), and doesn't perform extension, even in recovery, if
> EXTENSION_REALLY_RETURN_NULL is passed.
>
> This fixes the issue for me, both in the originally reported variant,
> and in some more rudely setup version (i.e. rm'ing files).

Fixes it for me too.

Thom

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

25 апреля 2016 г., 18:56:34

On 2016-04-25 08:55:54 -0400, Robert Haas wrote:
> Andres, this issue has now been open for more than a month, which is
> frankly kind of ridiculous given the schedule we're trying to hit for
> beta.  Do you think it's possible to commit something RSN without
> compromising the quality of PostgreSQL 9.6?

Frankly I'm getting a bit annoyed here too. I posted a patch Friday,
directly after getting back from pgconf.us. Saturday I posted a patch
for an issue which I didn't cause, but which caused issues when testing
the patch in this thread. Sunday I just worked on some small patch - one
you did want to see resolved too.  It's now 8.30 Monday morning.  What's
the point of your message?

Andres

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Robert Haas

Дата:

25 апреля 2016 г., 19:38:46

On Mon, Apr 25, 2016 at 11:56 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-04-25 08:55:54 -0400, Robert Haas wrote:
>> Andres, this issue has now been open for more than a month, which is
>> frankly kind of ridiculous given the schedule we're trying to hit for
>> beta.  Do you think it's possible to commit something RSN without
>> compromising the quality of PostgreSQL 9.6?
>
> Frankly I'm getting a bit annoyed here too. I posted a patch Friday,
> directly after getting back from pgconf.us. Saturday I posted a patch
> for an issue which I didn't cause, but which caused issues when testing
> the patch in this thread. Sunday I just worked on some small patch - one
> you did want to see resolved too.  It's now 8.30 Monday morning.  What's
> the point of your message?

I think that the point of my message is exactly what I said in my
message.  This isn't really about the last couple of days.  The issue
was reported on March 20th.  On March 31st, Noah asked you for a plan
to get it fixed by April 7th.  You never replied.  On April 16th, the
issue not having been fixed, he followed up.  You said that you would
fix it next week.  That week is now over, and we're into the following
week.  We have a patch, and that's good, and I have reviewed it and
Thom has tested it, and that's good, too.  But it is not clear whether
you feel confident to commit it or when you might be planning to do
that, so I asked.  Given that this is the open item of longest tenure
and that we're hoping to ship beta soon, why is that out of line?

Fundamentally, the choices for each open item are (1) get it fixed
before beta, (2) revert the commit that caused it, (3) decide it's OK
to ship beta with that issue, or (4) slip beta.  We initially had a
theory that the commit that caused this issue merely revealed an
underlying problem that had existed before, but I no longer really
think that's the case.  That commit introduced a new way to write to
blocks that might have in the meantime been removed, and it failed to
make that safe.  That's not to say that md.c doesn't do some wonky
things, but the pre-existing code in the upper layers coped with the
wonkiness and your new code doesn't, so in effect it's a regression.
And in fact I think it's a regression that can be expected to be a
significant operational problem for people if not fixed, because the
circumstances in which it can happen are not very obscure.  You just
need to hold some pending flush requests in your backend-local queue
while some other process truncates the relation, and boom.  So I am
disinclined to option #3.  I also do not think that we should slip
beta for an issue that was reported this far in advance of the planned
beta date, so I am disinclined to option #4.  That leaves #1 and #2.
I assume you will be pretty darned unhappy if we end up at #2, so I am
trying to figure out if we can achieve #1.  OK?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

25 апреля 2016 г., 21:05:31

Hi,

On 2016-04-25 12:38:36 -0400, Robert Haas wrote:
> I think that the point of my message is exactly what I said in my
> message.  This isn't really about the last couple of days.  The issue
> was reported on March 20th.  On March 31st, Noah asked you for a plan
> to get it fixed by April 7th.  You never replied.  On April 16th, the
> issue not having been fixed, he followed up.  You said that you would
> fix it next week.  That week is now over, and we're into the following
> week.

Well, I posted a patch. I'd have applied it too (after addressing your
comments obviously), except that there's some interdependencies with the
nsmg > 0 thread (some of my tests fail spuriously without that
fixed). Early last week I waited for a patch on that thread, but when
that didn't materialize by Friday I switched to work on that [1].  With
both fixes applied I can't reproduce any problems anymore.

About the delay: Sure, it'd be nicer if I'd addressed this
immediately. But priority-wise it's an issue that's damned hard to hit,
and where the worst known consequence is having to reconnect; that's not
earth shattering. So spending time to work on the, imo more severe
performance regressions, seemed to be more important; maybe I was wrong
in priorizing things that way.

> We have a patch, and that's good, and I have reviewed it and
> Thom has tested it, and that's good, too.  But it is not clear whether
> you feel confident to commit it or when you might be planning to do
> that, so I asked.  Given that this is the open item of longest tenure
> and that we're hoping to ship beta soon, why is that out of line?

Well, if you'd asked it that way, then I'd not responded the way I have.

> We initially had a theory that the commit that caused this issue
> merely revealed an underlying problem that had existed before, but I
> no longer really think that's the case.

I do think there's some lingering problems (avoiding a FATAL by choosing
an index scan instead of a seqscan, the problem afaics can transiently
occur in HS, any corrupted index can trigger it ...) - but I agree it's
not really been a big problem so far.

> That commit introduced a new way to write to blocks that might have in
> the meantime been removed, and it failed to make that safe.

There's no writing of any blocks involved - the problem is just about
opening segments which might or might not exist.

> And in fact I think it's a regression that can be
> expected to be a significant operational problem for people if not
> fixed, because the circumstances in which it can happen are not very
> obscure.  You just need to hold some pending flush requests in your
> backend-local queue while some other process truncates the relation,
> and boom.

I found it quite hard to come up with scenarios to reproduce it. Entire
segments have to be dropped, the writes to the to-be-dropped segment
have to result in fully dead rows, only few further writes outside the
dropped can happen, invalidations may not fix the problem up.  But
obviously it should be fixed.

> I assume you will be pretty darned unhappy if we end up at #2, so I am
> trying to figure out if we can achieve #1.  OK?

Ok.

[1] http://archives.postgresql.org/message-id/20160424025117.2cmf6ku4ldfcoo44%40alap3.anarazel.de

Andres

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Fabien COELHO

Дата:

25 апреля 2016 г., 21:53:40

Hello Andres,

> Fixes it for me too.

Yep, for me too.

I'm not at ease with the part of the API the code is dealing with, so I do 
not feel like a competent reviewer. I agree with the "more comments" 
suggested by Robert.

I have just a small naming point:
  /* ereport if segment not present, create in recovery */  EXTENSION_FAIL,  /* return NULL if not present, create in
recovery*/  EXTENSION_RETURN_NULL,  /* return NULL if not present */  EXTENSION_REALLY_RETURN_NULL,  /* create new
segmentsas needed */  EXTENSION_CREATE
 

The comments seem pretty clear, but the naming of these options are more 
behavioral than functional somehow (or the reverse?), especially the 
RETURN_NULL and REALLY_RETURN_NULL names seemed pretty contrived to me.

There is one context: whether it is in recovery. There are 3 possible 
behaviors: whether to error or ignore or create if segment does not 
exist.

In recovery it is always create if asked for it must be made available, 
maybe it does not exists because of the crash...

If I understand correctly, with flushes kept a long time before being 
processed, there is a new state which is "ignore whatever", we do not want 
to create a segment for flushing non existing data.

So I would suggest it would be better to name the options closer to the 
comments above, something like:
            normal / in recovery:  EXTENSION_ERROR_OR_IR_CREATE  EXTENSION_IGNORE_OR_IR_CREATE  EXTENSION_IGNORE
EXTENSION_CREATE

Now, maybe that is too late to try to find a better name for these 
options now:-)

-- 
Fabien.

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Robert Haas

Дата:

25 апреля 2016 г., 23:29:41

On Mon, Apr 25, 2016 at 2:05 PM, Andres Freund <andres@anarazel.de> wrote:
> Well, I posted a patch. I'd have applied it too (after addressing your
> comments obviously), except that there's some interdependencies with the
> nsmg > 0 thread (some of my tests fail spuriously without that
> fixed). Early last week I waited for a patch on that thread, but when
> that didn't materialize by Friday I switched to work on that [1].  With
> both fixes applied I can't reproduce any problems anymore.

OK.  What are the interdependencies?  You've said that a few times but
I am not clear on what the nature of those interdependencies is.

> About the delay: Sure, it'd be nicer if I'd addressed this
> immediately. But priority-wise it's an issue that's damned hard to hit,
> and where the worst known consequence is having to reconnect; that's not
> earth shattering. So spending time to work on the, imo more severe
> performance regressions, seemed to be more important; maybe I was wrong
> in priorizing things that way.

Do you mean performance regression related to this patch, or to other
patches like the atomic buffer pin/unpin stuff?

>> We have a patch, and that's good, and I have reviewed it and
>> Thom has tested it, and that's good, too.  But it is not clear whether
>> you feel confident to commit it or when you might be planning to do
>> that, so I asked.  Given that this is the open item of longest tenure
>> and that we're hoping to ship beta soon, why is that out of line?
>
> Well, if you'd asked it that way, then I'd not responded the way I have.

Fair enough, but keep in mind that a few more mildly-phrased emails
from Noah didn't actually get us to the point where this is fixed, or
even to a clear explanation of when it might be fixed or why it wasn't
getting fixed sooner.

>> That commit introduced a new way to write to blocks that might have in
>> the meantime been removed, and it failed to make that safe.
>
> There's no writing of any blocks involved - the problem is just about
> opening segments which might or might not exist.

As I understand it, that commit introduced a backend-private queue;
when it fills, we issue flush requests for particular blocks.  By the
time the flush requests are issued, the blocks might have been
truncated away.  So it's not writing blocks in the sense of write(),
but persuading the OS to write those blocks to stable storage.

>> And in fact I think it's a regression that can be
>> expected to be a significant operational problem for people if not
>> fixed, because the circumstances in which it can happen are not very
>> obscure.  You just need to hold some pending flush requests in your
>> backend-local queue while some other process truncates the relation,
>> and boom.
>
> I found it quite hard to come up with scenarios to reproduce it. Entire
> segments have to be dropped, the writes to the to-be-dropped segment
> have to result in fully dead rows, only few further writes outside the
> dropped can happen, invalidations may not fix the problem up.  But
> obviously it should be fixed.

It's probably a bit tricky to produce a workload where the
truncated-away block is in some backend's private queue, because the
queues are very small and it's not exactly clear what sort of workload
will produce regular relation truncations, and you need both of those
things to hit this. However, I think it's pretty optimistic to think
that there won't be people in that situation, partly because we had a
customer find a bug in an EDB proprietary feature  a very similar
whose profile is very similar to this feature less than 2 years ago.

>> I assume you will be pretty darned unhappy if we end up at #2, so I am
>> trying to figure out if we can achieve #1.  OK?
>
> Ok.

I'm still not clear on the answer to this.  I think the answer is that
you think that this patch is OK to commit with some editing, but the
situation with the nmsgs > 0 patch is not so clear yet because it
hasn't had any review yet.  And they depend on each other somehow.  Am
I close?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

25 апреля 2016 г., 23:57:25

On 2016-04-25 16:29:36 -0400, Robert Haas wrote:
> On Mon, Apr 25, 2016 at 2:05 PM, Andres Freund <andres@anarazel.de> wrote:
> > Well, I posted a patch. I'd have applied it too (after addressing your
> > comments obviously), except that there's some interdependencies with the
> > nsmg > 0 thread (some of my tests fail spuriously without that
> > fixed). Early last week I waited for a patch on that thread, but when
> > that didn't materialize by Friday I switched to work on that [1].  With
> > both fixes applied I can't reproduce any problems anymore.
> 
> OK.  What are the interdependencies?  You've said that a few times but
> I am not clear on what the nature of those interdependencies is.

I added checks to smgr/md.c that verify that the smgr state is
consistent with the on-file state. Without the two issues in [1] fixed,
these tests fail in a standby, while running regression tests.  Adding
those tests made me notice a bug in an unreleased version of the patch,
so it seems they're worthwhile to run.


> > About the delay: Sure, it'd be nicer if I'd addressed this
> > immediately. But priority-wise it's an issue that's damned hard to hit,
> > and where the worst known consequence is having to reconnect; that's not
> > earth shattering. So spending time to work on the, imo more severe
> > performance regressions, seemed to be more important; maybe I was wrong
> > in priorizing things that way.
> 
> Do you mean performance regression related to this patch, or to other
> patches like the atomic buffer pin/unpin stuff?

Other patches, I can't see this having measurable impact.


> >> We have a patch, and that's good, and I have reviewed it and
> >> Thom has tested it, and that's good, too.  But it is not clear whether
> >> you feel confident to commit it or when you might be planning to do
> >> that, so I asked.  Given that this is the open item of longest tenure
> >> and that we're hoping to ship beta soon, why is that out of line?
> >
> > Well, if you'd asked it that way, then I'd not responded the way I have.
> 
> Fair enough, but keep in mind that a few more mildly-phrased emails
> from Noah didn't actually get us to the point where this is fixed, or
> even to a clear explanation of when it might be fixed or why it wasn't
> getting fixed sooner.

Hrmpf. I'd initially missed the first email in the onslought, and the
second email I responded to and posted a patch in the timeframe I'd
promised. I didn't forsee, after we'd initially dismissed a correlation,
that there'd be a connection to the other patch.


> >> That commit introduced a new way to write to blocks that might have in
> >> the meantime been removed, and it failed to make that safe.
> >
> > There's no writing of any blocks involved - the problem is just about
> > opening segments which might or might not exist.
> 
> As I understand it, that commit introduced a backend-private queue;
> when it fills, we issue flush requests for particular blocks.  By the
> time the flush requests are issued, the blocks might have been
> truncated away.  So it's not writing blocks in the sense of write(),
> but persuading the OS to write those blocks to stable storage.

Right.


> >> And in fact I think it's a regression that can be
> >> expected to be a significant operational problem for people if not
> >> fixed, because the circumstances in which it can happen are not very
> >> obscure.  You just need to hold some pending flush requests in your
> >> backend-local queue while some other process truncates the relation,
> >> and boom.
> >
> > I found it quite hard to come up with scenarios to reproduce it. Entire
> > segments have to be dropped, the writes to the to-be-dropped segment
> > have to result in fully dead rows, only few further writes outside the
> > dropped can happen, invalidations may not fix the problem up.  But
> > obviously it should be fixed.
> 
> It's probably a bit tricky to produce a workload where the
> truncated-away block is in some backend's private queue, because the
> queues are very small and it's not exactly clear what sort of workload
> will produce regular relation truncations, and you need both of those
> things to hit this. However, I think it's pretty optimistic to think
> that there won't be people in that situation, partly because we had a
> customer find a bug in an EDB proprietary feature  a very similar
> whose profile is very similar to this feature less than 2 years ago.

I'm obviously not arguing that we shouldn't fix this.


> >> I assume you will be pretty darned unhappy if we end up at #2, so I am
> >> trying to figure out if we can achieve #1.  OK?
> >
> > Ok.
> 
> I'm still not clear on the answer to this.  I think the answer is that
> you think that this patch is OK to commit with some editing, but the
> situation with the nmsgs > 0 patch is not so clear yet because it
> hasn't had any review yet.  And they depend on each other somehow.  Am
> I close?

You are. I'd prefer pushing this after fixes for the two invalidation
issues, because it makes testing easier. But I guess the timeframe there
is unclear enough, that it makes sense to test this patch on top of the
the preliminary fixes, and then push without those.

- Andres

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Robert Haas

Дата:

27 апреля 2016 г., 00:25:24

On Mon, Apr 25, 2016 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-04-25 16:29:36 -0400, Robert Haas wrote:
>> On Mon, Apr 25, 2016 at 2:05 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Well, I posted a patch. I'd have applied it too (after addressing your
>> > comments obviously), except that there's some interdependencies with the
>> > nsmg > 0 thread (some of my tests fail spuriously without that
>> > fixed). Early last week I waited for a patch on that thread, but when
>> > that didn't materialize by Friday I switched to work on that [1].  With
>> > both fixes applied I can't reproduce any problems anymore.
>>
>> OK.  What are the interdependencies?  You've said that a few times but
>> I am not clear on what the nature of those interdependencies is.
>
> I added checks to smgr/md.c that verify that the smgr state is
> consistent with the on-file state. Without the two issues in [1] fixed,
> these tests fail in a standby, while running regression tests.  Adding
> those tests made me notice a bug in an unreleased version of the patch,
> so it seems they're worthwhile to run.

Footnote [1] is used, but not defined.

>> >> I assume you will be pretty darned unhappy if we end up at #2, so I am
>> >> trying to figure out if we can achieve #1.  OK?
>> >
>> > Ok.
>>
>> I'm still not clear on the answer to this.  I think the answer is that
>> you think that this patch is OK to commit with some editing, but the
>> situation with the nmsgs > 0 patch is not so clear yet because it
>> hasn't had any review yet.  And they depend on each other somehow.  Am
>> I close?
>
> You are. I'd prefer pushing this after fixes for the two invalidation
> issues, because it makes testing easier. But I guess the timeframe there
> is unclear enough, that it makes sense to test this patch on top of the
> the preliminary fixes, and then push without those.

I think it makes sense to go ahead and push this fix rather soon.  I'd
much rather not have this bug, even if verifying that I don't have it
is a little harder than it might be under other circumstances.  The
fix could be buggy too, and if that fix doesn't go in until right
before we ship beta1, we're less likely to be able to find and correct
any deficiencies before that goes out the door.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

27 апреля 2016 г., 00:32:15

On 2016-04-26 17:25:18 -0400, Robert Haas wrote:
> On Mon, Apr 25, 2016 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-04-25 16:29:36 -0400, Robert Haas wrote:
> >> On Mon, Apr 25, 2016 at 2:05 PM, Andres Freund <andres@anarazel.de> wrote:
> >> > Well, I posted a patch. I'd have applied it too (after addressing your
> >> > comments obviously), except that there's some interdependencies with the
> >> > nsmg > 0 thread (some of my tests fail spuriously without that
> >> > fixed). Early last week I waited for a patch on that thread, but when
> >> > that didn't materialize by Friday I switched to work on that [1].  With
> >> > both fixes applied I can't reproduce any problems anymore.
> >>
> >> OK.  What are the interdependencies?  You've said that a few times but
> >> I am not clear on what the nature of those interdependencies is.
> >
> > I added checks to smgr/md.c that verify that the smgr state is
> > consistent with the on-file state. Without the two issues in [1] fixed,
> > these tests fail in a standby, while running regression tests.  Adding
> > those tests made me notice a bug in an unreleased version of the patch,
> > so it seems they're worthwhile to run.
>
> Footnote [1] is used, but not defined.

Oops, it's the thread you replied too..

> I think it makes sense to go ahead and push this fix rather soon.

Will do.

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

27 апреля 2016 г., 06:36:01

Hi,

On 2016-04-25 20:53:26 +0200, Fabien COELHO wrote:
> I have just a small naming point:
> 
>   /* ereport if segment not present, create in recovery */
>   EXTENSION_FAIL,
>   /* return NULL if not present, create in recovery */
>   EXTENSION_RETURN_NULL,
>   /* return NULL if not present */
>   EXTENSION_REALLY_RETURN_NULL,
>   /* create new segments as needed */
>   EXTENSION_CREATE
> 
> The comments seem pretty clear, but the naming of these options are more
> behavioral than functional somehow (or the reverse?), especially the
> RETURN_NULL and REALLY_RETURN_NULL names seemed pretty contrived to me.

I tend to agree. But "fixing" that would mean changing quite some
additional pieces of code, more than I want to do in a bugfix. I also
think it's not *that* confusing...

Thanks for looking.

Andres

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Thom Brown

Дата:

28 апреля 2016 г., 19:42:08

On 26 April 2016 at 22:32, Andres Freund <andres@anarazel.de> wrote:
> On 2016-04-26 17:25:18 -0400, Robert Haas wrote:
>> On Mon, Apr 25, 2016 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2016-04-25 16:29:36 -0400, Robert Haas wrote:
>> >> On Mon, Apr 25, 2016 at 2:05 PM, Andres Freund <andres@anarazel.de> wrote:
>> >> > Well, I posted a patch. I'd have applied it too (after addressing your
>> >> > comments obviously), except that there's some interdependencies with the
>> >> > nsmg > 0 thread (some of my tests fail spuriously without that
>> >> > fixed). Early last week I waited for a patch on that thread, but when
>> >> > that didn't materialize by Friday I switched to work on that [1]. With
>> >> > both fixes applied I can't reproduce any problems anymore.
>> >>
>> >> OK. What are the interdependencies? You've said that a few times but
>> >> I am not clear on what the nature of those interdependencies is.
>> >
>> > I added checks to smgr/md.c that verify that the smgr state is
>> > consistent with the on-file state. Without the two issues in [1] fixed,
>> > these tests fail in a standby, while running regression tests. Adding
>> > those tests made me notice a bug in an unreleased version of the patch,
>> > so it seems they're worthwhile to run.
>>
>> Footnote [1] is used, but not defined.
>
> Oops, it's the thread you replied too..
>
>> I think it makes sense to go ahead and push this fix rather soon.
>
> Will do.

I've noticed another breakage, which I can reproduce consistently.

createdb pgbench
pgbench -i -s 100 --unlogged-tables pgbench
psql -f pgbench_partitions.sql pgbench
vacuumdb -z pgbench
createdb test

Which produces:

createdb: database creation failed: ERROR: checkpoint request failed
HINT: Consult recent messages in the server log for details.

The log shows:

2016-04-28 17:36:00 BST [18605]: [1-1] user=thom,db=postgres,client=[local] DEBUG: postgres child[18605]: starting with (
2016-04-28 17:36:00 BST [18605]: [2-1] user=thom,db=postgres,client=[local] DEBUG: postgres
2016-04-28 17:36:00 BST [18605]: [3-1] user=thom,db=postgres,client=[local] DEBUG: )
2016-04-28 17:36:00 BST [18605]: [4-1] user=thom,db=postgres,client=[local] DEBUG: InitPostgres
2016-04-28 17:36:00 BST [18605]: [5-1] user=thom,db=postgres,client=[local] DEBUG: StartTransaction
2016-04-28 17:36:00 BST [18605]: [6-1] user=thom,db=postgres,client=[local] DEBUG: name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nest
lvl: 1, children:
2016-04-28 17:36:00 BST [18605]: [7-1] user=thom,db=postgres,client=[local] DEBUG: CommitTransaction
2016-04-28 17:36:00 BST [18605]: [8-1] user=thom,db=postgres,client=[local] DEBUG: name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nest
lvl: 1, children:
2016-04-28 17:36:00 BST [18605]: [9-1] user=thom,db=postgres,client=[local] DEBUG: StartTransactionCommand
2016-04-28 17:36:00 BST [18605]: [10-1] user=thom,db=postgres,client=[local] STATEMENT: CREATE DATABASE test;
2016-04-28 17:36:00 BST [18605]: [11-1] user=thom,db=postgres,client=[local] DEBUG: StartTransaction
2016-04-28 17:36:00 BST [18605]: [12-1] user=thom,db=postgres,client=[local] STATEMENT: CREATE DATABASE test;
2016-04-28 17:36:00 BST [18605]: [13-1] user=thom,db=postgres,client=[local] DEBUG: name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nes
tlvl: 1, children:
2016-04-28 17:36:00 BST [18605]: [14-1] user=thom,db=postgres,client=[local] STATEMENT: CREATE DATABASE test;
2016-04-28 17:36:00 BST [18605]: [15-1] user=thom,db=postgres,client=[local] DEBUG: ProcessUtility
2016-04-28 17:36:00 BST [18605]: [16-1] user=thom,db=postgres,client=[local] STATEMENT: CREATE DATABASE test;
2016-04-28 17:36:00 BST [18108]: [46-1] user=,db=,client= DEBUG: performing replication slot checkpoint
2016-04-28 17:36:00 BST [18105]: [158-1] user=,db=,client= DEBUG: server process (PID 18582) exited with exit code 0
2016-04-28 17:36:08 BST [18108]: [47-1] user=,db=,client= DEBUG: could not fsync file "base/24581/24594.1" but retrying: No such file or directory
2016-04-28 17:36:08 BST [18108]: [48-1] user=,db=,client= ERROR: could not fsync file "base/24581/24594.1": No such file or directory
2016-04-28 17:36:08 BST [18605]: [17-1] user=thom,db=postgres,client=[local] ERROR: checkpoint request failed
2016-04-28 17:36:08 BST [18605]: [18-1] user=thom,db=postgres,client=[local] HINT: Consult recent messages in the server log for details.
2016-04-28 17:36:08 BST [18605]: [19-1] user=thom,db=postgres,client=[local] STATEMENT: CREATE DATABASE test;
2016-04-28 17:36:08 BST [18605]: [20-1] user=thom,db=postgres,client=[local] DEBUG: shmem_exit(0): 1 before_shmem_exit callbacks to make
2016-04-28 17:36:08 BST [18605]: [21-1] user=thom,db=postgres,client=[local] DEBUG: shmem_exit(0): 6 on_shmem_exit callbacks to make
2016-04-28 17:36:08 BST [18605]: [22-1] user=thom,db=postgres,client=[local] DEBUG: proc_exit(0): 3 callbacks to make
2016-04-28 17:36:08 BST [18605]: [23-1] user=thom,db=postgres,client=[local] DEBUG: exit(0)
2016-04-28 17:36:08 BST [18605]: [24-1] user=thom,db=postgres,client=[local] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2016-04-28 17:36:08 BST [18605]: [25-1] user=thom,db=postgres,client=[local] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2016-04-28 17:36:08 BST [18605]: [26-1] user=thom,db=postgres,client=[local] DEBUG: proc_exit(-1): 0 callbacks to make
2016-04-28 17:36:08 BST [18105]: [159-1] user=,db=,client= DEBUG: server process (PID 18605) exited with exit code 0
2016-04-28 17:36:09 BST [18108]: [49-1] user=,db=,client= DEBUG: checkpointer updated shared memory configuration values

Relfilenode 24594 corresponds to the empty pgbench_accounts parent table.

Thom

Вложения

pgbench_partitions.sql

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

28 апреля 2016 г., 19:49:54

On 2016-04-28 17:41:29 +0100, Thom Brown wrote:
> I've noticed another breakage, which I can reproduce consistently.
> 
> createdb pgbench
> pgbench -i -s 100 --unlogged-tables pgbench
> psql -f pgbench_partitions.sql pgbench
> vacuumdb -z pgbench
> createdb test
> 
> Which produces:
> 
> createdb: database creation failed: ERROR:  checkpoint request failed
> HINT:  Consult recent messages in the server log for details.

Interesting. Any chance you could verify if this actually related to
either the fix or the commit that was to blame for the previous issue?

Because

> user=thom,db=postgres,client=[local] STATEMENT:  CREATE DATABASE test;
> 2016-04-28 17:36:00 BST [18108]: [46-1] user=,db=,client= DEBUG:
>  performing replication slot checkpoint
> 2016-04-28 17:36:00 BST [18105]: [158-1] user=,db=,client= DEBUG:  server
> process (PID 18582) exited with exit code 0
> 2016-04-28 17:36:08 BST [18108]: [47-1] user=,db=,client= DEBUG:  could not
> fsync file "base/24581/24594.1" but retrying: No such file or directory
> 2016-04-28 17:36:08 BST [18108]: [48-1] user=,db=,client= ERROR:  could not
> fsync file "base/24581/24594.1": No such file or directory
> 2016-04-28 17:36:08 BST [18605]: [17-1]
> user=thom,db=postgres,client=[local] ERROR:  checkpoint request failed
> 2016-04-28 17:36:08 BST [18605]: [18-1]
> user=thom,db=postgres,client=[local] HINT:  Consult recent messages in the
> server log for details.
> 2016-04-28 17:36:08 BST [18605]: [19-1]

Doesn't really sound like it's necessarily related to the previous problem.

- Andres

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Fabien COELHO

Дата:

28 апреля 2016 г., 22:37:13

> createdb: database creation failed: ERROR:  checkpoint request failed
> HINT:  Consult recent messages in the server log for details.

Attached is a simpler/faster version, based on the previous script, I just 
added a CHECKPOINT after the EXPLAIN ANALYSE.

It fails on head with:
 HINT:  Consider increasing the configuration parameter "max_wal_size". ERROR:  canceling autovacuum task CONTEXT:
automaticvacuum of table "fabien.public.pgbench_accounts" ERROR:  canceling autovacuum task CONTEXT:  automatic analyze
oftable "fabien.public.pgbench_accounts_1" ERROR:  canceling autovacuum task CONTEXT:  automatic analyze of table
"fabien.public.pgbench_accounts_2"ERROR:  could not fsync file "base/16384/16397.1": No such file or directory ERROR:
checkpointrequest failed HINT:  Consult recent messages in the server log for details. STATEMENT:  CHECKPOINT;
 

I tried it on pg prior to the flushing patches and it did not fail, so it 
seems to be related somehow. Probably the fix is similar, just some 
conditions to check when dealing with a file which has been removed by a 
large DELETE.

-- 
Fabien.

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Fabien COELHO

Дата:

29 апреля 2016 г., 10:22:37

An interesting complement about the previous failing test:

Although I disabled the "flushing" feature...
  sh> grep flush_after xxx/postgresql.conf  bgwriter_flush_after = 0                # 0 disables,  backend_flush_after
=0         # 0 disables,  wal_writer_flush_after = 0              # 0 disables  checkpoint_flush_after = 0
#0 disables,
 

the test still fails with "head":
  ...  ERROR:  could not fsync file "base/16384/16397.1": No such file or directory  ERROR:  checkpoint request failed
HINT: Consult recent messages in the server log for details.  STATEMENT:  CHECKPOINT;
 

So this seem to suggest that there is an underlying issue independent from 
the flushing file.

-- 
Fabien.

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

30 апреля 2016 г., 02:58:45

On 2016-04-28 17:41:29 +0100, Thom Brown wrote:
> I've noticed another breakage, which I can reproduce consistently.

> 2016-04-28 17:36:08 BST [18108]: [47-1] user=,db=,client= DEBUG:  could not
> fsync file "base/24581/24594.1" but retrying: No such file or directory
> 2016-04-28 17:36:08 BST [18108]: [48-1] user=,db=,client= ERROR:  could not
> fsync file "base/24581/24594.1": No such file or directory
> 2016-04-28 17:36:08 BST [18605]: [17-1]
> user=thom,db=postgres,client=[local] ERROR:  checkpoint request failed
> 2016-04-28 17:36:08 BST [18605]: [18-1]
> user=thom,db=postgres,client=[local] HINT:  Consult recent messages in the
> server log for details.

Yuck. md.c is so crummy :(

Basically the reason for the problem is that mdsync() needs to access
"formally non-existant segments" (as in ones where previous segments are
< RELSEG_SIZE), because we queue (and the might be preexistant) fsync
requests via register_dirty_segment() in mdtruncate().

I'm a bit of a loss of how to reconcile that view with the original
issue in this thread.  The best I can come up with this moment is doing
a _mdfd_openseg() in mdsync() to open the truncated segment if
_mdfd_getseg() returned NULL. We don't want to normally use that in
either function because it'll imply a separate open() etc, which is
pretty expensive - but doing in the fallback case would be kind of ok.

Andres

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Robert Haas

Дата:

02 мая 2016 г., 19:29:51

On Fri, Apr 29, 2016 at 7:58 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-04-28 17:41:29 +0100, Thom Brown wrote:
>> I've noticed another breakage, which I can reproduce consistently.
>
>> 2016-04-28 17:36:08 BST [18108]: [47-1] user=,db=,client= DEBUG:  could not
>> fsync file "base/24581/24594.1" but retrying: No such file or directory
>> 2016-04-28 17:36:08 BST [18108]: [48-1] user=,db=,client= ERROR:  could not
>> fsync file "base/24581/24594.1": No such file or directory
>> 2016-04-28 17:36:08 BST [18605]: [17-1]
>> user=thom,db=postgres,client=[local] ERROR:  checkpoint request failed
>> 2016-04-28 17:36:08 BST [18605]: [18-1]
>> user=thom,db=postgres,client=[local] HINT:  Consult recent messages in the
>> server log for details.
>
> Yuck. md.c is so crummy :(
>
> Basically the reason for the problem is that mdsync() needs to access
> "formally non-existant segments" (as in ones where previous segments are
> < RELSEG_SIZE), because we queue (and the might be preexistant) fsync
> requests via register_dirty_segment() in mdtruncate().

Shouldn't we just throw those flush requests away?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

02 мая 2016 г., 19:41:19

On 2016-05-02 12:29:45 -0400, Robert Haas wrote:
> On Fri, Apr 29, 2016 at 7:58 PM, Andres Freund <andres@anarazel.de> wrote:
> > Basically the reason for the problem is that mdsync() needs to access
> > "formally non-existant segments" (as in ones where previous segments are
> > < RELSEG_SIZE), because we queue (and the might be preexistant) fsync
> > requests via register_dirty_segment() in mdtruncate().
> 
> Shouldn't we just throw those flush requests away?

Well, we explicity make them for truncations (register_dirty_segment()
calls in mdtruncate()).  There's no comment as to why - I suspect the
idea is that you want to make sure the truncation sticks in case of
crash?

FWIW, falling back to _mdfd_openseg() fixes the issue.

Andres

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Robert Haas

Дата:

02 мая 2016 г., 19:45:00

On Mon, May 2, 2016 at 12:41 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-05-02 12:29:45 -0400, Robert Haas wrote:
>> On Fri, Apr 29, 2016 at 7:58 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Basically the reason for the problem is that mdsync() needs to access
>> > "formally non-existant segments" (as in ones where previous segments are
>> > < RELSEG_SIZE), because we queue (and the might be preexistant) fsync
>> > requests via register_dirty_segment() in mdtruncate().
>>
>> Shouldn't we just throw those flush requests away?
>
> Well, we explicity make them for truncations (register_dirty_segment()
> calls in mdtruncate()).  There's no comment as to why - I suspect the
> idea is that you want to make sure the truncation sticks in case of
> crash?

I dunno, I don't understand this well enough yet.

> FWIW, falling back to _mdfd_openseg() fixes the issue.

Can you post a patch?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

02 мая 2016 г., 19:54:43

On 2016-05-02 12:44:53 -0400, Robert Haas wrote:
> On Mon, May 2, 2016 at 12:41 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-05-02 12:29:45 -0400, Robert Haas wrote:
> >> On Fri, Apr 29, 2016 at 7:58 PM, Andres Freund <andres@anarazel.de> wrote:
> >> > Basically the reason for the problem is that mdsync() needs to access
> >> > "formally non-existant segments" (as in ones where previous segments are
> >> > < RELSEG_SIZE), because we queue (and the might be preexistant) fsync
> >> > requests via register_dirty_segment() in mdtruncate().
> >>
> >> Shouldn't we just throw those flush requests away?
> >
> > Well, we explicity make them for truncations (register_dirty_segment()
> > calls in mdtruncate()).  There's no comment as to why - I suspect the
> > idea is that you want to make sure the truncation sticks in case of
> > crash?
>
> I dunno, I don't understand this well enough yet.
>
> > FWIW, falling back to _mdfd_openseg() fixes the issue.
>
> Can you post a patch?

Sure, attached.


I'm not sure this is the best way to go about this.  I can see valid
arguments for *always* using _mdfd_openseg() in mdsync(); and I'm
wondering whether we shouldn't make EXTENSION_* into a bitmask
(extend,extend_recovery,return_null,open_deleted).

Andres

Вложения

wip_mdsync.patch

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Fabien COELHO

Дата:

02 мая 2016 г., 23:00:28

Hello Andres,

> Sure, attached.

For what it is worth, it works for me on head.

I'm wondering that if _mdfd_openseg may return NULL, then ISTM that 
"opened_directly" should logically be false, because it was not opened?

If so, maybe consider something like:
    opened_directy = (seg != NULL);

Now it does not change the result because the later code seems garded 
against a NULL seg, but it does not look right to have a boolean to say 
that a segment was opened if it was not indeed the case.

Given the comments, I understand that the truncation implementation is a 
shortcut with a sting, as a lot of functions must then take into account 
that something unusual may have happen somewhere and deal with it...

Also, I do not understand why this issue is raised by the flushing patch, 
it seems rather independent.

> I'm not sure this is the best way to go about this.  I can see valid
> arguments for *always* using _mdfd_openseg() in mdsync(); and I'm
> wondering whether we shouldn't make EXTENSION_* into a bitmask
> (extend,extend_recovery,return_null,open_deleted).

I thought about that when I looked at the previous fix, but it seemed that 
not all combinations made sense.

-- 
Fabien.

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

02 мая 2016 г., 23:36:41

On 2016-05-02 22:00:14 +0200, Fabien COELHO wrote:
> I'm wondering that if _mdfd_openseg may return NULL, then ISTM that
> "opened_directly" should logically be false, because it was not opened?
>
> If so, maybe consider something like:
>
>     opened_directy = (seg != NULL);

Hm, don't care either way. Seems just as valid to understand it as the
attempt to directly open the segment.


> Also, I do not understand why this issue is raised by the flushing patch, it
> seems rather independent.

It's not the flushing itself, it's 72a98a639574d2e25ed94652848555900c81a799


> >I'm not sure this is the best way to go about this.  I can see valid
> >arguments for *always* using _mdfd_openseg() in mdsync(); and I'm
> >wondering whether we shouldn't make EXTENSION_* into a bitmask
> >(extend,extend_recovery,return_null,open_deleted).
>
> I thought about that when I looked at the previous fix, but it seemed that
> not all combinations made sense.

Sure, but that's nothing unusual.  Here's an attempt at doing so - not
fully polished, just as a discussion point. I think it looks better.
Fabien, Robert, what do you think?

Greetings,

Andres Freund

Вложения

wip_mdsync_2.patch

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Fabien COELHO

Дата:

03 мая 2016 г., 01:05:52

Hello Andres,

>>> I'm not sure this is the best way to go about this.  I can see valid
>>> arguments for *always* using _mdfd_openseg() in mdsync(); and I'm
>>> wondering whether we shouldn't make EXTENSION_* into a bitmask
>>> (extend,extend_recovery,return_null,open_deleted).
>>
>> I thought about that when I looked at the previous fix, but it seemed that
>> not all combinations made sense.
>
> Sure, but that's nothing unusual.  Here's an attempt at doing so - not
> fully polished, just as a discussion point. I think it looks better.
> Fabien, Robert, what do you think?

My 0,02€.

Not tested, just a few comments on the patch from someone which does not 
understand this API deep down... Nevertheless:

I agree that it is looks better than "EXTENSION_REALLY_RETURNS_NULL", that 
I did not like much.

There are 3 possible behaviors on extension, but coding them as bits does 
not make their exclusivity clear. Now mixing numbers & bits does not seem 
advisable either.

Maybe consider checking for the exclusivity explicitely?
  EXTENSION_BEHAVIORS = (EXTENSION_RETURN_NULL | ..._FAIL | ..._CREATE);

And then the Assert can check for the exclusivity:
  int behavior = option & EXTENSION_BEHAVIORS;  Assert( (behavior == EXTENSION_RETURN_NULL) ||          (behavior ==
..._FAIL)||          (behavior == ..._CREATE));
 

I'm unsure about switching enum to #define, could be an enum still with 
explicit values set, something like:
  enum {    EXTENSION_RETURN_NULL = (1 << 0),    ...  } extension_behavior;

I'm fuzzy about the _OPEN_DELETED part because it is an oxymoron. Is it 
RECREATE really?

-- 
Fabien.

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

03 мая 2016 г., 01:14:30

Hi,

On 2016-05-03 00:05:35 +0200, Fabien COELHO wrote:
> Maybe consider checking for the exclusivity explicitely?

I thought about it, and decided it's not worth it.  Requiring one of
those to be specified seems stringent enough.

> I'm unsure about switching enum to #define, could be an enum still with
> explicit values set, something like:
> 
>   enum {
>     EXTENSION_RETURN_NULL = (1 << 0),
>     ...
>   } extension_behavior;

An enum doesn't have a benefit for a bitmask imo - you can't "legally"
use it as a type for functions accepting the bitmask.

> I'm fuzzy about the _OPEN_DELETED part because it is an oxymoron. Is it
> RECREATE really?

No. The relevant explanation is at the top of the file:*    On disk, a relation must consist of consecutively numbered
segment*   files in the pattern*        -- Zero or more full segments of exactly RELSEG_SIZE blocks each*        --
Exactlyone partial segment of size 0 <= size < RELSEG_SIZE blocks*        -- Optionally, any number of inactive
segmentsof size 0 blocks.*    The full and partial segments are collectively the "active" segments.*    Inactive
segmentsare those that once contained data but are currently*    not needed because of an mdtruncate() operation.  The
reasonfor leaving*    them present at size zero, rather than unlinking them, is that other*    backends and/or the
checkpointermight be holding open file references to*    such segments.  If the relation expands again after
mdtruncate(),such*    that a deactivated segment becomes active again, it is important that*    such file references
stillbe valid --- else data might get written*    out to an unlinked old copy of a segment file that will eventually*
disappear.

- Andres

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Fabien COELHO

Дата:

03 мая 2016 г., 08:25:04

Hello,

>> I'm unsure about switching enum to #define, could be an enum still with
>> explicit values set, something like:
>
> An enum doesn't have a benefit for a bitmask imo - you can't "legally"
> use it as a type for functions accepting the bitmask.

I do not understand. I suggested to use enum to enumerate the bitmask 
constants, ISTM that it does not preclude to use it as a bitmask as you 
do, it is just a replacement of the #define? The type constraint on the 
enum does not disallow bitmasking values, I checked with both gcc & clang.

>> I'm fuzzy about the _OPEN_DELETED part because it is an oxymoron. Is it
>> RECREATE really?
>
> No. The relevant explanation is at the top of the file:

[...]

> *        -- Optionally, any number of inactive segments of size 0 blocks.
> *    Inactive segments are those that once contained data but are currently
> *    not needed because of an mdtruncate() operation.  The reason for leaving
> *    them present at size zero, rather than unlinking them, is that other
> *    backends and/or the checkpointer might be holding open file references to
> *    such segments.  If the relation expands again after mdtruncate(), such
> *    that a deactivated segment becomes active again, it is important that
> *    such file references still be valid --- else data might get written
> *    out to an unlinked old copy of a segment file that will eventually
> *    disappear.

Ok.

Then should it be _OPEN_INACTIVE[TED] or _OPEN_TRUNCATED rather than 
_OPEN_DELETED, which is contradictory?

-- 
Fabien.

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

04 мая 2016 г., 02:48:17

Hi,

On 2016-05-03 07:24:46 +0200, Fabien COELHO wrote:
> >>I'm unsure about switching enum to #define, could be an enum still with
> >>explicit values set, something like:
> >
> >An enum doesn't have a benefit for a bitmask imo - you can't "legally"
> >use it as a type for functions accepting the bitmask.
> 
> I do not understand. I suggested to use enum to enumerate the bitmask
> constants, ISTM that it does not preclude to use it as a bitmask as you do,
> it is just a replacement of the #define? The type constraint on the enum
> does not disallow bitmasking values, I checked with both gcc & clang.

There's not really a point in using an enum if you use neither the type
(which you shouldn't because if you or the bitmask value you have types
outside the range of the enum), nor the autogenerated numbers. Anyway
seems fairly unimportant.


> >>I'm fuzzy about the _OPEN_DELETED part because it is an oxymoron. Is it
> >>RECREATE really?
> >
> >No. The relevant explanation is at the top of the file:
> 
> [...]
> 
> >*        -- Optionally, any number of inactive segments of size 0 blocks.
> >*    Inactive segments are those that once contained data but are currently
> >*    not needed because of an mdtruncate() operation.  The reason for leaving
> >*    them present at size zero, rather than unlinking them, is that other
> >*    backends and/or the checkpointer might be holding open file references to
> >*    such segments.  If the relation expands again after mdtruncate(), such
> >*    that a deactivated segment becomes active again, it is important that
> >*    such file references still be valid --- else data might get written
> >*    out to an unlinked old copy of a segment file that will eventually
> >*    disappear.
> 
> Ok.
> 
> Then should it be _OPEN_INACTIVE[TED] or _OPEN_TRUNCATED rather than
> _OPEN_DELETED, which is contradictory?

Well, TRUNCATED doesn't entirely work, there's I think some cases where
this currently also applies to deleted relations. I kind of like the
unintuitive sounding name, because it's really a dangerous options (any
further mdnblocks will be wrong).


Andres

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Fabien COELHO

Дата:

04 мая 2016 г., 08:46:59

Hello Andres,

>>> An enum doesn't have a benefit for a bitmask imo - you can't "legally"
>>> use it as a type for functions accepting the bitmask.
>>
>> I do not understand. I suggested to use enum to enumerate the bitmask
>> constants, ISTM that it does not preclude to use it as a bitmask as you do,
>> it is just a replacement of the #define? The type constraint on the enum
>> does not disallow bitmasking values, I checked with both gcc & clang.
>
> There's not really a point in using an enum if you use neither the type 
> (which you shouldn't because if you or the bitmask value you have types 
> outside the range of the enum), nor the autogenerated numbers.

I do not think that there is such a constraint in C, you can use the enum 
bitfield type, so you should. You can say in the type name that it is a 
bitmask to make it clearer:
  typedef enum {    EXTENSION_XXX = ...;  } extension_behavior_bitfield;

> Anyway seems fairly unimportant.

Sure, it is just cosmetic. Now the type is already an enum and you can 
keep it that way, ISTM that it is cleaner to avoid defines, so I think you 
should.

>> Then should it be _OPEN_INACTIVE[TED] or _OPEN_TRUNCATED rather than
>> _OPEN_DELETED, which is contradictory?
>
> Well, TRUNCATED doesn't entirely work, there's I think some cases where
> this currently also applies to deleted relations. I kind of like the
> unintuitive sounding name, because it's really a dangerous options (any
> further mdnblocks will be wrong).

Hmmm.

My issue is with the semantics of the name which implies how it can be 
understand by someone reading the code. The interface deals with files, so 
implicitely DELETED refers to files, and AFAICS no file was deleted. Maybe 
rows in the relation where deleted somehow, or entries in an index, but 
that has no sense at the API level. So I think you should not use DELETED.

I can see why a vaguely oxymoronic name is amusing, but I do not think 
that it conveys any idea of "dangerous" as you suggest.

Now the important think is probably not that the code is the cleanest 
possible one, but that it fixes the issue, so you should probably commit 
something.

-- 
Fabien.

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

04 мая 2016 г., 11:59:17

On 2016-04-28 17:41:29 +0100, Thom Brown wrote:
> I've noticed another breakage, which I can reproduce consistently.

Thanks for the testing!  I pushed a fix for this. This wasn't actually
an issue in the original patch, but a too strict test added in my fix.

Greetings,

Andres Freund

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Andres Freund

Дата:

04 мая 2016 г., 11:59:54

On 2016-05-04 07:46:42 +0200, Fabien COELHO wrote:
> >Well, TRUNCATED doesn't entirely work, there's I think some cases where
> >this currently also applies to deleted relations. I kind of like the
> >unintuitive sounding name, because it's really a dangerous options (any
> >further mdnblocks will be wrong).
> 
> Hmmm.

I went with the boring DONT_CHECK_SIZE ;)

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Thom Brown

Дата:

04 мая 2016 г., 13:43:14

On 4 May 2016 at 09:59, Andres Freund <andres@anarazel.de> wrote:

On 2016-04-28 17:41:29 +0100, Thom Brown wrote:
> I've noticed another breakage, which I can reproduce consistently.

Thanks for the testing! I pushed a fix for this. This wasn't actually
an issue in the original patch, but a too strict test added in my fix.

Re-testing shows that this appears to have solved the issue.

Thanks

Thom

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Tom Lane

Дата:

04 мая 2016 г., 17:55:55

Fabien COELHO <coelho@cri.ensmp.fr> writes:
>> There's not really a point in using an enum if you use neither the type 
>> (which you shouldn't because if you or the bitmask value you have types 
>> outside the range of the enum), nor the autogenerated numbers.

> I do not think that there is such a constraint in C, you can use the enum 
> bitfield type, so you should.

I think you are failing to understand Andres' point.  If you're going
to OR together some bits, the result is no longer a member of the enum
type, and the advantages that an enum would have immediately turn into
disadvantages.
        regards, tom lane

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Fabien COELHO

Дата:

04 мая 2016 г., 19:29:27

Hello Tom,

>>> There's not really a point in using an enum if you use neither the type
>>> (which you shouldn't because if you or the bitmask value you have types
>>> outside the range of the enum), nor the autogenerated numbers.
>
>> I do not think that there is such a constraint in C, you can use the enum
>> bitfield type, so you should.
>
> I think you are failing to understand Andres' point.  If you're going
> to OR together some bits, the result is no longer a member of the enum
> type, and the advantages that an enum would have immediately turn into
> disadvantages.

I understood the point and I do not see real disadvantages. The C standard 
really says that an enum is an int, and compilers just do that. I see it 
as a matter of interpretation whether enum members are strictly allowed 
values or just allowed bits, but maybe the standard says otherwise.

Anyway, the good news is that the bug is now fixed.

-- 
Fabien.

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Tom Lane

Дата:

04 мая 2016 г., 20:31:38

Fabien COELHO <coelho@cri.ensmp.fr> writes:
> I understood the point and I do not see real disadvantages. The C standard
> really says that an enum is an int, and compilers just do that.

No, it doesn't say that, and compilers don't just do that. A compiler
is specifically allowed to store an enum in char or short if the enum's
declared values would all fit into that width. (Admittedly, if you're
choosing the values as powers of 2, an OR of them would still fit; but
if you think "oh, an enum is just an int", you will get burned.)

More to the point, once you allow OR'd values then none of the practical
benefits of an enum still hold good. The typical things that I rely on
in an enum that you don't get from a collection of #define's are:

* compiler warnings if you forget some members of the enum in a switch

* debugger ability to print variables symbolically

Those benefits both go up in smoke as soon as you allow OR'd values.
At that point you might as well use the #defines rather than playing
language lawyer about whether what you're doing meets the letter of
the spec. I note that C99 specifically mentions this as something
a compiler might warn about:
-- A value is given to an object of an enumeration type other than by assignment of an
enumeration constant that is a member of that type, or an enumeration variable that has the
sametype, or the value of a function that returns the same enumeration type
(6.7.2.2).

which certainly looks like they don't consider
"enumvar = ENUMVAL1 | ENUMVAL2" to be strictly kosher.
regards, tom lane

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Stephen Frost

Дата:

04 мая 2016 г., 20:44:42

Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> * debugger ability to print variables symbolically

I might be misunderstanding what you're getting at here, but if you want
to be able to use #define'd values using their name, you can get that by
compiling with -g3.  With -g3 and gdb, you can do things like:

(gdb) p tblinfo[i].dobj.dump & ~(DUMP_COMPONENT_COMMENT |
DUMP_COMPONENT_SECLABEL | DUMP_COMPONENT_USERMAP | DUMP_COMPONENT_ACL)

where all the DUMP_COMPONENT items are #define's.

Thanks!

Stephen

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Tom Lane

Дата:

04 мая 2016 г., 21:58:27

Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> * debugger ability to print variables symbolically

> I might be misunderstanding what you're getting at here, but if you want
> to be able to use #define'd values using their name, you can get that by
> compiling with -g3.  With -g3 and gdb, you can do things like:

> (gdb) p tblinfo[i].dobj.dump & ~(DUMP_COMPONENT_COMMENT |
> DUMP_COMPONENT_SECLABEL | DUMP_COMPONENT_USERMAP | DUMP_COMPONENT_ACL)

> where all the DUMP_COMPONENT items are #define's.

Yes, but if you just do "p tblinfo[i].dobj.dump", you're only going to get
a number, right?  The value-added for an enum type is that the debugger
knows which symbol to substitute for a numeric value when printing.  But
that stops working if what's in the variable is some OR of the values the
debugger knows about.
        regards, tom lane

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Stephen Frost

Дата:

04 мая 2016 г., 22:02:45

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> >> * debugger ability to print variables symbolically
>
> > I might be misunderstanding what you're getting at here, but if you want
> > to be able to use #define'd values using their name, you can get that by
> > compiling with -g3.  With -g3 and gdb, you can do things like:
>
> > (gdb) p tblinfo[i].dobj.dump & ~(DUMP_COMPONENT_COMMENT |
> > DUMP_COMPONENT_SECLABEL | DUMP_COMPONENT_USERMAP | DUMP_COMPONENT_ACL)
>
> > where all the DUMP_COMPONENT items are #define's.
>
> Yes, but if you just do "p tblinfo[i].dobj.dump", you're only going to get
> a number, right?  The value-added for an enum type is that the debugger
> knows which symbol to substitute for a numeric value when printing.  But
> that stops working if what's in the variable is some OR of the values the
> debugger knows about.

Ah, yeah, that's true.  -g3, unsurprisingly, doesn't help with
displaying a random int value using some set of #define's.

Thanks!

Stephen

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Fabien COELHO

Дата:

05 мая 2016 г., 10:33:00

Hello Tom,

>> I understood the point and I do not see real disadvantages. The C standard
>> really says that an enum is an int, and compilers just do that.
>
> No, it doesn't say that, and compilers don't just do that.
> A compiler is specifically allowed to store an enum in char or short if 
> the enum's declared values would all fit into that width.

Hmm. That is a bit of a subtle one:

Spec says that enum *constants* are ints "The identifiers in an enumerator 
list are declared as constants that have type int and may appear wherever 
such are permitted."  But indeed, as you point out, per spec the storage 
could be made smaller.

However, I'm yet to meet a compiler which does not do ints:
  typedef enum { false, true } boolean;  assert(sizeof(boolean) == sizeof(int)); // ok with gcc & clang

I can guess why: such a compiler would not be able to work with libraries 
compiled with gcc or clang, which would be an pretty annoying feature. Now 
it does not mean that such a compiler does not exist in some realm (8/16 
bits architectures maybe? but then ints would be 16 bits on these...).

> (Admittedly, if you're choosing the values as powers of 2, an OR of them 
> would still fit;

Yep.

> but if you think "oh, an enum is just an int", you will get burned.)

In theory, yes. In practice, the compiler writer would have get burned 
before me:-).

> * compiler warnings if you forget some members of the enum in a switch

Sure. Using a switch on a bitfield is an uncommon pattern, though.

> * debugger ability to print variables symbolically

Yep.

Names are lost by defines, which is my preliminary concern to try to avoid 
them, even at the price of beting against the spec letter:-)

> At that point you might as well use the #defines rather than playing
> language lawyer about whether what you're doing meets the letter of
> the spec.

Hmmm... the compilers are the real judges, the spec is just the law:-)

> I note that C99 specifically mentions this as something a compiler might 
> warn about: [...]

Indeed. Neither gcc nor clang emit such warnings... but they might some 
day, which would be a blow for my suggestion!

Another option would have been explicit bitfields, something like:
  typedef struct {    int create : 1,        return_null : 1,        fail : 1,        create_in_recovery : 1,        //
...       ;  } extension_bitfield_t;
 
  void foo(extension_bitfield_t behavior)  {    if (behavior.create) printf("create...\n");    if
(behavior.return_null)printf("null...\n");    if (behavior.fail) printf("fail...\n");    if
(behavior.create_in_recovery)printf("recovery...\n");  }
 
  void bla(void)  {    foo((extension_bitfield_t) { .fail = 1, .create_in_recovery = 1 });  }

With gdb it is quite readable:
  // (gdb) p behavior  // $1 = {create = 0, return_null = 0, fail = -1, create_in_recovery = -1}

Anyway, thanks for the discussion!

-- 
Fabien.

Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

От

Piotr Stefaniak

Дата:

06 мая 2016 г., 08:38:20

On 2016-05-05 09:32, Fabien COELHO wrote:
>> I note that C99 specifically mentions this as something a compiler
>> might warn about: [...]
>
> Indeed. Neither gcc nor clang emit such warnings... but they might some
> day, which would be a blow for my suggestion!

For what it's worth, newer versions of clang can emit useful warnings 
for cases like this:

int main(void) {enum test { FOUR = 4 };enum incompatible { INCOMPATIBLE_FOUR = 4 };enum test variable;variable =
INCOMPATIBLE_FOUR;variable= 5;variable = 4;variable = 3;return 0;

}

enum.c:5:13: warning: implicit conversion from enumeration type 'enum 
incompatible' to different enumeration type 'enum test' [-Wenum-conversion]        variable = INCOMPATIBLE_FOUR;
        ~ ^~~~~~~~~~~~~~~~~

enum.c:6:13: warning: integer constant not in range of enumerated type 
'enum test' [-Wassign-enum]        variable = 5;                   ^
enum.c:8:13: warning: integer constant not in range of enumerated type 
'enum test' [-Wassign-enum]        variable = 3;                   ^
3 warnings generated.

So with -Wenum-conversion -Wassign-enum you could treat enum types as 
distinct and incompatible with each other.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: [BUGS] Breakage with VACUUM ANALYSE + partitions

Вложения

Вложения

Вложения

Вложения