Обсуждение: Updated posix fadvise patch v19

Поиск

Список

Период

Сортировка

Updated posix fadvise patch v19

От

Gregory Stark

Дата:

30 октября 2008 г., 17:03:02

Here is an updated posix_fadvise patch against CVS HEAD. It's my first patch
generated using git, yay. I remembered to strip out configure (WHY is that
STILL in CVS!?) and convert it to context diff, but if there's anything else
I've missed just shout.

Changes from before:

1) Based on discussions changed the parameter to effective_io_concurrency and
   modified the documentation discussion to include comments about SSDs and
   experimentation being necessary.

2) standardized all references to "prefetch" from the mixture of same and
   "preread" and "advise". This means the main entry point is now
   PrefetchBuffer().

3) Added run-time check for buggy glibc versions with bad posix_fadvise()
   since it was pointed out that someone could run on a different version of
   glibc than it was compiled on (such as with binary installers). I kept the
   autoconf test though arguably it's redundant now and could be removed.

Also, the previous patch included support for Zoltan's situation using
POSIX_FADV_SEQUENTIAL for bulk sequential scans. I experimented with this and
was not able to find any situation it improved. I've left it in so others can
test and show it's helpful. This bit of code is a bit less polished -- it's
missing documentation for the guc variable and there are some #defines which
are probably in the wrong .h file (and probably should be an enum).

This code is entirely separate from the prefetching code. They happen to both
use posix_fadvise but I don't think they belong in the same abstraction for
Postgres. PrefetchBuffer does just one thing -- prefetches a buffer.
Sequential i/o needs to be set and tracked per-file.

I have not changed the query costing side of things. I'm a bit leery about
making those changes as I fear it will lead me to propose a major reworking of
these parameters and I don't want to go there...



--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

Вложения

posix-fadvise-v19.patch.gz

Re: Updated posix fadvise patch v19

От

"Robert Haas"

Дата:

13 ноября 2008 г., 23:34:57

> - StrategyFileStrategy doesn't handle the recently added BAS_BULKWRITE
> strategy.  I'm not sure whether it needs to, but it seems to me that
> this a trap for the unwary: we should probably add a comment where the
> BAS_* constants are defined warning that any changes here may/will
> also necessitate changes there.  I think a detailed comment on the
> function itself explaining why it does what it does and how to decide
> what to do for a new type of BufferAccessStrategy would be a good
> idea.

In fact, now that I look at this a little further, I see that in
general you've not added comments at the beginnings of functions - for
example, the other functions in the files that contain smgrprefetch,
mdprefetch, PrefetchBuffer seem to have a description of the purpose
of those functions; the ones you've added do not.

Good luck, I'd like to see this one get in - the performance results
you've reported sound very impressive.

...Robert

Re: Updated posix fadvise patch v19

От

"Robert Haas"

Дата:

13 ноября 2008 г., 23:40:33

I was pretty leery about reviewing this one due to the feeling that I
might well be in over my head, but they talked me into it, so here
goes nothin'.  Apologies in advance for any deficiencies in this
review.

- Overall, this looks pretty clean.  The style appears to be
consistent with the surrounding code, the patch applies with only
minor fuzz, there is not much cruft in the diff (I found a few very
minor unnecessary hunks, see department of nitpicking below) and the
whole thing compiles and passes regression tests.  Also, while I'm not
totally qualified to judge the success of your efforts, it appears
that you've attempted to make the changes in a way that preserves the
existing abstractions, which is good.

- However, having said that, it looks as if there is still a bit of
experimentation going on in terms of what you actually want the patch
to do.  There are a couple of things that say FIXME or XXX, and at
least one diff hunk that adds code surrounded by #if 0 (in FileRead).
Maybe committing FIXME is OK, but certainly #if 0 isn't, so there's at
least a bit of work needed here to put this in its final form, though
I think perhaps not that much.  As you mentioned when posting this
version of the patch, it does two unrelated things only one of which
you are confident is beneficial.  I suggest splitting this into two
patches and trying to get the prefetch part committed first.  I think
that would make for easier review, too.

- I dislike cluttering up EXPLAIN ANALYZE with even more crap.  On the
flip side, I am loathe to take away any sort of instrumentation that
might be helpful.  I think we need to revamp the syntax of EXPLAIN
[ANALYZE] to support some kind of option list, so that users can
request the information they care about and not be overwhelmed by what
they don't care about.  (ISTR that a similar proposal was made with
regard to VACUUM some time ago... perhaps the same ideas could be
applied to both.)  That would need to be done as a separate patch, so
maybe we shouldn't worry about it here.

- It's not clear to me whether there is a reason why we are only
adding instrumentation for bitmap heap scan; would not other scan
types benefit from something similar?  If we're not going to add
instrumentation for all the cases that can benefit from it, then we
should just rip it out.

- StrategyFileStrategy doesn't handle the recently added BAS_BULKWRITE
strategy.  I'm not sure whether it needs to, but it seems to me that
this a trap for the unwary: we should probably add a comment where the
BAS_* constants are defined warning that any changes here may/will
also necessitate changes there.  I think a detailed comment on the
function itself explaining why it does what it does and how to decide
what to do for a new type of BufferAccessStrategy would be a good
idea.

- I am mildly concerned that we are overusing the word Strategy.  The
purpose of StrategyFileStrategy is, of course, to take a
BufferAccessStrategy (which is an object) and return a FILE_STRATEGY_*
constant.  But someone might not find that entirely evident from the
name.  If we're going to have multiple things floating around that are
different from each other but all called strategies, ISTM we should
name functions etc. to make clear which one we're talking about.  Or
else pick a different word.  Maybe for now it's sufficient to rename
this function to AccessStrategyGetFileStrategy, or something.

- The WARNING at the top of PrefetchBuffer() seems strange.  Is that
really only a WARNING?  We just continue anyway?

Department of nitpicking:
- The very first comment change in src/backend/commands/explain.c is
apparently extraneous.
- guc.c misspells the work "concurrent" as "concurrenct".
- The first diff hunk in each of fd.h and smgr.h is an unnecessary
whitespace change.
- nodeBitmapHeapscan.c adds an ELOG at DEBUG2 - do we really want
this, or is it leftover debugging code?

...Robert

Re: Updated posix fadvise patch v19

От

Gregory Stark

Дата:

14 ноября 2008 г., 18:33:47

"Robert Haas" <robertmhaas@gmail.com> writes:

> I was pretty leery about reviewing this one due to the feeling that I
> might well be in over my head, but they talked me into it, so here
> goes nothin'.  Apologies in advance for any deficiencies in this
> review.
>
> - Overall, this looks pretty clean.  
>
> - However, having said that, it looks as if there is still a bit of
> experimentation going on in terms of what you actually want the patch
> to do.  There are a couple of things that say FIXME or XXX, and at
> least one diff hunk that adds code surrounded by #if 0 (in FileRead).
> Maybe committing FIXME is OK, but certainly #if 0 isn't, so there's at
> least a bit of work needed here to put this in its final form, though
> I think perhaps not that much.  As you mentioned when posting this
> version of the patch, it does two unrelated things only one of which
> you are confident is beneficial.  I suggest splitting this into two
> patches and trying to get the prefetch part committed first.  I think
> that would make for easier review, too.

When I posted the plan was that people would run experiments with other
systems. In particular I was hoping Zoltan who original reported the
sequential file strategy stuff being helpful might be able to see if this
works for him.

As this hasn't happened and I haven't been able to demonstrate it being useful
myself I guess it makes more sense to separate the two now and set the
sequential scan stuff aside until someone can demonstrate it being useful.

> - I dislike cluttering up EXPLAIN ANALYZE with even more crap.  On the
> flip side, I am loathe to take away any sort of instrumentation that
> might be helpful.  

I put it in for debugging since it was hard to know if it was actually doing
anything otherwise. I figured whoever was reviewing the patch would run into
the same problem. It would be trivial to strip out though.

> - It's not clear to me whether there is a reason why we are only
> adding instrumentation for bitmap heap scan; would not other scan
> types benefit from something similar?  If we're not going to add
> instrumentation for all the cases that can benefit from it, then we
> should just rip it out.

Well there is a difference because bitmap heap scans do that slow-start thing.
They ramp up the prefetching as you fetch more tuples from the bitmap up to
the user configurable GUC so the point was to instrument what level they get
to. Regular index scans prefetch whatever they find in the index leaf pages
which will vary from page to page and isn't really something the user can
influence anyways.

But as I said, it was mostly so I could tell what the slow start algorithm was
doing and make sure it was doing anything at all in testing.

> - The WARNING at the top of PrefetchBuffer() seems strange.  Is that
> really only a WARNING?  We just continue anyway?

Er, uh, yeah that's bogus. The RelationIsValid is pointless as
RelationOpenSmgr will just crash if that's not true.

But I'm not nearly as convinced that BlockNumberIsValid() isn't worth
asserting for. I don't see anything in the buffer tag lookup code which will
fail in that case. What confuses me is that there's no corresponding check in
the ReadBuffer code.

Tom? Heikki? Should we have an assert in ReadBuffer asserting
BlockNumberIsValid -- especially for the zero-page case where we're not
actually going to do I/O? Or am I missing where that will seg fault?

> Department of nitpicking:

Will clean these up, they all look valid. I thought I did clean things up
already -- I guess you should be happy you're looking at it *after* that
cleanup :/

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning

Re: Updated posix fadvise patch v19

От

"Robert Haas"

Дата:

14 ноября 2008 г., 18:40:40

> As this hasn't happened and I haven't been able to demonstrate it being useful
> myself I guess it makes more sense to separate the two now and set the
> sequential scan stuff aside until someone can demonstrate it being useful.

Sounds good.  How soon do you think you can post updated patches?

> But as I said, it was mostly so I could tell what the slow start algorithm was
> doing and make sure it was doing anything at all in testing.

In that case, I think you should just rip it all out for now.

>> Department of nitpicking:
>
> Will clean these up, they all look valid. I thought I did clean things up
> already -- I guess you should be happy you're looking at it *after* that
> cleanup :/

Maybe it's you who should be glad... :-)

...Robert

Re: Updated posix fadvise patch v19

От

Greg Smith

Дата:

15 ноября 2008 г., 03:46:10

On Fri, 14 Nov 2008, Gregory Stark wrote:

> In particular I was hoping Zoltan who original reported the sequential 
> file strategy stuff being helpful might be able to see if this works for 
> him.

The original message there suggested it was particularly valuable when 
working with a somewhat broken kernel that was handling multiple 
simultaenous reads badly.  I think the idea here is that if you've got an 
OS that handles that situation well, then the fadvise calls don't help 
much.  But having them in there does improve the odds that the kernel will 
do the right thing with the sequential reads.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Updated posix fadvise patch v19

От

Gregory Stark

Дата:

18 ноября 2008 г., 07:50:09

"Robert Haas" <robertmhaas@gmail.com> writes:

> - However, having said that, it looks as if there is still a bit of
> experimentation going on in terms of what you actually want the patch
> to do.  There are a couple of things that say FIXME or XXX, and at
> least one diff hunk that adds code surrounded by #if 0 (in FileRead).

Well, yes, we have lots of XXX and FIXMEs throughout the source code. We need
to make progress even if we don't know for certain that we have the perfect
crystalline code structure in all its glory.

I have the same reaction and fear when I see reviewers' questions about
whether a patch is "finished". If we wait until someone is "finished" working
on a feature we'll never get the feature -- there's always more that can be
done. We want to commit things when they are ready for use, not wait until
there's nothing more to do.

The XXX is something that probably needs to be fixed but it's just a question
of what header file to put a declaration in. I couldn't find a good choice but
perhaps someone else has an idea?

For the FIXMEs I don't have any problem leaving them in place. They're
warnings to future coders working in the same area of what they may have to do
to make the code more general. In particular both FIXMEs are related to memory
management of the iterator structures. I think just allocating them in the
bitmap memory context is fine for existing callers. I would rather leave them
there because I would like a reviewer to double check that we don't have a
memory leak there.

The #if 0 is part of the FileStrategy code to use POSIX_FADV_SEQUENTIAL which
I'm removing now so it's irrelevant.

> As you mentioned when posting this version of the patch, it does two
> unrelated things only one of which you are confident is beneficial. I
> suggest splitting this into two patches and trying to get the prefetch part
> committed first. I think that would make for easier review, too.

Maintaining two interdependent patches is a bit of a pain. This is one reason
it's annoying when a patch goes uncommitted for a long time. It's hard to work
on further improvements which are dependent on it.

Nobody seems to care about the sequential scan case any more so I doubt I'll
bother maintaining it after I remove it from this patch.

> - I dislike cluttering up EXPLAIN ANALYZE with even more crap.  On the
> flip side, I am loathe to take away any sort of instrumentation that
> might be helpful.  

My feeling is that it was useful but only marginally and mostly only for
testing that things were working properly. The main benefit is giving people a
real feeling for how much data a given effective_io_concurrency actually
translates to. 

I'm removing this now. If anyone else thinks it was nice to have scream now.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning

Re: Updated posix fadvise patch v19

От

"Robert Haas"

Дата:

18 ноября 2008 г., 10:33:39

> For the FIXMEs I don't have any problem leaving them in place. They're
> warnings to future coders working in the same area of what they may have to do
> to make the code more general.

That's fine with me.  I think it's fine to document possibilities for
future development, but sometimes it's hard to tell between which such
things the patch author intends to be committed and which he or she
intends to fix before commit.  Maybe FIXME is just a noise word,
though: there are other places in the code where we document things
that might be related or need to be changed without using that word.
I don't care very much, though.

Looking forward to v20.

...Robert

Re: Updated posix fadvise patch v19

От

Peter Eisentraut

Дата:

18 ноября 2008 г., 11:12:01

Gregory Stark wrote:
> The XXX is something that probably needs to be fixed but it's just a question
> of what header file to put a declaration in. I couldn't find a good choice but
> perhaps someone else has an idea?
> 
> For the FIXMEs I don't have any problem leaving them in place. They're
> warnings to future coders working in the same area of what they may have to do
> to make the code more general. In particular both FIXMEs are related to memory
> management of the iterator structures. I think just allocating them in the
> bitmap memory context is fine for existing callers. I would rather leave them
> there because I would like a reviewer to double check that we don't have a
> memory leak there.

There are probably no rigid rules on this, but my interpretation of 
these tags is usually this:

XXX -- not sure if this is the best way to do this, needs ideas
TODO -- specific ideas for improvement
FIXME -- broken, must be fixed to be usable

So committed code should probably not contain any FIXMEs, but possibly 
some of the others.

I usually label stubs in work-in-progress code with // FIXME and then 
check if I removed them all before proposing a patch for inclusion.

But those are just my ideas ...

Re: Updated posix fadvise patch v19

От

Gregory Stark

Дата:

18 ноября 2008 г., 11:20:57

"Robert Haas" <robertmhaas@gmail.com> writes:

> Looking forward to v20.

Here you go!

I addressed all the nitpicks and added comments.

I also stripped out the sequential i/o posix_fadvises. I'm kind of sad to see
them go since it did seem like a nice way to give more info to the OS even if
no OSes today make good use of it. But one thing at a time and this is clearly
a lot more important. Today we're effectively not using raid arrays properly
at all, we're using them like they're single drives.

One thing which is bothering me is that the guc assign hook is throwing an
error if you set effective_io_concurrency when your system's posix_fadvise is
deemed inadequate (either unavailable or from an old version of glibc). I'm
starting to think it shouldn't throw an error, just not set the internal
variable and possible output a warning. We do have some GUC variables which
throw errors if you use them and support isn't compiled in, but I'm not sure
it's such a hot idea even for those.




--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Вложения

posix-fadvise-v20.patch.gz

Re: Updated posix fadvise patch v19

От

Gregory Stark

Дата:

18 ноября 2008 г., 12:30:36

Peter Eisentraut <peter_e@gmx.net> writes:

> There are probably no rigid rules on this, but my interpretation of these tags
> is usually this:
>
> XXX -- not sure if this is the best way to do this, needs ideas
> TODO -- specific ideas for improvement
> FIXME -- broken, must be fixed to be usable

I don't have strong feelings on this. I appear to use them more or less
interchangeably. 

Or perhaps what I did is consistent with your rules. Except that the "FIXME"
isn't unusable it's just that I put the prototype in clearly the wrong file.
So it definitely has to be corrected but I don't know where to move it.

The XXX is for something I think is correct now but might need to be fixed if
new callers need tighter memory management. And which could use a close look
by a reviewer to be sure I'm right about the memory management being ok for
now.

FWIW we don't seem to have any such strict rules about them:

$ find . -name \*.[ch] -print0 | xargs -0 grep FIXME | wc -l
22
$ find . -name \*.[ch] -print0 | xargs -0 grep XXX | wc -l
485
$ find . -name \*.[ch] -print0 | xargs -0 grep TODO | wc -l
33

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!

Re: Updated posix fadvise patch v19

От

"Robert Haas"

Дата:

18 ноября 2008 г., 13:07:38

> One thing which is bothering me is that the guc assign hook is throwing an
> error if you set effective_io_concurrency when your system's posix_fadvise is
> deemed inadequate (either unavailable or from an old version of glibc). I'm
> starting to think it shouldn't throw an error, just not set the internal
> variable and possible output a warning. We do have some GUC variables which
> throw errors if you use them and support isn't compiled in, but I'm not sure
> it's such a hot idea even for those.

I can't see why this would be a good idea.  Warnings are easy to
overlook, and then you have completely different behavior without
knowing it.  If I wanted a database that silently did something
completely different from what I asked it to do, I'd use... well,
let's just say products like that are available.

...Robert

Re: Updated posix fadvise patch v19

От

"Robert Haas"

Дата:

18 ноября 2008 г., 23:55:49

> I addressed all the nitpicks and added comments.

Woot, yeah for comments.

There's a trivial conflict with CVS HEAD due to unrelated changes to
AC_CHECK_FUNCS(...kitchen sink...) but that led me to notice something
else: can't all this stuff about posix_fallocate be ripped out of
configure.in at this point?

In config.sgml, the documentation is good, but suffers from a slightly
informal style.  There are a lot of places where commas seem
appropriate but are not present.  Suggested changes by paragraph:

1. Replace last sentence: "Raising this value will cause PostgreSQL to initiate

...Robert

Re: Updated posix fadvise patch v19

От

"Robert Haas"

Дата:

19 ноября 2008 г., 00:08:34

Obviously that went too soon.

> In config.sgml, the documentation is good, but suffers from a slightly
> informal style.  There are a lot of places where commas seem
> appropriate but are not present.  Suggested changes by paragraph:
>
> 1. Replace last sentence: "Raising this value will cause PostgreSQL to initiate

...more I/O operations in parallel."

2. "Typically, this parameter should be set to the number of separate
drives comprising a RAID 0 stripe or RAID 1 mirror or any combination.For RAID 5, it should be set to the number of
drivesexcluding the
 
parity drive.   However, the optimal value varies based on the
effectiveness of the RAID controller or software implementation, and
may require some experimentation."

3. "For more exotic systems, such as memory-based storage or a RAID
array which is limited by bus bandwidth, the correct value might be
the number of paths available instead; again, experimentation is
recommended."

4. Insert comma after "However" in third sentence.  Replace last
sentence with: "On these operating systems, setting
effective_io_concurrency to a value greater than 1 will incur some CPU
overhead without improving perforamance."

...Robert

Re: Updated posix fadvise patch v19

От

Grzegorz Jaskiewicz

Дата:

20 ноября 2008 г., 07:09:03

this doesn't apply cleanly to head anymore, can you please post v21 ?
I would love to test it here :)

Re: Updated posix fadvise patch v19

От

Grzegorz Jaskiewicz

Дата:

20 ноября 2008 г., 15:56:43

On 2008-11-20, at 11:08, Grzegorz Jaskiewicz wrote:

> this doesn't apply cleanly to head anymore, can you please post v21 ?
> I would love to test it here :)
bollocks, it's already in cvs head - isn't it ? ... :D

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Updated posix fadvise patch v19

Вложения

Вложения