Обсуждение: Re: posix advises ...
Hi Zoltán. I was reading through your posix_fadvise patch, and I wanted to ask about this change in particular: > --- a/src/backend/executor/nodeIndexscan.c > +++ b/src/backend/executor/nodeIndexscan.c > @@ -290,7 +290,7 @@ ExecIndexEvalArrayKeys(ExprContext *econtext, > /* We want to keep the arrays in per-tuple memory */ > oldContext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory); > > - for (j = 0; j < numArrayKeys; j++) > + for (j = numArrayKeys-1; j >= 0; j--) > { > ScanKey scan_key = arrayKeys[j].scan_key; > ExprState *array_expr = arrayKeys[j].array_expr; Why is this loop reversed? (I could have missed some discussion about this, I just wanted to make sure it was intentional.) I can confirm that the patch applies to HEAD, that the configure test correctly #defines USE_POSIX_FADVISE, that it compiles cleanly, and it looks basically sensible to me. The nodeBitmapHeapscan.c and iterator changes need a second look by someone who understands the code better than I do. The documentation patch could use a little tweaking: > + <productname>PostgreSQL</productname> can give hints to POSIX compatible > + operating systems using posix_fadvise(2) to pre-read pages that will > + be needed in the near future. Reading the pages into the OS kernel's > + disk buffer will be done asynchronically while <productname>PostgreSQL</productname> > + works on pages which are already in memory. This may speed up bitmap scans > + considerably. This setting only applies to bitmap scans. > + Hinting for sequential scans is also used but no GUC is needed in this case. I would suggest something like this instead: <productname>PostgreSQL</productname> can use posix_fadvise(2) to tell the operating system about pages that it knowswill be needed in the near future. The performance of bitmap scans is considerably improved when the kernel pre-readssuch pages into its disk buffers. This variable controls how many pages are marked for pre-reading at a timeduring a bitmap scan. But I'm not convinced that this GUC is well-advised; at least, it needs some advice about how to determine a sensible size for the parameter (and maybe a better name). But is it really necessary at all? Thanks. -- ams
"Abhijit Menon-Sen" <ams@oryx.com> writes: > Hi Zoltán. > > I was reading through your posix_fadvise patch, Actually Zoltan's patch was based on an earlier patch from me. The sections you're highlighting here are from my original patch. > and I wanted to ask about this change in particular: > >> --- a/src/backend/executor/nodeIndexscan.c >> +++ b/src/backend/executor/nodeIndexscan.c >> @@ -290,7 +290,7 @@ ExecIndexEvalArrayKeys(ExprContext *econtext, >> /* We want to keep the arrays in per-tuple memory */ >> oldContext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory); >> >> - for (j = 0; j < numArrayKeys; j++) >> + for (j = numArrayKeys-1; j >= 0; j--) > Why is this loop reversed? (I could have missed some discussion about > this, I just wanted to make sure it was intentional.) There was some discussion about this change and in fact if you look at CVS HEAD you'll find it already applied. It probably doesn't make a big difference at least in most cases but it seems more sensible to increment the least significant key elements first since that maximizes the chances of fetching index keys from the same index pages or nearby pages. Incrementing the most significant index keys would maximize the distance we're jumpin around in the index tree. > But I'm not convinced that this GUC is well-advised; at least, it needs > some advice about how to determine a sensible size for the parameter > (and maybe a better name). But is it really necessary at all? I'm not sure which version of my patch Zoltan's was based on. The later versions of mine had a GUC named effective_spindle_count which I think is nicely abstracted away from the implementation details. We do need something like that because we have no way to determine how wide a raid stripe Postgres is sitting on and the optimal read-ahead depends on that. The alternative would be to measure the bandwith with various amounts of read-ahead. But I fear that would be quite hard on a production system with the amount of noise from other i/o. It's hard enough on an idle machine with synthetic benchmarks. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
At 2008-07-12 00:52:42 +0100, stark@enterprisedb.com wrote: > > There was some discussion about this change and in fact if you > look at CVS HEAD you'll find it already applied. Not as far as I can see. > Incrementing the most significant index keys would maximize the > distance we're jumpin around in the index tree. I see. Thanks. > The later versions of mine had a GUC named effective_spindle_count > which I think is nicely abstracted away from the implementation > details. Yes, that does sound much better. (The patch I read had a preread_pages_bitmapscan variable instead.) -- ams
Abhijit Menon-Sen <ams@oryx.com> writes: > At 2008-07-12 00:52:42 +0100, stark@enterprisedb.com wrote: >> There was some discussion about this change and in fact if you >> look at CVS HEAD you'll find it already applied. > Not as far as I can see. The place where it matters is in ExecIndexAdvanceArrayKeys(), not initial setup. regards, tom lane
On Sat, 12 Jul 2008, Abhijit Menon-Sen wrote: > At 2008-07-12 00:52:42 +0100, stark@enterprisedb.com wrote: >> >> The later versions of mine had a GUC named effective_spindle_count >> which I think is nicely abstracted away from the implementation >> details. > > Yes, that does sound much better. (The patch I read had a > preread_pages_bitmapscan variable instead.) This patch does need a bit of general care in a couple of areas. The reviewing game plan I'm working through goes like this: 1) Update the original fadvise test program Greg Stark wrote to be a bit easier to use for testing general compatibility of this approach. I want to collect some data from at least two Linux and Solaris systems with different disk setups. 2) Check out effective_spindle_count and see if it looks like a reasonable way to tune this feature. If so, will probably need to merge that in to Zoltan's version of the patch. May need some other cleanup in that patch set as well--I'm not sure that closed XLOG patch that got pushed into here as well is really helpful for example. 3) Generate a sequential scan test program aimed to hobble the Linux kernel in the way Zoltan described as motivation for his work. I'm working with Jeff Davis this week to try and repurpose some of his syncronized scan test programs to handle this while we're both in the same place for a bit. 4) Generate a bitmap scan test program to check the original patch. 5) If the performance results look useful and consistant, then move toward cleaning up broader compatibility issues like the segfault concerns Zoltan mentioned. Going to take a while to work through all that, but performance patches with platform-specific benefit are always painful like this. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > This patch does need a bit of general care in a couple of areas. The > reviewing game plan I'm working through goes like this: Did this review effort go anywhere? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Sun, 31 Aug 2008, Alvaro Herrera wrote: > Greg Smith wrote: > >> This patch does need a bit of general care in a couple of areas. The >> reviewing game plan I'm working through goes like this: > > Did this review effort go anywhere? Haven't made much progress--all my spare time for work like this lately has been stuck fighting broken hardware and OS issues on my test systems. I have a lot more time set aside for PostgreSQL hacking this month, and I'll look at that and the checkpoint block sorting patch I've also been tinkering with soon and kick back an initial review of each. But if anybody else is keen on working on either patch, let me know. Be glad to assist instead of lead. It takes fairly serious hardware to work on either of these usefully though. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD