Обсуждение: Index Page Split logging

Поиск

Список

Период

Сортировка

Index Page Split logging

От

Simon Riggs

Дата:

01 января 2008 г., 12:06:04

When we split an index page we perform a multi-block operation that is
both fairly expensive and complex to reconstruct should we crash partway
through.

If we could log *only* the insert that caused the split, rather than the
split itself, we would avoid that situation entirely. This would then
mean that the recovery code would resolve the split by performing a full
logical split rather than replaying pieces of the original physical
split. Doing that would remove a ton of complexity, as well as reducing
log volumes.

We would need to ensure that the right-hand page of the split reached
disk before the left-hand page. If a crash occurs when only the right
hand page has reached disk then there would be no link (on disk) to it
and so it would be ignored. We would need an orphaned page detection
mechanism to allow the page to be VACUUMed sometime in the future. There
would also be some sort of ordering required in the buffer manager, so
that pages which must be written last are kept pinned until the first
page is written. That sounds like it is fairly straightforward and it
would allow a generic mechanism that worked for all index splits, rather
than requiring individual code for each rmgr.

ISTM that would require Direct I/O to perform physical writes in a
specific order, rather than just issue the writes and fsync. Which
probably kills it for now, even assuming you followed me on every point
up till now...

So I'm mentioning this really to get the idea out there and see if
anybody has any bright ideas, rather than as a well-formed proposal for
immediate implementation.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: Index Page Split logging

От

Tom Lane

Дата:

01 января 2008 г., 18:02:12

Simon Riggs <simon@2ndquadrant.com> writes:
> If we could log *only* the insert that caused the split, rather than the
> split itself, we would avoid that situation entirely.

How are you going to avoid the need to run user-defined functions
(specifically, the btree comparison functions) during replay?
        regards, tom lane

Implementing Sorting Refinements

От

Manolo _

Дата:

01 января 2008 г., 19:09:55

Hi to all.

This mail is aimed at asking some suggestion to face PostgreSQL (PG) development to implement a refinement to PG source
code.I'll briefly summarize the idea of the "2-Way Replacement Selection" (2WRS) refinement, just in case. If you
alreadyremember what 2WRS is, you can please jump directly to the IMPLEMENTATION ISSUES part of this mail.

2WRS.
Refinement of the Replacement Selection (RS) technique currently used by PG in src/backend/utils/sort/tuplesort.c .
The 2WRS uses two heaps instead of just one in order to create the current (logical) run. Here there are some
fundamentalpoints of the 2WRS technique:
- 'qsort' the initial unsorted 'memtuples' array
- divide the 'memtuples' array into two parts and each of those will be managed as a heap
- one of the heaps will arrange its elements in ascending order, while the other heap in descending order
- each heap will spill its current root in its corresponding run (i.e.: we have a run per each of those two heaps), so
weare actually creating 2 physical current runs
- those 2 current physical runs could "theoretically" be merged into the same logical run, actually we can make
'mergesort'think they do belong to the same physical run. That reduces the number of comparisons 'mergesort' has to do
ateach merge step (that means less seek delay time on mass storage). We can also think the average length of logical
runsproduced by 2WRS will probably be greater than the average length produced by RS (again less seek delay time: the
longereach run the less number of final runs produced, that means the less number of comparisons at each 'mergesort'
step).

IMPLEMENTATION ISSUES.Where to place those heaps?
1) I think that both heaps could be arranged on the same 'memtuples' array. That allows easily subsequent resizing
thoseheaps according to their effective use or according to some heuristic, without reallocating memory.How to arrange
thoseheaps?
2a) It's convenient to arrange those heaps "root to root". That is arranging those heaps with their roots toward the
centerof 'memtuples' (in a way we can say they lay "face to face", or "root to root" as said before) while their leaves
laytowards the extreme indexes of the 'memtuples' array (that is the last leaf of one heap will lay at index 0, the
lastleaf of the other heap laying at index memtupsize-1. This arrangement prevents overlapping elements between those
physicalruns associated to the same current logical run.
PRO: once we qsort memtuples and divide it into 2 parts we already get those two heaps, no need to build them.
CONTRA: ???

2b) As in 2a) but arranging heaps "leaf to leaf", that is their roots will lay at the extreme indexes of 'memtuples'
whiletheir leaves towards the center of the 'memtuples' array.
Or even start building heaps as soon as we get initial elements, instead of qsort the whole 'memtuples' array.
Any PRO/CONTRA compared to 2a)???
Current run numbers
I think I should duplicate the 'int currentRun' variable in the Tuplesortstate struct. I'll replace it with a 'int
currentRunUP'and 'int currentRunDOWN' variables in order to distinguish those two physical runs associated to those 2
heaps.In this case I will give a run number (max{currentRunUP,currentRunDOWN} + 1) to elements not belonging to the
currentlogical run. I suppose no need to touch 'long availMem' nor 'long allowedMem' variables nor any others.
Heap functions
I will duplicate all the heap management functions in order to adapt them to the kind of heap they should be applied to
(forexample, the tuplesort_heap_siftup function should be replaced with tuplesort_heap_siftupUP and
tuplesort_heap_siftupDOWNfunctions).
Merge Plan
This technique would use a sort of "merge plan" to instruct mergesort on how to use those physical runs. Actually
mergesortshould consider at first "odd runs" before "pair runs". That is, for example, mergesort should start merging
runswith run number 1,3,5,7,... and when run number X terminates start considering run number X+1. Obviously that
doesn'tneed any merge plan, but I remember someone else (as Simon Riggs) was interested in sorting improvements so it
couldbe a good thing to know if I should consider any conventions or paramethers in order to possibly create that merge
plan.

DEVELOPMENT CONTEXT
I preferred to use the "last stable release" at the moment, that is 8.2.

Any comment/suggestion/advice ?

Thanks for your attention and for your time.
Regards, Manolo.
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Re: Index Page Split logging

От

Simon Riggs

Дата:

01 января 2008 г., 19:55:37

On Tue, 2008-01-01 at 14:02 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > If we could log *only* the insert that caused the split, rather than the
> > split itself, we would avoid that situation entirely.
> 
> How are you going to avoid the need to run user-defined functions
> (specifically, the btree comparison functions) during replay?

Seems like a good objection. Just exercising my lateral thought muscles.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: Index Page Split logging

От

Martijn van Oosterhout

Дата:

01 января 2008 г., 20:21:00

On Tue, Jan 01, 2008 at 08:55:58PM +0000, Simon Riggs wrote:
> On Tue, 2008-01-01 at 14:02 -0500, Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > > If we could log *only* the insert that caused the split, rather than the
> > > split itself, we would avoid that situation entirely.
> >
> > How are you going to avoid the need to run user-defined functions
> > (specifically, the btree comparison functions) during replay?
>
> Seems like a good objection. Just exercising my lateral thought muscles.

It seems to me you should be able to manage an intermediate version
where (in for example GiST) you store the output of picksplit. i.e. you
describe your split as: items A,B,C went to the left and the rest went
to the right page. This would allow you to reconstruct the split
completely without actually having to write all the data.

The only thing I'm not sure about is whether it's easy to get the
relevent inforation to make this efficient.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Re: Index Page Split logging

От

Simon Riggs

Дата:

01 января 2008 г., 20:30:31

On Tue, 2008-01-01 at 22:20 +0100, Martijn van Oosterhout wrote:
> On Tue, Jan 01, 2008 at 08:55:58PM +0000, Simon Riggs wrote:
> > On Tue, 2008-01-01 at 14:02 -0500, Tom Lane wrote:
> > > Simon Riggs <simon@2ndquadrant.com> writes:
> > > > If we could log *only* the insert that caused the split, rather than the
> > > > split itself, we would avoid that situation entirely.
> > > 
> > > How are you going to avoid the need to run user-defined functions
> > > (specifically, the btree comparison functions) during replay?
> > 
> > Seems like a good objection. Just exercising my lateral thought muscles.
> 
> It seems to me you should be able to manage an intermediate version
> where (in for example GiST) you store the output of picksplit. i.e. you
> describe your split as: items A,B,C went to the left and the rest went
> to the right page. This would allow you to reconstruct the split
> completely without actually having to write all the data.

Yes, but you may have to handle recursive splits in the parent, so the
problem is not fully solved by that manoeuvre.

Perhaps we could do this only in cases where the parent is not split?

Hmm, starting to sound too far fetched for me now.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: Index Page Split logging

От

"Gokulakannan Somasundaram"

Дата:

02 января 2008 г., 08:24:06

On Jan 2, 2008 2:25 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, 2008-01-01 at 14:02 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > If we could log *only* the insert that caused the split, rather than the
> > split itself, we would avoid that situation entirely.
>
> How are you going to avoid the need to run user-defined functions
> (specifically, the btree comparison functions) during replay?

Seems like a good objection. Just exercising my lateral thought muscles.

Can this be explained in more detail???

Thanks,
Gokul.

Re: Index Page Split logging

От

Martijn van Oosterhout

Дата:

02 января 2008 г., 09:05:33

On Wed, Jan 02, 2008 at 02:49:35PM +0530, Gokulakannan Somasundaram wrote:
> On Jan 2, 2008 2:25 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On Tue, 2008-01-01 at 14:02 -0500, Tom Lane wrote:
> > > How are you going to avoid the need to run user-defined functions
> > > (specifically, the btree comparison functions) during replay?
> >
> > Seems like a good objection. Just exercising my lateral thought muscles.
>
> Can this be explained in more detail???

If the goal is to only store the insert, then we need to determine
during recovery which page the record needs to be added to. To do this
you need to go through the index, which can only be done by calling
user-defined functions.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Re: Index Page Split logging

От

"Gokulakannan Somasundaram"

Дата:

02 января 2008 г., 09:34:57

On Jan 2, 2008 3:35 PM, Martijn van Oosterhout <kleptog@svana.org> wrote:

If the goal is to only store the insert, then we need to determine
during recovery which page the record needs to be added to. To do this
you need to go through the index, which can only be done by calling
user-defined functions.

Correct me if i am wrong.
User-defined functions act as comparison functions in GIST. But when do they act as comparison functions in b-tree? I am not able to understand exactly that part.

Thanks for the explanation.

--
Thanks,
Gokul.

Re: Index Page Split logging

От

Martijn van Oosterhout

Дата:

02 января 2008 г., 09:40:12

On Wed, Jan 02, 2008 at 04:04:48PM +0530, Gokulakannan Somasundaram wrote:
> On Jan 2, 2008 3:35 PM, Martijn van Oosterhout <kleptog@svana.org> wrote:
> > If the goal is to only store the insert, then we need to determine
> > during recovery which page the record needs to be added to. To do this
> > you need to go through the index, which can only be done by calling
> > user-defined functions.
>
> Correct me if i am wrong.
> User-defined functions act as comparison functions in GIST. But when do they
> act as comparison functions in b-tree? I am not able to understand exactly
> that part.

All indexes are done by user-defined functions, even b-trees. People can
make their own b-tree indexes by defining an operator class. Note that
"user-defined" is this case means anything called via the fmgr
interface.

For reference, hash indexes are defined by user-defined functions as
well... It's all about the intereaction between Index AMs and operator
classes.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Re: Index Page Split logging

От

"Gokulakannan Somasundaram"

Дата:

02 января 2008 г., 10:16:18

All indexes are done by user-defined functions, even b-trees. People can
make their own b-tree indexes by defining an operator class. Note that
"user-defined" is this case means anything called via the fmgr
interface.

Again, i think i have one more wrong understanding. My understanding is,
We are discussing about user-defined functions because, they might be actually be mutable functions, but the user might have classified as immutable. This might cause some problems while replaying the log. In the case of hash-indexes, if the hash-function is mutable, then the user has a corrupted index.
Is there some other problem also, because of user-defined functions, that will stall recovery in the proposed idea?

In our standard b-tree(with no user-defined operator classes), there shouldn't be any problem with replaying right?

Again shouldn't we say b-tree with user-defined op-classes as gist-btree? If not what is the difference between both?

Again thanks for the explanation.

--
Thanks,
Gokul.

Re: Index Page Split logging

От

Martijn van Oosterhout

Дата:

02 января 2008 г., 11:54:34

On Wed, Jan 02, 2008 at 04:46:11PM +0530, Gokulakannan Somasundaram wrote:
> > All indexes are done by user-defined functions, even b-trees. People can
> > make their own b-tree indexes by defining an operator class. Note that
> > "user-defined" is this case means anything called via the fmgr
> > interface.
>
> Again, i think i have one more wrong understanding. My understanding is,
> We are discussing about user-defined functions because, they might be
> actually be mutable functions, but the user might have classified as
> immutable.

This is where it gets a bit beyond by depth, so someone may have to
correct me. The point is that during the recovery the system is not yet
fully running and not everything works yet. For example, what happens
if the system crashed halfway through updating a page in pg_proc and
that page needs to be recovered from WAL. Yet to insert into the index
you need to be able to read pg_am, pg_amproc, pg_proc at least,
probably more.

The point being, you can't rely on anything except WAL during recovery.

> In our standard b-tree(with no user-defined operator classes), there
> shouldn't be any problem with replaying right?

Even inserting into our "standard b-tree" involves looking up the
operator class and associated functions and using the fmgr interface.
There is no difference in the system between the builtin operator
classes and "user defined" ones.

> Again shouldn't we say b-tree with user-defined op-classes as gist-btree? If
> not what is the difference between both?

gist-btree is a nice example but IIRC it takes more diskspace and can't
handle uniqueness.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Re: Index Page Split logging

От

Simon Riggs

Дата:

02 января 2008 г., 12:16:50

On Wed, 2008-01-02 at 13:54 +0100, Martijn van Oosterhout wrote:
> On Wed, Jan 02, 2008 at 04:46:11PM +0530, Gokulakannan Somasundaram wrote:
> > > All indexes are done by user-defined functions, even b-trees. People can
> > > make their own b-tree indexes by defining an operator class. Note that
> > > "user-defined" is this case means anything called via the fmgr
> > > interface.
> > 
> > Again, i think i have one more wrong understanding. My understanding is,
> > We are discussing about user-defined functions because, they might be
> > actually be mutable functions, but the user might have classified as
> > immutable. 
> 
> This is where it gets a bit beyond by depth, so someone may have to
> correct me. The point is that during the recovery the system is not yet
> fully running and not everything works yet. For example, what happens
> if the system crashed halfway through updating a page in pg_proc and
> that page needs to be recovered from WAL. Yet to insert into the index
> you need to be able to read pg_am, pg_amproc, pg_proc at least,
> probably more.

That's right; shame I forgot this before I started the thread...

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: Index Page Split logging

От

"Gokulakannan Somasundaram"

Дата:

02 января 2008 г., 12:48:54

<br /><br /><div class="gmail_quote">On Jan 2, 2008 6:24 PM, Martijn van Oosterhout <<a
href="mailto:kleptog@svana.org">kleptog@svana.org</a>>wrote:<br /><blockquote class="gmail_quote"
style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d">On
Wed,Jan 02, 2008 at 04:46:11PM +0530, Gokulakannan Somasundaram wrote:<br />> > All indexes are done by
user-definedfunctions, even b-trees. People can<br />> > make their own b-tree indexes by defining an operator
class.Note that <br />> > "user-defined" is this case means anything called via the fmgr<br />> >
interface.<br/>><br />> Again, i think i have one more wrong understanding. My understanding is,<br />> We are
discussingabout user-defined functions because, they might be <br />> actually be mutable functions, but the user
mighthave classified as<br />> immutable.<br /><br /></div>This is where it gets a bit beyond by depth, so someone
mayhave to<br />correct me. The point is that during the recovery the system is not yet <br />fully running and not
everythingworks yet. For example, what happens<br />if the system crashed halfway through updating a page in pg_proc
and<br/>that page needs to be recovered from WAL. Yet to insert into the index<br /> you need to be able to read pg_am,
pg_amproc,pg_proc at least,<br />probably more.<br /><br />The point being, you can't rely on anything except WAL
duringrecovery.<br /><div class="Ih2E3d"><br /></div></blockquote></div>Thanks a lot for the nice explanation.  That
wassomething new for me....:(( <br /><br clear="all" /><br />-- <br />Thanks,<br />Gokul.<br /><br />

Re: Index Page Split logging

От

Martijn van Oosterhout

Дата:

02 января 2008 г., 13:02:15

On Wed, Jan 02, 2008 at 01:17:03PM +0000, Simon Riggs wrote:
> That's right; shame I forgot this before I started the thread...

Actually, I think your idea has merit, it's just not as easy as
originally thought. All splits, including multiple-level splits, can be
described as a sequence of "split page X with items {S} going to the
left" followed by an "insert item into page X". The trick is that for
multi-level splits you have to split from the top down. Then each split
becomes a simple loggable operation.

But, I think in the live system we split from the bottom up (the split
and insert is a single operation), so I don't think you can actually
combine the current split algorithm with the logging operation I
suggest above.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Re: Implementing Sorting Refinements

От

Decibel!

Дата:

07 января 2008 г., 22:34:14

You'll get better response if you don't hijack threads. :) Also,  
there's nothing in here that describes what the benefits of this  
change are.

On Jan 1, 2008, at 2:09 PM, Manolo _ wrote:

>
> Hi to all.
>
> This mail is aimed at asking some suggestion to face PostgreSQL  
> (PG) development to implement a refinement to PG source code. I'll  
> briefly summarize the idea of the "2-Way Replacement  
> Selection" (2WRS) refinement, just in case. If you already remember  
> what 2WRS is, you can please jump directly to the IMPLEMENTATION  
> ISSUES part of this mail.
>
>
> 2WRS.
> Refinement of the Replacement Selection (RS) technique currently  
> used by PG in src/backend/utils/sort/tuplesort.c .
> The 2WRS uses two heaps instead of just one in order to create the  
> current (logical) run. Here there are some fundamental points of  
> the 2WRS technique:
> - 'qsort' the initial unsorted 'memtuples' array
> - divide the 'memtuples' array into two parts and each of those  
> will be managed as a heap
> - one of the heaps will arrange its elements in ascending order,  
> while the other heap in descending order
> - each heap will spill its current root in its corresponding run  
> (i.e.: we have a run per each of those two heaps), so we are  
> actually creating 2 physical current runs
> - those 2 current physical runs could "theoretically" be merged  
> into the same logical run, actually we can make 'mergesort' think  
> they do belong to the same physical run. That reduces the number of  
> comparisons 'mergesort' has to do at each merge step (that means  
> less seek delay time on mass storage). We can also think the  
> average length of logical runs produced by 2WRS will probably be  
> greater than the average length produced by RS (again less seek  
> delay time: the longer each run the less number of final runs  
> produced, that means the less number of comparisons at each  
> 'mergesort' step).
>
>
> IMPLEMENTATION ISSUES.
>     Where to place those heaps?
> 1) I think that both heaps could be arranged on the same  
> 'memtuples' array. That allows easily subsequent resizing those  
> heaps according to their effective use or according to some  
> heuristic, without reallocating memory.
>     
>     How to arrange those heaps?
> 2a) It's convenient to arrange those heaps "root to root". That is  
> arranging those heaps with their roots toward the center of  
> 'memtuples' (in a way we can say they lay "face to face", or "root  
> to root" as said before) while their leaves lay towards the extreme  
> indexes of the 'memtuples' array (that is the last leaf of one heap  
> will lay at index 0, the last leaf of the other heap laying at  
> index memtupsize-1. This arrangement prevents overlapping elements  
> between those physical runs associated to the same current logical  
> run.
> PRO: once we qsort memtuples and divide it into 2 parts we already  
> get those two heaps, no need to build them.
> CONTRA: ???
>
> 2b) As in 2a) but arranging heaps "leaf to leaf", that is their  
> roots will lay at the extreme indexes of 'memtuples' while their  
> leaves towards the center of the 'memtuples' array.
> Or even start building heaps as soon as we get initial elements,  
> instead of qsort the whole 'memtuples' array.
> Any PRO/CONTRA compared to 2a)???
>
>     Current run numbers
> I think I should duplicate the 'int currentRun' variable in the  
> Tuplesortstate struct. I'll replace it with a 'int currentRunUP'  
> and 'int currentRunDOWN' variables in order to distinguish those  
> two physical runs associated to those 2 heaps. In this case I will  
> give a run number (max{currentRunUP,currentRunDOWN} + 1) to  
> elements not belonging to the current logical run. I suppose no  
> need to touch 'long availMem' nor 'long allowedMem' variables nor  
> any others.
>
>     Heap functions
> I will duplicate all the heap management functions in order to  
> adapt them to the kind of heap they should be applied to (for  
> example, the tuplesort_heap_siftup function should be replaced with  
> tuplesort_heap_siftupUP and tuplesort_heap_siftupDOWN functions).
>
>     Merge Plan
> This technique would use a sort of "merge plan"    to instruct  
> mergesort on how to use those physical runs. Actually mergesort  
> should consider at first "odd runs" before "pair runs". That is,  
> for example, mergesort should start merging runs with run number  
> 1,3,5,7,... and when run number X terminates start considering run  
> number X+1. Obviously that doesn't need any merge plan, but I  
> remember someone else (as Simon Riggs) was interested in sorting  
> improvements so it could be a good thing to know if I should  
> consider any conventions or paramethers in order to possibly create  
> that merge plan.
>
>
> DEVELOPMENT CONTEXT
> I preferred to use the "last stable release" at the moment, that is  
> 8.2.
>
>
> Any comment/suggestion/advice ?
>
> Thanks for your attention and for your time.
> Regards, Manolo.
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today it's  
> FREE!
> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
> ---------------------------(end of  
> broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Re: Implementing Sorting Refinements

От

Дата:

07 января 2008 г., 23:05:35

Well, sorry for hijacking... ummm how did I do that?

Anyway I'll thank you for giving a "sign of life" when I was almost loosing 
my hopes to get any kind of answer from "-hackers".

I suppose the lack of answers was due to the way I wrote my mail. At that 
moment I supposed that at least someone reminded the 2WRS technique and 
possible benefits described into previous posts.
I think I was wrong, so I'll write it once again hoping meanwhile to get 
some suggestions on: HOWTO write a mail to which "-hackers" will give an 
answer :) hehehehe

Thanks for your attention.
Manolo.

--------------------------------------------------
From: "Decibel!" <decibel@decibel.org>
Sent: Tuesday, January 08, 2008 12:34 AM
To: "Manolo _" <mac_man2005@hotmail.it>
Cc: <pgsql-hackers@postgresql.org>
Subject: Re: [HACKERS] Implementing Sorting Refinements

> You'll get better response if you don't hijack threads. :) Also,  there's 
> nothing in here that describes what the benefits of this  change are.
>
> On Jan 1, 2008, at 2:09 PM, Manolo _ wrote:
>
>>
>> Hi to all.
>>
>> This mail is aimed at asking some suggestion to face PostgreSQL  (PG) 
>> development to implement a refinement to PG source code. I'll  briefly 
>> summarize the idea of the "2-Way Replacement  Selection" (2WRS) 
>> refinement, just in case. If you already remember  what 2WRS is, you can 
>> please jump directly to the IMPLEMENTATION  ISSUES part of this mail.
>>
>>
>> 2WRS.
>> Refinement of the Replacement Selection (RS) technique currently  used by 
>> PG in src/backend/utils/sort/tuplesort.c .
>> The 2WRS uses two heaps instead of just one in order to create the 
>> current (logical) run. Here there are some fundamental points of  the 
>> 2WRS technique:
>> - 'qsort' the initial unsorted 'memtuples' array
>> - divide the 'memtuples' array into two parts and each of those  will be 
>> managed as a heap
>> - one of the heaps will arrange its elements in ascending order,  while 
>> the other heap in descending order
>> - each heap will spill its current root in its corresponding run  (i.e.: 
>> we have a run per each of those two heaps), so we are  actually creating 
>> 2 physical current runs
>> - those 2 current physical runs could "theoretically" be merged  into the 
>> same logical run, actually we can make 'mergesort' think  they do belong 
>> to the same physical run. That reduces the number of  comparisons 
>> 'mergesort' has to do at each merge step (that means  less seek delay 
>> time on mass storage). We can also think the  average length of logical 
>> runs produced by 2WRS will probably be  greater than the average length 
>> produced by RS (again less seek  delay time: the longer each run the less 
>> number of final runs  produced, that means the less number of comparisons 
>> at each  'mergesort' step).
>>
>>
>> IMPLEMENTATION ISSUES.
>> Where to place those heaps?
>> 1) I think that both heaps could be arranged on the same  'memtuples' 
>> array. That allows easily subsequent resizing those  heaps according to 
>> their effective use or according to some  heuristic, without reallocating 
>> memory.
>>
>> How to arrange those heaps?
>> 2a) It's convenient to arrange those heaps "root to root". That is 
>> arranging those heaps with their roots toward the center of  'memtuples' 
>> (in a way we can say they lay "face to face", or "root  to root" as said 
>> before) while their leaves lay towards the extreme  indexes of the 
>> 'memtuples' array (that is the last leaf of one heap  will lay at index 
>> 0, the last leaf of the other heap laying at  index memtupsize-1. This 
>> arrangement prevents overlapping elements  between those physical runs 
>> associated to the same current logical  run.
>> PRO: once we qsort memtuples and divide it into 2 parts we already  get 
>> those two heaps, no need to build them.
>> CONTRA: ???
>>
>> 2b) As in 2a) but arranging heaps "leaf to leaf", that is their  roots 
>> will lay at the extreme indexes of 'memtuples' while their  leaves 
>> towards the center of the 'memtuples' array.
>> Or even start building heaps as soon as we get initial elements,  instead 
>> of qsort the whole 'memtuples' array.
>> Any PRO/CONTRA compared to 2a)???
>>
>> Current run numbers
>> I think I should duplicate the 'int currentRun' variable in the 
>> Tuplesortstate struct. I'll replace it with a 'int currentRunUP'  and 
>> 'int currentRunDOWN' variables in order to distinguish those  two 
>> physical runs associated to those 2 heaps. In this case I will  give a 
>> run number (max{currentRunUP,currentRunDOWN} + 1) to  elements not 
>> belonging to the current logical run. I suppose no  need to touch 'long 
>> availMem' nor 'long allowedMem' variables nor  any others.
>>
>> Heap functions
>> I will duplicate all the heap management functions in order to  adapt 
>> them to the kind of heap they should be applied to (for  example, the 
>> tuplesort_heap_siftup function should be replaced with 
>> tuplesort_heap_siftupUP and tuplesort_heap_siftupDOWN functions).
>>
>> Merge Plan
>> This technique would use a sort of "merge plan" to instruct  mergesort on 
>> how to use those physical runs. Actually mergesort  should consider at 
>> first "odd runs" before "pair runs". That is,  for example, mergesort 
>> should start merging runs with run number  1,3,5,7,... and when run 
>> number X terminates start considering run  number X+1. Obviously that 
>> doesn't need any merge plan, but I  remember someone else (as Simon 
>> Riggs) was interested in sorting  improvements so it could be a good 
>> thing to know if I should  consider any conventions or paramethers in 
>> order to possibly create  that merge plan.
>>
>>
>> DEVELOPMENT CONTEXT
>> I preferred to use the "last stable release" at the moment, that is  8.2.
>>
>>
>> Any comment/suggestion/advice ?
>>
>> Thanks for your attention and for your time.
>> Regards, Manolo.
>> _________________________________________________________________
>> Express yourself instantly with MSN Messenger! Download today it's  FREE!
>> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>> ---------------------------(end of  broadcast)---------------------------
>> TIP 4: Have you searched our list archives?
>>
>>                http://archives.postgresql.org
>>
>
> -- 
> Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
> Give your computer some brain candy! www.distributed.net Team #1828
>
>
>

Re: Implementing Sorting Refinements

От

"Guillaume Smet"

Дата:

07 января 2008 г., 23:48:38

On Jan 8, 2008 1:04 AM,  <mac_man2005@hotmail.it> wrote:
> Well, sorry for hijacking... ummm how did I do that?
>
> Anyway I'll thank you for giving a "sign of life" when I was almost loosing
> my hopes to get any kind of answer from "-hackers".

Don't forget that we're just a few days/weeks of 8.3 release so the
attention is a bit focused on it at the moment (and I'm not speaking
of the security releases of today).

Don't feel disappointed about the lack of attention you're suffering
at the moment, just post your proposal again after 8.3 release,
explain what you plan to do and why and perhaps you'll have the time
to write a mock-up and get some numbers to prove your point before
that. That could help too.

Keep up the good work.

Regards,

--
Guillaume

Re: Implementing Sorting Refinements

От

Tomasz Ostrowski

Дата:

08 января 2008 г., 06:06:50

On Tue, 08 Jan 2008, mac_man2005@hotmail.it wrote:

> Well, sorry for hijacking... ummm how did I do that?

You replied to a post instead of creating a new, unrelated e-mail. It
is different.

Just try to use threaded mode of your e-mail client and you'll get
the idea.

Regards
Tometzky
-- 
...although Eating Honey was a very good thing to do, there was a
moment just before you began to eat it which was better than when you
were...                                                     Winnie the Pooh

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Index Page Split logging