Обсуждение: polyphase merge?

Поиск
Список
Период
Сортировка

polyphase merge?

От
Don Marvick
Дата:
Dear All,<br /><br />I apologize if this has been discussed before. I have tried to search to the mailing list and
couldnot find an exact answer.<br /><br />Currently, Postgres uses Knuth's Algorithm 5.4.2D to perform external merge
sort.IMHO the algorithm is designed for tapes. <br /><br />Why don't the simple text book merge pattern described in
KnuthPage 365 (5.4.9) is used for disk based system? The same merge pattern is also described in Ramakrishnan text book
andit is optimal if seek time is not counted, which of course not the real-world case.<br /><br />Best regards,<br
/><br/>Don<br /><br /> 

Re: polyphase merge?

От
Don Marvick
Дата:
Dear All,

Since nobody replied, I would give it a try. I am going to implement the merge pattern described in Knuth Page 365 (5.4.9), essentially it is as follow:
- create initial runs using replacement selection (basically this is as in the current implementation)
- add enough dummy runs of size 0 so that the number of sorted run minus one can be divided by k-1 (k is merge fan in)
- repetitively merge k smallest runs until only 1 run left

I am new to postgresql, hence any advice would be appreciated.

Can anybody give me some advice on how it can done?
1. how a run should be represented? should I reuse the tape mechanism? e.g. 1 tape 1 run
- or should I use a temp buffer?

2. How memory should be allocated? I assume I divide the memory equally to k runs, hence each run will get M/k memory. Each read of a run would be of size M/k bytes.

3. Prefetching. Then, we can precompute the read sequence of blocks of run during the entire merge process. Based on this, we know the blocks of run to be prefetched at a point of time.

3. Is it possible to perform read/write I/O while the merge is being performed? Hence we overlap I/O and computation.

4. any other issue needs consideration?

Best regards,

Don

On Thu, Jan 29, 2009 at 4:11 PM, Don Marvick <donmarvick@gmail.com> wrote:
Dear All,

I apologize if this has been discussed before. I have tried to search to the mailing list and could not find an exact answer.

Currently, Postgres uses Knuth's Algorithm 5.4.2D to perform external merge sort. IMHO the algorithm is designed for tapes.

Why don't the simple text book merge pattern described in Knuth Page 365 (5.4.9) is used for disk based system? The same merge pattern is also described in Ramakrishnan text book and it is optimal if seek time is not counted, which of course not the real-world case.

Best regards,

Don


Re: polyphase merge?

От
Simon Riggs
Дата:
On Wed, 2009-02-04 at 14:42 +0800, Don Marvick wrote:

> 4. any other issue needs consideration?

Most attempts to improve sorting further have fallen to nothing because
of the lack of repeatable test results. In particular, coming up with
test cases *after* writing the code is a good way to get lost in
discussions and have your work passed over.

The best starting point is to collect a number of test cases, both
typical and extreme cases. Do some research to show that "typical" cases
really are that and be prepared for some expressions of reasonable
doubt. Publish that data so test results can be objectively verified and
then measure those cases on existing code, with various settings of
tunable parameters.

Then it will be a simple matter to prove your changes are effective in
target cases without damaging other cases. We would also want the
changes to work automatically without additional tunable parameters.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: polyphase merge?

От
Greg Stark
Дата:
Is this basically the same as our current algorithm but without
multiplexing the tapes onto single files? I have been wondering
whether we multiplex the tapes any better than filesystems can lay out
separate files actually.

Or is there something else that's different here?

-- 
greg


Re: polyphase merge?

От
Simon Riggs
Дата:
On Wed, 2009-02-04 at 10:55 +0000, Greg Stark wrote:
> Is this basically the same as our current algorithm but without
> multiplexing the tapes onto single files? I have been wondering
> whether we multiplex the tapes any better than filesystems can lay out
> separate files actually.

I don't think you'll be able to do that more efficiently than we already
do. Read the top of tuplesort.c

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: polyphase merge?

От
Tom Lane
Дата:
Greg Stark <stark@enterprisedb.com> writes:
> Is this basically the same as our current algorithm but without
> multiplexing the tapes onto single files? I have been wondering
> whether we multiplex the tapes any better than filesystems can lay out
> separate files actually.

The reason for the multiplexing is so that space can get re-used
quickly.  If each tape were represented as a separate file, there would
be no way to release blocks as they're read; you could only give back
the whole file after reaching end of tape.  Which would at least double
the amount of disk space needed to sort X amount of data.  (It's
actually even worse, more like 4X, though the multiplier might depend on
the number of "tapes" --- I don't recall the details anymore.)

The penalty we pay is that in the later merge passes, the blocks
representing a single tape aren't very well ordered.

It might be interesting to think about some compromise that wastes a
little more space in order to get better sequentiality of disk access.
It'd be easy to do if we were willing to accept a 2X space penalty,
but I'm not sure if that would fly or not.  It definitely *wasn't*
acceptable to the community a few years ago when the current code was
written.  Disks have gotten bigger since then, but so have the problems
people want to solve.
        regards, tom lane


Re: polyphase merge?

От
Tom Lane
Дата:
Don Marvick <donmarvick@gmail.com> writes:
> Since nobody replied, I would give it a try. I am going to implement the
> merge pattern described in Knuth Page 365 (5.4.9), essentially it is as
> follow:
> - create initial runs using replacement selection (basically this is as in
> the current implementation)
> - add enough dummy runs of size 0 so that the number of sorted run minus one
> can be divided by k-1 (k is merge fan in)
> - repetitively merge k smallest runs until only 1 run left

How is this better than, or even different from, what is there now?
        regards, tom lane


Re: polyphase merge?

От
Gregory Stark
Дата:
Simon Riggs <simon@2ndQuadrant.com> writes:

> On Wed, 2009-02-04 at 10:55 +0000, Greg Stark wrote:
>> Is this basically the same as our current algorithm but without
>> multiplexing the tapes onto single files? I have been wondering
>> whether we multiplex the tapes any better than filesystems can lay out
>> separate files actually.
>
> I don't think you'll be able to do that more efficiently than we already
> do. Read the top of tuplesort.c

Huh?

The question I posed was whether we do it any better than filesystems do, not
whether we could do a better job. If filesystems can do as good a job then we
might as well create separate files for each tape and let the filesystem
decide where to allocate space. That would mean we could massively simplify
logtape.c and just create a separate file for every tape.

Possible reasons that filesystems could do better than us are that they have
access to more information about actual storage layout on disk, that they have
more information about hardware characteristics, and that it would give the
filesystem cache a better idea of the access pattern which logtape.c might be
confusing the picture of currently.

On the other hand possible reasons why filesystems would suck at this include
creating and deleting new files being a slow or locking operation on many
filesystems, and dealing with directories of large numbers of files being
poorly optimized on others.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: polyphase merge?

От
Don Marvick
Дата:
Good point. I would note this issue. Thanks for the advice :).

Regards,

Don

On Wed, Feb 4, 2009 at 6:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, 2009-02-04 at 14:42 +0800, Don Marvick wrote:

> 4. any other issue needs consideration?

Most attempts to improve sorting further have fallen to nothing because
of the lack of repeatable test results. In particular, coming up with
test cases *after* writing the code is a good way to get lost in
discussions and have your work passed over.

The best starting point is to collect a number of test cases, both
typical and extreme cases. Do some research to show that "typical" cases
really are that and be prepared for some expressions of reasonable
doubt. Publish that data so test results can be objectively verified and
then measure those cases on existing code, with various settings of
tunable parameters.

Then it will be a simple matter to prove your changes are effective in
target cases without damaging other cases. We would also want the
changes to work automatically without additional tunable parameters.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: polyphase merge?

От
Don Marvick
Дата:
This is what I understand from existing implementation:
- fix the merge fan in F, and number of tapes=F+1
- for each sorted run generated, we have a formula to determine which tape it belongs to
- the sorted run is added to the end of the tape
- all sorted runs have been added to tape. There is always an empty tape called output tape
- add dummy runs if necessary, to each tape
- merge the input tapes, write result to output tape, until one of the input tape is empty
- repeat this process until 1 sorted run remains


I think the main difference is that at each step, the current impl does not merge the smallest k-runs. It merges from the beginning of each tape. For 1 pass, this does not make any difference. For 2 passes onwards there are some differences. In some complex query, where the memory is eaten by some other operators, more passes are needed and the differences become larger. This is the only improvement that can be achieved. Let me know if this does not make any sense.

Regarding disk layout, I am open to suggestion whether we should reuse the tape module or not.



On Wed, Feb 4, 2009 at 11:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Don Marvick <donmarvick@gmail.com> writes:
> Since nobody replied, I would give it a try. I am going to implement the
> merge pattern described in Knuth Page 365 (5.4.9), essentially it is as
> follow:
> - create initial runs using replacement selection (basically this is as in
> the current implementation)
> - add enough dummy runs of size 0 so that the number of sorted run minus one
> can be divided by k-1 (k is merge fan in)
> - repetitively merge k smallest runs until only 1 run left

How is this better than, or even different from, what is there now?

                       regards, tom lane

Re: polyphase merge?

От
Don Marvick
Дата:
Hmm probably it's worth to try this. Has anybody try this before? If not, I am interested to give it a try.

Regards,

Don

On Wed, Feb 4, 2009 at 11:25 PM, Gregory Stark <stark@enterprisedb.com> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:

> On Wed, 2009-02-04 at 10:55 +0000, Greg Stark wrote:
>> Is this basically the same as our current algorithm but without
>> multiplexing the tapes onto single files? I have been wondering
>> whether we multiplex the tapes any better than filesystems can lay out
>> separate files actually.
>
> I don't think you'll be able to do that more efficiently than we already
> do. Read the top of tuplesort.c

Huh?

The question I posed was whether we do it any better than filesystems do, not
whether we could do a better job. If filesystems can do as good a job then we
might as well create separate files for each tape and let the filesystem
decide where to allocate space. That would mean we could massively simplify
logtape.c and just create a separate file for every tape.

Possible reasons that filesystems could do better than us are that they have
access to more information about actual storage layout on disk, that they have
more information about hardware characteristics, and that it would give the
filesystem cache a better idea of the access pattern which logtape.c might be
confusing the picture of currently.

On the other hand possible reasons why filesystems would suck at this include
creating and deleting new files being a slow or locking operation on many
filesystems, and dealing with directories of large numbers of files being
poorly optimized on others.

--
 Gregory Stark
 EnterpriseDB          http://www.enterprisedb.com
 Ask me about EnterpriseDB's RemoteDBA services!