Обсуждение: polyphase merge?
Dear All,<br /><br />I apologize if this has been discussed before. I have tried to search to the mailing list and couldnot find an exact answer.<br /><br />Currently, Postgres uses Knuth's Algorithm 5.4.2D to perform external merge sort.IMHO the algorithm is designed for tapes. <br /><br />Why don't the simple text book merge pattern described in KnuthPage 365 (5.4.9) is used for disk based system? The same merge pattern is also described in Ramakrishnan text book andit is optimal if seek time is not counted, which of course not the real-world case.<br /><br />Best regards,<br /><br/>Don<br /><br />
Dear All,
Since nobody replied, I would give it a try. I am going to implement the merge pattern described in Knuth Page 365 (5.4.9), essentially it is as follow:
- create initial runs using replacement selection (basically this is as in the current implementation)
- add enough dummy runs of size 0 so that the number of sorted run minus one can be divided by k-1 (k is merge fan in)
- repetitively merge k smallest runs until only 1 run left
I am new to postgresql, hence any advice would be appreciated.
Can anybody give me some advice on how it can done?
1. how a run should be represented? should I reuse the tape mechanism? e.g. 1 tape 1 run
- or should I use a temp buffer?
2. How memory should be allocated? I assume I divide the memory equally to k runs, hence each run will get M/k memory. Each read of a run would be of size M/k bytes.
3. Prefetching. Then, we can precompute the read sequence of blocks of run during the entire merge process. Based on this, we know the blocks of run to be prefetched at a point of time.
3. Is it possible to perform read/write I/O while the merge is being performed? Hence we overlap I/O and computation.
4. any other issue needs consideration?
Best regards,
Don
Since nobody replied, I would give it a try. I am going to implement the merge pattern described in Knuth Page 365 (5.4.9), essentially it is as follow:
- create initial runs using replacement selection (basically this is as in the current implementation)
- add enough dummy runs of size 0 so that the number of sorted run minus one can be divided by k-1 (k is merge fan in)
- repetitively merge k smallest runs until only 1 run left
I am new to postgresql, hence any advice would be appreciated.
Can anybody give me some advice on how it can done?
1. how a run should be represented? should I reuse the tape mechanism? e.g. 1 tape 1 run
- or should I use a temp buffer?
2. How memory should be allocated? I assume I divide the memory equally to k runs, hence each run will get M/k memory. Each read of a run would be of size M/k bytes.
3. Prefetching. Then, we can precompute the read sequence of blocks of run during the entire merge process. Based on this, we know the blocks of run to be prefetched at a point of time.
3. Is it possible to perform read/write I/O while the merge is being performed? Hence we overlap I/O and computation.
4. any other issue needs consideration?
Best regards,
Don
On Thu, Jan 29, 2009 at 4:11 PM, Don Marvick <donmarvick@gmail.com> wrote:
Dear All,
I apologize if this has been discussed before. I have tried to search to the mailing list and could not find an exact answer.
Currently, Postgres uses Knuth's Algorithm 5.4.2D to perform external merge sort. IMHO the algorithm is designed for tapes.
Why don't the simple text book merge pattern described in Knuth Page 365 (5.4.9) is used for disk based system? The same merge pattern is also described in Ramakrishnan text book and it is optimal if seek time is not counted, which of course not the real-world case.
Best regards,
Don
On Wed, 2009-02-04 at 14:42 +0800, Don Marvick wrote: > 4. any other issue needs consideration? Most attempts to improve sorting further have fallen to nothing because of the lack of repeatable test results. In particular, coming up with test cases *after* writing the code is a good way to get lost in discussions and have your work passed over. The best starting point is to collect a number of test cases, both typical and extreme cases. Do some research to show that "typical" cases really are that and be prepared for some expressions of reasonable doubt. Publish that data so test results can be objectively verified and then measure those cases on existing code, with various settings of tunable parameters. Then it will be a simple matter to prove your changes are effective in target cases without damaging other cases. We would also want the changes to work automatically without additional tunable parameters. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Is this basically the same as our current algorithm but without multiplexing the tapes onto single files? I have been wondering whether we multiplex the tapes any better than filesystems can lay out separate files actually. Or is there something else that's different here? -- greg
On Wed, 2009-02-04 at 10:55 +0000, Greg Stark wrote: > Is this basically the same as our current algorithm but without > multiplexing the tapes onto single files? I have been wondering > whether we multiplex the tapes any better than filesystems can lay out > separate files actually. I don't think you'll be able to do that more efficiently than we already do. Read the top of tuplesort.c -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Greg Stark <stark@enterprisedb.com> writes:
> Is this basically the same as our current algorithm but without
> multiplexing the tapes onto single files? I have been wondering
> whether we multiplex the tapes any better than filesystems can lay out
> separate files actually.
The reason for the multiplexing is so that space can get re-used
quickly. If each tape were represented as a separate file, there would
be no way to release blocks as they're read; you could only give back
the whole file after reaching end of tape. Which would at least double
the amount of disk space needed to sort X amount of data. (It's
actually even worse, more like 4X, though the multiplier might depend on
the number of "tapes" --- I don't recall the details anymore.)
The penalty we pay is that in the later merge passes, the blocks
representing a single tape aren't very well ordered.
It might be interesting to think about some compromise that wastes a
little more space in order to get better sequentiality of disk access.
It'd be easy to do if we were willing to accept a 2X space penalty,
but I'm not sure if that would fly or not. It definitely *wasn't*
acceptable to the community a few years ago when the current code was
written. Disks have gotten bigger since then, but so have the problems
people want to solve.
regards, tom lane
Don Marvick <donmarvick@gmail.com> writes:
> Since nobody replied, I would give it a try. I am going to implement the
> merge pattern described in Knuth Page 365 (5.4.9), essentially it is as
> follow:
> - create initial runs using replacement selection (basically this is as in
> the current implementation)
> - add enough dummy runs of size 0 so that the number of sorted run minus one
> can be divided by k-1 (k is merge fan in)
> - repetitively merge k smallest runs until only 1 run left
How is this better than, or even different from, what is there now?
regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > On Wed, 2009-02-04 at 10:55 +0000, Greg Stark wrote: >> Is this basically the same as our current algorithm but without >> multiplexing the tapes onto single files? I have been wondering >> whether we multiplex the tapes any better than filesystems can lay out >> separate files actually. > > I don't think you'll be able to do that more efficiently than we already > do. Read the top of tuplesort.c Huh? The question I posed was whether we do it any better than filesystems do, not whether we could do a better job. If filesystems can do as good a job then we might as well create separate files for each tape and let the filesystem decide where to allocate space. That would mean we could massively simplify logtape.c and just create a separate file for every tape. Possible reasons that filesystems could do better than us are that they have access to more information about actual storage layout on disk, that they have more information about hardware characteristics, and that it would give the filesystem cache a better idea of the access pattern which logtape.c might be confusing the picture of currently. On the other hand possible reasons why filesystems would suck at this include creating and deleting new files being a slow or locking operation on many filesystems, and dealing with directories of large numbers of files being poorly optimized on others. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Good point. I would note this issue. Thanks for the advice :).
Regards,
Don
Regards,
Don
On Wed, Feb 4, 2009 at 6:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Most attempts to improve sorting further have fallen to nothing because
On Wed, 2009-02-04 at 14:42 +0800, Don Marvick wrote:
> 4. any other issue needs consideration?
of the lack of repeatable test results. In particular, coming up with
test cases *after* writing the code is a good way to get lost in
discussions and have your work passed over.
The best starting point is to collect a number of test cases, both
typical and extreme cases. Do some research to show that "typical" cases
really are that and be prepared for some expressions of reasonable
doubt. Publish that data so test results can be objectively verified and
then measure those cases on existing code, with various settings of
tunable parameters.
Then it will be a simple matter to prove your changes are effective in
target cases without damaging other cases. We would also want the
changes to work automatically without additional tunable parameters.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
This is what I understand from existing implementation:
- fix the merge fan in F, and number of tapes=F+1
- for each sorted run generated, we have a formula to determine which tape it belongs to
- the sorted run is added to the end of the tape
- all sorted runs have been added to tape. There is always an empty tape called output tape
- add dummy runs if necessary, to each tape
- merge the input tapes, write result to output tape, until one of the input tape is empty
- repeat this process until 1 sorted run remains
I think the main difference is that at each step, the current impl does not merge the smallest k-runs. It merges from the beginning of each tape. For 1 pass, this does not make any difference. For 2 passes onwards there are some differences. In some complex query, where the memory is eaten by some other operators, more passes are needed and the differences become larger. This is the only improvement that can be achieved. Let me know if this does not make any sense.
Regarding disk layout, I am open to suggestion whether we should reuse the tape module or not.
- fix the merge fan in F, and number of tapes=F+1
- for each sorted run generated, we have a formula to determine which tape it belongs to
- the sorted run is added to the end of the tape
- all sorted runs have been added to tape. There is always an empty tape called output tape
- add dummy runs if necessary, to each tape
- merge the input tapes, write result to output tape, until one of the input tape is empty
- repeat this process until 1 sorted run remains
I think the main difference is that at each step, the current impl does not merge the smallest k-runs. It merges from the beginning of each tape. For 1 pass, this does not make any difference. For 2 passes onwards there are some differences. In some complex query, where the memory is eaten by some other operators, more passes are needed and the differences become larger. This is the only improvement that can be achieved. Let me know if this does not make any sense.
Regarding disk layout, I am open to suggestion whether we should reuse the tape module or not.
On Wed, Feb 4, 2009 at 11:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Don Marvick <donmarvick@gmail.com> writes:How is this better than, or even different from, what is there now?
> Since nobody replied, I would give it a try. I am going to implement the
> merge pattern described in Knuth Page 365 (5.4.9), essentially it is as
> follow:
> - create initial runs using replacement selection (basically this is as in
> the current implementation)
> - add enough dummy runs of size 0 so that the number of sorted run minus one
> can be divided by k-1 (k is merge fan in)
> - repetitively merge k smallest runs until only 1 run left
regards, tom lane
Hmm probably it's worth to try this. Has anybody try this before? If not, I am interested to give it a try.
Regards,
Don
Regards,
Don
On Wed, Feb 4, 2009 at 11:25 PM, Gregory Stark <stark@enterprisedb.com> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:Huh?
> On Wed, 2009-02-04 at 10:55 +0000, Greg Stark wrote:
>> Is this basically the same as our current algorithm but without
>> multiplexing the tapes onto single files? I have been wondering
>> whether we multiplex the tapes any better than filesystems can lay out
>> separate files actually.
>
> I don't think you'll be able to do that more efficiently than we already
> do. Read the top of tuplesort.c
The question I posed was whether we do it any better than filesystems do, not
whether we could do a better job. If filesystems can do as good a job then we
might as well create separate files for each tape and let the filesystem
decide where to allocate space. That would mean we could massively simplify
logtape.c and just create a separate file for every tape.
Possible reasons that filesystems could do better than us are that they have
access to more information about actual storage layout on disk, that they have
more information about hardware characteristics, and that it would give the
filesystem cache a better idea of the access pattern which logtape.c might be
confusing the picture of currently.
On the other hand possible reasons why filesystems would suck at this include
creating and deleting new files being a slow or locking operation on many
filesystems, and dealing with directories of large numbers of files being
poorly optimized on others.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!