Обсуждение: [HACKERS] \dt and disk access
Can someone suggest why running the \dt comand in psql causes so many disk accesses? I am seeing some queries causing a lot of disk access that doesn't make sense to me. I am running my server with fsync off. - -- Bruce Momjian maillist@candle.pha.pa.us ------------------------------
>
> Can someone suggest why running the \dt comand in psql causes so many
> disk accesses?
>
> I am seeing some queries causing a lot of disk access that doesn't make
> sense to me.
>
> I am running my server with fsync off.
I now realize the unusual disk activity is caused by any query that uses
an ORDER BY. A query that returns three rows takes MUCH longer with the
ORDER BY than without it, 0.20 seconds vs. 0.61 seconds, three times
longer.
The query is:
select * from matter
AND
select * from matter order by matter
The table has three short text columns.
Would anyone like to comment on why ORDER BY is doing this? It
certainly slows down the query. I am running BSD/OS.
This is of course using PostgreSQL 6.1.
- --
Bruce Momjian
maillist@candle.pha.pa.us
------------------------------
Well....Is the sorting algorithm using temp files to do the sorting? If
not, it would be interesting to see the difference in Purify output when
running an identical query with sorting and without...
I'll probably run it on Saturday that way just to (hopefully) eliminate
leaks and such from possible causes...
=+=------------------------/\---------------------------------=+=
Igor Natanzon |**| E-mail: igor@sba.miami.edu
=+=------------------------\/---------------------------------=+=
On Sat, 21 Jun 1997, Bruce Momjian wrote:
> >
> > Can someone suggest why running the \dt comand in psql causes so many
> > disk accesses?
> >
> > I am seeing some queries causing a lot of disk access that doesn't make
> > sense to me.
> >
> > I am running my server with fsync off.
>
> I now realize the unusual disk activity is caused by any query that uses
> an ORDER BY. A query that returns three rows takes MUCH longer with the
> ORDER BY than without it, 0.20 seconds vs. 0.61 seconds, three times
> longer.
>
> The query is:
>
> select * from matter
>
> AND
>
> select * from matter order by matter
>
> The table has three short text columns.
>
>
> Would anyone like to comment on why ORDER BY is doing this? It
> certainly slows down the query. I am running BSD/OS.
>
> This is of course using PostgreSQL 6.1.
>
> --
> Bruce Momjian
> maillist@candle.pha.pa.us
>
------------------------------
Under BSD/OS 2.0 and up (don't know about 1.x), it should be possible to start up the postmaster thus: ktrace -i postmaster >& Log This will trace all calls to the kernel (and generate a good deal of output, as well as slow things down a lot). Run your psql command. Kill -15 the postmaster. In the same directory where you started the postmaster, you can run `kdump | less`. Down toward the bottom of the file, you should be able to track down your postgress process and what it was doing. If you find it doing a lot of activity with /tmp, you might try mounting /tmp as a memory file system. I realize this isn't really an answer to the question "why is there disk activity", but it should at least show "what postgres is doing". > Well....Is the sorting algorithm using temp files to do the sorting? If > not, it would be interesting to see the difference in Purify output when > running an identical query with sorting and without... > I'll probably run it on Saturday that way just to (hopefully) eliminate > leaks and such from possible causes... > > =+=------------------------/\---------------------------------=+= > Igor Natanzon |**| E-mail: igor@sba.miami.edu > =+=------------------------\/---------------------------------=+= > > On Sat, 21 Jun 1997, Bruce Momjian wrote: > > > > > > > Can someone suggest why running the \dt comand in psql causes so many > > > disk accesses? > > > > > > I am seeing some queries causing a lot of disk access that doesn't make > > > sense to me. > > > > > > I am running my server with fsync off. > > > > I now realize the unusual disk activity is caused by any query that uses > > an ORDER BY. A query that returns three rows takes MUCH longer with the > > ORDER BY than without it, 0.20 seconds vs. 0.61 seconds, three times > > longer. > > > > The query is: > > > > select * from matter > > > > AND > > > > select * from matter order by matter > > > > The table has three short text columns. > > > > > > Would anyone like to comment on why ORDER BY is doing this? It > > certainly slows down the query. I am running BSD/OS. > > > > This is of course using PostgreSQL 6.1. > > > > -- > > Bruce Momjian > > maillist@candle.pha.pa.us > > > > -Erik Erik Bennett bennett@corp.oneworld.com ------------------------------
Well, in case you are wondering.....I populated a table with about 500
tuples.
Running select * on that table yealded no errors from purify..
Running select * .. order by (a non-indexed field) gave the following:
total memory usage: 319,803 bytes.
Of this amount of memory, 57,344 bytes were leaks.
I then populated another table with 2600+ tuples.
These are the results with order by:
Total memory used: 786,763
Of this memory, 524,288 bytes are memory leaks!!! Waste of half a meg of
ram on 2600 or so tuples is bad....And what if you have 20,000 tuples? Or
a 100,000 ?
I will try to investigate this matter now....
=+=------------------------/\---------------------------------=+=
Igor Natanzon |**| E-mail: igor@sba.miami.edu
=+=------------------------\/---------------------------------=+=
On Sat, 21 Jun 1997, Bruce Momjian wrote:
> >
> > Can someone suggest why running the \dt comand in psql causes so many
> > disk accesses?
> >
> > I am seeing some queries causing a lot of disk access that doesn't make
> > sense to me.
> >
> > I am running my server with fsync off.
>
> I now realize the unusual disk activity is caused by any query that uses
> an ORDER BY. A query that returns three rows takes MUCH longer with the
> ORDER BY than without it, 0.20 seconds vs. 0.61 seconds, three times
> longer.
>
> The query is:
>
> select * from matter
>
> AND
>
> select * from matter order by matter
>
> The table has three short text columns.
>
>
> Would anyone like to comment on why ORDER BY is doing this? It
> certainly slows down the query. I am running BSD/OS.
>
> This is of course using PostgreSQL 6.1.
>
> --
> Bruce Momjian
> maillist@candle.pha.pa.us
>
------------------------------
> >
> > Can someone suggest why running the \dt comand in psql causes so many
> > disk accesses?
> >
> > I am seeing some queries causing a lot of disk access that doesn't make
> > sense to me.
> >
> > I am running my server with fsync off.
>
> I now realize the unusual disk activity is caused by any query that uses
> an ORDER BY. A query that returns three rows takes MUCH longer with the
> ORDER BY than without it, 0.20 seconds vs. 0.61 seconds, three times
> longer.
>
> The query is:
>
> select * from matter
>
> AND
>
> select * from matter order by matter
>
> The table has three short text columns.
>
>
> Would anyone like to comment on why ORDER BY is doing this? It
> certainly slows down the query. I am running BSD/OS.
>
> This is of course using PostgreSQL 6.1.
OK, I think I have some more information on this.
To test:
create table ordertest (x integer);
insert into ordertest values (3);
Then try:
select * from ordertest;
AND
select * from ordertest order by x;
I test it by:
echo "select * from ordertest;" | time psql test
OK, now for some times. On my machine (PP200 running BSD/OS), the
startup/shutdown time for the database is 0.15 seconds, so I will
subtract that time to show the incremental elapsed wallclock time for
each command:
Select without order by: 0.01
Select with order by: 0.30
Create Table: 0.11
Create Index: 0.54
Insert into Table: 0.01
These are all times on a table with one integer column, and one row.
As you can see, the ORDER BY has slowed the SELECT by a factor of 30.
You can actually create three tables in the same time, or insert 30
rows.
This looks like a pretty serious performance problem.
Without the ORDER BY, performance is great. With the ORDER BY,
performance is poor, and I can see some mSQL comparisons where we will
look very bad. Let's face it, most multi-row SELECT statements use
ORDER BY, and it is a waste to use 30x the time to sort the data as it
took to get the data out of the database.
FASTBUILD is not used in the ORDER BY operations.
How is ORDER BY performed? The output of the SELECT statement is placed
in a temp table, and the temp table is then sorted. A comment from
ExecSort.c says:
/* ----------------------------------------------------------------
* ExecSort
*
* old comments
* Retrieves tuples fron the outer subtree and insert them into a
* temporary relation. The temporary relation is then sorted and
* the sorted relation is stored in the relation whose ID is indicated
* in the 'tempid' field of this node.
* Assumes that heap access method is used.
ExecSort calls psort() to do the work. psort() uses a Knuth algorithm.
I removed the psort() call from ExecSort, and the execution time was:
Select with order by using psort(): 0.30
Select with order by NOT using psort(): 0.07
Now of course, not using psort() causes the data not to be sorted, but
there must be some inefficienty if sorting one row takes 0.23 seconds on
a PP200 with 64MB of RAM. I suspect the other 0.06 seconds is caused by
the creation of the temp table and moving the rows around.
Any comments?
- --
Bruce Momjian
maillist@candle.pha.pa.us
------------------------------
> ExecSort calls psort() to do the work. psort() uses a Knuth algorithm. > > I removed the psort() call from ExecSort, and the execution time was: > > Select with order by using psort(): 0.30 > > Select with order by NOT using psort(): 0.07 > > Now of course, not using psort() causes the data not to be sorted, but > there must be some inefficienty if sorting one row takes 0.23 seconds on > a PP200 with 64MB of RAM. I suspect the other 0.06 seconds is caused by > the creation of the temp table and moving the rows around. It looks like psort()'s use of disk files to do the sorting is causing the slowdown, with the multiple create's and unlink's. I am going to modify psort() to use memory buffers instead of disk files for sorting small SELECT results. psort() is only called by ExecSort, which is used by ORDER BY's. Permission to put this in 6.1p1? - -- Bruce Momjian maillist@candle.pha.pa.us ------------------------------
Bruce Momjian wrote:
> It looks like psort()'s use of disk files to do the sorting is causing
> the slowdown, with the multiple create's and unlink's. I am going to
> modify psort() to use memory buffers instead of disk files for sorting
> small SELECT results. psort() is only called by ExecSort, which is used
> by ORDER BY's.
>
> Permission to put this in 6.1p1?
IMHO sounds like a *bad* idea, since it might take a while to debug.
Better to leave this as a feature for now (Postgres has had this
forever?) and come up with something great for v6.2; I'd be interested
in contributing to this if it would be helpful.
I don't know which Knuth algorithm the code uses, but some sort
algorithms perform very poorly on data which happens to be already
sorted. QuickSort, which comes on many Unix boxes (the qsort() call) is
quickest on completely unordered/disordered data and its slowest
performance is on already perfectly sorted data (I'm doing this from
memory, but the facts may be mostly correct :).
- Tom
------------------------------
Bruce Momjian wrote:
> OK, I think I have some more information on this.
> OK, now for some times. On my machine (PP200 running BSD/OS), the
> startup/shutdown time for the database is 0.15 seconds, so I will
> subtract that time to show the incremental elapsed wallclock time for
> each command:
>
> Select without order by: 0.01
>
> Select with order by: 0.30
>
> Create Table: 0.11
> Create Index: 0.54
>
> Insert into Table: 0.01
>
> These are all times on a table with one integer column, and one row.
>
> As you can see, the ORDER BY has slowed the SELECT by a factor of 30.
> You can actually create three tables in the same time, or insert 30
> rows.
>
> This looks like a pretty serious performance problem.
By having a small table, you are really just measuring the overhead in
setting up the sort. Try something with a _lot_ of data, and try to have
the data inserted randomly if possible. Another thing to check is the
fundamental time it takes to actually do sorting (for example, put the
same data into a little test program which calls qsort(), or find the
Knuth algorithm and call that directly). Sorts aren't (usually) cheap!
- Tom
------------------------------
End of hackers-digest V1 #396
*****************************
> > Bruce Momjian wrote: > > It looks like psort()'s use of disk files to do the sorting is causing > > the slowdown, with the multiple create's and unlink's. I am going to > > modify psort() to use memory buffers instead of disk files for sorting > > small SELECT results. psort() is only called by ExecSort, which is used > > by ORDER BY's. > > > > Permission to put this in 6.1p1? > > IMHO sounds like a *bad* idea, since it might take a while to debug. > Better to leave this as a feature for now (Postgres has had this > forever?) and come up with something great for v6.2; I'd be interested > in contributing to this if it would be helpful. I think I agree. Let's leave it out. I initially thought it was a FASTBULD btree issue, but I found it was unrelated. > > I don't know which Knuth algorithm the code uses, but some sort > algorithms perform very poorly on data which happens to be already > sorted. QuickSort, which comes on many Unix boxes (the qsort() call) is > quickest on completely unordered/disordered data and its slowest > performance is on already perfectly sorted data (I'm doing this from > memory, but the facts may be mostly correct :). It is from Knuth III, page 280. There are references to Knuth pages and diagrams all over the code. The sort is meant for cases when you don't have enough memory to do the sort, and the thrashing of doing all those random accesses can be slow. Now why they would not implement a fast version for cases where the rows WOULD fit in memory, I don't know. I will wait for 6.1p1, and then apply the fix. I plan to just use qsort() on an array of tuple pointers if the size of the ORDER BY result is under 256k bytes, which is only 1/2 of the default shared buffer size. I will run tests to see if there is major speed improvement at that size. If not, I will knock it down to 128k or 64k, just to be safe that it will not swamp the shared pool on a busy system. I may even key the trigger value on the number of shared buffers allocated. This has got to be faster than what it does now. The psort() code is designed for sorts that would clearly fill the shared buffer pool and disk buffer pool. It takes a multi-megabyte ORDER BY to do that on my machine. And I am using a Barracuda Ultra drive with tagged queueing doing 7 Mbytes/second transfers, so I am not testing this on a slow drive when I report 0.23 seconds to sort one row. - -- Bruce Momjian maillist@candle.pha.pa.us ------------------------------
On Sat, 21 Jun 1997, Bruce Momjian wrote: > > ExecSort calls psort() to do the work. psort() uses a Knuth algorithm. > > > > I removed the psort() call from ExecSort, and the execution time was: > > > > Select with order by using psort(): 0.30 > > > > Select with order by NOT using psort(): 0.07 > > > > Now of course, not using psort() causes the data not to be sorted, but > > there must be some inefficienty if sorting one row takes 0.23 seconds on > > a PP200 with 64MB of RAM. I suspect the other 0.06 seconds is caused by > > the creation of the temp table and moving the rows around. > > It looks like psort()'s use of disk files to do the sorting is causing > the slowdown, with the multiple create's and unlink's. I am going to > modify psort() to use memory buffers instead of disk files for sorting > small SELECT results. psort() is only called by ExecSort, which is used > by ORDER BY's. > > Permission to put this in 6.1p1? Can you make the size of the result set above which diskfiles will be used configurable? That way ppl with loads of RAM can use huge buffers, and ppl with little RAM can keep that RAM free for other processes. Maarten _____________________________________________________________________________ | Maarten Boekhold, Faculty of Electrical Engineering TU Delft, NL | | M.Boekhold@et.tudelft.nl boekhold@gopher.library.tudelft.nl | - ----------------------------------------------------------------------------- ------------------------------
> > On Sat, 21 Jun 1997, Bruce Momjian wrote: > > > > ExecSort calls psort() to do the work. psort() uses a Knuth algorithm. > > > > > > I removed the psort() call from ExecSort, and the execution time was: > > > > > > Select with order by using psort(): 0.30 > > > > > > Select with order by NOT using psort(): 0.07 > > > > > > Now of course, not using psort() causes the data not to be sorted, but > > > there must be some inefficienty if sorting one row takes 0.23 seconds on > > > a PP200 with 64MB of RAM. I suspect the other 0.06 seconds is caused by > > > the creation of the temp table and moving the rows around. > > > > It looks like psort()'s use of disk files to do the sorting is causing > > the slowdown, with the multiple create's and unlink's. I am going to > > modify psort() to use memory buffers instead of disk files for sorting > > small SELECT results. psort() is only called by ExecSort, which is used > > by ORDER BY's. > > > > Permission to put this in 6.1p1? > > Can you make the size of the result set above which diskfiles will be used > configurable? That way ppl with loads of RAM can use huge buffers, and ppl > with little RAM can keep that RAM free for other processes. Good idea. I think the final code will not use the buffer cache, but directly palloc the memory it needs and copy the rows to be sorted into there. I also hope to eliminate the copy to a temp table. Currently, the code copies to a temp table, loads the rows into "tape" files, then writes to a new table, which seems like a waste. Paul's FASTBUILD btree code uses ftruncate() after he has loaded the rows from the table, before he writes them back. I will contact him directly for comments. - -- Bruce Momjian maillist@candle.pha.pa.us ------------------------------
On Sat, 21 Jun 1997, Bruce Momjian wrote:
> > ExecSort calls psort() to do the work. psort() uses a Knuth algorithm.
> >
> > I removed the psort() call from ExecSort, and the execution time was:
> >
> > Select with order by using psort(): 0.30
> >
> > Select with order by NOT using psort(): 0.07
> >
> > Now of course, not using psort() causes the data not to be sorted, but
> > there must be some inefficienty if sorting one row takes 0.23 seconds on
> > a PP200 with 64MB of RAM. I suspect the other 0.06 seconds is caused by
> > the creation of the temp table and moving the rows around.
>
> It looks like psort()'s use of disk files to do the sorting is causing
> the slowdown, with the multiple create's and unlink's. I am going to
> modify psort() to use memory buffers instead of disk files for sorting
> small SELECT results. psort() is only called by ExecSort, which is used
> by ORDER BY's.
>
> Permission to put this in 6.1p1?
By all means...
My next question is what is considered a 'small SELECT result'? If
I have a 32Meg machine, that 'small' would be different, I would assume,
then a 128Meg machine.
With that in mind, how about some sort of 'switch', similar to
- -B, that allows you to stipulate total RAM of a machine (or virtual memory?),
or something similar to that?
Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
------------------------------
On Sat, 21 Jun 1997, Bruce Momjian wrote:
> I will wait for 6.1p1, and then apply the fix. I plan to just use
> qsort() on an array of tuple pointers if the size of the ORDER BY result
> is under 256k bytes, which is only 1/2 of the default shared buffer
> size. I will run tests to see if there is major speed improvement at
> that size. If not, I will knock it down to 128k or 64k, just to be safe
> that it will not swamp the shared pool on a busy system. I may even key
> the trigger value on the number of shared buffers allocated. This has
> got to be faster than what it does now.
Why use shared memory for this? Why not use mmap() for it? Come
to think of it, mmap()'ng it would have better scalability, no? If you
already know the result size (ie. 256k), you could that have the code try
to mmap() that amount of memory to do the sort in. If the mmap() fails,
revert to using a file...now the result size of the SELECT doesn't have a
compiled in "limit" to the size of memory used for the sort...its restricted
to the amount of memory on the machine (ie. if I double the RAM, I should be
able to have double the results sorted in memory instead of one disk)
Add a flag over top of that, like -B, that states the *max* result
size to do an in-memory sort with...or, rather, the *max* to try to do one
with.
Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
------------------------------
------------------------------
> > Permission to put this in 6.1p1? > > By all means... Thomas says no for 6.1p1. I have no opinion on this. However, I don't want to rush out a fix until I have heard from Paul M. Aoki and Vadim, and gotten their opinion for the most efficient way to fix this. > > My next question is what is considered a 'small SELECT result'? If > I have a 32Meg machine, that 'small' would be different, I would assume, > then a 128Meg machine. I think the sort is used by ORDER BY and joins where this is not an existing index. > With that in mind, how about some sort of 'switch', similar to > -B, that allows you to stipulate total RAM of a machine (or virtual memory?), > or something similar to that? I am not sure how to do this. There is sysctl() and getconf(), but neither of them is standard on all platforms. - -- Bruce Momjian maillist@candle.pha.pa.us ------------------------------
> Why use shared memory for this? Why not use mmap() for it? Come
> to think of it, mmap()'ng it would have better scalability, no? If you
> already know the result size (ie. 256k), you could that have the code try
> to mmap() that amount of memory to do the sort in. If the mmap() fails,
> revert to using a file...now the result size of the SELECT doesn't have a
> compiled in "limit" to the size of memory used for the sort...its restricted
> to the amount of memory on the machine (ie. if I double the RAM, I should be
> able to have double the results sorted in memory instead of one disk)
I don't think we need mmap() because we never need to put it on disk.
Just palloc() the memory.
>
> Add a flag over top of that, like -B, that states the *max* result
> size to do an in-memory sort with...or, rather, the *max* to try to do one
> with.
>
> Marc G. Fournier
> Systems Administrator @ hub.org
> primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
>
>
- --
Bruce Momjian
maillist@candle.pha.pa.us
------------------------------
Igor wrote:
>
> Well, in case you are wondering.....I populated a table with about 500
> tuples.
> Running select * on that table yealded no errors from purify..
> Running select * .. order by (a non-indexed field) gave the following:
> total memory usage: 319,803 bytes.
> Of this amount of memory, 57,344 bytes were leaks.
^^^^^^
57344/8192 = 7
>
> I then populated another table with 2600+ tuples.
> These are the results with order by:
> Total memory used: 786,763
> Of this memory, 524,288 bytes are memory leaks!!! Waste of half a meg of
^^^^^^^
524288/8192 = 64
I believe that it's just local buffers allocation, so it's not
real memory leak.
> ram on 2600 or so tuples is bad....And what if you have 20,000 tuples? Or
> a 100,000 ?
64 is max number of local buffers...
Vadim
------------------------------
> > It looks like psort()'s use of disk files to do the sorting is causing > > the slowdown, with the multiple create's and unlink's. I am going to > > modify psort() to use memory buffers instead of disk files for sorting > > small SELECT results. psort() is only called by ExecSort, which is used > > by ORDER BY's. > > > > Permission to put this in 6.1p1? > > Can you make the size of the result set above which diskfiles will be used > configurable? That way ppl with loads of RAM can use huge buffers, and ppl > with little RAM can keep that RAM free for other processes. If I need a configuration value, I will either determine the amount of RAM portably, or base the value on the number of shared buffers requested with -B. - -- Bruce Momjian maillist@candle.pha.pa.us ------------------------------
On Sun, 22 Jun 1997, Bruce Momjian wrote:
> > My next question is what is considered a 'small SELECT result'? If
> > I have a 32Meg machine, that 'small' would be different, I would assume,
> > then a 128Meg machine.
>
> I think the sort is used by ORDER BY and joins where this is not an
> existing index.
Ummm...okay.
Now, what is considered a "small SELECT result"? :)
> > With that in mind, how about some sort of 'switch', similar to
> > -B, that allows you to stipulate total RAM of a machine (or virtual memory?),
> > or something similar to that?
>
> I am not sure how to do this. There is sysctl() and getconf(), but
> neither of them is standard on all platforms.
Huh? You seem to have totally missed the question on that one
there Bruce :( Where does sysctl() and getconf() fit into having a run
time switch? *confused look* *grin*
Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
------------------------------
On Sun, 22 Jun 1997, Bruce Momjian wrote:
> > Why use shared memory for this? Why not use mmap() for it? Come
> > to think of it, mmap()'ng it would have better scalability, no? If you
> > already know the result size (ie. 256k), you could that have the code try
> > to mmap() that amount of memory to do the sort in. If the mmap() fails,
> > revert to using a file...now the result size of the SELECT doesn't have a
> > compiled in "limit" to the size of memory used for the sort...its restricted
> > to the amount of memory on the machine (ie. if I double the RAM, I should be
> > able to have double the results sorted in memory instead of one disk)
>
> I don't think we need mmap() because we never need to put it on disk.
> Just palloc() the memory.
okay, I'm not totally familiary with palloc()...I did a project while
back where we mmap()'d a region of memory so that we didn't need to write
to disk. Essentially, I used ftruncate() to create the size of the buffer I
wanted to mmap(), and never flushed the memory...
Not sure what palloc() does though...should look into it, eh? :(
Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
------------------------------
> > On Sun, 22 Jun 1997, Bruce Momjian wrote: > > > > My next question is what is considered a 'small SELECT result'? If > > > I have a 32Meg machine, that 'small' would be different, I would assume, > > > then a 128Meg machine. > > > > I think the sort is used by ORDER BY and joins where this is not an > > existing index. > > Ummm...okay. > > Now, what is considered a "small SELECT result"? :) I don't know yet. Any sort that will not fix in physical memory should use the default psort() code, because it prevents the thrashing of moving data in and out of virtial memory (SWAP) while the sort is being performed. It is hard to determine what would fix in PHYSICAL, not virtual memory, and in fact this number changes depending on how busy the machine is. > > > > With that in mind, how about some sort of 'switch', similar to > > > -B, that allows you to stipulate total RAM of a machine (or virtual memory?), > > > or something similar to that? > > > > I am not sure how to do this. There is sysctl() and getconf(), but > > neither of them is standard on all platforms. > > Huh? You seem to have totally missed the question on that one > there Bruce :( Where does sysctl() and getconf() fit into having a run > time switch? *confused look* *grin* Yes, I guess we will just have to make this a run-time switch, though keying in on -B may be a good idea, that way they have only one thing to tune. - -- Bruce Momjian maillist@candle.pha.pa.us ------------------------------
> okay, I'm not totally familiary with palloc()...I did a project while > back where we mmap()'d a region of memory so that we didn't need to write > to disk. Essentially, I used ftruncate() to create the size of the buffer I > wanted to mmap(), and never flushed the memory... > > Not sure what palloc() does though...should look into it, eh? :( palloc() basically malloc's from a pool that gets auto-free'd at the end of statements or transcations. We could us mmap(), but why bother if we can just malloc() all the memory we need? That is what has me confused. Anyone know an advantage to using mmap() when you really don't want the image on disk? I would suspect that mmap() would use the file system for backing store for the memory rather than swap. But the thrashing of large sorts is going to be as big a problem with this as with malloc'ed memory backing store in SWAP, in fact slower because swap backing store is faster than file system backing store, so we might was well use psort() for those large cases. psort() does its sorts in pieces with merging, and doesn't load large amount of data into memory at once, so it doesn't thrash. It uses disk files to store itermediate sorts, hence the large startup time. - -- Bruce Momjian maillist@candle.pha.pa.us ------------------------------
Bruce Momjian wrote: > > > Can you make the size of the result set above which diskfiles will be used > > configurable? That way ppl with loads of RAM can use huge buffers, and ppl > > with little RAM can keep that RAM free for other processes. > > If I need a configuration value, I will either determine the amount of > RAM portably, or base the value on the number of shared buffers > requested with -B. Why don't use additional flag as it suggested by Mark ? Using -B is not good: the size of shared memory segment may be limited by OS (under FreeBSD 2.1.0 it's limited to 4M) or system administrator and so backend will use 2-4 M of RAM for sort on box with 192 M RAM ? This flag may be useful for joinpath.c:EnoughMemoryForHashjoin() too... Also note that following > - make psort read directly from the executor node below it > (instead of an input relation) it will be impossible to know the size of result before sort startup. So you have to use palloc up to in-memory limit and switch to 'tape' files dynamically. Also > - makes the Sort node read directly from the last set of psort runs > (instead of an output relation) require changes to ExecSortMarkPos()/ExecSortRestrPos() which use heap_markpos()/heap_restrpos() (because of last set of psort is not normal heap relation). But both changes of nodeSort.c are what we really need. Vadim ------------------------------
> > Bruce Momjian wrote: > > > > > Can you make the size of the result set above which diskfiles will be used > > > configurable? That way ppl with loads of RAM can use huge buffers, and ppl > > > with little RAM can keep that RAM free for other processes. > > > > If I need a configuration value, I will either determine the amount of > > RAM portably, or base the value on the number of shared buffers > > requested with -B. > > Why don't use additional flag as it suggested by Mark ? > Using -B is not good: the size of shared memory segment may be > limited by OS (under FreeBSD 2.1.0 it's limited to 4M) or system > administrator and so backend will use 2-4 M of RAM for sort on > box with 192 M RAM ? OK, I will use a new flag. > > This flag may be useful for joinpath.c:EnoughMemoryForHashjoin() too... > > Also note that following > > - make psort read directly from the executor node below it > > (instead of an input relation) > > it will be impossible to know the size of result before sort startup. > So you have to use palloc up to in-memory limit and switch to > 'tape' files dynamically. I like this idea. I was struggling on how I was going to determine the size of the result anyway. I have checked the Mariposa source changes Paul mentioned. They do indeed change the behavior or psort(). It still uses tape files, so I will need to increase the memory used for each sort, and only create the tape files if the initial sort does not fit within the allocated memory. > > Also > > - makes the Sort node read directly from the last set of psort runs > > (instead of an output relation) > > require changes to ExecSortMarkPos()/ExecSortRestrPos() which > use heap_markpos()/heap_restrpos() (because of last set of > psort is not normal heap relation). > > But both changes of nodeSort.c are what we really need. With the new psort(), you can do multiple sorts at the same time. The new psort() comment says: * The old psort.c's routines formed a temporary relation from the merged * sort files. This version keeps the files around instead of generating the * relation from them, and provides interface functions to the file so that * you can grab tuples, mark a position in the file, restore a position in the * file. You must now explicitly call an interface function to end the sort, * psort_end, when you are done. * Now most of the global variables are stuck in the Sort nodes, and * accessed from there (they are passed to all the psort routines) so that * each sort running has its own separate state. This is facilitated by having * the Sort nodes passed in to all the interface functions. * The one global variable that all the sorts still share is SortMemory. * You should now be allowed to run two or more psorts concurrently, * so long as the memory they eat up is not greater than SORTMEM, the initial * value of SortMemory. -Rex 2.15.1995 * * Use the tape-splitting method (Knuth, Vol. III, pp281-86) in the future. I am uploading mariposa-alpha-1.tar.gz to the postgreSQL ftp incoming directory because I think I am going to need help on this one. The official mariposa ftp site is very, very slow and unreliable. This release is dated June, 1996, and is the newest available. - -- Bruce Momjian maillist@candle.pha.pa.us ------------------------------
On Sun, 22 Jun 1997, Bruce Momjian wrote:
> > okay, I'm not totally familiary with palloc()...I did a project while
> > back where we mmap()'d a region of memory so that we didn't need to write
> > to disk. Essentially, I used ftruncate() to create the size of the buffer I
> > wanted to mmap(), and never flushed the memory...
> >
> > Not sure what palloc() does though...should look into it, eh? :(
>
> palloc() basically malloc's from a pool that gets auto-free'd at the end
> of statements or transcations.
>
> We could us mmap(), but why bother if we can just malloc() all the
> memory we need? That is what has me confused. Anyone know an advantage
> to using mmap() when you really don't want the image on disk?
Okay...this is something I never truly have every succeeded in doing,
and do agree that it would be better then mmap()...mmap() is good when
multiple processes want/need to access the same memory, and, from what I've
seen, sort of acts as an intermediary between a malloc() and shmem()...
Now I understand what you are thinking :)
Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
------------------------------
On Sun, 22 Jun 1997, Thomas G. Lockhart wrote: > I don't know which Knuth algorithm the code uses, but some sort > algorithms perform very poorly on data which happens to be already > sorted. QuickSort, which comes on many Unix boxes (the qsort() call) is > quickest on completely unordered/disordered data and its slowest > performance is on already perfectly sorted data (I'm doing this from > memory, but the facts may be mostly correct :). from what I remember, the quick sort algorithm works just as fast on sorted data as on unsorted data. In other words, the algorithm does not take advantage of the fact that the data is sorted. On the other hand, if you go to something like bubble sort, it will work faster than quick sort if the data is sorted and you are adding one more item, but on unsorted data it takes an enormously longer length of time to sort a large list. David Friend ! cq995@freenet.carleton.ca Atlantis Scientific Inc. ! david.friend@atlsci.com 20 Colonnade Rd, Suite 110 ! 613-727-1087 (voice) Ottawa, Ontario, CANADA K2E 7M6 ! 800-265-3894 (voice) ERGOvista Scientific Image Analysis ! 613-727-5853 (fax) ------------------------------
> >From what I remember, the quick sort algorithm works just as fast on > sorted data as on unsorted data. In other words, the algorithm does not > take advantage of the fact that the data is sorted. > > On the other hand, if you go to something like bubble sort, it will > work faster than quick sort if the data is sorted and you are adding one > more item, but on unsorted data it takes an enormously longer length of > time to sort a large list. > One thing I am going to look into is replacing the btree-build in the current psort with a qsort of HeapTuple pointers. - -- Bruce Momjian maillist@candle.pha.pa.us ------------------------------ End of hackers-digest V1 #398 *****************************
Bruce Momjian wrote:
>
> > >From what I remember, the quick sort algorithm works just as fast on
> > sorted data as on unsorted data. In other words, the algorithm does not
> > take advantage of the fact that the data is sorted.
> >
> > On the other hand, if you go to something like bubble sort, it will
> > work faster than quick sort if the data is sorted and you are adding one
> > more item, but on unsorted data it takes an enormously longer length of
> > time to sort a large list.
> >
>
> One thing I am going to look into is replacing the btree-build in the
^^^^^^^^^^^
> current psort with a qsort of HeapTuple pointers.
I don't fully understand what do you mean...
btree-build uses qsort to sort items on a page and has own
priority queue methods (instead of lselect.c - leftist tree
selection algorithm).
Do you want use qsort for in-memory sorts ?
BTW, I may suggest additional trick:
if size of sort-keys is (much?) smaller then size of tuple
(select _int4_field_, _big_text_field_ from ... order by _int4_field_)
and you are going to switch to tape-methods (i.e. you've pallocted
all sort-memory) then you may try to create temp-file to store
(unsorted) tuples there and to deal with sort-keys only. After
you'll got sort-keys (or better say - records which keep sort-keys
and pointers - temp-file offsets - to original tuples) sorted, you
may use these pointers to get requested tuple from temp-file.
Even for select int4_field from ... order by int4_field
current psort palloc 48(header) + 4 (field) + 4 (aligning) = 56 bytes
for each tuple - why don't work with 4(field) + 4 (offset) = 8 bytes
of sort-records (tm-:)) ?
Vadim
------------------------------
> I don't fully understand what do you mean... > btree-build uses qsort to sort items on a page and has own > priority queue methods (instead of lselect.c - leftist tree > selection algorithm). > Do you want use qsort for in-memory sorts ? When I said btree-build, I meant the leftist tree sort, so I am considering replacing the leftist-tree selection algorithm with a qsort of an array of HeapTuple pointers. Any comments on this? The only performance problem I see is re-allocing the array as the number of entries grows, but if I double the size each time I exceed the current size, I don't believe it will be much of a problem. The leftist tree looked very funny to me, with recursion and stuff. I saw the new FASTBUILD sort used qsort, so I thought it may be a good idea. > > BTW, I may suggest additional trick: > if size of sort-keys is (much?) smaller then size of tuple > (select _int4_field_, _big_text_field_ from ... order by _int4_field_) > and you are going to switch to tape-methods (i.e. you've pallocted > all sort-memory) then you may try to create temp-file to store > (unsorted) tuples there and to deal with sort-keys only. After > you'll got sort-keys (or better say - records which keep sort-keys > and pointers - temp-file offsets - to original tuples) sorted, you > may use these pointers to get requested tuple from temp-file. I am a little lost here. The Mariposa sort code allows ExecSort to read directly from final sorted tape file, preventing the copy to a temp table. So, if I don't store the full tuple in the tape file, I can't pass a REAL tuple back to ExecSort. I assume you are suggesting putting all the tuples in one tape file, then doing the sort with tape file ONLY with the indexed fields, then using those indexed fields to retrieve the data from the tape. The problem I see is that the final read of sorted rows from the full tape file will be seeking all over the file, causing poor performance. Now, not having to move around extra data during the sort is going to be good for performance, but I am not sure what the result will be. I think you are correct that it will be a win to only sort the needed columns, but considering that any sort that uses tape files will be big, and hence will not fit in the file system cache, and considering the extra code to do this, I am uncertain about its usefulness. > > Even for select int4_field from ... order by int4_field > current psort palloc 48(header) + 4 (field) + 4 (aligning) = 56 bytes > for each tuple - why don't work with 4(field) + 4 (offset) = 8 bytes > of sort-records (tm-:)) ? Again, can I do this if I need to pass back the actual row? Also, another confusing issue is that when I start the sort, I don't know if it is going to fit into memory or not, so I must store the entire row into memory because I might be returning a pointer and not using tape files. Can I then switch in the middle of the sort to tape files and start doing my comparisons differently. Seems hard to me. I am just guessing on these issues. Any comments? - -- Bruce Momjian maillist@candle.pha.pa.us ------------------------------
Bruce Momjian wrote:
>
> > I don't fully understand what do you mean...
> > btree-build uses qsort to sort items on a page and has own
> > priority queue methods (instead of lselect.c - leftist tree
> > selection algorithm).
> > Do you want use qsort for in-memory sorts ?
>
> When I said btree-build, I meant the leftist tree sort, so I am
> considering replacing the leftist-tree selection algorithm with a qsort
> of an array of HeapTuple pointers. Any comments on this? The only
> performance problem I see is re-allocing the array as the number of
> entries grows, but if I double the size each time I exceed the current
> size, I don't believe it will be much of a problem. The leftist tree
> looked very funny to me, with recursion and stuff. I saw the new
> FASTBUILD sort used qsort, so I thought it may be a good idea.
Ok. What do you suppose to use for merging tapes - lselect.c or
btree queue methods ?
>
> >
> > BTW, I may suggest additional trick:
> > if size of sort-keys is (much?) smaller then size of tuple
> > (select _int4_field_, _big_text_field_ from ... order by _int4_field_)
> > and you are going to switch to tape-methods (i.e. you've pallocted
> > all sort-memory) then you may try to create temp-file to store
> > (unsorted) tuples there and to deal with sort-keys only. After
> > you'll got sort-keys (or better say - records which keep sort-keys
> > and pointers - temp-file offsets - to original tuples) sorted, you
> > may use these pointers to get requested tuple from temp-file.
>
...
>
> The problem I see is that the final read of sorted rows from the full
> tape file will be seeking all over the file, causing poor performance.
I see it.
BTW, using indices for sorts has the same problem!
As work around you may try to reduce seek randomness by addition offset to
sort-keys (at the end) - but this may help in the case of duplicates only,
of course.
On the other hand, when you are merging tape-files don't you do
random seeks by switching from one tape to another ?
I'm not sure but it seems that more than one pass over all
tape-files may be required...
> Now, not having to move around extra data during the sort is going to be
> good for performance, but I am not sure what the result will be.
(Note that I'm too.)
>
> I think you are correct that it will be a win to only sort the needed
> columns, but considering that any sort that uses tape files will be big,
> and hence will not fit in the file system cache, and considering the
> extra code to do this, I am uncertain about its usefulness.
Ok - select int4 ... order by int4 using 4M for in-memory sort:
1. entire-tuple method
in-memory sort will be used for 4 000 000 / 56 =~ 71428 tuples
in result.
2. sort-records method
in-memory sort for 4 000 000 / 12 = 333333 tuples in result
(12 = 8 + 4: 4 bytes to hold NULLs bitmask)
+ writing 333333*56 = 18 666 648 bytes into temp_file
(btw, we may exclude 8 bytes of int4 + aligning from tuples
in temp-file - we have attribute values in sort-records, -
so it may be reduced to 15 999 984: not so big in this case,
but may have sence for bigger sort-keys)
+ 333333 seeks over temp-file when returning results
3. entire-tuple method for 333333 tuples result
writing of 6 tape-files with 18 666 648 bytes total size
+ ??? seeks in merging
+ writing 18 666 648 bytes of final tape-file
is it all ? or there may be more than one pass (read/write)
over tape-files ?
Vadim
------------------------------
At 9:48 PM 6/23/97, David Friend wrote: >On Sun, 22 Jun 1997, Thomas G. Lockhart wrote: > >> I don't know which Knuth algorithm the code uses, but some sort >> algorithms perform very poorly on data which happens to be already >> sorted. QuickSort, which comes on many Unix boxes (the qsort() call) is >> quickest on completely unordered/disordered data and its slowest >> performance is on already perfectly sorted data (I'm doing this from >> memory, but the facts may be mostly correct :). > From what I remember, the quick sort algorithm works just as fast on >sorted data as on unsorted data. In other words, the algorithm does not >take advantage of the fact that the data is sorted. What Thomas says is correct for the standard textbook description of quicksort. It is not hard to reverse the order of the compare and fix it so its worst case is reverse-sorted instead of presorted data. I would expect that most vendor's implementations have this fix, but no guarantees. I have skipped the first part of this discussion. Quicksort is a good in-memory algorithm for uniprocessors. It may not be applicable to database sorts of very large data sets. Signature failed Preliminary Design Review. Feasibility of a new signature is currently being evaluated. h.b.hotz@jpl.nasa.gov, or hbhotz@oxy.edu ------------------------------
> > At 9:48 PM 6/23/97, David Friend wrote: > >On Sun, 22 Jun 1997, Thomas G. Lockhart wrote: > > > >> I don't know which Knuth algorithm the code uses, but some sort > >> algorithms perform very poorly on data which happens to be already > >> sorted. QuickSort, which comes on many Unix boxes (the qsort() call) is > >> quickest on completely unordered/disordered data and its slowest > >> performance is on already perfectly sorted data (I'm doing this from > >> memory, but the facts may be mostly correct :). > > > >From what I remember, the quick sort algorithm works just as fast on > >sorted data as on unsorted data. In other words, the algorithm does not > >take advantage of the fact that the data is sorted. > > What Thomas says is correct for the standard textbook description of > quicksort. It is not hard to reverse the order of the compare and fix it > so its worst case is reverse-sorted instead of presorted data. I would > expect that most vendor's implementations have this fix, but no guarantees. > > I have skipped the first part of this discussion. Quicksort is a good > in-memory algorithm for uniprocessors. It may not be applicable to > database sorts of very large data sets. The quicksort is to be used only for in-memory sorting. Larger sorts are going to use the Knuth, volume 3, page 280 tape sorting algorithm. If anyone wants a digest of this discussion, I can easily sent it to them. - -- Bruce Momjian maillist@candle.pha.pa.us ------------------------------