Обсуждение: [HACKERS] \dt and disk access

Поиск
Список
Период
Сортировка

[HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
Can someone suggest why running the \dt comand in psql causes so many
disk accesses?

I am seeing some queries causing a lot of disk access that doesn't make
sense to me.

I am running my server with fsync off.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
>
> Can someone suggest why running the \dt comand in psql causes so many
> disk accesses?
>
> I am seeing some queries causing a lot of disk access that doesn't make
> sense to me.
>
> I am running my server with fsync off.

I now realize the unusual disk activity is caused by any query that uses
an ORDER BY.  A query that returns three rows takes MUCH longer with the
ORDER BY than without it, 0.20 seconds vs. 0.61 seconds, three times
longer.

The query is:

    select * from matter

AND

    select * from matter order by matter

The table has three short text columns.


Would anyone like to comment on why ORDER BY is doing this?  It
certainly slows down the query.  I am running BSD/OS.

This is of course using PostgreSQL 6.1.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] \dt and disk access

От
Igor
Дата:
Well....Is the sorting algorithm using temp files to do the sorting? If
not, it would be interesting to see the difference in Purify output when
running an identical query with sorting and without...
I'll probably run it on Saturday that way just to (hopefully) eliminate
leaks and such from possible causes...

=+=------------------------/\---------------------------------=+=
       Igor Natanzon      |**|   E-mail: igor@sba.miami.edu
=+=------------------------\/---------------------------------=+=

On Sat, 21 Jun 1997, Bruce Momjian wrote:

> >
> > Can someone suggest why running the \dt comand in psql causes so many
> > disk accesses?
> >
> > I am seeing some queries causing a lot of disk access that doesn't make
> > sense to me.
> >
> > I am running my server with fsync off.
>
> I now realize the unusual disk activity is caused by any query that uses
> an ORDER BY.  A query that returns three rows takes MUCH longer with the
> ORDER BY than without it, 0.20 seconds vs. 0.61 seconds, three times
> longer.
>
> The query is:
>
>     select * from matter
>
> AND
>
>     select * from matter order by matter
>
> The table has three short text columns.
>
>
> Would anyone like to comment on why ORDER BY is doing this?  It
> certainly slows down the query.  I am running BSD/OS.
>
> This is of course using PostgreSQL 6.1.
>
> --
> Bruce Momjian
> maillist@candle.pha.pa.us
>

------------------------------

Re: [HACKERS] \dt and disk access

От
(Erik Bennett)
Дата:
Under BSD/OS 2.0 and up (don't know about 1.x), it should be possible
to start up the postmaster thus:
ktrace -i postmaster >& Log

This will trace all calls to the kernel (and generate a good deal of
output, as well as slow things down a lot).

Run your psql command.
Kill -15 the postmaster.

In the same directory where you started the postmaster, you can run
`kdump | less`.  Down toward the bottom of the file, you should be
able to track down your postgress process and what it was doing.

If you find it doing a lot of activity with /tmp, you might try mounting
/tmp as a memory file system.

I realize this isn't really an answer to the question "why is there
disk activity", but it should at least show "what postgres is doing".

> Well....Is the sorting algorithm using temp files to do the sorting? If
> not, it would be interesting to see the difference in Purify output when
> running an identical query with sorting and without...
> I'll probably run it on Saturday that way just to (hopefully) eliminate
> leaks and such from possible causes...
>
> =+=------------------------/\---------------------------------=+=
>        Igor Natanzon      |**|   E-mail: igor@sba.miami.edu
> =+=------------------------\/---------------------------------=+=
>
> On Sat, 21 Jun 1997, Bruce Momjian wrote:
>
> > >
> > > Can someone suggest why running the \dt comand in psql causes so many
> > > disk accesses?
> > >
> > > I am seeing some queries causing a lot of disk access that doesn't make
> > > sense to me.
> > >
> > > I am running my server with fsync off.
> >
> > I now realize the unusual disk activity is caused by any query that uses
> > an ORDER BY.  A query that returns three rows takes MUCH longer with the
> > ORDER BY than without it, 0.20 seconds vs. 0.61 seconds, three times
> > longer.
> >
> > The query is:
> >
> >     select * from matter
> >
> > AND
> >
> >     select * from matter order by matter
> >
> > The table has three short text columns.
> >
> >
> > Would anyone like to comment on why ORDER BY is doing this?  It
> > certainly slows down the query.  I am running BSD/OS.
> >
> > This is of course using PostgreSQL 6.1.
> >
> > --
> > Bruce Momjian
> > maillist@candle.pha.pa.us
> >
>
>


 -Erik

Erik Bennett                    bennett@corp.oneworld.com

------------------------------

Re: [HACKERS] \dt and disk access

От
Igor
Дата:
Well, in case you are wondering.....I populated a table with about 500
tuples.
Running select * on that table yealded no errors from purify..
Running select * .. order by (a non-indexed field) gave the following:
  total memory usage: 319,803 bytes.
Of this amount of memory, 57,344 bytes were leaks.

I then populated another table with 2600+ tuples.
These are the results with order by:
  Total memory used: 786,763
Of this memory, 524,288 bytes are memory leaks!!! Waste of half a meg of
ram on 2600 or so tuples is bad....And what if you have 20,000 tuples? Or
a 100,000 ?

I will try to investigate this matter now....

=+=------------------------/\---------------------------------=+=
       Igor Natanzon      |**|   E-mail: igor@sba.miami.edu
=+=------------------------\/---------------------------------=+=

On Sat, 21 Jun 1997, Bruce Momjian wrote:

> >
> > Can someone suggest why running the \dt comand in psql causes so many
> > disk accesses?
> >
> > I am seeing some queries causing a lot of disk access that doesn't make
> > sense to me.
> >
> > I am running my server with fsync off.
>
> I now realize the unusual disk activity is caused by any query that uses
> an ORDER BY.  A query that returns three rows takes MUCH longer with the
> ORDER BY than without it, 0.20 seconds vs. 0.61 seconds, three times
> longer.
>
> The query is:
>
>     select * from matter
>
> AND
>
>     select * from matter order by matter
>
> The table has three short text columns.
>
>
> Would anyone like to comment on why ORDER BY is doing this?  It
> certainly slows down the query.  I am running BSD/OS.
>
> This is of course using PostgreSQL 6.1.
>
> --
> Bruce Momjian
> maillist@candle.pha.pa.us
>

------------------------------

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
> >
> > Can someone suggest why running the \dt comand in psql causes so many
> > disk accesses?
> >
> > I am seeing some queries causing a lot of disk access that doesn't make
> > sense to me.
> >
> > I am running my server with fsync off.
>
> I now realize the unusual disk activity is caused by any query that uses
> an ORDER BY.  A query that returns three rows takes MUCH longer with the
> ORDER BY than without it, 0.20 seconds vs. 0.61 seconds, three times
> longer.
>
> The query is:
>
>     select * from matter
>
> AND
>
>     select * from matter order by matter
>
> The table has three short text columns.
>
>
> Would anyone like to comment on why ORDER BY is doing this?  It
> certainly slows down the query.  I am running BSD/OS.
>
> This is of course using PostgreSQL 6.1.

OK, I think I have some more information on this.

To test:

    create table ordertest (x integer);
    insert into ordertest values (3);

Then try:

    select * from ordertest;

AND

    select * from ordertest order by x;

I test it by:

    echo "select * from ordertest;" | time psql test

OK, now for some times.  On my machine (PP200 running BSD/OS), the
startup/shutdown time for the database is 0.15 seconds, so I will
subtract that time to show the incremental elapsed wallclock time for
each command:

    Select without order by:    0.01

    Select with order by:        0.30

    Create Table:            0.11
    Create Index:            0.54

    Insert into Table:        0.01

These are all times on a table with one integer column, and one row.

As you can see, the ORDER BY has slowed the SELECT by a factor of 30.
You can actually create three tables in the same time, or insert 30
rows.

This looks like a pretty serious performance problem.

Without the ORDER BY, performance is great.  With the ORDER BY,
performance is poor, and I can see some mSQL comparisons where we will
look very bad.  Let's face it, most multi-row SELECT statements use
ORDER BY, and it is a waste to use 30x the time to sort the data as it
took to get the data out of the database.

FASTBUILD is not used in the ORDER BY operations.

How is ORDER BY performed?  The output of the SELECT statement is placed
in a temp table, and the temp table is then sorted.  A comment from
ExecSort.c says:

/* ----------------------------------------------------------------
 *      ExecSort
 *
 * old comments
 *      Retrieves tuples fron the outer subtree and insert them into a
 *      temporary relation. The temporary relation is then sorted and
 *      the sorted relation is stored in the relation whose ID is indicated
 *      in the 'tempid' field of this node.
 *      Assumes that heap access method is used.

ExecSort calls psort() to do the work. psort() uses a Knuth algorithm.

I removed the psort() call from ExecSort, and the execution time was:

    Select with order by using psort():        0.30

    Select with order by NOT using psort():        0.07

Now of course, not using psort() causes the data not to be sorted, but
there must be some inefficienty if sorting one row takes 0.23 seconds on
a PP200 with 64MB of RAM.  I suspect the other 0.06 seconds is caused by
the creation of the temp table and moving the rows around.

Any comments?

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
> ExecSort calls psort() to do the work. psort() uses a Knuth algorithm.
>
> I removed the psort() call from ExecSort, and the execution time was:
>
>     Select with order by using psort():        0.30
>
>     Select with order by NOT using psort():        0.07
>
> Now of course, not using psort() causes the data not to be sorted, but
> there must be some inefficienty if sorting one row takes 0.23 seconds on
> a PP200 with 64MB of RAM.  I suspect the other 0.06 seconds is caused by
> the creation of the temp table and moving the rows around.

It looks like psort()'s use of disk files to do the sorting is causing
the slowdown, with the multiple create's and unlink's.  I am going to
modify psort() to use memory buffers instead of disk files for sorting
small SELECT results.  psort() is only called by ExecSort, which is used
by ORDER BY's.

Permission to put this in 6.1p1?

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] \dt and disk access

От
"Thomas G. Lockhart"
Дата:
Bruce Momjian wrote:
> It looks like psort()'s use of disk files to do the sorting is causing
> the slowdown, with the multiple create's and unlink's.  I am going to
> modify psort() to use memory buffers instead of disk files for sorting
> small SELECT results.  psort() is only called by ExecSort, which is used
> by ORDER BY's.
>
> Permission to put this in 6.1p1?

IMHO sounds like a *bad* idea, since it might take a while to debug.
Better to leave this as a feature for now (Postgres has had this
forever?) and come up with something great for v6.2; I'd be interested
in contributing to this if it would be helpful.

I don't know which Knuth algorithm the code uses, but some sort
algorithms perform very poorly on data which happens to be already
sorted. QuickSort, which comes on many Unix boxes (the qsort() call) is
quickest on completely unordered/disordered data and its slowest
performance is on already perfectly sorted data (I'm doing this from
memory, but the facts may be mostly correct :).

            - Tom

------------------------------

Re: [HACKERS] \dt and disk access

От
"Thomas G. Lockhart"
Дата:
Bruce Momjian wrote:
> OK, I think I have some more information on this.
> OK, now for some times.  On my machine (PP200 running BSD/OS), the
> startup/shutdown time for the database is 0.15 seconds, so I will
> subtract that time to show the incremental elapsed wallclock time for
> each command:
>
>         Select without order by:        0.01
>
>         Select with order by:           0.30
>
>         Create Table:                   0.11
>         Create Index:                   0.54
>
>         Insert into Table:              0.01
>
> These are all times on a table with one integer column, and one row.
>
> As you can see, the ORDER BY has slowed the SELECT by a factor of 30.
> You can actually create three tables in the same time, or insert 30
> rows.
>
> This looks like a pretty serious performance problem.

By having a small table, you are really just measuring the overhead in
setting up the sort. Try something with a _lot_ of data, and try to have
the data inserted randomly if possible. Another thing to check is the
fundamental time it takes to actually do sorting (for example, put the
same data into a little test program which calls qsort(), or find the
Knuth algorithm and call that directly). Sorts aren't (usually) cheap!

            - Tom

------------------------------

End of hackers-digest V1 #396
*****************************

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
>
> Bruce Momjian wrote:
> > It looks like psort()'s use of disk files to do the sorting is causing
> > the slowdown, with the multiple create's and unlink's.  I am going to
> > modify psort() to use memory buffers instead of disk files for sorting
> > small SELECT results.  psort() is only called by ExecSort, which is used
> > by ORDER BY's.
> >
> > Permission to put this in 6.1p1?
>
> IMHO sounds like a *bad* idea, since it might take a while to debug.
> Better to leave this as a feature for now (Postgres has had this
> forever?) and come up with something great for v6.2; I'd be interested
> in contributing to this if it would be helpful.

I think I agree.  Let's leave it out.  I initially thought it was a
FASTBULD btree issue, but I found it was unrelated.

>
> I don't know which Knuth algorithm the code uses, but some sort
> algorithms perform very poorly on data which happens to be already
> sorted. QuickSort, which comes on many Unix boxes (the qsort() call) is
> quickest on completely unordered/disordered data and its slowest
> performance is on already perfectly sorted data (I'm doing this from
> memory, but the facts may be mostly correct :).

It is from Knuth III, page 280.  There are references to Knuth pages and
diagrams all over the code.

The sort is meant for cases when you don't have enough memory to do the
sort, and the thrashing of doing all those random accesses can be slow.

Now why they would not implement a fast version for cases where the rows
WOULD fit in memory, I don't know.

I will wait for 6.1p1, and then apply the fix.  I plan to just use
qsort() on an array of tuple pointers if the size of the ORDER BY result
is under 256k bytes, which is only 1/2 of the default shared buffer
size.  I will run tests to see if there is major speed improvement at
that size.  If not, I will knock it down to 128k or 64k, just to be safe
that it will not swamp the shared pool on a busy system.  I may even key
the trigger value on the number of shared buffers allocated.  This has
got to be faster than what it does now.

The psort() code is designed for sorts that would clearly fill the
shared buffer pool and disk buffer pool.  It takes a multi-megabyte
ORDER BY to do that on my machine.

And I am using a Barracuda Ultra drive with tagged queueing doing 7
Mbytes/second transfers, so I am not testing this on a slow drive when I
report 0.23 seconds to sort one row.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] \dt and disk access

От
Maarten Boekhold
Дата:
On Sat, 21 Jun 1997, Bruce Momjian wrote:

> > ExecSort calls psort() to do the work. psort() uses a Knuth algorithm.
> >
> > I removed the psort() call from ExecSort, and the execution time was:
> >
> >     Select with order by using psort():        0.30
> >
> >     Select with order by NOT using psort():        0.07
> >
> > Now of course, not using psort() causes the data not to be sorted, but
> > there must be some inefficienty if sorting one row takes 0.23 seconds on
> > a PP200 with 64MB of RAM.  I suspect the other 0.06 seconds is caused by
> > the creation of the temp table and moving the rows around.
>
> It looks like psort()'s use of disk files to do the sorting is causing
> the slowdown, with the multiple create's and unlink's.  I am going to
> modify psort() to use memory buffers instead of disk files for sorting
> small SELECT results.  psort() is only called by ExecSort, which is used
> by ORDER BY's.
>
> Permission to put this in 6.1p1?

Can you make the size of the result set above which diskfiles will be used
configurable? That way ppl with loads of RAM can use huge buffers, and ppl
with little RAM can keep that RAM free for other processes.

Maarten

_____________________________________________________________________________
|     Maarten Boekhold, Faculty of Electrical Engineering TU Delft,   NL    |
|        M.Boekhold@et.tudelft.nl   boekhold@gopher.library.tudelft.nl      |
- -----------------------------------------------------------------------------

------------------------------

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
>
> On Sat, 21 Jun 1997, Bruce Momjian wrote:
>
> > > ExecSort calls psort() to do the work. psort() uses a Knuth algorithm.
> > >
> > > I removed the psort() call from ExecSort, and the execution time was:
> > >
> > >     Select with order by using psort():        0.30
> > >
> > >     Select with order by NOT using psort():        0.07
> > >
> > > Now of course, not using psort() causes the data not to be sorted, but
> > > there must be some inefficienty if sorting one row takes 0.23 seconds on
> > > a PP200 with 64MB of RAM.  I suspect the other 0.06 seconds is caused by
> > > the creation of the temp table and moving the rows around.
> >
> > It looks like psort()'s use of disk files to do the sorting is causing
> > the slowdown, with the multiple create's and unlink's.  I am going to
> > modify psort() to use memory buffers instead of disk files for sorting
> > small SELECT results.  psort() is only called by ExecSort, which is used
> > by ORDER BY's.
> >
> > Permission to put this in 6.1p1?
>
> Can you make the size of the result set above which diskfiles will be used
> configurable? That way ppl with loads of RAM can use huge buffers, and ppl
> with little RAM can keep that RAM free for other processes.

Good idea.  I think the final code will not use the buffer cache, but
directly palloc the memory it needs and copy the rows to be sorted into
there.

I also hope to eliminate the copy to a temp table.  Currently, the code
copies to a temp table, loads the rows into "tape" files, then writes to
a new table, which seems like a waste.  Paul's FASTBUILD btree code uses
ftruncate() after he has loaded the rows from the table, before he
writes them back.

I will contact him directly for comments.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] \dt and disk access

От
The Hermit Hacker
Дата:
On Sat, 21 Jun 1997, Bruce Momjian wrote:

> > ExecSort calls psort() to do the work. psort() uses a Knuth algorithm.
> >
> > I removed the psort() call from ExecSort, and the execution time was:
> >
> >     Select with order by using psort():        0.30
> >
> >     Select with order by NOT using psort():        0.07
> >
> > Now of course, not using psort() causes the data not to be sorted, but
> > there must be some inefficienty if sorting one row takes 0.23 seconds on
> > a PP200 with 64MB of RAM.  I suspect the other 0.06 seconds is caused by
> > the creation of the temp table and moving the rows around.
>
> It looks like psort()'s use of disk files to do the sorting is causing
> the slowdown, with the multiple create's and unlink's.  I am going to
> modify psort() to use memory buffers instead of disk files for sorting
> small SELECT results.  psort() is only called by ExecSort, which is used
> by ORDER BY's.
>
> Permission to put this in 6.1p1?

    By all means...

    My next question is what is considered a 'small SELECT result'?  If
I have a 32Meg machine, that 'small' would be different, I would assume,
then a 128Meg machine.

    With that in mind, how about some sort of 'switch', similar to
- -B, that allows you to stipulate total RAM of a machine (or virtual memory?),
or something similar to that?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

------------------------------

Re: [HACKERS] \dt and disk access

От
The Hermit Hacker
Дата:
On Sat, 21 Jun 1997, Bruce Momjian wrote:

> I will wait for 6.1p1, and then apply the fix.  I plan to just use
> qsort() on an array of tuple pointers if the size of the ORDER BY result
> is under 256k bytes, which is only 1/2 of the default shared buffer
> size.  I will run tests to see if there is major speed improvement at
> that size.  If not, I will knock it down to 128k or 64k, just to be safe
> that it will not swamp the shared pool on a busy system.  I may even key
> the trigger value on the number of shared buffers allocated.  This has
> got to be faster than what it does now.

    Why use shared memory for this?  Why not use mmap() for it?  Come
to think of it, mmap()'ng it would have better scalability, no?  If you
already know the result size (ie. 256k), you could that have the code try
to mmap() that amount of memory to do the sort in.  If the mmap() fails,
revert to using a file...now the result size of the SELECT doesn't have a
compiled in "limit" to the size of memory used for the sort...its restricted
to the amount of memory on the machine (ie. if I double the RAM, I should be
able to have double the results sorted in memory instead of one disk)

    Add a flag over top of that, like -B, that states the *max* result
size to do an in-memory sort with...or, rather, the *max* to try to do one
with.

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

------------------------------

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:

------------------------------

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
> > Permission to put this in 6.1p1?
>
>     By all means...

Thomas says no for 6.1p1.  I have no opinion on this.  However, I don't
want to rush out a fix until I have heard from Paul M. Aoki and Vadim,
and gotten their opinion for the most efficient way to fix this.

>
>     My next question is what is considered a 'small SELECT result'?  If
> I have a 32Meg machine, that 'small' would be different, I would assume,
> then a 128Meg machine.

I think the sort is used by ORDER BY and joins where this is not an
existing index.

>     With that in mind, how about some sort of 'switch', similar to
> -B, that allows you to stipulate total RAM of a machine (or virtual memory?),
> or something similar to that?

I am not sure how to do this.  There is sysctl() and getconf(), but
neither of them is standard on all platforms.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
>     Why use shared memory for this?  Why not use mmap() for it?  Come
> to think of it, mmap()'ng it would have better scalability, no?  If you
> already know the result size (ie. 256k), you could that have the code try
> to mmap() that amount of memory to do the sort in.  If the mmap() fails,
> revert to using a file...now the result size of the SELECT doesn't have a
> compiled in "limit" to the size of memory used for the sort...its restricted
> to the amount of memory on the machine (ie. if I double the RAM, I should be
> able to have double the results sorted in memory instead of one disk)

I don't think we need mmap() because we never need to put it on disk.
Just palloc() the memory.


>
>     Add a flag over top of that, like -B, that states the *max* result
> size to do an in-memory sort with...or, rather, the *max* to try to do one
> with.
>
> Marc G. Fournier
> Systems Administrator @ hub.org
> primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org
>
>


- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] \dt and disk access

От
"Vadim B. Mikheev"
Дата:
Igor wrote:
>
> Well, in case you are wondering.....I populated a table with about 500
> tuples.
> Running select * on that table yealded no errors from purify..
> Running select * .. order by (a non-indexed field) gave the following:
>   total memory usage: 319,803 bytes.
> Of this amount of memory, 57,344 bytes were leaks.
                            ^^^^^^
57344/8192 = 7

>
> I then populated another table with 2600+ tuples.
> These are the results with order by:
>   Total memory used: 786,763
> Of this memory, 524,288 bytes are memory leaks!!! Waste of half a meg of
                  ^^^^^^^
524288/8192 = 64

I believe that it's just local buffers allocation, so it's not
real memory leak.

> ram on 2600 or so tuples is bad....And what if you have 20,000 tuples? Or
> a 100,000 ?

64 is max number of local buffers...

Vadim

------------------------------

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
> > It looks like psort()'s use of disk files to do the sorting is causing
> > the slowdown, with the multiple create's and unlink's.  I am going to
> > modify psort() to use memory buffers instead of disk files for sorting
> > small SELECT results.  psort() is only called by ExecSort, which is used
> > by ORDER BY's.
> >
> > Permission to put this in 6.1p1?
>
> Can you make the size of the result set above which diskfiles will be used
> configurable? That way ppl with loads of RAM can use huge buffers, and ppl
> with little RAM can keep that RAM free for other processes.

If I need a configuration value, I will either determine the amount of
RAM portably, or base the value on the number of shared buffers
requested with -B.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] \dt and disk access

От
The Hermit Hacker
Дата:
On Sun, 22 Jun 1997, Bruce Momjian wrote:

> >     My next question is what is considered a 'small SELECT result'?  If
> > I have a 32Meg machine, that 'small' would be different, I would assume,
> > then a 128Meg machine.
>
> I think the sort is used by ORDER BY and joins where this is not an
> existing index.

    Ummm...okay.

    Now, what is considered a "small SELECT result"? :)

> >     With that in mind, how about some sort of 'switch', similar to
> > -B, that allows you to stipulate total RAM of a machine (or virtual memory?),
> > or something similar to that?
>
> I am not sure how to do this.  There is sysctl() and getconf(), but
> neither of them is standard on all platforms.

    Huh?  You seem to have totally missed the question on that one
there Bruce :(  Where does sysctl() and getconf() fit into having a run
time switch? *confused look* *grin*

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

------------------------------

Re: [HACKERS] \dt and disk access

От
The Hermit Hacker
Дата:
On Sun, 22 Jun 1997, Bruce Momjian wrote:

> >     Why use shared memory for this?  Why not use mmap() for it?  Come
> > to think of it, mmap()'ng it would have better scalability, no?  If you
> > already know the result size (ie. 256k), you could that have the code try
> > to mmap() that amount of memory to do the sort in.  If the mmap() fails,
> > revert to using a file...now the result size of the SELECT doesn't have a
> > compiled in "limit" to the size of memory used for the sort...its restricted
> > to the amount of memory on the machine (ie. if I double the RAM, I should be
> > able to have double the results sorted in memory instead of one disk)
>
> I don't think we need mmap() because we never need to put it on disk.
> Just palloc() the memory.

    okay, I'm not totally familiary with palloc()...I did a project while
back where we mmap()'d a region of memory so that we didn't need to write
to disk.  Essentially, I used ftruncate() to create the size of the buffer I
wanted to mmap(), and never flushed the memory...

    Not sure what palloc() does though...should look into it, eh? :(

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

------------------------------

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
>
> On Sun, 22 Jun 1997, Bruce Momjian wrote:
>
> > >     My next question is what is considered a 'small SELECT result'?  If
> > > I have a 32Meg machine, that 'small' would be different, I would assume,
> > > then a 128Meg machine.
> >
> > I think the sort is used by ORDER BY and joins where this is not an
> > existing index.
>
>     Ummm...okay.
>
>     Now, what is considered a "small SELECT result"? :)

I don't know yet.  Any sort that will not fix in physical memory should
use the default psort() code, because it prevents the thrashing of
moving data in and out of virtial memory (SWAP) while the sort is being
performed.  It is hard to determine what would fix in PHYSICAL, not
virtual memory, and in fact this number changes depending on how busy
the machine is.


>
> > >     With that in mind, how about some sort of 'switch', similar to
> > > -B, that allows you to stipulate total RAM of a machine (or virtual memory?),
> > > or something similar to that?
> >
> > I am not sure how to do this.  There is sysctl() and getconf(), but
> > neither of them is standard on all platforms.
>
>     Huh?  You seem to have totally missed the question on that one
> there Bruce :(  Where does sysctl() and getconf() fit into having a run
> time switch? *confused look* *grin*

Yes, I guess we will just have to make this a run-time switch, though
keying in on -B may be a good idea, that way they have only one thing to
tune.


- --
Bruce Momjian
maillist@candle.pha.pa.us


------------------------------

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
>     okay, I'm not totally familiary with palloc()...I did a project while
> back where we mmap()'d a region of memory so that we didn't need to write
> to disk.  Essentially, I used ftruncate() to create the size of the buffer I
> wanted to mmap(), and never flushed the memory...
>
>     Not sure what palloc() does though...should look into it, eh? :(

palloc() basically malloc's from a pool that gets auto-free'd at the end
of statements or transcations.

We could us mmap(), but why bother if we can just malloc() all the
memory we need?  That is what has me confused.  Anyone know an advantage
to using mmap() when you really don't want the image on disk?

I would suspect that mmap() would use the file system for backing store
for the memory rather than swap.  But the thrashing of large sorts is
going to be as big a problem with this as with malloc'ed memory backing
store in SWAP, in fact slower because swap backing store is faster than
file system backing store, so we might was well use psort() for those
large cases.  psort() does its sorts in pieces with merging, and doesn't
load large amount of data into memory at once, so it doesn't thrash.  It
uses disk files to store itermediate sorts, hence the large startup time.


- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] \dt and disk access

От
"Vadim B. Mikheev"
Дата:
Bruce Momjian wrote:
>
> > Can you make the size of the result set above which diskfiles will be used
> > configurable? That way ppl with loads of RAM can use huge buffers, and ppl
> > with little RAM can keep that RAM free for other processes.
>
> If I need a configuration value, I will either determine the amount of
> RAM portably, or base the value on the number of shared buffers
> requested with -B.

Why don't use additional flag as it suggested by Mark ?
Using -B is not good: the size of shared memory segment may be
limited by OS (under FreeBSD 2.1.0 it's limited to 4M) or system
administrator and so backend will use 2-4 M of RAM for sort on
box with 192 M RAM ?

This flag may be useful for joinpath.c:EnoughMemoryForHashjoin() too...

Also note that following
>       - make psort read directly from the executor node below it
>       (instead of an input relation)

it will be impossible to know the size of result before sort startup.
So you have to use palloc up to in-memory limit and switch to
'tape' files dynamically.

Also
>       - makes the Sort node read directly from the last set of psort runs
>       (instead of an output relation)

require changes to ExecSortMarkPos()/ExecSortRestrPos() which
use heap_markpos()/heap_restrpos() (because of last set of
psort is not normal heap relation).

But both changes of nodeSort.c are what we really need.

Vadim

------------------------------

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
>
> Bruce Momjian wrote:
> >
> > > Can you make the size of the result set above which diskfiles will be used
> > > configurable? That way ppl with loads of RAM can use huge buffers, and ppl
> > > with little RAM can keep that RAM free for other processes.
> >
> > If I need a configuration value, I will either determine the amount of
> > RAM portably, or base the value on the number of shared buffers
> > requested with -B.
>
> Why don't use additional flag as it suggested by Mark ?
> Using -B is not good: the size of shared memory segment may be
> limited by OS (under FreeBSD 2.1.0 it's limited to 4M) or system
> administrator and so backend will use 2-4 M of RAM for sort on
> box with 192 M RAM ?

OK, I will use a new flag.

>
> This flag may be useful for joinpath.c:EnoughMemoryForHashjoin() too...
>
> Also note that following
> >       - make psort read directly from the executor node below it
> >       (instead of an input relation)
>
> it will be impossible to know the size of result before sort startup.
> So you have to use palloc up to in-memory limit and switch to
> 'tape' files dynamically.

I like this idea.  I was struggling on how I was going to determine the
size of the result anyway.

I have checked the Mariposa source changes Paul mentioned.  They do
indeed change the behavior or psort().  It still uses tape files, so I
will need to increase the memory used for each sort, and only create the
tape files if the initial sort does not fit within the allocated memory.

>
> Also
> >       - makes the Sort node read directly from the last set of psort runs
> >       (instead of an output relation)
>
> require changes to ExecSortMarkPos()/ExecSortRestrPos() which
> use heap_markpos()/heap_restrpos() (because of last set of
> psort is not normal heap relation).
>
> But both changes of nodeSort.c are what we really need.

With the new psort(), you can do multiple sorts at the same time.  The
new psort() comment says:

 *      The old psort.c's routines formed a temporary relation from the merged
 * sort files. This version keeps the files around instead of generating the
 * relation from them, and provides interface functions to the file so that
 * you can grab tuples, mark a position in the file, restore a position in the
 * file. You must now explicitly call an interface function to end the sort,
 * psort_end, when you are done.
 *      Now most of the global variables are stuck in the Sort nodes, and
 * accessed from there (they are passed to all the psort routines) so that
 * each sort running has its own separate state. This is facilitated by having
 * the Sort nodes passed in to all the interface functions.
 *      The one global variable that all the sorts still share is SortMemory.
 *      You should now be allowed to run two or more psorts concurrently,
 * so long as the memory they eat up is not greater than SORTMEM, the initial
 * value of SortMemory.                                         -Rex 2.15.1995
 *
 *    Use the tape-splitting method (Knuth, Vol. III, pp281-86) in the future.

I am uploading mariposa-alpha-1.tar.gz to the postgreSQL ftp incoming
directory because I think I am going to need help on this one.  The
official mariposa ftp site is very, very slow and unreliable.  This
release is dated June, 1996, and is the newest available.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] \dt and disk access

От
The Hermit Hacker
Дата:
On Sun, 22 Jun 1997, Bruce Momjian wrote:

> >     okay, I'm not totally familiary with palloc()...I did a project while
> > back where we mmap()'d a region of memory so that we didn't need to write
> > to disk.  Essentially, I used ftruncate() to create the size of the buffer I
> > wanted to mmap(), and never flushed the memory...
> >
> >     Not sure what palloc() does though...should look into it, eh? :(
>
> palloc() basically malloc's from a pool that gets auto-free'd at the end
> of statements or transcations.
>
> We could us mmap(), but why bother if we can just malloc() all the
> memory we need?  That is what has me confused.  Anyone know an advantage
> to using mmap() when you really don't want the image on disk?

    Okay...this is something I never truly have every succeeded in doing,
and do agree that it would be better then mmap()...mmap() is good when
multiple processes want/need to access the same memory, and, from what I've
seen, sort of acts as an intermediary between a malloc() and shmem()...

    Now I understand what you are thinking :)


Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org

------------------------------

Re: [HACKERS] \dt and disk access

От
David Friend
Дата:
On Sun, 22 Jun 1997, Thomas G. Lockhart wrote:

> I don't know which Knuth algorithm the code uses, but some sort
> algorithms perform very poorly on data which happens to be already
> sorted. QuickSort, which comes on many Unix boxes (the qsort() call) is
> quickest on completely unordered/disordered data and its slowest
> performance is on already perfectly sorted data (I'm doing this from
> memory, but the facts may be mostly correct :).

from what I remember, the quick sort algorithm works just as fast on
sorted data as on unsorted data.  In other words, the algorithm does not
take advantage of the fact that the data is sorted.

On the other hand, if you go to something like bubble sort, it will
work faster than quick sort if the data is sorted and you are adding one
more item, but on unsorted data it takes an enormously longer length of
time to sort a large list.

David Friend                ! cq995@freenet.carleton.ca
Atlantis Scientific Inc.        ! david.friend@atlsci.com
20 Colonnade Rd, Suite 110        ! 613-727-1087 (voice)
Ottawa, Ontario, CANADA  K2E 7M6    ! 800-265-3894 (voice)
ERGOvista Scientific Image Analysis    ! 613-727-5853 (fax)

------------------------------

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
> >From what I remember, the quick sort algorithm works just as fast on
> sorted data as on unsorted data.  In other words, the algorithm does not
> take advantage of the fact that the data is sorted.
>
> On the other hand, if you go to something like bubble sort, it will
> work faster than quick sort if the data is sorted and you are adding one
> more item, but on unsorted data it takes an enormously longer length of
> time to sort a large list.
>

One thing I am going to look into is replacing the btree-build in the
current psort with a qsort of HeapTuple pointers.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

End of hackers-digest V1 #398
*****************************

Re: [HACKERS] \dt and disk access

От
"Vadim B. Mikheev"
Дата:
Bruce Momjian wrote:
>
> > >From what I remember, the quick sort algorithm works just as fast on
> > sorted data as on unsorted data.  In other words, the algorithm does not
> > take advantage of the fact that the data is sorted.
> >
> > On the other hand, if you go to something like bubble sort, it will
> > work faster than quick sort if the data is sorted and you are adding one
> > more item, but on unsorted data it takes an enormously longer length of
> > time to sort a large list.
> >
>
> One thing I am going to look into is replacing the btree-build in the
                                                     ^^^^^^^^^^^
> current psort with a qsort of HeapTuple pointers.

I don't fully understand what do you mean...
btree-build uses qsort to sort items on a page and has own
priority queue methods (instead of lselect.c - leftist tree
selection algorithm).
Do you want use qsort for in-memory sorts ?

BTW, I may suggest additional trick:
if size of sort-keys is (much?) smaller then size of tuple
(select _int4_field_, _big_text_field_ from ... order by _int4_field_)
and you are going to switch to tape-methods (i.e. you've pallocted
all sort-memory) then you may try to create temp-file to store
(unsorted) tuples there and to deal with sort-keys only. After
you'll got sort-keys (or better say - records which keep sort-keys
and pointers - temp-file offsets - to original tuples) sorted, you
may use these pointers to get requested tuple from temp-file.

Even for select int4_field from ... order by int4_field
current psort palloc 48(header) + 4 (field) + 4 (aligning) = 56 bytes
for each tuple - why don't work with 4(field) + 4 (offset) = 8 bytes
of sort-records (tm-:)) ?

Vadim

------------------------------

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
> I don't fully understand what do you mean...
> btree-build uses qsort to sort items on a page and has own
> priority queue methods (instead of lselect.c - leftist tree
> selection algorithm).
> Do you want use qsort for in-memory sorts ?

When I said btree-build, I meant the leftist tree sort, so I am
considering replacing the leftist-tree selection algorithm with a qsort
of an array of HeapTuple pointers.  Any comments on this?  The only
performance problem I see is re-allocing the array as the number of
entries grows, but if I double the size each time I exceed the current
size, I don't believe it will be much of a problem.  The leftist tree
looked very funny to me, with recursion and stuff.  I saw the new
FASTBUILD sort used qsort, so I thought it may be a good idea.

>
> BTW, I may suggest additional trick:
> if size of sort-keys is (much?) smaller then size of tuple
> (select _int4_field_, _big_text_field_ from ... order by _int4_field_)
> and you are going to switch to tape-methods (i.e. you've pallocted
> all sort-memory) then you may try to create temp-file to store
> (unsorted) tuples there and to deal with sort-keys only. After
> you'll got sort-keys (or better say - records which keep sort-keys
> and pointers - temp-file offsets - to original tuples) sorted, you
> may use these pointers to get requested tuple from temp-file.

I am a little lost here.  The Mariposa sort code allows ExecSort to read
directly from final sorted tape file, preventing the copy to a temp
table.  So, if I don't store the full tuple in the tape file, I can't
pass a REAL tuple back to ExecSort.

I assume you are suggesting putting all the tuples in one tape file,
then doing the sort with tape file ONLY with the indexed fields, then
using those indexed fields to retrieve the data from the tape.

The problem I see is that the final read of sorted rows from the full
tape file will be seeking all over the file, causing poor performance.
Now, not having to move around extra data during the sort is going to be
good for performance, but I am not sure what the result will be.

I think you are correct that it will be a win to only sort the needed
columns, but considering that any sort that uses tape files will be big,
and hence will not fit in the file system cache, and considering the
extra code to do this, I am uncertain about its usefulness.

>
> Even for select int4_field from ... order by int4_field
> current psort palloc 48(header) + 4 (field) + 4 (aligning) = 56 bytes
> for each tuple - why don't work with 4(field) + 4 (offset) = 8 bytes
> of sort-records (tm-:)) ?

Again, can I do this if I need to pass back the actual row?

Also, another confusing issue is that when I start the sort, I don't
know if it is going to fit into memory or not, so I must store the
entire row into memory because I might be returning a pointer and not
using tape files.  Can I then switch in the middle of the sort to tape
files and start doing my comparisons differently.  Seems hard to me.

I am just guessing on these issues.  Any comments?

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] \dt and disk access

От
"Vadim B. Mikheev"
Дата:
Bruce Momjian wrote:
>
> > I don't fully understand what do you mean...
> > btree-build uses qsort to sort items on a page and has own
> > priority queue methods (instead of lselect.c - leftist tree
> > selection algorithm).
> > Do you want use qsort for in-memory sorts ?
>
> When I said btree-build, I meant the leftist tree sort, so I am
> considering replacing the leftist-tree selection algorithm with a qsort
> of an array of HeapTuple pointers.  Any comments on this?  The only
> performance problem I see is re-allocing the array as the number of
> entries grows, but if I double the size each time I exceed the current
> size, I don't believe it will be much of a problem.  The leftist tree
> looked very funny to me, with recursion and stuff.  I saw the new
> FASTBUILD sort used qsort, so I thought it may be a good idea.

Ok. What do you suppose to use for merging tapes - lselect.c or
btree queue methods ?

>
> >
> > BTW, I may suggest additional trick:
> > if size of sort-keys is (much?) smaller then size of tuple
> > (select _int4_field_, _big_text_field_ from ... order by _int4_field_)
> > and you are going to switch to tape-methods (i.e. you've pallocted
> > all sort-memory) then you may try to create temp-file to store
> > (unsorted) tuples there and to deal with sort-keys only. After
> > you'll got sort-keys (or better say - records which keep sort-keys
> > and pointers - temp-file offsets - to original tuples) sorted, you
> > may use these pointers to get requested tuple from temp-file.
>
...
>
> The problem I see is that the final read of sorted rows from the full
> tape file will be seeking all over the file, causing poor performance.

I see it.
BTW, using indices for sorts has the same problem!
As work around you may try to reduce seek randomness by addition offset to
sort-keys (at the end) - but this may help in the case of duplicates only,
of course.

On the other hand, when you are merging tape-files don't you do
random seeks by switching from one tape to another ?
I'm not sure but it seems that more than one pass over all
tape-files may be required...

> Now, not having to move around extra data during the sort is going to be
> good for performance, but I am not sure what the result will be.

(Note that I'm too.)

>
> I think you are correct that it will be a win to only sort the needed
> columns, but considering that any sort that uses tape files will be big,
> and hence will not fit in the file system cache, and considering the
> extra code to do this, I am uncertain about its usefulness.

Ok - select int4 ... order by int4 using 4M for in-memory sort:

1. entire-tuple method
   in-memory sort will be used for 4 000 000 / 56 =~ 71428 tuples
   in result.
2. sort-records method
   in-memory sort for 4 000 000 / 12 = 333333 tuples in result
   (12 = 8 + 4: 4 bytes to hold NULLs bitmask)
   + writing 333333*56 = 18 666 648 bytes into temp_file
     (btw, we may exclude 8 bytes of int4 + aligning from tuples
      in temp-file - we have attribute values in sort-records, -
      so it may be reduced to 15 999 984: not so big in this case,
      but may have sence for bigger sort-keys)
   + 333333 seeks over temp-file when returning results
3. entire-tuple method for 333333 tuples result
   writing of 6 tape-files with 18 666 648 bytes total size
   + ??? seeks in merging
   + writing 18 666 648 bytes of final tape-file
   is it all ? or there may be more than one pass (read/write)
   over tape-files ?

Vadim

------------------------------

Re: [HACKERS] \dt and disk access

От
hotz@jpl.nasa.gov (Henry B. Hotz)
Дата:
At 9:48 PM 6/23/97, David Friend wrote:
>On Sun, 22 Jun 1997, Thomas G. Lockhart wrote:
>
>> I don't know which Knuth algorithm the code uses, but some sort
>> algorithms perform very poorly on data which happens to be already
>> sorted. QuickSort, which comes on many Unix boxes (the qsort() call) is
>> quickest on completely unordered/disordered data and its slowest
>> performance is on already perfectly sorted data (I'm doing this from
>> memory, but the facts may be mostly correct :).
>
From what I remember, the quick sort algorithm works just as fast on
>sorted data as on unsorted data.  In other words, the algorithm does not
>take advantage of the fact that the data is sorted.

What Thomas says is correct for the standard textbook description of
quicksort.  It is not hard to reverse the order of the compare and fix it
so its worst case is reverse-sorted instead of presorted data.  I would
expect that most vendor's implementations have this fix, but no guarantees.

I have skipped the first part of this discussion.  Quicksort is a good
in-memory algorithm for uniprocessors.  It may not be applicable to
database sorts of very large data sets.

Signature failed Preliminary Design Review.
Feasibility of a new signature is currently being evaluated.
h.b.hotz@jpl.nasa.gov, or hbhotz@oxy.edu

------------------------------

Re: [HACKERS] \dt and disk access

От
Bruce Momjian
Дата:
>
> At 9:48 PM 6/23/97, David Friend wrote:
> >On Sun, 22 Jun 1997, Thomas G. Lockhart wrote:
> >
> >> I don't know which Knuth algorithm the code uses, but some sort
> >> algorithms perform very poorly on data which happens to be already
> >> sorted. QuickSort, which comes on many Unix boxes (the qsort() call) is
> >> quickest on completely unordered/disordered data and its slowest
> >> performance is on already perfectly sorted data (I'm doing this from
> >> memory, but the facts may be mostly correct :).
> >
> >From what I remember, the quick sort algorithm works just as fast on
> >sorted data as on unsorted data.  In other words, the algorithm does not
> >take advantage of the fact that the data is sorted.
>
> What Thomas says is correct for the standard textbook description of
> quicksort.  It is not hard to reverse the order of the compare and fix it
> so its worst case is reverse-sorted instead of presorted data.  I would
> expect that most vendor's implementations have this fix, but no guarantees.
>
> I have skipped the first part of this discussion.  Quicksort is a good
> in-memory algorithm for uniprocessors.  It may not be applicable to
> database sorts of very large data sets.

The quicksort is to be used only for in-memory sorting.  Larger sorts
are going to use the Knuth, volume 3, page 280 tape sorting algorithm.

If anyone wants a digest of this discussion, I can easily sent it to
them.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------