Обсуждение: Script to compute random page cost
Because we have seen many complains about sequential vs index scans, I wrote a script which computes the value for your OS/hardware combination. Under BSD/OS on one SCSI disk, I get a random_page_cost around 60. Our current postgresql.conf default is 4. What do other people get for this value? Keep in mind if we increase this value, we will get a more sequential scans vs. index scans. One flaw in this test is that it randomly reads blocks from different files rather than randomly reading from the same file. Do people have a suggestion on how to correct this? Does it matter? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 #!/bin/bash trap "rm -f /tmp/$$" 0 1 2 3 15 BLCKSZ=8192 if [ "$RANDOM" = "$RANDOM" ] then echo "Your shell does not support \$RANDOM. Try using bash." 1>&2 exit 1 fi # XXX We assume 0 <= random <= 32767 echo "Collecting sizing information ..." TEMPLATE1=`du -s "$PGDATA/base/1" | awk '{print $1}'` FULL=`du -s "$PGDATA/base" | awk '{print $1}'` if [ "$FULL" -lt `expr "$TEMPLATE1" \* 4` ] then echo "Your installation should have at least four times the data stored in template1 to yield meaningful results"1>&2 exit 1 fi find "$PGDATA/base" -type f -exec ls -ld {} \; | awk '$5 % '"$BLCKSZ"' == 0 {print $5 / '"$BLCKSZ"', $9}' | grep -v '^0 ' > /tmp/$$ TOTAL=`awk 'BEGIN {sum=0} {sum += $1} END {print sum}' /tmp/$$` echo "Running random access timing test ..." START=`date '+%s'` PAGES=1000 while [ "$PAGES" -ne 0 ] do BIGRAND=`expr "$RANDOM" \* 32768 + "$RANDOM"` OFFSET=`awk 'BEGIN{printf "%d\n", ('"$BIGRAND"' / 2^30) * '"$TOTAL"'}'` RESULT=`awk ' BEGIN {offset = 0} offset + $1 > '"$OFFSET"' \ {print $2, '"$OFFSET"' - offset ; exit} {offset += $1}' /tmp/$$` FILE=`echo "$RESULT" | awk '{print $1}'` OFFSET=`echo "$RESULT" | awk '{print $2}'` dd bs="$BLCKSZ" seek="$OFFSET" count=1 if="$FILE" of="/dev/null" >/dev/null 2>&1 PAGES=`expr "$PAGES" - 1` done STOP=`date '+%s'` RANDTIME=`expr "$STOP" - "$START"` echo "Running sequential access timing test ..." START=`date '+%s'` # We run the random test 10 times more because it is quicker and # we need it to run for a while to get accurate results. PAGES=10000 while [ "$PAGES" -ne 0 ] do BIGRAND=`expr "$RANDOM" \* 32768 + "$RANDOM"` OFFSET=`awk 'BEGIN{printf "%d\n", ('"$BIGRAND"' / 2^30) * '"$TOTAL"'}'` RESULT=`awk ' BEGIN {offset = 0} offset + $1 > '"$OFFSET"' \ {print $2, $1; exit} {offset += $1}' /tmp/$$` FILE=`echo "$RESULT" | awk '{print $1}'` FILEPAGES=`echo "$RESULT" | awk '{print $2}'` if [ "$FILEPAGES" -gt "$PAGES" ] then FILEPAGES="$PAGES" fi dd bs="$BLCKSZ" count="$FILEPAGES" if="$FILE" of="/dev/null" >/dev/null 2>&1 PAGES=`expr "$PAGES" - "$FILEPAGES"` done STOP=`date '+%s'` SEQTIME=`expr "$STOP" - "$START"` echo awk 'BEGIN {printf "random_page_cost = %f\n", ('"$RANDTIME"' / '"$SEQTIME"') * 10}'
OK, turns out that the loop for sequential scan ran fewer times and was skewing the numbers. I have a new version at: ftp://candle.pha.pa.us/pub/postgresql/randcost I get _much_ lower numbers now for random_page_cost. --------------------------------------------------------------------------- Bruce Momjian wrote: > Because we have seen many complains about sequential vs index scans, I > wrote a script which computes the value for your OS/hardware > combination. > > Under BSD/OS on one SCSI disk, I get a random_page_cost around 60. Our > current postgresql.conf default is 4. > > What do other people get for this value? > > Keep in mind if we increase this value, we will get a more sequential > scans vs. index scans. > > One flaw in this test is that it randomly reads blocks from different > files rather than randomly reading from the same file. Do people have a > suggestion on how to correct this? Does it matter? > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > #!/bin/bash > > trap "rm -f /tmp/$$" 0 1 2 3 15 > > BLCKSZ=8192 > > if [ "$RANDOM" = "$RANDOM" ] > then echo "Your shell does not support \$RANDOM. Try using bash." 1>&2 > exit 1 > fi > > # XXX We assume 0 <= random <= 32767 > > echo "Collecting sizing information ..." > > TEMPLATE1=`du -s "$PGDATA/base/1" | awk '{print $1}'` > FULL=`du -s "$PGDATA/base" | awk '{print $1}'` > if [ "$FULL" -lt `expr "$TEMPLATE1" \* 4` ] > then echo "Your installation should have at least four times the data stored in template1 to yield meaningful results"1>&2 > exit 1 > fi > > find "$PGDATA/base" -type f -exec ls -ld {} \; | > awk '$5 % '"$BLCKSZ"' == 0 {print $5 / '"$BLCKSZ"', $9}' | > grep -v '^0 ' > /tmp/$$ > > TOTAL=`awk 'BEGIN {sum=0} > {sum += $1} > END {print sum}' /tmp/$$` > > echo "Running random access timing test ..." > > START=`date '+%s'` > PAGES=1000 > > while [ "$PAGES" -ne 0 ] > do > BIGRAND=`expr "$RANDOM" \* 32768 + "$RANDOM"` > > OFFSET=`awk 'BEGIN{printf "%d\n", ('"$BIGRAND"' / 2^30) * '"$TOTAL"'}'` > > RESULT=`awk ' BEGIN {offset = 0} > offset + $1 > '"$OFFSET"' \ > {print $2, '"$OFFSET"' - offset ; exit} > {offset += $1}' /tmp/$$` > FILE=`echo "$RESULT" | awk '{print $1}'` > OFFSET=`echo "$RESULT" | awk '{print $2}'` > > dd bs="$BLCKSZ" seek="$OFFSET" count=1 if="$FILE" of="/dev/null" >/dev/null 2>&1 > PAGES=`expr "$PAGES" - 1` > done > > STOP=`date '+%s'` > RANDTIME=`expr "$STOP" - "$START"` > > echo "Running sequential access timing test ..." > > START=`date '+%s'` > # We run the random test 10 times more because it is quicker and > # we need it to run for a while to get accurate results. > PAGES=10000 > > while [ "$PAGES" -ne 0 ] > do > BIGRAND=`expr "$RANDOM" \* 32768 + "$RANDOM"` > > OFFSET=`awk 'BEGIN{printf "%d\n", ('"$BIGRAND"' / 2^30) * '"$TOTAL"'}'` > > RESULT=`awk ' BEGIN {offset = 0} > offset + $1 > '"$OFFSET"' \ > {print $2, $1; exit} > {offset += $1}' /tmp/$$` > FILE=`echo "$RESULT" | awk '{print $1}'` > FILEPAGES=`echo "$RESULT" | awk '{print $2}'` > > if [ "$FILEPAGES" -gt "$PAGES" ] > then FILEPAGES="$PAGES" > fi > > dd bs="$BLCKSZ" count="$FILEPAGES" if="$FILE" of="/dev/null" >/dev/null 2>&1 > PAGES=`expr "$PAGES" - "$FILEPAGES"` > done > > STOP=`date '+%s'` > SEQTIME=`expr "$STOP" - "$START"` > > echo > awk 'BEGIN {printf "random_page_cost = %f\n", ('"$RANDTIME"' / '"$SEQTIME"') * 10}' > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
I got: random_page_cost = 0.807018 For FreeBSD 4.4/i386 With 512MB RAM & SCSI HDD Chris > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Bruce Momjian > Sent: Monday, 9 September 2002 2:14 PM > To: PostgreSQL-development > Subject: Re: [HACKERS] Script to compute random page cost > > > > OK, turns out that the loop for sequential scan ran fewer times and was > skewing the numbers. I have a new version at: > > ftp://candle.pha.pa.us/pub/postgresql/randcost > > I get _much_ lower numbers now for random_page_cost. > > ------------------------------------------------------------------ > --------- > > Bruce Momjian wrote: > > Because we have seen many complains about sequential vs index scans, I > > wrote a script which computes the value for your OS/hardware > > combination. > > > > Under BSD/OS on one SCSI disk, I get a random_page_cost around 60. Our > > current postgresql.conf default is 4. > > > > What do other people get for this value? > > > > Keep in mind if we increase this value, we will get a more sequential > > scans vs. index scans. > > > > One flaw in this test is that it randomly reads blocks from different > > files rather than randomly reading from the same file. Do people have a > > suggestion on how to correct this? Does it matter? > > > > -- > > Bruce Momjian | http://candle.pha.pa.us > > pgman@candle.pha.pa.us | (610) 359-1001 > > + If your life is a hard drive, | 13 Roberts Road > > + Christ can be your backup. | Newtown Square, > Pennsylvania 19073 > > > #!/bin/bash > > > > trap "rm -f /tmp/$$" 0 1 2 3 15 > > > > BLCKSZ=8192 > > > > if [ "$RANDOM" = "$RANDOM" ] > > then echo "Your shell does not support \$RANDOM. Try > using bash." 1>&2 > > exit 1 > > fi > > > > # XXX We assume 0 <= random <= 32767 > > > > echo "Collecting sizing information ..." > > > > TEMPLATE1=`du -s "$PGDATA/base/1" | awk '{print $1}'` > > FULL=`du -s "$PGDATA/base" | awk '{print $1}'` > > if [ "$FULL" -lt `expr "$TEMPLATE1" \* 4` ] > > then echo "Your installation should have at least four > times the data stored in template1 to yield meaningful results" 1>&2 > > exit 1 > > fi > > > > find "$PGDATA/base" -type f -exec ls -ld {} \; | > > awk '$5 % '"$BLCKSZ"' == 0 {print $5 / '"$BLCKSZ"', $9}' | > > grep -v '^0 ' > /tmp/$$ > > > > TOTAL=`awk 'BEGIN {sum=0} > > {sum += $1} > > END {print sum}' /tmp/$$` > > > > echo "Running random access timing test ..." > > > > START=`date '+%s'` > > PAGES=1000 > > > > while [ "$PAGES" -ne 0 ] > > do > > BIGRAND=`expr "$RANDOM" \* 32768 + "$RANDOM"` > > > > OFFSET=`awk 'BEGIN{printf "%d\n", ('"$BIGRAND"' / 2^30) * > '"$TOTAL"'}'` > > > > RESULT=`awk ' BEGIN {offset = 0} > > offset + $1 > '"$OFFSET"' \ > > {print $2, '"$OFFSET"' - offset ; exit} > > {offset += $1}' /tmp/$$` > > FILE=`echo "$RESULT" | awk '{print $1}'` > > OFFSET=`echo "$RESULT" | awk '{print $2}'` > > > > dd bs="$BLCKSZ" seek="$OFFSET" count=1 if="$FILE" > of="/dev/null" >/dev/null 2>&1 > > PAGES=`expr "$PAGES" - 1` > > done > > > > STOP=`date '+%s'` > > RANDTIME=`expr "$STOP" - "$START"` > > > > echo "Running sequential access timing test ..." > > > > START=`date '+%s'` > > # We run the random test 10 times more because it is quicker and > > # we need it to run for a while to get accurate results. > > PAGES=10000 > > > > while [ "$PAGES" -ne 0 ] > > do > > BIGRAND=`expr "$RANDOM" \* 32768 + "$RANDOM"` > > > > OFFSET=`awk 'BEGIN{printf "%d\n", ('"$BIGRAND"' / 2^30) * > '"$TOTAL"'}'` > > > > RESULT=`awk ' BEGIN {offset = 0} > > offset + $1 > '"$OFFSET"' \ > > {print $2, $1; exit} > > {offset += $1}' /tmp/$$` > > FILE=`echo "$RESULT" | awk '{print $1}'` > > FILEPAGES=`echo "$RESULT" | awk '{print $2}'` > > > > if [ "$FILEPAGES" -gt "$PAGES" ] > > then FILEPAGES="$PAGES" > > fi > > > > dd bs="$BLCKSZ" count="$FILEPAGES" if="$FILE" > of="/dev/null" >/dev/null 2>&1 > > PAGES=`expr "$PAGES" - "$FILEPAGES"` > > done > > > > STOP=`date '+%s'` > > SEQTIME=`expr "$STOP" - "$START"` > > > > echo > > awk 'BEGIN {printf "random_page_cost = %f\n", ('"$RANDTIME"' / > '"$SEQTIME"') * 10}' > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 3: if posting/reading through Usenet, please send an appropriate > > subscribe-nomail command to majordomo@postgresql.org so that your > > message can get through to the mailing list cleanly > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, > Pennsylvania 19073 > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
Dell Inspiron 8100 laptop, 1.2GHz Pentium, 512Mb RAM, Windows XP Pro CYGWIN_NT-5.1 PC9 1.3.10(0.51/3/2) 2002-02-25 11:14 i686 unknown random_page_cost = 0.924119 Regards, Dave. > -----Original Message----- > From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] > Sent: 09 September 2002 07:14 > To: PostgreSQL-development > Subject: Re: [HACKERS] Script to compute random page cost > > > > OK, turns out that the loop for sequential scan ran fewer > times and was skewing the numbers. I have a new version at: > > ftp://candle.pha.pa.us/pub/postgresql/randcost > > I get _much_ lower numbers now for random_page_cost.
> OK, turns out that the loop for sequential scan ran fewer times and was > skewing the numbers. I have a new version at: > > ftp://candle.pha.pa.us/pub/postgresql/randcost > > I get _much_ lower numbers now for random_page_cost. I got: random_page_cost = 1.047619 Linux kernel 2.4.18 Pentium III 750MHz Memory 256MB IDE HDD (A notebook/SONY VAIO PCG-Z505CR/K) -- Tatsuo Ishii
On Mon, 9 Sep 2002, Bruce Momjian wrote: > What do other people get for this value? With your new script, with a 1.5 GHz Athlon, 512 MB RAM, and a nice fast IBM 7200 RPM IDE disk, I get random_page_cost = 0.933333. > One flaw in this test is that it randomly reads blocks from different > files rather than randomly reading from the same file. Do people have a > suggestion on how to correct this? Does it matter? From my quick glance, it also does a lot of work work to read each block, including forking off serveral other programs. This would tend to push up the cost of a random read. You might want to look at modifying the randread program (http://randread.sourceforge.net) to do what you want.... cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
> What do other people get for this value? > > Keep in mind if we increase this value, we will get a more sequential > scans vs. index scans. With the new script I get 0.929825 on 2 IBM DTLA 5400RPM (80GB) with a 3Ware 6400 Controller (RAID-1) Best regards,Mario Weilguni -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Mon, 2002-09-09 at 07:13, Bruce Momjian wrote: > > OK, turns out that the loop for sequential scan ran fewer times and was > skewing the numbers. I have a new version at: > > ftp://candle.pha.pa.us/pub/postgresql/randcost > > I get _much_ lower numbers now for random_page_cost. > > --------------------------------------------------------------------------- Five successive runs: random_page_cost = 0.947368 random_page_cost = 0.894737 random_page_cost = 0.947368 random_page_cost = 0.894737 random_page_cost = 0.894737 linux 2.4.18 SMP dual Athlon MP 1900+ 512Mb RAM SCSI -- Oliver Elphick Oliver.Elphick@lfix.co.uk Isle of Wight, UK http://www.lfix.co.uk/oliver GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839 932A 614D 4C34 3E1D 0C1C ======================================== "Submit yourselves therefore to God. Resist the devil, and he will flee fromyou." James 4:7
On Mon, 2002-09-09 at 02:13, Bruce Momjian wrote: > > OK, turns out that the loop for sequential scan ran fewer times and was > skewing the numbers. I have a new version at: > > ftp://candle.pha.pa.us/pub/postgresql/randcost > > I get _much_ lower numbers now for random_page_cost. The current script pulls way more data for Sequential scan than random scan now. Random is pulling a single page (count=1 for dd) with every loop. Sequential does the same number of loops, but pulls count > 1 in each. In effect, sequential is random with more data load -- which explains all of the 0.9's. Rod Taylor
Bruce- With the change in the script that I mentioned to you off-list (which I believe just pointed it at our "real world" data), I got the following results with 6 successive runs on each of our two development platforms: (We're running PGSQL 7.2.1 on Debian Linux 2.4) System 1: 1.2 GHz Athlon Processor, 512MB RAM, Database on IDE hard drive random_page_cost = 0.857143 random_page_cost = 0.809524 random_page_cost = 0.809524 random_page_cost = 0.809524 random_page_cost = 0.857143 random_page_cost = 0.884615 System 2: Dual 1.2Ghz Athlon MP Processors, SMP enabled, 1 GB RAM, Database on Ultra SCSI RAID 5 with Hardware controller. random_page_cost = 0.894737 random_page_cost = 0.842105 random_page_cost = 0.894737 random_page_cost = 0.894737 random_page_cost = 0.842105 random_page_cost = 0.894737 I was surprised that the SCSI RAID drive is generally slower than IDE, but the values are in line with the results that others have been getting. -Nick > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Bruce Momjian > Sent: Monday, September 09, 2002 1:14 AM > To: PostgreSQL-development > Subject: Re: [HACKERS] Script to compute random page cost > > > > OK, turns out that the loop for sequential scan ran fewer times and was > skewing the numbers. I have a new version at: > > ftp://candle.pha.pa.us/pub/postgresql/randcost > > I get _much_ lower numbers now for random_page_cost.
I'm getting an infinite wait on that file, could someone post it to the list please? On Mon, 9 Sep 2002, Bruce Momjian wrote: > > OK, turns out that the loop for sequential scan ran fewer times and was > skewing the numbers. I have a new version at: > > ftp://candle.pha.pa.us/pub/postgresql/randcost > > I get _much_ lower numbers now for random_page_cost.
Hi again- I bounced these numbers off of Ray Ontko here at our shop, and he pointed out that random page cost is measured in multiples of a sequential page fetch. It seems almost impossible that a random fetch would be less expensive than a sequential fetch, yet we all seem to be getting results < 1. I can't see anything obviously wrong with the script, but something very odd is going. -Nick > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Nick Fankhauser > Sent: Monday, September 09, 2002 11:25 AM > To: Bruce Momjian; PostgreSQL-development > Cc: Ray Ontko > Subject: Re: [HACKERS] Script to compute random page cost > > > Bruce- > > With the change in the script that I mentioned to you off-list (which I > believe just pointed it at our "real world" data), I got the following > results with 6 successive runs on each of our two development platforms: > > (We're running PGSQL 7.2.1 on Debian Linux 2.4) > > System 1: > 1.2 GHz Athlon Processor, 512MB RAM, Database on IDE hard drive > random_page_cost = 0.857143 > random_page_cost = 0.809524 > random_page_cost = 0.809524 > random_page_cost = 0.809524 > random_page_cost = 0.857143 > random_page_cost = 0.884615 > > System 2: > Dual 1.2Ghz Athlon MP Processors, SMP enabled, 1 GB RAM, Database on Ultra > SCSI RAID 5 with Hardware controller. > random_page_cost = 0.894737 > random_page_cost = 0.842105 > random_page_cost = 0.894737 > random_page_cost = 0.894737 > random_page_cost = 0.842105 > random_page_cost = 0.894737 > > > I was surprised that the SCSI RAID drive is generally slower than IDE, but > the values are in line with the results that others have been getting. > > -Nick > > > -----Original Message----- > > From: pgsql-hackers-owner@postgresql.org > > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Bruce Momjian > > Sent: Monday, September 09, 2002 1:14 AM > > To: PostgreSQL-development > > Subject: Re: [HACKERS] Script to compute random page cost > > > > > > > > OK, turns out that the loop for sequential scan ran fewer times and was > > skewing the numbers. I have a new version at: > > > > ftp://candle.pha.pa.us/pub/postgresql/randcost > > > > I get _much_ lower numbers now for random_page_cost. > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
"Nick Fankhauser" <nickf@ontko.com> writes:
> I bounced these numbers off of Ray Ontko here at our shop, and he pointed
> out that random page cost is measured in multiples of a sequential page
> fetch. It seems almost impossible that a random fetch would be less
> expensive than a sequential fetch, yet we all seem to be getting results <
> 1. I can't see anything obviously wrong with the script, but something very
> odd is going.
The big problem with the script is that it involves an invocation of
"dd" - hence, at least one process fork --- for every page read
operation.  The seqscan part of the test is even worse, as it adds a
test(1) call and a shell if/then/else to the overhead.  My guess is that
we are measuring script overhead here, and not the desired I/O quantities
at all --- the script overhead is completely swamping the latter.  The
apparent stability of the results across a number of different platforms
bolsters that thought.
Someone else opined that the script was also not comparing equal
numbers of pages read for the random and sequential cases.  I haven't
tried to decipher the logic enough to see if that allegation is true,
but it's not obviously false.
Finally, I wouldn't believe the results for a moment if they were taken
against databases that are not several times the size of physical RAM
on the test machine, with a total I/O volume also much more than
physical RAM.  We are trying to measure the behavior when kernel
caching is not helpful; if the database fits in RAM then you are just
naturally going to get random_page_cost close to 1, because the kernel
will avoid doing any I/O at all.
        regards, tom lane
			
		Nick Fankhauser wrote: > Hi again- > > I bounced these numbers off of Ray Ontko here at our shop, and he pointed > out that random page cost is measured in multiples of a sequential page > fetch. It seems almost impossible that a random fetch would be less > expensive than a sequential fetch, yet we all seem to be getting results < > 1. I can't see anything obviously wrong with the script, but something very > odd is going. OK, new version at: ftp://candle.pha.pa.us/pub/postgresql/randcost What I have done is to take all of the computation stuff out of the timed loop so only the 'dd' is done in the loop. I am getting a 1.0 for random pages cost with this new code, but I don't have much data in the database so it is very possible I have it all cached. Would others please test it? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Mon, 9 Sep 2002, Tom Lane wrote: > Finally, I wouldn't believe the results for a moment if they were taken > against databases that are not several times the size of physical RAM > on the test machine, with a total I/O volume also much more than > physical RAM. We are trying to measure the behavior when kernel > caching is not helpful; if the database fits in RAM then you are just > naturally going to get random_page_cost close to 1, because the kernel > will avoid doing any I/O at all. Um...yeah; another reason to use randread against a raw disk device. (A little hard to use on linux systems, I bet, but works fine on BSD systems.) cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
Curt Sampson <cjs@cynic.net> writes:
> On Mon, 9 Sep 2002, Tom Lane wrote:
>> ...  We are trying to measure the behavior when kernel
>> caching is not helpful; if the database fits in RAM then you are just
>> naturally going to get random_page_cost close to 1, because the kernel
>> will avoid doing any I/O at all.
> Um...yeah; another reason to use randread against a raw disk device.
> (A little hard to use on linux systems, I bet, but works fine on
> BSD systems.)
Umm... not really; surely randread wouldn't know anything about
read-ahead logic?
The reason this is a difficult topic is that we are trying to measure
certain kernel behaviors --- namely readahead for sequential reads ---
and not others --- namely caching, because we have other parameters
of the cost models that purport to deal with that.
Mebbe this is an impossible task and we need to restructure the cost
models from the ground up.  But I'm not convinced of that.  The fact
that a one-page shell script can't measure the desired quantity doesn't
mean we can't measure it with more effort.
        regards, tom lane
			
		On Mon, 9 Sep 2002, Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > On Mon, 9 Sep 2002, Tom Lane wrote: > >> ... We are trying to measure the behavior when kernel > >> caching is not helpful; if the database fits in RAM then you are just > >> naturally going to get random_page_cost close to 1, because the kernel > >> will avoid doing any I/O at all. > > > Um...yeah; another reason to use randread against a raw disk device. > > (A little hard to use on linux systems, I bet, but works fine on > > BSD systems.) > > Umm... not really; surely randread wouldn't know anything about > read-ahead logic? Randread doesn't know anything about read-ahead logic, but I don't see how that matters one way or the other. The chances of it reading blocks sequentially are pretty much infinitesimal if you're reading across a reasonably large area of disk (I recommend at least 4GB), so readahead will never be triggered. > The reason this is a difficult topic is that we are trying to measure > certain kernel behaviors --- namely readahead for sequential reads --- > and not others --- namely caching, because we have other parameters > of the cost models that purport to deal with that. Well, for the sequential reads, the readahead should be trigerred even when reading from a raw device. So just use dd to measure that. If you want to slightly more accurately model postgres' behaviour, you probably want to pick a random area of the disk, read a gigabyte, switch areas, read another gigabyte, and so on. This will model the "split into 1GB" files thing. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
OK, I have a better version at: ftp://candle.pha.pa.us/pub/postgresql/randcost I have added a null loop which does a dd on a single file without reading any data, and by netting that loop out of the total computation and increasing the number of tests, I have gotten the following results for three runs:random test: 36sequential test: 33null timing test: 27random_page_cost = 1.500000 random test: 38sequential test: 32null timing test: 27random_page_cost = 2.200000 random test: 40sequential test: 31null timing test: 27random_page_cost = 3.250000 Interesting that random time is increasing, while the others were stable. I think this may have to do with other system activity at the time of the test. I will run it some more tomorrow but clearly we are seeing reasonable numbers now. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
I got somewhat different: $ ./randcost /usr/local/pgsql/data Collecting sizing information ... Running random access timing test ... Running sequential access timing test ... Running null loop timing test ... random test: 13 sequential test: 15 null timing test: 11 random_page_cost = 0.500000 Chris > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Bruce Momjian > Sent: Tuesday, 10 September 2002 2:02 PM > To: Curt Sampson > Cc: Tom Lane; nickf@ontko.com; PostgreSQL-development; Ray Ontko > Subject: Re: [HACKERS] Script to compute random page cost > > > OK, I have a better version at: > > ftp://candle.pha.pa.us/pub/postgresql/randcost > > I have added a null loop which does a dd on a single file without > reading any data, and by netting that loop out of the total computation > and increasing the number of tests, I have gotten the following results > for three runs: > > random test: 36 > sequential test: 33 > null timing test: 27 > > random_page_cost = 1.500000 > > > random test: 38 > sequential test: 32 > null timing test: 27 > > random_page_cost = 2.200000 > > > random test: 40 > sequential test: 31 > null timing test: 27 > > random_page_cost = 3.250000 > > Interesting that random time is increasing, while the others were > stable. I think this may have to do with other system activity at the > time of the test. I will run it some more tomorrow but clearly we are > seeing reasonable numbers now. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, > Pennsylvania 19073 > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
On Tue, 10 Sep 2002, Bruce Momjian wrote: > Interesting that random time is increasing, while the others were > stable. I think this may have to do with other system activity at the > time of the test. Actually, the random versus sequential time may also be different depending on how many processes are competing for disk access, as well. If the OS isn't maintaining readahead for whatever reason, sequential access could, in theory, degrade to being the same speed as random access. It might be interesting to test this, too. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
>OK, I have a better version at: The script is now broken, I get: Collecting sizing information ... Running random access timing test ... Running sequential access timing test ... Running null loop timing test ... random test: 14 sequential test: 16 null timing test: 14 random_page_cost = 0.000000
I was attempting to measure random page cost a while ago - I used three programs in this archive : http://techdocs.postgresql.org/markir/download/benchtool/ It writes a single big file and seems to give more realistic measurements ( like 6 for a Solaris scsi system and 10 for a Linux ide one...) Have a look and see if you can cannibalize it for your program Cheers Mark
On Mon, 2002-09-09 at 07:13, Bruce Momjian wrote: > > OK, turns out that the loop for sequential scan ran fewer times and was > skewing the numbers. I have a new version at: > > ftp://candle.pha.pa.us/pub/postgresql/randcost Latest version: olly@linda$ random test: 14 sequential test: 11 null timing test: 9 random_page_cost = 2.500000 olly@linda$ for a in 1 2 3 4 5 > do > ~/randcost > done Collecting sizing information ... random test: 11 sequential test: 11 null timing test: 9 random_page_cost = 1.000000 random test: 11 sequential test: 10 null timing test: 9 random_page_cost = 2.000000 random test: 11 sequential test: 11 null timing test: 9 random_page_cost = 1.000000 random test: 11 sequential test: 10 null timing test: 9 random_page_cost = 2.000000 random test: 10 sequential test: 10 null timing test: 10 Sequential time equals null time. Increase TESTCYCLES and rerun. Available memory (512M) exceeds the total database size, so sequential and random are almost the same for the second and subsequent runs. Since, in production, I would hope to have all active tables permanently in RAM, would there be a case for my using a page cost of 1 on the assumption that no disk reads would be needed? -- Oliver Elphick Oliver.Elphick@lfix.co.uk Isle of Wight, UK http://www.lfix.co.uk/oliver GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839 932A 614D 4C34 3E1D 0C1C ======================================== "Draw near to God and he will draw near to you. Cleanse your hands, yousinners; and purify your hearts, you double minded." James 4:8
Curt Sampson <cjs@cynic.net> writes:
> Well, for the sequential reads, the readahead should be trigerred
> even when reading from a raw device.
That strikes me as an unportable assumption.
Even if true, we can't provide a test mechanism that requires root
access to run it --- raw-device testing is out of the question just on
that basis, never mind that it is not measuring what we want to measure.
Perhaps it's time to remind people that what we want to measure
is the performance seen by a C program issuing write() and read()
commands, transferring 8K at a time, on a regular Unix filesystem.
A shell script invoking dd is by definition going to see a very
different performance ratio, even if what dd does under the hood
is 8K read() and write() (another not-well-supported assumption,
IMHO).  If you try to "improve" the results by using a raw device,
you're merely moving even further away from the scenario of interest.
        regards, tom lane
			
		Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I will run it some more tomorrow but clearly we are
> seeing reasonable numbers now.
... which still have no provable relationship to the ratio we need to
measure.  See my previous comments to Curt; I don't think you can
possibly get trustworthy results out of a shell script + dd approach,
because we do not implement Postgres using dd.
If you implemented a C testbed and then proved by experiment that the
shell script got comparable numbers, then I'd believe its results.
Without that confirmation, these are just meaningless numbers.
        regards, tom lane
			
		OK, what you are seeing here is that for your platform the TESTCYCLES size isn't enough; the numbers are too close to measure the difference. I am going to increase the TESTCYCLES from 5k to 10k. That should provide better numbers. --------------------------------------------------------------------------- Oliver Elphick wrote: > On Mon, 2002-09-09 at 07:13, Bruce Momjian wrote: > > > > OK, turns out that the loop for sequential scan ran fewer times and was > > skewing the numbers. I have a new version at: > > > > ftp://candle.pha.pa.us/pub/postgresql/randcost > > Latest version: > > olly@linda$ > random test: 14 > sequential test: 11 > null timing test: 9 > random_page_cost = 2.500000 > > olly@linda$ for a in 1 2 3 4 5 > > do > > ~/randcost > > done > Collecting sizing information ... > random test: 11 > sequential test: 11 > null timing test: 9 > random_page_cost = 1.000000 > > random test: 11 > sequential test: 10 > null timing test: 9 > random_page_cost = 2.000000 > > random test: 11 > sequential test: 11 > null timing test: 9 > random_page_cost = 1.000000 > > random test: 11 > sequential test: 10 > null timing test: 9 > random_page_cost = 2.000000 > > random test: 10 > sequential test: 10 > null timing test: 10 > Sequential time equals null time. Increase TESTCYCLES and rerun. > > > Available memory (512M) exceeds the total database size, so sequential > and random are almost the same for the second and subsequent runs. > > Since, in production, I would hope to have all active tables permanently > in RAM, would there be a case for my using a page cost of 1 on the > assumption that no disk reads would be needed? > > -- > Oliver Elphick Oliver.Elphick@lfix.co.uk > Isle of Wight, UK > http://www.lfix.co.uk/oliver > GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839 932A 614D 4C34 3E1D 0C1C > ======================================== > "Draw near to God and he will draw near to you. > Cleanse your hands, you sinners; and purify your > hearts, you double minded." James 4:8 > > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Oliver Elphick wrote: > Available memory (512M) exceeds the total database size, so sequential > and random are almost the same for the second and subsequent runs. > > Since, in production, I would hope to have all active tables permanently > in RAM, would there be a case for my using a page cost of 1 on the > assumption that no disk reads would be needed? Yes, in your case random_page_cost would be 1 once the data gets into RAM. In fact, that is the reason I used only /data/base for testing so places where data can load into ram will see lower random pages costs. I could just create a random file and test on that but it isn't the same. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, 10 Sep 2002, Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > Well, for the sequential reads, the readahead should be trigerred > > even when reading from a raw device. > > That strikes me as an unportable assumption. Not only unportable: but false. :-) NetBSD, at least, does read-ahead only through the buffer cache. Thinking about it, you *can't* do read-ahead on a raw device, because you're not buffering. Doh! > Perhaps it's time to remind people that what we want to measure > is the performance seen by a C program issuing write() and read() > commands, transferring 8K at a time, on a regular Unix filesystem. Right. Which is what randread does, if you give it a file rather than a raw device. I'm actually just now working on some modifications for it that will let you work against a bunch of files, rather than just one, so it will very accurately emulate a postgres random read of blocks from a table. There are two other tricky things related to the behaviour, however: 1. The buffer cache. You really need to be working against your entire database, not just a few gigabytes of its data, or sample data. 2. Multiple users. You really want a mix of simultaneous accesses going on, with as many processes as you normally have users querying the database. These can probably both be taken care of with shell scripts, though. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
Tom Lane wrote: > Perhaps it's time to remind people that what we want to measure > is the performance seen by a C program issuing write() and read()>commands, transferring 8K at a time, on a regular Unixfilesystem Yes...and at the risk of being accused of marketing ;-) , that is exactly what the 3 programs in my archive do (see previous post for url) : - one called 'write' creates a suitably sized data file (8k at a time - configurable), using the write() call - another called 'read' does sequential reads (8k at a time - configurable), using the read() call - finally one called 'seek' does random reads (8k chunks - configurable), using the lseek() and read() calls I tried to use code as similar as possible to how Postgres does its io....so the results *should* be meaningful ! Large file support in enabled too (as you need to use a file several times bigger than your RAM - and everyone seems to have >1G of it these days...) I think the code is reasonably readable too.... Its been *tested* on Linux, Freebsd, Solaris, MacosX. The only downer is that they don't automatically compute random_page_cost for you..(I was more interested in the raw sequential read, write and random read rates at the time). However it would be a fairly simple modification to combine the all 3 programs into one executable that outputs random_page_cost... regards Mark
On Wed, 11 Sep 2002, Mark Kirkwood wrote: > Yes...and at the risk of being accused of marketing ;-) , that is > exactly what the 3 programs in my archive do (see previous post for url) : Hm, it appears we've both been working on something similar. However, I've just released version 0.2 of randread, which has the following features: Written in C, uses read(2) and write(2), pretty much like postgres. Reads or writes random blocks from a specified list of files, treated as a contiguous range of blocks, again like postgres.This allows you to do random reads from the actual postgres data files for a table, if you like. You can specify the block size to use, and the number of reads to do. Allows you to specify how many blocks you want to read before you start reading again at a new random location. (Thedefault is 1.) This allows you to model various sequential and random read mixes. If you want to do writes, I suggest you create your own set of files to write, rather than destroying postgresql data. This can easily a be done with something like this Bourne shell script: for i in 1 2 3 4; dodd if=/dev/zero of=file.$i bs=1m count=1024 done However, it doesn't calculate the random vs. sequential ratio for you; you've got to do that for yourself. E.g.,: $ ./randread -l 512 -c 256 /u/cjs/z? 256 reads of 512 x 8.00 KB blocks (4096.00 KB) totalling 131072 blocks (1024.00 MB) from 524288 blocks (4092.00 MB) in 4files. 256 reads in 36.101119 sec. (141019 usec/read, 7 reads/sec, 29045.53 KB/sec) $ ./randread -c 4096 /u/cjs/z? 4096 reads of 1 x 8.00 KB blocks (8.00 KB) totalling 4096 blocks (32.00 MB) from 524288 blocks (4095.99 MB) in 4 files. 4096 reads in 34.274582 sec. (8367 usec/read, 120 reads/sec, 956.04 KB/sec) In this case, across 4 GB in 4 files on my 512 MB, 1.5 GHz Athlon with an IBM 7200 RPM IDE drive, I read about 30 times faster doing a full sequential read of the files than I do reading 32 MB randomly from it. But because of the size of this, there's basically no buffer cache involved. If I do this on a single 512 MB file: $ ./randread -c 4096 /u/cjs/z1:0-65536 4096 reads of 1 x 8.00 KB blocks (8.00 KB) totalling 4096 blocks (32.00 MB) from 65536 blocks (511.99 MB) in 1 files. 4096 reads in 28.064573 sec. (6851 usec/read, 146 reads/sec, 1167.59 KB/sec) $ ./randread -l 65535 -c 1 /u/cjs/z1:0-65536 1 reads of 65535 x 8.00 KB blocks (524280.00 KB) totalling 65535 blocks (511.99 MB) from 65536 blocks (0.01 MB) in 1 files. 1 reads in 17.107867 sec. (17107867 usec/read, 0 reads/sec, 30645.55 KB/sec) $ ./randread -c 4096 /u/cjs/z1:0-65536 4096 reads of 1 x 8.00 KB blocks (8.00 KB) totalling 4096 blocks (32.00 MB) from 65536 blocks (511.99 MB) in 1 files. 4096 reads in 19.413738 sec. (4739 usec/read, 215 reads/sec, 1687.88 KB/sec) Well, there you see some of the buffer cache effect from starting with about half the file in memory. If you want to see serious buffer cache action, just use the first 128 MB of my first test file: $ ./randread -c 4096 /u/cjs/z1:0-16536 4096 reads of 1 x 8.00 KB blocks (8.00 KB) totalling 4096 blocks (32.00 MB) from 16536 blocks (129.18 MB) in 1 files. 4096 reads in 20.220791 sec. (4936 usec/read, 204 reads/sec, 1620.51 KB/sec) $ ./randread -l 16535 -c 1 /u/cjs/z1:0-16536 1 reads of 16535 x 8.00 KB blocks (132280.00 KB) totalling 16535 blocks (129.18 MB) from 16536 blocks (0.01 MB) in 1 files. 1 reads in 3.469231 sec. (3469231 usec/read, 0 reads/sec, 38129.49 KB/sec) $ ./randread -l 16535 -c 64 /u/cjs/z1:0-16536 64 reads of 16535 x 8.00 KB blocks (132280.00 KB) totalling 1058240 blocks (8267.50 MB) from 16536 blocks (0.01 MB) in 1files. 64 reads in 23.643026 sec. (369422 usec/read, 2 reads/sec, 358072.59 KB/sec) For those last three, we're basically limited completely by the CPU, as there's not much disk I/O going on at all. The many-block one is going to be slower because it's got to generate a lot more random numbers and do a lot more lseek operations. Anyway, looking at the real difference between truly sequential and truly random reads on a large amount of data file (30:1 or so), it looks to me that people getting much less than that are getting good work out of their buffer cache. You've got to wonder if there's some way to auto-tune for this sort of thing.... Anyway, feel free to download and play. If you want to work on the program, I'm happy to give developer access on sourceforge. http://sourceforge.net/project/showfiles.php?group_id=55994 cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
AMD Athlon 500 512MB Ram IBM 120GB IDE Tested with: BLCKSZ=8192 TESTCYCLES=500000 Result: Collecting sizing information ... Running random access timing test ... Running sequential access timing test ... Running null loop timing test ... random test: 2541 sequential test: 2455 null timing test: 2389 random_page_cost = 2.303030 Hans -- *Cybertec Geschwinde u Schoenig* Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria Tel: +43/1/913 68 09; +43/664/233 90 75 www.postgresql.at <http://www.postgresql.at>, cluster.postgresql.at <http://cluster.postgresql.at>, www.cybertec.at <http://www.cybertec.at>, kernel.cybertec.at <http://kernel.cybertec.at>
Curt Sampson wrote: > On Wed, 11 Sep 2002, Mark Kirkwood wrote: > > > > Hm, it appears we've both been working on something similar. However, > I've just released version 0.2 of randread, which has the following > features: > funny how often that happens...( I think its often worth the effort to write your own benchmarking / measurement tool in order to gain an good understanding of what you intend to measure) >Anyway, feel free to download and play. If you want to work on the>program, I'm happy to give developer access on sourceforge.>> http://sourceforge.net/project/showfiles.php?group_id=55994 I'll take a look. best wishes Mark
Hi! On Tue, 10 Sep 2002 14:01:11 +0000 (UTC) tgl@sss.pgh.pa.us (Tom Lane) wrote: [...] > Perhaps it's time to remind people that what we want to measure > is the performance seen by a C program issuing write() and read() > commands, transferring 8K at a time, on a regular Unix filesystem. [...] I've written something like that. It is not C but might be useful. Any comments are welcome. http://www.a-nugget.org/downloads/randread.py Bye Guido
Hi all, As an end result of all this, do we now have a decent utility by which end user admin's can run it against the same disk/array that their PostgreSQL installation is on, and get a reasonably accurate number for random page cost? ie: $ ./get_calc_cost Try using random_page_cost = foo $ :-) Regards and best wishes, Justin Clift -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
Justin Clift wrote: > Hi all, > > As an end result of all this, do we now have a decent utility by which > end user admin's can run it against the same disk/array that their > PostgreSQL installation is on, and get a reasonably accurate number for > random page cost? > > ie: > > $ ./get_calc_cost > Try using random_page_cost = foo > > $ > > :-) Right now we only have my script: ftp://candle.pha.pa.us/pub/postgresql/randcost It uses dd so it forks for every loop and shows a value for my machine around 2.5. I need to code the loop in C to get more accurate numbers. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073