Обсуждение: [PATCH] Using pread instead of lseek (with analysis)

Поиск
Список
Период
Сортировка

[PATCH] Using pread instead of lseek (with analysis)

От
Martijn van Oosterhout
Дата:
Hi,

Below is a patch that tries to determine (for linux) if the pread is
emulated or not (using an autoconf test), and if so use it instead of
lseek. It doesn't remove all uses of lseek. For example, PostgreSQL
uses lseek to find the end of the file when extending. So in a bulk
load, there is one seek per block written anyway. Also, the xlog seems
to be completely seperate code.

Short summary: Actual benefit, (on Linux anyway) not distinguishable
from the noise.

Longer summary: It reduces CPU usage slightly (few percent user,
overall almost nothing), but the benefit is usually swamped by the
actual cost of the read from disk (system cache miss). I would actually
make the change because it simplifies the code a bit.

Comparisons on other systems may be worthwhile.

What follows is the detailed analysis of the change. My conclusion is
that on Linux at least where the syscall overhead is so low, it's not
worth the change for performance. It is cleaner code-wise IMHO.

----
Initially, the table is 1.5 million rows of random integer data. Worst
case scenario: ordered output run with seqscan disabled. Table is 66MB,
Index is 56MB and includes all three data columns. The queries took 98
+/- 1 seconds. Over the entire query it consistantly loaded about 55MB
of data from disk (about 50% system cache miss rate).

Query: select count(*) from (select * from bigtable order by a,b,c)

Exhibit A below is the vmstat output with 10 second averages. The
interesting columns are cpu us (user) and cpu sy (system). Note, system
is defined as stuff not user and not idle. Waiting for the I/O to
complete is counted as system, even though it's not doing all that
much. It shows a very, very slight reduction in user time, though it is
measurable and repeatable (both run a number of times with same
results).

What this tells me is that lseek time is swamped by actual read time,
so not a major benefit in most cases.

Exhibit B is the same but with only 1 million rows (Table 48MB, Index
29MB), this fitting in cache memory. The original index had some slack
from deleting records (I did a vacuum full though). The queries now
take 60 +/- 1 seconds. As you can see even without accessing the disk,
it still isn't making a big difference.

Exhibit C is the output of strace -c on the second test. As you can see
there is a difference, but marginal.

Exhibit D is the output of a seq scan, in case anyone figured I might
be buggering the queries somehow. As you can see, it shows that the
index scan is extremely expensive. Only 5900 reads for the seq scan
(enough to read the whole table) instead of 890000. The shared buffers
is the default 1000. Upping this would close the gap somewhat but that
is irrelevent for this test (lseek vs pread).

Exhibit E is the strace -c output, for completeness. lseek doesn't
appear in the top 10.

If you read this far, thanks for the attention! If you're running
another system, it would be nice to test. Remember to update the
autoconf code for your system!

Have a nice day,

Misc details:

command line:
  time echo "select count(*) from(select * from bigtable order by a,b,c) x" | \
    ./postgres -f s -D /var/lib/postgres/data test
System:
  Linux kleptog 2.6.8-1-686 #1 Thu Nov 25 04:34:30 UTC 2004 i686 GNU/Linux
  128MB RAM
  Intel(R) Celeron(TM) CPU 1000MHz

Exhibit (A): Index scan, dataset > memory

Using ordinary llseek/read:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0      4   1372    940 107564    0    0   534    18 1070   145 18 56  0 27
 1  0      4   1648    992 107260    0    0   548     3 1071   147 18 63  0 19
 1  0      4   1552    984 107348    0    0   542     4 1069   144 20 69  0 11
 1  0      4   1432    980 107496    0    0   550     6 1071   146 19 69  0 12
 1  0      4   1648    664 107600    0    0   560     3 1071   148 19 67  0 15
 1  0      4   1576    632 107728    0    0   597     3 1076   157 19 68  0 13
 2  0      4   1768    640 107528    0    0   618    14 1080   163 19 67  0 14

Using pread():

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0      4   1680    600 107616    0    0   534    18 1070   145 17 56  0 28
 1  0      4   1656    644 107644    0    0   543     5 1070   147 16 64  0 19
 1  0      4   1344    624 107976    0    0   540     2 1068   142 19 70  0 11
 1  0      4   1656    632 107652    0    0   555     6 1071   147 17 70  0 13
 1  0      4   1608    540 107796    0    0   564    13 1074   149 18 68  0 15
 1  0      4   1392    504 108068    0    0   598     4 1076   158 17 69  0 14
 0  1      4   1368    460 108120    0    0   626     3 1079   165 17 69  0 14

Exhibit (B): Index scan, dataset < memory

Using ordinary llseek/read:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0      4   7500   3084  98000    0    0     0    25 1002     8 22 78  0  0
 1  0      4   7484   3100  98000    0    0     0     3 1001     5 22 78  0  0
 1  0      4   7468   3116  98000    0    0     0     3 1001     5 22 78  0  0
 1  0      4   7444   3140  98000    0    0     0    10 1003     6 22 78  0  0

Using pread():

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0      4   7116   3452  98000    0    0     0    18 1002     8 21 79  0  0
 1  0      4   7100   3468  98000    0    0     0     8 1001     5 21 79  0  0
 1  0      4   7080   3488  98000    0    0     0     4 1001     5 21 79  0  0
 1  0      4   7064   3504  98000    0    0     0    10 1003     5 22 78  0  0

Exhibit (C): strace -c output of backend

Using ordinary llseek/read:

 % time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 85.27   51.906570          58    891526           read
 14.71    8.956573          10    887770           _llseek
  0.00    0.001983          17       114        29 open

Using pread():

 % time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.98   52.977031          59    891479           pread
  0.00    0.002164         114        19           write

Exhibit (D): sequential scan: dataset < memory

Using ordinary llseek/read:

$ time echo "select count(*) from(select * from bigtable) x" | ./postgres.read -D /var/lib/postgres/data test

PostgreSQL stand-alone backend 8.1beta2
backend>         1: count       (typeid = 20, len = 8, typmod = -1, byval = f)
        ----
         1: count = "1080838"   (typeid = 20, len = 8, typmod = -1, byval = f)
        ----
backend>
real    0m1.264s
user    0m0.921s
sys     0m0.328s

Using pread():

$ time echo "select count(*) from(select * from bigtable) x" | ./postgres -D /var/lib/postgres/data test

PostgreSQL stand-alone backend 8.1beta2
backend>         1: count       (typeid = 20, len = 8, typmod = -1, byval = f)
        ----
         1: count = "1080838"   (typeid = 20, len = 8, typmod = -1, byval = f)
        ----
backend>
real    0m1.213s
user    0m0.827s
sys     0m0.371s

Exhibit (E): strace -c of sequential scan case

Using ordinary llseek/read:

 % time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 96.78    0.318179          53      5950           read
  0.68    0.002249         118        19           write
  0.57    0.001879          17       109        29 open
<snip>
  0.06    0.000197           5        40           _llseek

Using pread():

 % time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 95.92    0.316512          54      5903           pread
  0.70    0.002326         122        19           write
  0.62    0.002032         406         5           fsync


--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Вложения

Re: [PATCH] Using pread instead of lseek (with analysis)

От
"Qingqing Zhou"
Дата:
"Martijn van Oosterhout" <kleptog@svana.org> wrote
>
> Below is a patch that tries to determine (for linux) if the pread is
> emulated or not (using an autoconf test), and if so use it instead of
> lseek.

Just FYI, this was proposed 2 years ago:

http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php

The basic problem I read from the above thread is that it does not give
enough performance proofs on *various* platforms.

Regards,
Qingqing



Re: [PATCH] Using pread instead of lseek (with analysis)

От
Tom Lane
Дата:
Martijn van Oosterhout <kleptog@svana.org> writes:
> What follows is the detailed analysis of the change. My conclusion is
> that on Linux at least where the syscall overhead is so low, it's not
> worth the change for performance. It is cleaner code-wise IMHO.

How is it cleaner code-wise to debug and maintain two #ifdef'd code
paths instead of one?  (And when I say "debug", I mean "I don't believe
FileSeek still works".  One reason that the patch seems unmaintainable
to me as-is is that it creates a subtle, critical, and undocumented
difference in the semantic meaning of the seekPos variable: in one case
it's tracking the kernel's own seek position, and in the other it isn't.)

> What this tells me is that lseek time is swamped by actual read time,
> so not a major benefit in most cases.

It would be reasonable to check results in fully-cached cases, which
would be the best real-world scenario for this to show any improvement
in.

            regards, tom lane

Re: [PATCH] Using pread instead of lseek (with analysis)

От
Martijn van Oosterhout
Дата:
On Sat, Oct 08, 2005 at 05:27:11PM -0400, Tom Lane wrote:
> How is it cleaner code-wise to debug and maintain two #ifdef'd code
> paths instead of one?  (And when I say "debug", I mean "I don't believe
> FileSeek still works".  One reason that the patch seems unmaintainable
> to me as-is is that it creates a subtle, critical, and undocumented
> difference in the semantic meaning of the seekPos variable: in one case
> it's tracking the kernel's own seek position, and in the other it isn't.)

I mean in the situation where we would only use pread. As for the
difference, no code outside of fd.c can assume anying about the kernel
seek position because it can't see it (the file might be closed!).
Unless something bypasses the abstraction?

> It would be reasonable to check results in fully-cached cases, which
> would be the best real-world scenario for this to show any improvement
> in.

If you look, I did that and even then it simply didn't make a
difference. lseek is 10 microseconds, you'd need to do hundreds of
thousands of them to make a difference. And any gain would disappear in
just the rotational latency of a hard disk read.

I satisfied my curiosity and showed it just isn't worth it at this
point.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Вложения

Re: [PATCH] Using pread instead of lseek (with analysis)

От
Martijn van Oosterhout
Дата:
On Sun, Oct 09, 2005 at 12:19:57AM +0200, Martijn van Oosterhout wrote:
> > It would be reasonable to check results in fully-cached cases, which
> > would be the best real-world scenario for this to show any improvement
> > in.
>
> If you look, I did that and even then it simply didn't make a
> difference. lseek is 10 microseconds, you'd need to do hundreds of
> thousands of them to make a difference. And any gain would disappear in
> just the rotational latency of a hard disk read.

Just to clarify, I reexamined this point and there is a small
difference. The fully cached case went from 72 seconds to 70 over
890,000 (p)read syscalls, indicating that in that case lseek/read is
slower than pread by about 2 microseconds per call. The overall benefit
is about 2.5%.

This doesn't change the fact that any disk access at all renders the
benefit almost nil. On Linux anyway, other OSes, who knows?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Вложения