Обсуждение: RAID arrays and performance

Поиск

Список

Период

Сортировка

RAID arrays and performance

От

Matthew

Дата:

04 декабря 2007 г., 11:23:12

I have a question about how Postgres makes use of RAID arrays for
performance, because we are considering buying a 12-disc array for
performance reasons. I'm interested in how the performance scales with the
number of discs in the array.

Now, I know that for an OLTP workload (in other words, lots of small
parallel random accesses), the performance will scale almost with the
number of discs. However, I'm more interested in the performance of
individual queries, particularly those where Postgres has to do an index
scan, which will result in a single query performing lots of random
accesses to the disc system. Theoretically, this *can* scale with the
number of discs too - my question is does it?

Does Postgres issue requests to each random access in turn, waiting for
each one to complete before issuing the next request (in which case the
performance will not exceed that of a single disc), or does it use some
clever asynchronous access method to send a queue of random access
requests to the OS that can be distributed among the available discs?

Any knowledgable answers or benchmark proof would be appreciated,

Matthew

--
"To err is human; to really louse things up requires root
 privileges."                 -- Alexander Pope, slightly paraphrased

Re: RAID arrays and performance

От

Gregory Stark

Дата:

04 декабря 2007 г., 11:40:42

"Matthew" <matthew@flymine.org> writes:

> Does Postgres issue requests to each random access in turn, waiting for
> each one to complete before issuing the next request (in which case the
> performance will not exceed that of a single disc), or does it use some
> clever asynchronous access method to send a queue of random access
> requests to the OS that can be distributed among the available discs?

Sorry, it does the former, at least currently.

That said, this doesn't really come up nearly as often as you might think.
Normally queries fit mostly in either the large batch query domain or the
small quick oltp query domain. For the former Postgres tries quite hard to do
sequential i/o which the OS will do readahead for and you'll get good
performance. For the latter you're normally running many simultaneous such
queries and the raid array helps quite a bit.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

Re: RAID arrays and performance

От

Matthew

Дата:

04 декабря 2007 г., 12:17:09

On Tue, 4 Dec 2007, Gregory Stark wrote:
> "Matthew" <matthew@flymine.org> writes:
>
> > Does Postgres issue requests to each random access in turn, waiting for
> > each one to complete before issuing the next request (in which case the
> > performance will not exceed that of a single disc), or does it use some
> > clever asynchronous access method to send a queue of random access
> > requests to the OS that can be distributed among the available discs?
>
> Sorry, it does the former, at least currently.
>
> That said, this doesn't really come up nearly as often as you might think.

Shame. It comes up a *lot* in my project. A while ago we converted a task
that processes a queue of objects to processing groups of a thousand
objects, which sped up the process considerably. So we run an awful lot of
queries with IN lists with a thousand values. They hit the indexes, then
fetch the rows by random access. A full table sequential scan would take
much longer. It'd be awfully nice to have those queries go twelve times
faster.

> Normally queries fit mostly in either the large batch query domain or the
> small quick oltp query domain. For the former Postgres tries quite hard to do
> sequential i/o which the OS will do readahead for and you'll get good
> performance. For the latter you're normally running many simultaneous such
> queries and the raid array helps quite a bit.

Having twelve discs will certainly improve the sequential IO throughput!

However, if this was implemented (and I have *no* idea whatsoever how hard
it would be), then large index scans would scale with the number of discs
in the system, which would be quite a win, I would imagine. Large index
scans can't be that rare!

Matthew

--
Software suppliers are trying to make their software packages more
'user-friendly'.... Their best approach, so far, has been to take all
the old brochures, and stamp the words, 'user-friendly' on the cover.
-- Bill Gates

Re: RAID arrays and performance

От

Mark Mielke

Дата:

04 декабря 2007 г., 12:46:11

Matthew wrote:
> On Tue, 4 Dec 2007, Gregory Stark wrote:
>
>> "Matthew" <matthew@flymine.org> writes
>>> Does Postgres issue requests to each random access in turn, waiting for
>>> each one to complete before issuing the next request (in which case the
>>> performance will not exceed that of a single disc), or does it use some
>>> clever asynchronous access method to send a queue of random access
>>> requests to the OS that can be distributed among the available discs?
>>>
>> Sorry, it does the former, at least currently.
>> That said, this doesn't really come up nearly as often as you might think.
>>
> Shame. It comes up a *lot* in my project. A while ago we converted a task
> that processes a queue of objects to processing groups of a thousand
> objects, which sped up the process considerably. So we run an awful lot of
> queries with IN lists with a thousand values. They hit the indexes, then
> fetch the rows by random access. A full table sequential scan would take
> much longer. It'd be awfully nice to have those queries go twelve times
> faster.
>
The bitmap scan method does ordered reads of the table, which can
partially take advantage of sequential reads. Not sure whether bitmap
scan is optimal for your situation or whether your situation would allow
this to be taken advantage of.

>> Normally queries fit mostly in either the large batch query domain or the
>> small quick oltp query domain. For the former Postgres tries quite hard to do
>> sequential i/o which the OS will do readahead for and you'll get good
>> performance. For the latter you're normally running many simultaneous such
>> queries and the raid array helps quite a bit.
>>
> Having twelve discs will certainly improve the sequential IO throughput!
>
> However, if this was implemented (and I have *no* idea whatsoever how hard
> it would be), then large index scans would scale with the number of discs
> in the system, which would be quite a win, I would imagine. Large index
> scans can't be that rare!
>
Do you know that there is a problem, or are you speculating about one? I
think your case would be far more compelling if you could show a
problem. :-)

I would think that at a minimum, having 12 disks with RAID 0 or RAID 1+0
would allow your insane queries to run concurrent with up to 12 other
queries. Unless your insane query is the only query in use on the
system, I think you may be speculating about a nearly non-existence
problem. Just a suggestion...

I recall talk of more intelligent table scanning algorithms, and the use
of asynchronous I/O to benefit from RAID arrays, but the numbers
prepared to convince people that the change would have effect have been
less than impressive.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

Re: RAID arrays and performance

От

Matthew

Дата:

04 декабря 2007 г., 13:12:45

On Tue, 4 Dec 2007, Mark Mielke wrote:
> The bitmap scan method does ordered reads of the table, which can
> partially take advantage of sequential reads. Not sure whether bitmap
> scan is optimal for your situation or whether your situation would allow
> this to be taken advantage of.

Bitmap scan may help where several randomly-accessed pages are next to
each other. However, I would expect the operating system's readahead and
buffer cache to take care of that one.

> Do you know that there is a problem, or are you speculating about one? I
> think your case would be far more compelling if you could show a
> problem. :-)
>
> I would think that at a minimum, having 12 disks with RAID 0 or RAID 1+0
> would allow your insane queries to run concurrent with up to 12 other
> queries.

Yeah, I don't really care about concurrency. It's pretty obvious that
running x concurrent queries on an x-disc RAID system will allow the
utilisation of all the discs at once, therefore allowing the performance
to scale by x. What I'm talking about is a single query running on an
x-disc RAID array.

> Unless your insane query is the only query in use on the system...

That's exactly the case.

> I recall talk of more intelligent table scanning algorithms, and the use
> of asynchronous I/O to benefit from RAID arrays, but the numbers
> prepared to convince people that the change would have effect have been
> less than impressive.

I think a twelve-times speed increase is impressive. Granted, given
greatly-concurrent access, the benefits go away, but I think it'd be worth
it for when there are few queries running on the system.

I don't think you would have to create a more intelligent table scanning
algorithm. What you would need to do is take the results of the index,
convert that to a list of page fetches, then pass that list to the OS as
an asynchronous "please fetch all these into the buffer cache" request,
then do the normal algorithm as is currently done. The requests would then
come out of the cache instead of from the disc. Of course, this is from a
simple Java programmer who doesn't know the OS interfaces for this sort of
thing.

Matthew

--
Here we go - the Fairy Godmother redundancy proof.
                                        -- Computer Science Lecturer

Re: RAID arrays and performance

От

Mark Mielke

Дата:

04 декабря 2007 г., 13:31:18

Matthew wrote:
> On Tue, 4 Dec 2007, Mark Mielke wrote:
>
>> The bitmap scan method does ordered reads of the table, which can
>> partially take advantage of sequential reads. Not sure whether bitmap
>> scan is optimal for your situation or whether your situation would allow
>> this to be taken advantage of.
>>
> Bitmap scan may help where several randomly-accessed pages are next to
> each other. However, I would expect the operating system's readahead and
> buffer cache to take care of that one.
>
The disk head has less theoretical distance to travel if always moving
in a single direction instead of randomly seeking back and forth.
>> Do you know that there is a problem, or are you speculating about one? I
>> think your case would be far more compelling if you could show a
>> problem. :-)
>>
>> I would think that at a minimum, having 12 disks with RAID 0 or RAID 1+0
>> would allow your insane queries to run concurrent with up to 12 other
>> queries.
>>
> Yeah, I don't really care about concurrency. It's pretty obvious that
> running x concurrent queries on an x-disc RAID system will allow the
> utilisation of all the discs at once, therefore allowing the performance
> to scale by x. What I'm talking about is a single query running on an
> x-disc RAID array.
>
The time to seek to a particular sector does not reduce 12X with 12
disks. It is still approximately the same, only it can handle 12X the
concurrency. This makes RAID best for concurrent loads. In your
scenario, you are trying to make a single query take advantage of this
concurrency by taking advantage of the system cache. I don't think full
12X concurrency of a single query is possible for most loads, probably
including yours. See later for details.
>> I recall talk of more intelligent table scanning algorithms, and the use
>> of asynchronous I/O to benefit from RAID arrays, but the numbers
>> prepared to convince people that the change would have effect have been
>> less than impressive.
>>
> I think a twelve-times speed increase is impressive. Granted, given
> greatly-concurrent access, the benefits go away, but I think it'd be worth
> it for when there are few queries running on the system.
>
> I don't think you would have to create a more intelligent table scanning
> algorithm. What you would need to do is take the results of the index,
> convert that to a list of page fetches, then pass that list to the OS as
> an asynchronous "please fetch all these into the buffer cache" request,
> then do the normal algorithm as is currently done. The requests would then
> come out of the cache instead of from the disc. Of course, this is from a
> simple Java programmer who doesn't know the OS interfaces for this sort of
> thing.
>
That's about how the talk went. :-)

The problem is that a 12X speed for 12 disks seems unlikely except under
very specific loads (such as a sequential scan of a single table). Each
of the indexes may need to be scanned or searched in turn, then each of
the tables would need to be scanned or searched in turn, depending on
the query plan. There is no guarantee that the index rows or the table
rows are equally spread across the 12 disks. CPU processing becomes
involved with is currently limited to a single processor thread. I
suspect no database would achieve a 12X speedup for 12 disks unless a
simple sequential scan of a single table was required, in which case the
reads could be fully parallelized with RAID 0 using standard sequential
reads, and this is available today using built-in OS or disk read-ahead.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

Re: RAID arrays and performance

От

Gregory Stark

Дата:

04 декабря 2007 г., 13:53:52

"Mark Mielke" <mark@mark.mielke.cc> writes:

> Matthew wrote:
>
>> I don't think you would have to create a more intelligent table scanning
>> algorithm. What you would need to do is take the results of the index,
>> convert that to a list of page fetches, then pass that list to the OS as
>> an asynchronous "please fetch all these into the buffer cache" request,
>> then do the normal algorithm as is currently done. The requests would then
>> come out of the cache instead of from the disc. Of course, this is from a
>> simple Java programmer who doesn't know the OS interfaces for this sort of
>> thing.
>
> That's about how the talk went. :-)
>
> The problem is that a 12X speed for 12 disks seems unlikely except under very
> specific loads (such as a sequential scan of a single table). Each of the
> indexes may need to be scanned or searched in turn, then each of the tables
> would need to be scanned or searched in turn, depending on the query plan.
> There is no guarantee that the index rows or the table rows are equally spread
> across the 12 disks. CPU processing becomes involved with is currently limited
> to a single processor thread. I suspect no database would achieve a 12X speedup
> for 12 disks unless a simple sequential scan of a single table was required, in
> which case the reads could be fully parallelized with RAID 0 using standard
> sequential reads, and this is available today using built-in OS or disk
> read-ahead.

I'm sure you would get something between 1x and 12x though...

I'm rerunning my synthetic readahead tests now. That doesn't show the effect
of the other cpu and i/o work being done in the meantime but surely if they're
being evicted from cache too soon that just means your machine is starved for
cache and you should add more RAM?

Also, it's true, you need to preread more than 12 blocks to handle a 12-disk
raid. My offhand combinatorics analysis seems to indicate you would expect to
need to n(n-1)/2 blocks on average before you've hit all the blocks. There's
little penalty to prereading unless you use up too much kernel resources or
you do unnecessary i/o which you never use, so I would expect doing n^2 capped
at some reasonable number like 1,000 pages (enough to handle a 32-disk raid)
would be reasonable.

The real trick is avoiding doing prefetches that are never needed. The user
may never actually read all the tuples being requested. I think that means we
shouldn't prefetch until the second tuple is read and then gradually increase
the prefetch distance as you read more and more of the results.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

Re: RAID arrays and performance

От

Matthew

Дата:

04 декабря 2007 г., 14:22:06

On Tue, 4 Dec 2007, Mark Mielke wrote:
> The disk head has less theoretical distance to travel if always moving
> in a single direction instead of randomly seeking back and forth.

True... and false. The head can move pretty quickly, and it also has
rotational latency and settling time to deal with. This means that there
are cases where it is actually faster to move across the disc and read a
block, then move back and read a block than to read them in order.

So, if you hand requests one by one to the disc, it will almost always be
faster to order them. On the other hand, if you hand a huge long list of
requests to a decent SCSI or SATA-NCQ disc in one go, it will reorder the
reads itself, and it will do it much better than you.

> The time to seek to a particular sector does not reduce 12X with 12
> disks. It is still approximately the same, only it can handle 12X the
> concurrency. This makes RAID best for concurrent loads. In your
> scenario, you are trying to make a single query take advantage of this
> concurrency by taking advantage of the system cache.

Kind of. The system cache is just a method to make it simpler to explain -
I don't know the operating system interfaces, but it's likely that the
actual call is something more like "transfer these blocks to these memory
locations and tell me when they're all finished."  I'm trying to make a
single query concurrent by using the knowledge of a *list* of accesses to
be made, and getting the operating system to do all of them concurrently.

> The problem is that a 12X speed for 12 disks seems unlikely except under
> very specific loads (such as a sequential scan of a single table).

I'll grant you that 12X may not actually be reached, but it'll be
somewhere between 1X and 12X. I'm optimistic.

> Each of the indexes may need to be scanned or searched in turn, then
> each of the tables would need to be scanned or searched in turn,
> depending on the query plan.

Yes, the indexes would also have to be accessed concurrently, and that
will undoubtedly be harder to code than accessing the tables concurrently.

> There is no guarantee that the index rows or the table rows are equally
> spread across the 12 disks.

Indeed. However, you will get an advantage if they are spread out at all.
Statistically, the larger the number of pages that need to be retrieved in
the set, the more equally spread between the discs they will be. It's the
times when there are a large number of pages to retrieve that this will be
most useful.

> CPU processing becomes involved with is currently limited to a single
> processor thread.

On the contrary, this is a problem at the moment for sequential table
scans, but would not be a problem for random accesses. If you have twelve
discs all throwing 80MB/s at the CPU, it's understandable that the CPU
won't keep up. However, when you're making random accesses, with say a
15,000 rpm disc, and retrieving a single 8k page on every access, each
disc will be producing a maximum of 2MB per second, which can be handled
quite easily by modern CPUs. Index scans are limited by the disc, not the
CPU.

Mathew

--
A. Top Posters
> Q. What's the most annoying thing in the world?

Re: RAID arrays and performance

От

Matthew

Дата:

04 декабря 2007 г., 14:28:16

On Tue, 4 Dec 2007, Gregory Stark wrote:
> Also, it's true, you need to preread more than 12 blocks to handle a 12-disk
> raid. My offhand combinatorics analysis seems to indicate you would expect to
> need to n(n-1)/2 blocks on average before you've hit all the blocks. There's
> little penalty to prereading unless you use up too much kernel resources or
> you do unnecessary i/o which you never use, so I would expect doing n^2 capped
> at some reasonable number like 1,000 pages (enough to handle a 32-disk raid)
> would be reasonable.

It's better than that, actually. Let's assume a RAID 0 striped set of
twelve discs. If you spread out twelve requests to twelve discs, then the
expected number of requests to each disc is one. The probablility that any
single disc receives more than say three requests is rather small. As you
increase the number of requests, the longest reasonably-expected queue
length for a particular disc gets closer to the number of requests divided
by the number of discs, as the requests get spread more and more evenly
among the discs.

The larger the set of requests, the closer the performance will scale to
the number of discs.

Matthew

--
All of this sounds mildly turgid and messy and confusing... but what the
heck. That's what programming's all about, really
                                        -- Computer Science Lecturer

Re: RAID arrays and performance

От

Gregory Stark

Дата:

04 декабря 2007 г., 14:40:35

Fwiw, what made you bring up this topic now? You're the second person in about
two days to bring up precisely this issue and it was an issue I had been
planning to bring up on -hackers as it was.

"Matthew" <matthew@flymine.org> writes:

> Kind of. The system cache is just a method to make it simpler to explain -
> I don't know the operating system interfaces, but it's likely that the
> actual call is something more like "transfer these blocks to these memory
> locations and tell me when they're all finished."  I'm trying to make a
> single query concurrent by using the knowledge of a *list* of accesses to
> be made, and getting the operating system to do all of them concurrently.

There are two interfaces of which I'm aware of. posix_fadvise() which just
tells the kernel you'll be needing the blocks soon. Linux at least initiates
I/O on them into cache. libaio which lets you specify the location to read the
blocks and how to notify you when they're ready.

On Linux posix_fadvise works great but libaio doesn't seem to gain any speed
bonus, at least on 2.6.22 with glibc 2.6.1. I was under the impression there
was a kernel implementation somewhere but apparently it's not helping.

On Solaris apparently it doesn't have posix_fadvise but libaio works great. We
could use libaio as a kind of backdoor fadvise where we just initiate i/o on
the block but throw away the results assuming they'll stay in cache for the
real read or we could add an asynchronous interface to the buffer manager. The
latter is attractive but would be a much more invasive patch. I'm inclined to
start with the former.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's PostGIS support!

Re: RAID arrays and performance

От

Mark Mielke

Дата:

04 декабря 2007 г., 14:41:55

Matthew wrote:

On Tue, 4 Dec 2007, Gregory Stark wrote:

Also, it's true, you need to preread more than 12 blocks to handle a 12-disk
raid. My offhand combinatorics analysis seems to indicate you would expect to
need to n(n-1)/2 blocks on average before you've hit all the blocks. There's
little penalty to prereading unless you use up too much kernel resources or
you do unnecessary i/o which you never use, so I would expect doing n^2 capped
at some reasonable number like 1,000 pages (enough to handle a 32-disk raid)
would be reasonable.

It's better than that, actually. Let's assume a RAID 0 striped set of
twelve discs. If you spread out twelve requests to twelve discs, then the
expected number of requests to each disc is one. The probablility that any
single disc receives more than say three requests is rather small. As you
increase the number of requests, the longest reasonably-expected queue
length for a particular disc gets closer to the number of requests divided
by the number of discs, as the requests get spread more and more evenly
among the discs.

The larger the set of requests, the closer the performance will scale to
the number of discs

This assumes that you can know which pages to fetch ahead of time - which you do not except for sequential read of a single table.

I think it would be possible that your particular case could be up to 6X faster, but that most other people will see little or no speed up. If it does read the wrong pages, it is wasting it's time.

I am not trying to discourage you - only trying to ensure that you have reasonable expectations. 12X is far too optimistic.

Please show one of your query plans and how you as a person would design which pages to request reads for.

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: RAID arrays and performance

От

James Mansion

Дата:

04 декабря 2007 г., 14:48:35

matthew@flymine.org wrote:
> So, if you hand requests one by one to the disc, it will almost always be
> faster to order them. On the other hand, if you hand a huge long list of
> requests to a decent SCSI or SATA-NCQ disc in one go, it will reorder the
> reads itself, and it will do it much better than you.
>
>
Sure - but this doesn't suggest threading so much as pushing all reads
into AIO as soon as they can
be identified - and hoping that your os has a decent AIO subsystem,
which is sadly a tall order
for many *nix systems.

I do think some thought should be given to threading but I would expect
the effect to be more
noticeable for updates where you update tables that have multiple
indices.  In the case of your
scan then you need threading on CPU (rather than concurrent IO through
AIO) if the disks
can feed you data faster than you can scan it.  Which might be the case
for some scans
using user functions, but I wouldn't have thought it would be the case
for a sinple index scan.

At some point, hopefully, the engines will be:
 a) thread safe (if not thread hot) so it can exist with threaded user
functions and embedded
    languages
 b) able to incorporate C++ add-in functionality

There may not be a pressing reason to support either of these, but
having a capability to
experiment would surely be helpful and allow incremental enhancement -
so baby steps
could be made to (for example) move stats and logging to a background
thread, move
push of results to clients out of the query evaluator thread, and so
on.  Parallel threading
queries is a whle different ball game which needs thought in the optimiser.

James

Re: RAID arrays and performance

От

James Mansion

Дата:

04 декабря 2007 г., 14:53:45

Mark Mielke wrote:
> This assumes that you can know which pages to fetch ahead of time -
> which you do not except for sequential read of a single table.
>
Why doesn't it help to issue IO ahead-of-time requests when you are
scanning an index?  You can read-ahead
in index pages, and submit requests for data pages as soon as it is
clear you'll want them.  Doing so can allow
the disks and OS to relax the order in which you receive them, which may
allow you to process them while IO
continues, and it may also optimise away some seeking and settle time.
Maybe.

Re: RAID arrays and performance

От

Matthew

Дата:

04 декабря 2007 г., 15:00:43

On Tue, 4 Dec 2007, Mark Mielke wrote:
> > The larger the set of requests, the closer the performance will scale to
> > the number of discs
>
> This assumes that you can know which pages to fetch ahead of time -
> which you do not except for sequential read of a single table.

There are circumstances where it may be hard to locate all the pages ahead
of time - that's probably when you're doing a nested loop join. However,
if you're looking up in an index and get a thousand row hits in the index,
then there you go. Page locations to load.

> Please show one of your query plans and how you as a person would design
> which pages to request reads for.

How about the query that "cluster <skrald@amossen.dk>" was trying to get
to run faster a few days ago? Tom Lane wrote about it:

| Wouldn't help, because the accesses to "questions" are not the problem.
| The query's spending nearly all its time in the scan of "posts", and
| I'm wondering why --- doesn't seem like it should take 6400msec to fetch
| 646 rows, unless perhaps the data is just horribly misordered relative
| to the index.

Which is exactly what's going on. The disc is having to seek 646 times
fetching a single row each time, and that takes 6400ms. He obviously has a
standard 5,400 or 7,200 rpm drive with a seek time around 10ms.

Or on a similar vein, fill a table with completely random values, say ten
million rows with a column containing integer values ranging from zero to
ten thousand. Create an index on that column, analyse it. Then pick a
number between zero and ten thousand, and

"SELECT * FROM table WHERE that_column = the_number_you_picked"

Matthew

--
Experience is what allows you to recognise a mistake the second time you
make it.

Re: RAID arrays and performance

От

Mark Mielke

Дата:

04 декабря 2007 г., 15:07:37

James Mansion wrote:
> Mark Mielke wrote:
>> This assumes that you can know which pages to fetch ahead of time -
>> which you do not except for sequential read of a single table.
> Why doesn't it help to issue IO ahead-of-time requests when you are
> scanning an index?  You can read-ahead
> in index pages, and submit requests for data pages as soon as it is
> clear you'll want them.  Doing so can allow
> the disks and OS to relax the order in which you receive them, which
> may allow you to process them while IO
> continues, and it may also optimise away some seeking and settle
> time.  Maybe.
Sorry to be unclear. To achieve a massive speedup (12X for 12 disks with
RAID 0) requires that you know what reads to perform in advance. The
moment you do not, you only have a starting point, your operations begin
to serialize again. For example, you must scan the first index, to be
able to know what table rows to read. At a minimum, this breaks your
query into: 1) Preload all the index pages you will need, 2) Scan the
index pages you needed, 3) Preload all the table page you will need, 4)
Scan the table pages you needed. But do you really need the whole index?
What if you only need parts of the index, and this plan now reads the
whole index using async I/O "just in case" it is useful? Index is a
B-Tree for a reason. In Matthew's case where he has an IN clause with
thousands of possibles (I think?), perhaps a complete index scan is
always the best case - but that's only one use case, and in my opinion,
an obscure one. As soon as additional table joins become involved, the
chance that whole index scans are required would probably normally
reduce, which turns the index scan into a regular B-Tree scan, which is
difficult to perform async I/O for, as you don't necessarily know which
pages to read next.

It seems like a valuable goal - but throwing imaginary numbers around
does not appeal to me. I am more interested in Gregory's simulations. I
would like to understand his simulation better, and see his results.
Speculation about amazing potential is barely worth the words used to
express it. The real work is in design and implementation. :-)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

Re: RAID arrays and performance

От

Mark Mielke

Дата:

04 декабря 2007 г., 15:12:51

Matthew wrote:

On Tue, 4 Dec 2007, Mark Mielke wrote:

The larger the set of requests, the closer the performance will scale to
the number of discs

This assumes that you can know which pages to fetch ahead of time -
which you do not except for sequential read of a single table.

There are circumstances where it may be hard to locate all the pages ahead
of time - that's probably when you're doing a nested loop join. However,
if you're looking up in an index and get a thousand row hits in the index,
then there you go. Page locations to load.

Sure.

Please show one of your query plans and how you as a person would design
which pages to request reads for.

How about the query that "cluster <skrald@amossen.dk>" was trying to get
to run faster a few days ago? Tom Lane wrote about it:

| Wouldn't help, because the accesses to "questions" are not the problem.
| The query's spending nearly all its time in the scan of "posts", and
| I'm wondering why --- doesn't seem like it should take 6400msec to fetch
| 646 rows, unless perhaps the data is just horribly misordered relative
| to the index.

Which is exactly what's going on. The disc is having to seek 646 times
fetching a single row each time, and that takes 6400ms. He obviously has a
standard 5,400 or 7,200 rpm drive with a seek time around 10ms.

Your proposal would not necessarily improve his case unless he also purchased additional disks, at which point his execution time may be different. More speculation. :-)

It seems reasonable - but still a guess.

Or on a similar vein, fill a table with completely random values, say ten
million rows with a column containing integer values ranging from zero to
ten thousand. Create an index on that column, analyse it. Then pick a
number between zero and ten thousand, and

"SELECT * FROM table WHERE that_column = the_number_you_picked

This isn't a real use case. Optimizing for the worst case scenario is not always valuable.

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: RAID arrays and performance

От

Matthew

Дата:

04 декабря 2007 г., 15:17:51

On Tue, 4 Dec 2007, Gregory Stark wrote:
> Fwiw, what made you bring up this topic now? You're the second person in about
> two days to bring up precisely this issue and it was an issue I had been
> planning to bring up on -hackers as it was.

I only just joined the performance mailing list to talk about R-trees. I
would probably have brought it up earlier if I had been here earlier.
However, we're thinking of buying this large machine, and that reminded
me.

I have been biting at the bit for my bosses to allow me time to write an
indexing system for transient data - a lookup table backed by disc,
looking up from an integer to get an object, native in Java. Our system
often needs to fetch a list of a thousand different objects by a key like
that, and Postgres just doesn't do that exact thing fast. I was going to
implement it with full asynchronous IO, to do that particular job very
fast, so I have done a reasonable amount of research into the topic. In
Java, that is. It would add a little bit more performance for our system.
That wouldn't cover us - we still need to do complex queries with the same
problem, and that'll have to stay in Postgres.

Matthew

--
The early bird gets the worm. If you want something else for breakfast, get
up later.

Re: RAID arrays and performance

От

Mark Mielke

Дата:

04 декабря 2007 г., 15:32:58

Matthew wrote:

On Tue, 4 Dec 2007, Gregory Stark wrote:

Fwiw, what made you bring up this topic now? You're the second person in about
two days to bring up precisely this issue and it was an issue I had been
planning to bring up on -hackers as it was.

I only just joined the performance mailing list to talk about R-trees. I
would probably have brought it up earlier if I had been here earlier.
However, we're thinking of buying this large machine, and that reminded
me.

I have been biting at the bit for my bosses to allow me time to write an
indexing system for transient data - a lookup table backed by disc,
looking up from an integer to get an object, native in Java. Our system
often needs to fetch a list of a thousand different objects by a key like
that, and Postgres just doesn't do that exact thing fast. I was going to
implement it with full asynchronous IO, to do that particular job very
fast, so I have done a reasonable amount of research into the topic. In
Java, that is. It would add a little bit more performance for our system.
That wouldn't cover us - we still need to do complex queries with the same
problem, and that'll have to stay in Postgres

So much excitement and zeal - refreshing to see. And yet, no numbers! :-)

You describe a new asynchronous I/O system to map integers to Java objects above. Why would you write this? Have you tried BerkeleyDB or BerkeleyDB JE? BerkeleyDB with BDB as a backend or their full Java backend gives you a Java persistence API that will allow you to map Java objects (including integers) to other Java objects. They use generated Java run time instructions instead of reflection to store and lock your Java objects. If it came to a bet, I would bet that their research and tuning over several years, and many people, would beat your initial implementation, asynchronous I/O or not.

Asynchronous I/O is no more a magic bullet than threading. It requires a lot of work to get it right, and if one gets it wrong, it can be slower than the regular I/O or single threaded scenarios. Both look sexy on paper, neither may be the solution to your problem. Or they may be. We wouldn't know without numbers.

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: RAID arrays and performance

От

Matthew

Дата:

04 декабря 2007 г., 16:03:29

On Tue, 4 Dec 2007, Mark Mielke wrote:
> So much excitement and zeal - refreshing to see. And yet, no numbers! :-)

What sort of numbers did you want to see?

> You describe a new asynchronous I/O system to map integers to Java
> objects above. Why would you write this? Have you tried BerkeleyDB or
> BerkeleyDB JE? BerkeleyDB with BDB as a backend or their full Java
> backend gives you a Java persistence API that will allow you to map Java
> objects (including integers) to other Java objects.

Looked at all those. Didn't like their performance characteristics, or
their interfaces. It's simply the fact that they're not designed for the
workload that we want to put on them.

> If it came to a bet, I would bet that their research and
> tuning over several years, and many people, would beat your initial
> implementation, asynchronous I/O or not.

Quite possibly. However, there's the possibility that it wouldn't. And I
can improve it - initial implementations often suck.

> Asynchronous I/O is no more a magic bullet than threading. It requires a
> lot of work to get it right, and if one gets it wrong, it can be slower
> than the regular I/O or single threaded scenarios. Both look sexy on
> paper, neither may be the solution to your problem. Or they may be. We
> wouldn't know without numbers.

Okay, numbers. About eight years ago I wrote the shell of a filesystem
implementation, concentrating on performance and integrity. It absolutely
whooped ext2 for both read and write speed - especially metadata write
speed.  Anything up to 60 times as fast. I wrote a load of metadata to
ext2, which took 599.52 seconds, and on my system took 10.54 seconds.
Listing it back (presumably from cache) took 1.92 seconds on ext2 and 0.22
seconds on my system. No silly caching tricks that sacrifice integrity.

It's a pity I got nowhere near finishing that system - just enough to
prove my point and get a degree, but looking back on it there are massive
ways it's rubbish and should be improved. It was an initial
implementation.  I didn't have reiserfs, jfs, or xfs available at that
time, but it would have been really interesting to compare. This is the
system I would have based my indexing thing on.

Matthew

--
Anyone who goes to a psychiatrist ought to have his head examined.

Re: RAID arrays and performance

От

Gregory Stark

Дата:

04 декабря 2007 г., 18:28:50

"Matthew" <matthew@flymine.org> writes:

> On Tue, 4 Dec 2007, Mark Mielke wrote:
>> So much excitement and zeal - refreshing to see. And yet, no numbers! :-)
>
> What sort of numbers did you want to see?

FWIW I posted some numbers from a synthetic case to pgsql-hackers

http://archives.postgresql.org/pgsql-hackers/2007-12/msg00088.php

Very promising


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

Re: RAID arrays and performance

От

James Mansion

Дата:

04 декабря 2007 г., 19:12:45

Mark Mielke wrote:
> At a minimum, this breaks your query into: 1) Preload all the index
> pages you will need
Isn't this fairly predictable - the planner has chosen the index so it
will be operating
on a bounded subset.
> , 2) Scan the index pages you needed
Yes, and AIO helps when you can scan them in arbitrary order, as they
are returned.
> , 3) Preload all the table page you will need
No - just request that they load.  You can do work as soon as any page
is returned.
> , 4) Scan the table pages you needed.
In the order that is most naturally returned by the disks.

> But do you really need the whole index?
I don't think I suggested that.
> What if you only need parts of the index, and this plan now reads the
> whole index using async I/O "just in case" it is useful?
Where did you get that from?
> index scan into a regular B-Tree scan, which is difficult to perform
> async I/O for, as you don't necessarily know which pages to read next.
>
Most B-trees are not so deep.  It would generally be a win to retain
interior nodes of indices in
shared memory, even if leaf pages are not present.  In such a case, it
is quite quick to establish
which leaf pages must be demand loaded.

I'm not suggesting that Postgres indices are structured in a way that
would support this sort
of thing now.

James

Re: RAID arrays and performance

От

Mark Mielke

Дата:

04 декабря 2007 г., 21:15:41

James Mansion wrote:
> Mark Mielke wrote:
>> At a minimum, this breaks your query into: 1) Preload all the index
>> pages you will need
> Isn't this fairly predictable - the planner has chosen the index so it
> will be operating
> on a bounded subset.
What is the bounded subset? It is bounded by the value. What value? You
need to access the first page before you know what the second page is.
PostgreSQL or the kernel should already have the hottest pages in
memory, so the value of doing async I/O is very likely the cooler pages
that are unique to the query. We don't know what the cooler pages are
until we follow three tree down.

>> , 2) Scan the index pages you needed
> Yes, and AIO helps when you can scan them in arbitrary order, as they
> are returned.

I don't think you are talking about searching a B-Tree, as the order is
important when searching, and search performance would be reduced if one
reads and scans more pages than necessary to map from the value to the
row. I presume you are talking about scanning the entire index. Where
"needed" means "all". Again, this only benefits a subset of the queries.

>> , 3) Preload all the table page you will need
> No - just request that they load.  You can do work as soon as any page
> is returned.

The difference between preload and handling async I/O in terms of
performance is debatable. Greg reports that async I/O on Linux sucks,
but posix_fadvise*() has substantial benefits. posix_fadvise*() is
preload not async I/O (he also reported that async I/O on Solaris seems
to work well). Being able to do work as the first page is available is a
micro-optimization as far as I am concerned at this point (one that may
not yet work on Linux), as the real benefit comes from utilizing all 12
disks in Matthew's case, not from guaranteeing that data is processed as
soon as possible.

>> , 4) Scan the table pages you needed.
> In the order that is most naturally returned by the disks.

Micro-optimization.

>> But do you really need the whole index?
> I don't think I suggested that.
>> What if you only need parts of the index, and this plan now reads the
>> whole index using async I/O "just in case" it is useful?
> Where did you get that from?

I get it from your presumption that you can know which pages of the
index to load in advance. The only way you can know which pages must be
loaded, is to know that you need to query them all. Unless you have some
way of speculating with some degree of accuracy which colder pages in
the index you will need, without looking at the index?

>> index scan into a regular B-Tree scan, which is difficult to perform
>> async I/O for, as you don't necessarily know which pages to read next.
> Most B-trees are not so deep.  It would generally be a win to retain
> interior nodes of indices in
> shared memory, even if leaf pages are not present.  In such a case, it
> is quite quick to establish
> which leaf pages must be demand loaded.

This is bogus. The less deep the B-Tree is, the less there should be any
requirement for async I/O. Hot index pages will be in cache.

> I'm not suggesting that Postgres indices are structured in a way that
> would support this sort
> of thing now.

In your hand waving, you have assumed that PostgreSQL B-Tree index might
need to be replaced? :-)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

Re: RAID arrays and performance

От

James Mansion

Дата:

04 декабря 2007 г., 22:56:45

Mark Mielke wrote:
> PostgreSQL or the kernel should already have the hottest pages in
> memory, so the value of doing async I/O is very likely the cooler
> pages that are unique to the query. We don't know what the cooler
> pages are until we follow three tree down.
>
I'm assuming that at the time we start to search the index, we have some
idea of value or values that
we are looking for.  Or, as you say, we are applying a function to 'all
of it'.

Think of a 'between' query.  The subset of the index that can be a match
can be bounded by the leaf
pages that contain the end point(s).  Similarly if we have a merge with
a sorted intermediate set from
a prior step then we also have bounds on the values.

I'm not convinced that your assertion that the index leaf pages must
necessarily be processed in-order
is true - it depends what sort of operation is under way.  I am assuming
that we try hard to keep
interior index nodes and data in meory and that having identified the
subset of these that we want, we
can immediately infer the set of leaves that are potentially of interest.
> The difference between preload and handling async I/O in terms of
> performance is debatable. Greg reports that async I/O on Linux sucks,
> but posix_fadvise*() has substantial benefits. posix_fadvise*() is
> preload not async I/O (he also reported that async I/O on Solaris
> seems to work well). Being able to do work as the first page is
> available is a micro-optimization as far as I am concerned at this
> point (one that may not yet work on Linux), as the real benefit comes
> from utilizing all 12 disks in Matthew's case, not from guaranteeing
> that data is processed as soon as possible.
>
I see it as part of the same problem.  You can partition the data across
all the disks and run queries in parallel
against the partitions, or you can lay out the data in the RAID array in
which case the optimiser has very little idea
how the data will map to physical data layout - so its best bet is to
let the systems that DO know decide the
access strategy.  And those systems can only do that if you give them a
lot of requests that CAN be reordered,
so they can choose a good ordering.

> Micro-optimization.
>
Well, you like to assert this - but why?  If a concern is the latency
(and my experience suggests that latency is the
biggest issue in practice, not throughput per se) then overlapping
processing while waiting for 'distant' data is
important - and we don't have any information about the physical layout
of the data that allows us to assert that
forward access pre-read of data from a file is the right strategy for
accessing it as fast as possible - we have to
allow the OS (and to an increasing extent the disks) to manage the
elevator IO to best effect.  Its clear that the
speed of streaming read and write of modern disks is really high
compared to that of random access, so anything
we can do to help the disks run in that mode is pretty worthwhile even
if the physical streaming doesn't match
any obvious logical ordering of the OS files or logical data pages
within them.  If you have a function to apply to
a set of data elements and the elements are independant, then requiring
that the function is applied in an order
rather than conceptually in parallel is going to put a lot of constraint
on how the hardware can optimise it.

Clearly a hint to preload is better than nothing.  But it seems to me
that the worst case is that we wait for
the slowest page to load and then start processing hoping that the rest
of the data stays in the buffer cache
and is 'instant', while AIO and evaluate-when-ready means that process
is still bound by the slowest
data to arrive, but at that point there is little processing still to
do, and the already-processed buffers can be
reused earlier.  In the case where there is significant presure on the
buffer cache, this can be significant.

Of course, a couple of decades bullying Sybase systems on Sun Enterprise
boxes may have left me
somewhat jaundiced - but Sybase can at least parallelise things.
Sometimes.  When it does, its quite
a big win.

> In your hand waving, you have assumed that PostgreSQL B-Tree index
> might need to be replaced? :-)
>
Sure, if the intent is to make the system thread-hot or AIO-hot, then
the change is potentially very
invasive.  The strategy to evaluate queries based on parallel execution
and async IO is not necessarily
very like a strategy where you delegate to the OS buffer cache.

I'm not too bothered for the urpose of this discussion whether the way
that postgres currently
navigates indexes is amenable to this.  This is bikeshed land, right?

I think it is foolish to disregard strategies that will allow
overlapping IO and processing - and we want to
keep disks reading and writing rather than seeking.  To me that suggests
AIO and disk-native queuing
are quite a big deal.  And parallel evaluation will be too as the number
of cores goes up and there is
an expectation that this should reduce latency of individual query, not
just allow throughput with lots
of concurrent demand.

Re: RAID arrays and performance

От

Mark Mielke

Дата:

04 декабря 2007 г., 23:55:53

James Mansion wrote:
> Mark Mielke wrote:
>> PostgreSQL or the kernel should already have the hottest pages in
>> memory, so the value of doing async I/O is very likely the cooler
>> pages that are unique to the query. We don't know what the cooler
>> pages are until we follow three tree down.
>>
> I'm assuming that at the time we start to search the index, we have
> some idea of value or values that
> we are looking for.  Or, as you say, we are applying a function to
> 'all of it'.
>
> Think of a 'between' query.  The subset of the index that can be a
> match can be bounded by the leaf
> pages that contain the end point(s).  Similarly if we have a merge
> with a sorted intermediate set from
> a prior step then we also have bounds on the values.

How do you find the bounding points for these pages? Does this not
require descending through the tree in a more traditional way?

> I'm not convinced that your assertion that the index leaf pages must
> necessarily be processed in-order
> is true - it depends what sort of operation is under way.  I am
> assuming that we try hard to keep
> interior index nodes and data in meory and that having identified the
> subset of these that we want, we
> can immediately infer the set of leaves that are potentially of interest.

It is because you are missing my point. In order to find a reduced set
of pages to load, one must descend through the B-Tree. Identifying the
subset requires sequential loading of pages. There is some theoretical
potential for async I/O, but I doubt your own assertion that this is
significant in any significant way. I ask you again - how do you know
which lower level pages to load before you load the higher level pages?
The only time the B-Tree can be processed out of order in this regard is
if you are doing a bitmap index scan or some similar operation that will
scan the entire tree, and you do not care what order the pages arrive
in. If you are looking for a specific key, one parent page leads to one
child page, and the operation is sequential.

>> The difference between preload and handling async I/O in terms of
>> performance is debatable. Greg reports that async I/O on Linux sucks,
>> but posix_fadvise*() has substantial benefits. posix_fadvise*() is
>> preload not async I/O (he also reported that async I/O on Solaris
>> seems to work well). Being able to do work as the first page is
>> available is a micro-optimization as far as I am concerned at this
>> point (one that may not yet work on Linux), as the real benefit comes
>> from utilizing all 12 disks in Matthew's case, not from guaranteeing
>> that data is processed as soon as possible.
> I see it as part of the same problem.  You can partition the data
> across all the disks and run queries in parallel
> against the partitions, or you can lay out the data in the RAID array
> in which case the optimiser has very little idea
> how the data will map to physical data layout - so its best bet is to
> let the systems that DO know decide the
> access strategy.  And those systems can only do that if you give them
> a lot of requests that CAN be reordered,
> so they can choose a good ordering.

You can see it however you like - what remains, is that the 12X speed up
is going to come from using 12 disks, and loading the 12 disks in
parallel. More theoretical improvements with regard to the ability for a
particular hard disk to schedule reads and return results out of order,
have not, in my reading, been shown to reliably improve performance to a
significant degree. Unfortunately, Linux doesn't support my SATA NCQ
yet, so I haven't been able to experiment myself. Gregory provided
numbers showing a 2X - 3X performance when using three disks. This has
the potential for significant improvement with real hardware, modest
cost, and perhaps trivial changes to PostgreSQL. What you describe is a
re-design of the I/O strategy that will be designed for asynchronous
I/O, with some sort of state machine that will be able to deal
efficiently with either index pages or table pages out of order. Do you
have numbers to show that such a significant change would result in a
reasonable return on the investment?

>> Micro-optimization.
> Well, you like to assert this - but why?
I'll quote from:

    http://en.wikipedia.org/wiki/Native_Command_Queuing

Most reading I have done shows NCQ to have limited gains, with most
benchmarks being suspect. Also, latency is only for the first page. One
presumes that asynch I/O will be mostly valuable in the case where many
pages can be scheduled for reading at the same time. In the case that
many pages are scheduled for reading, the first page will be eventually
served, and the overall bandwidth is still the same. In your theoretical
model, you are presuming the CPU is a bottleneck, either for processing,
or scheduling the next batch of reads. I think you are hand waving, and
given that Linux doesn't yet support asynch I/O well, Gregory's model
will serve more PostgreSQL users than yours with less complexity.

> Clearly a hint to preload is better than nothing.  But it seems to me
> that the worst case is that we wait for
> the slowest page to load and then start processing hoping that the
> rest of the data stays in the buffer cache
> and is 'instant', while AIO and evaluate-when-ready means that process
> is still bound by the slowest
> data to arrive, but at that point there is little processing still to
> do, and the already-processed buffers can be
> reused earlier.  In the case where there is significant presure on the
> buffer cache, this can be significant.

It seems to me that you have yet to prove that there will be substantial
gains in overall performance for preload. Leaping on to presuming that
PostgreSQL should switch to a fully asynch I/O model seems a greater
stretch. By the sounds of it, Gregory could have the first implemented
very soon. When will you have yours? :-)

>> In your hand waving, you have assumed that PostgreSQL B-Tree index
>> might need to be replaced? :-)
> Sure, if the intent is to make the system thread-hot or AIO-hot, then
> the change is potentially very
> invasive.  The strategy to evaluate queries based on parallel
> execution and async IO is not necessarily
> very like a strategy where you delegate to the OS buffer cache.
>
> I'm not too bothered for the urpose of this discussion whether the way
> that postgres currently
> navigates indexes is amenable to this.  This is bikeshed land, right?

I am only interested by juicy projects that have a hope of success. This
subject does interest me - I am hoping my devil's advocate participation
encourages people to seek a practical implementation that will benefit me.

> I think it is foolish to disregard strategies that will allow
> overlapping IO and processing - and we want to
> keep disks reading and writing rather than seeking.  To me that
> suggests AIO and disk-native queuing
> are quite a big deal.  And parallel evaluation will be too as the
> number of cores goes up and there is
> an expectation that this should reduce latency of individual query,
> not just allow throughput with lots
> of concurrent demand.

I am more amenable to multi-threaded index processing for the same query
than async I/O to take advantage of NCQ. Guess we each come from a
different background. :-)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

Re: RAID arrays and performance

От

Greg Smith

Дата:

05 декабря 2007 г., 05:47:14

On Tue, 4 Dec 2007, Mark Mielke wrote:

>> This is bikeshed land, right?
> I am only interested by juicy projects that have a hope of success. This
> subject does interest me - I am hoping my devil's advocate participation
> encourages people to seek a practical implementation that will benefit me.

Nah, you're both in bikeshed land.  Ultimately this feature is going to
get built out in a way that prioritizes as much portability as is possible
while minimizing the impact on existing code.  The stated PostgreSQL bias
is that async-I/O is not worth the trouble until proven otherwise:
http://www.postgresql.org/docs/faqs.FAQ_DEV.html#item1.14

That's why Greg Stark is busy with low-level tests of I/O options while
you're arguing much higher level details.  Until you've got some
benchmarks in this specific area to go by, you can talk theory all day and
not impact what implementation will actually get built one bit.

As an example of an area you've already brought up where theory and
benchmarks clash violently, the actual situation with NCQ on SATA drives
has been heavily blurred because of how shoddy some manufacturer's NCQ
implementation are.  Take a look at the interesting thread on this topic
at http://lkml.org/lkml/2007/4/3/159 to see what I'm talking about.  Even
if you have an NCQ drive, and a version of Linux that supports it, and
you've setup everything up right, you can still end up with unexpected
performance regressions under some circumstances.  It's still the wild
west with that technology.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: RAID arrays and performance

От

Bruce Momjian

Дата:

16 декабря 2007 г., 05:39:52

Mark Mielke wrote:
> Asynchronous I/O is no more a magic bullet than threading. It requires a
> lot of work to get it right, and if one gets it wrong, it can be slower
> than the regular I/O or single threaded scenarios. Both look sexy on
> paper, neither may be the solution to your problem. Or they may be. We
> wouldn't know without numbers.

Agreed.  We currently don't use multiple CPUs or multiple disks
efficiently for single-query loads.  There is certainly more we could do
in these areas, and it is on the TODO list.

The good news is that most work loads are multi-user and we use
resources more evenly in those cases.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: RAID arrays and performance

От

Gregory Stark

Дата:

29 января 2008 г., 13:55:59

"Matthew" <matthew@flymine.org> writes:

> On Tue, 4 Dec 2007, Gregory Stark wrote:
>> FWIW I posted some numbers from a synthetic case to pgsql-hackers
>>
>> http://archives.postgresql.org/pgsql-hackers/2007-12/msg00088.php
>...
> This was with 8192 random requests of size 8192 bytes from an 80GB test file.
> Unsorted requests ranged from 1.8 MB/s with no prefetching to 28MB/s with lots
> of prefetching. Sorted requests went from 2.4MB/s to 38MB/s. That's almost
> exactly 16x improvement for both, and this is top of the line hardware.

Neat. The curves look very similar to mine. I also like that with your
hardware the benefit maxes out at pretty much exactly where I had
mathematically predicted they would ((stripe size)^2 / 2).

> So, this is FYI, and also an added encouragement to implement fadvise
> prefetching in some form or another. How's that going by the way?

I have a patch which implements it for the low hanging fruit of bitmap index
scans. it does it using an extra trip through the buffer manager which is the
least invasive approach but not necessarily the best.

Heikki was pushing me to look at changing the buffer manager to support doing
asynchronous buffer reads. In that scheme you would issue a ReadBufferAsync
which would give you back a pinned buffer which the scan would have to hold
onto and call a new buffer manager call to complete the i/o when it actually
needed the buffer. The ReadBufferAsync call would put the buffer in some form
of i/o in progress.

On systems with only posix_fadvise ReadBufferAsync would issue posix_fadvise
and ReadBufferFinish would issue a regular read. On systems with an effective
libaio the ReadBufferAsync could issue the aio operation and ReadBufferFinish
could then do a aio_return.

The pros of such an approach would be less locking and cpu overhead in the
buffer manager since the second trip would have the buffer handle handy and
just have to issue the read.

The con of such an approach would be the extra shared buffers occupied by
buffers which aren't really needed yet. Also the additional complexity in the
buffer manager with the new i/o in progress state. (I'm not sure but I don't
think we get to reuse the existing i/o in progress state.)

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

Re: RAID arrays and performance

От

Matthew

Дата:

29 января 2008 г., 14:10:40

On Tue, 29 Jan 2008, Gregory Stark wrote:
>> This was with 8192 random requests of size 8192 bytes from an 80GB test file.
>> Unsorted requests ranged from 1.8 MB/s with no prefetching to 28MB/s with lots
>> of prefetching. Sorted requests went from 2.4MB/s to 38MB/s. That's almost
>> exactly 16x improvement for both, and this is top of the line hardware.
>
> Neat. The curves look very similar to mine. I also like that with your
> hardware the benefit maxes out at pretty much exactly where I had
> mathematically predicted they would ((stripe size)^2 / 2).

Why would that be the case? Does that mean that we can select a stripe
size of 100GB and get massive performance improvements? Doesn't seem
logical to me. To me, it maxes out at 16x speed because there are 16
discs.

Amusingly, there appears to be a spam filter preventing my message (with
its image) getting through to the performance mailing list.

Matthew

Re: RAID arrays and performance

От

Gregory Stark

Дата:

29 января 2008 г., 14:52:36

"Matthew" <matthew@flymine.org> writes:

> On Tue, 29 Jan 2008, Gregory Stark wrote:
>>> This was with 8192 random requests of size 8192 bytes from an 80GB test file.
>>> Unsorted requests ranged from 1.8 MB/s with no prefetching to 28MB/s with lots
>>> of prefetching. Sorted requests went from 2.4MB/s to 38MB/s. That's almost
>>> exactly 16x improvement for both, and this is top of the line hardware.
>>
>> Neat. The curves look very similar to mine. I also like that with your
>> hardware the benefit maxes out at pretty much exactly where I had
>> mathematically predicted they would ((stripe size)^2 / 2).
>
> Why would that be the case? Does that mean that we can select a stripe size of
> 100GB and get massive performance improvements? Doesn't seem logical to me. To
> me, it maxes out at 16x speed because there are 16 discs.

Sorry, I meant "number of drives in the array" not number of bytes. So with 16
drives you would need approximately 128 random pending i/o operations to
expect all drives to be busy at all times.

I got this from a back-of-the-envelope calculation which now that I'm trying
to reproduce it seems to be wrong. Previously I thought it was n(n+1)/2 or
about n^2/2. So at 16 I would have expected about 128 pending i/o requests
before all the drives could be expected to be busy.

Now that I'm working it out more carefully I'm getting that the expected
number of pending i/o requests before all drives are busy is
 n + n/2 + n/3 + ... + n/n

which is actually n * H(n) which is approximated closely by n * log(n).

That would predict that 16 drives would actually max out at 44.4 pending i/o
requests. It would predict that my three-drive array would max out well below
that at 7.7 pending i/o requests. Empirically neither result seems to match
reality. Other factors must be dominating.

> Amusingly, there appears to be a spam filter preventing my message (with its
> image) getting through to the performance mailing list.

This has been plaguing us for a while. When we figure out who's misconfigured
system is doing it I expect they'll be banned from the internet for life!

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

Re: RAID arrays and performance

От

"Kevin Grittner"

Дата:

29 января 2008 г., 15:23:41

>>> On Tue, Jan 29, 2008 at  9:52 AM, in message
<877ihsvdcb.fsf@oxford.xeocode.com>, Gregory Stark <stark@enterprisedb.com>
wrote:

> I got this from a back-of-the-envelope calculation which now that I'm trying
> to reproduce it seems to be wrong. Previously I thought it was n(n+1)/2 or
> about n^2/2. So at 16 I would have expected about 128 pending i/o requests
> before all the drives could be expected to be busy.

That seems right to me, based on the probabilities of any new
request hitting an already-busy drive.

> Now that I'm working it out more carefully I'm getting that the expected
> number of pending i/o requests before all drives are busy is
>  n + n/2 + n/3 + ... + n/n

What's the basis for that?

-Kevin

Re: RAID arrays and performance

От

Gregory Stark

Дата:

29 января 2008 г., 15:45:41

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

> >>> On Tue, Jan 29, 2008 at  9:52 AM, in message
> <877ihsvdcb.fsf@oxford.xeocode.com>, Gregory Stark <stark@enterprisedb.com>
> wrote:
>
> > I got this from a back-of-the-envelope calculation which now that I'm trying
> > to reproduce it seems to be wrong. Previously I thought it was n(n+1)/2 or
> > about n^2/2. So at 16 I would have expected about 128 pending i/o requests
> > before all the drives could be expected to be busy.
>
> That seems right to me, based on the probabilities of any new
> request hitting an already-busy drive.
>
> > Now that I'm working it out more carefully I'm getting that the expected
> > number of pending i/o requests before all drives are busy is
> >  n + n/2 + n/3 + ... + n/n
>
> What's the basis for that?

Well consider when you've reached n-1 drives; the expected number of requests
before you hit the 1 idle drive remaining out of n would be n requests. When
you're at n-2 the expected number of requests before you hit either of the two
idle drives would be n/2. And so on. The last term of n/n would be the first
i/o when all the drives are idle and you obviously only need one i/o to hit an
idle drive.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: RAID arrays and performance

От

"Kevin Grittner"

Дата:

29 января 2008 г., 16:14:09

>>> On Tue, Jan 29, 2008 at 10:45 AM, in message
<873asginrz.fsf@stark.xeocode.com>, Gregory Stark <gsstark@mit.edu> wrote:

> Well consider when you've reached n-1 drives; the expected number of requests
> before you hit the 1 idle drive remaining out of n would be n requests. When
> you're at n-2 the expected number of requests before you hit either of the
> two
> idle drives would be n/2. And so on. The last term of n/n would be the first
> i/o when all the drives are idle and you obviously only need one i/o to hit
> an
> idle drive.

You're right.  Perhaps the reason more requests continue to improve
performance is that a smart controller will move across the tracks
and satisfy the pending requests in the most efficient order?

-Kevin

Re: RAID arrays and performance

От

Matthew Wakeling

Дата:

18 сентября 2008 г., 13:22:02

On Tue, 29 Jan 2008, Gregory Stark wrote:
>> So, this is FYI, and also an added encouragement to implement fadvise
>> prefetching in some form or another. How's that going by the way?
>
> I have a patch which implements it for the low hanging fruit of bitmap index
> scans. it does it using an extra trip through the buffer manager which is the
> least invasive approach but not necessarily the best.

Gregory - what's the status of that patch at the moment? Will it be making
it into a new version of Postgres, or are we waiting for it to be
implemented fully?

It's just that our system is doing a lot of bitmap index scans at the
moment, and it'd help to be able to spread them across the 16 discs in
the RAID array. It's the bottleneck in our system at the moment.

Matthew

--
The email of the species is more deadly than the mail.

Re: RAID arrays and performance

От

Greg Smith

Дата:

18 сентября 2008 г., 19:30:11

On Thu, 18 Sep 2008, Matthew Wakeling wrote:

> On Tue, 29 Jan 2008, Gregory Stark wrote:

>> I have a patch which implements it for the low hanging fruit of bitmap
>> index scans. it does it using an extra trip through the buffer manager
>> which is the least invasive approach but not necessarily the best.
>
> Gregory - what's the status of that patch at the moment? Will it be making it
> into a new version of Postgres, or are we waiting for it to be implemented
> fully?

It and a related fadvise patch have been floating around the project queue
for a while now.  I just sorted through all the various patches and
mesasges related to this area and updated the list at
http://wiki.postgresql.org/wiki/CommitFestInProgress#Pending_patches
recently, I'm trying to kick back a broader reviewed version of this
concept right now.

> It's just that our system is doing a lot of bitmap index scans at the moment,
> and it'd help to be able to spread them across the 16 discs in the RAID
> array. It's the bottleneck in our system at the moment.

If you have some specific bitmap index scan test case suggestions you can
pass along (either publicly or in private to me, I can probably help
anonymize them), that's one of the things that has been holding this up.
Alternately, if you'd like to join in on testing this all out more help
would certainly be welcome.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: RAID arrays and performance

От

"Scott Marlowe"

Дата:

18 сентября 2008 г., 19:34:00

On Thu, Sep 18, 2008 at 1:30 PM, Greg Smith <gsmith@gregsmith.com> wrote:
> If you have some specific bitmap index scan test case suggestions you can
> pass along (either publicly or in private to me, I can probably help
> anonymize them), that's one of the things that has been holding this up.
> Alternately, if you'd like to join in on testing this all out more help
> would certainly be welcome.

I posted in pgsql-perform about a problem that's using bitmap heap
scans that's really slow compared to just using nested-loops.  Don't
know if that is relevant or not.

Re: RAID arrays and performance

От

Matthew Wakeling

Дата:

19 сентября 2008 г., 14:59:19

On Thu, 18 Sep 2008, Greg Smith wrote:
>> It's just that our system is doing a lot of bitmap index scans at the
>> moment, and it'd help to be able to spread them across the 16 discs in the
>> RAID array. It's the bottleneck in our system at the moment.
>
> If you have some specific bitmap index scan test case suggestions you can
> pass along (either publicly or in private to me, I can probably help
> anonymize them), that's one of the things that has been holding this up.

Okay, here's a description of what we're doing. We are importing data from
a large file (so we don't have a choice on the order that the data comes
in). For each entry in the source, we have to look up the corresponding
row in the database, and issue the correct "UPDATE" command according to
what's in the database and what's in the data source. Long story - it
isn't feasible to change the overall process.

In order to improve the performance, I made the system look ahead in the
source, in groups of a thousand entries, so instead of running:

SELECT * FROM table WHERE field = 'something';

a thousand times, we now run:

SELECT * FROM table WHERE field IN ('something', 'something else'...);

with a thousand things in the IN. Very simple query. It does run faster
than the individual queries, but it still takes quite a while. Here is an
example query:

SELECT a1_.id AS a1_id, a1_.primaryIdentifier AS a2_ FROM Gene AS a1_
WHERE a1_.primaryIdentifier IN ('SPAC11D3.15', 'SPAC11D3.16c',
'SPAC11D3.17', 'SPAC11D3.18c', 'SPAC15F9.01c', 'SPAC15F9.02', 'SPAC16.01',
'SPAC18G6.01c', 'SPAC18G6.02c', 'SPAC18G6.04c', 'SPAC18G6.05c',
'SPAC18G6.06', 'SPAC18G6.07c', 'SPAC18G6.09c', 'SPAC18G6.10',
'SPAC18G6.11c', 'SPAC18G6.12c', 'SPAC18G6.13', 'SPAC18G6.14c',
'SPAC18G6.15', 'SPAC1B9.01c', 'SPAC1D4.02c', 'SPAC1D4.03c', 'SPAC1D4.04',
'SPAC1D4.05c', 'SPAC1D4.07c', 'SPAC1D4.08', 'SPAC1D4.09c', 'SPAC1D4.10',
'SPAC1D4.11c', 'SPAC1F3.11', 'SPAC23A1.10', 'SPAC23E2.01', 'SPAC23E2.02',
'SPAC23E2.03c', 'SPAC26A3.02', 'SPAC26A3.03c', 'SPAC26A3.05',
'SPAC26A3.06', 'SPAC26A3.07c', 'SPAC26A3.08', 'SPAC26A3.09c',
'SPAC26A3.10', 'SPAC26A3.11', 'SPAC26A3.14c', 'SPAC26A3.15c',
'SPAC26A3.16', 'SPAC27F1.01c', 'SPAC27F1.03c', 'SPAC27F1.04c',
'SPAC27F1.05c', 'SPAC27F1.06c', 'SPAC3H8.02', 'SPAC3H8.03', 'SPAC3H8.04',
'SPAC3H8.05c', 'SPAC3H8.06', 'SPAC3H8.07c', 'SPAC3H8.08c', 'SPAC3H8.09c',
'SPAC3H8.10', 'SPAC3H8.11', 'SPAC8E11.11', 'SPBC106.15', 'SPBC17G9.10',
'SPBC24E9.15c', 'WBGene00000969', 'WBGene00003035', 'WBGene00004095',
'WBGene00016011', 'WBGene00018672', 'WBGene00018674', 'WBGene00018675',
'WBGene00018676', 'WBGene00018959', 'WBGene00018960', 'WBGene00018961',
'WBGene00023407') ORDER BY a1_.id LIMIT 2000;

And the corresponding EXPLAIN ANALYSE:

  Limit  (cost=331.28..331.47 rows=77 width=17) (actual time=121.973..122.501 rows=78 loops=1)
    ->  Sort  (cost=331.28..331.47 rows=77 width=17) (actual time=121.968..122.152 rows=78 loops=1)
          Sort Key: id
          Sort Method:  quicksort  Memory: 29kB
          ->  Bitmap Heap Scan on gene a1_  (cost=174.24..328.87 rows=77 width=17) (actual time=114.311..121.705
rows=78loops=1) 
                Recheck Cond: (primaryidentifier = ANY ('{SPAC11D3.15...
                ->  Bitmap Index Scan on gene__key_primaryidentifier  (cost=0.00..174.22 rows=77 width=0) (actual
time=44.434..44.434rows=150 loops=1) 
                      Index Cond: (primaryidentifier = ANY ('{SPAC11D3.15,SPAC11D3.16c...
  Total runtime: 122.733 ms
(9 rows)

Although it's probably in the cache, as it took 1073 ms the first time.
The table has half a million rows, but tables all over the database are
being accessed, so the cache is shared between several hundred million
rows.

Postgres executes this query in two stages. First it does a trawl of the
index (on field), and builds an bitmap. Then it fetches the pages
according to the bitmap. I can see the second stage being quite easy to
adapt for fadvise, but the first stage would be a little more tricky. Both
stages are equally important, as they take a comparable amount of time.

We are running this database on a 16-spindle RAID array, so the benefits
to our process of fully utilising them would be quite large. I'm
considering if I can parallelise things a little though.

> Alternately, if you'd like to join in on testing this all out more help would
> certainly be welcome.

How would you like me to help?

Matthew

--
What goes up must come down. Ask any system administrator.

Re: RAID arrays and performance

От

Tom Lane

Дата:

19 сентября 2008 г., 15:10:14

Matthew Wakeling <matthew@flymine.org> writes:
> In order to improve the performance, I made the system look ahead in the
> source, in groups of a thousand entries, so instead of running:
> SELECT * FROM table WHERE field = 'something';
> a thousand times, we now run:
> SELECT * FROM table WHERE field IN ('something', 'something else'...);
> with a thousand things in the IN. Very simple query. It does run faster
> than the individual queries, but it still takes quite a while. Here is an
> example query:

Your example shows the IN-list as being sorted, but I wonder whether you
actually are sorting the items in practice?  If not, you might try that
to improve locality of access to the index.

Also, parsing/planning time could be part of your problem here with 1000
things to look at.  Can you adjust your client code to use a prepared
query?  I'd try
    SELECT * FROM table WHERE field = ANY($1::text[])
(or whatever the field datatype actually is) and then push the list
over as a single parameter value using array syntax.  You might find
that it scales to much larger IN-lists that way.

            regards, tom lane

Re: RAID arrays and performance

От

Matthew Wakeling

Дата:

19 сентября 2008 г., 15:26:15

On Fri, 19 Sep 2008, Tom Lane wrote:
> Your example shows the IN-list as being sorted, but I wonder whether you
> actually are sorting the items in practice?  If not, you might try that
> to improve locality of access to the index.

Well, like I said, we generally don't have the luxury of dictating the
order of entries in the data source. However, the IN list itself is sorted
- more to do with making the logs readable and the tests reproducable than
for performance.

However, I have been looking at changing the order of the input data. This
particular data source is a 29GB xml file, and I wrote a quick program
which sorts that by one key in 40 minutes, which will hopefully allow
later data sources (which are easier to sort) to take advantage of spacial
locality in the table. However, that key is not the same one as the one
used in the query above, hence why I say we can't really dictate the order
of the entries. There's another complication which I won't go into.

> Also, parsing/planning time could be part of your problem here with 1000
> things to look at.  Can you adjust your client code to use a prepared
> query?  I'd try
>     SELECT * FROM table WHERE field = ANY($1::text[])
> (or whatever the field datatype actually is) and then push the list
> over as a single parameter value using array syntax.  You might find
> that it scales to much larger IN-lists that way.

Yes, that is a useful suggestion. However, I am fairly clear that the
system is disk seek-bound at the moment, so it probably wouldn't make a
massive improvement. It would also unfortunately require changing a lot of
our code. Worth doing at some point.

Matthew

--
"Interwoven alignment preambles are not allowed."
If you have been so devious as to get this message, you will understand
it, and you deserve no sympathy.  -- Knuth, in the TeXbook

Re: RAID arrays and performance

От

Mark Mielke

Дата:

19 сентября 2008 г., 16:23:04

Matthew Wakeling <matthew@flymine.org> writes:

In order to improve the performance, I made the system look ahead in the 
source, in groups of a thousand entries, so instead of running:
SELECT * FROM table WHERE field = 'something';
a thousand times, we now run:
SELECT * FROM table WHERE field IN ('something', 'something else'...);
with a thousand things in the IN. Very simple query. It does run faster 
than the individual queries, but it still takes quite a while. Here is an 
example query:

Have you considered temporary tables? Use COPY to put everything you want to query into a temporary table, then SELECT to join the results, and pull all of the results, doing additional processing (UPDATE) as you pull?

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: RAID arrays and performance