Обсуждение: Question about disk IO an index use and seeking advice

Поиск
Список
Период
Сортировка

Question about disk IO an index use and seeking advice

От
"Nikolas Everett"
Дата:
I have a question about index us and IO and am seeking advice.

We are running postgres 8.2.  We have two big big tables.  Our ~600,000,000 row table is changed very infrequently and is on a 12 disk software raid-6 for historical reasons using an  LSI Logic / Symbios Logic SAS1068 PCI-X Fusion-MPT SAS  Our ~50,000,000 row staging table is on a 12 disk hardware raid-10 using a Dell PowerEdge Expandable RAID controller 5.  All of the rows in the staging table are changed at least once and then deleted and recreated in the bigger table.  All of the staging table's indexes are on the raid-10.  The postgres data directory itself is on the raid-6.  I think all the disks are SATA 10Ks.  The setup is kind of a beast.

So my disk IO and index question.  When I issue a query on the big table like this:
SELECT    column, count(*)
FROM    bigtable
GROUP BY column
ORDER BY count DESC
When I run dstat to see my disk IO I see the software raid-6 consistently holding over 70M/sec.  This is fine with me, but I generally don't like to do queries that table scan 600,000,000 rows.  So I do:
SELECT    column, count(*)
FROM    bigtable
WHERE date > '4-24-08'
GROUP BY column
ORDER BY count DESC
When I run dstat I see only around 2M/sec and it is not consistent at all.

So my question is, why do I see such low IO load on the index scan version?  If I could tweak some setting to make more aggressive use of IO, would it actually make the query faster?  The field I'm scanning has a .960858 correlation, but I haven't vacuumed since importing any of the data that I'm scanning, though the correlation should remain very high.  When I do a similar set of queries on the hardware raid I see similar performance except  the numbers are both more than doubled.

Here is the explain output for the queries:
SELECT    column, count(*)
FROM    bigtable
GROUP BY column
ORDER BY count DESC
"Sort  (cost=74404440.58..74404444.53 rows=1581 width=10)"
"  Sort Key: count(*)"
"  ->  HashAggregate  (cost=74404336.81..74404356.58 rows=1581 width=10)"
"        ->  Seq Scan on bigtable (cost=0.00..71422407.21 rows=596385921 width=10)"
---------------
SELECT    column, count(*)
FROM    bigtable
WHERE date > '4-24-08'
GROUP BY column
ORDER BY count DESC
"Sort  (cost=16948.80..16948.81 rows=1 width=10)"
"  Sort Key: count(*)"
"  ->  HashAggregate  (cost=16948.78..16948.79 rows=1 width=10)"
"        ->  Index Scan using date_idx on bigtable (cost=0.00..16652.77 rows=59201 width=10)"
"              Index Cond: (date > '2008-04-21 00:00:00'::timestamp without time zone)"

So now the asking for advice part.  I have two questions:
What is the fastest way to copy data from the smaller table to the larger table?

We plan to rearrange the setup when we move to Postgres 8.3.  We'll probably move all the storage over to a SAN and slice the larger table into monthly or weekly tables.  Can someone point me to a good page on partitioning?  My gut tells me it should be better, but I'd like to learn more about why.
Does anyone have experience migrating large databases to a SAN?  I understand that it'll give me better fail over capabilities so long as the SAN itself doesn't go out, but are we going to be sacrificing performance for this?  That doesn't even mention the cost....

Thanks so much for reading through all this,

--Nik

Re: Question about disk IO an index use and seeking advice

От
Matthew Wakeling
Дата:
On Thu, 24 Apr 2008, Nikolas Everett wrote:
> The setup is kind of a beast.

No kidding.

> When I run dstat I see only around 2M/sec and it is not consistent at all.

Well, it is having to seek over the disc a little. Firstly, your table may
not be wonderfully ordered for index scans, but goodness knows how long a
CLUSTER operation might take with that much data. Secondly, when doing an
index scan, Postgres unfortunately can only use the performance equivalent
of a single disc, because it accesses the pages one by one in a
single-threaded manner. A large RAID array will give you a performance
boost if you are doing lots of index scans in parallel, but not if you are
only doing one. Greg Stark has a patch in the pipeline to improve this
though.

> When I do a similar set of queries on the hardware raid I see similar
> performance except the numbers are both more than doubled.

Hardware RAID is often better than software RAID. 'Nuff said.

> Here is the explain output for the queries:

EXPLAIN ANALYSE is even better.

> Sort  (cost=16948.80..16948.81 rows=1 width=10)"
>   Sort Key: count(*)"
>   ->  HashAggregate  (cost=16948.78..16948.79 rows=1 width=10)"
>         ->  Index Scan using date_idx on bigtable (cost=0.00..16652.77 rows=59201 width=10)"
>               Index Cond: (date > '2008-04-21 00:00:00'::timestamp without time zone)"

That doesn't look like it should take too long. How long does it take?
(EXPLAIN ANALYSE, in other words). It's a good plan, anyway.

> So now the asking for advice part.  I have two questions:
> What is the fastest way to copy data from the smaller table to the larger
> table?

INSERT INTO bigtable (field1, field2) SELECT whatever FROM staging_table
        ORDER BY staging_table.date

That will do it all in Postgres. The ORDER BY clause may slow down the
insert, but it will certainly speed up your subsequent index scans.

If the bigtable isn't getting any DELETE or UPDATE traffic, you don't need
to vacuum it. However, make sure you do vacuum the staging table,
preferably directly after moving all that data to the bigtable and
deleting it from the staging table.

> Can someone point me to a good page on partitioning? My
> gut tells me it should be better, but I'd like to learn more about why.

You could possibly not bother with a staging table, and replace the mass
copy with making a new partition. Not sure of the details myself though.

Matthew

--
Me... a skeptic?  I trust you have proof?

Re: Question about disk IO an index use and seeking advice

От
PFC
Дата:
> Our ~600,000,000
> row table is changed very infrequently and is on a 12 disk software
> raid-6
> for historical reasons using an  LSI Logic / Symbios Logic SAS1068 PCI-X
> Fusion-MPT SAS  Our ~50,000,000 row staging table is on a 12 disk
> hardware
> raid-10 using a Dell PowerEdge Expandable RAID controller 5.

> So my disk IO and index question.  When I issue a query on the big table
> like this:
> SELECT    column, count(*)
> FROM    bigtable
> GROUP BY column
> ORDER BY count DESC
> When I run dstat to see my disk IO I see the software raid-6 consistently
> holding over 70M/sec.  This is fine with me, but I generally don't like
> to
> do queries that table scan 600,000,000 rows.  So I do:

    Note that RAID5 or 6 is fine when reading, it's the small random writes
that kill it.
    Is the table being inserted to while you run this query, which will
generate small random writes for the index updates ?
    Or is the table only inserted to during the nightly cron job ?

    70 MB/s seems to me quite close to what a single SATA disk could do these
days.
    My software RAID 5 saturates the PCI bus in the machine and pushes more
than 120 MB/s.
    You have PCI-X and 12 disks so you should get huuuuge disk throughput,
really mindboggling figures, not 70 MB/s.
    Since this seems a high-budget system perhaps a good fast hardware RAID ?
    Or perhaps this test was performed under heavy load and it is actually a
good result.


> All of the
> rows in the staging table are changed at least once and then deleted and
> recreated in the bigger table.  All of the staging table's indexes are on
> the raid-10.  The postgres data directory itself is on the raid-6.  I
> think
> all the disks are SATA 10Ks. The setup is kind of a beast.
>
> SELECT    column, count(*)
> FROM    bigtable
> WHERE date > '4-24-08'
> GROUP BY column
> ORDER BY count DESC
> When I run dstat I see only around 2M/sec and it is not consistent at
> all.
>
> So my question is, why do I see such low IO load on the index scan
> version?

    First, it is probably choosing a bitmap index scan, which means it needs
to grab lots of pages from the index. If your index is fragmented, just
scanning the index could take a long time.
    Then, i is probably taking lots of random bites in the table data.
    If this is an archive table, the dates should be increasing sequentially.
If this is not the case you will get random IO which is rather bad on huge
data sets.

    So.

    If you need the rows to be grouped on-disk by date (or perhaps another
field if you more frequently run other types of query, like grouping by
category, or perhaps something else, you decide) :

    The painful thing will be to reorder the table, either
    - use CLUSTER
    - or recreate a table and INSERT INTO it ORDER BY the field you chose.
This is going to take a while, set sort_mem to a large value. Then create
the indexes.

    Then every time you insert data in the archive, be sure to insert it in
big batches, ORDER BY the field you chose. That way new inserts will be
still in the order you want.

    While you're at it you might think about partitioning the monster on a
useful criterion (this depends on your querying).

> If I could tweak some setting to make more aggressive use of IO, would it
> actually make the query faster?  The field I'm scanning has a .960858
> correlation, but I haven't vacuumed since importing any of the data that

    You have ANALYZEd at least ?
    Cause if you didn't and an index scan (not bitmap) comes up on this kind
of query and it does a million index hits you have a problem.

> I'm
> scanning, though the correlation should remain very high.  When I do a
> similar set of queries on the hardware raid I see similar performance
> except  the numbers are both more than doubled.
>
> Here is the explain output for the queries:
> SELECT    column, count(*)
> FROM    bigtable
> GROUP BY column
> ORDER BY count DESC
> "Sort  (cost=74404440.58..74404444.53 rows=1581 width=10)"
> "  Sort Key: count(*)"
> "  ->  HashAggregate  (cost=74404336.81..74404356.58 rows=1581 width=10)"
> "        ->  Seq Scan on bigtable (cost=0.00..71422407.21 rows=596385921
> width=10)"

    Plan is OK (nothing else to do really)

> ---------------
> SELECT    column, count(*)
> FROM    bigtable
> WHERE date > '4-24-08'
> GROUP BY column
> ORDER BY count DESC
> "Sort  (cost=16948.80..16948.81 rows=1 width=10)"
> "  Sort Key: count(*)"
> "  ->  HashAggregate  (cost=16948.78..16948.79 rows=1 width=10)"
> "        ->  Index Scan using date_idx on bigtable (cost=0.00..16652.77
> rows=59201 width=10)"
> "              Index Cond: (date > '2008-04-21 00:00:00'::timestamp
> without
> time zone)"

    Argh.
    So you got an index scan after all.
    Is the 59201 rows estimate right ? If it is 10 times that you really have
a problem.
    Is it ANALYZEd ?

> So now the asking for advice part.  I have two questions:
> What is the fastest way to copy data from the smaller table to the larger
> table?

    INSERT INTO SELECT FROM (add ORDER BY to taste)

> We plan to rearrange the setup when we move to Postgres 8.3.  We'll
> probably
> move all the storage over to a SAN and slice the larger table into
> monthly
> or weekly tables.  Can someone point me to a good page on partitioning?
> My
> gut tells me it should be better, but I'd like to learn more about why.

    Because in your case, records having the dates you want will be in 1
partition (or 2), so you get a kind of automatic CLUSTER. For instance if
you do your query on last week's data, it will seq scan last week's
partition (which will be a much more manageable size) and not even look at
the others.

Matthew said :
> You could possibly not bother with a staging table, and replacethe mass
> copy with making a new partition. Not sure of the details myself though.

    Yes you could do that.
    When a partition ceases to become actively updated, though, you should
CLUSTER it so it is really tight and fast.
    CLUSTER on a partition which has a week's worth of data will obviously be
much faster than CLUSTERing your monster archive.









Re: Question about disk IO an index use and seeking advice

От
"Nikolas Everett"
Дата:


On Thu, Apr 24, 2008 at 12:56 PM, PFC <lists@peufeu.com> wrote:

Our ~600,000,000
row table is changed very infrequently and is on a 12 disk software raid-6
for historical reasons using an  LSI Logic / Symbios Logic SAS1068 PCI-X
Fusion-MPT SAS  Our ~50,000,000 row staging table is on a 12 disk hardware
raid-10 using a Dell PowerEdge Expandable RAID controller 5.

So my disk IO and index question.  When I issue a query on the big table
like this:
SELECT    column, count(*)
FROM    bigtable
GROUP BY column
ORDER BY count DESC
When I run dstat to see my disk IO I see the software raid-6 consistently
holding over 70M/sec.  This is fine with me, but I generally don't like to
do queries that table scan 600,000,000 rows.  So I do:

       Note that RAID5 or 6 is fine when reading, it's the small random writes that kill it.
       Is the table being inserted to while you run this query, which will generate small random writes for the index updates ?
       Or is the table only inserted to during the nightly cron job ?

       70 MB/s seems to me quite close to what a single SATA disk could do these days.
       My software RAID 5 saturates the PCI bus in the machine and pushes more than 120 MB/s.
       You have PCI-X and 12 disks so you should get huuuuge disk throughput, really mindboggling figures, not 70 MB/s.
       Since this seems a high-budget system perhaps a good fast hardware RAID ?
       Or perhaps this test was performed under heavy load and it is actually a good result.


All of the
rows in the staging table are changed at least once and then deleted and
recreated in the bigger table.  All of the staging table's indexes are on
the raid-10.  The postgres data directory itself is on the raid-6.  I think
all the disks are SATA 10Ks. The setup is kind of a beast.

SELECT    column, count(*)
FROM    bigtable
WHERE date > '4-24-08'
GROUP BY column
ORDER BY count DESC
When I run dstat I see only around 2M/sec and it is not consistent at all.

So my question is, why do I see such low IO load on the index scan version?

       First, it is probably choosing a bitmap index scan, which means it needs to grab lots of pages from the index. If your index is fragmented, just scanning the index could take a long time.
       Then, i is probably taking lots of random bites in the table data.
       If this is an archive table, the dates should be increasing sequentially. If this is not the case you will get random IO which is rather bad on huge data sets.

       So.

       If you need the rows to be grouped on-disk by date (or perhaps another field if you more frequently run other types of query, like grouping by category, or perhaps something else, you decide) :

       The painful thing will be to reorder the table, either
       - use CLUSTER
       - or recreate a table and INSERT INTO it ORDER BY the field you chose. This is going to take a while, set sort_mem to a large value. Then create the indexes.

       Then every time you insert data in the archive, be sure to insert it in big batches, ORDER BY the field you chose. That way new inserts will be still in the order you want.    

       While you're at it you might think about partitioning the monster on a useful criterion (this depends on your querying).


If I could tweak some setting to make more aggressive use of IO, would it
actually make the query faster?  The field I'm scanning has a .960858
correlation, but I haven't vacuumed since importing any of the data that

       You have ANALYZEd at least ?
       Cause if you didn't and an index scan (not bitmap) comes up on this kind of query and it does a million index hits you have a problem.


I'm
scanning, though the correlation should remain very high.  When I do a
similar set of queries on the hardware raid I see similar performance
except  the numbers are both more than doubled.

Here is the explain output for the queries:
SELECT    column, count(*)
FROM    bigtable
GROUP BY column
ORDER BY count DESC
"Sort  (cost=74404440.58..74404444.53 rows=1581 width=10)"
"  Sort Key: count(*)"
"  ->  HashAggregate  (cost=74404336.81..74404356.58 rows=1581 width=10)"
"        ->  Seq Scan on bigtable (cost=0.00..71422407.21 rows=596385921
width=10)"

       Plan is OK (nothing else to do really)


---------------
SELECT    column, count(*)
FROM    bigtable
WHERE date > '4-24-08'
GROUP BY column
ORDER BY count DESC
"Sort  (cost=16948.80..16948.81 rows=1 width=10)"
"  Sort Key: count(*)"
"  ->  HashAggregate  (cost=16948.78..16948.79 rows=1 width=10)"
"        ->  Index Scan using date_idx on bigtable (cost=0.00..16652.77
rows=59201 width=10)"
"              Index Cond: (date > '2008-04-21 00:00:00'::timestamp without
time zone)"

       Argh.
       So you got an index scan after all.
       Is the 59201 rows estimate right ? If it is 10 times that you really have a problem.
       Is it ANALYZEd ?


So now the asking for advice part.  I have two questions:
What is the fastest way to copy data from the smaller table to the larger
table?

       INSERT INTO SELECT FROM (add ORDER BY to taste)


We plan to rearrange the setup when we move to Postgres 8.3.  We'll probably
move all the storage over to a SAN and slice the larger table into monthly
or weekly tables.  Can someone point me to a good page on partitioning?  My
gut tells me it should be better, but I'd like to learn more about why.

       Because in your case, records having the dates you want will be in 1 partition (or 2), so you get a kind of automatic CLUSTER. For instance if you do your query on last week's data, it will seq scan last week's partition (which will be a much more manageable size) and not even look at the others.

Matthew said :
You could possibly not bother with a staging table, and replacethe mass copy with making a new partition. Not sure of the details myself though.

       Yes you could do that.
       When a partition ceases to become actively updated, though, you should CLUSTER it so it is really tight and fast.
       CLUSTER on a partition which has a week's worth of data will obviously be much faster than CLUSTERing your monster archive.

Both Matthew and PFC, thanks for the response.

It turns out that the DB really loves to do index scans when I check new data because I haven't had a chance to analyze it yet.  It should be doing a bitmap index scan and a bitmap heap scan.  I think.  Doing a quick "set enable_indexscan = false" and doing a different date range really helped things.  Here is my understanding of the situation:

An index scan looks through the index and pulls in each pages as it sees it.
A bitmap index scan looks through the index and makes a sorted list of all the pages it needs and then the bitmap heap scan reads all the pages.
If your data is scattered then you may as well do the index scan, but if your data is sequential-ish then you should do the bitmap index scan.

Is that right?  Where can I learn more?  I've read http://www.postgresql.org/docs/8.2/interactive/using-explain.html but it didn't really dive deeply enough.  I'd like a list of all the options the query planner has and what they mean.


About clustering:  I know that CLUSTER takes an exclusive lock on the table.  At present, users can query the table at any time, so I'm not allowed to take an exclusive lock for more than a few seconds.  Could I achieve the same thing by creating a second copy of the table and then swapping the first copy out for the second?  I think something like that would fit in my time frames.


About partitioning:  I can definitely see how having the data in more manageable chunks would allow me to do things like clustering.  It will definitely make vacuuming easier.

About IO speeds:  The db is always under some kind of load.  I actually get scared if the load average isn't at least 2.  Could I try to run something like bonnie++ to get some real load numbers?  I'm sure that would cripple the system while it is running, but if it only takes a few seconds that would be ok.

There were updates running while I was running the test.  The WAL log is on the hardware raid 10.  Moving it from the software raid 5 almost doubled our insert performance.

Thanks again,

--Nik

Re: Question about disk IO an index use and seeking advice

От
PFC
Дата:
> An index scan looks through the index and pulls in each pages as it sees
> it.
> A bitmap index scan looks through the index and makes a sorted list of
> all
> the pages it needs and then the bitmap heap scan reads all the pages.
> If your data is scattered then you may as well do the index scan, but if
> your data is sequential-ish then you should do the bitmap index scan.
>
> Is that right?  Where can I learn more?  I've read

    That's about it, yes.
    If your bitmap has large holes, it will seek, but if it has little holes,
readahead will work. Hence, fast, and good.
    On indexscan, readahead doesn't help since the hits are pretty random. If
you have N rows in the index with the same date, in which order whill they
get scanned ? There is no way to know that, and no way to be sure this
order corresponds to physical order on disk.

> About clustering:  I know that CLUSTER takes an exclusive lock on the
> table.  At present, users can query the table at any time, so I'm not
> allowed to take an exclusive lock for more than a few seconds.

    Then, CLUSTER is out.

> Could I
> achieve the same thing by creating a second copy of the table and then
> swapping the first copy out for the second?  I think something like that
> would fit in my time frames

    If the archive table is read-only, then yes, you can do this.
.
> About partitioning:  I can definitely see how having the data in more
> manageable chunks would allow me to do things like clustering.  It will
> definitely make vacuuming easier.
>
> About IO speeds:  The db is always under some kind of load.  I actually
> get
> scared if the load average isn't at least 2.  Could I try to run
> something
> like bonnie++ to get some real load numbers?  I'm sure that would cripple
> the system while it is running, but if it only takes a few seconds that
> would be ok.
>
> There were updates running while I was running the test.  The WAL log is
> on
> the hardware raid 10.  Moving it from the software raid 5 almost doubled
> our
> insert performance.

    Normal ; fsync on a RAID5-6 is bad, bad.
    You have battery backed up cache ?

> Thanks again,
>
> --Nik