Обсуждение: very very slow inserts into very large table

Поиск
Список
Период
Сортировка

very very slow inserts into very large table

От
Jon Nelson
Дата:
I have a single *table* that is some 560GB in size, 6 columns, average
row width 63.
There are approximately 6.1 billion rows.
It has 4 indices, 131, 313, 131 and 190 GB in size, respectively. All
are btree indices.

I tried inserting new data into the table, and it's taking a *very* long time.
I pre-built the data to be inserted into a temporary table with the
exact same structure and column ordering, etc, and the temporary table
is about 8.5GB in size with about 93 million rows.
The temporary table was built in about 95 seconds.
The insert has been going for 47 hours and 21 minutes, give or take.
I'm not doing any correlation or filtering, etc --  straight up
insert, literally "insert into big_table select * from
the_temp_table;".

vmstat output doesn't seem that useful, with disk wait being 10-15%
and I/O speeds highly variable, from 5-20MB/s reads couple with
0-16MB/s writes, generally on the lower end of these.
strace of the inserting process shows that it's basically hammering
the disk in terms of random reads and infrequent writes.
postgresql. It's not verifying, rebuilding, etc. While this process is
active, streaming write I/O is terrible - 36MB/s. WIth it "paused"
(via strace) I get 72MB/s.  (reads are 350MB/s).

The OS is Scientific Linux 6.2, and the version of postgresql is 9.1.4
- x86_64. There is nothing else of note happening on the box. The box
is a quad CPU, dual-core each Xeon E5430  @ 2.66GHz with 32GB of RAM
and a 3ware 9690 RAID 4TB RAID10 for the storage for

What might be going on here?


--
Jon

Re: very very slow inserts into very large table

От
Mark Thornton
Дата:
On 16/07/12 14:37, Jon Nelson wrote:
> I have a single *table* that is some 560GB in size, 6 columns, average
> row width 63.
> There are approximately 6.1 billion rows.
> It has 4 indices, 131, 313, 131 and 190 GB in size, respectively. All
> are btree indices.
>
> I tried inserting new data into the table, and it's taking a *very* long time.
> I pre-built the data to be inserted into a temporary table with the
> exact same structure and column ordering, etc, and the temporary table
> is about 8.5GB in size with about 93 million rows.
> The temporary table was built in about 95 seconds.
> The insert has been going for 47 hours and 21 minutes, give or take.
> I'm not doing any correlation or filtering, etc --  straight up
> insert, literally "insert into big_table select * from
> the_temp_table;".
>
> vmstat output doesn't seem that useful, with disk wait being 10-15%
> and I/O speeds highly variable, from 5-20MB/s reads couple with
> 0-16MB/s writes, generally on the lower end of these.
> strace of the inserting process shows that it's basically hammering
> the disk in terms of random reads and infrequent writes.
> postgresql. It's not verifying, rebuilding, etc. While this process is
> active, streaming write I/O is terrible - 36MB/s. WIth it "paused"
> (via strace) I get 72MB/s.  (reads are 350MB/s).
>
> The OS is Scientific Linux 6.2, and the version of postgresql is 9.1.4
> - x86_64. There is nothing else of note happening on the box. The box
> is a quad CPU, dual-core each Xeon E5430  @ 2.66GHz with 32GB of RAM
> and a 3ware 9690 RAID 4TB RAID10 for the storage for
>
> What might be going on here?
>
>
Every insert updates four indexes, so at least 3 of those will be in
random order. The indexes don't fit in memory, so all those updates will
involve reading most of the relevant b-tree pages from disk (or at least
the leaf level). A total of 10ms of random read from disk (per inserted
row) wouldn't surprise me ... which adds up to more than 10 days for
your 93 million rows.

Mark Thornton

Re: very very slow inserts into very large table

От
Samuel Gendler
Дата:
On Mon, Jul 16, 2012 at 7:06 AM, Mark Thornton <mthornton@optrak.com> wrote:

Every insert updates four indexes, so at least 3 of those will be in random order. The indexes don't fit in memory, so all those updates will involve reading most of the relevant b-tree pages from disk (or at least the leaf level). A total of 10ms of random read from disk (per inserted row) wouldn't surprise me ... which adds up to more than 10 days for your 93 million rows.

Which is the long way of saying that you will likely benefit from partitioning that table into a number of smaller tables, especially if queries on that table tend to access only a subset of the data that can be defined to always fit into a smaller number of partitions than the total.  At the very least, inserts will be faster because individual indexes will be smaller.  But unless all queries can't be constrained to fit within a subset of partitions, you'll also see improved performance on selects.

--sam


Re: very very slow inserts into very large table

От
Jon Nelson
Дата:
On Mon, Jul 16, 2012 at 12:35 PM, Samuel Gendler
<sgendler@ideasculptor.com> wrote:
> On Mon, Jul 16, 2012 at 7:06 AM, Mark Thornton <mthornton@optrak.com> wrote:
>>>
>>>
>> Every insert updates four indexes, so at least 3 of those will be in
>> random order. The indexes don't fit in memory, so all those updates will
>> involve reading most of the relevant b-tree pages from disk (or at least the
>> leaf level). A total of 10ms of random read from disk (per inserted row)
>> wouldn't surprise me ... which adds up to more than 10 days for your 93
>> million rows.
>
>
> Which is the long way of saying that you will likely benefit from
> partitioning that table into a number of smaller tables, especially if
> queries on that table tend to access only a subset of the data that can be
> defined to always fit into a smaller number of partitions than the total.
> At the very least, inserts will be faster because individual indexes will be
> smaller.  But unless all queries can't be constrained to fit within a subset
> of partitions, you'll also see improved performance on selects.

Acknowledged. My data is actually partitioned into individual tables,
but this was an experiment to see what the performance was like. I was
expecting that effectively appending all of the individual tables into
a great big table would result in less redundant information being
stored in indices and, therefore, a bit more speed and efficiency.
However, I have to admit I was very surprised at the performance
reduction.

What is the greater lesson to take away, here? If you are working with
data that is larger (substantially larger) than available memory, is
the architecture and design of postgresql such that the only real
approach is some type of data partitioning? It is not my intent to
insult or even disparage my favorite software, but it took less time
to *build* the indices for 550GB of data than it would have to insert
1/20th as much. That doesn't seem right.

--
Jon

Re: very very slow inserts into very large table

От
Jeff Janes
Дата:
On Mon, Jul 16, 2012 at 10:35 AM, Samuel Gendler
<sgendler@ideasculptor.com> wrote:
> On Mon, Jul 16, 2012 at 7:06 AM, Mark Thornton <mthornton@optrak.com> wrote:
>>>
>>>
>> Every insert updates four indexes, so at least 3 of those will be in
>> random order. The indexes don't fit in memory, so all those updates will
>> involve reading most of the relevant b-tree pages from disk (or at least the
>> leaf level). A total of 10ms of random read from disk (per inserted row)
>> wouldn't surprise me ... which adds up to more than 10 days for your 93
>> million rows.
>
>
> Which is the long way of saying that you will likely benefit from
> partitioning that table into a number of smaller tables, especially if
> queries on that table tend to access only a subset of the data that can be
> defined to always fit into a smaller number of partitions than the total.
> At the very least, inserts will be faster because individual indexes will be
> smaller.

If the select locality and the insert locality are not the same, and
the table is partitioned according to the select locality, then the
total index size needed to be accessed during the inserts will be
slightly larger, not smaller, under the partitioning and the inserts
will not perform well.

On the other hand, if the select locality and the insert locality are
the same, it should be possible to change the index definitions in a
way to get all the gains of your described partitioning, without
actually doing the partitioning.

> But unless all queries can't be constrained to fit within a subset
> of partitions, you'll also see improved performance on selects.

When you can't constrain the queries to fit within a subset of the
partitions is where I see a possible win from partitioning that can't
be obtained other ways.  By using partitioning, you can greatly
increase the insert performance by imposing a small cost on each
query.  The total cost is at least as great, but you have re-arranged
how the cost is amortized into a more acceptable shape.

Cheers,

Jeff

Re: very very slow inserts into very large table

От
Mark Thornton
Дата:
On 16/07/12 18:56, Jon Nelson wrote:
It is not my intent to
insult or even disparage my favorite software, but it took less time
to *build* the indices for 550GB of data than it would have to insert
1/20th as much. That doesn't seem right.
My explanation would apply to many databases, not just Postgres.

To speed up the insert there are a number of possible approaches:

1. Partition the data and then sort the temporary table into groups based on the partitioning. Best of all if all the new data goes into a single partition.

2. Drop the indexes before insert and rebuild afterwards.

3. Reduce the number of indexes. If you only have one index, you can sort the data to be inserted in the natural order of the index. If you must have more than one index you could still sort the new data in the order of one of them to obtain a modest improvement in locality.

4. The most efficient way for the database itself to do the updates would be to first insert all the data in the table, and then update each index in turn having first sorted the inserted keys in the appropriate order for that index.

Mark


Re: very very slow inserts into very large table

От
Claudio Freire
Дата:
On Mon, Jul 16, 2012 at 3:59 PM, Mark Thornton <mthornton@optrak.com> wrote:
> 4. The most efficient way for the database itself to do the updates would be
> to first insert all the data in the table, and then update each index in
> turn having first sorted the inserted keys in the appropriate order for that
> index.

Actually, it should create a temporary index btree and merge[0] them.
Only worth if there are really a lot of rows.

[0] http://www.ccs.neu.edu/home/bradrui/index_files/parareorg.pdf

Re: very very slow inserts into very large table

От
Mark Thornton
Дата:
On 16/07/12 20:08, Claudio Freire wrote:
> On Mon, Jul 16, 2012 at 3:59 PM, Mark Thornton <mthornton@optrak.com> wrote:
>> 4. The most efficient way for the database itself to do the updates would be
>> to first insert all the data in the table, and then update each index in
>> turn having first sorted the inserted keys in the appropriate order for that
>> index.
> Actually, it should create a temporary index btree and merge[0] them.
> Only worth if there are really a lot of rows.
>
> [0] http://www.ccs.neu.edu/home/bradrui/index_files/parareorg.pdf
I think 93 million would qualify as a lot of rows. However does any
available database (commercial or open source) use this optimisation.

Mark



Re: very very slow inserts into very large table

От
Claudio Freire
Дата:
On Mon, Jul 16, 2012 at 4:16 PM, Mark Thornton <mthornton@optrak.com> wrote:
>> Actually, it should create a temporary index btree and merge[0] them.
>> Only worth if there are really a lot of rows.
>>
>> [0] http://www.ccs.neu.edu/home/bradrui/index_files/parareorg.pdf
>
> I think 93 million would qualify as a lot of rows. However does any
> available database (commercial or open source) use this optimisation.

Databases, I honestly don't know. But I do know most document
retrieval engines use a similar technique with inverted indexes.

Re: very very slow inserts into very large table

От
Craig Ringer
Дата:
On 07/17/2012 01:56 AM, Jon Nelson wrote:
> What is the greater lesson to take away, here? If you are working with
> data that is larger (substantially larger) than available memory, is
> the architecture and design of postgresql such that the only real
> approach is some type of data partitioning? It is not my intent to
> insult or even disparage my favorite software, but it took less time
> to *build* the indices for 550GB of data than it would have to insert
> 1/20th as much. That doesn't seem right.

To perform reasonably well, Pg would need to be able to defer index
updates when bulk-loading data in a single statement (or even
transaction), then apply them when the statement finished or transaction
committed. Doing this at a transaction level would mean you'd need a way
to mark indexes as 'lazily updated' and have Pg avoid using them once
they'd been dirtied within a transaction. No such support currently
exists, and it'd be non-trivial to implement, especially since people
loading huge amounts of data often want to do it with multiple
concurrent sessions. You'd need some kind of 'DISABLE INDEX' and 'ENABLE
INDEX' commands plus a transactional backing table of pending index updates.

Not simple.


Right now, Pg is trying to keep the index consistent the whole time.
That involves moving a heck of a lot of data around - repeatedly.

Setting a lower FILLFACTOR on your indexes can give Pg some breathing
room here, but only a limited amount, and at the cost of reduced scan
efficiency.

--
Craig Ringer

Re: very very slow inserts into very large table

От
Satoshi Nagayasu
Дата:
2012/07/16 22:37, Jon Nelson wrote:
> I have a single *table* that is some 560GB in size, 6 columns, average
> row width 63.
> There are approximately 6.1 billion rows.
> It has 4 indices, 131, 313, 131 and 190 GB in size, respectively. All
> are btree indices.
>
> I tried inserting new data into the table, and it's taking a *very* long time.
> I pre-built the data to be inserted into a temporary table with the
> exact same structure and column ordering, etc, and the temporary table
> is about 8.5GB in size with about 93 million rows.
> The temporary table was built in about 95 seconds.
> The insert has been going for 47 hours and 21 minutes, give or take.
> I'm not doing any correlation or filtering, etc --  straight up
> insert, literally "insert into big_table select * from
> the_temp_table;".
>
> vmstat output doesn't seem that useful, with disk wait being 10-15%
> and I/O speeds highly variable, from 5-20MB/s reads couple with
> 0-16MB/s writes, generally on the lower end of these.
> strace of the inserting process shows that it's basically hammering
> the disk in terms of random reads and infrequent writes.
> postgresql. It's not verifying, rebuilding, etc. While this process is
> active, streaming write I/O is terrible - 36MB/s. WIth it "paused"
> (via strace) I get 72MB/s.  (reads are 350MB/s).

I think the most possible reason could exists around WAL and its
buffers.

But it's just my guess, and you need to determine a cause of the
situation precisely. Disk I/O operations must be broken down
into the PostgreSQL context, such as block reads, wal writes or bgwiter.

If you want to know what's actually going on inside PostgreSQL,
pgstatview may help you that.

http://pgsnaga.blogspot.jp/2012/06/pgstatview-visualize-your-postgresql-in.html
http://www2.uptimeforce.com/pgstatview/

pgstatview provides an easy way not only to visualize your performance
statistics while workload, but also to share it with the PostgreSQL
experts.

Here is an example of the report:
http://www2.uptimeforce.com/pgstatview/a9ee29aa84668cca2d8cdfd2556d370c/

I believe you can find some thoughts from visualizing and comparing
your statistics between your temp table and regular table.

Regards,

>
> The OS is Scientific Linux 6.2, and the version of postgresql is 9.1.4
> - x86_64. There is nothing else of note happening on the box. The box
> is a quad CPU, dual-core each Xeon E5430  @ 2.66GHz with 32GB of RAM
> and a 3ware 9690 RAID 4TB RAID10 for the storage for
>
> What might be going on here?
>
>

--
Satoshi Nagayasu <snaga@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp


Re: very very slow inserts into very large table

От
Ants Aasma
Дата:
On Tue, Jul 17, 2012 at 6:30 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:
> On 07/17/2012 01:56 AM, Jon Nelson wrote:
> To perform reasonably well, Pg would need to be able to defer index updates
> when bulk-loading data in a single statement (or even transaction), then
> apply them when the statement finished or transaction committed. Doing this
> at a transaction level would mean you'd need a way to mark indexes as
> 'lazily updated' and have Pg avoid using them once they'd been dirtied
> within a transaction. No such support currently exists, and it'd be
> non-trivial to implement, especially since people loading huge amounts of
> data often want to do it with multiple concurrent sessions. You'd need some
> kind of 'DISABLE INDEX' and 'ENABLE INDEX' commands plus a transactional
> backing table of pending index updates.

It seems to me that if the insertion is done as a single statement it
wouldn't be a problem to collect up all btree insertions and apply
them before completing the statement. I'm not sure how much that would
help though. If the new rows have uniform distribution you end up
reading in the whole index anyway. Because indexes are not stored in
logical order you don't get to benefit from sequential I/O.

The lazy merging approach (the paper that Claudio linked) on the other
hand seems promising but a lot trickier to implement.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Re: very very slow inserts into very large table

От
Jeff Janes
Дата:
On Tue, Jul 17, 2012 at 8:59 AM, Ants Aasma <ants@cybertec.at> wrote:
> On Tue, Jul 17, 2012 at 6:30 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:
>> On 07/17/2012 01:56 AM, Jon Nelson wrote:
>> To perform reasonably well, Pg would need to be able to defer index updates
>> when bulk-loading data in a single statement (or even transaction), then
>> apply them when the statement finished or transaction committed. Doing this
>> at a transaction level would mean you'd need a way to mark indexes as
>> 'lazily updated' and have Pg avoid using them once they'd been dirtied
>> within a transaction. No such support currently exists, and it'd be
>> non-trivial to implement, especially since people loading huge amounts of
>> data often want to do it with multiple concurrent sessions. You'd need some
>> kind of 'DISABLE INDEX' and 'ENABLE INDEX' commands plus a transactional
>> backing table of pending index updates.
>
> It seems to me that if the insertion is done as a single statement it
> wouldn't be a problem to collect up all btree insertions and apply
> them before completing the statement. I'm not sure how much that would
> help though. If the new rows have uniform distribution you end up
> reading in the whole index anyway. Because indexes are not stored in
> logical order you don't get to benefit from sequential I/O.

In this case, he is loading new data that is 5% of the current data
size.  A leaf page probably has much more than 20 entries, so by
sorting them you could turn many scattered accesses to the same page
to one access (or many accesses that immediately follow each other,
and so are satisfied by the cache).

Also, while indexes are not formally kept in logical order, but they
do tend to be biased in that direction in most cases.  I've found that
even if you are only inserting one row for every 4 or 5 leaf pages,
you still get substantial improvement by doing so in sorted order.

Cheers,

Jeff

Re: very very slow inserts into very large table

От
Claudio Freire
Дата:
On Tue, Jul 17, 2012 at 1:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Also, while indexes are not formally kept in logical order, but they
> do tend to be biased in that direction in most cases.  I've found that
> even if you are only inserting one row for every 4 or 5 leaf pages,
> you still get substantial improvement by doing so in sorted order.

Yep, I do the same. Whenever I have to perform massive updates, I sort them.

Although "massive" for me is nowhere near what "massive" for the OP is.