Обсуждение: Inserts in 'big' table slowing down the database

Поиск
Список
Период
Сортировка

Inserts in 'big' table slowing down the database

От
Stefan Keller
Дата:
Hi,

I'm having performance issues with a simple table containing 'Nodes'
(points) from OpenStreetMap:

  CREATE TABLE nodes (
      id bigint PRIMARY KEY,
      user_name text NOT NULL,
      tstamp timestamp without time zone NOT NULL,
      geom GEOMETRY(POINT, 4326)
  );
  CREATE INDEX idx_nodes_geom ON nodes USING gist (geom);

The number of rows grows steadily and soon reaches one billion
(1'000'000'000), therefore the bigint id.
Now, hourly inserts (update and deletes) are slowing down the database
(PostgreSQL 9.1) constantly.
Before I'm looking at non-durable settings [1] I'd like to know what
choices I have to tune it while keeping the database productive:
cluster index? partition table? use tablespaces? reduce physical block size?

Stefan

[1] http://www.postgresql.org/docs/9.1/static/non-durability.html


Re: Inserts in 'big' table slowing down the database

От
Ivan Voras
Дата:
On 03/09/2012 13:03, Stefan Keller wrote:
> Hi,
>
> I'm having performance issues with a simple table containing 'Nodes'
> (points) from OpenStreetMap:
>
>   CREATE TABLE nodes (
>       id bigint PRIMARY KEY,
>       user_name text NOT NULL,
>       tstamp timestamp without time zone NOT NULL,
>       geom GEOMETRY(POINT, 4326)
>   );
>   CREATE INDEX idx_nodes_geom ON nodes USING gist (geom);
>
> The number of rows grows steadily and soon reaches one billion
> (1'000'000'000), therefore the bigint id.
> Now, hourly inserts (update and deletes) are slowing down the database
> (PostgreSQL 9.1) constantly.
> Before I'm looking at non-durable settings [1] I'd like to know what
> choices I have to tune it while keeping the database productive:
> cluster index? partition table? use tablespaces? reduce physical block size?

You need to describe in detail what does "slowing down" mean in your
case. Do the disk drives somehow do more operations per transaction?
Does the database use more CPU cycles? Is there swapping? What is the
expected (previous) performance?

At a guess, it is very unlikely that using non-durable settings will
help you here.


Вложения

Re: Inserts in 'big' table slowing down the database

От
Stefan Keller
Дата:
Sorry for the delay. I had to sort out the problem (among other things).

It's mainly about swapping.

The table nodes contains about 2^31 entries and occupies about 80GB on
disk space plus index.
If one would store the geom values in a big array (where id is the
array index) it would only make up about 16GB, which means that the
ids are dense (with few deletes).
Then updates come in every hour as bulk insert statements with entries
having ids in sorted manner.
Now PG becomes slower and slower!
CLUSTER could help - but obviously this operation needs a table lock.
And if this operation takes longer than an hour, it delays the next
update.

Any ideas? Partitioning?

Yours, S.

2012/9/3 Ivan Voras <ivoras@freebsd.org>:
> On 03/09/2012 13:03, Stefan Keller wrote:
>> Hi,
>>
>> I'm having performance issues with a simple table containing 'Nodes'
>> (points) from OpenStreetMap:
>>
>>   CREATE TABLE nodes (
>>       id bigint PRIMARY KEY,
>>       user_name text NOT NULL,
>>       tstamp timestamp without time zone NOT NULL,
>>       geom GEOMETRY(POINT, 4326)
>>   );
>>   CREATE INDEX idx_nodes_geom ON nodes USING gist (geom);
>>
>> The number of rows grows steadily and soon reaches one billion
>> (1'000'000'000), therefore the bigint id.
>> Now, hourly inserts (update and deletes) are slowing down the database
>> (PostgreSQL 9.1) constantly.
>> Before I'm looking at non-durable settings [1] I'd like to know what
>> choices I have to tune it while keeping the database productive:
>> cluster index? partition table? use tablespaces? reduce physical block size?
>
> You need to describe in detail what does "slowing down" mean in your
> case. Do the disk drives somehow do more operations per transaction?
> Does the database use more CPU cycles? Is there swapping? What is the
> expected (previous) performance?
>
> At a guess, it is very unlikely that using non-durable settings will
> help you here.
>


Re: Inserts in 'big' table slowing down the database

От
Greg Williamson
Дата:
Stefan --


----- Original Message -----
> From: Stefan Keller <sfkeller@gmail.com>
> To: Ivan Voras <ivoras@freebsd.org>
> Cc: pgsql-performance@postgresql.org
> Sent: Monday, October 1, 2012 5:15 PM
> Subject: Re: [PERFORM] Inserts in 'big' table slowing down the database
>
> Sorry for the delay. I had to sort out the problem (among other things).
>
> It's mainly about swapping.
>
> The table nodes contains about 2^31 entries and occupies about 80GB on
> disk space plus index.
> If one would store the geom values in a big array (where id is the
> array index) it would only make up about 16GB, which means that the
> ids are dense (with few deletes).
> Then updates come in every hour as bulk insert statements with entries
> having ids in sorted manner.
> Now PG becomes slower and slower!
> CLUSTER could help - but obviously this operation needs a table lock.
> And if this operation takes longer than an hour, it delays the next
> update.
>
> Any ideas? Partitioning?


pg_reorg if you have the space might be useful in doing a cluster-like action:
 <http://reorg.projects.postgresql.org/>

Haven't followed the thread so I hope this isn't redundant.

Partitioning might work if you can create clusters that are bigger than 1 hour -- too many partitions doesn't help.

Greg Williamson



Re: Inserts in 'big' table slowing down the database

От
Shaun Thomas
Дата:
On 10/01/2012 07:15 PM, Stefan Keller wrote:

> Any ideas? Partitioning?

Yes. Make sure you have a good column to partition on. Tables this large
are just bad performers in general, and heaven forbid you ever have to
perform maintenance on them. We had a table that size, and simply
creating an index could take upwards of two hours.

If you can't archive any of the table contents, partitioning may be your
only solution. If you have an EDB 9.1, you'll also have less problems
with the legacy issues people had with planning queries using partitions.

Don't go crazy, though. I try to keep it under a couple dozen partitions
per table, or under 100M records per partition.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Inserts in 'big' table slowing down the database

От
Jeff Janes
Дата:
On Mon, Oct 1, 2012 at 5:15 PM, Stefan Keller <sfkeller@gmail.com> wrote:
> Sorry for the delay. I had to sort out the problem (among other things).
>
> It's mainly about swapping.

Do you mean ordinary file IO?  Or swapping of an actual process's
virtual memory?  The latter shouldn't happen much unless you have
something mis-configured.

>
> The table nodes contains about 2^31 entries and occupies about 80GB on
> disk space plus index.

How big is each index?

If you reset the stats just before the bulk load, what do select *
from pg_statio_user_tables and select * from pg_statio_user_indexes
show after the bulk load?  What does vmstat show during the load?

> If one would store the geom values in a big array (where id is the
> array index) it would only make up about 16GB, which means that the
> ids are dense (with few deletes).
> Then updates come in every hour as bulk insert statements with entries
> having ids in sorted manner.

Is the problem that these operations themselves are too slow, or that
they slow down other operations when they are active?  If the main
problem is that it slows down other operations, what are they?

If the problem is the length of the bulk operations themselves, what
happens if you split them up into chunks and run them in parallel?

Do you have a test/dev/QA server?  How long does a bulk insert take
under the four conditions of both indexes (PK and geometry), neither
index, just one, or just the other?

> Now PG becomes slower and slower!
> CLUSTER could help - but obviously this operation needs a table lock.
> And if this operation takes longer than an hour, it delays the next
> update.

I don't see why a CLUSTER would help.  Your table is probably already
clustered well on the serial column.  Clustering it instead on the
geometry probably wouldn't accomplish much.  One thing that might help
would be to stuff the data to be inserted into a scratch table, index
that on the geometry, cluster that scratch table, and then do the
insert to the main table from the scratch table.   That might result
in the geom being inserted in a more cache-friendly order.

> Any ideas? Partitioning?

Do most selects against this table specify user_name as well as a
geometry query?  If so, that might be a good partitioning key.
Otherwise, I don't see what you could partition on in a way that make
things better.

Cheers,

Jeff