Обсуждение: Inserts in 'big' table slowing down the database
Hi, I'm having performance issues with a simple table containing 'Nodes' (points) from OpenStreetMap: CREATE TABLE nodes ( id bigint PRIMARY KEY, user_name text NOT NULL, tstamp timestamp without time zone NOT NULL, geom GEOMETRY(POINT, 4326) ); CREATE INDEX idx_nodes_geom ON nodes USING gist (geom); The number of rows grows steadily and soon reaches one billion (1'000'000'000), therefore the bigint id. Now, hourly inserts (update and deletes) are slowing down the database (PostgreSQL 9.1) constantly. Before I'm looking at non-durable settings [1] I'd like to know what choices I have to tune it while keeping the database productive: cluster index? partition table? use tablespaces? reduce physical block size? Stefan [1] http://www.postgresql.org/docs/9.1/static/non-durability.html
On 03/09/2012 13:03, Stefan Keller wrote: > Hi, > > I'm having performance issues with a simple table containing 'Nodes' > (points) from OpenStreetMap: > > CREATE TABLE nodes ( > id bigint PRIMARY KEY, > user_name text NOT NULL, > tstamp timestamp without time zone NOT NULL, > geom GEOMETRY(POINT, 4326) > ); > CREATE INDEX idx_nodes_geom ON nodes USING gist (geom); > > The number of rows grows steadily and soon reaches one billion > (1'000'000'000), therefore the bigint id. > Now, hourly inserts (update and deletes) are slowing down the database > (PostgreSQL 9.1) constantly. > Before I'm looking at non-durable settings [1] I'd like to know what > choices I have to tune it while keeping the database productive: > cluster index? partition table? use tablespaces? reduce physical block size? You need to describe in detail what does "slowing down" mean in your case. Do the disk drives somehow do more operations per transaction? Does the database use more CPU cycles? Is there swapping? What is the expected (previous) performance? At a guess, it is very unlikely that using non-durable settings will help you here.
Вложения
Sorry for the delay. I had to sort out the problem (among other things). It's mainly about swapping. The table nodes contains about 2^31 entries and occupies about 80GB on disk space plus index. If one would store the geom values in a big array (where id is the array index) it would only make up about 16GB, which means that the ids are dense (with few deletes). Then updates come in every hour as bulk insert statements with entries having ids in sorted manner. Now PG becomes slower and slower! CLUSTER could help - but obviously this operation needs a table lock. And if this operation takes longer than an hour, it delays the next update. Any ideas? Partitioning? Yours, S. 2012/9/3 Ivan Voras <ivoras@freebsd.org>: > On 03/09/2012 13:03, Stefan Keller wrote: >> Hi, >> >> I'm having performance issues with a simple table containing 'Nodes' >> (points) from OpenStreetMap: >> >> CREATE TABLE nodes ( >> id bigint PRIMARY KEY, >> user_name text NOT NULL, >> tstamp timestamp without time zone NOT NULL, >> geom GEOMETRY(POINT, 4326) >> ); >> CREATE INDEX idx_nodes_geom ON nodes USING gist (geom); >> >> The number of rows grows steadily and soon reaches one billion >> (1'000'000'000), therefore the bigint id. >> Now, hourly inserts (update and deletes) are slowing down the database >> (PostgreSQL 9.1) constantly. >> Before I'm looking at non-durable settings [1] I'd like to know what >> choices I have to tune it while keeping the database productive: >> cluster index? partition table? use tablespaces? reduce physical block size? > > You need to describe in detail what does "slowing down" mean in your > case. Do the disk drives somehow do more operations per transaction? > Does the database use more CPU cycles? Is there swapping? What is the > expected (previous) performance? > > At a guess, it is very unlikely that using non-durable settings will > help you here. >
Stefan -- ----- Original Message ----- > From: Stefan Keller <sfkeller@gmail.com> > To: Ivan Voras <ivoras@freebsd.org> > Cc: pgsql-performance@postgresql.org > Sent: Monday, October 1, 2012 5:15 PM > Subject: Re: [PERFORM] Inserts in 'big' table slowing down the database > > Sorry for the delay. I had to sort out the problem (among other things). > > It's mainly about swapping. > > The table nodes contains about 2^31 entries and occupies about 80GB on > disk space plus index. > If one would store the geom values in a big array (where id is the > array index) it would only make up about 16GB, which means that the > ids are dense (with few deletes). > Then updates come in every hour as bulk insert statements with entries > having ids in sorted manner. > Now PG becomes slower and slower! > CLUSTER could help - but obviously this operation needs a table lock. > And if this operation takes longer than an hour, it delays the next > update. > > Any ideas? Partitioning? pg_reorg if you have the space might be useful in doing a cluster-like action: <http://reorg.projects.postgresql.org/> Haven't followed the thread so I hope this isn't redundant. Partitioning might work if you can create clusters that are bigger than 1 hour -- too many partitions doesn't help. Greg Williamson
On 10/01/2012 07:15 PM, Stefan Keller wrote: > Any ideas? Partitioning? Yes. Make sure you have a good column to partition on. Tables this large are just bad performers in general, and heaven forbid you ever have to perform maintenance on them. We had a table that size, and simply creating an index could take upwards of two hours. If you can't archive any of the table contents, partitioning may be your only solution. If you have an EDB 9.1, you'll also have less problems with the legacy issues people had with planning queries using partitions. Don't go crazy, though. I try to keep it under a couple dozen partitions per table, or under 100M records per partition. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Mon, Oct 1, 2012 at 5:15 PM, Stefan Keller <sfkeller@gmail.com> wrote: > Sorry for the delay. I had to sort out the problem (among other things). > > It's mainly about swapping. Do you mean ordinary file IO? Or swapping of an actual process's virtual memory? The latter shouldn't happen much unless you have something mis-configured. > > The table nodes contains about 2^31 entries and occupies about 80GB on > disk space plus index. How big is each index? If you reset the stats just before the bulk load, what do select * from pg_statio_user_tables and select * from pg_statio_user_indexes show after the bulk load? What does vmstat show during the load? > If one would store the geom values in a big array (where id is the > array index) it would only make up about 16GB, which means that the > ids are dense (with few deletes). > Then updates come in every hour as bulk insert statements with entries > having ids in sorted manner. Is the problem that these operations themselves are too slow, or that they slow down other operations when they are active? If the main problem is that it slows down other operations, what are they? If the problem is the length of the bulk operations themselves, what happens if you split them up into chunks and run them in parallel? Do you have a test/dev/QA server? How long does a bulk insert take under the four conditions of both indexes (PK and geometry), neither index, just one, or just the other? > Now PG becomes slower and slower! > CLUSTER could help - but obviously this operation needs a table lock. > And if this operation takes longer than an hour, it delays the next > update. I don't see why a CLUSTER would help. Your table is probably already clustered well on the serial column. Clustering it instead on the geometry probably wouldn't accomplish much. One thing that might help would be to stuff the data to be inserted into a scratch table, index that on the geometry, cluster that scratch table, and then do the insert to the main table from the scratch table. That might result in the geom being inserted in a more cache-friendly order. > Any ideas? Partitioning? Do most selects against this table specify user_name as well as a geometry query? If so, that might be a good partitioning key. Otherwise, I don't see what you could partition on in a way that make things better. Cheers, Jeff