Обсуждение: Running update in chunks?

Поиск

Список

Период

Сортировка

Running update in chunks?

От

Tim Uckun

Дата:

21 января 2013 г., 11:04:39

I have seen a lot of slow update questions asked both here and on
stack overflow  but they usually involve large tables. In my case the
dataset is kind of small.

I have an app in which I import data and then merge the records with
an existing table. Currently I do most of the heavy lifting with code
and it works well enough but as the imports get bigger I thought I
would rewrite the code to speed it up using postgres. Basically I get
the data which I consider to be dirty and I put it into a table using
hstore to store the data. I then run a series of update queries to
locate the "real" records in the various tables.  The import data
looks like this https://gist.github.com/4584366 and has about 98K
records in it. The lookup table is very small only a couple of hundred
records in it.

This is the query I am running

update cars.imports i
    set make_id = md.make_id
    from cars.models md where i.model_id = md.id;


Here is the analyse

"Update on imports i  (cost=2.46..49720.34 rows=138858 width=526)
(actual time=51968.553..51968.553 rows=0 loops=1)"
"  ->  Hash Join  (cost=2.46..49720.34 rows=138858 width=526) (actual
time=0.044..408.170 rows=98834 loops=1)"
"        Hash Cond: (i.model_id = md.id)"
"        ->  Seq Scan on imports i  (cost=0.00..47808.58 rows=138858
width=516) (actual time=0.010..323.616 rows=98834 loops=1)"
"        ->  Hash  (cost=1.65..1.65 rows=65 width=14) (actual
time=0.026..0.026 rows=65 loops=1)"
"              Buckets: 1024  Batches: 1  Memory Usage: 4kB"
"              ->  Seq Scan on models md  (cost=0.00..1.65 rows=65
width=14) (actual time=0.002..0.012 rows=65 loops=1)"
"Total runtime: 51968.602 ms"


This query takes fifty seconds on a macbook air with i7 processor and
eight gigs of RAM and SSD hard drive.  I am using postgres 9.2
installed with homebrew using the standard conf file.

So it seems to me that this query is running as fast as it could but
it's still much slower than doing things with code one record at a
time (using some memoization).

Anyway...  Presuming I can't really do anything to speed up this query
does it make sense to  try and do this in chunks and if so what is the
best technique for doing that.

Re: Running update in chunks?

От

Richard Huxton

Дата:

21 января 2013 г., 12:31:07

On 21/01/13 08:04, Tim Uckun wrote:
> This is the query I am running
>
> update cars.imports i
>      set make_id = md.make_id
>      from cars.models md where i.model_id = md.id;
>
>
> Here is the analyse
Looks like it's the actual update that's taking all the time.
> This query takes fifty seconds on a macbook air with i7 processor and
> eight gigs of RAM and SSD hard drive.  I am using postgres 9.2
> installed with homebrew using the standard conf file.
Can you try a couple of things just to check timings. Probably worth
EXPLAIN ANALYSE.

SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id =
md.id;

CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models
md ON i.model_id = md.id;

Now the first one should take half a second judging by your previous
explain. If the second one takes 50 seconds too then that's just the
limit of your SSD's write. If it's much faster then something else is
happening.


--
   Richard Huxton
   Archonet Ltd

Re: Running update in chunks?

От

Tim Uckun

Дата:

21 января 2013 г., 13:30:12

> Can you try a couple of things just to check timings. Probably worth EXPLAIN
> ANALYSE.
>
> SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id =
> md.id;


Takes about 300 ms

>
> CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models md
> ON i.model_id = md.id;

Takes about 300 ms

>
> Now the first one should take half a second judging by your previous
> explain. If the second one takes 50 seconds too then that's just the limit
> of your SSD's write. If it's much faster then something else is happening.

Re: Running update in chunks?

От

Richard Huxton

Дата:

21 января 2013 г., 14:12:46

On 21/01/13 10:30, Tim Uckun wrote:
>> Can you try a couple of things just to check timings. Probably worth EXPLAIN
>> ANALYSE.
>>
>> SELECT count(*) FROM cars.imports i JOIN cars.models md ON i.model_id =
>> md.id;
>
> Takes about 300 ms
>
>> CREATE TEMP TABLE tt AS SELECT i.* FROM cars.imports i JOIN cars.models md
>> ON i.model_id = md.id;
> Takes about 300 ms
OK - so writing all the data takes very under one second but updating
the same amount takes 50 seconds.

The only differences I can think of are WAL logging (transaction log)
and index updates (the temp table has no indexes).

1. Try CREATE TABLE rather than CREATE TEMP TABLE - if that's still
quick then it's not the time taken to write WAL.
2. Run the update query against your new tt table and see how long that
takes.
3. Add indexes and repeat (in particular I'd be suspicious of the gin
index on "data")

My guess is that it's the time taken to update the "data" index - gin
indexes can be slow to rebuild (although 50 seconds seems *very* slow).
If so there are a few options:
1. Split the table and put whatever this "data" is into an import_data
table - assuming it doesn't change often.
2. Try a fill-factor of 50% or less - keeping the updates on the same
data page as the original might help
3. Drop the gin index before doing your bulk update and rebuild it at
the end. This is a common approach with bulk-loading / updates.

Oh - I'm assuming you're only updating those rows whose id has changed -
that seemed to be the suggestion in your first message. If not, simply
adding "AND make_id <> md.make_id" should help. Also (and you may well
have considered this) - for a normalised setup you'd just have the
model-id in "imports" and look up the make-id through the "models" table.

--
   Richard Huxton
   Archonet Ltd

Re: Running update in chunks?

От

"Kevin Grittner"

Дата:

21 января 2013 г., 22:06:10

Richard Huxton wrote:

> The only differences I can think of are WAL logging (transaction
> log) and index updates (the temp table has no indexes).

What about foreign keys? Are there any tables which reference the
updated column in a foreign key declaration? Do they have indexes
on that column?

-Kevin

Re: Running update in chunks?

От

Tim Uckun

Дата:

21 января 2013 г., 23:09:35

Just to close this up and give some guidance to future googlers...

There are no foreign keys. The table is empty when I start. I copy the
data into it from a text file.

Removing indexes didn't help much (made a very slight difference).

running a query CREATE TEMP TABLE tt AS SELECT .... using a massive
join takes about 8 seconds. I presume that's the baseline for the disk
and RAM given my current postgres configuration.  Note that this is
not a satisfactory option for me because I can't do what I want in one
step (the update I specified is one of many).

running a very simple update "UPDATE imports set make_id = null"
takes over 50 seconds so that's the minimum amount of time any update
is going to take.

Running a complex update where I join all the tables together and
update all the fields takes about 106 seconds.

Just running a complex select with the joins takes no time at all.

I tried chunking the updates using chunks of 100 records and 1000
records (where ID between X and Y repeatedly) and it was even slower.

Conclusion.  Updates on postgres are slow (given the default
postgresql.conf).  I presume this is due to MVCC or the WAL or
something and there are probably some things I can do to tweak the
conf file to make them go faster but out of the box running an update
on a table with lots of rows is going to cost you a lot. Removing the
indexes doesn't help that much.

Suggestion for the PG team.  Deliver a more realistic postgres.conf by
default. The default one seems to be aimed at ten year old PCs with
very little RAM and disk space. At least deliver additional conf files
for small, medium, large, huge setups.

Re: Running update in chunks?

От

Richard Huxton

Дата:

22 января 2013 г., 00:23:54

On 21/01/13 20:09, Tim Uckun wrote:
> Just to close this up and give some guidance to future googlers...
Careful, future googlers.
> Conclusion.  Updates on postgres are slow
Nope.

> (given the default
> postgresql.conf).  I presume this is due to MVCC or the WAL or
> something and there are probably some things I can do to tweak the
> conf file to make them go faster but out of the box running an update
> on a table with lots of rows is going to cost you a lot.
Unlikely. Do you really think that a PostgreSQL installation typically
runs 100 times slower on updates than inserts and every other user has
just said "oh, that's ok then"? Or is it more likely that something
peculiar is broken on your setup.

>   Removing the indexes doesn't help that much.
>
> Suggestion for the PG team.  Deliver a more realistic postgres.conf by
> default. The default one seems to be aimed at ten year old PCs with
> very little RAM and disk space. At least deliver additional conf files
> for small, medium, large, huge setups.
--
   Richard Huxton

Re: Running update in chunks?

От

patrick keshishian

Дата:

22 января 2013 г., 00:31:58

On Mon, Jan 21, 2013 at 1:23 PM, Richard Huxton <dev@archonet.com> wrote:
> On 21/01/13 20:09, Tim Uckun wrote:
>>
>> Just to close this up and give some guidance to future googlers...
>
> Careful, future googlers.
>
>> Conclusion.  Updates on postgres are slow
>
> Nope.
>
>
>> (given the default
>> postgresql.conf).  I presume this is due to MVCC or the WAL or
>> something and there are probably some things I can do to tweak the
>> conf file to make them go faster but out of the box running an update
>> on a table with lots of rows is going to cost you a lot.
>
> Unlikely. Do you really think that a PostgreSQL installation typically runs
> 100 times slower on updates than inserts and every other user has just said
> "oh, that's ok then"? Or is it more likely that something peculiar is broken
> on your setup.
>
>
>>   Removing the indexes doesn't help that much.
>>
>> Suggestion for the PG team.  Deliver a more realistic postgres.conf by
>> default. The default one seems to be aimed at ten year old PCs with
>> very little RAM and disk space. At least deliver additional conf files
>> for small, medium, large, huge setups.

I'd be curious to see results of the same "update" on a standard HDD
vs the SSD, and maybe on a more typical database deployment hardware
vs a macbook air.

--patrick

Re: Running update in chunks?

От

Tim Uckun

Дата:

22 января 2013 г., 00:45:29

>
> Nope.

If you have any suggestions I am all ears. For the purposes of this
discussion we can narrow down the problem this update statement.

Update imports set make_id = null.

There are 98K records in the table. There is no index on the make_id
field. Standard untouched postgresql.conf from the brew install of
postgres.


> Unlikely. Do you really think that a PostgreSQL installation typically runs
> 100 times slower on updates than inserts and every other user has just said
> "oh, that's ok then"? Or is it more likely that something peculiar is broken
> on your setup.

I really don't know. That's why I am here asking.  I don't think
anything particular is broken with my system.  As mentioned above the
setup is really simple. Standard postgres install, the default conf
file, update one field on one table. It takes fifty plus seconds.

I concede that if I was to go into the postgres.conf and make some
changes it will probably run faster (maybe much faster) but I wanted
to exhaust other factors before I went messing with the default
install.

Re: Running update in chunks?

От

Tim Uckun

Дата:

22 января 2013 г., 00:48:45

> I'd be curious to see results of the same "update" on a standard HDD
> vs the SSD, and maybe on a more typical database deployment hardware
> vs a macbook air.
>


I haven't tried it on any other machine yet.  CREATE TEMP TABLE tt as
SELECT ...  takes eight seconds so presumably the disk is not the
choke point.

Re: Running update in chunks?

От

"Kevin Grittner"

Дата:

22 января 2013 г., 00:50:33

Richard Huxton wrote:
> On 21/01/13 20:09, Tim Uckun wrote:

>> Just to close this up and give some guidance to future
>> googlers...

> Careful, future googlers.

+1

>> Conclusion. Updates on postgres are slow

> Nope.

Agreed.

>> (given the default postgresql.conf). I presume this is due to
>> MVCC or the WAL or something and there are probably some things
>> I can do to tweak the conf file to make them go faster but out
>> of the box running an update on a table with lots of rows is
>> going to cost you a lot.

> Unlikely. Do you really think that a PostgreSQL installation
> typically runs 100 times slower on updates than inserts and every
> other user has just said "oh, that's ok then"? Or is it more
> likely that something peculiar is broken on your setup.

As someone who has managed hundreds of databases ranging up to over
3TB without seeing this without some very specific cause, I agree
that there is something wonky on Tim's setup which he hasn't told
us about. Then again, I'm not sure we've pushed hard enough for the
relevant details.

Tim, if you're still interested in resolving this, please post the
results from running the SQL code on this page:

http://wiki.postgresql.org/wiki/Server_Configuration

It might be worthwhile to read through this page:

http://wiki.postgresql.org/wiki/Slow_Query_Questions

... and try some of the ideas there. Base disk perfromance numbers
would help put the results in perspective.

The cause could be anything from table bloat due to inadequate
vacuuming to hardware problems.

-Kevin

Re: Running update in chunks?

От

"Kevin Grittner"

Дата:

22 января 2013 г., 01:05:29

Tim Uckun wrote:

> If you have any suggestions I am all ears. For the purposes of this
> discussion we can narrow down the problem this update statement.
>
> Update imports set make_id = null.

Well, that simplifies things.

First off, what does it say for rows affected? (Hint, if you really
are using a default configuration and it doesn't say 0 rows
affected, please show us the actual query used.)

Second, try connecting to the database as a superuser and running:

VACUUM ANALYZE imports;
-- (show us the results)
VACUUM FULL imports;
VACUUM FREEZE ANALYZE;  -- (don't specify a table)

Then try your query and see whether performance is any different.

-Kevin

Re: Running update in chunks?

От

patrick keshishian

Дата:

22 января 2013 г., 01:07:56

On Mon, Jan 21, 2013 at 1:48 PM, Tim Uckun <timuckun@gmail.com> wrote:
>> I'd be curious to see results of the same "update" on a standard HDD
>> vs the SSD, and maybe on a more typical database deployment hardware
>> vs a macbook air.
>>
>
>
> I haven't tried it on any other machine yet.  CREATE TEMP TABLE tt as
> SELECT ...  takes eight seconds so presumably the disk is not the
> choke point.

you are making an assumption that a fresh write is the same as a
re-write. try the test.

--patrick

Re: Running update in chunks?

От

"Kevin Grittner"

Дата:

22 января 2013 г., 01:15:36

Kevin Grittner wrote:

> First off, what does it say for rows affected? (Hint, if you really
> are using a default configuration and it doesn't say 0 rows
> affected, please show us the actual query used.)

Never mind that bit -- I got myself confused. Sorry for the noise.

-Kevin

Re: Running update in chunks?

От

Tim Uckun

Дата:

22 января 2013 г., 02:36:14

> First off, what does it say for rows affected? (Hint, if you really
> are using a default configuration and it doesn't say 0 rows
> affected, please show us the actual query used.)

update imports set make_id = null

Query returned successfully: 98834 rows affected, 49673 ms execution time.


vacuum analyze imports

Query returned successfully with no result in 4138 ms.

VACUUM FULL imports;

Query returned successfully with no result in 38106 ms.

VACUUM FREEZE ANALYZE;

Query returned successfully with no result in 184635 ms

update imports set make_id = 0

Query returned successfully: 98834 rows affected, 45860 ms execution time.


So all the vacuuming saved about four seconds of execution time.

here is the postgresql.conf completely untouched from the default
install https://gist.github.com/4590590

Re: Running update in chunks?

От

Tim Uckun

Дата:

22 января 2013 г., 02:45:29

Oh I forgot



 SELECT version();

"PostgreSQL 9.2.2 on x86_64-apple-darwin12.2.1, compiled by Apple
clang version 4.1 (tags/Apple/clang-421.11.65) (based on LLVM 3.1svn),
64-bit"


SELECT name, current_setting(name), source
  FROM pg_settings
  WHERE source NOT IN ('default', 'override');



"application_name";"pgAdmin III - Query Tool";"client"
"bytea_output";"escape";"session"
"client_encoding";"UNICODE";"session"
"client_min_messages";"notice";"session"
"DateStyle";"ISO, DMY";"session"
"default_text_search_config";"pg_catalog.english";"configuration file"
"lc_messages";"en_NZ.UTF-8";"configuration file"
"lc_monetary";"en_NZ.UTF-8";"configuration file"
"lc_numeric";"en_NZ.UTF-8";"configuration file"
"lc_time";"en_NZ.UTF-8";"configuration file"
"log_timezone";"NZ";"configuration file"
"max_connections";"20";"configuration file"
"max_stack_depth";"2MB";"environment variable"
"search_path";"chrysler, public";"session"
"shared_buffers";"1600kB";"configuration file"
"TimeZone";"NZ";"configuration file"

Re: Running update in chunks?

От

Steve Crawford

Дата:

22 января 2013 г., 03:15:07

On 01/21/2013 03:45 PM, Tim Uckun wrote:
> Oh I forgot
> ...
> "shared_buffers";"1600kB";"configuration file"

You *reduced* shared buffers to 1.6MB? IIRC the typical default is 32MB
and the most common adjustment is to *increase* shared buffers. Most of
my servers are set to 2GB.

Try bumping that up to a reasonable value
(http://www.postgresql.org/docs/9.2/static/runtime-config-resource.html)
and share the results. Don't forget to restart PG after changing that
setting.

Cheers,
Steve

Re: Running update in chunks?

От

Steve Crawford

Дата:

22 января 2013 г., 03:23:28

On 01/21/2013 04:15 PM, Steve Crawford wrote:
> On 01/21/2013 03:45 PM, Tim Uckun wrote:
>> Oh I forgot
>> ...
Me, too. I forgot to ask for the table definition. If there are
variable-length fields like "text" or "varchar", what is the typical
size of the data.

Also, what is the physical size of the table (\dt+ yourtable)?

Perhaps even the output of
select * from pg_stat_user_tables where relname='yourtable';
might be useful.

Cheers,
Steve

Re: Running update in chunks?

От

Tim Uckun

Дата:

22 января 2013 г., 04:03:03

I already posted the schema earlier. It's a handful of integer fields
with one hstore field.


On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford
<scrawford@pinpointresearch.com> wrote:
> select * from pg_stat_user_tables where relname='yourtable';



Messy output


"relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count"
26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22
12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22
12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24

Re: Running update in chunks?

От

"Kevin Grittner"

Дата:

22 января 2013 г., 05:21:59

Kevin Grittner wrote:

> update imports set make_id = 0
>
> Query returned successfully: 98834 rows affected, 45860 ms execution time.

For difficult problems, there is nothing like a self-contained test
case, that someone else can run to see the issue. Here's a starting
point:

create extension if not exists hstore;
create schema cars;
drop table if exists cars.imports;
CREATE TABLE cars.imports
(
id serial NOT NULL,
target_id integer,
batch_id integer,
make_id integer,
model_id integer,
date timestamp without time zone,
division_id integer,
dealer_id integer,
data hstore,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT imports_pkey PRIMARY KEY (id)
);
CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id);
CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id);
CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id);
CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id);
insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x;
vacuum freeze analyze;
\timing on
update cars.imports set make_id = 0;
vacuum analyze;
update cars.imports set make_id = 0;

... and here's what I get when I run it on my desktop computer with
ordinary disk drives and a completely default configuration:

test=# create extension if not exists hstore;
CREATE EXTENSION
Time: 48.032 ms
test=# create schema cars;
CREATE SCHEMA
Time: 8.150 ms
test=# drop table if exists cars.imports;
NOTICE:  table "imports" does not exist, skipping
DROP TABLE
Time: 0.205 ms
test=# CREATE TABLE cars.imports
test-# (
test(# id serial NOT NULL,
test(# target_id integer,
test(# batch_id integer,
test(# make_id integer,
test(# model_id integer,
test(# date timestamp without time zone,
test(# division_id integer,
test(# dealer_id integer,
test(# data hstore,
test(# created_at timestamp without time zone NOT NULL,
test(# updated_at timestamp without time zone NOT NULL,
test(# CONSTRAINT imports_pkey PRIMARY KEY (id)
test(# );
NOTICE:  CREATE TABLE will create implicit sequence "imports_id_seq" for serial column "imports.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "imports_pkey" for table "imports"
CREATE TABLE
Time: 152.677 ms
test=# CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
CREATE INDEX
Time: 6.391 ms
test=# CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id);
CREATE INDEX
Time: 64.668 ms
test=# CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id);
CREATE INDEX
Time: 65.573 ms
test=# CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id);
CREATE INDEX
Time: 64.959 ms
test=# CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id);
CREATE INDEX
Time: 64.906 ms
test=# insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000))
x;
INSERT 0 100000
Time: 2516.559 ms
test=# vacuum freeze analyze;
VACUUM
Time: 3357.778 ms
test=# \timing on
Timing is on.
test=# update cars.imports set make_id = 0;
UPDATE 100000
Time: 2937.241 ms
test=# vacuum analyze;
VACUUM
Time: 2097.426 ms
test=# update cars.imports set make_id = 0;
UPDATE 100000
Time: 3935.939 ms

Ubuntu 12.10
i7-3770 CPU @ 3.40GHz with 16GB RAM
Linux Kevin-Desktop 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
A pair of WDC WD10EALX-009BA0 (15.01H51) 7200 RPM drives in RAID 1.

 PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bit

test=# SELECT name, current_setting(name), source
test-#   FROM pg_settings
test-#   WHERE source NOT IN ('default', 'override');
            name            |  current_setting   |        source        
----------------------------+--------------------+----------------------
 application_name           | psql               | client
 client_encoding            | UTF8               | client
 DateStyle                  | ISO, MDY           | configuration file
 default_text_search_config | pg_catalog.english | configuration file
 lc_messages                | en_US.UTF-8        | configuration file
 lc_monetary                | en_US.UTF-8        | configuration file
 lc_numeric                 | en_US.UTF-8        | configuration file
 lc_time                    | en_US.UTF-8        | configuration file
 log_timezone               | US/Central         | configuration file
 max_connections            | 100                | configuration file
 max_stack_depth            | 2MB                | environment variable
 shared_buffers             | 32MB               | configuration file
 TimeZone                   | US/Central         | configuration file
(13 rows)

I did absolutely no tuning from the default configuration.

So, what timings do you get if you run the identical script? Is
there something you can do to the above script (maybe in terms of
populating data) which will cause the performance you see?

-Kevin

Re: Running update in chunks?

От

Jeff Janes

Дата:

22 января 2013 г., 10:03:03

On Monday, January 21, 2013, Tim Uckun wrote:

> First off, what does it say for rows affected? (Hint, if you really
> are using a default configuration and it doesn't say 0 rows
> affected, please show us the actual query used.)

update imports set make_id = null

Query returned successfully: 98834 rows affected, 49673 ms execution time.

vacuum analyze imports

Query returned successfully with no result in 4138 ms.

VACUUM FULL imports;

What if you do:

alter table cars.imports set (fillfactor=50);

Before the vacuum full, and then try the update again?

Cheers,

Jeff

Re: Running update in chunks?

От

Jeff Janes

Дата:

22 января 2013 г., 10:07:06

On Monday, January 21, 2013, Tim Uckun wrote:

I already posted the schema earlier. It's a handful of integer fields
with one hstore field.

one hstore field can easily be equivalent to 50 text fields with an index on each one.

I'm pretty sure that that is your bottleneck.

what does \di+ show?

Cheers,

Jeff

Re: Running update in chunks?

От

Steve Crawford

Дата:

22 января 2013 г., 21:50:17

On 01/21/2013 05:02 PM, Tim Uckun wrote:
> I already posted the schema earlier. It's a handful of integer fields
> with one hstore field.

Oh well. I can't find it but maybe it got lost in shipping or eaten by a
spam filter.

>
>
> On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford
> <scrawford@pinpointresearch.com> wrote:
>> select * from pg_stat_user_tables where relname='yourtable';
>
>
> Messy output
Don't know if you are in psql but if so, expanded display works for
this. I.e.:
steve@[local] => \x
Expanded display is on.
steve@[local] => select * from pg_stat_user_tables where relname='footest';
-[ RECORD 1 ]-----+------------------------------
relid             | 781691
schemaname        | public
relname           | footest
seq_scan          | 3
seq_tup_read      | 609
idx_scan          |
idx_tup_fetch     |
n_tup_ins         | 609
n_tup_upd         | 0
n_tup_del         | 0
n_tup_hot_upd     | 0
n_live_tup        | 301
n_dead_tup        | 0
last_vacuum       |
last_autovacuum   |
last_analyze      |
last_autoanalyze  | 2012-12-19 08:42:23.347368-08
vacuum_count      | 0
autovacuum_count  | 0
analyze_count     | 0
autoanalyze_count | 2


>
>
"relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count"
> 26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22
> 12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22
> 12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24
>
>
So at least autovacuum is running (and some manual vacuum and analyze as
well).

Cheers,
Steve

Re: Running update in chunks?

От

Steve Crawford

Дата:

22 января 2013 г., 21:57:19

On 01/21/2013 06:21 PM, Kevin Grittner wrote:
> Kevin Grittner wrote:
> ...
>   shared_buffers             | 32MB               | configuration file
> ...
>
> I did absolutely no tuning from the default configuration.
>
But Tim has his shared_buffers set to 1600kB
("shared_buffers";"1600kB";"configuration file") or roughly 1/20 of the
typical default value, which is a very low starting value anyway, on a
machine populated with 8GB RAM.

I'd like to see how it runs with a more reasonable shared_buffers
setting. At a very minimum the 32MB default.

Cheers,
Steve

Re: Running update in chunks?

От

Alvaro Herrera

Дата:

22 января 2013 г., 21:59:23

Steve Crawford escribió:
> On 01/21/2013 05:02 PM, Tim Uckun wrote:
> >I already posted the schema earlier. It's a handful of integer fields
> >with one hstore field.
>
> Oh well. I can't find it but maybe it got lost in shipping or eaten
> by a spam filter.

This is what we have the archives are for:

http://www.postgresql.org/message-id/flat/50FEDF66.7050100@pinpointresearch.com#50FEDF66.7050100@pinpointresearch.com

The original message is at the top of the page (obviously).

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Running update in chunks?

От

Steve Crawford

Дата:

22 января 2013 г., 22:22:31

On 01/22/2013 10:59 AM, Alvaro Herrera wrote:
> Steve Crawford escribió:
>> On 01/21/2013 05:02 PM, Tim Uckun wrote:
>>> I already posted the schema earlier. It's a handful of integer fields
>>> with one hstore field.
>> Oh well. I can't find it but maybe it got lost in shipping or eaten
>> by a spam filter.
> This is what we have the archives are for:
>
> http://www.postgresql.org/message-id/flat/50FEDF66.7050100@pinpointresearch.com#50FEDF66.7050100@pinpointresearch.com
>
> The original message is at the top of the page (obviously).
>
Didn't notice that the information was over on the github site (which,
of course, also makes it impossible to search for in my email and
unavailable to the mail archives for those wishing to view it in the
future).

Cheers,
Steve

Re: Running update in chunks?

От

"Kevin Grittner"

Дата:

23 января 2013 г., 17:45:58

Jeff Janes wrote:

> one hstore field can easily be equivalent to 50 text fields with
> an index on each one.
>
> I'm pretty sure that that is your bottleneck.

I agree that seems like the most likely cause. Each update to the
row holding the hstore column requires adding new index entries for
all the hstore elements, and autovacuum will need to clean up the
old ones in the background. The best solution would be to either
normalize the data instead of using hstore, or move the hstore to a
separate table which is referenced by some sort of ID from the
frequently-updated table.

-Kevin

Re: Running update in chunks?

От

Tim Uckun

Дата:

25 января 2013 г., 11:53:05

Sorry I haven't been responsive for a little while.

I ran your script  but creating a new schema instead of my existing
schema. My timings were similar to yours (more or less) except fo the
vacuums which took roughly 147891 ms.


On Tue, Jan 22, 2013 at 3:21 PM, Kevin Grittner <kgrittn@mail.com> wrote:
> Kevin Grittner wrote:
>
>> update imports set make_id = 0
>>
>> Query returned successfully: 98834 rows affected, 45860 ms execution time.
>
> For difficult problems, there is nothing like a self-contained test
> case, that someone else can run to see the issue. Here's a starting
> point:
>
> create extension if not exists hstore;
> create schema cars;
> drop table if exists cars.imports;
> CREATE TABLE cars.imports
> (
> id serial NOT NULL,
> target_id integer,
> batch_id integer,
> make_id integer,
> model_id integer,
> date timestamp without time zone,
> division_id integer,
> dealer_id integer,
> data hstore,
> created_at timestamp without time zone NOT NULL,
> updated_at timestamp without time zone NOT NULL,
> CONSTRAINT imports_pkey PRIMARY KEY (id)
> );
> CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
> CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id);
> CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id);
> CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id);
> CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id);
> insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x;
> vacuum freeze analyze;
> \timing on
> update cars.imports set make_id = 0;
> vacuum analyze;
> update cars.imports set make_id = 0;
>
> ... and here's what I get when I run it on my desktop computer with
> ordinary disk drives and a completely default configuration:
>
> test=# create extension if not exists hstore;
> CREATE EXTENSION
> Time: 48.032 ms
> test=# create schema cars;
> CREATE SCHEMA
> Time: 8.150 ms
> test=# drop table if exists cars.imports;
> NOTICE:  table "imports" does not exist, skipping
> DROP TABLE
> Time: 0.205 ms
> test=# CREATE TABLE cars.imports
> test-# (
> test(# id serial NOT NULL,
> test(# target_id integer,
> test(# batch_id integer,
> test(# make_id integer,
> test(# model_id integer,
> test(# date timestamp without time zone,
> test(# division_id integer,
> test(# dealer_id integer,
> test(# data hstore,
> test(# created_at timestamp without time zone NOT NULL,
> test(# updated_at timestamp without time zone NOT NULL,
> test(# CONSTRAINT imports_pkey PRIMARY KEY (id)
> test(# );
> NOTICE:  CREATE TABLE will create implicit sequence "imports_id_seq" for serial column "imports.id"
> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "imports_pkey" for table "imports"
> CREATE TABLE
> Time: 152.677 ms
> test=# CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
> CREATE INDEX
> Time: 6.391 ms
> test=# CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id);
> CREATE INDEX
> Time: 64.668 ms
> test=# CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id);
> CREATE INDEX
> Time: 65.573 ms
> test=# CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id);
> CREATE INDEX
> Time: 64.959 ms
> test=# CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id);
> CREATE INDEX
> Time: 64.906 ms
> test=# insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000))
x;
> INSERT 0 100000
> Time: 2516.559 ms
> test=# vacuum freeze analyze;
> VACUUM
> Time: 3357.778 ms
> test=# \timing on
> Timing is on.
> test=# update cars.imports set make_id = 0;
> UPDATE 100000
> Time: 2937.241 ms
> test=# vacuum analyze;
> VACUUM
> Time: 2097.426 ms
> test=# update cars.imports set make_id = 0;
> UPDATE 100000
> Time: 3935.939 ms
>
> Ubuntu 12.10
> i7-3770 CPU @ 3.40GHz with 16GB RAM
> Linux Kevin-Desktop 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> A pair of WDC WD10EALX-009BA0 (15.01H51) 7200 RPM drives in RAID 1.
>
>  PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bit
>
> test=# SELECT name, current_setting(name), source
> test-#   FROM pg_settings
> test-#   WHERE source NOT IN ('default', 'override');
>             name            |  current_setting   |        source
> ----------------------------+--------------------+----------------------
>  application_name           | psql               | client
>  client_encoding            | UTF8               | client
>  DateStyle                  | ISO, MDY           | configuration file
>  default_text_search_config | pg_catalog.english | configuration file
>  lc_messages                | en_US.UTF-8        | configuration file
>  lc_monetary                | en_US.UTF-8        | configuration file
>  lc_numeric                 | en_US.UTF-8        | configuration file
>  lc_time                    | en_US.UTF-8        | configuration file
>  log_timezone               | US/Central         | configuration file
>  max_connections            | 100                | configuration file
>  max_stack_depth            | 2MB                | environment variable
>  shared_buffers             | 32MB               | configuration file
>  TimeZone                   | US/Central         | configuration file
> (13 rows)
>
> I did absolutely no tuning from the default configuration.
>
> So, what timings do you get if you run the identical script? Is
> there something you can do to the above script (maybe in terms of
> populating data) which will cause the performance you see?
>
> -Kevin


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Tim Uckun

Дата:

25 января 2013 г., 11:57:33

>
> What if you do:
> alter table cars.imports set (fillfactor=50);
> Before the vacuum full, and then try the update again?


This makes a dramatic difference when combined with a vacuum.

UPDATE 98834
Time: 3408.210 ms

Ten times faster!


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Tim Uckun

Дата:

25 января 2013 г., 12:00:19

relid             | 26710
schemaname        | cars
relname           | imports
seq_scan          | 280
seq_tup_read      | 25873543
idx_scan          | 4
idx_tup_fetch     | 2749
n_tup_ins         | 98926
n_tup_upd         | 6350466
n_tup_del         | 92
n_tup_hot_upd     | 625286
n_live_tup        | 98834
n_dead_tup        | 0
last_vacuum       | 2013-01-25 21:55:36.078614+13
last_autovacuum   | 2013-01-25 21:58:40.850546+13
last_analyze      | 2013-01-25 21:55:36.305967+13
last_autoanalyze  | 2013-01-25 21:51:54.307639+13
vacuum_count      | 6
autovacuum_count  | 32
analyze_count     | 6
autoanalyze_count | 25


On Wed, Jan 23, 2013 at 7:50 AM, Steve Crawford
<scrawford@pinpointresearch.com> wrote:
> On 01/21/2013 05:02 PM, Tim Uckun wrote:
>>
>> I already posted the schema earlier. It's a handful of integer fields
>> with one hstore field.
>
>
> Oh well. I can't find it but maybe it got lost in shipping or eaten by a
> spam filter.
>
>
>>
>>
>> On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford
>> <scrawford@pinpointresearch.com> wrote:
>>>
>>> select * from pg_stat_user_tables where relname='yourtable';
>>
>>
>>
>> Messy output
>
> Don't know if you are in psql but if so, expanded display works for this.
> I.e.:
> steve@[local] => \x
> Expanded display is on.
> steve@[local] => select * from pg_stat_user_tables where relname='footest';
> -[ RECORD 1 ]-----+------------------------------
> relid             | 781691
> schemaname        | public
> relname           | footest
> seq_scan          | 3
> seq_tup_read      | 609
> idx_scan          |
> idx_tup_fetch     |
> n_tup_ins         | 609
> n_tup_upd         | 0
> n_tup_del         | 0
> n_tup_hot_upd     | 0
> n_live_tup        | 301
> n_dead_tup        | 0
> last_vacuum       |
> last_autovacuum   |
> last_analyze      |
> last_autoanalyze  | 2012-12-19 08:42:23.347368-08
> vacuum_count      | 0
> autovacuum_count  | 0
> analyze_count     | 0
> autoanalyze_count | 2
>
>
>
>>
>>
>>
"relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count"
>>
>> 26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22
>> 12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22
>> 12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24
>>
>>
> So at least autovacuum is running (and some manual vacuum and analyze as
> well).
>
> Cheers,
> Steve
>


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Tim Uckun

Дата:

25 января 2013 г., 12:03:02

> I agree that seems like the most likely cause. Each update to the
> row holding the hstore column requires adding new index entries for
> all the hstore elements, and autovacuum will need to clean up the
> old ones in the background. The best solution would be to either
> normalize the data instead of using hstore, or move the hstore to a
> separate table which is referenced by some sort of ID from the
> frequently-updated table.


That's very interesting. I can certainly split up the table, no big
deal there.  So would the index be redone even if I am not updating
the hstore field itself?


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Richard Huxton

Дата:

25 января 2013 г., 12:23:15

On 25/01/13 08:57, Tim Uckun wrote:
>> What if you do:
>> alter table cars.imports set (fillfactor=50);
>> Before the vacuum full, and then try the update again?
>
> This makes a dramatic difference when combined with a vacuum.
>
> UPDATE 98834
> Time: 3408.210 ms
>
> Ten times faster!
That suggests (to me, at least) that it is related to index updating.
Again, your GIN index seems primary candidate.

A fillfactor of 50% means row updates probably stay on the same
disk-block as their previous version. This implies less index updates.

Try running iostat (I think that's available on a Mac) with/without the
fillfactor and with/without the GIN index while you do the updates. It's
possible your SSD is just behaving oddly under stress.

--
   Richard Huxton
   Archonet Ltd

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Tim Uckun

Дата:

25 января 2013 г., 14:39:00

>
> That suggests (to me, at least) that it is related to index updating. Again,
> your GIN index seems primary candidate.
>
> Try running iostat (I think that's available on a Mac) with/without the
> fillfactor and with/without the GIN index while you do the updates. It's
> possible your SSD is just behaving oddly under stress.
>


I dropped the index and the numbers shot up tenfold or more.  I don't
know why postgres feels the need to update the GIN index on the hstore
field when I am only updating an integer field but it looks like I
need to split the hstore into a different table.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Richard Huxton

Дата:

25 января 2013 г., 14:49:42

On 25/01/13 11:38, Tim Uckun wrote:
>> That suggests (to me, at least) that it is related to index updating. Again,
>> your GIN index seems primary candidate.
>>
>> Try running iostat (I think that's available on a Mac) with/without the
>> fillfactor and with/without the GIN index while you do the updates. It's
>> possible your SSD is just behaving oddly under stress.
>>
>
> I dropped the index and the numbers shot up tenfold or more.  I don't
> know why postgres feels the need to update the GIN index on the hstore
> field when I am only updating an integer field but it looks like I
> need to split the hstore into a different table.
If the row moves to a different block, then it has no choice. The old
index entry will point to an invalid block. There are some optimisations
(HOT - http://pgsql.tapoueh.org/site/html/misc/hot.html) but that relies
on (iirc) the update staying on the same block and also not updating any
indexed fields (and you were, I think).

A GIN index is very expensive to update compared to btree too.
--
   Richard Huxton
   Archonet Ltd


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Albe Laurenz

Дата:

25 января 2013 г., 14:53:48

Tim Uckun wrote:
> I dropped the index and the numbers shot up tenfold or more.  I don't
> know why postgres feels the need to update the GIN index on the hstore
> field when I am only updating an integer field but it looks like I
> need to split the hstore into a different table.

Every UPDATE that is not HOT will create a row version with
a new "row id".  That means that all indexes referencing that
row will have to get updated.

That is consistent with better performance with low fillfactor
(which makes HOT more likely).

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Jeff Janes

Дата:

25 января 2013 г., 20:58:57

On Fri, Jan 25, 2013 at 3:38 AM, Tim Uckun <timuckun@gmail.com> wrote:
>>
>> That suggests (to me, at least) that it is related to index updating. Again,
>> your GIN index seems primary candidate.
>>
>> Try running iostat (I think that's available on a Mac) with/without the
>> fillfactor and with/without the GIN index while you do the updates. It's
>> possible your SSD is just behaving oddly under stress.
>>
>
>
> I dropped the index and the numbers shot up tenfold or more.  I don't
> know why postgres feels the need to update the GIN index on the hstore
> field when I am only updating an integer field

When the row gets updated, it might move to some place else.  An index
maps data values to row locations.  So if the location changes, all
indexes need to be updated, even if the data value for that index did
not change.  (Well, I shouldn't say they *need* to change.  The
database could have been designed, with considerable difficulty and
consequences, to leave behind permanent redirect pointers to the new
location.  But it wasn't)

There is a mechanism called HOT update (Heap-Only Tuple) which can
prevent this under certain conditions.

1) Either none of the fields being updated are indexed, or any that
are both updated and indexed are updated to the value they already
have.

2) There is room for a new copy of the tuple on the same page as the old one.

lowering the fillfactor helps with requirement 2, especially since
your tuples are probably wide (because of the hstore column) and so
not many fit on a page.

Note that if you update a field to have the same value as it already
does, it still makes a new copy of the entire tuple anyway.  (It
detects that the :old = :new for HOT-eligibility purposes if the field
is indexed, but not for suppression of copying purposes.  And if the
tuple needs to be copied but there is no room on that page, then it
isn't eligible for HOT after all).

So you should add a where clause to the UPDATE to filter out things
that are unchanged.

> but it looks like I
> need to split the hstore into a different table.

That would be one solution, but I think a better one would be to not
store "make_id" in "imports" in the first place, but instead to always
fetch it by joining "imports" to "models" at query time.

Cheers,

Jeff

Re: Running update in chunks?

От

Tim Uckun

Дата:

27 января 2013 г., 03:55:08

>
> That would be one solution, but I think a better one would be to not
> store "make_id" in "imports" in the first place, but instead to always
> fetch it by joining "imports" to "models" at query time.
>

My problem here is that the incoming data is quite messy so the join
conditions become weird (lots of ORs and such). A multi pass approach
seems to work better.

Re: Running update in chunks?

От

Tim Uckun

Дата:

27 января 2013 г., 04:18:41

> I agree that seems like the most likely cause. Each update to the
> row holding the hstore column requires adding new index entries for
> all the hstore elements, and autovacuum will need to clean up the
> old ones in the background. The best solution would be to either
> normalize the data instead of using hstore, or move the hstore to a
> separate table which is referenced by some sort of ID from the
> frequently-updated table.


That's very interesting. I can certainly split up the table, no big
deal there.  So would the index be redone even if I am not updating
the hstore field itself?


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Tim Uckun

Дата:

27 января 2013 г., 04:36:36

Sorry I haven't been responsive for a little while.

I ran your script  but creating a new schema instead of my existing
schema. My timings were similar to yours (more or less) except fo the
vacuums which took roughly 147891 ms.


On Tue, Jan 22, 2013 at 3:21 PM, Kevin Grittner <kgrittn@mail.com> wrote:
> Kevin Grittner wrote:
>
>> update imports set make_id = 0
>>
>> Query returned successfully: 98834 rows affected, 45860 ms execution time.
>
> For difficult problems, there is nothing like a self-contained test
> case, that someone else can run to see the issue. Here's a starting
> point:
>
> create extension if not exists hstore;
> create schema cars;
> drop table if exists cars.imports;
> CREATE TABLE cars.imports
> (
> id serial NOT NULL,
> target_id integer,
> batch_id integer,
> make_id integer,
> model_id integer,
> date timestamp without time zone,
> division_id integer,
> dealer_id integer,
> data hstore,
> created_at timestamp without time zone NOT NULL,
> updated_at timestamp without time zone NOT NULL,
> CONSTRAINT imports_pkey PRIMARY KEY (id)
> );
> CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
> CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id);
> CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id);
> CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id);
> CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id);
> insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000)) x;
> vacuum freeze analyze;
> \timing on
> update cars.imports set make_id = 0;
> vacuum analyze;
> update cars.imports set make_id = 0;
>
> ... and here's what I get when I run it on my desktop computer with
> ordinary disk drives and a completely default configuration:
>
> test=# create extension if not exists hstore;
> CREATE EXTENSION
> Time: 48.032 ms
> test=# create schema cars;
> CREATE SCHEMA
> Time: 8.150 ms
> test=# drop table if exists cars.imports;
> NOTICE:  table "imports" does not exist, skipping
> DROP TABLE
> Time: 0.205 ms
> test=# CREATE TABLE cars.imports
> test-# (
> test(# id serial NOT NULL,
> test(# target_id integer,
> test(# batch_id integer,
> test(# make_id integer,
> test(# model_id integer,
> test(# date timestamp without time zone,
> test(# division_id integer,
> test(# dealer_id integer,
> test(# data hstore,
> test(# created_at timestamp without time zone NOT NULL,
> test(# updated_at timestamp without time zone NOT NULL,
> test(# CONSTRAINT imports_pkey PRIMARY KEY (id)
> test(# );
> NOTICE:  CREATE TABLE will create implicit sequence "imports_id_seq" for serial column "imports.id"
> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "imports_pkey" for table "imports"
> CREATE TABLE
> Time: 152.677 ms
> test=# CREATE INDEX index_imports_on_data ON cars.imports USING gin (data);
> CREATE INDEX
> Time: 6.391 ms
> test=# CREATE INDEX index_imports_on_dealer_id ON cars.imports USING btree (dealer_id);
> CREATE INDEX
> Time: 64.668 ms
> test=# CREATE INDEX index_imports_on_division_id ON cars.imports USING btree (division_id);
> CREATE INDEX
> Time: 65.573 ms
> test=# CREATE INDEX index_imports_on_make_id ON cars.imports USING btree (make_id);
> CREATE INDEX
> Time: 64.959 ms
> test=# CREATE INDEX index_imports_on_model_id ON cars.imports USING btree (model_id);
> CREATE INDEX
> Time: 64.906 ms
> test=# insert into cars.imports (created_at, updated_at) select now(), now() from (select generate_series(1, 100000))
x;
> INSERT 0 100000
> Time: 2516.559 ms
> test=# vacuum freeze analyze;
> VACUUM
> Time: 3357.778 ms
> test=# \timing on
> Timing is on.
> test=# update cars.imports set make_id = 0;
> UPDATE 100000
> Time: 2937.241 ms
> test=# vacuum analyze;
> VACUUM
> Time: 2097.426 ms
> test=# update cars.imports set make_id = 0;
> UPDATE 100000
> Time: 3935.939 ms
>
> Ubuntu 12.10
> i7-3770 CPU @ 3.40GHz with 16GB RAM
> Linux Kevin-Desktop 3.5.0-22-generic #34-Ubuntu SMP Tue Jan 8 21:47:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> A pair of WDC WD10EALX-009BA0 (15.01H51) 7200 RPM drives in RAID 1.
>
>  PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bit
>
> test=# SELECT name, current_setting(name), source
> test-#   FROM pg_settings
> test-#   WHERE source NOT IN ('default', 'override');
>             name            |  current_setting   |        source
> ----------------------------+--------------------+----------------------
>  application_name           | psql               | client
>  client_encoding            | UTF8               | client
>  DateStyle                  | ISO, MDY           | configuration file
>  default_text_search_config | pg_catalog.english | configuration file
>  lc_messages                | en_US.UTF-8        | configuration file
>  lc_monetary                | en_US.UTF-8        | configuration file
>  lc_numeric                 | en_US.UTF-8        | configuration file
>  lc_time                    | en_US.UTF-8        | configuration file
>  log_timezone               | US/Central         | configuration file
>  max_connections            | 100                | configuration file
>  max_stack_depth            | 2MB                | environment variable
>  shared_buffers             | 32MB               | configuration file
>  TimeZone                   | US/Central         | configuration file
> (13 rows)
>
> I did absolutely no tuning from the default configuration.
>
> So, what timings do you get if you run the identical script? Is
> there something you can do to the above script (maybe in terms of
> populating data) which will cause the performance you see?
>
> -Kevin


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Richard Huxton

Дата:

27 января 2013 г., 04:44:55

On 25/01/13 08:57, Tim Uckun wrote:
>> What if you do:
>> alter table cars.imports set (fillfactor=50);
>> Before the vacuum full, and then try the update again?
>
> This makes a dramatic difference when combined with a vacuum.
>
> UPDATE 98834
> Time: 3408.210 ms
>
> Ten times faster!
That suggests (to me, at least) that it is related to index updating.
Again, your GIN index seems primary candidate.

A fillfactor of 50% means row updates probably stay on the same
disk-block as their previous version. This implies less index updates.

Try running iostat (I think that's available on a Mac) with/without the
fillfactor and with/without the GIN index while you do the updates. It's
possible your SSD is just behaving oddly under stress.

--
   Richard Huxton
   Archonet Ltd

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Tim Uckun

Дата:

27 января 2013 г., 05:04:23

>
> That suggests (to me, at least) that it is related to index updating. Again,
> your GIN index seems primary candidate.
>
> Try running iostat (I think that's available on a Mac) with/without the
> fillfactor and with/without the GIN index while you do the updates. It's
> possible your SSD is just behaving oddly under stress.
>


I dropped the index and the numbers shot up tenfold or more.  I don't
know why postgres feels the need to update the GIN index on the hstore
field when I am only updating an integer field but it looks like I
need to split the hstore into a different table.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Tim Uckun

Дата:

27 января 2013 г., 06:54:07

>
> What if you do:
> alter table cars.imports set (fillfactor=50);
> Before the vacuum full, and then try the update again?


This makes a dramatic difference when combined with a vacuum.

UPDATE 98834
Time: 3408.210 ms

Ten times faster!


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Albe Laurenz

Дата:

27 января 2013 г., 07:15:09

Tim Uckun wrote:
> I dropped the index and the numbers shot up tenfold or more.  I don't
> know why postgres feels the need to update the GIN index on the hstore
> field when I am only updating an integer field but it looks like I
> need to split the hstore into a different table.

Every UPDATE that is not HOT will create a row version with
a new "row id".  That means that all indexes referencing that
row will have to get updated.

That is consistent with better performance with low fillfactor
(which makes HOT more likely).

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Richard Huxton

Дата:

27 января 2013 г., 07:21:53

On 25/01/13 11:38, Tim Uckun wrote:
>> That suggests (to me, at least) that it is related to index updating. Again,
>> your GIN index seems primary candidate.
>>
>> Try running iostat (I think that's available on a Mac) with/without the
>> fillfactor and with/without the GIN index while you do the updates. It's
>> possible your SSD is just behaving oddly under stress.
>>
>
> I dropped the index and the numbers shot up tenfold or more.  I don't
> know why postgres feels the need to update the GIN index on the hstore
> field when I am only updating an integer field but it looks like I
> need to split the hstore into a different table.
If the row moves to a different block, then it has no choice. The old
index entry will point to an invalid block. There are some optimisations
(HOT - http://pgsql.tapoueh.org/site/html/misc/hot.html) but that relies
on (iirc) the update staying on the same block and also not updating any
indexed fields (and you were, I think).

A GIN index is very expensive to update compared to btree too.
--
   Richard Huxton
   Archonet Ltd


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Tim Uckun

Дата:

27 января 2013 г., 07:28:38

relid             | 26710
schemaname        | cars
relname           | imports
seq_scan          | 280
seq_tup_read      | 25873543
idx_scan          | 4
idx_tup_fetch     | 2749
n_tup_ins         | 98926
n_tup_upd         | 6350466
n_tup_del         | 92
n_tup_hot_upd     | 625286
n_live_tup        | 98834
n_dead_tup        | 0
last_vacuum       | 2013-01-25 21:55:36.078614+13
last_autovacuum   | 2013-01-25 21:58:40.850546+13
last_analyze      | 2013-01-25 21:55:36.305967+13
last_autoanalyze  | 2013-01-25 21:51:54.307639+13
vacuum_count      | 6
autovacuum_count  | 32
analyze_count     | 6
autoanalyze_count | 25


On Wed, Jan 23, 2013 at 7:50 AM, Steve Crawford
<scrawford@pinpointresearch.com> wrote:
> On 01/21/2013 05:02 PM, Tim Uckun wrote:
>>
>> I already posted the schema earlier. It's a handful of integer fields
>> with one hstore field.
>
>
> Oh well. I can't find it but maybe it got lost in shipping or eaten by a
> spam filter.
>
>
>>
>>
>> On Tue, Jan 22, 2013 at 1:23 PM, Steve Crawford
>> <scrawford@pinpointresearch.com> wrote:
>>>
>>> select * from pg_stat_user_tables where relname='yourtable';
>>
>>
>>
>> Messy output
>
> Don't know if you are in psql but if so, expanded display works for this.
> I.e.:
> steve@[local] => \x
> Expanded display is on.
> steve@[local] => select * from pg_stat_user_tables where relname='footest';
> -[ RECORD 1 ]-----+------------------------------
> relid             | 781691
> schemaname        | public
> relname           | footest
> seq_scan          | 3
> seq_tup_read      | 609
> idx_scan          |
> idx_tup_fetch     |
> n_tup_ins         | 609
> n_tup_upd         | 0
> n_tup_del         | 0
> n_tup_hot_upd     | 0
> n_live_tup        | 301
> n_dead_tup        | 0
> last_vacuum       |
> last_autovacuum   |
> last_analyze      |
> last_autoanalyze  | 2012-12-19 08:42:23.347368-08
> vacuum_count      | 0
> autovacuum_count  | 0
> analyze_count     | 0
> autoanalyze_count | 2
>
>
>
>>
>>
>>
"relid","schemaname","relname","seq_scan","seq_tup_read","idx_scan","idx_tup_fetch","n_tup_ins","n_tup_upd","n_tup_del","n_tup_hot_upd","n_live_tup","n_dead_tup","last_vacuum","last_autovacuum","last_analyze","last_autoanalyze","vacuum_count","autovacuum_count","analyze_count","autoanalyze_count"
>>
>> 26710,"chrysler","imports",274,25280539,4,2749,98926,5757462,92,327542,98834,0,"2013-01-22
>> 12:28:29.01505+13","2013-01-22 12:32:29.249588+13","2013-01-22
>> 12:28:29.173772+13","2013-01-22 12:32:44.123493+13",3,30,3,24
>>
>>
> So at least autovacuum is running (and some manual vacuum and analyze as
> well).
>
> Cheers,
> Steve
>


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Running update in chunks?

От

Jasen Betts

Дата:

27 января 2013 г., 13:28:28

On 2013-01-25, Tim Uckun <timuckun@gmail.com> wrote:
>> I agree that seems like the most likely cause. Each update to the
>> row holding the hstore column requires adding new index entries for
>> all the hstore elements, and autovacuum will need to clean up the
>> old ones in the background. The best solution would be to either
>> normalize the data instead of using hstore, or move the hstore to a
>> separate table which is referenced by some sort of ID from the
>> frequently-updated table.
>
>
> That's very interesting. I can certainly split up the table, no big
> deal there.  So would the index be redone even if I am not updating
> the hstore field itself?

Absolutely! see MVCC.

 http://www.postgresql.org/docs/current/static/mvcc-intro.html


--
⚂⚃ 100% natural

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Running update in chunks?