Обсуждение: update query taking too long

Поиск
Список
Период
Сортировка

update query taking too long

От
Chris
Дата:
Hi all,

I'm trying to do an update of a reasonably large table and it's taking
way too long so I'm trying to work out why and if I need to tweak any
settings to speed it up.

The table is around 3.5 million records.

The query is

update table set domainname=substring(emailaddress from position('@' in
emailaddress));

I've left it running for over 20 minutes and it hasn't finished so I'm
doing something terribly bad but I have no idea what ;)

Maybe there's another way to write the query but I'm not sure how to
make it better.

Most settings are default, I have bumped up shared_buffers a bit to
65536 - I think that works out to 512Meg ? The machine has 2G ram.

Running version 8.1.9.

Any pointers about settings etc are most welcome.

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: update query taking too long

От
"A. Kretschmer"
Дата:
am  Thu, dem 28.06.2007, um 15:03:32 +1000 mailte Chris folgendes:
> Hi all,
>
> I'm trying to do an update of a reasonably large table and it's taking
> way too long so I'm trying to work out why and if I need to tweak any
> settings to speed it up.
>
> The table is around 3.5 million records.
>
> The query is
>
> update table set domainname=substring(emailaddress from position('@' in
> emailaddress));

I think, this is a bad idea.
Because, first, you have 2 columns with nearly identical data
(mailaddres includes the domain and a extra domain field)

And, after the UPDATE you have every row twice, because of MVCC: the
live tuple and a dead tuple.


> Any pointers about settings etc are most welcome.

I think, you should better use a VIEW.

CREATE VIEW my_view_on_table as SELECT mailaddres, substring(...) as
domain, ...

or, use the substring(...) in your regular queries instead the extra
column.


Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID:   0x3FFF606C, privat 0x7F4584DA   http://wwwkeys.de.pgp.net

Re: update query taking too long

От
Tom Lane
Дата:
Chris <dmagick@gmail.com> writes:
> I'm trying to do an update of a reasonably large table and it's taking
> way too long so I'm trying to work out why and if I need to tweak any
> settings to speed it up.

Any foreign keys leading to or from that table?

3.5 million row updates are not exactly gonna be instantaneous anyway,
but only FK checks or really slow user-written triggers would make it
take upwards of an hour ...

            regards, tom lane

Re: update query taking too long

От
Chris
Дата:
Tom Lane wrote:
> Chris <dmagick@gmail.com> writes:
>> I'm trying to do an update of a reasonably large table and it's taking
>> way too long so I'm trying to work out why and if I need to tweak any
>> settings to speed it up.
>
> Any foreign keys leading to or from that table?

Nope :(

> 3.5 million row updates are not exactly gonna be instantaneous anyway,
> but only FK checks or really slow user-written triggers would make it
> take upwards of an hour ...

No triggers, functions.

Table is pretty basic.

I have a few indexes (one on the primary key, one on emailaddress etc)
but the 'domainname' column is a new one not referenced by any of the
indexes.

FWIW (while the other update is still going in another window):

select SUBSTRING(emailaddress FROM POSITION('@' IN emailaddress)) from
table;
Time: 28140.399 ms

Is there a better way to write the update? I thought about something
like this (but couldn't get it working - guess I don't have the right
syntax):

update t1 set domainname=(select id, SUBSTRING(emailaddress FROM
POSITION('@' IN emailaddress)) from table t2) AS t2 where t1.id=t2.id

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: update query taking too long

От
Chris
Дата:
A. Kretschmer wrote:
> am  Thu, dem 28.06.2007, um 15:03:32 +1000 mailte Chris folgendes:
>> Hi all,
>>
>> I'm trying to do an update of a reasonably large table and it's taking
>> way too long so I'm trying to work out why and if I need to tweak any
>> settings to speed it up.
>>
>> The table is around 3.5 million records.
>>
>> The query is
>>
>> update table set domainname=substring(emailaddress from position('@' in
>> emailaddress));
>
> I think, this is a bad idea.
> Because, first, you have 2 columns with nearly identical data
> (mailaddres includes the domain and a extra domain field)

Yeh I know. I might have to go back to the drawing board on this one.
The app has to work in mysql & postgres so I'm a bit limited in some of
my approaches.

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: update query taking too long

От
"A. Kretschmer"
Дата:
am  Thu, dem 28.06.2007, um 16:16:50 +1000 mailte Chris folgendes:
> Is there a better way to write the update? I thought about something
> like this (but couldn't get it working - guess I don't have the right
> syntax):
>
> update t1 set domainname=(select id, SUBSTRING(emailaddress FROM
> POSITION('@' IN emailaddress)) from table t2) AS t2 where t1.id=t2.id

test=# select * from foo;
 id |    mail     | domain
----+-------------+--------
  1 | foo@foo.tld |
  2 | bar@bar.tld |
(2 rows)

test=*# update foo set domain=SUBSTRING(mail FROM (POSITION('@' IN
mail)+1));
UPDATE 2
test=*# select * from foo;
 id |    mail     | domain
----+-------------+---------
  1 | foo@foo.tld | foo.tld
  2 | bar@bar.tld | bar.tld
(2 rows)


(without the @ in the domain...)


Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID:   0x3FFF606C, privat 0x7F4584DA   http://wwwkeys.de.pgp.net

Re: update query taking too long

От
Chris
Дата:
A. Kretschmer wrote:
> am  Thu, dem 28.06.2007, um 16:16:50 +1000 mailte Chris folgendes:
>> Is there a better way to write the update? I thought about something
>> like this (but couldn't get it working - guess I don't have the right
>> syntax):
>>
>> update t1 set domainname=(select id, SUBSTRING(emailaddress FROM
>> POSITION('@' IN emailaddress)) from table t2) AS t2 where t1.id=t2.id
>
> test=# select * from foo;
>  id |    mail     | domain
> ----+-------------+--------
>   1 | foo@foo.tld |
>   2 | bar@bar.tld |
> (2 rows)
>
> test=*# update foo set domain=SUBSTRING(mail FROM (POSITION('@' IN
> mail)+1));

That's what my original query is (apart from the +1 at the end) ;)

I was just trying to approach it differently with the other attempt.

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: update query taking too long

От
Richard Huxton
Дата:
Chris wrote:
> Tom Lane wrote:
>> Any foreign keys leading to or from that table?
>
> Nope :(
>
>> 3.5 million row updates are not exactly gonna be instantaneous anyway,
>> but only FK checks or really slow user-written triggers would make it
>> take upwards of an hour ...
>
> No triggers, functions.

Of course you really want a trigger on this, since presumably domainname
should always be kept in sync with emailaddress. But that's not the
immediate issue.

> Table is pretty basic.
>
> I have a few indexes (one on the primary key, one on emailaddress etc)
> but the 'domainname' column is a new one not referenced by any of the
> indexes.
>
> FWIW (while the other update is still going in another window):

What's saturated? Is the system I/O limited or CPU limited? You *should*
be limited by the write speed of your disk with something simple like this.

What happens if you do the following?
CREATE TABLE email_upd_test (id SERIAL, email text, domainname text,
PRIMARY KEY (id));

INSERT INTO email_upd_test (email) SELECT n::text || '@' || n::text FROM
(SELECT generate_series(1,1000000) AS n) AS numbers;
ANALYSE email_upd_test;

\timing
UPDATE email_upd_test SET domainname=substring(email from position('@'
in email));
UPDATE 1000000
Time: 35056.125 ms

That 35 seconds is on a simple single-disk IDE disk. No particular
tuning done on that box.

--
   Richard Huxton
   Archonet Ltd

Re: update query taking too long

От
Chris
Дата:
Richard Huxton wrote:
> Chris wrote:
>> Tom Lane wrote:
>>> Any foreign keys leading to or from that table?
>>
>> Nope :(
>>
>>> 3.5 million row updates are not exactly gonna be instantaneous anyway,
>>> but only FK checks or really slow user-written triggers would make it
>>> take upwards of an hour ...
>>
>> No triggers, functions.
>
> Of course you really want a trigger on this, since presumably domainname
> should always be kept in sync with emailaddress. But that's not the
> immediate issue.
>
>> Table is pretty basic.
>>
>> I have a few indexes (one on the primary key, one on emailaddress etc)
>> but the 'domainname' column is a new one not referenced by any of the
>> indexes.
>>
>> FWIW (while the other update is still going in another window):
>
> What's saturated? Is the system I/O limited or CPU limited? You *should*
> be limited by the write speed of your disk with something simple like this.
>
> What happens if you do the following?

db=# CREATE TABLE email_upd_test (id SERIAL, email text, domainname
text, PRIMARY KEY (id));
NOTICE:  CREATE TABLE will create implicit sequence
"email_upd_test_id_seq" for serial column "email_upd_test.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"email_upd_test_pkey" for table "email_upd_test"
CREATE TABLE
Time: 276.500 ms
db=# INSERT INTO email_upd_test (email) SELECT n::text || '@' || n::text
FROM (SELECT generate_series(1,1000000) AS n) AS numbers;
INSERT 0 1000000
Time: 14104.663 ms
db=# ANALYSE email_upd_test;
ANALYZE
Time: 121.775 ms
db=# UPDATE email_upd_test SET domainname=substring(email from
position('@' in email));
UPDATE 1000000
Time: 43796.030 ms


I think I'm I/O bound from my very limited understanding of vmstat.

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: update query taking too long

От
Richard Huxton
Дата:
Chris wrote:
> db=# UPDATE email_upd_test SET domainname=substring(email from
> position('@' in email));
> UPDATE 1000000
> Time: 43796.030 ms
>
> I think I'm I/O bound from my very limited understanding of vmstat.

Well, 43 seconds to update 1 million rows suggests your real query
should be complete in a few minutes, even if your real table has more
columns.

Could you check again and just make sure you don't have a foreign key
referencing this table? I suspect a large table without an index on the
referencing column.

If you can't see anything, cancel the long-running query, run VACUUM
FULL VERBOSE on the table, ANALYSE VERBOSE and then try it again.
There's something very odd here.

--
   Richard Huxton
   Archonet Ltd

Re: update query taking too long

От
Chris
Дата:
Richard Huxton wrote:
> Chris wrote:
>> db=# UPDATE email_upd_test SET domainname=substring(email from
>> position('@' in email));
>> UPDATE 1000000
>> Time: 43796.030 ms
>>
>> I think I'm I/O bound from my very limited understanding of vmstat.
>
> Well, 43 seconds to update 1 million rows suggests your real query
> should be complete in a few minutes, even if your real table has more
> columns.

Yep.

I think I have solved it though - the server was checkpointing so much
not much else was going on.

I didn't have logging set up before but it's up and running now and I
was getting

LOG:  checkpoints are occurring too frequently (26 seconds apart)
HINT:  Consider increasing the configuration parameter
"checkpoint_segments".

So I increased that from 10 to 30 and it finished:

UPDATE 3500101
Time: 146513.349 ms

Thanks for all the help :)

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: update query taking too long

От
Jean-David Beyer
Дата:
Chris wrote (in part):

> I didn't have logging set up before but it's up and running now and I
> was getting
>
> LOG:  checkpoints are occurring too frequently (26 seconds apart)
> HINT:  Consider increasing the configuration parameter
> "checkpoint_segments".
>
> So I increased that from 10 to 30 and it finished:
>
> UPDATE 3500101
> Time: 146513.349 ms
>
I have not used postgreSQL since I tried it once in about 1998 (when I found
it unsatisfactory, but much has changed since then), but I am going to try
it again. What would be a good checkpointing interval? I would guess 26
seconds is too often. What considerations go into picking a checkpointing
interval?

I note, from the book "PostgreSQL" second edition by Douglas and Doublas,
the following parameters are available:

WAL_BUFFERS         The default is 8.
CHECKPOINT_SEGMENTS The default is 3. This would have been too low for the
                    O.P. Would it make sense to start with a higher value
                    or is this a good value and just not appropriate for
                    the O.P.? Should CHECKPOINT_SEGMENTS be raised until
                    the checkpointing is about half CHECKPOINT_TIMEOUT,
                    e.g., 150 seconds while the dbms is running typical
                    work?
CHECKPOINT_TIMEOUT  The default is 300 seconds.
CHECKPOINT_WARNING  The default is  30 seconds.

My machine has 8 GBytes RAM and it worked perfectly well (very very little
paging) when it had 4 GBytes RAM. I doubled it because it was cheap at the
time and I was afraid it would become unavailable later. It is usually
between 2/3 and 3/4 used by the cache. When I run IBM DB2 on it, the choke
point is the IO time spent writing the logfiles.


--
  .~.  Jean-David Beyer          Registered Linux User 85642.
  /V\  PGP-Key: 9A2FC99A         Registered Machine   241939.
 /( )\ Shrewsbury, New Jersey    http://counter.li.org
 ^^-^^ 07:20:01 up 7 days, 14:55, 3 users, load average: 4.26, 4.15, 4.07

Re: update query taking too long

От
Heikki Linnakangas
Дата:
Jean-David Beyer wrote:
> I have not used postgreSQL since I tried it once in about 1998 (when I found
> it unsatisfactory, but much has changed since then), but I am going to try
> it again. What would be a good checkpointing interval? I would guess 26
> seconds is too often. What considerations go into picking a checkpointing
> interval?

Welcome back.

Longer is better when the system is running. But on recovery, longer
checkpoint interval means a longer wait until the database is up again.
Longer checkpoint interval also means that more WAL needs to be kept
around, but that's not usually a concern on normal server hardware with
plenty of disk space.

> WAL_BUFFERS         The default is 8.

Increasing this can increase the performance of bulk load operations but
it doesn't make much difference otherwise.

> CHECKPOINT_SEGMENTS The default is 3. This would have been too low for the
>                     O.P. Would it make sense to start with a higher value
>                     or is this a good value and just not appropriate for
>                     the O.P.? Should CHECKPOINT_SEGMENTS be raised until
>                     the checkpointing is about half CHECKPOINT_TIMEOUT,
>                     e.g., 150 seconds while the dbms is running typical
>                     work?
> CHECKPOINT_TIMEOUT  The default is 300 seconds.

You have to decide if you want to use checkpoint_timeout or
checkpoint_segments as the primary means of controlling your checkpoint
interval. checkpoint_timeout is easier to understand and tune, so I
would suggest using that. Depending on how long recovery times you can
live with, set it to something like 15 minutes - 60 minutes. Then set
checkpoint_segments to a high value; it's purpose in this scheme is
basically to just protect you from running out of disk space on the
filesystem WAL is located in.

Note that unlike on DB2, the size of your transactions isn't limited by
the amount of transaction log you keep around; this is all about
performance.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: update query taking too long

От
Richard Huxton
Дата:
Jean-David Beyer wrote:
> Chris wrote (in part):
>
>> I didn't have logging set up before but it's up and running now and I
>> was getting
>>
>> LOG:  checkpoints are occurring too frequently (26 seconds apart)
>> HINT:  Consider increasing the configuration parameter
>> "checkpoint_segments".
>>
>> So I increased that from 10 to 30 and it finished:
>>
>> UPDATE 3500101
>> Time: 146513.349 ms
>>
> I have not used postgreSQL since I tried it once in about 1998 (when I found
> it unsatisfactory, but much has changed since then), but I am going to try
> it again. What would be a good checkpointing interval? I would guess 26
> seconds is too often. What considerations go into picking a checkpointing
> interval?

Basically, it depends on the amount of updates you have and whether you
want to minimise total writes or keep the load even. Lots of
checkpointing means you'll do more writing, but in smaller chunks. The
only way to find out the right value for you is to test on a realistic
setup I'm afraid.

>
> I note, from the book "PostgreSQL" second edition by Douglas and Doublas,
> the following parameters are available:
>
> WAL_BUFFERS         The default is 8.
> CHECKPOINT_SEGMENTS The default is 3. This would have been too low for the
>                     O.P. Would it make sense to start with a higher value
>                     or is this a good value and just not appropriate for
>                     the O.P.? Should CHECKPOINT_SEGMENTS be raised until
>                     the checkpointing is about half CHECKPOINT_TIMEOUT,
>                     e.g., 150 seconds while the dbms is running typical
>                     work?
> CHECKPOINT_TIMEOUT  The default is 300 seconds.
> CHECKPOINT_WARNING  The default is  30 seconds.

If your updates are large (rather than having lots of small ones) then
increasing wal_buffers might be useful.

If you have a lot of updates, you'll want to increase
checkpoint_segments at least. You'll see mention in the logs when PG
thinks checkpoints are too close together (checkpoint_timeout/warning).

Of course, a lot of people will have PostgreSQL installed on a PC or
laptop along with the rest of the Linux distro. They'll not want to
allocate too many resources.

> My machine has 8 GBytes RAM and it worked perfectly well (very very little
> paging) when it had 4 GBytes RAM. I doubled it because it was cheap at the
> time and I was afraid it would become unavailable later. It is usually
> between 2/3 and 3/4 used by the cache. When I run IBM DB2 on it, the choke
> point is the IO time spent writing the logfiles.

If DB2 was I/O saturated with its transaction log, I'd be surprised if
PG isn't too.

--
   Richard Huxton
   Archonet Ltd