Обсуждение: UPDATE many records

Поиск
Список
Период
Сортировка

UPDATE many records

От
Israel Brewster
Дата:
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).

Given that each row update is completely independent of any other row, I have the following questions:

1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: UPDATE many records

От
Justin
Дата:
There are several ways to actually do this

If you have Postgresql 11 or higher we now have Create Procedure  that allows committing transactions,  one draw back is it can not parallel from inside the procedure

If its an older version then Python Script or other scripting language to iterates over the data say 10 to 20K will do what you want

for i in list of IDs
begin ;
"UPDATE table_name SET changed_field=new_value()  where ID @> int4range(i, i+10000);
commit;


To create parallel process simple Python script or other scripting language can be used to create many connections working the data in parallel  but given the simple update it will NOT help in performance,  this  will be Hard disk IO bound,  not process bound where parallelization  helps



On Mon, Jan 6, 2020 at 1:36 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).

Given that each row update is completely independent of any other row, I have the following questions:

1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: UPDATE many records

От
Adrian Klaver
Дата:
On 1/6/20 10:36 AM, Israel Brewster wrote:
> Thanks to a change in historical data, I have a need to update a large 
> number of records (around 50 million). The update itself is straight 
> forward, as I can just issue an "UPDATE table_name SET 
> changed_field=new_value();" (yes, new_value is the result of a stored 
> procedure, if that makes a difference) command via psql, and it should 
> work. However, due to the large number of records this command will 
> obviously take a while, and if anything goes wrong during the update 
> (one bad value in row 45 million, lost connection, etc), all the work 
> that has been done already will be lost due to the transactional nature 
> of such commands (unless I am missing something).
> 
> Given that each row update is completely independent of any other row, I 
> have the following questions:
> 
> 1) Is there any way to set the command such that each row change is 
> committed as it is calculated?

Pretty sure:
     UPDATE table_name SET changed_field=new_value();
is seen as a single statement and is all or none.

If you want to go row by row you will need to have the statement run on 
a row by row basis or maybe in batches.

> 2) Is there some way to run this command in parallel in order to better 
> utilize multiple processor cores, other than manually breaking the data 
> into chunks and running a separate psql/update process for each chunk? 
> Honestly, manual parallelizing wouldn’t be too bad (there are a number 
> of logical segregations I can apply), I’m just wondering if there is a 
> more automatic option.

This is good time to ask what Postgres version?

I am still working out the recent parallel query system additions. Not 
sure if it applies to UPDATE or not.

> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: UPDATE many records

От
Christopher Browne
Дата:
On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster@alaska.edu> wrote:
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).

Given that each row update is completely independent of any other row, I have the following questions:

1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.

Yeah, I'd be inclined to do this in batches.

If, for instance, the table has a nice primary key, then I'd capture the primary keys into a side table, and grab tuples from the side table to process in more bite-sized batches, say, of a few thousand tuples per batch.

create table just_keys as select pk_column from big_historical_table;
alter table just_keys add column processed boolean;
create index jkpk on just_keys(pk_column) where (processed is null);
then loop repeatedly along the lines...

create temp table iteration as select pk_column from just_keys where processed is null limit 1000;
[do update on big_historical_table where pk_column in (select pk_column from iteration)]
update iteration set processed='true' where pk_column in (select pk_column from iteration);
drop table iteration;

Parallelization is absolutely an interesting idea; if you want to use 8 processes, then use a cycling sequence on the side table to spread tuples across the 8 processes, so that they can grab their own tuples and not block one another.

In that case, more like...
create temp sequence seq_procs start with 1 maxval 8 cycle;
create temp table just_keys as select pk_column, false::boolean as processed, nextval('seq_procs') as batch_id from big_historical_table;

The individual iterations then look for values in just_keys corresponding to their assigned batch number.

--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Re: UPDATE many records

От
Israel Brewster
Дата:
Good information. I did forget to mention that I am using PostgreSQL 11.5. I also was not aware of the distinction between PROCEDURE and FUNCTION, so I guess I used the wrong terminology there when stating that new_value is the result of a stored procedure. It’s actually a function.

So would your suggestion then be to create a procedure that loops through the records, calculating and committing each one (or, as in your older Postgres example, batches of 10k to 20k)?

Good point on the HD I/O bound vs processor bound, but wouldn’t that depend on how complicated the actual update is? Still, there is a good chance you are correct in that statement, so that aspect is probably not worth spending too much time on.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

On Jan 6, 2020, at 10:05 AM, Justin <zzzzz.graf@gmail.com> wrote:

There are several ways to actually do this

If you have Postgresql 11 or higher we now have Create Procedure  that allows committing transactions,  one draw back is it can not parallel from inside the procedure

If its an older version then Python Script or other scripting language to iterates over the data say 10 to 20K will do what you want

for i in list of IDs
begin ;
"UPDATE table_name SET changed_field=new_value()  where ID @> int4range(i, i+10000);
commit;


To create parallel process simple Python script or other scripting language can be used to create many connections working the data in parallel  but given the simple update it will NOT help in performance,  this  will be Hard disk IO bound,  not process bound where parallelization  helps



On Mon, Jan 6, 2020 at 1:36 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).

Given that each row update is completely independent of any other row, I have the following questions:

1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


Re: UPDATE many records

От
Israel Brewster
Дата:
On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbrowne@gmail.com> wrote:

On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster@alaska.edu> wrote:
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).

Given that each row update is completely independent of any other row, I have the following questions:

1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.

Yeah, I'd be inclined to do this in batches.

If, for instance, the table has a nice primary key, then I'd capture the primary keys into a side table, and grab tuples from the side table to process in more bite-sized batches, say, of a few thousand tuples per batch.

create table just_keys as select pk_column from big_historical_table;
alter table just_keys add column processed boolean;
create index jkpk on just_keys(pk_column) where (processed is null);
then loop repeatedly along the lines...

create temp table iteration as select pk_column from just_keys where processed is null limit 1000;
[do update on big_historical_table where pk_column in (select pk_column from iteration)]
update iteration set processed='true' where pk_column in (select pk_column from iteration);
drop table iteration;

Parallelization is absolutely an interesting idea; if you want to use 8 processes, then use a cycling sequence on the side table to spread tuples across the 8 processes, so that they can grab their own tuples and not block one another.

In that case, more like...
create temp sequence seq_procs start with 1 maxval 8 cycle;
create temp table just_keys as select pk_column, false::boolean as processed, nextval('seq_procs') as batch_id from big_historical_table;

The individual iterations then look for values in just_keys corresponding to their assigned batch number.

Sounds like a reasonable approach. As Justin pointed out, it is actually likely that the process will be IO bound rather than CPU bound, so my parallel idea may not have much merit after all, but the batching procedure makes sense. I assume you meant update just_keys in your sample rather than update iteration on that line just before drop table iteration. Thanks for the info!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Re: UPDATE many records

От
Justin
Дата:
As you have access to Procedure,   you can create a loop then issue an Begin Update Commit
so something like this should work plpgsql

declare
 icount int = 0;
 new_count int = 0;

begin

select count(*) into icount from mytable;

loop
  begin ;
    Update mytable set myvalue = newvalue() where id  between new_count  and new_count+9999  ;
   commit;
    new_count = new_count + 10,000;
   if new_count > icount  then
      break
   end if;
end loop;
end;


I am going to put caveat into this,  if newvalue() function is complex and takes allot of "CPU cycles to do its thing" then parallelism would help, unless this function looks at the table being updated it can really complicate things as the parallel functions would be looking at stale records which could be bad...






On Mon, Jan 6, 2020 at 3:07 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
Good information. I did forget to mention that I am using PostgreSQL 11.5. I also was not aware of the distinction between PROCEDURE and FUNCTION, so I guess I used the wrong terminology there when stating that new_value is the result of a stored procedure. It’s actually a function.

So would your suggestion then be to create a procedure that loops through the records, calculating and committing each one (or, as in your older Postgres example, batches of 10k to 20k)?

Good point on the HD I/O bound vs processor bound, but wouldn’t that depend on how complicated the actual update is? Still, there is a good chance you are correct in that statement, so that aspect is probably not worth spending too much time on.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

On Jan 6, 2020, at 10:05 AM, Justin <zzzzz.graf@gmail.com> wrote:

There are several ways to actually do this

If you have Postgresql 11 or higher we now have Create Procedure  that allows committing transactions,  one draw back is it can not parallel from inside the procedure

If its an older version then Python Script or other scripting language to iterates over the data say 10 to 20K will do what you want

for i in list of IDs
begin ;
"UPDATE table_name SET changed_field=new_value()  where ID @> int4range(i, i+10000);
commit;


To create parallel process simple Python script or other scripting language can be used to create many connections working the data in parallel  but given the simple update it will NOT help in performance,  this  will be Hard disk IO bound,  not process bound where parallelization  helps



On Mon, Jan 6, 2020 at 1:36 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).

Given that each row update is completely independent of any other row, I have the following questions:

1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


Re: UPDATE many records

От
Alban Hertroys
Дата:
> On 6 Jan 2020, at 21:15, Israel Brewster <ijbrewster@alaska.edu> wrote:
>
>> On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbrowne@gmail.com> wrote:
>>
>> On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster@alaska.edu> wrote:
>> Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The
updateitself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes,
new_valueis the result of a stored procedure, if that makes a difference) command via psql, and it should work.
However,due to the large number of records this command will obviously take a while, and if anything goes wrong during
theupdate (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost
dueto the transactional nature of such commands (unless I am missing something). 
>>
>> Given that each row update is completely independent of any other row, I have the following questions:
>>
>> 1) Is there any way to set the command such that each row change is committed as it is calculated?
>> 2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than
manuallybreaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual
parallelizingwouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there
isa more automatic option. 
>>
>> Yeah, I'd be inclined to do this in batches.

I think you’re overcomplicating the matter.

I’d just do it as a single update in one transaction. It’s only 50M rows. It may take half an hour or so on decent
hardware,depending on how resource-intensive your function is. 

If that fails[1], only then would I start looking into batching things. But then you still need to figure out why it
failsand what to do about that; if it fails it will probably fail fast, and if not, then you’re looking at a one-off
situationthat won’t require more than a few workarounds - after which you can just run the update again. 

Ad 1). No harm has been done, it’s a single transaction that rolled back.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.




Re: UPDATE many records

От
Christopher Browne
Дата:


On Mon, Jan 6, 2020, 3:15 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbrowne@gmail.com> wrote:

On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster@alaska.edu> wrote:
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).

Given that each row update is completely independent of any other row, I have the following questions:

1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.

Yeah, I'd be inclined to do this in batches.

If, for instance, the table has a nice primary key, then I'd capture the primary keys into a side table, and grab tuples from the side table to process in more bite-sized batches, say, of a few thousand tuples per batch.

create table just_keys as select pk_column from big_historical_table;
alter table just_keys add column processed boolean;
create index jkpk on just_keys(pk_column) where (processed is null);
then loop repeatedly along the lines...

create temp table iteration as select pk_column from just_keys where processed is null limit 1000;
[do update on big_historical_table where pk_column in (select pk_column from iteration)]
update iteration set processed='true' where pk_column in (select pk_column from iteration);
drop table iteration;

Parallelization is absolutely an interesting idea; if you want to use 8 processes, then use a cycling sequence on the side table to spread tuples across the 8 processes, so that they can grab their own tuples and not block one another.

In that case, more like...
create temp sequence seq_procs start with 1 maxval 8 cycle;
create temp table just_keys as select pk_column, false::boolean as processed, nextval('seq_procs') as batch_id from big_historical_table;

The individual iterations then look for values in just_keys corresponding to their assigned batch number.

Sounds like a reasonable approach. As Justin pointed out, it is actually likely that the process will be IO bound rather than CPU bound, so my parallel idea may not have much merit after all, but the batching procedure makes sense. I assume you meant update just_keys in your sample rather than update iteration on that line just before drop table iteration. Thanks for the info

As for parallelism, if you have really powerful disk, lots of disks on disk array, it may help.  Or not, as commented.

I didn't test my wee bit of code, so yep, I meant to update just_keys :-).

You won't find something terribly much more automatic.

Oh, yah, there's a possible further complication; does the application need to get stopped to do this update?  Is the newest version of the app still generating data that needs the rewriting?  Sure hope not...

Re: UPDATE many records

От
Rob Sargent
Дата:


On Jan 6, 2020, at 1:29 PM, Alban Hertroys <haramrae@gmail.com> wrote:

I think you’re overcomplicating the matter.

I’d just do it as a single update in one transaction. It’s only 50M rows. It may take half an hour or so on decent hardware, depending on how resource-intensive your function is.

I must emphasize: This estimate is HIGHLY dependent on hardware and the complexity of the table (number of indices, etc).  (I suspect there’s a correlation between table size (business value) and number of indices)

If that fails[1], only then would I start looking into batching things. But then you still need to figure out why it fails and what to do about that; if it fails it will probably fail fast, and if not, then you’re looking at a one-off situation that won’t require more than a few workarounds - after which you can just run the update again.

Ad 1). No harm has been done, it’s a single transaction that rolled back.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

Re: UPDATE many records

От
Israel Brewster
Дата:
On Jan 6, 2020, at 11:38 AM, Christopher Browne <cbbrowne@gmail.com> wrote:



On Mon, Jan 6, 2020, 3:15 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbrowne@gmail.com> wrote:

On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster@alaska.edu> wrote:
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).

Given that each row update is completely independent of any other row, I have the following questions:

1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.

Yeah, I'd be inclined to do this in batches.

If, for instance, the table has a nice primary key, then I'd capture the primary keys into a side table, and grab tuples from the side table to process in more bite-sized batches, say, of a few thousand tuples per batch.

create table just_keys as select pk_column from big_historical_table;
alter table just_keys add column processed boolean;
create index jkpk on just_keys(pk_column) where (processed is null);
then loop repeatedly along the lines...

create temp table iteration as select pk_column from just_keys where processed is null limit 1000;
[do update on big_historical_table where pk_column in (select pk_column from iteration)]
update iteration set processed='true' where pk_column in (select pk_column from iteration);
drop table iteration;

Parallelization is absolutely an interesting idea; if you want to use 8 processes, then use a cycling sequence on the side table to spread tuples across the 8 processes, so that they can grab their own tuples and not block one another.

In that case, more like...
create temp sequence seq_procs start with 1 maxval 8 cycle;
create temp table just_keys as select pk_column, false::boolean as processed, nextval('seq_procs') as batch_id from big_historical_table;

The individual iterations then look for values in just_keys corresponding to their assigned batch number.

Sounds like a reasonable approach. As Justin pointed out, it is actually likely that the process will be IO bound rather than CPU bound, so my parallel idea may not have much merit after all, but the batching procedure makes sense. I assume you meant update just_keys in your sample rather than update iteration on that line just before drop table iteration. Thanks for the info

As for parallelism, if you have really powerful disk, lots of disks on disk array, it may help.  Or not, as commented.

I didn't test my wee bit of code, so yep, I meant to update just_keys :-).

You won't find something terribly much more automatic.

Oh, yah, there's a possible further complication; does the application need to get stopped to do this update?  Is the newest version of the app still generating data that needs the rewriting?  Sure hope not…

Yeah, a valid concern, but I should be ok on that front. Once I fix the calculation function, any new records will have the correct value. Plus, the actual update calculation is idempotent, so if a handful of new records end up getting re-calculated, that’s not an issue. Granted, the data will look weird while the re-calculation is in process (part new, part old), but we can live with that :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: UPDATE many records

От
Israel Brewster
Дата:


On Jan 6, 2020, at 11:40 AM, Rob Sargent <robjsargent@gmail.com> wrote:



On Jan 6, 2020, at 1:29 PM, Alban Hertroys <haramrae@gmail.com> wrote:

I think you’re overcomplicating the matter.

I’d just do it as a single update in one transaction. It’s only 50M rows. It may take half an hour or so on decent hardware, depending on how resource-intensive your function is.

I must emphasize: This estimate is HIGHLY dependent on hardware and the complexity of the table (number of indices, etc).  (I suspect there’s a correlation between table size (business value) and number of indices)

I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


If that fails[1], only then would I start looking into batching things. But then you still need to figure out why it fails and what to do about that; if it fails it will probably fail fast, and if not, then you’re looking at a one-off situation that won’t require more than a few workarounds - after which you can just run the update again.

Ad 1). No harm has been done, it’s a single transaction that rolled back.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.


Re: UPDATE many records

От
Michael Lewis
Дата:
I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.

I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.

Re: UPDATE many records

От
Israel Brewster
Дата:

On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com> wrote:

I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.

I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN ANALYZE output at https://explain.depesz.com/s/lIYn if it matters), during which time I did see a postmaster process consuming 100% CPU. Upping the test to 20,819 records took about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So by my calculations, doing all 50M records would take around 10 hours.

One potentially significant note: most of the execution time is spent in a trigger. This trigger is actually what’s doing the REAL update that I need to happen. If it would make a difference, I could easily pull the trigger code out to a separate function that I just call directly (with triggers temporarily disabled). My thinking is that calling a function is calling a function, and the fact that it is currently called via a trigger rather than direct is of little consequence, but I’m willing to be corrected on that :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

RE: UPDATE many records

От
Mark Zellers
Дата:

Just out of curiosity, what kind of trigger are you using, a row level trigger or a statement level trigger?  If you are using a row level trigger, see if you can achieve your requirements using a statement level trigger instead.  I’m relatively new to Postgres, so there could be some limit that I’m not aware of, but my understanding is that you have access to the old and new values of the updated rows in the after statement trigger.  It would likely be much more performant to do your operation once after the statement is done rather than firing a trigger on every changed row.

 

Regards,

 

Mark Z.

 

 

From: Israel Brewster <ijbrewster@alaska.edu>
Sent: Monday, January 6, 2020 1:24 PM
To: Michael Lewis <mlewis@entrata.com>
Cc: Rob Sargent <robjsargent@gmail.com>; Alban Hertroys <haramrae@gmail.com>; Christopher Browne <cbbrowne@gmail.com>; pgsql-generallists.postgresql.org <pgsql-general@lists.postgresql.org>
Subject: Re: UPDATE many records

 

 

On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com> wrote:

 

I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.

 

I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.

So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN ANALYZE output at https://explain.depesz.com/s/lIYn if it matters), during which time I did see a postmaster process consuming 100% CPU. Upping the test to 20,819 records took about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So by my calculations, doing all 50M records would take around 10 hours.

 

One potentially significant note: most of the execution time is spent in a trigger. This trigger is actually what’s doing the REAL update that I need to happen. If it would make a difference, I could easily pull the trigger code out to a separate function that I just call directly (with triggers temporarily disabled). My thinking is that calling a function is calling a function, and the fact that it is currently called via a trigger rather than direct is of little consequence, but I’m willing to be corrected on that :-)

 

---

Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320

Work: 907-474-5172
cell:  907-328-9145

 

Re: UPDATE many records

От
Michael Lewis
Дата:
On Mon, Jan 6, 2020 at 2:34 PM Mark Zellers <markz@adaptiveinsights.com> wrote:

Just out of curiosity, what kind of trigger are you using, a row level trigger or a statement level trigger?  If you are using a row level trigger, see if you can achieve your requirements using a statement level trigger instead.  I’m relatively new to Postgres, so there could be some limit that I’m not aware of, but my understanding is that you have access to the old and new values of the updated rows in the after statement trigger.  It would likely be much more performant to do your operation once after the statement is done rather than firing a trigger on every changed row.


My experience/understanding is that statement level triggers can be a big performance boost, but only for changing *other* tables and not the table that the trigger is on since it is *AFTER* only and can't modify NEW record directly.

Re: UPDATE many records

От
Justin
Дата:
What was the HD wait time ?  What tool is being use to monitor the server resources??

It appears based on this information there is allot more going on than a simple Update command

Moving code out of the trigger  probably not  going to improve performance, unless there is allot of code  that does not need to be processed for this update or code touching other tables

Study the trigger identify what has to run, pull that code out, then disable the trigger.    Move the necessary code to a new function for Updating..    

On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster <ijbrewster@alaska.edu> wrote:

On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com> wrote:

I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.

I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN ANALYZE output at https://explain.depesz.com/s/lIYn if it matters), during which time I did see a postmaster process consuming 100% CPU. Upping the test to 20,819 records took about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So by my calculations, doing all 50M records would take around 10 hours.

One potentially significant note: most of the execution time is spent in a trigger. This trigger is actually what’s doing the REAL update that I need to happen. If it would make a difference, I could easily pull the trigger code out to a separate function that I just call directly (with triggers temporarily disabled). My thinking is that calling a function is calling a function, and the fact that it is currently called via a trigger rather than direct is of little consequence, but I’m willing to be corrected on that :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: UPDATE many records

От
Israel Brewster
Дата:

On Jan 6, 2020, at 12:49 PM, Justin <zzzzz.graf@gmail.com> wrote:

What was the HD wait time ?  What tool is being use to monitor the server resources??

No idea on the HD wait time - how would I best monitor that? That said, this machine does have NVMe drives, so the speed should be fairly high/wait time fairly low. It’s also running as a VM, which could affect things, but is a bit of a moot point as far as this update goes. As far as monitoring server resources, I was just using top.

It appears based on this information there is allot more going on than a simple Update command

Depending on your definition of “simple update” of course, very true. As I stated in the original message, the actual update value is the result of a function. The psql command is a simple update, but the function does a bit of stuff (primarily trigonometry). According to the EXPLAIN ANALYZE, about .7 ms of stuff per record, which of course is most of the runtime. It is entirely possible that the function could be optimized to run more quickly.


Moving code out of the trigger  probably not  going to improve performance, unless there is allot of code  that does not need to be processed for this update or code touching other tables

One SELECT query on another table to get some values I need to use for the calculation. No code that is not needed for the update. Given the nature of this bulk update, I *could* make a separate function that simply takes those values as parameters, since the same value will be applied to a lot of records. I’d just have to be careful about how I applied the update, so rows get processed with the correct values. I’m not convinced it would be worth it though - might shave a few hours off the total execution time (assuming that SELECT is expensive - EXPLAIN ANLYZE shows an index scan, on a table with only 12,761 rows, which seems to be about as simple as it gets), but I doubt it would be enough for me to feel comfortable simply running the update as one monolithic unit.

Study the trigger identify what has to run, pull that code out, then disable the trigger.    Move the necessary code to a new function for Updating.. 

Sure. But I feel we are getting a bit off track. Optimizing the runtime of the update is great, but this is a one-off (hopefully) event. I want to accomplish it as quickly as possible, of course, but at the same time it doesn’t make sense to spend a lot of time optimizing every component of the query. The main purpose of the question was honestly for my sanity, to reduce the likelihood of having it run for several hours only to error out due to bad data or whatever and have to start over from the top. Running in parallel simply seemed to be a no-brainer option to make it go quicker, assuming CPU bound updating. Optimizations that are going to take work are probably not worth it. We can wait for the data to be updated.

Thanks again!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster <ijbrewster@alaska.edu> wrote:

On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com> wrote:

I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.

I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN ANALYZE output at https://explain.depesz.com/s/lIYn if it matters), during which time I did see a postmaster process consuming 100% CPU. Upping the test to 20,819 records took about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So by my calculations, doing all 50M records would take around 10 hours.

One potentially significant note: most of the execution time is spent in a trigger. This trigger is actually what’s doing the REAL update that I need to happen. If it would make a difference, I could easily pull the trigger code out to a separate function that I just call directly (with triggers temporarily disabled). My thinking is that calling a function is calling a function, and the fact that it is currently called via a trigger rather than direct is of little consequence, but I’m willing to be corrected on that :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


Re: UPDATE many records

От
Christopher Browne
Дата:


On Mon, 6 Jan 2020 at 17:38, Israel Brewster <ijbrewster@alaska.edu> wrote:

Sure. But I feel we are getting a bit off track. Optimizing the runtime of the update is great, but this is a one-off (hopefully) event. I want to accomplish it as quickly as possible, of course, but at the same time it doesn’t make sense to spend a lot of time optimizing every component of the query. The main purpose of the question was honestly for my sanity, to reduce the likelihood of having it run for several hours only to error out due to bad data or whatever and have to start over from the top. Running in parallel simply seemed to be a no-brainer option to make it go quicker, assuming CPU bound updating. Optimizations that are going to take work are probably not worth it. We can wait for the data to be updated.

It sounds like you're in a decent place on this, and that you have done a pretty apropos amount of exploration of the matter.

I was pleased to hear that you have the idempotency of the updates well in hand, and that the application can cope with the degree of out-of-sync that things will temporarily be.

The estimate of 10h to update the data doesn't surprise me; that's long enough that it sure seems tempting to do the work in pieces so that you don't have your whole set of application data locked for 10h.

I'd be inclined to call this "enough attention" for a one-off event.

I'll poke at the trigger aspect a wee bit; if the trigger function does a one-tuple-at-a-time handling of things, so that it fires 50M times, you might get a substantial speedup by replacing that with an equivalent set operation that processes a few thousand tuples at a time.  That said, if you're happy with the process running 10h, it's not worth unpeeling the extra testing needed to ensure identical end states.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Re: UPDATE many records

От
Justin
Дата:
system monitor i like is glances,  have to have python installed to run,  it can run in command console or present a web interface.  its a very nice to  get a quick detail few what is going on

just monitoring the system CPU utilization does not give enough information to state if the system is CPU bound or IO bound or network IO bound.

on simple selects assuming the data is in shared buffers its trivial event,  if the select has to go to disk to get the data then its not trivial especially if its a big table.  One thing that can happen is the update and the selects in the trigger are pushing data in and out of shared memory causing the disk to thrash which explain does not report.

 Explain does not peer into function, triggers, etc   select myfunction(),  column list, from mytable.  or update mytable set column = myfunction() ,  it will not report what happened inside just the total time it took to run.
looking at the explain the Update only took 0.6 second while the trigger took 6.5
rotate_tilt_data: time=6560.401 calls=9299
I'm a little confused is this trigger function going to be around after this update?  If yes it should be worth looking into

If its a one time run or every 12 months who cares,  Start the update on friday night,  go in on Saturday to check it

On Mon, Jan 6, 2020 at 5:38 PM Israel Brewster <ijbrewster@alaska.edu> wrote:

On Jan 6, 2020, at 12:49 PM, Justin <zzzzz.graf@gmail.com> wrote:

What was the HD wait time ?  What tool is being use to monitor the server resources??

No idea on the HD wait time - how would I best monitor that? That said, this machine does have NVMe drives, so the speed should be fairly high/wait time fairly low. It’s also running as a VM, which could affect things, but is a bit of a moot point as far as this update goes. As far as monitoring server resources, I was just using top.

It appears based on this information there is allot more going on than a simple Update command

Depending on your definition of “simple update” of course, very true. As I stated in the original message, the actual update value is the result of a function. The psql command is a simple update, but the function does a bit of stuff (primarily trigonometry). According to the EXPLAIN ANALYZE, about .7 ms of stuff per record, which of course is most of the runtime. It is entirely possible that the function could be optimized to run more quickly.


Moving code out of the trigger  probably not  going to improve performance, unless there is allot of code  that does not need to be processed for this update or code touching other tables

One SELECT query on another table to get some values I need to use for the calculation. No code that is not needed for the update. Given the nature of this bulk update, I *could* make a separate function that simply takes those values as parameters, since the same value will be applied to a lot of records. I’d just have to be careful about how I applied the update, so rows get processed with the correct values. I’m not convinced it would be worth it though - might shave a few hours off the total execution time (assuming that SELECT is expensive - EXPLAIN ANLYZE shows an index scan, on a table with only 12,761 rows, which seems to be about as simple as it gets), but I doubt it would be enough for me to feel comfortable simply running the update as one monolithic unit.

Study the trigger identify what has to run, pull that code out, then disable the trigger.    Move the necessary code to a new function for Updating.. 

Sure. But I feel we are getting a bit off track. Optimizing the runtime of the update is great, but this is a one-off (hopefully) event. I want to accomplish it as quickly as possible, of course, but at the same time it doesn’t make sense to spend a lot of time optimizing every component of the query. The main purpose of the question was honestly for my sanity, to reduce the likelihood of having it run for several hours only to error out due to bad data or whatever and have to start over from the top. Running in parallel simply seemed to be a no-brainer option to make it go quicker, assuming CPU bound updating. Optimizations that are going to take work are probably not worth it. We can wait for the data to be updated.

Thanks again!

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145



On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster <ijbrewster@alaska.edu> wrote:

On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com> wrote:

I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.

I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN ANALYZE output at https://explain.depesz.com/s/lIYn if it matters), during which time I did see a postmaster process consuming 100% CPU. Upping the test to 20,819 records took about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So by my calculations, doing all 50M records would take around 10 hours.

One potentially significant note: most of the execution time is spent in a trigger. This trigger is actually what’s doing the REAL update that I need to happen. If it would make a difference, I could easily pull the trigger code out to a separate function that I just call directly (with triggers temporarily disabled). My thinking is that calling a function is calling a function, and the fact that it is currently called via a trigger rather than direct is of little consequence, but I’m willing to be corrected on that :-)

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


RE: UPDATE many records

От
Mark Zellers
Дата:

You don’t tell us if other users will be concurrently changing any of the records involved.  If you could guarantee that the table won’t be changed, you might be better off doing a CREATE TABLE table_new as SELECT … FROM table_old, dropping table_old, and finally renaming table_new.   Given the way Postgres handles updates, I would think that might perform significantly better.  Even if you did the work in batches (create a new table, insert/select from the old table, drop, rename), that could well be better.  Especially if you re-create the indexes after all the data is moved.

 

 

 

From: Israel Brewster <ijbrewster@alaska.edu>
Sent: Monday, January 6, 2020 10:36 AM
To: pgsql-general@lists.postgresql.org
Subject: UPDATE many records

 

Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).

 

Given that each row update is completely independent of any other row, I have the following questions:

 

1) Is there any way to set the command such that each row change is committed as it is calculated?

2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.

---

Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320

Work: 907-474-5172
cell:  907-328-9145

 

Re: UPDATE many records

От
Israel Brewster
Дата:
Nice approach! Using that method, I can do the select to generate the new table in only 6 minutes! I’m sure it helps that through a slightly creative use of Joins, I can re-write my function to actually be part of the select, just using bult-in trig functions, rather than having to called a stored function that I wrote for each row. And at only 6 minutes, I don’t need to worry about things like committing in batches or parallelizing. 

I haven’t tried selecting to a new table yet, presumably writing the data back could take some time, but the concept seems like it should work well. Thanks!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

On Jan 7, 2020, at 10:09 AM, Mark Zellers <markz@adaptiveinsights.com> wrote:

You don’t tell us if other users will be concurrently changing any of the records involved.  If you could guarantee that the table won’t be changed, you might be better off doing a CREATE TABLE table_new as SELECT … FROM table_old, dropping table_old, and finally renaming table_new.   Given the way Postgres handles updates, I would think that might perform significantly better.  Even if you did the work in batches (create a new table, insert/select from the old table, drop, rename), that could well be better.  Especially if you re-create the indexes after all the data is moved.
 
 
 
From: Israel Brewster <ijbrewster@alaska.edu> 
Sent: Monday, January 6, 2020 10:36 AM
To: pgsql-general@lists.postgresql.org
Subject: UPDATE many records
 
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).
 
Given that each row update is completely independent of any other row, I have the following questions:
 
1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: UPDATE many records

От
Israel Brewster
Дата:
One potential issue I just thought of with this approach: disk space. Will I be doubling the amount of space used while both tables exist? If so, that would prevent this from working - I don’t have that much space available at the moment.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

On Jan 7, 2020, at 10:09 AM, Mark Zellers <markz@adaptiveinsights.com> wrote:

You don’t tell us if other users will be concurrently changing any of the records involved.  If you could guarantee that the table won’t be changed, you might be better off doing a CREATE TABLE table_new as SELECT … FROM table_old, dropping table_old, and finally renaming table_new.   Given the way Postgres handles updates, I would think that might perform significantly better.  Even if you did the work in batches (create a new table, insert/select from the old table, drop, rename), that could well be better.  Especially if you re-create the indexes after all the data is moved.
 
 
 
From: Israel Brewster <ijbrewster@alaska.edu> 
Sent: Monday, January 6, 2020 10:36 AM
To: pgsql-general@lists.postgresql.org
Subject: UPDATE many records
 
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).
 
Given that each row update is completely independent of any other row, I have the following questions:
 
1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

Re: UPDATE many records

От
Alan Hodgson
Дата:
On Tue, 2020-01-07 at 11:47 -0900, Israel Brewster wrote:
> One potential issue I just thought of with this approach: disk space.
> Will I be doubling the amount of space used while both tables exist?
> If so, that would prevent this from working - I don’t have that much
> space available at the moment.

The original update you planned would do that, too.

You probably need to just do the update in batches and vacuum the table
between batches.




Re: UPDATE many records

От
Israel Brewster
Дата:
> On Jan 7, 2020, at 11:56 AM, Alan Hodgson <ahodgson@lists.simkin.ca> wrote:
>
> On Tue, 2020-01-07 at 11:47 -0900, Israel Brewster wrote:
>> One potential issue I just thought of with this approach: disk space.
>> Will I be doubling the amount of space used while both tables exist?
>> If so, that would prevent this from working - I don’t have that much
>> space available at the moment.
>
> The original update you planned would do that, too.
>
> You probably need to just do the update in batches and vacuum the table
> between batches.
>
>

Really? Why? With the update I am only changing data - I’m not adding any additional data, so the total size should
staythe same, right? I’m obviously missing something… :-) 

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

>




Re: UPDATE many records

От
Adrian Klaver
Дата:
On 1/7/20 12:47 PM, Israel Brewster wrote:
> One potential issue I just thought of with this approach: disk space. 
> Will I be doubling the amount of space used while both tables exist? If 
> so, that would prevent this from working - I don’t have that much space 
> available at the moment.

It will definitely increase the disk space by at least the data in the 
new table. How much relative to the old table is going to depend on how 
aggressive the AUTOVACUUM/VACUUM is.

A suggestion for an alternative approach:

1) Create a table:

create table change_table(id int, changed_fld some_type)

where is is the PK from the existing table.

2) Run your conversion function against existing table with change to 
have it put new field value in change_table keyed to id/PK. Probably do 
this in batches.

3) Once all the values have been updated, do an UPDATE set changed_field 
= changed_fld from change_table where existing_table.pk = change_table.id;

> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: UPDATE many records

От
Adrian Klaver
Дата:
On 1/7/20 12:58 PM, Israel Brewster wrote:
>> On Jan 7, 2020, at 11:56 AM, Alan Hodgson <ahodgson@lists.simkin.ca> wrote:
>>
>> On Tue, 2020-01-07 at 11:47 -0900, Israel Brewster wrote:
>>> One potential issue I just thought of with this approach: disk space.
>>> Will I be doubling the amount of space used while both tables exist?
>>> If so, that would prevent this from working - I don’t have that much
>>> space available at the moment.
>>
>> The original update you planned would do that, too.
>>
>> You probably need to just do the update in batches and vacuum the table
>> between batches.
>>
>>
> 
> Really? Why? With the update I am only changing data - I’m not adding any additional data, so the total size should
staythe same, right? I’m obviously missing something… :-)
 

https://www.postgresql.org/docs/12/sql-vacuum.html

"VACUUM reclaims storage occupied by dead tuples. In normal PostgreSQL 
operation, tuples that are deleted or obsoleted by an update are not 
physically removed from their table; they remain present until a VACUUM 
is done. Therefore it's necessary to do VACUUM periodically, especially 
on frequently-updated tables."

> 
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
> 
>>
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: UPDATE many records

От
Israel Brewster
Дата:
> On Jan 7, 2020, at 12:01 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>
> On 1/7/20 12:47 PM, Israel Brewster wrote:
>> One potential issue I just thought of with this approach: disk space. Will I be doubling the amount of space used
whileboth tables exist? If so, that would prevent this from working - I don’t have that much space available at the
moment.
>
> It will definitely increase the disk space by at least the data in the new table. How much relative to the old table
isgoing to depend on how aggressive the AUTOVACUUM/VACUUM is. 
>
> A suggestion for an alternative approach:
>
> 1) Create a table:
>
> create table change_table(id int, changed_fld some_type)
>
> where is is the PK from the existing table.
>
> 2) Run your conversion function against existing table with change to have it put new field value in change_table
keyedto id/PK. Probably do this in batches. 
>
> 3) Once all the values have been updated, do an UPDATE set changed_field = changed_fld from change_table where
existing_table.pk= change_table.id; 

Makes sense. Use the fast SELECT to create/populate the other table, then the update can just be setting a value, not
havingto call any functions. From what you are saying about updates though, I may still need to batch the UPDATE
section,with occasional VACUUMs to maintain disk space. Unless I am not understanding the concept of “tuples that are
obsoletedby an update”, which is possible. 

>
>> ---
>> Israel Brewster
>> Software Engineer
>> Alaska Volcano Observatory
>> Geophysical Institute - UAF
>> 2156 Koyukuk Drive
>> Fairbanks AK 99775-7320
>> Work: 907-474-5172
>> cell:  907-328-9145
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com




Re: UPDATE many records

От
Alan Hodgson
Дата:
On Tue, 2020-01-07 at 11:58 -0900, Israel Brewster wrote:
> > 
> Really? Why? With the update I am only changing data - I’m not adding
> any additional data, so the total size should stay the same, right?
> I’m obviously missing something… :-)
> 

PostgreSQL keeps the old row until it gets vacuumed, as it needs to be
visible to other transactions. Not only that, but every index record
gets updated to point to the location of the new data row too (excluding
HOT), and those old index blocks also need to get vacuumed. And none of
those rows can get removed until your update finishes.

I know this isn't universally true with HOT and fillfactor etc. but with
an update this big I think it's safe to say most of the space will get
doubled.

Plus you'll get a ton of write-ahead logs.




Re: UPDATE many records

От
Israel Brewster
Дата:
> On Jan 7, 2020, at 12:15 PM, Alan Hodgson <ahodgson@lists.simkin.ca> wrote:
>
> On Tue, 2020-01-07 at 11:58 -0900, Israel Brewster wrote:
>>>
>> Really? Why? With the update I am only changing data - I’m not adding
>> any additional data, so the total size should stay the same, right?
>> I’m obviously missing something… :-)
>>
>
> PostgreSQL keeps the old row until it gets vacuumed, as it needs to be
> visible to other transactions. Not only that, but every index record
> gets updated to point to the location of the new data row too (excluding
> HOT), and those old index blocks also need to get vacuumed. And none of
> those rows can get removed until your update finishes.
>
> I know this isn't universally true with HOT and fillfactor etc. but with
> an update this big I think it's safe to say most of the space will get
> doubled.
>
> Plus you'll get a ton of write-ahead logs.

Gotcha. Batches with VACUUM it is! Thanks for the info.

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

>
>
>




Re: UPDATE many records

От
Adrian Klaver
Дата:
On 1/7/20 1:10 PM, Israel Brewster wrote:
> 
>> On Jan 7, 2020, at 12:01 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>>
>> On 1/7/20 12:47 PM, Israel Brewster wrote:
>>> One potential issue I just thought of with this approach: disk space. Will I be doubling the amount of space used
whileboth tables exist? If so, that would prevent this from working - I don’t have that much space available at the
moment.
>>
>> It will definitely increase the disk space by at least the data in the new table. How much relative to the old table
isgoing to depend on how aggressive the AUTOVACUUM/VACUUM is.
 
>>
>> A suggestion for an alternative approach:
>>
>> 1) Create a table:
>>
>> create table change_table(id int, changed_fld some_type)
>>
>> where is is the PK from the existing table.
>>
>> 2) Run your conversion function against existing table with change to have it put new field value in change_table
keyedto id/PK. Probably do this in batches.
 
>>
>> 3) Once all the values have been updated, do an UPDATE set changed_field = changed_fld from change_table where
existing_table.pk= change_table.id;
 
> 
> Makes sense. Use the fast SELECT to create/populate the other table, then the update can just be setting a value, not
havingto call any functions. From what you are saying about updates though, I may still need to batch the UPDATE
section,with occasional VACUUMs to maintain disk space. Unless I am not understanding the concept of “tuples that are
obsoletedby an update”, which is possible.
 

You are not. For a more thorough explanation see:

https://www.postgresql.org/docs/12/routine-vacuuming.html#VACUUM-BASICS

How much space do you have to work with?

To get an idea of the disk space currently used by table see;

https://www.postgresql.org/docs/12/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT

> 
>>
>>> ---
>>> Israel Brewster
>>> Software Engineer
>>> Alaska Volcano Observatory
>>> Geophysical Institute - UAF
>>> 2156 Koyukuk Drive
>>> Fairbanks AK 99775-7320
>>> Work: 907-474-5172
>>> cell:  907-328-9145
>>
>>
>> -- 
>> Adrian Klaver
>> adrian.klaver@aklaver.com
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: UPDATE many records

От
Israel Brewster
Дата:
On Jan 7, 2020, at 12:21 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 1/7/20 1:10 PM, Israel Brewster wrote:
On Jan 7, 2020, at 12:01 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 1/7/20 12:47 PM, Israel Brewster wrote:
One potential issue I just thought of with this approach: disk space. Will I be doubling the amount of space used while both tables exist? If so, that would prevent this from working - I don’t have that much space available at the moment.

It will definitely increase the disk space by at least the data in the new table. How much relative to the old table is going to depend on how aggressive the AUTOVACUUM/VACUUM is.

A suggestion for an alternative approach:

1) Create a table:

create table change_table(id int, changed_fld some_type)

where is is the PK from the existing table.

2) Run your conversion function against existing table with change to have it put new field value in change_table keyed to id/PK. Probably do this in batches.

3) Once all the values have been updated, do an UPDATE set changed_field = changed_fld from change_table where existing_table.pk = change_table.id;
Makes sense. Use the fast SELECT to create/populate the other table, then the update can just be setting a value, not having to call any functions. From what you are saying about updates though, I may still need to batch the UPDATE section, with occasional VACUUMs to maintain disk space. Unless I am not understanding the concept of “tuples that are obsoleted by an update”, which is possible.

You are not. For a more thorough explanation see:

https://www.postgresql.org/docs/12/routine-vacuuming.html#VACUUM-BASICS

How much space do you have to work with?

To get an idea of the disk space currently used by table see;

https://www.postgresql.org/docs/12/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT

Oh, ok, I guess I was being overly paranoid on this front. Those functions would indicate that the table is only 7.5 GB, with another 8.7GB of indexes, for a total of around 16GB. So not a problem after all - I have around 100GB available.

Of course, that now leaves me with the mystery of where my other 500GB of disk space is going, since it is apparently NOT going to my DB as I had assumed, but solving that can wait.

Thanks again for all the good information and suggestions!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


-- 
Adrian Klaver
adrian.klaver@aklaver.com


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: UPDATE many records

От
Adrian Klaver
Дата:
On 1/7/20 1:43 PM, Israel Brewster wrote:
>> On Jan 7, 2020, at 12:21 PM, Adrian Klaver <adrian.klaver@aklaver.com 
>> <mailto:adrian.klaver@aklaver.com>> wrote:
>>
>> On 1/7/20 1:10 PM, Israel Brewster wrote:
>>>> On Jan 7, 2020, at 12:01 PM, Adrian Klaver 
>>>> <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
>>>>
>>>> On 1/7/20 12:47 PM, Israel Brewster wrote:
>>>>> One potential issue I just thought of with this approach: disk 
>>>>> space. Will I be doubling the amount of space used while both 
>>>>> tables exist? If so, that would prevent this from working - I don’t 
>>>>> have that much space available at the moment.
>>>>
>>>> It will definitely increase the disk space by at least the data in 
>>>> the new table. How much relative to the old table is going to depend 
>>>> on how aggressive the AUTOVACUUM/VACUUM is.
>>>>
>>>> A suggestion for an alternative approach:
>>>>
>>>> 1) Create a table:
>>>>
>>>> create table change_table(id int, changed_fld some_type)
>>>>
>>>> where is is the PK from the existing table.
>>>>
>>>> 2) Run your conversion function against existing table with change 
>>>> to have it put new field value in change_table keyed to id/PK. 
>>>> Probably do this in batches.
>>>>
>>>> 3) Once all the values have been updated, do an UPDATE set 
>>>> changed_field = changed_fld from change_table where 
>>>> existing_table.pk = change_table.id <http://change_table.id>;
>>> Makes sense. Use the fast SELECT to create/populate the other table, 
>>> then the update can just be setting a value, not having to call any 
>>> functions. From what you are saying about updates though, I may still 
>>> need to batch the UPDATE section, with occasional VACUUMs to maintain 
>>> disk space. Unless I am not understanding the concept of “tuples that 
>>> are obsoleted by an update”, which is possible.
>>
>> You are not. For a more thorough explanation see:
>>
>> https://www.postgresql.org/docs/12/routine-vacuuming.html#VACUUM-BASICS
>>
>> How much space do you have to work with?
>>
>> To get an idea of the disk space currently used by table see;
>>
>> https://www.postgresql.org/docs/12/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT
> 
> Oh, ok, I guess I was being overly paranoid on this front. Those 
> functions would indicate that the table is only 7.5 GB, with another 
> 8.7GB of indexes, for a total of around 16GB. So not a problem after all 
> - I have around 100GB available.
> 
> Of course, that now leaves me with the mystery of where my other 500GB 
> of disk space is going, since it is apparently NOT going to my DB as I 
> had assumed, but solving that can wait.

Assuming you are on some form of Linux:

sudo du -h -d 1 /

Then you can drill down into the output of above.

> 
> Thanks again for all the good information and suggestions!
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory
> Geophysical Institute - UAF
> 2156 Koyukuk Drive
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
>>
>>>>
>>>>> ---
>>>>> Israel Brewster
>>>>> Software Engineer
>>>>> Alaska Volcano Observatory
>>>>> Geophysical Institute - UAF
>>>>> 2156 Koyukuk Drive
>>>>> Fairbanks AK 99775-7320
>>>>> Work: 907-474-5172
>>>>> cell:  907-328-9145
>>>>
>>>>
>>>> --
>>>> Adrian Klaver
>>>> adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>>
>>
>> --
>> Adrian Klaver
>> adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: UPDATE many records

От
Israel Brewster
Дата:
>
> On Jan 7, 2020, at 12:57 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>
> On 1/7/20 1:43 PM, Israel Brewster wrote:
>>> On Jan 7, 2020, at 12:21 PM, Adrian Klaver <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
>>>
>>> On 1/7/20 1:10 PM, Israel Brewster wrote:
>>>>> On Jan 7, 2020, at 12:01 PM, Adrian Klaver <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
>>>>>
>>>>> On 1/7/20 12:47 PM, Israel Brewster wrote:
>>>>>> One potential issue I just thought of with this approach: disk space. Will I be doubling the amount of space
usedwhile both tables exist? If so, that would prevent this from working - I don’t have that much space available at
themoment. 
>>>>>
>>>>> It will definitely increase the disk space by at least the data in the new table. How much relative to the old
tableis going to depend on how aggressive the AUTOVACUUM/VACUUM is. 
>>>>>
>>>>> A suggestion for an alternative approach:
>>>>>
>>>>> 1) Create a table:
>>>>>
>>>>> create table change_table(id int, changed_fld some_type)
>>>>>
>>>>> where is is the PK from the existing table.
>>>>>
>>>>> 2) Run your conversion function against existing table with change to have it put new field value in change_table
keyedto id/PK. Probably do this in batches. 
>>>>>
>>>>> 3) Once all the values have been updated, do an UPDATE set changed_field = changed_fld from change_table where
existing_table.pk= change_table.id <http://change_table.id>; 
>>>> Makes sense. Use the fast SELECT to create/populate the other table, then the update can just be setting a value,
nothaving to call any functions. From what you are saying about updates though, I may still need to batch the UPDATE
section,with occasional VACUUMs to maintain disk space. Unless I am not understanding the concept of “tuples that are
obsoletedby an update”, which is possible. 
>>>
>>> You are not. For a more thorough explanation see:
>>>
>>> https://www.postgresql.org/docs/12/routine-vacuuming.html#VACUUM-BASICS
>>>
>>> How much space do you have to work with?
>>>
>>> To get an idea of the disk space currently used by table see;
>>>
>>> https://www.postgresql.org/docs/12/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT
>> Oh, ok, I guess I was being overly paranoid on this front. Those functions would indicate that the table is only 7.5
GB,with another 8.7GB of indexes, for a total of around 16GB. So not a problem after all - I have around 100GB
available.
>> Of course, that now leaves me with the mystery of where my other 500GB of disk space is going, since it is
apparentlyNOT going to my DB as I had assumed, but solving that can wait. 
>
> Assuming you are on some form of Linux:
>
> sudo du -h -d 1 /
>
> Then you can drill down into the output of above.

Yep. Done it many times to discover a runaway log file or the like. Just mildly amusing that solving one problem leads
toanother I need to take care of as well… But at least the select into a new table should work nicely. Thanks! 

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145
>
>> Thanks again for all the good information and suggestions!
>> ---
>> Israel Brewster
>> Software Engineer
>> Alaska Volcano Observatory
>> Geophysical Institute - UAF
>> 2156 Koyukuk Drive
>> Fairbanks AK 99775-7320
>> Work: 907-474-5172
>> cell:  907-328-9145
>>>
>>>>>
>>>>>> ---
>>>>>> Israel Brewster
>>>>>> Software Engineer
>>>>>> Alaska Volcano Observatory
>>>>>> Geophysical Institute - UAF
>>>>>> 2156 Koyukuk Drive
>>>>>> Fairbanks AK 99775-7320
>>>>>> Work: 907-474-5172
>>>>>> cell:  907-328-9145
>>>>>
>>>>>
>>>>> --
>>>>> Adrian Klaver
>>>>> adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>>>
>>>
>>> --
>>> Adrian Klaver
>>> adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com




Re: UPDATE many records

От
Israel Brewster
Дата:
On Jan 8, 2020, at 7:52 AM, stan <stanb@panix.com> wrote:

On Tue, Jan 07, 2020 at 12:20:12PM -0900, Israel Brewster wrote:
On Jan 7, 2020, at 12:15 PM, Alan Hodgson <ahodgson@lists.simkin.ca> wrote:

On Tue, 2020-01-07 at 11:58 -0900, Israel Brewster wrote:

Really? Why? With the update I am only changing data - I???m not adding
any additional data, so the total size should stay the same, right?
I???m obviously missing something??? :-)


PostgreSQL keeps the old row until it gets vacuumed, as it needs to be
visible to other transactions. Not only that, but every index record
gets updated to point to the location of the new data row too (excluding
HOT), and those old index blocks also need to get vacuumed. And none of
those rows can get removed until your update finishes.

I know this isn't universally true with HOT and fillfactor etc. but with
an update this big I think it's safe to say most of the space will get
doubled.

Plus you'll get a ton of write-ahead logs.

Gotcha. Batches with VACUUM it is! Thanks for the info.

I'd love to see you report on how this went. 

So after determining that I did, in fact, have enough disk space to duplicate the data, I moved forward with the CREATE TABLE … AS SELECT …. Method. Running the CREATE TABLE command took around 12 minutes for my almost 64million rows. I then created indexes/set constraints/set defaults, etc on the new table until it exactly matched the old one (other than the changed data, of course). This probably took another 5-10 minutes. Two quick ALTER TABLE…RENAME TO… commands later, and the new data was live. The only issue I ran into was initially forgetting to grant the application user permissions on the new table, but of course that was easily remedied.

In the end, really couldn’t have gone much smoother or quicker. Thanks all for the assistance and advice!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

-- 
"They that would give up essential liberty for temporary safety deserve
neither liberty nor safety."
-- Benjamin Franklin