Обсуждение: Update table performance

Поиск

Список

Период

Сортировка

Update table performance

От

Mark Makarowsky

Дата:

07 августа 2007 г., 12:58:47

I have a table with 4,889,820 records in it.  The
table also has 47 fields.  I'm having problems with
update performance.  Just as a test, I issued the
following update:

update valley set test='this is a test'

This took 905641 ms.  Isn't that kind of slow?  There
aren't any indexes, triggers, constraints or anything
on this table.  The version of Postgres is "PostgreSQL
8.2.4 on i686-pc-mingw32, compiled by GCC gcc.exe
(GCC) 3.4.2 (mingw-special)".  The operating
environment is Windows 2003 Standard Edition w/service
pack 2.  It is 2.20 Ghz with 1.0 GB of RAM.  Here is
the results from Explain:

"Seq Scan on valley  (cost=0.00..1034083.57
rows=4897257 width=601)"

Here are the settings in the postgresql.conf.  Any
ideas or is this the expected speed?

#---------------------------------------------------------------------------
# CONNECTIONS AND AUTHENTICATION
#---------------------------------------------------------------------------

# - Connection Settings -

listen_addresses = '*'        # what IP address(es) to
listen on;
                    # comma-separated list of addresses;
                    # defaults to 'localhost', '*' = all
                    # (change requires restart)
port = 5432                # (change requires restart)
max_connections = 20            # (change requires restart)
# Note: increasing max_connections costs ~400 bytes of
shared memory per
# connection slot, plus lock space (see
max_locks_per_transaction).  You
# might also need to raise shared_buffers to support
more connections.
#superuser_reserved_connections = 3    # (change requires
restart)
#unix_socket_directory = ''        # (change requires
restart)
#unix_socket_group = ''            # (change requires restart)
#unix_socket_permissions = 0777        # octal
                    # (change requires restart)
#bonjour_name = ''            # defaults to the computer name
                    # (change requires restart)

# - Security & Authentication -

#authentication_timeout = 1min        # 1s-600s
#ssl = off                # (change requires restart)
#password_encryption = on
#db_user_namespace = off

# Kerberos
#krb_server_keyfile = ''        # (change requires restart)
#krb_srvname = 'postgres'        # (change requires restart)
#krb_server_hostname = ''        # empty string matches any
keytab entry
                    # (change requires restart)
#krb_caseins_users = off        # (change requires restart)

# - TCP Keepalives -
# see 'man 7 tcp' for details

#tcp_keepalives_idle = 0        # TCP_KEEPIDLE, in seconds;
                    # 0 selects the system default
#tcp_keepalives_interval = 0        # TCP_KEEPINTVL, in
seconds;
                    # 0 selects the system default
#tcp_keepalives_count = 0        # TCP_KEEPCNT;
                    # 0 selects the system default


#---------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#---------------------------------------------------------------------------

# - Memory -

shared_buffers = 512MB            # min 128kB or
max_connections*16kB
                    # (change requires restart)
temp_buffers = 8MB            # min 800kB
max_prepared_transactions = 5        # can be 0 or more
                    # (change requires restart)
# Note: increasing max_prepared_transactions costs
~600 bytes of shared memory
# per transaction slot, plus lock space (see
max_locks_per_transaction).
work_mem = 8MB                # min 64kB
maintenance_work_mem = 16MB        # min 1MB
#max_stack_depth = 4MB            # min 100kB

# - Free Space Map -

max_fsm_pages = 700000        # min max_fsm_relations*16, 6
bytes each
                    # (change requires restart)
max_fsm_relations = 1000        # min 100, ~70 bytes each
                    # (change requires restart)

# - Kernel Resource Usage -

max_files_per_process = 1000        # min 25
                    # (change requires restart)
#shared_preload_libraries = ''        # (change requires
restart)

# - Cost-Based Vacuum Delay -

#vacuum_cost_delay = 0            # 0-1000 milliseconds
#vacuum_cost_page_hit = 1        # 0-10000 credits
#vacuum_cost_page_miss = 10        # 0-10000 credits
#vacuum_cost_page_dirty = 20        # 0-10000 credits
#vacuum_cost_limit = 200        # 0-10000 credits

# - Background writer -

#bgwriter_delay = 200ms            # 10-10000ms between rounds
#bgwriter_lru_percent = 1.0        # 0-100% of LRU buffers
scanned/round
#bgwriter_lru_maxpages = 5        # 0-1000 buffers max
written/round
#bgwriter_all_percent = 0.333        # 0-100% of all buffers
scanned/round
#bgwriter_all_maxpages = 5        # 0-1000 buffers max
written/round


#---------------------------------------------------------------------------
# WRITE AHEAD LOG
#---------------------------------------------------------------------------

# - Settings -

fsync = off                # turns forced synchronization on or
off
#wal_sync_method = fsync        # the default is the first
option
                    # supported by the operating system:
                    #   open_datasync
                    #   fdatasync
                    #   fsync
                    #   fsync_writethrough
                    #   open_sync
full_page_writes = off            # recover from partial page
writes
#wal_buffers = 64kB            # min 32kB
                    # (change requires restart)
#commit_delay = 0            # range 0-100000, in microseconds
#commit_siblings = 5            # range 1-1000

# - Checkpoints -

checkpoint_segments = 60        # in logfile segments, min
1, 16MB each
checkpoint_timeout = 5min        # range 30s-1h
checkpoint_warning = 0        # 0 is off

# - Archiving -

archive_command = ''        # command to use to archive a
logfile segment
archive_timeout = 0        # force a logfile segment switch
after this
                # many seconds; 0 is off


#---------------------------------------------------------------------------
# QUERY TUNING
#---------------------------------------------------------------------------

# - Planner Method Configuration -

enable_bitmapscan = on
enable_hashagg = on
enable_hashjoin = on
enable_indexscan = on
enable_mergejoin = on
enable_nestloop = on
enable_seqscan = on
enable_sort = on
enable_tidscan = on

# - Planner Cost Constants -

#seq_page_cost = 1.0            # measured on an arbitrary
scale
#random_page_cost = 4.0            # same scale as above
#cpu_tuple_cost = 0.01            # same scale as above
#cpu_index_tuple_cost = 0.005        # same scale as above
#cpu_operator_cost = 0.0025        # same scale as above
effective_cache_size = 32MB

# - Genetic Query Optimizer -

#geqo = on
#geqo_threshold = 12
#geqo_effort = 5            # range 1-10
#geqo_pool_size = 0            # selects default based on
effort
#geqo_generations = 0            # selects default based on
effort
#geqo_selection_bias = 2.0        # range 1.5-2.0

# - Other Planner Options -

#default_statistics_target = 10        # range 1-1000
#constraint_exclusion = off
#from_collapse_limit = 8
#join_collapse_limit = 8        # 1 disables collapsing of
explicit
                    # JOINs


#---------------------------------------------------------------------------
# ERROR REPORTING AND LOGGING
#---------------------------------------------------------------------------

# - Where to Log -

log_destination = 'stderr'        # Valid values are
combinations of
                    # stderr, syslog and eventlog,
                    # depending on platform.

# This is used when logging to stderr:
redirect_stderr = on            # Enable capturing of stderr
into log
                    # files
                    # (change requires restart)

# These are only used if redirect_stderr is on:
#log_directory = 'pg_log'        # Directory where log files
are written
                    # Can be absolute or relative to PGDATA
#log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # Log
file name pattern.
                    # Can include strftime() escapes
#log_truncate_on_rotation = off # If on, any existing
log file of the same
                    # name as the new log file will be
                    # truncated rather than appended to. But
                    # such truncation only occurs on
                    # time-driven rotation, not on restarts
                    # or size-driven rotation. Default is
                    # off, meaning append to existing files
                    # in all cases.
#log_rotation_age = 1d            # Automatic rotation of
logfiles will
                    # happen after that time.  0 to
                    # disable.
#log_rotation_size = 10MB        # Automatic rotation of
logfiles will
                    # happen after that much log
                    # output.  0 to disable.

# These are relevant when logging to syslog:
#syslog_facility = 'LOCAL0'
#syslog_ident = 'postgres'


# - When to Log -

#client_min_messages = notice        # Values, in order of
decreasing detail:
                    #   debug5
                    #   debug4
                    #   debug3
                    #   debug2
                    #   debug1
                    #   log
                    #   notice
                    #   warning
                    #   error

#log_min_messages = notice        # Values, in order of
decreasing detail:
                    #   debug5
                    #   debug4
                    #   debug3
                    #   debug2
                    #   debug1
                    #   info
                    #   notice
                    #   warning
                    #   error
                    #   log
                    #   fatal
                    #   panic

#log_error_verbosity = default        # terse, default, or
verbose messages

#log_min_error_statement = error    # Values in order of
increasing severity:
                     #   debug5
                    #   debug4
                    #   debug3
                    #   debug2
                    #   debug1
                     #   info
                    #   notice
                    #   warning
                    #   error
                    #   fatal
                    #   panic (effectively off)

#log_min_duration_statement = -1    # -1 is disabled, 0
logs all statements
                    # and their durations.

#silent_mode = off            # DO NOT USE without syslog or
                    # redirect_stderr
                    # (change requires restart)

# - What to Log -

#debug_print_parse = off
#debug_print_rewritten = off
#debug_print_plan = off
#debug_pretty_print = off
#log_connections = off
#log_disconnections = off
#log_duration = off
log_line_prefix = '%t '            # Special values:
                    #   %u = user name
                    #   %d = database name
                    #   %r = remote host and port
                    #   %h = remote host
                    #   %p = PID
                    #   %t = timestamp (no milliseconds)
                    #   %m = timestamp with milliseconds
                    #   %i = command tag
                    #   %c = session id
                    #   %l = session line number
                    #   %s = session start timestamp
                    #   %x = transaction id
                    #   %q = stop here in non-session
                    #        processes
                    #   %% = '%'
                    # e.g. '<%u%%%d> '
#log_statement = 'none'            # none, ddl, mod, all
#log_hostname = off

#---------------------------------------------------------------------------
# RUNTIME STATISTICS
#---------------------------------------------------------------------------

# - Query/Index Statistics Collector -

#stats_command_string = on
#update_process_title = on

stats_start_collector = on        # needed for block or row
stats
                    # (change requires restart)
#stats_block_level = off
stats_row_level = on
#stats_reset_on_server_start = off    # (change requires
restart)


# - Statistics Monitoring -

#log_parser_stats = off
#log_planner_stats = off
#log_executor_stats = off
#log_statement_stats = off


#---------------------------------------------------------------------------
# AUTOVACUUM PARAMETERS
#---------------------------------------------------------------------------

autovacuum = on            # enable autovacuum subprocess?
                    # 'on' requires stats_start_collector
                    # and stats_row_level to also be on
#autovacuum_naptime = 1min        # time between autovacuum
runs
#autovacuum_vacuum_threshold = 500    # min # of tuple
updates before
                    # vacuum
#autovacuum_analyze_threshold = 250    # min # of tuple
updates before
                    # analyze
#autovacuum_vacuum_scale_factor = 0.2    # fraction of
rel size before
                    # vacuum
#autovacuum_analyze_scale_factor = 0.1    # fraction of
rel size before
                    # analyze
#autovacuum_freeze_max_age = 200000000    # maximum XID
age before forced vacuum
                    # (change requires restart)
#autovacuum_vacuum_cost_delay = -1    # default vacuum
cost delay for
                    # autovacuum, -1 means use
                    # vacuum_cost_delay
#autovacuum_vacuum_cost_limit = -1    # default vacuum
cost limit for
                    # autovacuum, -1 means use
                    # vacuum_cost_limit


#---------------------------------------------------------------------------
# CLIENT CONNECTION DEFAULTS
#---------------------------------------------------------------------------

# - Statement Behavior -

#search_path = '"$user",public'        # schema names
#default_tablespace = ''        # a tablespace name, '' uses
                    # the default
#check_function_bodies = on
#default_transaction_isolation = 'read committed'
#default_transaction_read_only = off
#statement_timeout = 0            # 0 is disabled
#vacuum_freeze_min_age = 100000000

# - Locale and Formatting -

datestyle = 'iso, mdy'
#timezone = unknown            # actually, defaults to TZ
                    # environment setting
#timezone_abbreviations = 'Default'     # select the
set of available timezone
                    # abbreviations. Currently, there are
                    #   Default
                    #   Australia
                    #   India
                    # However you can also create your own
                    # file in share/timezonesets/.
#extra_float_digits = 0            # min -15, max 2
#client_encoding = sql_ascii        # actually, defaults to
database
                    # encoding

# These settings are initialized by initdb -- they
might be changed
lc_messages = 'C'            # locale for system error message
                    # strings
lc_monetary = 'C'            # locale for monetary formatting
lc_numeric = 'C'            # locale for number formatting
lc_time = 'C'                # locale for time formatting

# - Other Defaults -

#explain_pretty_print = on
#dynamic_library_path = '$libdir'
#local_preload_libraries = ''


#---------------------------------------------------------------------------
# LOCK MANAGEMENT
#---------------------------------------------------------------------------

#deadlock_timeout = 1s
max_locks_per_transaction = 384        # min 10
                    # (change requires restart)
# Note: each lock table slot uses ~270 bytes of shared
memory, and there are
# max_locks_per_transaction * (max_connections +
max_prepared_transactions)
# lock table slots.


#---------------------------------------------------------------------------
# VERSION/PLATFORM COMPATIBILITY
#---------------------------------------------------------------------------

# - Previous Postgres Versions -

#add_missing_from = off
#array_nulls = on
#backslash_quote = safe_encoding    # on, off, or
safe_encoding
#default_with_oids = off
#escape_string_warning = on
#standard_conforming_strings = off
#regex_flavor = advanced        # advanced, extended, or
basic
#sql_inheritance = on

# - Other Platforms & Clients -

#transform_null_equals = off


#---------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#---------------------------------------------------------------------------

#custom_variable_classes = ''        # list of custom
variable class names




____________________________________________________________________________________
Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out.
http://answers.yahoo.com/dir/?link=list&sid=396545433

Re: Update table performance

От

Richard Huxton

Дата:

07 августа 2007 г., 13:33:26

Mark Makarowsky wrote:
> I have a table with 4,889,820 records in it.  The
> table also has 47 fields.  I'm having problems with
> update performance.  Just as a test, I issued the
> following update:
>
> update valley set test='this is a test'
>
> This took 905641 ms.  Isn't that kind of slow?

The limiting factor here will be how fast you can write to your disk.
Let's see: 5 million rows in ~900 seconds, that's about 5500
rows/second. Now, you don't say  how large your rows are, but assuming
each row is say 1kB that'd be 5.5MB/sec - or not brilliant. Simplest way
to find out total activity is check how much disk space PG is using
before and after the update.

What you'll need to do is monitor disk activity, in particular how many
writes and how much time the processor spends waiting on writes to complete.

If your standard usage pattern is to update a single field in all rows
of a large table, then performance isn't going to be sub-second I'm afraid.

--
   Richard Huxton
   Archonet Ltd

Re: Update table performance

От

"Scott Marlowe"

Дата:

07 августа 2007 г., 14:53:37

On 8/7/07, Mark Makarowsky <bedrockconstruction@yahoo.com> wrote:
> I have a table with 4,889,820 records in it.  The
> table also has 47 fields.  I'm having problems with
> update performance.  Just as a test, I issued the
> following update:
>
> update valley set test='this is a test'
>
> This took 905641 ms.  Isn't that kind of slow?  There
> aren't any indexes, triggers, constraints or anything
> on this table.  The version of Postgres is "PostgreSQL
> 8.2.4 on i686-pc-mingw32, compiled by GCC gcc.exe
> (GCC) 3.4.2 (mingw-special)".  The operating
> environment is Windows 2003 Standard Edition w/service
> pack 2.  It is 2.20 Ghz with 1.0 GB of RAM.  Here is
> the results from Explain:
>
> "Seq Scan on valley  (cost=0.00..1034083.57
> rows=4897257 width=601)"

Have you done this a few times?  You could easily have a very large
and bloated table if you do this several times in a row.  That would
explain the slow performance.  If you're going to do a lot of updates
without where clauses on large tables, you'll need to run a vacuum
right afterwards to clean things up.

I see that you included a lot about your machine, but you didn't
include any specs on your disk subsystem.  When it comes to update
speed, the disk subsystem is probably the most important part.

Note also that Windows is still not the preferred platform for
postgresql from a performance perspective (actually, the only database
where that's true is MS-SQL really).

Have you run any benchmarks on your disk subsystem to see how fast it is?

Re: Update table performance

От

Alan Hodgson

Дата:

07 августа 2007 г., 16:03:59

On Tuesday 07 August 2007 05:58, Mark Makarowsky
<bedrockconstruction@yahoo.com> wrote:
> I have a table with 4,889,820 records in it.  The
> table also has 47 fields.  I'm having problems with
> update performance.  Just as a test, I issued the
> following update:
>
> update valley set test='this is a test'
>
> This took 905641 ms.  Isn't that kind of slow?

PostgreSQL has to write a full new version of every row that gets updated.
Updates are, therefore, relatively slow.

I'm guessing you're doing this on a single SATA drive, too, which probably
doesn't help.

--
"If a nation values anything more than freedom, it will lose its freedom;
and the irony of it is that if it is comfort or money that it values more,
it will lose that too." -- Somerset Maugham, Author

Re: Update table performance

От

Decibel!

Дата:

07 августа 2007 г., 18:00:02

On Tue, Aug 07, 2007 at 02:33:19PM +0100, Richard Huxton wrote:
> Mark Makarowsky wrote:
> >I have a table with 4,889,820 records in it.  The
> >table also has 47 fields.  I'm having problems with
> >update performance.  Just as a test, I issued the
> >following update:
> >
> >update valley set test='this is a test'
> >
> >This took 905641 ms.  Isn't that kind of slow?
>
> The limiting factor here will be how fast you can write to your disk.

Well, very possibly how fast you can read, too. Using your assumption of
1k per row, 5M rows means 5G of data, which might well not fit in
memory. And if the entire table's been updated just once before, even
with vacuuming you're now at 10G of data.
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Вложения

msg-22763-27855.dat

Re: Update table performance

От

Chris Browne

Дата:

07 августа 2007 г., 18:03:31

bedrockconstruction@yahoo.com (Mark Makarowsky) writes:
> I have a table with 4,889,820 records in it.  The
> table also has 47 fields.  I'm having problems with
> update performance.  Just as a test, I issued the
> following update:
>
> update valley set test='this is a test'
>
> This took 905641 ms.  Isn't that kind of slow?  There
> aren't any indexes, triggers, constraints or anything
> on this table.  The version of Postgres is "PostgreSQL
> 8.2.4 on i686-pc-mingw32, compiled by GCC gcc.exe
> (GCC) 3.4.2 (mingw-special)".  The operating
> environment is Windows 2003 Standard Edition w/service
> pack 2.  It is 2.20 Ghz with 1.0 GB of RAM.  Here is
> the results from Explain:
>
> "Seq Scan on valley  (cost=0.00..1034083.57
> rows=4897257 width=601)"
>
> Here are the settings in the postgresql.conf.  Any
> ideas or is this the expected speed?

Hmm.

- You asked to update 4,889,820 records.

- It's a table consisting of 8.5GB of data (based on the cost info)

For this to take 15 minutes doesn't seem particularly outrageous.
--
output = ("cbbrowne" "@" "acm.org")
http://cbbrowne.com/info/oses.html
Rules of the Evil Overlord #65.  "If I must have computer systems with
publically available  terminals, the maps  they display of  my complex
will have  a room clearly marked  as the Main Control  Room. That room
will be  the Execution Chamber. The  actual main control  room will be
marked as Sewage Overflow Containment." <http://www.eviloverlord.com/>

Re: Update table performance

От

"Scott Marlowe"

Дата:

07 августа 2007 г., 19:36:24

On 8/7/07, Decibel! <decibel@decibel.org> wrote:
> On Tue, Aug 07, 2007 at 02:33:19PM +0100, Richard Huxton wrote:
> > Mark Makarowsky wrote:
> > >I have a table with 4,889,820 records in it.  The
> > >table also has 47 fields.  I'm having problems with
> > >update performance.  Just as a test, I issued the
> > >following update:
> > >
> > >update valley set test='this is a test'
> > >
> > >This took 905641 ms.  Isn't that kind of slow?
> >
> > The limiting factor here will be how fast you can write to your disk.
>
> Well, very possibly how fast you can read, too. Using your assumption of
> 1k per row, 5M rows means 5G of data, which might well not fit in
> memory. And if the entire table's been updated just once before, even
> with vacuuming you're now at 10G of data.

Where one might have to update just one column of a wide table often,
it's often a good idea to move that column into its own dependent
table.

Or just don't update one column of every row in  table...

Re: Update table performance

От

Decibel!

Дата:

07 августа 2007 г., 21:42:40

On Tue, Aug 07, 2007 at 02:36:18PM -0500, Scott Marlowe wrote:
> On 8/7/07, Decibel! <decibel@decibel.org> wrote:
> > On Tue, Aug 07, 2007 at 02:33:19PM +0100, Richard Huxton wrote:
> > > Mark Makarowsky wrote:
> > > >I have a table with 4,889,820 records in it.  The
> > > >table also has 47 fields.  I'm having problems with
> > > >update performance.  Just as a test, I issued the
> > > >following update:
> > > >
> > > >update valley set test='this is a test'
> > > >
> > > >This took 905641 ms.  Isn't that kind of slow?
> > >
> > > The limiting factor here will be how fast you can write to your disk.
> >
> > Well, very possibly how fast you can read, too. Using your assumption of
> > 1k per row, 5M rows means 5G of data, which might well not fit in
> > memory. And if the entire table's been updated just once before, even
> > with vacuuming you're now at 10G of data.
>
> Where one might have to update just one column of a wide table often,
> it's often a good idea to move that column into its own dependent
> table.

Yeah, I've used "vertical partitioning" very successfully in the past,
though I've never done it for just a single field. I'll typically leave
the few most common fields in the "main" table and pull everything else
into a second table.

> Or just don't update one column of every row in  table...

Yeah, that too. :) Though sometimes you can't avoid it.

I should mention that if you can handle splitting the update into
multiple transactions, that will help a lot since it means you won't be
doubling the size of the table.
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Вложения

msg-22763-27859.dat

Re: Update table performance

От

Mark Makarowsky

Дата:

07 августа 2007 г., 23:13:07

Can you provide more detail on what you mean by your
two suggestions below:

Yeah, I've used "vertical partitioning" very
successfully in the past, though I've never done it
for just a single field. I'll typically leave the few
most common fields in the "main" table and pull
everything else into a second table.

I should mention that if you can handle splitting the
update into multiple transactions, that will help a
lot since it means you won't be doubling the size of
the table.

I guess I was just surprised by the speed it takes to
update the field in Postgres since on an almost
identical table in FoxPro (400,000 records less), it
updates the table with the same exact update table
statement in about 4 minutes.
--- Decibel! <decibel@decibel.org> wrote:

> On Tue, Aug 07, 2007 at 02:36:18PM -0500, Scott
> Marlowe wrote:
> > On 8/7/07, Decibel! <decibel@decibel.org> wrote:
> > > On Tue, Aug 07, 2007 at 02:33:19PM +0100,
> Richard Huxton wrote:
> > > > Mark Makarowsky wrote:
> > > > >I have a table with 4,889,820 records in it.
> The
> > > > >table also has 47 fields.  I'm having
> problems with
> > > > >update performance.  Just as a test, I issued
> the
> > > > >following update:
> > > > >
> > > > >update valley set test='this is a test'
> > > > >
> > > > >This took 905641 ms.  Isn't that kind of
> slow?
> > > >
> > > > The limiting factor here will be how fast you
> can write to your disk.
> > >
> > > Well, very possibly how fast you can read, too.
> Using your assumption of
> > > 1k per row, 5M rows means 5G of data, which
> might well not fit in
> > > memory. And if the entire table's been updated
> just once before, even
> > > with vacuuming you're now at 10G of data.
> >
> > Where one might have to update just one column of
> a wide table often,
> > it's often a good idea to move that column into
> its own dependent
> > table.
>
> Yeah, I've used "vertical partitioning" very
> successfully in the past,
> though I've never done it for just a single field.
> I'll typically leave
> the few most common fields in the "main" table and
> pull everything else
> into a second table.
>
> > Or just don't update one column of every row in
> table...
>
> Yeah, that too. :) Though sometimes you can't avoid
> it.
>
> I should mention that if you can handle splitting
> the update into
> multiple transactions, that will help a lot since it
> means you won't be
> doubling the size of the table.
> --
> Decibel!, aka Jim Nasby
> decibel@decibel.org
> EnterpriseDB      http://enterprisedb.com
> 512.569.9461 (cell)
>



      ____________________________________________________________________________________
Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/

Re: Update table performance

От

Erik Jones

Дата:

08 августа 2007 г., 01:46:37

On Aug 7, 2007, at 6:13 PM, Mark Makarowsky wrote:

> Can you provide more detail on what you mean by your
> two suggestions below:
>
> Yeah, I've used "vertical partitioning" very
> successfully in the past, though I've never done it
> for just a single field. I'll typically leave the few
> most common fields in the "main" table and pull
> everything else into a second table.

Vertical partitioning is where you split up your table on disk by
columns, i.e on the vertical lines.  He quoted it because Postgres
doesn't actually support it transparently but you can always fake it
by splitting up your table.  For example, given the following table
wherein column bar gets updated a lot but the others don't:

create table foo (
id    int     not null,
bar    int,
baz     int,

primary key (id)
);

You could split it up like so:

create table foo_a (
id     int,
baz    int,

primary key (id)
);

create table foo_b (
foo_id    int,
bar        int,

foreign key foo_a_id (foo_id) references foo_a (id)
);

The reason you'd ever want to do this is that when Postgres goes to
update a row what it actually does is inserts a new row with the new
value(s) that you changed and marks the old one as deleted.  So, if
you have a wide table and frequently update only certain columns,
you'll take a performance hit as you're having to re-write a lot of
static values.

>
> I should mention that if you can handle splitting the
> update into multiple transactions, that will help a
> lot since it means you won't be doubling the size of
> the table.

As I mentioned above, when you do an update you're actually inserting
a new row and deleting the old one.  That deleted row is still
considered part of the table (for reasons of concurrency, read up on
the concurrency chapter in the manual for the details) and once it is
no longer visible by any live transactions can be re-used by future
inserts.  So, if you update one column on every row of a one million
row table all at once, you have to allocate and write out one million
new rows.  But, if you do the update a quarter million at a time, the
last three updates would be able to re-use many of the rows deleted
in earlier updates.

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

Re: Update table performance

От

Heikki Linnakangas

Дата:

08 августа 2007 г., 08:01:13

Erik Jones wrote:
> Decibel! wrote:
>> I should mention that if you can handle splitting the
>> update into multiple transactions, that will help a
>> lot since it means you won't be doubling the size of
>> the table.
>
> As I mentioned above, when you do an update you're actually inserting a
> new row and deleting the old one.  That deleted row is still considered
> part of the table (for reasons of concurrency, read up on the
> concurrency chapter in the manual for the details) and once it is no
> longer visible by any live transactions can be re-used by future
> inserts.  So, if you update one column on every row of a one million row
> table all at once, you have to allocate and write out one million new
> rows.  But, if you do the update a quarter million at a time, the last
> three updates would be able to re-use many of the rows deleted in
> earlier updates.

Only if you vacuum between the updates.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: Update table performance

От

Erik Jones

Дата:

08 августа 2007 г., 15:15:57

On Aug 8, 2007, at 3:00 AM, Heikki Linnakangas wrote:

> Erik Jones wrote:
>> Decibel! wrote:
>>> I should mention that if you can handle splitting the
>>> update into multiple transactions, that will help a
>>> lot since it means you won't be doubling the size of
>>> the table.
>>
>> As I mentioned above, when you do an update you're actually
>> inserting a
>> new row and deleting the old one.  That deleted row is still
>> considered
>> part of the table (for reasons of concurrency, read up on the
>> concurrency chapter in the manual for the details) and once it is no
>> longer visible by any live transactions can be re-used by future
>> inserts.  So, if you update one column on every row of a one
>> million row
>> table all at once, you have to allocate and write out one million new
>> rows.  But, if you do the update a quarter million at a time, the
>> last
>> three updates would be able to re-use many of the rows deleted in
>> earlier updates.
>
> Only if you vacuum between the updates.

This is true.  In fact, the chapter on Routine Database Maintenance
tasks that discusses vacuuming explains all of this.

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

Re: Update table performance

От

Decibel!

Дата:

08 августа 2007 г., 17:28:25

On Tue, Aug 07, 2007 at 08:46:20PM -0500, Erik Jones wrote:
> Vertical partitioning is where you split up your table on disk by
> columns, i.e on the vertical lines.  He quoted it because Postgres
> doesn't actually support it transparently but you can always fake it
> by splitting up your table.  For example, given the following table
> wherein column bar gets updated a lot but the others don't:
>
> create table foo (
> id    int     not null,
> bar    int,
> baz     int,
>
> primary key (id)
> );
>
> You could split it up like so:
>
> create table foo_a (
> id     int,
> baz    int,
>
> primary key (id)
> );
>
> create table foo_b (
> foo_id    int,
> bar        int,
>
> foreign key foo_a_id (foo_id) references foo_a (id)
> );

FWIW, the cases where I've actually used this have been on much wider
tables, and a number of the attributes are in-frequently accessed. An
example would be if you keep snail-mail address info for users; you
probably don't use those fields very often, so they would be good
candidates for going into a second table.

When does it actually make sense to use this? When you do a *lot* with a
small number of fields in the table. In this example, perhaps you very
frequently need to look up either user_name or user_id, probably via
joins. Having a table with just name, id, perhaps password and a few
other fields might add up to 50 bytes per row (with overhead), while
address information by itself could easily be 50 bytes. So by pushing
that out to another table, you cut the size of the main table in half.
That means more efficient use of cache, faster seqscans, etc.

The case Erik is describing is more unique to PostgreSQL and how it
handles MVCC. In some cases, splitting a frequently updated row out to a
separate table might not gain as much once we get HOT, but it's still a
good tool to consider. Depending on what you're doing another useful
technique is to not update the field as often by logging updates to be
performed into a separate table and periodically processing that
information into the main table.
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Вложения

msg-22763-27865.dat

Re: Update table performance

От

"Merlin Moncure"

Дата:

09 августа 2007 г., 12:34:17

On 8/8/07, Mark Makarowsky <bedrockconstruction@yahoo.com> wrote:
> Can you provide more detail on what you mean by your
> two suggestions below:
>
> Yeah, I've used "vertical partitioning" very
> successfully in the past, though I've never done it
> for just a single field. I'll typically leave the few
> most common fields in the "main" table and pull
> everything else into a second table.
>
> I should mention that if you can handle splitting the
> update into multiple transactions, that will help a
> lot since it means you won't be doubling the size of
> the table.
>
> I guess I was just surprised by the speed it takes to
> update the field in Postgres since on an almost
> identical table in FoxPro (400,000 records less), it
> updates the table with the same exact update table
> statement in about 4 minutes.

FoxPro is a single process DBF based system with some sql access.
When you update th records, it updates them in place since all the
records are fixed size and padded.  Be careful with this
comparison...while certain operations like the above may feel faster,
the locking in foxpro is extremely crude compared to PostgreSQL.
There are many other things about dbf systems in general which are
pretty lousy from performance perspective.

That said, 'update' is the slowest operation for postgresql relative
to other databases that are not MVCC.  This is balanced by extremely
efficient locking and good performance under multi user loads.
PostgreSQL likes to be used a certain way...you will find that when
used properly it is extremely fast.

keep an eye for the HOT feature which will hopefully make 8.3 that
will highly reduce the penalty for (small) updates in many cases.

merlin

Re: Update table performance

От

Michael Stone

Дата:

09 августа 2007 г., 13:14:57

On Thu, Aug 09, 2007 at 06:04:09PM +0530, Merlin Moncure wrote:
>keep an eye for the HOT feature which will hopefully make 8.3 that
>will highly reduce the penalty for (small) updates in many cases.

Is there an overview somewhere about how this feature works and what it
is expected to do? There have been a lot of references to it over time,
and it's possible to understand it if you follow list traffic over time,
but starting cold it's hard to know what it is. The name was poorly
chosen as far as google is concerned. :)

Mike Stone

Re: Update table performance

От

"Trevor Talbot"

Дата:

09 августа 2007 г., 13:42:01

On 8/9/07, Michael Stone <mstone+postgres@mathom.us> wrote:
> On Thu, Aug 09, 2007 at 06:04:09PM +0530, Merlin Moncure wrote:
> >keep an eye for the HOT feature which will hopefully make 8.3 that
> >will highly reduce the penalty for (small) updates in many cases.
>
> Is there an overview somewhere about how this feature works and what it
> is expected to do? There have been a lot of references to it over time,
> and it's possible to understand it if you follow list traffic over time,
> but starting cold it's hard to know what it is. The name was poorly
> chosen as far as google is concerned. :)

This is what I found when I went looking for info earlier:
http://archives.postgresql.org/pgsql-patches/2007-07/msg00142.php
http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php

Re: Update table performance

От

Decibel!

Дата:

09 августа 2007 г., 15:32:25

On Thu, Aug 09, 2007 at 06:04:09PM +0530, Merlin Moncure wrote:
> That said, 'update' is the slowest operation for postgresql relative
> to other databases that are not MVCC.

Actually, it depends on how you do MVCC. In Oracle, DELETE is actually
the most expensive operation, because they have to not only remove the
row from the heap, they have to copy it to the undo log. And they need
to do something with indexes as well. Whereas we just update 4 bytes in
the heap and that's it.

An UPDATE in Oracle OTOH just needs to store whatever fields have
changed in the undo log. If you haven't messed with indexed fields, it
doesn't have to touch those either.
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Вложения

msg-22763-27887.dat

Re: Update table performance

От

Kenneth Marshall

Дата:

10 августа 2007 г., 05:25:02

Mark,

You are not alone in the fact that when you post your system
specifications, CPU and memory are always listed while the
disk I/O subsystem invariably is not. This is a very disk
intensive operation and I suspect that your disk system is
maxed-out. If you want it faster, you will need more I/O
capacity.

Regards,
Ken

On Tue, Aug 07, 2007 at 05:58:35AM -0700, Mark Makarowsky wrote:
> I have a table with 4,889,820 records in it.  The
> table also has 47 fields.  I'm having problems with
> update performance.  Just as a test, I issued the
> following update:
>
> update valley set test='this is a test'
>
> This took 905641 ms.  Isn't that kind of slow?  There
> aren't any indexes, triggers, constraints or anything
> on this table.  The version of Postgres is "PostgreSQL
> 8.2.4 on i686-pc-mingw32, compiled by GCC gcc.exe
> (GCC) 3.4.2 (mingw-special)".  The operating
> environment is Windows 2003 Standard Edition w/service
> pack 2.  It is 2.20 Ghz with 1.0 GB of RAM.  Here is
> the results from Explain:
>
> "Seq Scan on valley  (cost=0.00..1034083.57
> rows=4897257 width=601)"
>
> Here are the settings in the postgresql.conf.  Any

Re: Update table performance

От

Piotr Kołaczkowski

Дата:

10 августа 2007 г., 05:28:49

Hi,

update valley set test='this is a test'

Such query updates ALL of your records in the table.
5 million records * 47 fields - that can be several gigabytes of data.
The system has to scan that gigabytes to change every record. This is a huge
task. Try vacuuming and see if it helps. It can help a lot, if you perform
such 'whole table updates' often.

Best regards,
Piotr Kolaczkowski

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Update table performance

Вложения

Вложения

Вложения

Вложения