Обсуждение: Update table performance
I have a table with 4,889,820 records in it. The table also has 47 fields. I'm having problems with update performance. Just as a test, I issued the following update: update valley set test='this is a test' This took 905641 ms. Isn't that kind of slow? There aren't any indexes, triggers, constraints or anything on this table. The version of Postgres is "PostgreSQL 8.2.4 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) 3.4.2 (mingw-special)". The operating environment is Windows 2003 Standard Edition w/service pack 2. It is 2.20 Ghz with 1.0 GB of RAM. Here is the results from Explain: "Seq Scan on valley (cost=0.00..1034083.57 rows=4897257 width=601)" Here are the settings in the postgresql.conf. Any ideas or is this the expected speed? #--------------------------------------------------------------------------- # CONNECTIONS AND AUTHENTICATION #--------------------------------------------------------------------------- # - Connection Settings - listen_addresses = '*' # what IP address(es) to listen on; # comma-separated list of addresses; # defaults to 'localhost', '*' = all # (change requires restart) port = 5432 # (change requires restart) max_connections = 20 # (change requires restart) # Note: increasing max_connections costs ~400 bytes of shared memory per # connection slot, plus lock space (see max_locks_per_transaction). You # might also need to raise shared_buffers to support more connections. #superuser_reserved_connections = 3 # (change requires restart) #unix_socket_directory = '' # (change requires restart) #unix_socket_group = '' # (change requires restart) #unix_socket_permissions = 0777 # octal # (change requires restart) #bonjour_name = '' # defaults to the computer name # (change requires restart) # - Security & Authentication - #authentication_timeout = 1min # 1s-600s #ssl = off # (change requires restart) #password_encryption = on #db_user_namespace = off # Kerberos #krb_server_keyfile = '' # (change requires restart) #krb_srvname = 'postgres' # (change requires restart) #krb_server_hostname = '' # empty string matches any keytab entry # (change requires restart) #krb_caseins_users = off # (change requires restart) # - TCP Keepalives - # see 'man 7 tcp' for details #tcp_keepalives_idle = 0 # TCP_KEEPIDLE, in seconds; # 0 selects the system default #tcp_keepalives_interval = 0 # TCP_KEEPINTVL, in seconds; # 0 selects the system default #tcp_keepalives_count = 0 # TCP_KEEPCNT; # 0 selects the system default #--------------------------------------------------------------------------- # RESOURCE USAGE (except WAL) #--------------------------------------------------------------------------- # - Memory - shared_buffers = 512MB # min 128kB or max_connections*16kB # (change requires restart) temp_buffers = 8MB # min 800kB max_prepared_transactions = 5 # can be 0 or more # (change requires restart) # Note: increasing max_prepared_transactions costs ~600 bytes of shared memory # per transaction slot, plus lock space (see max_locks_per_transaction). work_mem = 8MB # min 64kB maintenance_work_mem = 16MB # min 1MB #max_stack_depth = 4MB # min 100kB # - Free Space Map - max_fsm_pages = 700000 # min max_fsm_relations*16, 6 bytes each # (change requires restart) max_fsm_relations = 1000 # min 100, ~70 bytes each # (change requires restart) # - Kernel Resource Usage - max_files_per_process = 1000 # min 25 # (change requires restart) #shared_preload_libraries = '' # (change requires restart) # - Cost-Based Vacuum Delay - #vacuum_cost_delay = 0 # 0-1000 milliseconds #vacuum_cost_page_hit = 1 # 0-10000 credits #vacuum_cost_page_miss = 10 # 0-10000 credits #vacuum_cost_page_dirty = 20 # 0-10000 credits #vacuum_cost_limit = 200 # 0-10000 credits # - Background writer - #bgwriter_delay = 200ms # 10-10000ms between rounds #bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round #bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round #bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round #bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round #--------------------------------------------------------------------------- # WRITE AHEAD LOG #--------------------------------------------------------------------------- # - Settings - fsync = off # turns forced synchronization on or off #wal_sync_method = fsync # the default is the first option # supported by the operating system: # open_datasync # fdatasync # fsync # fsync_writethrough # open_sync full_page_writes = off # recover from partial page writes #wal_buffers = 64kB # min 32kB # (change requires restart) #commit_delay = 0 # range 0-100000, in microseconds #commit_siblings = 5 # range 1-1000 # - Checkpoints - checkpoint_segments = 60 # in logfile segments, min 1, 16MB each checkpoint_timeout = 5min # range 30s-1h checkpoint_warning = 0 # 0 is off # - Archiving - archive_command = '' # command to use to archive a logfile segment archive_timeout = 0 # force a logfile segment switch after this # many seconds; 0 is off #--------------------------------------------------------------------------- # QUERY TUNING #--------------------------------------------------------------------------- # - Planner Method Configuration - enable_bitmapscan = on enable_hashagg = on enable_hashjoin = on enable_indexscan = on enable_mergejoin = on enable_nestloop = on enable_seqscan = on enable_sort = on enable_tidscan = on # - Planner Cost Constants - #seq_page_cost = 1.0 # measured on an arbitrary scale #random_page_cost = 4.0 # same scale as above #cpu_tuple_cost = 0.01 # same scale as above #cpu_index_tuple_cost = 0.005 # same scale as above #cpu_operator_cost = 0.0025 # same scale as above effective_cache_size = 32MB # - Genetic Query Optimizer - #geqo = on #geqo_threshold = 12 #geqo_effort = 5 # range 1-10 #geqo_pool_size = 0 # selects default based on effort #geqo_generations = 0 # selects default based on effort #geqo_selection_bias = 2.0 # range 1.5-2.0 # - Other Planner Options - #default_statistics_target = 10 # range 1-1000 #constraint_exclusion = off #from_collapse_limit = 8 #join_collapse_limit = 8 # 1 disables collapsing of explicit # JOINs #--------------------------------------------------------------------------- # ERROR REPORTING AND LOGGING #--------------------------------------------------------------------------- # - Where to Log - log_destination = 'stderr' # Valid values are combinations of # stderr, syslog and eventlog, # depending on platform. # This is used when logging to stderr: redirect_stderr = on # Enable capturing of stderr into log # files # (change requires restart) # These are only used if redirect_stderr is on: #log_directory = 'pg_log' # Directory where log files are written # Can be absolute or relative to PGDATA #log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # Log file name pattern. # Can include strftime() escapes #log_truncate_on_rotation = off # If on, any existing log file of the same # name as the new log file will be # truncated rather than appended to. But # such truncation only occurs on # time-driven rotation, not on restarts # or size-driven rotation. Default is # off, meaning append to existing files # in all cases. #log_rotation_age = 1d # Automatic rotation of logfiles will # happen after that time. 0 to # disable. #log_rotation_size = 10MB # Automatic rotation of logfiles will # happen after that much log # output. 0 to disable. # These are relevant when logging to syslog: #syslog_facility = 'LOCAL0' #syslog_ident = 'postgres' # - When to Log - #client_min_messages = notice # Values, in order of decreasing detail: # debug5 # debug4 # debug3 # debug2 # debug1 # log # notice # warning # error #log_min_messages = notice # Values, in order of decreasing detail: # debug5 # debug4 # debug3 # debug2 # debug1 # info # notice # warning # error # log # fatal # panic #log_error_verbosity = default # terse, default, or verbose messages #log_min_error_statement = error # Values in order of increasing severity: # debug5 # debug4 # debug3 # debug2 # debug1 # info # notice # warning # error # fatal # panic (effectively off) #log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements # and their durations. #silent_mode = off # DO NOT USE without syslog or # redirect_stderr # (change requires restart) # - What to Log - #debug_print_parse = off #debug_print_rewritten = off #debug_print_plan = off #debug_pretty_print = off #log_connections = off #log_disconnections = off #log_duration = off log_line_prefix = '%t ' # Special values: # %u = user name # %d = database name # %r = remote host and port # %h = remote host # %p = PID # %t = timestamp (no milliseconds) # %m = timestamp with milliseconds # %i = command tag # %c = session id # %l = session line number # %s = session start timestamp # %x = transaction id # %q = stop here in non-session # processes # %% = '%' # e.g. '<%u%%%d> ' #log_statement = 'none' # none, ddl, mod, all #log_hostname = off #--------------------------------------------------------------------------- # RUNTIME STATISTICS #--------------------------------------------------------------------------- # - Query/Index Statistics Collector - #stats_command_string = on #update_process_title = on stats_start_collector = on # needed for block or row stats # (change requires restart) #stats_block_level = off stats_row_level = on #stats_reset_on_server_start = off # (change requires restart) # - Statistics Monitoring - #log_parser_stats = off #log_planner_stats = off #log_executor_stats = off #log_statement_stats = off #--------------------------------------------------------------------------- # AUTOVACUUM PARAMETERS #--------------------------------------------------------------------------- autovacuum = on # enable autovacuum subprocess? # 'on' requires stats_start_collector # and stats_row_level to also be on #autovacuum_naptime = 1min # time between autovacuum runs #autovacuum_vacuum_threshold = 500 # min # of tuple updates before # vacuum #autovacuum_analyze_threshold = 250 # min # of tuple updates before # analyze #autovacuum_vacuum_scale_factor = 0.2 # fraction of rel size before # vacuum #autovacuum_analyze_scale_factor = 0.1 # fraction of rel size before # analyze #autovacuum_freeze_max_age = 200000000 # maximum XID age before forced vacuum # (change requires restart) #autovacuum_vacuum_cost_delay = -1 # default vacuum cost delay for # autovacuum, -1 means use # vacuum_cost_delay #autovacuum_vacuum_cost_limit = -1 # default vacuum cost limit for # autovacuum, -1 means use # vacuum_cost_limit #--------------------------------------------------------------------------- # CLIENT CONNECTION DEFAULTS #--------------------------------------------------------------------------- # - Statement Behavior - #search_path = '"$user",public' # schema names #default_tablespace = '' # a tablespace name, '' uses # the default #check_function_bodies = on #default_transaction_isolation = 'read committed' #default_transaction_read_only = off #statement_timeout = 0 # 0 is disabled #vacuum_freeze_min_age = 100000000 # - Locale and Formatting - datestyle = 'iso, mdy' #timezone = unknown # actually, defaults to TZ # environment setting #timezone_abbreviations = 'Default' # select the set of available timezone # abbreviations. Currently, there are # Default # Australia # India # However you can also create your own # file in share/timezonesets/. #extra_float_digits = 0 # min -15, max 2 #client_encoding = sql_ascii # actually, defaults to database # encoding # These settings are initialized by initdb -- they might be changed lc_messages = 'C' # locale for system error message # strings lc_monetary = 'C' # locale for monetary formatting lc_numeric = 'C' # locale for number formatting lc_time = 'C' # locale for time formatting # - Other Defaults - #explain_pretty_print = on #dynamic_library_path = '$libdir' #local_preload_libraries = '' #--------------------------------------------------------------------------- # LOCK MANAGEMENT #--------------------------------------------------------------------------- #deadlock_timeout = 1s max_locks_per_transaction = 384 # min 10 # (change requires restart) # Note: each lock table slot uses ~270 bytes of shared memory, and there are # max_locks_per_transaction * (max_connections + max_prepared_transactions) # lock table slots. #--------------------------------------------------------------------------- # VERSION/PLATFORM COMPATIBILITY #--------------------------------------------------------------------------- # - Previous Postgres Versions - #add_missing_from = off #array_nulls = on #backslash_quote = safe_encoding # on, off, or safe_encoding #default_with_oids = off #escape_string_warning = on #standard_conforming_strings = off #regex_flavor = advanced # advanced, extended, or basic #sql_inheritance = on # - Other Platforms & Clients - #transform_null_equals = off #--------------------------------------------------------------------------- # CUSTOMIZED OPTIONS #--------------------------------------------------------------------------- #custom_variable_classes = '' # list of custom variable class names ____________________________________________________________________________________ Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out. http://answers.yahoo.com/dir/?link=list&sid=396545433
Mark Makarowsky wrote: > I have a table with 4,889,820 records in it. The > table also has 47 fields. I'm having problems with > update performance. Just as a test, I issued the > following update: > > update valley set test='this is a test' > > This took 905641 ms. Isn't that kind of slow? The limiting factor here will be how fast you can write to your disk. Let's see: 5 million rows in ~900 seconds, that's about 5500 rows/second. Now, you don't say how large your rows are, but assuming each row is say 1kB that'd be 5.5MB/sec - or not brilliant. Simplest way to find out total activity is check how much disk space PG is using before and after the update. What you'll need to do is monitor disk activity, in particular how many writes and how much time the processor spends waiting on writes to complete. If your standard usage pattern is to update a single field in all rows of a large table, then performance isn't going to be sub-second I'm afraid. -- Richard Huxton Archonet Ltd
On 8/7/07, Mark Makarowsky <bedrockconstruction@yahoo.com> wrote: > I have a table with 4,889,820 records in it. The > table also has 47 fields. I'm having problems with > update performance. Just as a test, I issued the > following update: > > update valley set test='this is a test' > > This took 905641 ms. Isn't that kind of slow? There > aren't any indexes, triggers, constraints or anything > on this table. The version of Postgres is "PostgreSQL > 8.2.4 on i686-pc-mingw32, compiled by GCC gcc.exe > (GCC) 3.4.2 (mingw-special)". The operating > environment is Windows 2003 Standard Edition w/service > pack 2. It is 2.20 Ghz with 1.0 GB of RAM. Here is > the results from Explain: > > "Seq Scan on valley (cost=0.00..1034083.57 > rows=4897257 width=601)" Have you done this a few times? You could easily have a very large and bloated table if you do this several times in a row. That would explain the slow performance. If you're going to do a lot of updates without where clauses on large tables, you'll need to run a vacuum right afterwards to clean things up. I see that you included a lot about your machine, but you didn't include any specs on your disk subsystem. When it comes to update speed, the disk subsystem is probably the most important part. Note also that Windows is still not the preferred platform for postgresql from a performance perspective (actually, the only database where that's true is MS-SQL really). Have you run any benchmarks on your disk subsystem to see how fast it is?
On Tuesday 07 August 2007 05:58, Mark Makarowsky <bedrockconstruction@yahoo.com> wrote: > I have a table with 4,889,820 records in it. The > table also has 47 fields. I'm having problems with > update performance. Just as a test, I issued the > following update: > > update valley set test='this is a test' > > This took 905641 ms. Isn't that kind of slow? PostgreSQL has to write a full new version of every row that gets updated. Updates are, therefore, relatively slow. I'm guessing you're doing this on a single SATA drive, too, which probably doesn't help. -- "If a nation values anything more than freedom, it will lose its freedom; and the irony of it is that if it is comfort or money that it values more, it will lose that too." -- Somerset Maugham, Author
On Tue, Aug 07, 2007 at 02:33:19PM +0100, Richard Huxton wrote: > Mark Makarowsky wrote: > >I have a table with 4,889,820 records in it. The > >table also has 47 fields. I'm having problems with > >update performance. Just as a test, I issued the > >following update: > > > >update valley set test='this is a test' > > > >This took 905641 ms. Isn't that kind of slow? > > The limiting factor here will be how fast you can write to your disk. Well, very possibly how fast you can read, too. Using your assumption of 1k per row, 5M rows means 5G of data, which might well not fit in memory. And if the entire table's been updated just once before, even with vacuuming you're now at 10G of data. -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Вложения
bedrockconstruction@yahoo.com (Mark Makarowsky) writes: > I have a table with 4,889,820 records in it. The > table also has 47 fields. I'm having problems with > update performance. Just as a test, I issued the > following update: > > update valley set test='this is a test' > > This took 905641 ms. Isn't that kind of slow? There > aren't any indexes, triggers, constraints or anything > on this table. The version of Postgres is "PostgreSQL > 8.2.4 on i686-pc-mingw32, compiled by GCC gcc.exe > (GCC) 3.4.2 (mingw-special)". The operating > environment is Windows 2003 Standard Edition w/service > pack 2. It is 2.20 Ghz with 1.0 GB of RAM. Here is > the results from Explain: > > "Seq Scan on valley (cost=0.00..1034083.57 > rows=4897257 width=601)" > > Here are the settings in the postgresql.conf. Any > ideas or is this the expected speed? Hmm. - You asked to update 4,889,820 records. - It's a table consisting of 8.5GB of data (based on the cost info) For this to take 15 minutes doesn't seem particularly outrageous. -- output = ("cbbrowne" "@" "acm.org") http://cbbrowne.com/info/oses.html Rules of the Evil Overlord #65. "If I must have computer systems with publically available terminals, the maps they display of my complex will have a room clearly marked as the Main Control Room. That room will be the Execution Chamber. The actual main control room will be marked as Sewage Overflow Containment." <http://www.eviloverlord.com/>
On 8/7/07, Decibel! <decibel@decibel.org> wrote: > On Tue, Aug 07, 2007 at 02:33:19PM +0100, Richard Huxton wrote: > > Mark Makarowsky wrote: > > >I have a table with 4,889,820 records in it. The > > >table also has 47 fields. I'm having problems with > > >update performance. Just as a test, I issued the > > >following update: > > > > > >update valley set test='this is a test' > > > > > >This took 905641 ms. Isn't that kind of slow? > > > > The limiting factor here will be how fast you can write to your disk. > > Well, very possibly how fast you can read, too. Using your assumption of > 1k per row, 5M rows means 5G of data, which might well not fit in > memory. And if the entire table's been updated just once before, even > with vacuuming you're now at 10G of data. Where one might have to update just one column of a wide table often, it's often a good idea to move that column into its own dependent table. Or just don't update one column of every row in table...
On Tue, Aug 07, 2007 at 02:36:18PM -0500, Scott Marlowe wrote: > On 8/7/07, Decibel! <decibel@decibel.org> wrote: > > On Tue, Aug 07, 2007 at 02:33:19PM +0100, Richard Huxton wrote: > > > Mark Makarowsky wrote: > > > >I have a table with 4,889,820 records in it. The > > > >table also has 47 fields. I'm having problems with > > > >update performance. Just as a test, I issued the > > > >following update: > > > > > > > >update valley set test='this is a test' > > > > > > > >This took 905641 ms. Isn't that kind of slow? > > > > > > The limiting factor here will be how fast you can write to your disk. > > > > Well, very possibly how fast you can read, too. Using your assumption of > > 1k per row, 5M rows means 5G of data, which might well not fit in > > memory. And if the entire table's been updated just once before, even > > with vacuuming you're now at 10G of data. > > Where one might have to update just one column of a wide table often, > it's often a good idea to move that column into its own dependent > table. Yeah, I've used "vertical partitioning" very successfully in the past, though I've never done it for just a single field. I'll typically leave the few most common fields in the "main" table and pull everything else into a second table. > Or just don't update one column of every row in table... Yeah, that too. :) Though sometimes you can't avoid it. I should mention that if you can handle splitting the update into multiple transactions, that will help a lot since it means you won't be doubling the size of the table. -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Вложения
Can you provide more detail on what you mean by your two suggestions below: Yeah, I've used "vertical partitioning" very successfully in the past, though I've never done it for just a single field. I'll typically leave the few most common fields in the "main" table and pull everything else into a second table. I should mention that if you can handle splitting the update into multiple transactions, that will help a lot since it means you won't be doubling the size of the table. I guess I was just surprised by the speed it takes to update the field in Postgres since on an almost identical table in FoxPro (400,000 records less), it updates the table with the same exact update table statement in about 4 minutes. --- Decibel! <decibel@decibel.org> wrote: > On Tue, Aug 07, 2007 at 02:36:18PM -0500, Scott > Marlowe wrote: > > On 8/7/07, Decibel! <decibel@decibel.org> wrote: > > > On Tue, Aug 07, 2007 at 02:33:19PM +0100, > Richard Huxton wrote: > > > > Mark Makarowsky wrote: > > > > >I have a table with 4,889,820 records in it. > The > > > > >table also has 47 fields. I'm having > problems with > > > > >update performance. Just as a test, I issued > the > > > > >following update: > > > > > > > > > >update valley set test='this is a test' > > > > > > > > > >This took 905641 ms. Isn't that kind of > slow? > > > > > > > > The limiting factor here will be how fast you > can write to your disk. > > > > > > Well, very possibly how fast you can read, too. > Using your assumption of > > > 1k per row, 5M rows means 5G of data, which > might well not fit in > > > memory. And if the entire table's been updated > just once before, even > > > with vacuuming you're now at 10G of data. > > > > Where one might have to update just one column of > a wide table often, > > it's often a good idea to move that column into > its own dependent > > table. > > Yeah, I've used "vertical partitioning" very > successfully in the past, > though I've never done it for just a single field. > I'll typically leave > the few most common fields in the "main" table and > pull everything else > into a second table. > > > Or just don't update one column of every row in > table... > > Yeah, that too. :) Though sometimes you can't avoid > it. > > I should mention that if you can handle splitting > the update into > multiple transactions, that will help a lot since it > means you won't be > doubling the size of the table. > -- > Decibel!, aka Jim Nasby > decibel@decibel.org > EnterpriseDB http://enterprisedb.com > 512.569.9461 (cell) > ____________________________________________________________________________________ Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center. http://autos.yahoo.com/green_center/
On Aug 7, 2007, at 6:13 PM, Mark Makarowsky wrote: > Can you provide more detail on what you mean by your > two suggestions below: > > Yeah, I've used "vertical partitioning" very > successfully in the past, though I've never done it > for just a single field. I'll typically leave the few > most common fields in the "main" table and pull > everything else into a second table. Vertical partitioning is where you split up your table on disk by columns, i.e on the vertical lines. He quoted it because Postgres doesn't actually support it transparently but you can always fake it by splitting up your table. For example, given the following table wherein column bar gets updated a lot but the others don't: create table foo ( id int not null, bar int, baz int, primary key (id) ); You could split it up like so: create table foo_a ( id int, baz int, primary key (id) ); create table foo_b ( foo_id int, bar int, foreign key foo_a_id (foo_id) references foo_a (id) ); The reason you'd ever want to do this is that when Postgres goes to update a row what it actually does is inserts a new row with the new value(s) that you changed and marks the old one as deleted. So, if you have a wide table and frequently update only certain columns, you'll take a performance hit as you're having to re-write a lot of static values. > > I should mention that if you can handle splitting the > update into multiple transactions, that will help a > lot since it means you won't be doubling the size of > the table. As I mentioned above, when you do an update you're actually inserting a new row and deleting the old one. That deleted row is still considered part of the table (for reasons of concurrency, read up on the concurrency chapter in the manual for the details) and once it is no longer visible by any live transactions can be re-used by future inserts. So, if you update one column on every row of a one million row table all at once, you have to allocate and write out one million new rows. But, if you do the update a quarter million at a time, the last three updates would be able to re-use many of the rows deleted in earlier updates. Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
Erik Jones wrote: > Decibel! wrote: >> I should mention that if you can handle splitting the >> update into multiple transactions, that will help a >> lot since it means you won't be doubling the size of >> the table. > > As I mentioned above, when you do an update you're actually inserting a > new row and deleting the old one. That deleted row is still considered > part of the table (for reasons of concurrency, read up on the > concurrency chapter in the manual for the details) and once it is no > longer visible by any live transactions can be re-used by future > inserts. So, if you update one column on every row of a one million row > table all at once, you have to allocate and write out one million new > rows. But, if you do the update a quarter million at a time, the last > three updates would be able to re-use many of the rows deleted in > earlier updates. Only if you vacuum between the updates. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Aug 8, 2007, at 3:00 AM, Heikki Linnakangas wrote: > Erik Jones wrote: >> Decibel! wrote: >>> I should mention that if you can handle splitting the >>> update into multiple transactions, that will help a >>> lot since it means you won't be doubling the size of >>> the table. >> >> As I mentioned above, when you do an update you're actually >> inserting a >> new row and deleting the old one. That deleted row is still >> considered >> part of the table (for reasons of concurrency, read up on the >> concurrency chapter in the manual for the details) and once it is no >> longer visible by any live transactions can be re-used by future >> inserts. So, if you update one column on every row of a one >> million row >> table all at once, you have to allocate and write out one million new >> rows. But, if you do the update a quarter million at a time, the >> last >> three updates would be able to re-use many of the rows deleted in >> earlier updates. > > Only if you vacuum between the updates. This is true. In fact, the chapter on Routine Database Maintenance tasks that discusses vacuuming explains all of this. Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
On Tue, Aug 07, 2007 at 08:46:20PM -0500, Erik Jones wrote: > Vertical partitioning is where you split up your table on disk by > columns, i.e on the vertical lines. He quoted it because Postgres > doesn't actually support it transparently but you can always fake it > by splitting up your table. For example, given the following table > wherein column bar gets updated a lot but the others don't: > > create table foo ( > id int not null, > bar int, > baz int, > > primary key (id) > ); > > You could split it up like so: > > create table foo_a ( > id int, > baz int, > > primary key (id) > ); > > create table foo_b ( > foo_id int, > bar int, > > foreign key foo_a_id (foo_id) references foo_a (id) > ); FWIW, the cases where I've actually used this have been on much wider tables, and a number of the attributes are in-frequently accessed. An example would be if you keep snail-mail address info for users; you probably don't use those fields very often, so they would be good candidates for going into a second table. When does it actually make sense to use this? When you do a *lot* with a small number of fields in the table. In this example, perhaps you very frequently need to look up either user_name or user_id, probably via joins. Having a table with just name, id, perhaps password and a few other fields might add up to 50 bytes per row (with overhead), while address information by itself could easily be 50 bytes. So by pushing that out to another table, you cut the size of the main table in half. That means more efficient use of cache, faster seqscans, etc. The case Erik is describing is more unique to PostgreSQL and how it handles MVCC. In some cases, splitting a frequently updated row out to a separate table might not gain as much once we get HOT, but it's still a good tool to consider. Depending on what you're doing another useful technique is to not update the field as often by logging updates to be performed into a separate table and periodically processing that information into the main table. -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Вложения
On 8/8/07, Mark Makarowsky <bedrockconstruction@yahoo.com> wrote: > Can you provide more detail on what you mean by your > two suggestions below: > > Yeah, I've used "vertical partitioning" very > successfully in the past, though I've never done it > for just a single field. I'll typically leave the few > most common fields in the "main" table and pull > everything else into a second table. > > I should mention that if you can handle splitting the > update into multiple transactions, that will help a > lot since it means you won't be doubling the size of > the table. > > I guess I was just surprised by the speed it takes to > update the field in Postgres since on an almost > identical table in FoxPro (400,000 records less), it > updates the table with the same exact update table > statement in about 4 minutes. FoxPro is a single process DBF based system with some sql access. When you update th records, it updates them in place since all the records are fixed size and padded. Be careful with this comparison...while certain operations like the above may feel faster, the locking in foxpro is extremely crude compared to PostgreSQL. There are many other things about dbf systems in general which are pretty lousy from performance perspective. That said, 'update' is the slowest operation for postgresql relative to other databases that are not MVCC. This is balanced by extremely efficient locking and good performance under multi user loads. PostgreSQL likes to be used a certain way...you will find that when used properly it is extremely fast. keep an eye for the HOT feature which will hopefully make 8.3 that will highly reduce the penalty for (small) updates in many cases. merlin
On Thu, Aug 09, 2007 at 06:04:09PM +0530, Merlin Moncure wrote: >keep an eye for the HOT feature which will hopefully make 8.3 that >will highly reduce the penalty for (small) updates in many cases. Is there an overview somewhere about how this feature works and what it is expected to do? There have been a lot of references to it over time, and it's possible to understand it if you follow list traffic over time, but starting cold it's hard to know what it is. The name was poorly chosen as far as google is concerned. :) Mike Stone
On 8/9/07, Michael Stone <mstone+postgres@mathom.us> wrote: > On Thu, Aug 09, 2007 at 06:04:09PM +0530, Merlin Moncure wrote: > >keep an eye for the HOT feature which will hopefully make 8.3 that > >will highly reduce the penalty for (small) updates in many cases. > > Is there an overview somewhere about how this feature works and what it > is expected to do? There have been a lot of references to it over time, > and it's possible to understand it if you follow list traffic over time, > but starting cold it's hard to know what it is. The name was poorly > chosen as far as google is concerned. :) This is what I found when I went looking for info earlier: http://archives.postgresql.org/pgsql-patches/2007-07/msg00142.php http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php
On Thu, Aug 09, 2007 at 06:04:09PM +0530, Merlin Moncure wrote: > That said, 'update' is the slowest operation for postgresql relative > to other databases that are not MVCC. Actually, it depends on how you do MVCC. In Oracle, DELETE is actually the most expensive operation, because they have to not only remove the row from the heap, they have to copy it to the undo log. And they need to do something with indexes as well. Whereas we just update 4 bytes in the heap and that's it. An UPDATE in Oracle OTOH just needs to store whatever fields have changed in the undo log. If you haven't messed with indexed fields, it doesn't have to touch those either. -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Вложения
Mark, You are not alone in the fact that when you post your system specifications, CPU and memory are always listed while the disk I/O subsystem invariably is not. This is a very disk intensive operation and I suspect that your disk system is maxed-out. If you want it faster, you will need more I/O capacity. Regards, Ken On Tue, Aug 07, 2007 at 05:58:35AM -0700, Mark Makarowsky wrote: > I have a table with 4,889,820 records in it. The > table also has 47 fields. I'm having problems with > update performance. Just as a test, I issued the > following update: > > update valley set test='this is a test' > > This took 905641 ms. Isn't that kind of slow? There > aren't any indexes, triggers, constraints or anything > on this table. The version of Postgres is "PostgreSQL > 8.2.4 on i686-pc-mingw32, compiled by GCC gcc.exe > (GCC) 3.4.2 (mingw-special)". The operating > environment is Windows 2003 Standard Edition w/service > pack 2. It is 2.20 Ghz with 1.0 GB of RAM. Here is > the results from Explain: > > "Seq Scan on valley (cost=0.00..1034083.57 > rows=4897257 width=601)" > > Here are the settings in the postgresql.conf. Any
Hi, update valley set test='this is a test' Such query updates ALL of your records in the table. 5 million records * 47 fields - that can be several gigabytes of data. The system has to scan that gigabytes to change every record. This is a huge task. Try vacuuming and see if it helps. It can help a lot, if you perform such 'whole table updates' often. Best regards, Piotr Kolaczkowski