Обсуждение: Opteron/FreeBSD/PostgreSQL performance poor

Поиск
Список
Период
Сортировка

Opteron/FreeBSD/PostgreSQL performance poor

От
andy rost
Дата:
We're in the process of porting from Informix 9.4 to PostgreSQL 8.1.3.
Our PostgreSQL server is an AMD Opteron Dual Core 275 with two 2.2 Ghz
64-bit processors. There are two internal drives and an external
enclosure containing 14 drives (configured as 7 pairs of mirrored drives
- four pairs for table spaces, one pair for dbcluster, two pairs for
point in time recovery). The operating system is FreeBSD 6.0-RELEASE #10

The output from ulimit -a is:

ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) 33554432
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 11095
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 524288
cpu time               (seconds, -t) unlimited
max user processes              (-u) 5547
virtual memory          (kbytes, -v) unlimited

Shared memory kernel parameters are set to:

shmmax       1073741000
shmmin       1
shmall       262144
shmseg       128
shmmni       192
semmni       256
semmns       512
semmsl       256
semmap       256
semvmx       32767
shm_use_phys 1

The postgresql.conf file contains:

#---------------------------------------------------------------------------
# FILE LOCATIONS
#---------------------------------------------------------------------------

# The default values of these variables are driven from the -D command line
# switch or PGDATA environment variable, represented here as ConfigDir.

#data_directory = 'ConfigDir'           # use data in another directory
#hba_file = 'ConfigDir/pg_hba.conf'     # host-based authentication file
#ident_file = 'ConfigDir/pg_ident.conf' # IDENT configuration file

# If external_pid_file is not explicitly set, no extra pid file is written.

#external_pid_file = '(none)'           # write an extra pid file


#---------------------------------------------------------------------------
# CONNECTIONS AND AUTHENTICATION
#---------------------------------------------------------------------------

# - Connection Settings -

listen_addresses = '*'                  # what IP address(es) to listen on;
                                         # comma-separated list of
addresses;
                                         # defaults to 'localhost', '*'
= all
port = 5432
max_connections = 102
# note: increasing max_connections costs ~400 bytes of shared memory per
# connection slot, plus lock space (see max_locks_per_transaction).  You
# might also need to raise shared_buffers to support more connections.
superuser_reserved_connections = 2
unix_socket_directory = ''
unix_socket_group = ''
unix_socket_permissions = 0777          # octal
bonjour_name = ''                       # defaults to the computer name

# - Security & Authentication -

authentication_timeout = 60             # 1-600, in seconds
ssl = off
password_encryption = on
db_user_namespace = off

# Kerberos

krb_server_keyfile = ''
krb_srvname = 'postgres'
krb_server_hostname = ''                # empty string matches any
keytab entry
krb_caseins_users = off

# - TCP Keepalives -
# see 'man 7 tcp' for details

tcp_keepalives_idle = 0                 # TCP_KEEPIDLE, in seconds;
                                         # 0 selects the system default
tcp_keepalives_interval = 0             # TCP_KEEPINTVL, in seconds;
                                         # 0 selects the system default
tcp_keepalives_count = 0                # TCP_KEEPCNT;
                                         # 0 selects the system default


#---------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#---------------------------------------------------------------------------

# - Memory -

shared_buffers = 125000                 # min 16 or max_connections*2,
8KB each
temp_buffers = 1000                     # min 100, 8KB each
max_prepared_transactions = 0           # can be 0 or more
# note: increasing max_prepared_transactions costs ~600 bytes of shared
memory
# per transaction slot, plus lock space (see max_locks_per_transaction).
work_mem = 10000                        # min 64, size in KB
maintenance_work_mem = 50000            # min 1024, size in KB
max_stack_depth = 500000                # in 100, size in KB
                                         # ulimit -a or ulimit -s

# - Free Space Map -

max_fsm_pages = 600000                  # min max_fsm_relations*16, 6
bytes each
max_fsm_relations = 1000                # min 100, ~70 bytes each

# - Kernel Resource Usage -

max_files_per_process = 1000            # min 25
preload_libraries = ''

# - Cost-Based Vacuum Delay -

vacuum_cost_delay = 0                   # 0-1000 milliseconds
vacuum_cost_page_hit = 1                # 0-10000 credits
vacuum_cost_page_miss = 10              # 0-10000 credits
vacuum_cost_page_dirty = 20             # 0-10000 credits
vacuum_cost_limit = 200                 # 0-10000 credits

# - Background writer -

bgwriter_delay = 200                    # 10-10000 milliseconds between
rounds
bgwriter_lru_percent = 1.0              # 0-100% of LRU buffers
scanned/round
bgwriter_lru_maxpages = 1000            # 0-1000 buffers max written/round
bgwriter_all_percent = 0.333            # 0-100% of all buffers
scanned/round
bgwriter_all_maxpages = 1000            # 0-1000 buffers max written/round


#---------------------------------------------------------------------------
# WRITE AHEAD LOG
#---------------------------------------------------------------------------

# - Settings -

fsync = on                              # turns forced synchronization
on or off
wal_sync_method = fsync                 # the default is the first option
                                         # supported by the operating
system:
                                         #   open_datasync
                                         #   fdatasync
                                         #   fsync
                                         #   fsync_writethrough
                                         #   open_sync
full_page_writes = on                   # recover from partial page writes
wal_buffers = 64                        # min 4, 8KB each
commit_delay = 0                        # range 0-100000, in microseconds
commit_siblings = 5                     # range 1-1000

# - Checkpoints -

checkpoint_segments = 120               # in logfile segments, min 1,
16MB each
checkpoint_timeout = 900                # range 30-3600, in seconds
checkpoint_warning = 900                # in seconds, 0 is off

# - Archiving -

archive_command = 'archive_wal -email -txtmsg "%p" "%f"'        #
command to use
                                         # to archive a logfile segment


#---------------------------------------------------------------------------
# QUERY TUNING
#---------------------------------------------------------------------------

# - Planner Method Configuration -

enable_bitmapscan = on
enable_hashagg = on
enable_hashjoin = on
enable_indexscan = on
enable_mergejoin = on
enable_nestloop = on
enable_seqscan = on
enable_sort = on
enable_tidscan = on

# - Planner Cost Constants -

effective_cache_size = 27462            # typically 8KB each
                                         # On TRU64 do /sbin/sysconfig
-q advfs
                                         # to get the size o
AdvfsCacheMaxPercent
                                         # (default is 7% of RAM). On
FreeBSD set
                                         # to sysctl -n vfs.hibufspace /
8192
random_page_cost = 2                    # units are one sequential page
fetch
                                         # cost
cpu_tuple_cost = 0.01                   # (same)
cpu_index_tuple_cost = 0.001            # (same)
cpu_operator_cost = 0.0025              # (same)

# - Genetic Query Optimizer -

geqo = on
geqo_threshold = 12
geqo_effort = 5                         # range 1-10
geqo_pool_size = 0                      # selects default based on effort
geqo_generations = 0                    # selects default based on effort
geqo_selection_bias = 2.0               # range 1.5-2.0

# - Other Planner Options -

default_statistics_target = 100         # range 1-1000
constraint_exclusion = off
from_collapse_limit = 8
join_collapse_limit = 8                 # 1 disables collapsing of explicit
                                         # JOINs


#---------------------------------------------------------------------------
# ERROR REPORTING AND LOGGING
#---------------------------------------------------------------------------

# - Where to Log -

log_destination = 'stderr'              # Valid values are combinations of
                                         # stderr, syslog and eventlog,
                                         # depending on platform.

# This is used when logging to stderr:
redirect_stderr = on                    # Enable capturing of stderr
into log
                                         # files

# These are only used if redirect_stderr is on:
log_directory = 'pg_log'                # Directory where log files are
written
                                         # Can be absolute or relative
to PGDATA
log_filename = 'postgresql_log.%a'      # Log file name pattern.
                                         # Can include strftime() escapes
log_truncate_on_rotation = on           # If on, any existing log file
of the
                                         # same name as the new log file
will be
                                         # truncated rather than
appended to. But
                                         # such truncation only occurs on
                                         # time-driven rotation, not on
restarts
                                         # or size-driven rotation.
Default is
                                         # off, meaning append to
existing files
                                         # in all cases.
log_rotation_age = 1440                 # Automatic rotation of logfiles
will
                                         # happen after so many minutes.
  0 to
                                         # disable.
log_rotation_size = 10240               # Automatic rotation of logfiles
will
                                         # happen after so many
kilobytes of log
                                         # output.  0 to disable.

# These are relevant when logging to syslog:
syslog_facility = 'LOCAL0'
syslog_ident = 'postgres'


# - When to Log -

client_min_messages = log               # Values, in order of decreasing
detail:
                                         #   debug5
                                         #   debug4
                                         #   debug3
                                         #   debug2
                                         #   debug1
                                         #   log
                                         #   notice
                                         #   warning
                                         #   error

log_min_messages = notice               # Values, in order of decreasing
detail:
                                         #   debug5
                                         #   debug4
                                         #   debug3
                                         #   debug2
                                         #   debug1
                                         #   info
                                         #   notice
                                         #   warning
                                         #   error
                                         #   log
                                         #   fatal
                                         #   panic

log_error_verbosity = default           # terse, default, or verbose
messages

log_error_verbosity = default           # terse, default, or verbose
messages

#log_min_error_statement = notice       # Values in order of increasing
severity:
                                         #   debug5
                                         #   debug4
                                         #   debug3
                                         #   debug2
                                         #   debug1
                                         #   info
                                         #   notice
                                         #   warning
                                         #   error
                                         #   panic(off)

log_min_duration_statement = 1000       # -1 is disabled, 0 logs all
statements
                                         # and their durations, in
milliseconds.

silent_mode = off                       # DO NOT USE without syslog or
                                         # redirect_stderr

# - What to Log -

debug_print_parse = off
debug_print_rewritten = off
debug_print_plan = off
debug_pretty_print = off
log_connections = off
log_disconnections = off
log_duration = off
log_line_prefix = '%t'                  # Special values:
                                         #   %u = user name
                                         #   %d = database name
                                         #   %r = remote host and port
                                         #   %h = remote host
                                         #   %p = PID
                                         #   %t = timestamp (no
milliseconds)
                                         #   %m = timestamp with
milliseconds
                                         #   %i = command tag
                                         #   %c = session id
                                         #   %l = session line number
                                         #   %s = session start timestamp
                                         #   %x = transaction id
                                         #   %q = stop here in non-session
                                         #        processes
                                         #   %% = '%'
                                         # e.g. '<%u%%%d> '
log_statement = 'none'                  # none, mod, ddl, all
log_hostname = off


#---------------------------------------------------------------------------
# RUNTIME STATISTICS
#---------------------------------------------------------------------------

# - Statistics Monitoring -

log_parser_stats = off
log_planner_stats = off
log_executor_stats = off
log_statement_stats = off

# - Query/Index Statistics Collector -

stats_start_collector = on
stats_command_string = on
stats_block_level = on
stats_row_level = on
stats_reset_on_server_start = off


#---------------------------------------------------------------------------
# AUTOVACUUM PARAMETERS
#---------------------------------------------------------------------------

autovacuum = off                        # enable autovacuum subprocess?
autovacuum_naptime = 60                 # time between autovacuum runs,
in secs
autovacuum_vacuum_threshold = 1000      # min # of tuple updates before
                                         # vacuum
autovacuum_analyze_threshold = 500      # min # of tuple updates before
                                         # analyze
autovacuum_vacuum_scale_factor = 0.4    # fraction of rel size before
                                         # vacuum
autovacuum_analyze_scale_factor = 0.2   # fraction of rel size before
                                         # analyze
autovacuum_vacuum_cost_delay = -1       # default vacuum cost delay for
                                         # autovac, -1 means use
                                         # vacuum_cost_delay
autovacuum_vacuum_cost_limit = -1       # default vacuum cost limit for
                                         # autovac, -1 means use
                                         # vacuum_cost_limit


#---------------------------------------------------------------------------
# CLIENT CONNECTION DEFAULTS
#---------------------------------------------------------------------------

# - Statement Behavior -

search_path = '$user'                   # schema names
default_tablespace = ''                 # a tablespace name, '' uses
                                         # the default
check_function_bodies = on
default_transaction_isolation = 'read committed'
default_transaction_read_only = off
statement_timeout = 7200000             # 0 is disabled, in milliseconds

# - Locale and Formatting -

datestyle = 'iso, mdy'
timezone = unknown                      # actually, defaults to TZ
                                         # environment setting
australian_timezones = off
extra_float_digits = 0                  # min -15, max 2
client_encoding = sql_ascii             # actually, defaults to database
                                         # encoding

# These settings are initialized by initdb -- they might be changed
lc_messages = 'C'                       # locale for system error message
                                         # strings
lc_monetary = 'C'                       # locale for monetary formatting
lc_numeric = 'C'                        # locale for number formatting
lc_time = 'C'                           # locale for time formatting

# - Other Defaults -

explain_pretty_print = on
dynamic_library_path = '$libdir'


#---------------------------------------------------------------------------
# LOCK MANAGEMENT
#---------------------------------------------------------------------------

deadlock_timeout = 7200000              # in milliseconds
max_locks_per_transaction = 500         # min 10
# note: each lock table slot uses ~220 bytes of shared memory, and there are
# max_locks_per_transaction * (max_connections + max_prepared_transactions)
# lock table slots.


#---------------------------------------------------------------------------
# VERSION/PLATFORM COMPATIBILITY
#---------------------------------------------------------------------------

# - Previous Postgres Versions -

add_missing_from = off
regex_flavor = advanced                 # advanced, extended, or basic
sql_inheritance = on
default_with_oids = off
escape_string_warning = off

# - Other Platforms & Clients -

transform_null_equals = off


#---------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#---------------------------------------------------------------------------

#custom_variable_classes = ''           # list of custom variable class
names

NOTE - raising shared_buffers had no effect
NOTE - changing stats_command_string had no effect

The pg_hba.conf file contains:

# TYPE    DATABASE         USER         CIDR-ADDRESS          METHOD

# "local" is for Unix domain socket connections only
local     all              postgre                            trust
# IPv4 local connections:
host      all              postgre       127.0.0.1/32         trust
hostnossl operations       opps,opps_adm 10.1.10.0/8          trust
hostnossl development,work all           10.1.10.0/8          trust
# IPv6 local connections:
host      all              postgre       ::1/128              trust

The pg_ident.conf file, other than comments, is empty.

We're running an OLTP database with a small number of connections (<50)
performing mostly reads and inserts on modest sized tables (largest is <
2,000,000 records).

The symptoms are:

a) All 4 CPUs are nearly always 0% idle;
b) The system load level is nearly always in excess of 20;
c) the output from vmstat -w 10 looks like:
  procs      memory      page                    disks     faults      cpu
  r b w     avm    fre  flt  re  pi  po  fr  sr aa0 aa1   in   sy  cs us
sy id
21 0 3 1242976 327936 2766   0   0   0 2264   0   2   2 17397 140332
104846 18 82  1
21 0 3 1242932 313312 2761   0   0   0 1223   0   1   1 15989 128666
107019 13 86  1
19 3 0 1245908 275356 3762   0   0   0 1962   0   3   3 16397 131584
105792 14 85  1
21 0 2 1243968 262616 2006   0   0   0 2036   0   1   1 15260 122801
107406 14 85  1
  4 19 0 1240996 247004 1589   0   0   0 984   0   1   0 15403 121323
108331 12 87  2
17 1 2 1230744 252888 2440   0   0   0 1807   0   1   0 17977 142618
105600 15 84  2
NOTE - small user demands and high system demands
d) Running top indicates a significant number or sblock states and
occasional smwai states;
e) ps auxww | grep postgres doesn't show anything abnormal;
f) ESQL applications are very slow.

We VACUUM ANALYZE user databases every four hours. We VACUUM template1
every 4 hours. We make a copy of the current WAL every minute. We create
a PIT recovery archive daily daily. None of these, individually seem to
place much strain on the server.

Hopefully I've supplied enough information to start diagnosing the
problem. Any ideas, thoughts, suggestions are greatly appreciated ...


Andy

--
--------------------------------------------------------------------------------
Andrew Rost
National Operational Hydrologic Remote Sensing Center (NOHRSC)
National Weather Service, NOAA
1735 Lake Dr. West, Chanhassen, MN 55317-8582
Voice: (952)361-6610 x 234
Fax: (952)361-6634
andy.rost@noaa.gov
http://www.nohrsc.noaa.gov
--------------------------------------------------------------------------------



Re: Opteron/FreeBSD/PostgreSQL performance poor

От
Stephen Frost
Дата:
* andy rost (andy.rost@noaa.gov) wrote:
> We're in the process of porting from Informix 9.4 to PostgreSQL 8.1.3.
> Our PostgreSQL server is an AMD Opteron Dual Core 275 with two 2.2 Ghz
> 64-bit processors. There are two internal drives and an external
> enclosure containing 14 drives (configured as 7 pairs of mirrored drives
> - four pairs for table spaces, one pair for dbcluster, two pairs for
> point in time recovery). The operating system is FreeBSD 6.0-RELEASE #10

Not sure it matters, but is the mirroring done with a hardware
controller or in software?

> shared_buffers = 125000                 # min 16 or max_connections*2,
> 8KB each
> temp_buffers = 1000                     # min 100, 8KB each
> max_prepared_transactions = 0           # can be 0 or more
> # note: increasing max_prepared_transactions costs ~600 bytes of shared
> memory
> # per transaction slot, plus lock space (see max_locks_per_transaction).
> work_mem = 10000                        # min 64, size in KB
> maintenance_work_mem = 50000            # min 1024, size in KB
> max_stack_depth = 500000                # in 100, size in KB
>                                         # ulimit -a or ulimit -s

These seem kind of.. backwards...  Just an example of one system I've
got shows:

shared_buffers = 10000
work_mem = 32768
maintenance_work_mem = 65535

Defaults for the rest.  This is more of a data-warehouse than an OLTP,
so I'm sure these aren't perfect for you, but you might try playing with
them some.

> # - Free Space Map -
> max_fsm_pages = 600000                  # min max_fsm_relations*16, 6
> bytes each

This seems somewhat hgih from the default of 20,000, but for a very
frequently changing database it may make sense.

> archive_command = 'archive_wal -email -txtmsg "%p" "%f"'        #
> command to use

Are WALs being archived very frequently?  Any idea if this takes much
time?  I wouldn't really think it'd be an issue, but might be useful to
know.

> effective_cache_size = 27462            # typically 8KB each

This seems like it might be a little low...  How much memory do you have
in the system?  Then again, with your shared_mem set so high, perhaps
it's not that bad, but it might make sense to swap those two settings,
or at least that'd be a more common PG setup.

> random_page_cost = 2                    # units are one sequential page

That's quite a bit lower than the default of 4...  May make sense for
you but it's certainly something to look at.

> We're running an OLTP database with a small number of connections (<50)
> performing mostly reads and inserts on modest sized tables (largest is <
> 2,000,000 records).
>
> The symptoms are:
>
> a) All 4 CPUs are nearly always 0% idle;
> b) The system load level is nearly always in excess of 20;

At a guess I'd say that the system is doing lots of sequential scans
rather than using indexes, and that's why the processes are ending up in
a disk-wait state, which makes the load go up.  Have you looked at the
plans which are being generated for the most common queries to see what
they're doing?

I'd also wonder if the shared_mem setting isn't set *too* high and
causing problems with the IPC or something...  Not something I've heard
of (generally, going up with shared_mem doesn't degrade performance,
just doesn't improve it) but might be possible.

> We VACUUM ANALYZE user databases every four hours. We VACUUM template1
> every 4 hours. We make a copy of the current WAL every minute. We create
> a PIT recovery archive daily daily. None of these, individually seem to
> place much strain on the server.

This doesn't sound too bad at all.  How long do the vacuum's run for?
If it's 3 hours, then that might start to be an issue with disk I/O
contention...

> Hopefully I've supplied enough information to start diagnosing the
> problem. Any ideas, thoughts, suggestions are greatly appreciated ...

Just my 2c, hopefully you'll get some better answers too. :)

    Thanks,

        Stephen

Вложения

Re: Opteron/FreeBSD/PostgreSQL performance poor

От
Vivek Khera
Дата:
On Jul 5, 2006, at 10:43 AM, andy rost wrote:

> We're in the process of porting from Informix 9.4 to PostgreSQL
> 8.1.3. Our PostgreSQL server is an AMD Opteron Dual Core 275 with
> two 2.2 Ghz 64-bit processors. There are two internal drives and an
> external enclosure containing 14 drives (configured as 7 pairs of
> mirrored drives - four pairs for table spaces, one pair for
> dbcluster, two pairs for point in time recovery). The operating
> system is FreeBSD 6.0-RELEASE #10

What RAID card are you hooked up to?  My best machines have LSI
MegaRAID 320-2X cards in them, with the pairs of each mirrored set on
opposing channels.  My second best machine (into which the above card
will not fit) uses an Adaptec 2230SLP card similarly configured.

The database is mirrored from one box to the other, and a vacuum
analyze takes ~10 hours.

You seem to have done the right things in your postgresql.conf file.

Things I'd like to know before offering advice: how big is your
database?  Ie, what is the "df" output for your various partitions
holding Pg data?  also, how much RAM is in this box?  And finally,
can you show the output of "iostsat -w3" for a few rows during times
when you're having poor performance.



Re: Opteron/FreeBSD/PostgreSQL performance poor

От
andy rost
Дата:
Hi Stephen,

Thanks for your input. My follow ups are interleaved below ...

Stephen Frost wrote:
> * andy rost (andy.rost@noaa.gov) wrote:
>
>>We're in the process of porting from Informix 9.4 to PostgreSQL 8.1.3.
>>Our PostgreSQL server is an AMD Opteron Dual Core 275 with two 2.2 Ghz
>>64-bit processors. There are two internal drives and an external
>>enclosure containing 14 drives (configured as 7 pairs of mirrored drives
>>- four pairs for table spaces, one pair for dbcluster, two pairs for
>>point in time recovery). The operating system is FreeBSD 6.0-RELEASE #10
>
>
> Not sure it matters, but is the mirroring done with a hardware
> controller or in software?
>

I'll have to check on this when our system administrator returns
tomorrow. I performed a quick test while the server was under load by
moving a couple of Gigs of data while running iostat.I was getting disk
I/O rates of about 125 KB per transaction, 250 transactions per second,
and 35 Mg per second on all drives.

>
>>shared_buffers = 125000                 # min 16 or max_connections*2,
>>8KB each
>>temp_buffers = 1000                     # min 100, 8KB each
>>max_prepared_transactions = 0           # can be 0 or more
>># note: increasing max_prepared_transactions costs ~600 bytes of shared
>>memory
>># per transaction slot, plus lock space (see max_locks_per_transaction).
>>work_mem = 10000                        # min 64, size in KB
>>maintenance_work_mem = 50000            # min 1024, size in KB
>>max_stack_depth = 500000                # in 100, size in KB
>>                                        # ulimit -a or ulimit -s
>
>
> These seem kind of.. backwards...  Just an example of one system I've
> got shows:
>
> shared_buffers = 10000
> work_mem = 32768
> maintenance_work_mem = 65535
>
> Defaults for the rest.  This is more of a data-warehouse than an OLTP,
> so I'm sure these aren't perfect for you, but you might try playing with
> them some.

Originally shared_buffers was set to 32768. I set it to its current
value out of desperations (newby response).

>
>
>># - Free Space Map -
>>max_fsm_pages = 600000                  # min max_fsm_relations*16, 6
>>bytes each
>
>
> This seems somewhat hgih from the default of 20,000, but for a very
> frequently changing database it may make sense.
>

This value is based on the output from VACUUM ANALYZE

>
>>archive_command = 'archive_wal -email -txtmsg "%p" "%f"'        #
>>command to use
>
>
> Are WALs being archived very frequently?  Any idea if this takes much
> time?  I wouldn't really think it'd be an issue, but might be useful to
> know.
>

Yes, about 100 times per hour. No, I don't think it takes much time

>
>>effective_cache_size = 27462            # typically 8KB each
>
>
> This seems like it might be a little low...  How much memory do you have
> in the system?  Then again, with your shared_mem set so high, perhaps
> it's not that bad, but it might make sense to swap those two settings,
> or at least that'd be a more common PG setup.

Oops, forgot to mention that we have 6 Gigs of memory. This value was
set based on sysctl -n vfs.hibufspace / 8192

>
>
>>random_page_cost = 2                    # units are one sequential page
>
>
> That's quite a bit lower than the default of 4...  May make sense for
> you but it's certainly something to look at.
>

This value set per web page entitiled "Annotated POSTGRESQL.CONF Guide
for PostgreSQL"

>
>>We're running an OLTP database with a small number of connections (<50)
>>performing mostly reads and inserts on modest sized tables (largest is <
>>2,000,000 records).
>>
>>The symptoms are:
>>
>>a) All 4 CPUs are nearly always 0% idle;
>>b) The system load level is nearly always in excess of 20;
>
>
> At a guess I'd say that the system is doing lots of sequential scans
> rather than using indexes, and that's why the processes are ending up in
> a disk-wait state, which makes the load go up.  Have you looked at the
> plans which are being generated for the most common queries to see what
> they're doing?

We thought of that too. However, executing:
select * from pg_stat_user_tables
suggests that we are using indexes where needed. We confirmed this by
checking and running manually queries reported by
select * from pg_stat_activity
while the server is suffering

>
> I'd also wonder if the shared_mem setting isn't set *too* high and
> causing problems with the IPC or something...  Not something I've heard
> of (generally, going up with shared_mem doesn't degrade performance,
> just doesn't improve it) but might be possible.
>

Possible I suppose but we had the same trouble while the server was
configured with 32768 buffers

>
>>We VACUUM ANALYZE user databases every four hours. We VACUUM template1
>>every 4 hours. We make a copy of the current WAL every minute. We create
>>a PIT recovery archive daily daily. None of these, individually seem to
>>place much strain on the server.
>
>
> This doesn't sound too bad at all.  How long do the vacuum's run for?
> If it's 3 hours, then that might start to be an issue with disk I/O
> contention...
>

VACUUM ANALYZE lasts about an hour and fifteen minutes

>
>>Hopefully I've supplied enough information to start diagnosing the
>>problem. Any ideas, thoughts, suggestions are greatly appreciated ...
>
>
> Just my 2c, hopefully you'll get some better answers too. :)
>

Again, many thanks. Is this the proper mail list for this problem or
should I also be addressing the administation mail list as well?

>     Thanks,
>
>         Stephen

--
--------------------------------------------------------------------------------
Andrew Rost
National Operational Hydrologic Remote Sensing Center (NOHRSC)
National Weather Service, NOAA
1735 Lake Dr. West, Chanhassen, MN 55317-8582
Voice: (952)361-6610 x 234
Fax: (952)361-6634
andy.rost@noaa.gov
http://www.nohrsc.noaa.gov
--------------------------------------------------------------------------------



Re: Opteron/FreeBSD/PostgreSQL performance poor

От
Mark Kirkwood
Дата:
andy rost wrote:
>

>>> effective_cache_size = 27462            # typically 8KB each
>>
>>
>> This seems like it might be a little low...  How much memory do you have
>> in the system?  Then again, with your shared_mem set so high, perhaps
>> it's not that bad, but it might make sense to swap those two settings,
>> or at least that'd be a more common PG setup.
>
> Oops, forgot to mention that we have 6 Gigs of memory. This value was
> set based on sysctl -n vfs.hibufspace / 8192
>

That vfs.hibufspace sysctl is a little deceptive IMHO - e.g on my
FreeBSD 6.1 system with 2G of ram it says 117276672 (i.e. about 112M),
but I have a 1G file cached entirely in ram at the moment... In FreeBSD
file pages are actually kept in the 'Inactive' section of memory, the
'Buffer' section is used as a 'window' to read 'em. For instance on my
system I see:

Mem: 4192K Active, 1303M Inact, 205M Wired, 12K Cache, 112M Buf, 491M Free

So my 1G file is cached in the 1303M of 'Inactive', but I have 112M of
buffer window for accessing this (and other) cached files. Now, I may
not have explained this that well, and it is quite confusing... but
hopefully you get the idea!

Now on the basis of the figures provided:
- max_connections=102 , each with work_mem=10000 (approx 1G in total)
- shared buffers=125000 (1G total)

it looks like you are only using about 2G of  your 6G, so there is a lot
left for caching file pages (lets say 3-4G or so).

I would think you can happily set effective_cache_size=393216 (i.e.
3G/8192). This will have the side effect of encouraging more index scans
(probably what you want I think).

Best wishes

Mark

Re: Opteron/FreeBSD/PostgreSQL performance poor

От
"Merlin Moncure"
Дата:
On 7/5/06, andy rost <andy.rost@noaa.gov> wrote:

> fsync = on                              # turns forced synchronization

have you tried turning this off and measuring performance?

> stats_command_string = on

I would turn this off unless you absoltely require it.  It is
expensive for what it does.

> a) All 4 CPUs are nearly always 0% idle;
> b) The system load level is nearly always in excess of 20;

I am guessing your system is spending all it's time syncing.  If so,
it's solvable (again, just run fsync=off for a bit and compare).

> c) the output from vmstat -w 10 looks like:
>   procs      memory      page                    disks     faults      cpu
>   r b w     avm    fre  flt  re  pi  po  fr  sr aa0 aa1   in   sy  cs us
> sy id
> 21 0 3 1242976 327936 2766   0   0   0 2264   0   2   2 17397 140332
> 104846 18 82  1

is that 100k context switches over 10 seconds or one second? that
might be something to check out.  pg 8.1 is regarded as the solution
to any cs problem, though.

> NOTE - small user demands and high system demands
> d) Running top indicates a significant number or sblock states and
> occasional smwai states;
> e) ps auxww | grep postgres doesn't show anything abnormal;
> f) ESQL applications are very slow.
>
> We VACUUM ANALYZE user databases every four hours. We VACUUM template1
> every 4 hours. We make a copy of the current WAL every minute. We create
> a PIT recovery archive daily daily. None of these, individually seem to
> place much strain on the server.

your server should be able to handle this easily.

> Hopefully I've supplied enough information to start diagnosing the
> problem. Any ideas, thoughts, suggestions are greatly appreciated ...
>

can you please approximate roughly how many transactions per second
your server is handling while you are getting the 20 load condition
(and, if possible, broken down into read and write transactions)?

merlin

Re: Opteron/FreeBSD/PostgreSQL performance poor

От
andy rost
Дата:
Mark,

Thanks for the insight. I increased the value of effective_cache_size to
3 Gigs and will monitor the performance over the weekend. Prior to this
change we discovered that we are filling up WALs to the tune of 2400 per
day. Moving the pg_xlog subdirectory to its own drive seemed to boost
the performance significantly. We're taking this one step at a time. On
Monday we plan to drop the number of shared memory buffers down to 50000
from its current value of 125000 (per the large number of
recommendations that this value should be held fairly low and
suggestions that vales in excess of 50000 may hamper performance).

Thanks again ...

Andy

Mark Kirkwood wrote:
> andy rost wrote:
>
>>
>
>>>> effective_cache_size = 27462            # typically 8KB each
>>>
>>>
>>>
>>> This seems like it might be a little low...  How much memory do you have
>>> in the system?  Then again, with your shared_mem set so high, perhaps
>>> it's not that bad, but it might make sense to swap those two settings,
>>> or at least that'd be a more common PG setup.
>>
>>
>> Oops, forgot to mention that we have 6 Gigs of memory. This value was
>> set based on sysctl -n vfs.hibufspace / 8192
>>
>
> That vfs.hibufspace sysctl is a little deceptive IMHO - e.g on my
> FreeBSD 6.1 system with 2G of ram it says 117276672 (i.e. about 112M),
> but I have a 1G file cached entirely in ram at the moment... In FreeBSD
> file pages are actually kept in the 'Inactive' section of memory, the
> 'Buffer' section is used as a 'window' to read 'em. For instance on my
> system I see:
>
> Mem: 4192K Active, 1303M Inact, 205M Wired, 12K Cache, 112M Buf, 491M Free
>
> So my 1G file is cached in the 1303M of 'Inactive', but I have 112M of
> buffer window for accessing this (and other) cached files. Now, I may
> not have explained this that well, and it is quite confusing... but
> hopefully you get the idea!
>
> Now on the basis of the figures provided:
> - max_connections=102 , each with work_mem=10000 (approx 1G in total)
> - shared buffers=125000 (1G total)
>
> it looks like you are only using about 2G of  your 6G, so there is a lot
> left for caching file pages (lets say 3-4G or so).
>
> I would think you can happily set effective_cache_size=393216 (i.e.
> 3G/8192). This will have the side effect of encouraging more index scans
> (probably what you want I think).
>
> Best wishes
>
> Mark

--
--------------------------------------------------------------------------------
Andrew Rost
National Operational Hydrologic Remote Sensing Center (NOHRSC)
National Weather Service, NOAA
1735 Lake Dr. West, Chanhassen, MN 55317-8582
Voice: (952)361-6610 x 234
Fax: (952)361-6634
andy.rost@noaa.gov
http://www.nohrsc.noaa.gov
--------------------------------------------------------------------------------



Re: Opteron/FreeBSD/PostgreSQL performance poor

От
andy rost
Дата:
Hi Merlin,

Thanks for the input. Please see below ...

Merlin Moncure wrote:
> On 7/5/06, andy rost <andy.rost@noaa.gov> wrote:
>
>> fsync = on                              # turns forced synchronization
>
>
> have you tried turning this off and measuring performance?

No, not yet. We're trying a couple of outher avenues before manipulating
this parameter.

>
>> stats_command_string = on
>
>
> I would turn this off unless you absoltely require it.  It is
> expensive for what it does.

We've turned this off

>
>> a) All 4 CPUs are nearly always 0% idle;
>> b) The system load level is nearly always in excess of 20;
>
>
> I am guessing your system is spending all it's time syncing.  If so,
> it's solvable (again, just run fsync=off for a bit and compare).
>

We've reduced the load significantly primarily by moving pg_xlog to its
own drive and by increasing the effective cache size. While we still see
  high load levels, they don't last very long. We're trying improve
performance from several angles but are taking it one step at a time.
Eventually we'll experiment with fsynch

>> c) the output from vmstat -w 10 looks like:
>>   procs      memory      page                    disks     faults
>> cpu
>>   r b w     avm    fre  flt  re  pi  po  fr  sr aa0 aa1   in   sy  cs us
>> sy id
>> 21 0 3 1242976 327936 2766   0   0   0 2264   0   2   2 17397 140332
>> 104846 18 82  1
>
>
> is that 100k context switches over 10 seconds or one second? that
> might be something to check out.  pg 8.1 is regarded as the solution
> to any cs problem, though.

According to man top, that's 100K per second. I'm interested in your
recommendation but am not sure what "pg 8.1" references

>
>> NOTE - small user demands and high system demands
>> d) Running top indicates a significant number or sblock states and
>> occasional smwai states;
>> e) ps auxww | grep postgres doesn't show anything abnormal;
>> f) ESQL applications are very slow.
>>
>> We VACUUM ANALYZE user databases every four hours. We VACUUM template1
>> every 4 hours. We make a copy of the current WAL every minute. We create
>> a PIT recovery archive daily daily. None of these, individually seem to
>> place much strain on the server.
>
>
> your server should be able to handle this easily.
>
>> Hopefully I've supplied enough information to start diagnosing the
>> problem. Any ideas, thoughts, suggestions are greatly appreciated ...
>>
>
> can you please approximate roughly how many transactions per second
> your server is handling while you are getting the 20 load condition
> (and, if possible, broken down into read and write transactions)?

Do you have any suggestions on how I might obtain these metrics?

>
> merlin

Thanks again Merlin ...

Andy

--
--------------------------------------------------------------------------------
Andrew Rost
National Operational Hydrologic Remote Sensing Center (NOHRSC)
National Weather Service, NOAA
1735 Lake Dr. West, Chanhassen, MN 55317-8582
Voice: (952)361-6610 x 234
Fax: (952)361-6634
andy.rost@noaa.gov
http://www.nohrsc.noaa.gov
--------------------------------------------------------------------------------



Re: Opteron/FreeBSD/PostgreSQL performance poor

От
Jeff Frost
Дата:
On Fri, 7 Jul 2006, andy rost wrote:

>> is that 100k context switches over 10 seconds or one second? that
>> might be something to check out.  pg 8.1 is regarded as the solution
>> to any cs problem, though.
>
> According to man top, that's 100K per second. I'm interested in your
> recommendation but am not sure what "pg 8.1" references

pg 8.1 means PostgreSQL 8.1.x (preferably 8.1.4) which is said to resolve many
context switch issues.

>>
>> can you please approximate roughly how many transactions per second
>> your server is handling while you are getting the 20 load condition
>> (and, if possible, broken down into read and write transactions)?
>
> Do you have any suggestions on how I might obtain these metrics?
>

Every five minutes do:

select sum(xact_commit) + sum(xact_rollback) as transactions from pg_stat_database;

and then divide the difference by 300 and that's your transactions per second:

select sum(xact_commit) + sum(xact_rollback) as transactions from
pg_stat_database;

  transactions
--------------
     231894522
(1 row)


<wait 300 seconds>

select sum(xact_commit) + sum(xact_rollback) as transactions from
pg_stat_database;

  transactions
--------------
     231907346
(1 row)

(231907346-231894522)/300 = 42.74666666666666666666 TPS


--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: Opteron/FreeBSD/PostgreSQL performance poor

От
"Merlin Moncure"
Дата:
On 7/7/06, andy rost <andy.rost@noaa.gov> wrote:
> Hi Merlin,
> Thanks for the input. Please see below ...

no problem!

[aside: jeff, great advice on tps determination]

> >> fsync = on                              # turns forced synchronization
> > have you tried turning this off and measuring performance?
>
> No, not yet. We're trying a couple of outher avenues before manipulating
> this parameter.

ok.  just keep in mind that keeping fsync on keeps an aboslute upper
limit on your tps that is largely driven by hardware.  with a raid
controller write caching controller caching writes the penalty is
extremely low but without write caching you might get ~150 tps on a
relatively high end raid system.  moreover, your disks are totally
utilized providing those 150 tps.  (transactions with writing, that
is)

symptoms of sync bottleneck are little/no cpu utilization, sustained
high iowait, and extremely poor performance and an unresponsive
server.  the best solution to sync issues is to have a
hardware/software strategy designed to deal with it.

if you are having periodic performance issues, you might be having
checkpoint storms.  Thse are controllable by tuning the wal and
especially the bgwriter.  These are easy to spot: you can do manual
checkpoints in psql and closely monitor the load.

> We've reduced the load significantly primarily by moving pg_xlog to its
> own drive and by increasing the effective cache size. While we still see
>   high load levels, they don't last very long. We're trying improve
> performance from several angles but are taking it one step at a time.
> Eventually we'll experiment with fsynch

good!

merlin