Strange serialization problem

От: Mischa Sandberg
Тема: Strange serialization problem
Дата: ,
Msg-id: 1113430614.425d9a560e5b5@webmail.telus.net
(см: обсуждение, исходный текст)
Ответ на: Re: [NOVICE] Many connections lingering  (Richard Huxton)
Список: pgsql-performance

Скрыть дерево обсуждения

Many connections lingering  (Slavisa Garic, )
 Re: [NOVICE] Many connections lingering  (Tom Lane, )
  Re: [NOVICE] Many connections lingering  (Greg Stark, )
   Re: [NOVICE] Many connections lingering  (Tom Lane, )
   Re: [NOVICE] Many connections lingering  (Slavisa Garic, )
    Re: [NOVICE] Many connections lingering  (Mark Lewis, )
     Re: [NOVICE] Many connections lingering  (Slavisa Garic, )
    Re: [NOVICE] Many connections lingering  (John DeSoi, )
    Re: [NOVICE] Many connections lingering  (Richard Huxton, )
     Strange serialization problem  (Mischa Sandberg, )
     Re: [NOVICE] Many connections lingering  (Slavisa Garic, )

I have a performance problem; I'd like any suggestions on where to continue
investigation.

A set of insert-only processes seems to serialize itself. :-(

The processes appear to be blocked on disk IO, and probably the table drive,
rather than the pg_xlog drive.

Each process is inserting a block of 10K rows into a table.
I'm guessing they are "serialized" because one process by itself takes 15-20
secs; running ten processes in parallel averages 100-150 secs (each), with
elapsed (wall) time  of 150-200 secs.

Polling pg_locks shows each process has (been granted) only the locks you would
expect. I RARELY see an Exclusive lock on an index, and then only on one index
at a time.

A sample from pg_locks:

TABLE/INDEX                  GRANTED PID  MODE
m_reason                           t 7340 AccessShare
message                            t 7340 AccessShare
message                            t 7340 RowExclusive
pk_message                         t 7340 AccessShare
tmp_message                        t 7340 AccessShare
("m_reason" is a one-row lookup table; see INSERT cmd below).

--------------------------
The query plan is quite reasonable (see below).

On a side note, this is the first app I've had to deal with that is sweet to
pg_xlog, but hammers the drive bearing the base table (3x the traffic).

"log_executor_stats" for a sample insert look reasonable (except the "elapsed"!)

! system usage stats:
! 308.591728 elapsed 3.480000 user 1.270000 system sec
! [4.000000 user 1.390000 sys total]
! 0/0 [0/0] filesystem blocks in/out
! 18212/15 [19002/418] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 0/0 [0/0] voluntary/involuntary context switches
! buffer usage stats:
! Shared blocks:       9675 read,       8781 written, buffer hit rate = 97.66%
! Local  blocks:        504 read,         64 written, buffer hit rate = 0.00%
! Direct blocks:          0 read,          0 written

Summarized "ps" output for the above backend process, sampled every 5 secs,
shows it is 94% in the 'D' state, 3% in the 'S' state.

================
== BACKGROUND ==
================

**SOFTWARE
- PG 7.4.6, RedHat 8.

----------------------------------
**HARDWARE
Xeon 2x2 2.4GHz 2GB RAM
4 x 73GB SCSI; pg_xlog and base on separate drives.

----------------------------------
**APPLICATION

Six machines post batches of 10K messages to the PG db server.
Machine #nn generates its ID keys as "nn00000000001"::bigint etc.

Each process runs:
- "COPY tmp_message FROM STDIN" loads its own one-use TEMP table.
- " INSERT INTO message
    SELECT tmp.* FROM tmp_message AS tmp
    JOIN m_reason ON m_reason.name = tmp.reason
    LEFT JOIN message USING (ID) WHERE message.ID is null
      (check required because crash recovery logic requires idempotent insert)
  "DROP TABLE tmp_message"  --- call me paranoid, this is 7.4

The COPY step time is almost constant when #processes varies from 1 to 10.

----------------------------------
**POSTGRES
pg_autovacuum is running with default parameters.

Non-default GUC values:
checkpoint_segments            = 512
default_statistics_target      = 200
effective_cache_size           = 500000
log_min_duration_statement     = 1000
max_fsm_pages                  = 1000000
max_fsm_relations              = 1000
random_page_cost               = 1
shared_buffers                 = 10000
sort_mem                       = 16384
stats_block_level              = true
stats_command_string           = true
stats_row_level                = true
vacuum_mem                     = 65536
wal_buffers                    = 2000

Wal_buffers and checkpoint_segments look outrageous,
but were tuned for another process, that posts batches of 10000 6KB rows
in a single insert.
----------------------------------
TABLE/INDEX STATISTICS

----------------------------------
MACHINE STATISTICS

ps gives the backend process as >98% in (D) state, with <1% CPU.

A "top" snapshot:
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total    2.0%    0.0%    0.8%   0.0%     0.0%   96.9%    0.0%
           cpu00    2.5%    0.0%    1.9%   0.0%     0.0%   95.4%    0.0%
           cpu01    1.7%    0.0%    0.1%   0.0%     0.3%   97.6%    0.0%
           cpu02    0.5%    0.0%    0.7%   0.0%     0.0%   98.6%    0.0%
           cpu03    3.1%    0.0%    0.5%   0.0%     0.0%   96.2%    0.0%
Mem:  2061552k av, 2041752k used,   19800k free,       0k shrd,   21020k buff

iostat reports that the $PGDATA/base drive is being worked but not overworked.
The pg_xlog drive is underworked:

       KBPS   TPS   KBPS   TPS   KBPS   TPS   KBPS   TPS
12:30      1     2    763    16     31     8   3336   269
12:40      5     3   1151    22      5     5   2705   320
                      ^pg_xlog^                  ^base^

The base drive has run as much as 10MBPS, 5K TPS.
----------------------------------
EXPLAIN ANALYZE output:
The plan is eminently reasonable. But there's no visible relationship
between the top level "actual time" and the "total runtime":

Nested Loop Left Join
                (cost=0.00..31109.64 rows=9980 width=351)
                (actual time=0.289..2357.346 rows=9980 loops=1)
  Filter: ("inner".id IS NULL)
  ->  Nested Loop
                (cost=0.00..735.56 rows=9980 width=351)
                (actual time=0.092..1917.677 rows=9980 loops=1)
        Join Filter: (("outer".name)::text = ("inner".reason)::text)
        ->  Seq Scan on m_reason r
                (cost=0.00..1.01 rows=1 width=12)
                (actual time=0.008..0.050 rows=1 loops=1)
        ->  Seq Scan on tmp_message t
                (cost=0.00..609.80 rows=9980 width=355)
                (actual time=0.067..1756.617 rows=9980 loops=1)
  ->  Index Scan using pk_message on message
                (cost=0.00..3.02 rows=1 width=8)
                (actual time=0.014..0.014 rows=0 loops=9980)
        Index Cond: ("outer".id = message.id)
Total runtime: 737401.687 ms

--
"Dreams come true, not free." -- S.Sondheim, ITW



В списке pgsql-performance по дате сообщения:

От: Slavisa Garic
Дата:
Сообщение: Re: [NOVICE] Many connections lingering
От: Kevin Brown
Дата:
Сообщение: Re: How to improve db performance with $7K?