Обсуждение: Postgres 7.4.6 hang in async_notify

Поиск
Список
Период
Сортировка

Postgres 7.4.6 hang in async_notify

От
pgsql-bugs@counterstorm.com
Дата:
We have seen this twice.  The first time, it was on a 32 bit Fedora
Core 3 system during a `postgres stop`.  The stop hung because one of
the processes was in async_notify and refused to get out.  We were
unable to do further debugging because the db was in shutdown.  We
were forced to kill -9 the process.

The second time we saw this, today, was on a 64 bit Opteron Fedora
Core 3 system during operation.  One of our postgres processes was
hung in async_notify causing all further processing by that process to
be stopped.  All other processes were fine, including ones which used
notify.  After a variety of debugging and shutdown attempts (described
in further detail below), we were forced to kill -9 the process

Unfortunately, we do not know how to reproduce this (though we are
willing to try if anyone has any ideas).  We are obviously looking for
a patch, a recovery method more subtle than kill -9, or at least
something to do the next time we see this problem to help further
debugging.  A description of what happened, a dump of pg_locks, some
`ps` examples, and some final postgres shutdown messages are included
below.  Any attention you could spare would be most appreciated.



The way the notify gets called is a bit convoluted.  Our C program
performs an insert in table T.  Some triggers run doing all sorts of
stuff in the background involving changes to other tables.  Under some
circumstances, a row is inserted by the trigger into table W (a
trigger off of that is executed) and the parent trigger performs a
notify.  This, while expensive, normally all works.  However, at a
customer site we saw the postgres process hang out in async_notify and
refuse to allow more inserts.

As far as we know, there was a listener (a perl process) for that
notify (we failed to look at pg_listeners) but we were able to use
psql to listen on that value and notify on it as well.  We doubt the
notify rate was high enough to cause hundreds or even dozens of
notifies to be outstanding, but it is certainly possible that a
handful were.

After struggling with this for a bit (which included sending the hung
process a SIGUSR2 because of some emails on this list which seemed to
suggest a relationship between the signal and notifies), we attempted
to stop the system.  We killed off all postgres initiators (including
the process doing the insert which caused it to hang in async_notify)
but the program remained hung.  We initiated a database shutdown
(whapped the postmaster with a SIGTERM) but the postgres remained in
async_notify.  We tried directly sending the hung process a SIGTERM
but that did not help.  Finally, we were forced to send the process
the uncatchable signal SIGKILL, which obviously worked.


psql=# select * from pg_locks;
 relation | database | transaction |  pid  |      mode       | granted
----------+----------+-------------+-------+-----------------+---------
          |          |      565845 | 11603 | ExclusiveLock   | t
    16759 |    17212 |             |   469 | AccessShareLock | t
          |          |     2153119 |   469 | ExclusiveLock   | t

different psql (pid 1013)

psql=# select * from pg_locks;
 relation | database | transaction |  pid  |      mode       | granted
----------+----------+-------------+-------+-----------------+---------
          |          |      565845 | 11603 | ExclusiveLock   | t
          |          |     2154368 |  1013 | ExclusiveLock   | t
    16759 |    17212 |             |  1013 | AccessShareLock | t


----------------------------------------------------------------------
26535 ?        S      0:08 postmaster -i -D postgres/data
26536 ?        S      0:47 postgres: stats buffer process
26537 ?        S      0:27 postgres: stats collector process
26551 ?        Ss     0:00 initlog -c pg_autovacuum -U user1
26599 ?        S      0:19 pg_autovacuum -U user1
29684 ?        S     16:31 postgres: sysd sysd 10.139.140.58 idle
30881 ?        S      0:00 postgres: sysd sysd 10.139.140.58 idle
10549 ?        S      0:11 postgres: sysd sysd 10.139.140.58 idle
10626 ?        S      0:03 postgres: sysd sysd 10.139.140.58 idle
10668 ?        S      1:53 postgres: sysd sysd 10.139.140.58 idle
11178 ?        S      0:00 postgres: sysd sysd 10.139.140.58 idle
11258 ?        S      0:00 postgres: sysd sysd 10.139.140.58 idle
11603 ?        S      1:14 postgres: sysd sysd 10.139.140.58 async_notify
11718 ?        S      0:00 postgres: sysd sysd 10.139.140.58 idle
32506 pts/3    T      0:00 psql
32553 ?        S      0:00 postgres: sysd sysd 10.139.140.58 idle
32649 pts/3    T      0:00 psql -U user1
32696 ?        S      0:00 postgres: user1 sysd 10.139.140.58 idle
  374 ?        S      0:00 postgres: sysd sysd 10.139.140.58 idle
  389 pts/3    S+     0:00 grep postgres
----------------------------------------------------------------------

After killing off *all* postgres initiators, including the parent of
11603.

----------------------------------------------------------------------
26535 ?        S      0:08 postmaster -i -D postgres/data
26536 ?        S      0:47 postgres: stats buffer process
26537 ?        S      0:28 postgres: stats collector process
11603 ?        S      1:14 postgres: sysd sysd 10.139.140.58 async_notify
----------------------------------------------------------------------

Apr 25 11:01:02 localhost postgres[26535]: [5-1] LOG:  server process (PID 11603) was terminated by signal 9
Apr 25 11:01:02 localhost postgres[26535]: [6-1] LOG:  terminating any other active server processes
Apr 25 11:01:02 localhost postgres[26535]: [7-1] LOG:  all server processes terminated; reinitializing
Apr 25 11:01:02 localhost postgres[1947]: [8-1] LOG:  database system was interrupted at 2005-04-25 10:56:05 EDT
Apr 25 11:01:02 localhost postgres[1947]: [9-1] LOG:  checkpoint record is at 0/153152B8
Apr 25 11:01:02 localhost postgres[1947]: [10-1] LOG:  redo record is at 0/153152B8; undo record is at 0/0; shutdown
FALSE
Apr 25 11:01:02 localhost postgres[1947]: [11-1] LOG:  next transaction ID: 2156499; next OID: 66294
Apr 25 11:01:02 localhost postgres[1947]: [12-1] LOG:  database system was not properly shut down; automatic recovery
inprogress 
Apr 25 11:01:02 localhost postgres[1947]: [13-1] LOG:  record with zero length at 0/15315300
Apr 25 11:01:02 localhost postgres[1947]: [14-1] LOG:  redo is not required
Apr 25 11:01:04 localhost postgres[1947]: [15-1] LOG:  database system is ready
Apr 25 11:01:04 localhost postgres[1949]: [8-1] LOG:  shutting down
Apr 25 11:01:06 localhost postgres[1949]: [9-1] LOG:  database system is shut do

Re: Postgres 7.4.6 hang in async_notify

От
Alvaro Herrera
Дата:
On Mon, Apr 25, 2005 at 03:42:35PM -0400, pgsql-bugs@counterstorm.com wrote:

> The way the notify gets called is a bit convoluted.  Our C program
> performs an insert in table T.  Some triggers run doing all sorts of
> stuff in the background involving changes to other tables.  Under some
> circumstances, a row is inserted by the trigger into table W (a
> trigger off of that is executed) and the parent trigger performs a
> notify.  This, while expensive, normally all works.  However, at a
> customer site we saw the postgres process hang out in async_notify and
> refuse to allow more inserts.

What do you mean by "in the background"?  Do you fork() the backend
process?

IIRC there have been bugs in listen/notify by which a backend could hang
forever.  Some were fixed, but others main remain.  Can you post more
details of the involved functions and triggers?

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
One man's impedance mismatch is another man's layer of abstraction.
(Lincoln Yeoh)

Re: Postgres 7.4.6 hang in async_notify

От
pgsql-bugs@counterstorm.com
Дата:
In message <20050425205541.GC29834@dcc.uchile.cl>, Alvaro Herrera writes:

    On Mon, Apr 25, 2005 at 03:42:35PM -0400, pgsql-bugs@counterstorm.com wrote:

    > The way the notify gets called is a bit convoluted.  Our C program
    > performs an insert in table T.  Some triggers run doing all sorts of
    > stuff in the background involving changes to other tables.  Under some
    > circumstances, a row is inserted by the trigger into table W (a
    > trigger off of that is executed) and the parent trigger performs a
    > notify.  This, while expensive, normally all works.  However, at a
    > customer site we saw the postgres process hang out in async_notify and
    > refuse to allow more inserts.

    What do you mean by "in the background"?  Do you fork() the backend
    process?

Sorry, I was being imprecise.  It just does a lot of dynamic queries,
inserts, updates, lookups, etc.

    IIRC there have been bugs in listen/notify by which a backend could hang
    forever.  Some were fixed, but others main remain.  Can you post more
    details of the involved functions and triggers?

Here is the top layer.  If you need the second level down of functions,
let me know.  Yes, this function does imply you could get multiple
notifies generated in a single trigger.

Thanks,

----------------------------------------------------------------------
psql=> \d tagset_1
                                                      Table "tagset_1"
                  Column  |              Type              |                       Modifiers
--------------------------+--------------------------------+-------------------------------------------------------
 internal_dataset_key     | text                           | not null
 internal_record_id       | integer                        | not null default nextval(tagset_1_seq'::text)
 internal_eventin_ulink   | text                           |
 internal_revid           | text                           |
 event_timestamp          | timestamp(6) without time zone |
 code_all_start           | timestamp(6) without time zone |
 code_all_end             | timestamp(6) without time zone |
 code_language_list       | smallint[]                     |
 code_src_id_list         | integer[]                      |
 code_dst_id_list         | integer[]                      |
 code_dst_transport_list  | integer[]                      |

Indexes:
    "tagset_1_pkey" primary key, btree (internal_dataset_key, internal_record_id)
Triggers:
    zzz_tagset_1_table_trigger AFTER INSERT ON tagset_1 FOR EACH ROW EXECUTE PROCEDURE munch_2k4_entry()

create or replace function munch_2k4_entry()
returns trigger as '
declare
  srcid integer;
  lang smallint;
  dsttranss int[];
  dstids int[];
  dsttrans int;
  ev_start timestamp;
  ev_last timestamp;
  thres int;
  hitids int[];
  hits int;
  id bigint;
  pred_id int;
  scen text;
  entry_tag text;
  ev_hit boolean := false;
  alarm_hit boolean := false;
begin
  lang := new.code_language_list[1];
  srcid := new.code_src_net_list[1];
  dstids := new.code_dst_net_list;
  dsttranss := new.code_dst_transport_list;
  if ((array_size(dsttranss) is null) or (array_size(dsttranss)<1)) then
     dsttrans := null;
  else
     dsttrans := dsttranss[1];
  end if;
  entry_tag := get_entry_tag(new.internal_record_id, new.internal_dataset_key);
  ev_start := new.code_all_start;
  ev_last := new.code_all_end;


  hitids := get_first_hitids(dstids);
  hits := array_size(hitids);
  if is_ndt_id(srcid) then
     if hits>0 then
       pred_id := 1;
       scen := ''WMI-NDT'';
       ev_hit := massage_evidence(srcid, dsttrans, lang, pred_id, hits, entry_tag, ev_start, ev_last);
       if ev_hit then
         alarm_hit := maybe_raise_alarm(srcid, dsttrans, lang, scen);
         if alarm_hit then
           notify alarm;
         end if;
       end if;
     end if;
     return new;
  end if;

  if hits > 0 then
    pred_id := 2;
    scen := ''WMI-EEDT'';
    ev_hit := massage_evidence(srcid, dsttrans, lang, pred_id, hits, entry_tag, ev_start, ev_last);
    if ev_hit then
      alarm_hit := maybe_raise_alarm(srcid, dsttrans, lang, scen);
      if alarm_hit then
         notify alarm;
      end if;
    end if;
  end if;

  hitids := get_inside_hitids(dstids);
  hits := array_size(hitids);
  if hits>0 then
    pred_id := 3;
    scen := ''WMI-EIDT'';
    ev_hit := massage_evidence(srcid, dsttrans, lang, pred_id, hits, entry_tag, ev_start, ev_last);
    if ev_hit then
      alarm_hit := maybe_raise_alarm(srcid, dsttrans, lang, scen);
      if alarm_hit then
         notify alarm;
      end if;
    end if;
  end if;

  hitids := dstids;
  hits := array_size(dstids);
  if hits>0 then
    pred_id := 4;
    scen := ''WMI-EADT'';
    ev_hit := massage_evidence(srcid, dsttrans, lang, pred_id, hits, entry_tag, ev_start, ev_last);
    if ev_hit then
      alarm_hit := maybe_raise_alarm(srcid, dsttrans, lang, scen);
      if alarm_hit then
         notify alarm;
      end if;
    end if;
  end if;

  return new;
end; '
language 'plpgsql';

Re: Postgres 7.4.6 hang in async_notify

От
Tom Lane
Дата:
pgsql-bugs@counterstorm.com writes:
> We have seen this twice.  The first time, it was on a 32 bit Fedora
> Core 3 system during a `postgres stop`.  The stop hung because one of
> the processes was in async_notify and refused to get out.  We were
> unable to do further debugging because the db was in shutdown.  We
> were forced to kill -9 the process.

Next time try attaching to the stuck backend with gdb so you can get
a stack trace.  It's difficult to speculate what could be wrong without
a better-localized idea of where it's hanging.  It'd be a good idea to
rebuild with --enable-debug so that the stack trace will be most useful
when you get it.

            regards, tom lane