Обсуждение: What else could I've done? COPY to unlogged tbl "hung"/locked the table

Поиск
Список
Период
Сортировка

What else could I've done? COPY to unlogged tbl "hung"/locked the table

От
amacvar
Дата:
Hi,

I'm relatively new to Pg.

Question:
What could have happened to cause the COPY to unlogged table to hang?
And what would the recreate of the unlogged table have "released" that a
reboot of the cluster did not.
Any other place that I could've looked or done something else other than
recreating the unlogged table  and/or rebooting the cluster.

Timeline:
Day before, the cluster host had to be rebooted since it was unresponsive
due to perhaps "a massive loop in vi re substitution" (as dev mentioned it)

Yesterday, a copy to an unlogged table thru java was hung. Usually it is
instantaneous, but it was hung for 2hrs before they contacted me.

pg_locks showed me the following:
  pid  |   locktype    | dbase  |                 table         |
virtualtransaction | virtualxid |   mode           | granted
 26554 | relation      | dbname | pg_class                      | 8/13
|            | AccessShareLock  | t
 26554 | relation      | dbname | pg_class_oid_index            | 8/13
|            | AccessShareLock  | t
 26554 | relation      | dbname | pg_class_relname_nsp_index    | 8/13
|            | AccessShareLock  | t
 26554 | relation      |        | pg_database                   | 8/13
|            | AccessShareLock  | t
 26554 | relation      |        | pg_database_datname_index     | 8/13
|            | AccessShareLock  | t
 26554 | relation      |        | pg_database_oid_index         | 8/13
|            | AccessShareLock  | t
 26554 | relation      | dbname | pg_locks                      | 8/13
|            | AccessShareLock  | t
 26554 | virtualxid    |        |                               | 8/13
| 8/13       | ExclusiveLock    | t
 27288 | relation      | dbname | unlogged_tbl_idx              | 11/33
|            | RowExclusiveLock | t
 27288 | relation      | dbname | unlogged_tbl                  | 11/33
|            | RowExclusiveLock | t
 27288 | relation      | dbname | unlogged_tbl_key              | 11/33
|            | RowExclusiveLock | t
 27288 | relation      | dbname | unlogged_tbl_pkey             | 11/33
|            | RowExclusiveLock | t
 27288 | transactionid |        |                               | 11/33
|            | ExclusiveLock    | t
 27288 | virtualxid    |        |                               | 11/33
| 11/33      | ExclusiveLock    | t

When the developer killed the shell script from his end, it didn't release
the locks.
neither pg_cancel nor pg_terminate of the copy job worked.
so finally a kill -9 of the COPY os process caused the cluster to reboot.

2012-08-29 14:29:58 EDT [8119]: [2-1] LOG:  server process (PID 30468) was
terminated by signal 9: Killed
2012-08-29 14:29:58 EDT [8119]: [3-1] LOG:  terminating any other active
server processes
2012-08-29 14:29:58 EDT [25798]: [3-1] WARNING:  terminating connection
because of crash of another server process
2012-08-29 14:29:58 EDT [25798]: [4-1] DETAIL:  The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2012-08-29 14:29:58 EDT [25798]: [5-1] HINT:  In a moment you should be able
to reconnect to the database and repeat your command.

The load script was kicked off and it hung again.
A fast stop of the server failed, so an immediate stop had to be executed.

2012-08-29 14:45:44 EDT [8119]: [9-1] LOG:  received immediate shutdown
request
2012-08-29 14:45:44 EDT [27288]: [1-1] WARNING:  terminating connection
because of crash of another server process
2012-08-29 14:45:44 EDT [27288]: [2-1] DETAIL:  The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2012-08-29 14:45:44 EDT [27288]: [3-1] HINT:  In a moment you should be able
to reconnect to the database and repeat your command.
2012-08-29 14:45:44 EDT [27288]: [4-1] CONTEXT:  COPY unlogged_tbl, line 1:
"50005001|<abacus><address>5291 math
drive</address><address2/><alt_names/><city>atlantis..."

The developer tried to execute using only the java class and not thru a
shell script.
The copy process was kicked off and it hung again.

I created another table using

db=# create table tmp_tbl as select * from unlogged_tbl;

and COPY to this tmp_tbl from the same file was instantaneous.
However i later noticed that this creates a logged table, NOT an unlogged
table (as i had expected :) )

I dropped and recreated the unlogged tables and that seems to have worked.
No more hanging of the load script. And so far so good.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/What-else-could-I-ve-done-COPY-to-unlogged-tbl-hung-locked-the-table-tp5721983.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.


Re: What else could I've done? COPY to unlogged tbl "hung"/locked the table

От
amacvar
Дата:
In addition here is a backtrace of the process while it hung again today:
(thanks to our SA Eric)

    #0  0x000000396c8d4627 in semop () from /lib64/libc.so.6
    #1  0x00000000005dc313 in PGSemaphoreLock ()
    #2  0x0000000000616586 in LWLockAcquire ()
    #3  0x00000000004a718c in ginHeapTupleFastInsert ()
    #4  0x000000000049c808 in gininsert ()
    #5  0x00000000006dd968 in FunctionCall6Coll ()
    #6  0x0000000000474b07 in index_insert ()
    #7  0x00000000005664c8 in ExecInsertIndexTuples ()
    #8  0x0000000000511fa2 in DoCopy ()
    #9  0x0000000000626f71 in standard_ProcessUtility ()
    #10 0x0000000000624299 in PostgresMain ()
    #11 0x0000000000625309 in CreateQueryDesc ()
    #12 0x0000000000625d35 in PortalRun ()
    #13 0x0000000000623262 in PostgresMain ()
    #14 0x00000000005e7ae4 in ClosePostmasterPorts ()
    #15 0x00000000005e8861 in PostmasterMain ()
    #16 0x000000000058c5be in main ()



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/What-else-could-I-ve-done-COPY-to-unlogged-tbl-hung-locked-the-table-tp5721983p5722162.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.


Re: Re: What else could I've done? COPY to unlogged tbl "hung"/locked the table

От
Tom Lane
Дата:
amacvar <kahitarich-postgresForum@yahoo.com> writes:
> In addition here is a backtrace of the process while it hung again today:
> (thanks to our SA Eric)

>     #0  0x000000396c8d4627 in semop () from /lib64/libc.so.6
>     #1  0x00000000005dc313 in PGSemaphoreLock ()
>     #2  0x0000000000616586 in LWLockAcquire ()
>     #3  0x00000000004a718c in ginHeapTupleFastInsert ()
>     #4  0x000000000049c808 in gininsert ()

Interesting.  The most plausible explanation I can see for this stack
trace is that ginHeapTupleFastInsert is blocked trying to get buffer
lock on one of the GIN index pages it locks directly --- either the
metapage or the tail page of the fast-insert list.  That would suggest
a deadlock condition, that is somebody else trying to get the same two
locks in the other order.  (One of the ways that the LWLock mechanism
is lightweight is that it has no deadlock detection, so if this is
what's happening the deadlock would persist until you kill the
processes.)

However, if that's the case then there must be another process also
blocked.  Next time it happens, could you look around at *all* the
postgres processes and collect stack traces from any that don't seem to
be making progress?  It doesn't appear to me that ginHeapTupleFastInsert
itself is in the wrong here, so we need to find out what the conflicting
process is doing.

Also, as a temporary workaround, you might be able to dodge the problem
by disabling the FASTUPDATE storage parameter for your GIN indexes.
That won't help find the bug though, so if you could refrain from doing
that until we've identified the cause, it'd be appreciated.

            regards, tom lane


Re: What else could I've done? COPY to unlogged tbl "hung"/locked the table

От
amacvar
Дата:
Thank you Tom for the suggestion and the workaround.
I think we will be able to hold off on the temporary workaround.

Would the 2nd processes in the deadlock show up in pg_locks? (It didn't)
An insert from another table that i created to test the datafile did show up
as blocked.
I got confirmation from developers that no other process was running at the
time.

Also, even after a reboot, the COPY to the unlogged table hung, until it was
dropped and recreated.
After the latest occurrence, we took some downtime and I have rebuilt all
unlogged tables, no recurrences yet.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/What-else-could-I-ve-done-COPY-to-unlogged-tbl-hung-locked-the-table-tp5721983p5722754.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.