Обсуждение: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

Поиск
Список
Период
Сортировка

PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Michael Milligan
Дата:
Uh oh.  This is a first (for me).  I got this error on a transaction, it
did not crash the server but did abort the transaction (of course).

ERROR:  lock AccessShareLock on object 16385/16467/0 is already held

What was I doing?  A large transaction where I was pushing about 20
million rows into two tables using prepared transactions.  I have been
doing this for some time (over a year) on 8.2.6 without any problems.

I may be able to reproduce, but the transaction takes a long time to
complete... about a day.  I'm running it again now.

There was no information in the log.  What else can I do to get debug
information for you, provided it's reproducable...

Some db details:

db=> select version();
                                             version
-------------------------------------------------------------------------------------------------
 PostgreSQL 8.3.3 on amd64-portbld-freebsd6.3, compiled by GCC cc (GCC)
3.4.6 [FreeBSD] 20060305
(1 row)

This instance was also compiled with the following changes
(pg_config_manual.h):

diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 354a3d3..20c40c4 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -23,7 +23,7 @@
  *
  * Changing BLCKSZ requires an initdb.
  */
-#define BLCKSZ 8192
+#define BLCKSZ 32768

 /*
  * RELSEG_SIZE is the maximum number of blocks allowed in one disk
@@ -69,7 +69,7 @@
  *
  * Changing this requires an initdb.
  */
-#define NAMEDATALEN 64
+#define NAMEDATALEN 128

 /*
  * Maximum number of arguments to a function.

This goes along with a UFS block size of 32768.

Regards,
Mike

--
Michael Milligan                                   -> milli@acmeps.com

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Heikki Linnakangas
Дата:
Michael Milligan wrote:
> Uh oh.  This is a first (for me).  I got this error on a transaction, it
> did not crash the server but did abort the transaction (of course).
>
> ERROR:  lock AccessShareLock on object 16385/16467/0 is already held
>
> What was I doing?  A large transaction where I was pushing about 20
> million rows into two tables using prepared transactions.  I have been
> doing this for some time (over a year) on 8.2.6 without any problems.

Could you provide more details on the transaction? At what point did it
fail, during the insertss, or at commit perhaps? Are you really using
prepared transactions, aka two-phase commit, or did you mean prepared
statements?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Tom Lane
Дата:
Michael Milligan <milli@acmeps.com> writes:
> Uh oh.  This is a first (for me).  I got this error on a transaction, it
> did not crash the server but did abort the transaction (of course).

> ERROR:  lock AccessShareLock on object 16385/16467/0 is already held

Huh, that shouldn't happen.  What object is that?  The 16385 should be
a database OID, and the 16467 is most likely a table's OID within that
database.

> What was I doing?  A large transaction where I was pushing about 20
> million rows into two tables using prepared transactions.  I have been
> doing this for some time (over a year) on 8.2.6 without any problems.

> I may be able to reproduce, but the transaction takes a long time to
> complete... about a day.  I'm running it again now.

I wonder if it's possible that you overflowed nLocks in the local lock
table.  There are a lot of other things that likely would break first,
though, so that doesn't seem very probable.

Is this transaction longer than any comparable one you ever did before?

            regards, tom lane

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Michael Milligan
Дата:
Heikki Linnakangas wrote:
> Michael Milligan wrote:
>> Uh oh.  This is a first (for me).  I got this error on a transaction, it
>> did not crash the server but did abort the transaction (of course).
>>
>> ERROR:  lock AccessShareLock on object 16385/16467/0 is already held
>>
>> What was I doing?  A large transaction where I was pushing about 20
>> million rows into two tables using prepared transactions.  I have been
>> doing this for some time (over a year) on 8.2.6 without any problems.
>
> Could you provide more details on the transaction? At what point did it
> fail, during the insertss, or at commit perhaps? Are you really using
> prepared transactions, aka two-phase commit, or did you mean prepared
> statements?

Yes, mistyped there, prepared _statements_.  It failed during an
EXECUTE, so during a prepared INSERT statement.

Regards,
Mike

--
Michael Milligan                                   -> milli@acmeps.com

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Michael Milligan
Дата:
Tom Lane wrote:
> Michael Milligan <milli@acmeps.com> writes:
>> Uh oh.  This is a first (for me).  I got this error on a transaction, it
>> did not crash the server but did abort the transaction (of course).
>
>> ERROR:  lock AccessShareLock on object 16385/16467/0 is already held
>
> Huh, that shouldn't happen.  What object is that?  The 16385 should be
> a database OID, and the 16467 is most likely a table's OID within that
> database.
>
>> What was I doing?  A large transaction where I was pushing about 20
>> million rows into two tables using prepared transactions.  I have been
>> doing this for some time (over a year) on 8.2.6 without any problems.
>
>> I may be able to reproduce, but the transaction takes a long time to
>> complete... about a day.  I'm running it again now.
>
> I wonder if it's possible that you overflowed nLocks in the local lock
> table.  There are a lot of other things that likely would break first,
> though, so that doesn't seem very probable.
>
> Is this transaction longer than any comparable one you ever did before?

For 8.3.3, yes.  This is only the second such large transaction I've
done on 8.3.3, the first one I did was around 6.7 million rows and it
worked.

And a correction, the transaction that caused this error was inserting
13.5 million rows and failed near the end.

Regards,
Mike

--
Michael Milligan                                   -> milli@acmeps.com

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Tom Lane
Дата:
Michael Milligan <milli@acmeps.com> writes:
> Tom Lane wrote:
>> Huh, that shouldn't happen.  What object is that?  The 16385 should be
>> a database OID, and the 16467 is most likely a table's OID within that
>> database.

Please answer the above question.

> And a correction, the transaction that caused this error was inserting
> 13.5 million rows and failed near the end.

Once you've determined which table the error message is talking about,
please show us what the transaction does with that table.

            regards, tom lane

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Michael Milligan
Дата:
Tom Lane wrote:
> Michael Milligan <milli@acmeps.com> writes:
>> Tom Lane wrote:
>>> Huh, that shouldn't happen.  What object is that?  The 16385 should be
>>> a database OID, and the 16467 is most likely a table's OID within that
>>> database.
>
> Please answer the above question.

16385 is the database (db) and 16467 is the table (email) the prepared
inserts are executed against.  That table (email) is an inherited table
definition for partitioned tables that are named by date ranges.  E.g.,

CREATE TABLE email_2008_week27 ( PRIMARY KEY (id), CHECK ( ts >=
'2008-06-30'::timestamp AND ts < '2008-07-07'::timestamp ) ) INHERITS
(email)

>
>> And a correction, the transaction that caused this error was inserting
>> 13.5 million rows and failed near the end.
>
> Once you've determined which table the error message is talking about,
> please show us what the transaction does with that table.

You mean like:

BEGIN;
PREPARE msg (...) INSERT INTO email VALUES (...);
EXECUTE msg (...)
EXECUTE msg (...)
EXECUTE msg (...)
EXECUTE msg (...)
... repeated millions of times
COMMIT;

Regards,
Mike

--
Michael Milligan                                   -> milli@acmeps.com

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Tom Lane
Дата:
Michael Milligan <milli@acmeps.com> writes:
> Tom Lane wrote:
>> Once you've determined which table the error message is talking about,
>> please show us what the transaction does with that table.

> You mean like:

> BEGIN;
> PREPARE msg (...) INSERT INTO email VALUES (...);
> EXECUTE msg (...)
> EXECUTE msg (...)
> EXECUTE msg (...)
> EXECUTE msg (...)
> ... repeated millions of times
> COMMIT;

What I'm wondering about is the sequence of operations that are executed
per row.  Could it be long enough that the email table is being touched
by more than 2 billion separate SQL operations within the transaction?

            regards, tom lane

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Michael Milligan
Дата:
Tom Lane wrote:
> Michael Milligan <milli@acmeps.com> writes:
>> Tom Lane wrote:
>>> Once you've determined which table the error message is talking about,
>>> please show us what the transaction does with that table.
>
>> You mean like:
>
>> BEGIN;
>> PREPARE msg (...) INSERT INTO email VALUES (...);
>> EXECUTE msg (...)
>> EXECUTE msg (...)
>> EXECUTE msg (...)
>> EXECUTE msg (...)
>> ... repeated millions of times
>> COMMIT;
>
> What I'm wondering about is the sequence of operations that are executed
> per row.  Could it be long enough that the email table is being touched
> by more than 2 billion separate SQL operations within the transaction?

I can't imagine that, but then again, I'm not a PostgreSQL internals
expert either.  The partitioned email tables have indexes on them, but
in this part of the transaction, all I'm doing is EXECUTEs where the
prepared statement is INSERTing a single row.  Nothing else fancy with
the email table is going on.  There are other things going on with
values being stored to TEMP tables within the same transaction, would
those count toward the 2 billion?

I ran the transaction again and it failed again, so it is reproducable.
 I'm going to try it with a much smaller dataset and see what happens.

Regards,
Mike

--
Michael Milligan                                   -> milli@acmeps.com

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Michael Milligan
Дата:
Tom Lane wrote:
>
> What I'm wondering about is the sequence of operations that are executed
> per row.  Could it be long enough that the email table is being touched
> by more than 2 billion separate SQL operations within the transaction?

FWIW, I've used the exact same code against PG 8.2.6 and have half a
dozen similar transactions that inserted more than 13.5 million rows,
with the largest transaction at a little over 25 million rows inserted
into the email table.

Regards,
Mike

--
Michael Milligan                                   -> milli@acmeps.com

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Tom Lane
Дата:
Michael Milligan <milli@acmeps.com> writes:
> FWIW, I've used the exact same code against PG 8.2.6 and have half a
> dozen similar transactions that inserted more than 13.5 million rows,
> with the largest transaction at a little over 25 million rows inserted
> into the email table.

Hmph.  That seems to eliminate the overflow theory, because 8.2 has
essentially the same lock-counting code as 8.3.  Unless 8.3 is taking
out the lock a heckuva lot more than 8.2 did, but I can't think of a
reason for that to happen.

Now that we know you can reproduce it, we should think about how to get
some information out.  Are you in a position to build a locally modified
Postgres?  I could send you a patch to make that particular error report
dump out more information about the lock state, but a patch won't do you
any good if you aren't able to build from source.

            regards, tom lane

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Michael Milligan
Дата:
Tom Lane wrote:
> Michael Milligan <milli@acmeps.com> writes:
>> FWIW, I've used the exact same code against PG 8.2.6 and have half a
>> dozen similar transactions that inserted more than 13.5 million rows,
>> with the largest transaction at a little over 25 million rows inserted
>> into the email table.
>
> Hmph.  That seems to eliminate the overflow theory, because 8.2 has
> essentially the same lock-counting code as 8.3.  Unless 8.3 is taking
> out the lock a heckuva lot more than 8.2 did, but I can't think of a
> reason for that to happen.
>
> Now that we know you can reproduce it, we should think about how to get
> some information out.  Are you in a position to build a locally modified
> Postgres?  I could send you a patch to make that particular error report
> dump out more information about the lock state, but a patch won't do you
> any good if you aren't able to build from source.

I can rebuild with a patch, sure.  I've got a spare machine I can mirror
the db over to (same hardware) to debug this.

Regards,
Mike

--
Michael Milligan                                   -> milli@acmeps.com

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Tom Lane
Дата:
[ reincluding the mailing list ]

Michael Milligan <milli@acmeps.com> writes:
> Okay, it reproduces and surprise surprise nLocks does overflow...

>  ERROR:  lock AccessShareLock on object 16385/16467/0 is already held
> lock(0x87408a028) id(16385,16467,0,0,0,1) grantMask(a) waitMask(0)
> req(2,0,1,0,0,0,0,0)=3 grant(1,0,1,0,0,0,0,0)=2 wait(0)
> proclock(0x8743a7508) lock(0x87408a028) method(1) proc(0x8743aada8) hold(a)
> locallock(0xb29c78) nLocks(-2147483648) nOwners(2) mOwners(8)

Hah.  Okay, that shows that we'd never have reproduced it with a small
test case.

> So I'm guessing this is not a bug?  Just that for some reason 8.3.3 is
> grabbing more locks than 8.2.6 does and I have to adjust my batch
> processes to break things up into chunks of less than 10 million per
> transaction...  :-/

Well, I'm wondering exactly why 8.3 is taking more locks than 8.2 does,
because AFAICS it shouldn't.  Could you extract a complete example of
what your code does per tuple?  I would like to run a small test case
and just see where the LockAcquire calls come from in each version.
(Or, if you prefer, you can get out gdb and do the legwork yourself...)
It's true that breaking up the transaction is likely to be the best
short-term answer for you, but I think there might be a bug here.

In any case, now that we know that nLocks overflow is actually possible
within real-world transaction lengths, it'd behoove us to do something
about that in 8.4 or beyond.

            regards, tom lane

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Bruce Momjian
Дата:
Tom Lane wrote:
> [ reincluding the mailing list ]
>
> Michael Milligan <milli@acmeps.com> writes:
> > Okay, it reproduces and surprise surprise nLocks does overflow...
>
> >  ERROR:  lock AccessShareLock on object 16385/16467/0 is already held
> > lock(0x87408a028) id(16385,16467,0,0,0,1) grantMask(a) waitMask(0)
> > req(2,0,1,0,0,0,0,0)=3 grant(1,0,1,0,0,0,0,0)=2 wait(0)
> > proclock(0x8743a7508) lock(0x87408a028) method(1) proc(0x8743aada8) hold(a)
> > locallock(0xb29c78) nLocks(-2147483648) nOwners(2) mOwners(8)
>
> Hah.  Okay, that shows that we'd never have reproduced it with a small
> test case.
>
> > So I'm guessing this is not a bug?  Just that for some reason 8.3.3 is
> > grabbing more locks than 8.2.6 does and I have to adjust my batch
> > processes to break things up into chunks of less than 10 million per
> > transaction...  :-/
>
> Well, I'm wondering exactly why 8.3 is taking more locks than 8.2 does,
> because AFAICS it shouldn't.  Could you extract a complete example of
> what your code does per tuple?  I would like to run a small test case
> and just see where the LockAcquire calls come from in each version.
> (Or, if you prefer, you can get out gdb and do the legwork yourself...)
> It's true that breaking up the transaction is likely to be the best
> short-term answer for you, but I think there might be a bug here.
>
> In any case, now that we know that nLocks overflow is actually possible
> within real-world transaction lengths, it'd behoove us to do something
> about that in 8.4 or beyond.

Is this a TODO?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> In any case, now that we know that nLocks overflow is actually possible
>> within real-world transaction lengths, it'd behoove us to do something
>> about that in 8.4 or beyond.

> Is this a TODO?

Yes, although I'm still waiting for more info from Michael.

            regards, tom lane

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> [ reincluding the mailing list ]
>
> Michael Milligan <milli@acmeps.com> writes:
> > Okay, it reproduces and surprise surprise nLocks does overflow...
>
> >  ERROR:  lock AccessShareLock on object 16385/16467/0 is already held
> > lock(0x87408a028) id(16385,16467,0,0,0,1) grantMask(a) waitMask(0)
> > req(2,0,1,0,0,0,0,0)=3 grant(1,0,1,0,0,0,0,0)=2 wait(0)
> > proclock(0x8743a7508) lock(0x87408a028) method(1) proc(0x8743aada8) hold(a)
> > locallock(0xb29c78) nLocks(-2147483648) nOwners(2) mOwners(8)
>
> Hah.  Okay, that shows that we'd never have reproduced it with a small
> test case.

This hasn't been fixed yet, has it?  Do you have any ideas on how to
actually fix the problem?  I wonder if it looks like enlarging nLocks,
or you're intending to attempt to reduce the number of locks taken.

A customer of ours started hitting this bug too, last week.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> Michael Milligan <milli@acmeps.com> writes:
>>> Okay, it reproduces and surprise surprise nLocks does overflow...
>>
>> Hah.  Okay, that shows that we'd never have reproduced it with a small
>> test case.

> This hasn't been fixed yet, has it?

Well, it has been addressed anyway...

2008-09-15 21:56  tgl

    * src/include/storage/: lock.h (REL8_1_STABLE), lock.h
    (REL8_3_STABLE), lock.h (REL8_0_STABLE), lock.h (REL8_2_STABLE),
    lock.h: Widen the nLocks counts in local lock tables from int to
    int64.    This forestalls potential overflow when the same table (or
    other object, but usually tables) is accessed by very many
    successive queries within a single transaction.  Per report from
    Michael Milligan.

    Back-patch to 8.0, which is as far back as the patch conveniently
    applies.  There have been no reports of overflow in pre-8.3
    releases, but clearly the risk existed all along.  (Michael's
    report suggests that 8.3 may consume lock counts faster than prior
    releases, but with no test case to look at it's hard to be sure
    about that.  Widening the counts seems a good future-proofing
    measure in any event.)

> A customer of ours started hitting this bug too, last week.

Tut tut, customer not on latest point release?

            regards, tom lane

Re: PG 8.3.3 - ERROR: lock AccessShareLock on object 16385/16467/0 is already held

От
Alvaro Herrera
Дата:
Tom Lane wrote:

> 2008-09-15 21:56  tgl
>
>     * src/include/storage/: lock.h (REL8_1_STABLE), lock.h
>     (REL8_3_STABLE), lock.h (REL8_0_STABLE), lock.h (REL8_2_STABLE),
>     lock.h: Widen the nLocks counts in local lock tables from int to
>     int64.

Doh, I didn't see it because I only checked the change log for lock.c,
not lock.h :-(

> > A customer of ours started hitting this bug too, last week.
>
> Tut tut, customer not on latest point release?

Wouldn't be a first for them (we try hard, they don't necessarily heed
advice).

Thanks!

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.