Обсуждение: Shared row locking
Hi,
I've been thinking on how to do shared row locking.  There are some very
preliminar ideas on this issue.  Please comment; particularly if any
part of it sounds unworkable or too incomplete.
There are several problems to be solved here: the grammar, the internal
SelectStmt representation, how to store and share the info between
backends, how to clean up at transaction end, and how to clean up at
backend crash.
The Grammar
===========
The SQL spec does not say anything on this respect (that I can find).
It only talks of "FOR UPDATE" and "FOR READ ONLY".  However, because the
FK code uses SPI to do the locking, we definitely have to expose the
funcionality through SQL.  So I think we need a new clause, which I
propose to be "FOR SHARE".
The Parser and SelectStmt
=========================
The parser uses for_update_clause and opt_for_update_clause
nonterminals.  I assume it's best to change them to (new)
locking_clause, which can in turn be for_update_clause or (new)
for_share_clause.
SelectStmt currently has a forUpdate field (a List to the to-be-locked
tables, or an empty list meaning all of them).  We could simply add
another list, say forShare, or use a common list and a flag saying that
it's one or the other.  I prefer adding a new list.  (Same with the
Query node.)
How to Store the Info
=====================
This is the really interesting part.  I have two ideas, one mine (btree)
and other Bruce's (bitmap).
Using a B-tree
--------------
When a backend wants to lock a tuple, it set a bit in its infomask.
Then it inserts to a btree in a special tablespace, using
RelationId-BlockNumber-OffsetNumber as key, and BackendId-TransactionId
as value; actually, an array with a single element containing those two
values.
When a backend wants to lock a tuple that is already locked, it goes to
the btree and inserts itself into the array.  To do this, it inserts a
new index item (the enlarged array) and delete the previous one.  No
other backend may want to insert simultaneously (thus causing an ugly
race condition), because we hold an exclusive lock on the tuple's heap
page's buffer.
At transaction end, nothing special happens (tuples are not unlocked
explicitly).
When someone wants to know if the tuple is locked (to mark it FOR
UPDATE, or to delete it), it checks the infomask.  If it says it's
locked, it goes to check the btree.  If the array contains only
BackendId-TransactionId pairs that are no longer running, then the 
tuple is not locked and can be deleted/marked (and the btree can be
cleaned up).  Else, it will have to wait, using XactLockTableWait, for
the first transaction in the array that is still running.  We can be
sure that no one will try to share-lock the tuple while we check the
btree because we hold an exclusive lock on the tuple's heap page's
buffer.
Note that to check whether a transaction is running we need to lock
SInvalLock.  To minimize the time we hold it, we save the BackendId so
we don't have to scan the whole shmInvalBuffer->procState array, only
the item that we need to look at.  Another possibility would be to use
stock TransactionIdIsInProgress and save the extra 4 bytes of storage.
At server restart, the btree is created empty (or just deleted).  There
is one btree per database.
Using a Bitmap
--------------
First we create a counter called shared lock row counter.  Then we
create a file like pg_clog, and each counter slot has a bit for every
backend.  When we want to shared lock a row we increment the counter and
put that counter value on the row, and set our backend bit in the new
file.  We also store that counter value in our backend local memory.  On
commit we go through that local memory list and reset all our bits for
those counters.  When a row has all zeros, it can be recycled like we do
with pg_clog.
Problems and random comments
============================
There is possibility of starvation, if somebody wants to lock
exclusively a tuple and shared lockers are coming all the time.  Not
sure how to solve this.
The wakeup mechanism is not discussed ... is there a special need for
something beyond what we can do with XactLockTable{Insert,Wait} ?
Thanks to tablespaces, it's very easy to create special Relations that
can be dealt with by standard buffer and storage manager, etc.
The btree idea:
- does not need crash recovery.  Maybe we could use a stripped down version of nbtree.  This could cause a
maintanibilitynightmare.
 
- can't hold more than 300 or so simultaneous lockers (because of value length, limited to 1/3 of a page).  I doubt
thisis a real problem.
 
- could have problems (excessive storage requirements) in the long run because of empty or almost-empty pages.
The bitmap idea:
- seems to have lower overhead
- can use the same lazy cleanup mechanism exposed for the btree idea (in which case we don't need the list in local
memory).
- What can happen in presence of large max_connections settings?  Is this a real problem?
-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
Oh, oh, las chicas galacianas, lo harán por las perlas,
¡Y las de Arrakis por el agua! Pero si buscas damas
Que se consuman como llamas, ¡Prueba una hija de Caladan! (Gurney Halleck)
			
		Alvaro Herrera wrote: > The btree idea: > - does not need crash recovery. Maybe we could use a stripped down > version of nbtree. This could cause a maintanibility nightmare. Are you saying the btree is an index with no heap? If so, what about the xid's? Are they just in the btree? How does the btree get cleaned up over time? > The bitmap idea: > - seems to have lower overhead > > - can use the same lazy cleanup mechanism exposed for the btree idea (in > which case we don't need the list in local memory). You mean all empty/zero rows can be removed? Can we guarantee that on commit we can clean up the bitmap? If not the idea doesn't work. > - What can happen in presence of large max_connections settings? Is > this a real problem? I thought about that. 50 backends is 7 bytes, 1000 backends is 128 bytes. For a large number of backends you could just allow X concurrent locks and use space X*4 bytes. I think the basic issue is that the btree can be of variable length while the bitmap has to be of a fixed length. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> The SQL spec does not say anything on this respect (that I can find). > It only talks of "FOR UPDATE" and "FOR READ ONLY". However, because the > FK code uses SPI to do the locking, we definitely have to expose the > funcionality through SQL. So I think we need a new clause, which I > propose to be "FOR SHARE". MySQL uses LOCK IN SHARE MODE: http://dev.mysql.com/doc/mysql/en/InnoDB_locking_reads.html
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> Using a B-tree
> At transaction end, nothing special happens (tuples are not unlocked
> explicitly).
I don't think that works, because there is no guarantee that an entry
will get cleaned out before the XID counter wraps around.  Worst case,
you might think that a tuple is locked when the XID is left over from
the previous cycle.  (Possibly this could be avoided by cleaning out old
XIDs in this table whenever we truncate pg_clog, but that seems a tad
messy.)  I'm also a bit concerned about how we avoid table bloat if
there's no proactive cleanup at transaction end.
I think I like the pg_clog-modeled structure a bit better.  However it
could be objected that that puts a hard limit of 4G share-locked tuples
at any one time.
In the clog-modeled idea, it wasn't real clear how you decide whether to
assign a new counter value to a previously locked row, or reuse its
previous counter.  You must *not* assign a new value when the existing
entry still has bits set, but you probably do want to be aggressive
about assigning new values when you can; else it gets tough to be sure
that the log can be truncated in a reasonable time.
ISTM that your description is conflating several orthogonal issues:
how do we identify entries in this data structure (by CTID, or a shared
counter that increments each time a new lock is acquired); how do we
index the data structure (btree or linear array); and what is stored in
each entry (array of XIDs, or bitmap indexed by BackendId).  Not all of
the eight combinations work, but we do have more alternatives than the
two offered, even without coming up with any new ideas ;-)
> Note that to check whether a transaction is running we need to lock
> SInvalLock.  To minimize the time we hold it, we save the BackendId so
> we don't have to scan the whole shmInvalBuffer->procState array, only
> the item that we need to look at.  Another possibility would be to use
> stock TransactionIdIsInProgress and save the extra 4 bytes of storage.
I'm a bit worried about deadlocks and race conditions associated with
the conflict between locking a page of this data structure and locking
SInvalLock.
> At server restart, the btree is created empty (or just deleted).  There
> is one btree per database.
One per cluster you meant, right?  (Else we can't do locking of rows in
shared tables.)
        regards, tom lane
			
		Bruce Momjian <pgman@candle.pha.pa.us> writes:
> You mean all empty/zero rows can be removed?  Can we guarantee that on
> commit we can clean up the bitmap?  If not the idea doesn't work.
For whatever data structure we use, we may reset the structure to empty
during backend-crash recovery.  So your objection boils down to "what if
a backend exits normally but forgets to clean up its locks?"  Assuming
that doesn't happen isn't any worse than assuming a backend will clean
up its shared memory state on non-crash exit, so I don't think it's a
serious concern.
That brings another thought: really what this is all about is working
around the fact that the standard lock manager can only cope with a
finite number of coexisting locks, because it's working in a fixed-size
shared memory arena.  Maybe we should instead think about ways to allow
the existing lock table to spill to disk when it gets too big.  That
would eliminate max_locks_per_transaction as a source of hard failures,
which would be a nice benefit.
        regards, tom lane
			
		Tom Lane wrote: > Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > > Using a B-tree > > > At transaction end, nothing special happens (tuples are not unlocked > > explicitly). > > I don't think that works, because there is no guarantee that an entry > will get cleaned out before the XID counter wraps around. Worst case, > you might think that a tuple is locked when the XID is left over from > the previous cycle. (Possibly this could be avoided by cleaning out old > XIDs in this table whenever we truncate pg_clog, but that seems a tad > messy.) I'm also a bit concerned about how we avoid table bloat if > there's no proactive cleanup at transaction end. > > I think I like the pg_clog-modeled structure a bit better. However it > could be objected that that puts a hard limit of 4G share-locked tuples > at any one time. > > In the clog-modeled idea, it wasn't real clear how you decide whether to > assign a new counter value to a previously locked row, or reuse its > previous counter. You must *not* assign a new value when the existing > entry still has bits set, but you probably do want to be aggressive > about assigning new values when you can; else it gets tough to be sure > that the log can be truncated in a reasonable time. I assume you check and if all the bits are zero, you don't reuse it and get a new counter. In fact you shouldn't reuse it in case the log is being truncated while you are looking. :-) > ISTM that your description is conflating several orthogonal issues: > how do we identify entries in this data structure (by CTID, or a shared > counter that increments each time a new lock is acquired); how do we > index the data structure (btree or linear array); and what is stored in > each entry (array of XIDs, or bitmap indexed by BackendId). Not all of > the eight combinations work, but we do have more alternatives than the > two offered, even without coming up with any new ideas ;-) True. The only advantage to a bitmap vs. just a counter of locked backends is that you can clean out your own backend bits from the table without having to record them in your memory. However, because recording your own counters in local memory doesn't require fixed shared memory we might be better just recording the shared lock indexes in your local backend memory and just use an int4 counter in the pg_clog-like file that we can decrement on backend commit. However I am unclear that we can guarantee an exiting backend will do that. Certainly it is cleared on server start. > > Note that to check whether a transaction is running we need to lock > > SInvalLock. To minimize the time we hold it, we save the BackendId so > > we don't have to scan the whole shmInvalBuffer->procState array, only > > the item that we need to look at. Another possibility would be to use > > stock TransactionIdIsInProgress and save the extra 4 bytes of storage. > > I'm a bit worried about deadlocks and race conditions associated with > the conflict between locking a page of this data structure and locking > SInvalLock. > > > At server restart, the btree is created empty (or just deleted). There > > is one btree per database. > > One per cluster you meant, right? (Else we can't do locking of rows in > shared tables.) He meant one per database, I think. I suppose we would need another one for global tables or disallow shared locking of them. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
BTom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > You mean all empty/zero rows can be removed? Can we guarantee that on > > commit we can clean up the bitmap? If not the idea doesn't work. > > For whatever data structure we use, we may reset the structure to empty > during backend-crash recovery. So your objection boils down to "what if > a backend exits normally but forgets to clean up its locks?" Assuming > that doesn't happen isn't any worse than assuming a backend will clean > up its shared memory state on non-crash exit, so I don't think it's a > serious concern. > > That brings another thought: really what this is all about is working > around the fact that the standard lock manager can only cope with a > finite number of coexisting locks, because it's working in a fixed-size > shared memory arena. Maybe we should instead think about ways to allow > the existing lock table to spill to disk when it gets too big. That > would eliminate max_locks_per_transaction as a source of hard failures, > which would be a nice benefit. Agreed. Once concern I have about allowing the lock table to spill to disk is that a large number of FOR UPDATE locks could push out lock entries used by other backends, causing very poor performance. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Sun, 2004-12-19 at 04:04, Bruce Momjian wrote: > BTom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > You mean all empty/zero rows can be removed? Can we guarantee that on > > > commit we can clean up the bitmap? If not the idea doesn't work. > > > > For whatever data structure we use, we may reset the structure to empty > > during backend-crash recovery. So your objection boils down to "what if > > a backend exits normally but forgets to clean up its locks?" Assuming > > that doesn't happen isn't any worse than assuming a backend will clean > > up its shared memory state on non-crash exit, so I don't think it's a > > serious concern. > > > > That brings another thought: really what this is all about is working > > around the fact that the standard lock manager can only cope with a > > finite number of coexisting locks, because it's working in a fixed-size > > shared memory arena. Maybe we should instead think about ways to allow > > the existing lock table to spill to disk when it gets too big. That > > would eliminate max_locks_per_transaction as a source of hard failures, > > which would be a nice benefit. > > Agreed. Once concern I have about allowing the lock table to spill to > disk is that a large number of FOR UPDATE locks could push out lock > entries used by other backends, causing very poor performance. In similar circumstances, DB2 uses these techniques: - when locktable X % full, then escalate locks to full table locks: both locktable memory and threshold% are instance parameters - use a lock mode called Cursor Stability that locks only those rows currently being examined by a cursor, those maintaining the lock usage of a cursor at a constant level as the cursor moves. The lock mode of Repeatable Read *does* lock all rows read (these are not actually mutually exclusive) The first one is a real pain, but the idea might be of use somewhere. The second is a usable, practical alternative that should be considered and might avoid the need to write the spill-to-disk code and then discover it performs very badly. -- Best Regards, Simon Riggs
On Sun, Dec 19, 2004 at 09:52:01AM +0000, Simon Riggs wrote: Simon, > In similar circumstances, DB2 uses these techniques: > > - when locktable X % full, then escalate locks to full table locks: both > locktable memory and threshold% are instance parameters This is not useful at all, because the objective of this exercise is to downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to shared row locking. Keep in mind that this is for foreign key locking, which is one area where deadlocks are frequently encountered because we use too strong a lock. > - use a lock mode called Cursor Stability that locks only those rows > currently being examined by a cursor, those maintaining the lock usage > of a cursor at a constant level as the cursor moves. The lock mode of > Repeatable Read *does* lock all rows read We can't release the locks until the end of the transaction. > The second is a usable, practical alternative that should be considered > and might avoid the need to write the spill-to-disk code and then > discover it performs very badly. I don't think any of them serves the objective I'm trying to accomplish :-( Thanks for your ideas anyway. And keep having them! -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "Uno puede defenderse de los ataques; contra los elogios se esta indefenso"
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> On Sun, Dec 19, 2004 at 09:52:01AM +0000, Simon Riggs wrote:
>> - use a lock mode called Cursor Stability that locks only those rows
>> currently being examined by a cursor, those maintaining the lock usage
>> of a cursor at a constant level as the cursor moves. The lock mode of
>> Repeatable Read *does* lock all rows read
> We can't release the locks until the end of the transaction.
Well, what Simon is essentially proposing is that we work around a
potential performance issue by exposing a less-clean semantic model
to users.  I'd prefer to adopt such a thing as a last resort, not a
first one.
        regards, tom lane
			
		On Sun, 19 Dec 2004, Alvaro Herrera wrote: > On Sun, Dec 19, 2004 at 09:52:01AM +0000, Simon Riggs wrote: > > Simon, > >> In similar circumstances, DB2 uses these techniques: >> >> - when locktable X % full, then escalate locks to full table locks: both >> locktable memory and threshold% are instance parameters > > This is not useful at all, because the objective of this exercise is to > downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to > shared row locking. Keep in mind that this is for foreign key locking, > which is one area where deadlocks are frequently encountered because we > use too strong a lock. Actually it might help in some scenarios. Remember, we're not talking about upgrading shared locks to exclusive locks. We're only talking about locking more rows than necessary (all rows). I believe DB2 does the escalation also for perfomance. Getting a full table lock is cheaper than individually locking every row. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On Sun, 19 Dec 2004, Alvaro Herrera wrote:
>> This is not useful at all, because the objective of this exercise is to
>> downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
>> shared row locking.
> Actually it might help in some scenarios. Remember, we're not talking 
> about upgrading shared locks to exclusive locks. We're only talking about 
> locking more rows than necessary (all rows).
Nonetheless, it would mean that locks would be taken depending on
implementation-dependent, not-visible-to-the-user considerations.
Shared locks can still cause deadlocks, and so you would have an
unreliable application, which would only be unreliable under load.
As I said in connection with the other proposal, weird user-visible
semantics should be the last resort not the first.
        regards, tom lane
			
		On Sun, 19 Dec 2004, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> On Sun, 19 Dec 2004, Alvaro Herrera wrote: >>> This is not useful at all, because the objective of this exercise is to >>> downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to >>> shared row locking. > >> Actually it might help in some scenarios. Remember, we're not talking >> about upgrading shared locks to exclusive locks. We're only talking about >> locking more rows than necessary (all rows). > > Nonetheless, it would mean that locks would be taken depending on > implementation-dependent, not-visible-to-the-user considerations. > Shared locks can still cause deadlocks, and so you would have an > unreliable application, which would only be unreliable under load. > > As I said in connection with the other proposal, weird user-visible > semantics should be the last resort not the first. I agree that lock escalation is not a good solution, we run into problems with DB2 lock escalation at work all the time. - Heikki
On Sun, 2004-12-19 at 18:31, Alvaro Herrera wrote: > Thanks for your ideas anyway. And keep having them! No problem. Just giving some info on what works and doesn't work in other implementations. -- Best Regards, Simon Riggs
Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > On Sun, 19 Dec 2004, Alvaro Herrera wrote: > >> This is not useful at all, because the objective of this exercise is to > >> downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to > >> shared row locking. > > > Actually it might help in some scenarios. Remember, we're not talking > > about upgrading shared locks to exclusive locks. We're only talking about > > locking more rows than necessary (all rows). > > Nonetheless, it would mean that locks would be taken depending on > implementation-dependent, not-visible-to-the-user considerations. > Shared locks can still cause deadlocks, and so you would have an > unreliable application, which would only be unreliable under load. > > As I said in connection with the other proposal, weird user-visible > semantics should be the last resort not the first. > One idea is to stamp the same shared lock counter on multiple rows and just prevent another application from also sharing that lock if too many rows are locked. It would wait for the lock to be released. This prevents the problem of having to store thousands of shared lock counters. There would have to be some flag so say which shared counters are only for a single backend. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Sun, Dec 19, 2004 at 11:35:02PM +0200, Heikki Linnakangas wrote: > On Sun, 19 Dec 2004, Tom Lane wrote: > > >Heikki Linnakangas <hlinnaka@iki.fi> writes: > >>On Sun, 19 Dec 2004, Alvaro Herrera wrote: > >>>This is not useful at all, because the objective of this exercise is to > >>>downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to > >>>shared row locking. > > > >>Actually it might help in some scenarios. Remember, we're not talking > >>about upgrading shared locks to exclusive locks. We're only talking about > >>locking more rows than necessary (all rows). > > > >Nonetheless, it would mean that locks would be taken depending on > >implementation-dependent, not-visible-to-the-user considerations. > >Shared locks can still cause deadlocks, and so you would have an > >unreliable application, which would only be unreliable under load. > > > >As I said in connection with the other proposal, weird user-visible > >semantics should be the last resort not the first. > > I agree that lock escalation is not a good solution, we run into problems > with DB2 lock escalation at work all the time. Does anyone know how Oracle deals with this? They use MVCC like PostgreSQL, so they'd be a better source for inspiration. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Sat, 18 Dec 2004, Bruce Momjian wrote: > BTom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > You mean all empty/zero rows can be removed? Can we guarantee that on > > > commit we can clean up the bitmap? If not the idea doesn't work. > > > > For whatever data structure we use, we may reset the structure to empty > > during backend-crash recovery. So your objection boils down to "what if > > a backend exits normally but forgets to clean up its locks?" Assuming > > that doesn't happen isn't any worse than assuming a backend will clean > > up its shared memory state on non-crash exit, so I don't think it's a > > serious concern. > > > > That brings another thought: really what this is all about is working > > around the fact that the standard lock manager can only cope with a > > finite number of coexisting locks, because it's working in a fixed-size > > shared memory arena. Maybe we should instead think about ways to allow > > the existing lock table to spill to disk when it gets too big. That > > would eliminate max_locks_per_transaction as a source of hard failures, > > which would be a nice benefit. > > Agreed. Once concern I have about allowing the lock table to spill to > disk is that a large number of FOR UPDATE locks could push out lock > entries used by other backends, causing very poor performance. I think if we allow the lock manager to spill to disk (and I think we do need to allow it) then we should also be able to control the amount of shared memory allocated. There's little point spilling the lock table to disk if we have huge amounts of memory. Gavin
On Mon, 2004-12-20 at 06:34, Jim C. Nasby wrote: > On Sun, Dec 19, 2004 at 11:35:02PM +0200, Heikki Linnakangas wrote: > > On Sun, 19 Dec 2004, Tom Lane wrote: > > > > >Heikki Linnakangas <hlinnaka@iki.fi> writes: > > >>On Sun, 19 Dec 2004, Alvaro Herrera wrote: > > >>>This is not useful at all, because the objective of this exercise is to > > >>>downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to > > >>>shared row locking. > > > > > >>Actually it might help in some scenarios. Remember, we're not talking > > >>about upgrading shared locks to exclusive locks. We're only talking about > > >>locking more rows than necessary (all rows). > > > > > >Nonetheless, it would mean that locks would be taken depending on > > >implementation-dependent, not-visible-to-the-user considerations. > > >Shared locks can still cause deadlocks, and so you would have an > > >unreliable application, which would only be unreliable under load. > > > > > >As I said in connection with the other proposal, weird user-visible > > >semantics should be the last resort not the first. > > > > I agree that lock escalation is not a good solution, we run into problems > > with DB2 lock escalation at work all the time. > > Does anyone know how Oracle deals with this? They use MVCC like > PostgreSQL, so they'd be a better source for inspiration. Oracle only uses MVCC in its widest sense - versioning info is stored in UNDO tablespaces (rollback segments). That implementation is covered by aggressive patent attorneys. Oracle implements locking at row level within each data block. The block header expands dynamically to accommodate a list of transactions that can access, with minimum and maximum sizes settable by the DBA. This works reasonably well. Each SELECT FOR UPDATE is actually a block-write, whether or not the rows are modified, which has some additional code to recover from this without crashing/redo. Later transactions end up cleaning up the lock header info (which later became a problem in Parallel Server). https://cwisdb.cc.kuleuven.ac.be/ora10doc/server.101/b10743/consist.htm -- Best Regards, Simon Riggs
On Mon, Dec 20, 2004 at 06:23:24PM +1100, Gavin Sherry wrote: > On Sat, 18 Dec 2004, Bruce Momjian wrote: > > Agreed. Once concern I have about allowing the lock table to spill to > > disk is that a large number of FOR UPDATE locks could push out lock > > entries used by other backends, causing very poor performance. > > I think if we allow the lock manager to spill to disk (and I think we do > need to allow it) then we should also be able to control the amount of > shared memory allocated. There's little point spilling the lock table to > disk if we have huge amounts of memory. This is a interesting idea. Gavin also mentioned to me we should also control the amount of memory the shared inval queue uses. Causing all backends to refill the cache is (I assume) a big performance hit. Maybe we should expose this via new knobs in postgresql.conf, to ease implementation, or maybe not, to rid users of configuring it. As a start, we could have WARNINGs when the lock table is spilled and when a SInvalReset occurs. So the user can know whether he should increase memory use for those settings. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "Thou shalt not follow the NULL pointer, for chaos and madness await thee at its end." (2nd Commandment for C programmers)
Gavin Sherry <swm@linuxworld.com.au> writes:
> I think if we allow the lock manager to spill to disk (and I think we do
> need to allow it) then we should also be able to control the amount of
> shared memory allocated.
You mean like max_locks_per_transaction?
        regards, tom lane
			
		Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> Gavin also mentioned to me we should also control the amount of memory
> the shared inval queue uses.
Perhaps, but I've really seen no evidence that there's a need to worry
about that.  Without demonstrated problems I'd sooner keep that code a
bit simpler and faster.
        regards, tom lane
			
		Tom lane wrote: > Gavin Sherry <swm@linuxworld.com.au> writes: > > I think if we allow the lock manager to spill to disk (and I think we do > > need to allow it) then we should also be able to control the amount of > > shared memory allocated. > > You mean like max_locks_per_transaction? IMO, max_locks_per_transaction could use a better name a little more documentation. I've mentioned this a couple of times before, but there is at least one type of lock that does not expire when the transaction ends (user locks). I may be over my head here, but I think lock spillover is dangerous. In the extreme situations where this would happen, it would be a real performance buster. Personally, I would rather see locks escalate when the table gets full, or at least allow this as a configuration parameter. Merlin
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes:
> I may be over my head here, but I think lock spillover is dangerous.  In
> the extreme situations where this would happen, it would be a real
> performance buster.  Personally, I would rather see locks escalate when
> the table gets full, or at least allow this as a configuration
> parameter. 
To me, "performance buster" is better than "random, unrepeatable
deadlock failures".  In any case, if we find we *can't* implement this
in a non-performance-busting way, then it would be time enough to look
at alternatives that force the user to manage the problem for us.
        regards, tom lane
			
		On Mon, Dec 20, 2004 at 11:47:41AM -0500, Tom Lane wrote: > To me, "performance buster" is better than "random, unrepeatable > deadlock failures". In any case, if we find we *can't* implement this > in a non-performance-busting way, then it would be time enough to look > at alternatives that force the user to manage the problem for us. I am confused by this discussion. To solve the problem I want to solve, we have three orthogonal possibilities: 1. implement shared row locking using the ideas outlined in the mail starting this thread (pg_clog-like seems to be the winner, details TBD). 2. implement shared lock table spill-to-disk mechanism. 3. implement lock escalation. Some people seems to think 3 is better than 2. What do they think of 1? Some facts: - DB2 implements 3 and some people have problems with deadlocks. - 2 could have a performance impact, and we don't even know how to start. For example, what would be an algorithm to decidewhat locks to send to disk? - I am interested in implementing 1, maybe 2. Don't know about 3. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) One man's impedance mismatch is another man's layer of abstraction. (Lincoln Yeoh)
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> To solve the problem I want to solve, we have three orthogonal
> possibilities:
> 1. implement shared row locking using the ideas outlined in the mail
> starting this thread (pg_clog-like seems to be the winner, details TBD).
> 2. implement shared lock table spill-to-disk mechanism.
> 3. implement lock escalation.
Check.
> - 2 could have a performance impact, and we don't even know how to
>   start.  For example, what would be an algorithm to decide what locks
>   to send to disk?
LRU, perhaps?  That's all open for investigation still.
#1 could have a pretty serious performance impact, too.  For small
numbers of FOR UPDATE locks (too few to force spill to disk) I would
expect #2 to substantially beat #1.  #1 essentially imposes the worst
case performance at all times, whereas #2 degrades (at a currently
unknown rate) when there are lots and lots of FOR UPDATE locks.
Most of the applications I've seen don't take out that many FOR UPDATE
locks at once, so I'm unclear on the rationale for choosing a fixed-but-
poor performance curve over one that is fast for few locks and degrades
for many locks.  Especially when the value of "many" is
user-configurable.
Furthermore, we have also seen issues with too many locks on ordinary
objects, which #2 would solve simultaneously.
So I feel that #2 is clearly the approach to try first.  If we find that
we can't do spill-to-disk without serious performance degradation, then
I'd be inclined to try #1 next.  I really don't care for the
user-visible semantics changes implied by #3 ...
        regards, tom lane
			
		> In general, I agree with Tom: I haven't seen many programs that use > extended SELECT FOR UPDATE logic. However, the ones I have seen have > been batch style programs written using a whole-table cursor - these > latter ones have been designed for the cursor stability approach. I think if we add shared locks we should by default behave like cursor stability isolation level, that only holds one shared lock for the current cursor row. The semantics are well defined in SQL. If you want repeatable read you need to change isolation level. I know FK checks will need to keep the locks, but I would special case that. Andreas
On Mon, Dec 20, 2004 at 03:09:24PM -0300, Alvaro Herrera wrote: > To solve the problem I want to solve, we have three orthogonal > possibilities: > > 1. implement shared row locking using the ideas outlined in the mail > starting this thread (pg_clog-like seems to be the winner, details TBD). > > 2. implement shared lock table spill-to-disk mechanism. > > 3. implement lock escalation. > > > Some people seems to think 3 is better than 2. What do they think of 1? > > > Some facts: > > - DB2 implements 3 and some people have problems with deadlocks. FWIW, I have seen DB2 deadlock on a single row update statement in a database with no one else connected. It was an issue with how they were implementing repeatable read concurrency. What this indicates to me is that they've got some 'issues' with their locking mechanisms, and that #3 shouldn't be thrown out just because DB2 doesn't do it right. AFAIK Sybase and SQL-server also use lock escalation and I've not heard of issues with it. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
On Thu, 16 Dec 2004 21:54:14 -0300, Alvaro Herrera <alvherre@dcc.uchile.cl> wrote: > Else, it will have to wait, using XactLockTableWait, for >the first transaction in the array that is still running. We can be >sure that no one will try to share-lock the tuple while we check the >btree because we hold an exclusive lock on the tuple's heap page's >buffer. Do you really want to XactLockTableWait while holding an exclusive lock on a heap page? Or are you going to release the lock before entering the wait? >Thanks to tablespaces, it's very easy to create special Relations that >can be dealt with by standard buffer and storage manager, etc. I don't get how tablespaces would help. >The btree idea: >- does not need crash recovery. Maybe we could use a stripped down > version of nbtree. This could cause a maintanibility nightmare. It could be a nightmare if you simply duplicate and then modify the code. A better way would be to refactor nbtree to be able to handle btrees with different properties: . shared/private . logged/non-logged . flexible value data type. >- can't hold more than 300 or so simultaneous lockers (because of value > length, limited to 1/3 of a page). I doubt this is a real problem. Or you store each lock as a separate index tuple. If this is expected to be a problem because of many repeating key vaules, nbtree should be enhanced to support duplicate key compression, which might be a good thing per se. ServusManfred