Обсуждение: maintaining a reference to a fetched row

Поиск
Список
Период
Сортировка

maintaining a reference to a fetched row

От
Brian Karlak
Дата:
Hello All --

I have a simple queuing application written on top of postgres which
I'm trying to squeeze some more performance out of.

The setup is relatively simple: there is a central queue table in
postgres.  Worker daemons do a bounded, ordered, limited SELECT to
grab a row, which they lock by setting a value in the queue.status
column.  When the task is complete, results are written back to the
row.  The system is designed to allow multiple concurrent daemons to
access a queue.  At any one time, we expect 1-5M active items on the
queue.

Now this design is never going to win any performance awards against a
true queuing system like Active/Rabbit/Zero MQ, but it's tolerably
fast for our applications.  Fetch/mark times are about 1ms,
independent of the number of items on the queue.  This is acceptable
considering that our tasks take ~50ms to run.

However, the writing of results back to the row takes ~5ms, which is
slower than I'd like.  It seems that this is because I need to to do
an index scan on the queue table to find the row I just fetched.

My question is this: is there some way that I can keep a cursor /
pointer / reference / whatever to the row I fetched originally, so
that I don't have to search for it again when I'm ready to write
results?

Thanks in advance for any pointers you can provide.

Brian

Re: maintaining a reference to a fetched row

От
Craig Ringer
Дата:
Brian Karlak wrote:

> The setup is relatively simple: there is a central queue table in
> postgres.  Worker daemons do a bounded, ordered, limited SELECT to grab
> a row, which they lock by setting a value in the queue.status column.

You can probably do an UPDATE ... RETURNING to turn that into one
operation - but that won't work with a cursor :-(

> My question is this: is there some way that I can keep a cursor /
> pointer / reference / whatever to the row I fetched originally, so that
> I don't have to search for it again when I'm ready to write results?

You could use a cursor, but it won't work if you're locking rows by
testing a 'status' flag, because that requires the worker to commit the
transaction (so others can see the status flag) before starting work. A
cursor only exists within a transaction.

BEGIN;
DECLARE curs CURSOR FOR SELECT * FROM queue ORDER BY queue_id LIMIT 1;
FETCH NEXT FROM curs;
--
-- Set the status - but nobody else can see the change yet because we
-- haven't committed! We'll have a Pg row lock on the record due to the
-- UPDATE, preventing other UPDATEs but not other SELECTs.
--
-- We can't start work until the transaction commits, but committing
-- will close the cursor.
--
UPDATE queue SET status = 1 WHERE CURRENT OF curs;


I don't have a good answer for you there. Perhaps using Pg's locking to
do your queueing, rather than updating a status flag, might let you use
a cursor? Have a look at the list archives - there's been a fair bit of
discussion of queuing mechanisms.

--
Craig Ringer

Re: maintaining a reference to a fetched row

От
Brian Karlak
Дата:
On Nov 3, 2009, at 4:03 PM, Craig Ringer wrote:

> I don't have a good answer for you there. Perhaps using Pg's locking
> to
> do your queueing, rather than updating a status flag, might let you
> use
> a cursor? Have a look at the list archives - there's been a fair bit
> of
> discussion of queuing mechanisms.

This is an interesting idea.  I'll see what I can find in the
archives.  It will likely take a bit of refactoring, but such is
life ...

Thanks!
Brian

Re: maintaining a reference to a fetched row

От
Tom Lane
Дата:
Brian Karlak <zenkat@metaweb.com> writes:
> My question is this: is there some way that I can keep a cursor /
> pointer / reference / whatever to the row I fetched originally, so
> that I don't have to search for it again when I'm ready to write
> results?

If you don't expect any updates to the row meanwhile, ctid might serve.

            regards, tom lane

Re: maintaining a reference to a fetched row

От
Jeff Janes
Дата:
On Tue, Nov 3, 2009 at 12:30 PM, Brian Karlak <zenkat@metaweb.com> wrote:
> Hello All --
>
> I have a simple queuing application written on top of postgres which I'm
> trying to squeeze some more performance out of.
>
> The setup is relatively simple: there is a central queue table in postgres.
>  Worker daemons do a bounded, ordered, limited SELECT to grab a row, which
> they lock by setting a value in the queue.status column.

So you do a select, and then an update?

> When the task is
> complete, results are written back to the row.  The system is designed to
> allow multiple concurrent daemons to access a queue.  At any one time, we
> expect 1-5M active items on the queue.
>
> Now this design is never going to win any performance awards against a true
> queuing system like Active/Rabbit/Zero MQ, but it's tolerably fast for our
> applications.  Fetch/mark times are about 1ms, independent of the number of
> items on the queue.  This is acceptable considering that our tasks take
> ~50ms to run.
>
> However, the writing of results back to the row takes ~5ms, which is slower
> than I'd like.

It seems you have an select, and update, and another update.  Where in
this process do you commit?  Are you using fsync=off or
synchronous_commit=off?

> It seems that this is because I need to to do an index scan
> on the queue table to find the row I just fetched.

Why would the index scan take 1 ms two of the times it is done but 5ms
the third time?  Isn't it the same index scan each time?  Or does the
change in queue.status change the plan?

Cheers,

Jeff

Re: maintaining a reference to a fetched row

От
Brian Karlak
Дата:
On Nov 3, 2009, at 9:31 PM, Tom Lane wrote:

> Brian Karlak <zenkat@metaweb.com> writes:
>> My question is this: is there some way that I can keep a cursor /
>> pointer / reference / whatever to the row I fetched originally, so
>> that I don't have to search for it again when I'm ready to write
>> results?
>
> If you don't expect any updates to the row meanwhile, ctid might
> serve.

Ahhh ... that's the magic I'm looking for.  Thanks!

Brian

Re: maintaining a reference to a fetched row

От
Brian Karlak
Дата:

On Nov 4, 2009, at 8:47 AM, Jeff Janes wrote:

Worker daemons do a bounded, ordered, limited SELECT to grab a row, which
they lock by setting a value in the queue.status column.

So you do a select, and then an update?

I do a select for update in a stored proc:

FOR queue_item IN  
  SELECT *  FROM queue
   WHERE status IS NULL AND id >= low_bound_id
   ORDER BY id LIMIT batch_size
     FOR UPDATE
LOOP
  UPDATE queue_proc set status = 'proc' where id = queue_item.id ;

The daemons keep track of their last position in the queue with low_bound_id.  Also, as you probably notice, I also fetch a batch of (100) items at a time.  In practice, it's pretty fast.  The job I'm running now is showing an average fetch time of 30ms per 100 actions, which ain't bad.

However, the writing of results back to the row takes ~5ms, which is slower
than I'd like.

It seems you have an select, and update, and another update.  Where in
this process do you commit?  Are you using fsync=off or
synchronous_commit=off?

First commit occurs after the stored proc to select/update a batch of items is complete.  Second commit occurs on the writing of results back for each particular action.  Two commits are required because the time it takes to complete the intervening action can vary wildly: anywhere between 20ms and 45min.

It seems that this is because I need to to do an index scan
on the queue table to find the row I just fetched.

Why would the index scan take 1 ms two of the times it is done but 5ms
the third time?  Isn't it the same index scan each time?  Or does the
change in queue.status change the plan?

The final update is a different query -- just a plain old update by ID:

UPDATE queue_proc set status = 'proc' where id = %s ;

This update by ID takes ~2.5ms, which means it's where the framework is spending most of its overhead.

Brian 

Re: maintaining a reference to a fetched row

От
Tom Lane
Дата:
Brian Karlak <zenkat@metaweb.com> writes:
> On Nov 4, 2009, at 8:47 AM, Jeff Janes wrote:
>> Why would the index scan take 1 ms two of the times it is done but 5ms
>> the third time?  Isn't it the same index scan each time?  Or does the
>> change in queue.status change the plan?

> The final update is a different query -- just a plain old update by ID:
> UPDATE queue_proc set status = 'proc' where id = %s ;
> This update by ID takes ~2.5ms, which means it's where the framework
> is spending most of its overhead.

Well, if SELECT FROM queue_proc where id = %s takes 1ms and the update
takes 2.5ms, then you've got 1.5ms going into updating the row, which
means it's not going to get a whole lot faster by switching to some
other WHERE condition.  Maybe you should look at cutting back on indexes
and/or triggers attached to this table.

            regards, tom lane

Re: maintaining a reference to a fetched row

От
Jeff Janes
Дата:
On Wed, Nov 4, 2009 at 9:41 AM, Brian Karlak <zenkat@metaweb.com> wrote:
>
> I do a select for update in a stored proc:
>
> FOR queue_item IN
>
>   SELECT *  FROM queue
>    WHERE status IS NULL AND id >= low_bound_id
>    ORDER BY id LIMIT batch_size
>      FOR UPDATE
>
> LOOP
>
>   UPDATE queue_proc set status = 'proc' where id = queue_item.id ;
>
> The daemons keep track of their last position in the queue with
> low_bound_id.  Also, as you probably notice, I also fetch a batch of (100)
> items at a time.  In practice, it's pretty fast.  The job I'm running now is
> showing an average fetch time of 30ms per 100 actions, which ain't bad.
>
> However, the writing of results back to the row takes ~5ms, which is slower
> than I'd like.

5 ms per each of the 100 actions?  With one commit per action?

> > It seems you have an select, and update, and another update.  Where in
> > this process do you commit?  Are you using fsync=off or
> > synchronous_commit=off?
>
> First commit occurs after the stored proc to select/update a batch of items
> is complete.

So one commit per 100 items?

> Second commit occurs on the writing of results back for each
> particular action.

So one commit per 1 item?
If so, this completely explains the difference in speed, I think.

> Two commits are required because the time it takes to
> complete the intervening action can vary wildly: anywhere between 20ms and
> 45min.

Is there any way of knowing/approximating ahead of time how long it will take?

The 45 min monsters must be exceedingly rare, or else the average
could not be ~50ms.

>> Why would the index scan take 1 ms two of the times it is done but 5ms
>> the third time?  Isn't it the same index scan each time?  Or does the
>> change in queue.status change the plan?
>
> The final update is a different query -- just a plain old update by ID:
>
> UPDATE queue_proc set status = 'proc' where id = %s ;

That looks very much like the other UPDATE you showed.  The difference
it seems is that you commit after every one, rather than after every
100.  Right?

> This update by ID takes ~2.5ms, which means it's where the framework is
> spending most of its overhead.

You said the computation task can take anywhere from 20ms to 45min, so
it seems that this update overhead is at most 1/8 of the irreducible
time.  That doesn't seem like it is enough to worry about, to me.

Jeff

Re: maintaining a reference to a fetched row

От
Dimitri Fontaine
Дата:
Brian Karlak <zenkat@metaweb.com> writes:
> I have a simple queuing application written on top of postgres which I'm
> trying to squeeze some more performance out of.

Have you tried to write a custom PGQ consumer yet?
  http://wiki.postgresql.org/wiki/PGQ_Tutorial

Regards,
--
dim