Обсуждение: Search then Delete Performance

Поиск
Список
Период
Сортировка

Search then Delete Performance

От
Michael Hull
Дата:
Hi Everyone,
I am fairly new to practical databases, but I am trying out the c
interface to postgres and am wondering how to improve performance. I
am a researcher, and I am trying to perform a large parameter sweep.
Since this will involve a couple of thousand simulations, I have a
process that manages which simulations have been done, and which still
need to be done, so I can run it easily on a cluster.

So, I have a fairly simple schema of 4 tables.

-- na, ca,ks,kf,lk,iinj are the parameters for my simulation.
CREATE TABLE alljobs (
    id SERIAL,
    ca int,
    na int,
    lk int,
    ks int,
    kf int,
    iinj int,
    PRIMARY KEY(id)
);


CREATE TABLE assignedjobs (
    jobid int,
    nodeid varchar(100),
    assignedtime timestamp,
    PRIMARY KEY(jobid)
);


CREATE TABLE completedjobs (
    jobid int,
    PRIMARY KEY(jobid)
);

CREATE TABLE unassignedjobs(
    jobid int,
    PRIMARY KEY(jobid)
);



alljobs is initially populated, and contains all the simulations that
will ever be run
unassignedjobs contains the ids in alljobs that havent been run yet
assignedjobs contains the ids in alljobs that have been dispatched to
some cpu on the cluster and are currently simulating
completedjobs contains all the completed jobs.

So fairly simply, I have a daemon running on a machine, which accesses
this DB. Clients connect and request the details for say 1000
simulations, at which point the daemon takes 1000 entries from the
unassigned table and moves them to the assigned table. The once the
client is finished with those jobs, it signals this to the daemon,
which then move those jobs from 'assigned' to 'complete'.

So this is fairly simple to implement, but my problem is that it is very slow.


In particular, I have a 'select' waiting for network connections,
which then calls this function:

typedef vector<long> VectorLong;
VectorLong assignJobs(PGconn* pDB, int number, string nodename)
{
    char buffer[1000];
    sprintf(buffer,"SELECT jobid from unassignedjobs LIMIT  %d",number);
    PGresult* pRes = PQexec(pDB, buffer);

    printf("assigning jobs");

    //PGresult* pRes = PQexec(pDB, "SELECT * from alljobs LIMIT 100");
    //PGresult* pRes = PQexec(pDB, "SELECT * from alljobs");

    int nRes = PQntuples(pRes);
    printf("Results found: %d",nRes);

    VectorLong jobs;
    for(int i=0;i<nRes;i++)
    {
        long id = atol( PQgetvalue(pRes,i,0) );
        cout << id << " ";
        jobs.push_back(id);

        sprintf(buffer, "DELETE FROM unassignedjobs WHERE jobid = %ld", id);
        PQexec(pDB, buffer);

        sprintf(buffer, "INSERT INTO assignedjobs (jobid,nodeid) VALUES
(%ld, %s)", id, nodename.c_str() );
        PQexec(pDB, buffer);
    }


    return jobs;
}

but it is painfully slow. I was wondering if there is a way to improve
this? I feel there should be since I already have a 'pointer' to the
rows I want to delete.


Any help would be greatly appreciated.

Many thanks


Mike Hull

Re: Search then Delete Performance

От
Arjen Nienhuis
Дата:
Hi,

It's probably slow because you run many queries where a few would work:

DELETE FROM unassignedjobs WHERE jobid IN (6, 8 ,2, 99, 66)

But I wouldn't know how to build a query like that in C. A script in
python or even bash that dit it would be faster than your C
implementation.

What you can do in C is this:

sprintf(
   buffer,
  "INSERT INTO assignedjobs (jobid,nodeid)\n"
  "SELECT jobid, '%s' from unassignedjobs LIMIT %d\n",
  nodename.c_str(), number
);

Some smart SQL with some 'RETURNING' clauses could run the whole
assignJobs function in a single query and it would be a lot faster.

Met vriendelijke groet,
Arjen Nienhuis

On Wed, Sep 15, 2010 at 2:55 AM, Michael Hull <mikehulluk@googlemail.com> wrote:
> Hi Everyone,
> I am fairly new to practical databases, but I am trying out the c
> interface to postgres and am wondering how to improve performance. I
> am a researcher, and I am trying to perform a large parameter sweep.
> Since this will involve a couple of thousand simulations, I have a
> process that manages which simulations have been done, and which still
> need to be done, so I can run it easily on a cluster.
>
> So, I have a fairly simple schema of 4 tables.
>
> -- na, ca,ks,kf,lk,iinj are the parameters for my simulation.
> CREATE TABLE alljobs (
>        id SERIAL,
>        ca int,
>        na int,
>        lk int,
>        ks int,
>        kf int,
>        iinj int,
>        PRIMARY KEY(id)
> );
>
>
> CREATE TABLE assignedjobs (
>        jobid int,
>        nodeid varchar(100),
>        assignedtime timestamp,
>        PRIMARY KEY(jobid)
> );
>
>
> CREATE TABLE completedjobs (
>        jobid int,
>        PRIMARY KEY(jobid)
> );
>
> CREATE TABLE unassignedjobs(
>        jobid int,
>        PRIMARY KEY(jobid)
> );
>
>
>
> alljobs is initially populated, and contains all the simulations that
> will ever be run
> unassignedjobs contains the ids in alljobs that havent been run yet
> assignedjobs contains the ids in alljobs that have been dispatched to
> some cpu on the cluster and are currently simulating
> completedjobs contains all the completed jobs.
>
> So fairly simply, I have a daemon running on a machine, which accesses
> this DB. Clients connect and request the details for say 1000
> simulations, at which point the daemon takes 1000 entries from the
> unassigned table and moves them to the assigned table. The once the
> client is finished with those jobs, it signals this to the daemon,
> which then move those jobs from 'assigned' to 'complete'.
>
> So this is fairly simple to implement, but my problem is that it is very slow.
>
>
> In particular, I have a 'select' waiting for network connections,
> which then calls this function:
>
> typedef vector<long> VectorLong;
> VectorLong assignJobs(PGconn* pDB, int number, string nodename)
> {
>        char buffer[1000];
>        sprintf(buffer,"SELECT jobid from unassignedjobs LIMIT  %d",number);
>        PGresult* pRes = PQexec(pDB, buffer);
>
>        printf("assigning jobs");
>
>        //PGresult* pRes = PQexec(pDB, "SELECT * from alljobs LIMIT 100");
>        //PGresult* pRes = PQexec(pDB, "SELECT * from alljobs");
>
>        int nRes = PQntuples(pRes);
>        printf("Results found: %d",nRes);
>
>        VectorLong jobs;
>        for(int i=0;i<nRes;i++)
>        {
>                long id = atol( PQgetvalue(pRes,i,0) );
>                cout << id << " ";
>                jobs.push_back(id);
>
>                sprintf(buffer, "DELETE FROM unassignedjobs WHERE jobid = %ld", id);
>                PQexec(pDB, buffer);
>
>                sprintf(buffer, "INSERT INTO assignedjobs (jobid,nodeid) VALUES
> (%ld, %s)", id, nodename.c_str() );
>                PQexec(pDB, buffer);
>        }
>
>
>        return jobs;
> }
>
> but it is painfully slow. I was wondering if there is a way to improve
> this? I feel there should be since I already have a 'pointer' to the
> rows I want to delete.
>
>
> Any help would be greatly appreciated.
>
> Many thanks
>
>
> Mike Hull
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Re: Search then Delete Performance

От
Tom Lane
Дата:
Michael Hull <mikehulluk@googlemail.com> writes:
> I am fairly new to practical databases, but I am trying out the c
> interface to postgres and am wondering how to improve performance. I
> am a researcher, and I am trying to perform a large parameter sweep.
> Since this will involve a couple of thousand simulations, I have a
> process that manages which simulations have been done, and which still
> need to be done, so I can run it easily on a cluster.

There's basically no way for the performance of that loop to not suck.
You're incurring two round trips to the server per jobid change,
plus query parse/plan times; so even though the underlying table
manipulations are relatively simple, there's just too much overhead.

One way to fix this problem is to push the procedural logic into a
stored procedure that runs on the server, so that your app sends
one query, gets back one rowset of its newly assigned jobs, and
all the database update side-effects are done by the procedure.
That's only a small conceptual leap from where you are, but does require
getting up to speed on plpgsql or one of the other procedural languages.

The more radical way to fix it is to write the whole thing as one
SQL command.  This requires thinking about your problem as an operation
on a set of rows, rather than an iteration, so it can be a pretty big
conceptual jump for database novices.  I'm not sure it's actually
possible to do it given the specific table organization you've adopted
--- if you need to both delete rows in unassignedjobs and insert rows
in assignedjobs, there's no way to do it in one SQL operation.  But if
you're not yet wedded to that representation, you should consider having
just one table and implementing the state change as an update to a
status column instead of moving the data to a different table.  Then you
could probably implement the operation in a single UPDATE ... RETURNING
command.

Also, this looks suspiciously like a priority queue, which means you're
basically reinventing the wheel.  People have sweated the details on
this type of thing before, and come up with very clever solutions that
perform well even with multiple clients concurrently trying to obtain
job assignments.  Simple solutions tend to slow way down or even
deadlock under those conditions :-(.  Check the Postgres list archives.

            regards, tom lane

Re: Search then Delete Performance

От
John R Pierce
Дата:
  On 09/14/10 5:55 PM, Michael Hull wrote:
> So fairly simply, I have a daemon running on a machine, which accesses
> this DB. Clients connect and request the details for say 1000
> simulations, at which point the daemon takes 1000 entries from the
> unassigned table and moves them to the assigned table. The once the
> client is finished with those jobs, it signals this to the daemon,
> which then move those jobs from 'assigned' to 'complete'.
>
> So this is fairly simple to implement, but my problem is that it is very slow.
>
>

instead of moving data from one table to another, it might be better to
just have a table of simulations, then another table which just contains
the PK of each simulation, and a flag that says its assigned or
unassigned (and maybe the client its assigned to?  and anything else
thats related to this assignment?)...   so instead of moving your big
table rows, which involves deleting them from one table and inserting
them into another, you just update the row of this small table.   if you
create this small table with a fillfactor like 75%, the updates likely
will easily be handled by HOT



Re: Search then Delete Performance

От
Dann Corbit
Дата:
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of John R Pierce
> Sent: Tuesday, September 14, 2010 8:41 PM
> To: Michael Hull
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Search then Delete Performance
>
>   On 09/14/10 5:55 PM, Michael Hull wrote:
> > So fairly simply, I have a daemon running on a machine, which
> accesses
> > this DB. Clients connect and request the details for say 1000
> > simulations, at which point the daemon takes 1000 entries from the
> > unassigned table and moves them to the assigned table. The once the
> > client is finished with those jobs, it signals this to the daemon,
> > which then move those jobs from 'assigned' to 'complete'.
> >
> > So this is fairly simple to implement, but my problem is that it is
> very slow.
> >
> >
>
> instead of moving data from one table to another, it might be better to
> just have a table of simulations, then another table which just
> contains
> the PK of each simulation, and a flag that says its assigned or
> unassigned (and maybe the client its assigned to?  and anything else
> thats related to this assignment?)...   so instead of moving your big
> table rows, which involves deleting them from one table and inserting
> them into another, you just update the row of this small table.   if
> you
> create this small table with a fillfactor like 75%, the updates likely
> will easily be handled by HOT

Or just a status integer in the main table along the lines of:
1 = unassigned
2 = assigned
3 = running
4 = completed
Etc.

And then update the status as appropriate and check the status as needed.

If you want until a batch is done, you would also be able to update like this:

    UPDATE jobs SET status = 4 WHERE status = 3

As you like, with a single statement.

There are lots of job schedulers on SOURCEFORGE.

http://sourceforge.net/search/?words=scheduler+workflow&type_of_search=soft&sort=latest_file_date&sortdir=desc&limit=100


Re: Search then Delete Performance

От
Michael Hull
Дата:
On Wed, Sep 15, 2010 at 5:15 AM, Dann Corbit <DCorbit@connx.com> wrote:
>> -----Original Message-----
>> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
>> owner@postgresql.org] On Behalf Of John R Pierce
>> Sent: Tuesday, September 14, 2010 8:41 PM
>> To: Michael Hull
>> Cc: pgsql-general@postgresql.org
>> Subject: Re: [GENERAL] Search then Delete Performance
>>
>>   On 09/14/10 5:55 PM, Michael Hull wrote:
>> > So fairly simply, I have a daemon running on a machine, which
>> accesses
>> > this DB. Clients connect and request the details for say 1000
>> > simulations, at which point the daemon takes 1000 entries from the
>> > unassigned table and moves them to the assigned table. The once the
>> > client is finished with those jobs, it signals this to the daemon,
>> > which then move those jobs from 'assigned' to 'complete'.
>> >
>> > So this is fairly simple to implement, but my problem is that it is
>> very slow.
>> >
>> >
>>
>> instead of moving data from one table to another, it might be better to
>> just have a table of simulations, then another table which just
>> contains
>> the PK of each simulation, and a flag that says its assigned or
>> unassigned (and maybe the client its assigned to?  and anything else
>> thats related to this assignment?)...   so instead of moving your big
>> table rows, which involves deleting them from one table and inserting
>> them into another, you just update the row of this small table.   if
>> you
>> create this small table with a fillfactor like 75%, the updates likely
>> will easily be handled by HOT
>
> Or just a status integer in the main table along the lines of:
> 1 = unassigned
> 2 = assigned
> 3 = running
> 4 = completed
> Etc.
>
> And then update the status as appropriate and check the status as needed.
>
> If you want until a batch is done, you would also be able to update like this:
>
>        UPDATE jobs SET status = 4 WHERE status = 3
>
> As you like, with a single statement.
>
> There are lots of job schedulers on SOURCEFORGE.
>
http://sourceforge.net/search/?words=scheduler+workflow&type_of_search=soft&sort=latest_file_date&sortdir=desc&limit=100
>
>

Thanks everyone for the input - I will have an investigate


Mike