Обсуждение: Search then Delete Performance
Hi Everyone, I am fairly new to practical databases, but I am trying out the c interface to postgres and am wondering how to improve performance. I am a researcher, and I am trying to perform a large parameter sweep. Since this will involve a couple of thousand simulations, I have a process that manages which simulations have been done, and which still need to be done, so I can run it easily on a cluster. So, I have a fairly simple schema of 4 tables. -- na, ca,ks,kf,lk,iinj are the parameters for my simulation. CREATE TABLE alljobs ( id SERIAL, ca int, na int, lk int, ks int, kf int, iinj int, PRIMARY KEY(id) ); CREATE TABLE assignedjobs ( jobid int, nodeid varchar(100), assignedtime timestamp, PRIMARY KEY(jobid) ); CREATE TABLE completedjobs ( jobid int, PRIMARY KEY(jobid) ); CREATE TABLE unassignedjobs( jobid int, PRIMARY KEY(jobid) ); alljobs is initially populated, and contains all the simulations that will ever be run unassignedjobs contains the ids in alljobs that havent been run yet assignedjobs contains the ids in alljobs that have been dispatched to some cpu on the cluster and are currently simulating completedjobs contains all the completed jobs. So fairly simply, I have a daemon running on a machine, which accesses this DB. Clients connect and request the details for say 1000 simulations, at which point the daemon takes 1000 entries from the unassigned table and moves them to the assigned table. The once the client is finished with those jobs, it signals this to the daemon, which then move those jobs from 'assigned' to 'complete'. So this is fairly simple to implement, but my problem is that it is very slow. In particular, I have a 'select' waiting for network connections, which then calls this function: typedef vector<long> VectorLong; VectorLong assignJobs(PGconn* pDB, int number, string nodename) { char buffer[1000]; sprintf(buffer,"SELECT jobid from unassignedjobs LIMIT %d",number); PGresult* pRes = PQexec(pDB, buffer); printf("assigning jobs"); //PGresult* pRes = PQexec(pDB, "SELECT * from alljobs LIMIT 100"); //PGresult* pRes = PQexec(pDB, "SELECT * from alljobs"); int nRes = PQntuples(pRes); printf("Results found: %d",nRes); VectorLong jobs; for(int i=0;i<nRes;i++) { long id = atol( PQgetvalue(pRes,i,0) ); cout << id << " "; jobs.push_back(id); sprintf(buffer, "DELETE FROM unassignedjobs WHERE jobid = %ld", id); PQexec(pDB, buffer); sprintf(buffer, "INSERT INTO assignedjobs (jobid,nodeid) VALUES (%ld, %s)", id, nodename.c_str() ); PQexec(pDB, buffer); } return jobs; } but it is painfully slow. I was wondering if there is a way to improve this? I feel there should be since I already have a 'pointer' to the rows I want to delete. Any help would be greatly appreciated. Many thanks Mike Hull
Hi, It's probably slow because you run many queries where a few would work: DELETE FROM unassignedjobs WHERE jobid IN (6, 8 ,2, 99, 66) But I wouldn't know how to build a query like that in C. A script in python or even bash that dit it would be faster than your C implementation. What you can do in C is this: sprintf( buffer, "INSERT INTO assignedjobs (jobid,nodeid)\n" "SELECT jobid, '%s' from unassignedjobs LIMIT %d\n", nodename.c_str(), number ); Some smart SQL with some 'RETURNING' clauses could run the whole assignJobs function in a single query and it would be a lot faster. Met vriendelijke groet, Arjen Nienhuis On Wed, Sep 15, 2010 at 2:55 AM, Michael Hull <mikehulluk@googlemail.com> wrote: > Hi Everyone, > I am fairly new to practical databases, but I am trying out the c > interface to postgres and am wondering how to improve performance. I > am a researcher, and I am trying to perform a large parameter sweep. > Since this will involve a couple of thousand simulations, I have a > process that manages which simulations have been done, and which still > need to be done, so I can run it easily on a cluster. > > So, I have a fairly simple schema of 4 tables. > > -- na, ca,ks,kf,lk,iinj are the parameters for my simulation. > CREATE TABLE alljobs ( > id SERIAL, > ca int, > na int, > lk int, > ks int, > kf int, > iinj int, > PRIMARY KEY(id) > ); > > > CREATE TABLE assignedjobs ( > jobid int, > nodeid varchar(100), > assignedtime timestamp, > PRIMARY KEY(jobid) > ); > > > CREATE TABLE completedjobs ( > jobid int, > PRIMARY KEY(jobid) > ); > > CREATE TABLE unassignedjobs( > jobid int, > PRIMARY KEY(jobid) > ); > > > > alljobs is initially populated, and contains all the simulations that > will ever be run > unassignedjobs contains the ids in alljobs that havent been run yet > assignedjobs contains the ids in alljobs that have been dispatched to > some cpu on the cluster and are currently simulating > completedjobs contains all the completed jobs. > > So fairly simply, I have a daemon running on a machine, which accesses > this DB. Clients connect and request the details for say 1000 > simulations, at which point the daemon takes 1000 entries from the > unassigned table and moves them to the assigned table. The once the > client is finished with those jobs, it signals this to the daemon, > which then move those jobs from 'assigned' to 'complete'. > > So this is fairly simple to implement, but my problem is that it is very slow. > > > In particular, I have a 'select' waiting for network connections, > which then calls this function: > > typedef vector<long> VectorLong; > VectorLong assignJobs(PGconn* pDB, int number, string nodename) > { > char buffer[1000]; > sprintf(buffer,"SELECT jobid from unassignedjobs LIMIT %d",number); > PGresult* pRes = PQexec(pDB, buffer); > > printf("assigning jobs"); > > //PGresult* pRes = PQexec(pDB, "SELECT * from alljobs LIMIT 100"); > //PGresult* pRes = PQexec(pDB, "SELECT * from alljobs"); > > int nRes = PQntuples(pRes); > printf("Results found: %d",nRes); > > VectorLong jobs; > for(int i=0;i<nRes;i++) > { > long id = atol( PQgetvalue(pRes,i,0) ); > cout << id << " "; > jobs.push_back(id); > > sprintf(buffer, "DELETE FROM unassignedjobs WHERE jobid = %ld", id); > PQexec(pDB, buffer); > > sprintf(buffer, "INSERT INTO assignedjobs (jobid,nodeid) VALUES > (%ld, %s)", id, nodename.c_str() ); > PQexec(pDB, buffer); > } > > > return jobs; > } > > but it is painfully slow. I was wondering if there is a way to improve > this? I feel there should be since I already have a 'pointer' to the > rows I want to delete. > > > Any help would be greatly appreciated. > > Many thanks > > > Mike Hull > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
Michael Hull <mikehulluk@googlemail.com> writes: > I am fairly new to practical databases, but I am trying out the c > interface to postgres and am wondering how to improve performance. I > am a researcher, and I am trying to perform a large parameter sweep. > Since this will involve a couple of thousand simulations, I have a > process that manages which simulations have been done, and which still > need to be done, so I can run it easily on a cluster. There's basically no way for the performance of that loop to not suck. You're incurring two round trips to the server per jobid change, plus query parse/plan times; so even though the underlying table manipulations are relatively simple, there's just too much overhead. One way to fix this problem is to push the procedural logic into a stored procedure that runs on the server, so that your app sends one query, gets back one rowset of its newly assigned jobs, and all the database update side-effects are done by the procedure. That's only a small conceptual leap from where you are, but does require getting up to speed on plpgsql or one of the other procedural languages. The more radical way to fix it is to write the whole thing as one SQL command. This requires thinking about your problem as an operation on a set of rows, rather than an iteration, so it can be a pretty big conceptual jump for database novices. I'm not sure it's actually possible to do it given the specific table organization you've adopted --- if you need to both delete rows in unassignedjobs and insert rows in assignedjobs, there's no way to do it in one SQL operation. But if you're not yet wedded to that representation, you should consider having just one table and implementing the state change as an update to a status column instead of moving the data to a different table. Then you could probably implement the operation in a single UPDATE ... RETURNING command. Also, this looks suspiciously like a priority queue, which means you're basically reinventing the wheel. People have sweated the details on this type of thing before, and come up with very clever solutions that perform well even with multiple clients concurrently trying to obtain job assignments. Simple solutions tend to slow way down or even deadlock under those conditions :-(. Check the Postgres list archives. regards, tom lane
On 09/14/10 5:55 PM, Michael Hull wrote: > So fairly simply, I have a daemon running on a machine, which accesses > this DB. Clients connect and request the details for say 1000 > simulations, at which point the daemon takes 1000 entries from the > unassigned table and moves them to the assigned table. The once the > client is finished with those jobs, it signals this to the daemon, > which then move those jobs from 'assigned' to 'complete'. > > So this is fairly simple to implement, but my problem is that it is very slow. > > instead of moving data from one table to another, it might be better to just have a table of simulations, then another table which just contains the PK of each simulation, and a flag that says its assigned or unassigned (and maybe the client its assigned to? and anything else thats related to this assignment?)... so instead of moving your big table rows, which involves deleting them from one table and inserting them into another, you just update the row of this small table. if you create this small table with a fillfactor like 75%, the updates likely will easily be handled by HOT
> -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of John R Pierce > Sent: Tuesday, September 14, 2010 8:41 PM > To: Michael Hull > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Search then Delete Performance > > On 09/14/10 5:55 PM, Michael Hull wrote: > > So fairly simply, I have a daemon running on a machine, which > accesses > > this DB. Clients connect and request the details for say 1000 > > simulations, at which point the daemon takes 1000 entries from the > > unassigned table and moves them to the assigned table. The once the > > client is finished with those jobs, it signals this to the daemon, > > which then move those jobs from 'assigned' to 'complete'. > > > > So this is fairly simple to implement, but my problem is that it is > very slow. > > > > > > instead of moving data from one table to another, it might be better to > just have a table of simulations, then another table which just > contains > the PK of each simulation, and a flag that says its assigned or > unassigned (and maybe the client its assigned to? and anything else > thats related to this assignment?)... so instead of moving your big > table rows, which involves deleting them from one table and inserting > them into another, you just update the row of this small table. if > you > create this small table with a fillfactor like 75%, the updates likely > will easily be handled by HOT Or just a status integer in the main table along the lines of: 1 = unassigned 2 = assigned 3 = running 4 = completed Etc. And then update the status as appropriate and check the status as needed. If you want until a batch is done, you would also be able to update like this: UPDATE jobs SET status = 4 WHERE status = 3 As you like, with a single statement. There are lots of job schedulers on SOURCEFORGE. http://sourceforge.net/search/?words=scheduler+workflow&type_of_search=soft&sort=latest_file_date&sortdir=desc&limit=100
On Wed, Sep 15, 2010 at 5:15 AM, Dann Corbit <DCorbit@connx.com> wrote: >> -----Original Message----- >> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- >> owner@postgresql.org] On Behalf Of John R Pierce >> Sent: Tuesday, September 14, 2010 8:41 PM >> To: Michael Hull >> Cc: pgsql-general@postgresql.org >> Subject: Re: [GENERAL] Search then Delete Performance >> >> On 09/14/10 5:55 PM, Michael Hull wrote: >> > So fairly simply, I have a daemon running on a machine, which >> accesses >> > this DB. Clients connect and request the details for say 1000 >> > simulations, at which point the daemon takes 1000 entries from the >> > unassigned table and moves them to the assigned table. The once the >> > client is finished with those jobs, it signals this to the daemon, >> > which then move those jobs from 'assigned' to 'complete'. >> > >> > So this is fairly simple to implement, but my problem is that it is >> very slow. >> > >> > >> >> instead of moving data from one table to another, it might be better to >> just have a table of simulations, then another table which just >> contains >> the PK of each simulation, and a flag that says its assigned or >> unassigned (and maybe the client its assigned to? and anything else >> thats related to this assignment?)... so instead of moving your big >> table rows, which involves deleting them from one table and inserting >> them into another, you just update the row of this small table. if >> you >> create this small table with a fillfactor like 75%, the updates likely >> will easily be handled by HOT > > Or just a status integer in the main table along the lines of: > 1 = unassigned > 2 = assigned > 3 = running > 4 = completed > Etc. > > And then update the status as appropriate and check the status as needed. > > If you want until a batch is done, you would also be able to update like this: > > UPDATE jobs SET status = 4 WHERE status = 3 > > As you like, with a single statement. > > There are lots of job schedulers on SOURCEFORGE. > http://sourceforge.net/search/?words=scheduler+workflow&type_of_search=soft&sort=latest_file_date&sortdir=desc&limit=100 > > Thanks everyone for the input - I will have an investigate Mike