Re: Moving data from one set of tables to another?
От | Sean Davis |
---|---|
Тема | Re: Moving data from one set of tables to another? |
Дата | |
Msg-id | 264855a00809190953r1480bebch641b5c96eb7cff6b@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Moving data from one set of tables to another? (Howard Eglowstein <howard@yankeescientific.com>) |
Список | pgsql-novice |
On Fri, Sep 19, 2008 at 12:04 PM, Howard Eglowstein <howard@yankeescientific.com> wrote: > Yes, I have been deleting them as I go. I thought about running one pass to > move the data over and a second one to then delete the records. The data in > two of the tables is only loosely linked to the data in the first by the > value in one column, and I was concerned about how to know where to restart > the process if it stopped and I had to restart it later. Deleting the three > rows after the database reported successfully writing the three new ones > seemed like a good idea at the time. I don't want to stop the process now, > but I'll look at having the program keep track of its progress and then go > back and delete the old data when it's done. > > And yes, I do have a complete backup of the data from before I started any > of this. I can easily go back to where I was and try again or tweak the > process as needed. The database tuning is a problem I think we have from > before this procedure and I'll have to look at again after this data is > moved around. You might want to do all of this--inserts and deletes--within a transaction. Then, if ANY step fails, the entire process can be rolled back. > Carol Walter wrote: >> >> Database tuning can really be an issue. I have a development copy and a >> production copy of most of my databases. They are on two different >> machines. The databases used to tuned the same way, however one of the >> machines has more processing power and one has more memory. When we retuned >> the databases to take advantage of the machines strong points, it decreased >> the time it took to run some queries by 400%. >> >> Carol >> >> P.S. If I understand your process, and your deleting the records as you >> go, that would make me really nervous. As soon as you start, you no longer >> have an intact table that has all the data in it. While modern databases >> engines do a lot to protect your data, there is always some quirk that can >> happen. If you have enough space, you might consider running the delete >> after the tables are created. >> >> >> On Sep 19, 2008, at 11:07 AM, Howard Eglowstein wrote: >> >>> There are a lot of issues at work here. The speed of the machine, the >>> rest of the machine's workload, the database configuration, etc. This >>> machine is about 3 years old and not as fast as a test machine I have at my >>> desk. It's also running three web services and accepting new data into the >>> current year's tables at the rate of one set of rows every few seconds. The >>> database when I started didn't have any indices applied. I indexed a few >>> columns which seemed to help tremendously (a factor of 10 at least) and >>> perhaps a few more might help. >>> >>> Considering that searching the tables now with the data split into 3 rows >>> takes a minute or more to search the whole database, I suspect that there's >>> still organizational issues that could be addressed to speed up all PG >>> operations. I'm far more concerned with robustness and I'm not too keen on >>> trying too many experiments until I get the data broken up and backed up >>> again. >>> >>> I doubt this machine could perform 7 SQL operations on 1.5 million rows >>> in each of 3 tables in a few seconds or minutes on a good day, with the >>> wind, rolling down hill. I'd like to be proven wrong though... >>> >>> Howard >>> >>> Sean Davis wrote: >>>> >>>> On Fri, Sep 19, 2008 at 10:48 AM, Howard Eglowstein >>>> <howard@yankeescientific.com> wrote: >>>> >>>>> Absolutely true, and if the data weren't stored on the same machine >>>>> which is >>>>> running the client, I would have worked harder to combine statements. >>>>> In >>>>> this case though, the server and the data are on the same machine and >>>>> the >>>>> client application doing the SELECT, INSERT and DELETEs is also on the >>>>> same >>>>> machine. >>>>> >>>>> I'd like to see how to have done this with combined statements if I >>>>> ever >>>>> have to do it again in a different setup, but it is working well now. >>>>> It's >>>>> moved about 1/2 million records so far since last night. >>>>> >>>> >>>> So the 150ms was per row? Not to belabor the point, but I have done >>>> this with tables with tens-of-millions of rows in the space of seconds >>>> to minutes (for the entire move, not per row), depending on the exact >>>> details of the table(s). No overnight involved. The network is one >>>> issue (which you have avoided by being local), but the encoding and >>>> decoding overhead to go to a client is another one that is entirely >>>> avoided. When you have some free time, do benchmark, as I think the >>>> difference could be substantial. >>>> >>>> Sean >>>> >>>> >>>>> Sean Davis wrote: >>>>> >>>>>> On Fri, Sep 19, 2008 at 7:42 AM, Howard Eglowstein >>>>>> <howard@yankeescientific.com> wrote: >>>>>> >>>>>> >>>>>>> So you'd agree then that I'll need 7 SQL statements but that I could >>>>>>> stack >>>>>>> the INSERT and the first SELECT if I wanted to? Cool. That's what I >>>>>>> ended >>>>>>> up >>>>>>> with in C code and it's working pretty well. I did some indexing on >>>>>>> the >>>>>>> database and got the whole transaction down to about 150ms for the >>>>>>> sequence. >>>>>>> I guess that's as good as it's going to get. >>>>>>> >>>>>>> >>>>>> Keep in mind that the INSERT [...] SELECT [...] is done server-side, >>>>>> so the data never goes over the wire to the client. This is very >>>>>> different than doing the select, accumulating the data, and then doing >>>>>> the insert and is likely to be much faster, relatively. 150ms is >>>>>> already pretty fast, but the principle of doing as much on the server >>>>>> as possible is an important one when looking for efficiency, >>>>>> especially when data sizes are large. >>>>>> >>>>>> Glad to hear that it is working. >>>>>> >>>>>> Sean >>>>>> >>>>>> >>>>>> >>>>>>> Sean Davis wrote: >>>>>>> >>>>>>> >>>>>>>> On Thu, Sep 18, 2008 at 7:28 PM, Howard Eglowstein >>>>>>>> <howard@yankeescientific.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> What confuses me is that I need to do the one select with all three >>>>>>>>> tables >>>>>>>>> and then do three inserts, no? The results is that the 150 fields I >>>>>>>>> get >>>>>>>>> back >>>>>>>>> from the select have to be split into 3 groups of 50 fields each >>>>>>>>> and >>>>>>>>> then >>>>>>>>> written into three tables. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> You do the insert part of the command three times, once for each new >>>>>>>> table, so three separate SQL statements. The select remains >>>>>>>> basically >>>>>>>> the same for all three, with only the column selection changing >>>>>>>> (data_a.* when inserting into new_a, data_b.* when inserting into >>>>>>>> new_b, etc.). Just leave the ids the same as in the first set of >>>>>>>> tables. There isn't a need to change them in nearly every case. If >>>>>>>> you need to add a new ID column, you can do that as a serial column >>>>>>>> in >>>>>>>> the new tables, but I would stick to the original IDs, if possible. >>>>>>>> >>>>>>>> Sean >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> What you're suggesting is that there is some statement which could >>>>>>>>> do >>>>>>>>> the >>>>>>>>> select and the three inserts at once? >>>>>>>>> >>>>>>>>> Howard >>>>>>>>> >>>>>>>>> Sean Davis wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> You might want to look at insert into ... select ... >>>>>>>>>> >>>>>>>>>> You should be able to do this with 1 query per new table (+ the >>>>>>>>>> deletes, obviously). For a few thousand records, I would expect >>>>>>>>>> that >>>>>>>>>> the entire process might take a few seconds. >>>>>>>>>> >>>>>>>>>> Sean >>>>>>>>>> >>>>>>>>>> On Thu, Sep 18, 2008 at 6:39 PM, Howard Eglowstein >>>>>>>>>> <howard@yankeescientific.com> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Somewhat empty, yes. The single set of 'data_' tables contains 3 >>>>>>>>>>> years >>>>>>>>>>> worth >>>>>>>>>>> of data. I want to move 2 years worth out into the 'new_' tables. >>>>>>>>>>> When >>>>>>>>>>> I'm >>>>>>>>>>> done, there will still be 1 year's worth of data left in the >>>>>>>>>>> original >>>>>>>>>>> table. >>>>>>>>>>> >>>>>>>>>>> Howard >>>>>>>>>>> >>>>>>>>>>> Carol Walter wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> What do you want for your end product? Are the old tables empty >>>>>>>>>>>> after >>>>>>>>>>>> you >>>>>>>>>>>> put the data into the new tables? >>>>>>>>>>>> >>>>>>>>>>>> Carol >>>>>>>>>>>> >>>>>>>>>>>> On Sep 18, 2008, at 3:02 PM, Howard Eglowstein wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> I have three tables called 'data_a', 'data_b' and 'data_c' >>>>>>>>>>>>> which >>>>>>>>>>>>> each >>>>>>>>>>>>> have 50 columns. One of the columns in each is 'id' and is used >>>>>>>>>>>>> to >>>>>>>>>>>>> keep >>>>>>>>>>>>> track of which data in data_b and data_c corresponds to a row >>>>>>>>>>>>> in >>>>>>>>>>>>> data_a. If >>>>>>>>>>>>> I want to get all of the data in all 150 fields for this month >>>>>>>>>>>>> (for >>>>>>>>>>>>> example), I can get it with: >>>>>>>>>>>>> >>>>>>>>>>>>> select * from (data_a, data_b, data_c) where >>>>>>>>>>>>> data_a.id=data_b.id >>>>>>>>>>>>> AND >>>>>>>>>>>>> data_a.id = data_c.id AND timestamp >= '2008-09-01 00:00:00' >>>>>>>>>>>>> and >>>>>>>>>>>>> timestamp >>>>>>>>>>>>> <= '2008-09-30 23:59:59' >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>>> What I need to do is execute this search which might return >>>>>>>>>>>>> several >>>>>>>>>>>>> thousand rows and write the same structure into 'new_a', >>>>>>>>>>>>> 'new_b' >>>>>>>>>>>>> and >>>>>>>>>>>>> 'new_c'. What i'm doing now in a C program is executing the >>>>>>>>>>>>> search >>>>>>>>>>>>> above. >>>>>>>>>>>>> Then I execute: >>>>>>>>>>>>> >>>>>>>>>>>>> INSERT INTO data_a (timestamp, field1, field2 ...[imagine 50 of >>>>>>>>>>>>> them]) >>>>>>>>>>>>> VALUES ('2008-09-01 00:00:00', 'ABC', 'DEF', ...); >>>>>>>>>>>>> Get the ID that was assigned to this row since 'id' is a serial >>>>>>>>>>>>> field >>>>>>>>>>>>> and >>>>>>>>>>>>> the number is assigned sequentially. Say it comes back as '1'. >>>>>>>>>>>>> INSERT INTO data_b (id, field1, field2 ...[imagine 50 of them]) >>>>>>>>>>>>> VALUES >>>>>>>>>>>>> ('1', 'ABC', 'DEF', ...); >>>>>>>>>>>>> INSERT INTO data_c (id, field1, field2 ...[imagine 50 of them]) >>>>>>>>>>>>> VALUES >>>>>>>>>>>>> ('1', 'ABC', 'DEF', ...); >>>>>>>>>>>>> >>>>>>>>>>>>> That moves a copy of the three rows of data form the three >>>>>>>>>>>>> tables >>>>>>>>>>>>> into >>>>>>>>>>>>> the three separate new tables. >>>>>>>>>>>>> From the original group of tables, the id for these rows was, >>>>>>>>>>>>> let's >>>>>>>>>>>>> say, >>>>>>>>>>>>> '1234'. Then I execute: >>>>>>>>>>>>> >>>>>>>>>>>>> DELETE FROM data_a where id='1234'; >>>>>>>>>>>>> DELETE FROM data_b where id='1234'; >>>>>>>>>>>>> DELETE FROM data_c where id='1234'; >>>>>>>>>>>>> >>>>>>>>>>>>> That deletes the old data. >>>>>>>>>>>>> >>>>>>>>>>>>> This works fine and gives me exactly what I wanted, but is >>>>>>>>>>>>> there a >>>>>>>>>>>>> better >>>>>>>>>>>>> way? This is 7 SQL calls and it takes about 3 seconds per moved >>>>>>>>>>>>> record >>>>>>>>>>>>> on >>>>>>>>>>>>> our Linux box. >>>>>>>>>>>>> >>>>>>>>>>>>> Any thoughts or suggestions would be appreciated. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> Howard >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Sent via pgsql-novice mailing list >>>>>>>>>>>>> (pgsql-novice@postgresql.org) >>>>>>>>>>>>> To make changes to your subscription: >>>>>>>>>>>>> http://www.postgresql.org/mailpref/pgsql-novice >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ------------------------------------------------------------------------ >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> No virus found in this incoming message. >>>>>>>>>>>> Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus >>>>>>>>>>>> Database: >>>>>>>>>>>> 270.6.21/1678 - Release Date: 9/18/2008 9:01 AM >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org) >>>>>>>>>>> To make changes to your subscription: >>>>>>>>>>> http://www.postgresql.org/mailpref/pgsql-novice >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------------ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> No virus found in this incoming message. >>>>>>>>>> Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus >>>>>>>>>> Database: >>>>>>>>>> 270.6.21/1678 - Release Date: 9/18/2008 9:01 AM >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------------------ >>>>>>>> >>>>>>>> >>>>>>>> No virus found in this incoming message. >>>>>>>> Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus >>>>>>>> Database: >>>>>>>> 270.7.0/1679 - Release Date: 9/18/2008 5:03 PM >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------ >>>>>> >>>>>> >>>>>> No virus found in this incoming message. >>>>>> Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus Database: >>>>>> 270.7.0/1679 - Release Date: 9/18/2008 5:03 PM >>>>>> >>>>>> >>>>>> >>>> > >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> No virus found in this incoming message. >>>> Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus Database: >>>> 270.7.0/1679 - Release Date: 9/18/2008 5:03 PM >>>> >>>> >>> >>> >>> -- >>> Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org) >>> To make changes to your subscription: >>> http://www.postgresql.org/mailpref/pgsql-novice >> >> ------------------------------------------------------------------------ >> >> >> No virus found in this incoming message. >> Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus Database: >> 270.7.0/1679 - Release Date: 9/18/2008 5:03 PM >> >> > > > -- > Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-novice >
В списке pgsql-novice по дате отправления:
Предыдущее
От: Howard EglowsteinДата:
Сообщение: Re: Moving data from one set of tables to another?