Обсуждение: COPY performance
I'm trying to COPY in a table of 1 million rows. The table is created by: create table chat_post_new ( session_id INT NOT NULL references chat_session (session_id), poster_name VARCHAR(32) NOT NULL, time TIMESTAMP NOT NULL, post_number INT NOT NULL, FTS txtidx ); The old definition had the integer columns as int2s and the FTS column wasn't there. Because I wanted to reorder the rows, add them in a more natural order (by time), I created the data file read in by the copy command using SELECT session_id || '\t' || poster_name || '\t' || time || '\t' || post_number FROM chat_post ORDER BY time After removing the first and last couple of lines, so that only the data is in the file, renaming the original table and creating the new version I tried running: COPY chat_post FROM 'file-path' The data file is 40MB and has somewhere over 1000000 rows of data. During the copy I can sit and watch the memory foot print of the backend process growing. It eventually gets to something like 80MB, with 10-20MB less for it's resident set size. I've not seen this copy in complete because it can't get sufficient resources from the system after this point. So splitting the file into 300000 row chunks and doing four COPY commands: Initially when a COPY kicks off there is fair amount of disk activity, which is to be expected, but then the memory foot print starts growing (from a fresh connection size), disk activity drops off to unnoticeable unless I watch and listen and the CPU usage increases. The memory usage stablises around 30MB with 24MB resident before the CPU utilisation finishes creeping up. Eventually, but not exactly long after starting, the backend process is clocking up 99% of the CPU. So why does the memory usage increase so? Why does is the process only IO bound initially? And why is the process so CPU intensive? I'm thinking that the data file is read into memory entirely before any processing of the data is done but can't work out why the file isn't just mmap()ed, STDIN can be special cased surely, and how about all the CPU power required for the ascii to internal representations being so high in a process that I would have thought to be very nearly IO bound? I could take a look at the code, not something I've done yet I must admit, and see if I can contribute by doing the mmap business if it's seen as a good idea. -- Nigel J. Andrews Director --- Logictree Systems Limited Computer Consultants
On Sat, 13 Apr 2002, Nigel J. Andrews wrote: > > > I'm trying to COPY in a table of 1 million rows. The table is created by: > > [sniped] > > mmap()ed, STDIN can be special cased surely, and how about all the CPU power > required for the ascii to internal representations being so high in a process > that I would have thought to be very nearly IO bound? > > I could take a look at the code, not something I've done yet I must admit, and > see if I can contribute by doing the mmap business if it's seen as a good idea. I forgot to mention that even with 99% of the CPU it still takes a pretty long time to load 300,000 rows, it's still working on the second batch and I think I kicked the first off 30-60 minutes ago. Also: =>select version(); version ---------------------------------------------------------------------- PostgreSQL 7.2.1 on i386-unknown-freebsd3.3, compiled by GCC 2.7.2.3 -- Nigel J. Andrews Director --- Logictree Systems Limited Computer Consultants
Nigel J. Andrews wrote: > > I'm trying to COPY in a table of 1 million rows. The table is created by: > > create table chat_post_new ( > session_id INT NOT NULL references chat_session (session_id), > poster_name VARCHAR(32) NOT NULL, > time TIMESTAMP NOT NULL, > post_number INT NOT NULL, > FTS txtidx > ); > > The old definition had the integer columns as int2s and the FTS column wasn't > there. Because I wanted to reorder the rows, add them in a more natural order > (by time), I created the data file read in by the copy command using > > SELECT session_id || '\t' || poster_name || '\t' || time || '\t' || post_number > FROM chat_post > ORDER BY time > > After removing the first and last couple of lines, so that only the data is in > the file, renaming the original table and creating the new version I tried > running: > > COPY chat_post FROM 'file-path' > I'm not sure if this is your problem, but I believe you need the same number of columns in your file as your table. So if you want FTS to be null, append \N to the output, e.g. SELECT session_id || '\t' || poster_name || '\t' || time || '\t' || post_number || '\t\N' FROM chat_post ORDER BY time Also, you should probably leave the foreign key reference off the table and add it after you load the data. Hope this helps, Joe
On Sat, 13 Apr 2002, Joe Conway wrote: > Nigel J. Andrews wrote: > > > > I'm trying to COPY in a table of 1 million rows. The table is created by: > > > > create table chat_post_new ( > > session_id INT NOT NULL references chat_session (session_id), > > poster_name VARCHAR(32) NOT NULL, > > time TIMESTAMP NOT NULL, > > post_number INT NOT NULL, > > FTS txtidx > > ); > > > > The old definition had the integer columns as int2s and the FTS column wasn't > > there. Because I wanted to reorder the rows, add them in a more natural order > > (by time), I created the data file read in by the copy command using > > > > SELECT session_id || '\t' || poster_name || '\t' || time || '\t' || post_number > > FROM chat_post > > ORDER BY time > > > > After removing the first and last couple of lines, so that only the data is in > > the file, renaming the original table and creating the new version I tried > > running: > > > > COPY chat_post FROM 'file-path' > > > > I'm not sure if this is your problem, but I believe you need the same > number of columns in your file as your table. So if you want FTS to be > null, append \N to the output, e.g. > > SELECT session_id || '\t' || poster_name || '\t' || time || '\t' || > post_number || '\t\N' > FROM chat_post > ORDER BY time > Ah, thanks for that tip. I don't know if it'll make a difference, I'll try later, but I had wondered what that \N was doing at the end of a copy out to file I did before. > Also, you should probably leave the foreign key reference off the table > and add it after you load the data. I should have mentioned that I'm doing the copy in to the table in a transaction block with all constraints deferred. That should mean it's only at the commit stage that foreign key will be checked right? > Hope this helps, Well the \N is useful to know thanks. > > Joe > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > -- Nigel J. Andrews Director --- Logictree Systems Limited Computer Consultants
On Sat, 13 Apr 2002, Nigel J. Andrews wrote: > > Also, you should probably leave the foreign key reference off the table > > and add it after you load the data. > > I should have mentioned that I'm doing the copy in to the table in a > transaction block with all constraints deferred. That should mean it's only at > the commit stage that foreign key will be checked right? With the definition shown, I believe your constraint is not deferrable so setting the constraint mode to deferred won't help. In any case it'd still need to be saving the information on the triggers to run.
Nigel J. Andrews wrote: > > > I should have mentioned that I'm doing the copy in to the table in a > transaction block with all constraints deferred. That should mean > it's only at the commit stage that foreign key will be checked > right? > I'm not really sure, maybe someone else can chime in. But I can say that I imported about 38 million rows of similar size (actually, a bit larger) using COPY in under an hour with 7.2RC3. I had no indexes or references defined during the COPY. I never saw memory use grow beyond about 10MB IIRC -- possibly in part because I forgot to increase shared buffers beyond the default before I started. FWIW, hardware was dual P3 833 (I think), with 2GB RAM, hardware mirrored SCSI hard-drives, Red Hat 7.2. Joe
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > On Sat, 13 Apr 2002, Nigel J. Andrews wrote: >> I should have mentioned that I'm doing the copy in to the table in a >> transaction block with all constraints deferred. That should mean it's only at >> the commit stage that foreign key will be checked right? > With the definition shown, I believe your constraint is not deferrable so > setting the constraint mode to deferred won't help. In any case it'd still > need to be saving the information on the triggers to run. In any case the RI trigger firings will be postponed till end of query. I suspect that the memory growth is due to the list of pending trigger firings. The advice to add the REFERENCES constraint after you've loaded the table seems good to me. Another possibility is that there's some memory leak associated with the txtidx data type; I dunno how thoroughly that type has been tested... regards, tom lane
On Sat, 13 Apr 2002, Tom Lane wrote: > Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > > On Sat, 13 Apr 2002, Nigel J. Andrews wrote: > >> I should have mentioned that I'm doing the copy in to the table in a > >> transaction block with all constraints deferred. That should mean it's only at > >> the commit stage that foreign key will be checked right? > > > With the definition shown, I believe your constraint is not deferrable so > > setting the constraint mode to deferred won't help. In any case it'd still > > need to be saving the information on the triggers to run. > > In any case the RI trigger firings will be postponed till end of query. > I suspect that the memory growth is due to the list of pending trigger > firings. The advice to add the REFERENCES constraint after you've > loaded the table seems good to me. > > Another possibility is that there's some memory leak associated with the > txtidx data type; I dunno how thoroughly that type has been tested... I believe I have seen large memory footprints at other times and I haven't used the txtidx type before. However, I will also do a test loading into a table with just the standard types. If it turns out to be associated with the new column I'll sort out who to email and probably also report on here just so people can get 'closure'. I'll do the test of loading the table without the foreign ket set also, of course. Thanks for the quick replys folks, don't you people ever go home? -- Nigel J. Andrews Director --- Logictree Systems Limited Computer Consultants
On Sat, 13 Apr 2002, Nigel J. Andrews wrote: > On Sat, 13 Apr 2002, Tom Lane wrote: > > > Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > > > On Sat, 13 Apr 2002, Nigel J. Andrews wrote: > > >> I should have mentioned that I'm doing the copy in to the table in a > > >> transaction block with all constraints deferred. That should mean it's only at > > >> the commit stage that foreign key will be checked right? > > > > > With the definition shown, I believe your constraint is not deferrable so > > > setting the constraint mode to deferred won't help. In any case it'd still > > > need to be saving the information on the triggers to run. > > > > In any case the RI trigger firings will be postponed till end of query. > > I suspect that the memory growth is due to the list of pending trigger > > firings. The advice to add the REFERENCES constraint after you've > > loaded the table seems good to me. > > > > Another possibility is that there's some memory leak associated with the > > txtidx data type; I dunno how thoroughly that type has been tested... > > I believe I have seen large memory footprints at other times and I haven't used > the txtidx type before. However, I will also do a test loading into a table > with just the standard types. If it turns out to be associated with the new > column I'll sort out who to email and probably also report on here just so > people can get 'closure'. > > I'll do the test of loading the table without the foreign ket set also, of > course. Right, I'm not even going to bother doing the \N test to stick a null in the final column of the table. Missing the foreign key constraint from the table definition and leaving the txtidx typed column in does indeed enable the entire data set to be loaded in a few minutes with the memory footprint staying short of 6MB, with and without the load being done within a transaction. So, the slowness and the large memory usage was due to the foreign key. I note that the SQL reference manual for 7.2.1 says about SET CONSTRAINTS ... 'Currently, only foreign key contraints are affected by this setting. Check and unique constraints are always effectively initially immediate not deferrable.' I see from the CREATE TABLE page that although foreign keys are the only constraints that currently accept the deferrable/not deferrable setting that it is the not deferrable setting that is set by default. This was my problem, I didn't read this page only the SET CONSTRAINTS page and jumped to the assumption that foreign keys were defaulted to deferrable. A case of read TFM. Thanks for the help though folks, -- Nigel J. Andrews Director --- Logictree Systems Limited Computer Consultants
On Sat, 13 Apr 2002, Nigel J. Andrews wrote: > > I'm trying to COPY in a table of 1 million rows. The table is created by: > I forgot to mention that even with 99% of the CPU it still takes a pretty long > time to load 300,000 rows, it's still working on the second batch and I think I > kicked the first off 30-60 minutes ago. > > Also: > > =>select version(); > version > ---------------------------------------------------------------------- > PostgreSQL 7.2.1 on i386-unknown-freebsd3.3, compiled by GCC 2.7.2.3 Coming late to the party. :-) How about telling us about your complete setup? SYS V memory settings. Type of disks (SCSI, rpm, etc). I merge 4.5 M records in less than 20 minutes. The file loaded is about 160MB. The Machine is Pentium 4, 2GB Ram, 2 SCSI disks for the databa (10K rpm), base directory in one disk and logs in another.