Обсуждение: Restoring large tables with COPY
Maybe I am missing something obvious, but I am unable to load
larger tables (~300k rows) with COPY command that pg_dump by
default produces. Yes, dump as INSERTs works but is slow.
"Cant" as in "it does not work with the default setup I have
running on devel machine" - 128M mem, 128M swap, basically
default postgresql.conf:
1) Too few WAL files. - well, increase the wal_files (eg to 32),
2) Machine runs out of swap, PostgreSQL seems to keep whole TX in memory. - So I must put 1G of swap? But what if I
have1G of rows?
Or shortly: during pg_restore the resource requirements are
order of magnitude higher than during pg_dump, which is
non-obvious and may be a bad surprise when in real trouble.
This is annoying, especially as dump as COPY's should be
preferred as it is faster and smaller. Ofcourse the
dump-as-INSERTs has also positive side - eg. ALTER TABLE DROP
COLUMN with sed...
Patch below implements '-m NUM' switch to pg_dump, which splits
each COPY command to chunks, each maximum NUM rows.
Comments? What am I missing?
--
marko
Index: doc/src/sgml/ref/pg_dump.sgml
===================================================================
RCS file: /opt/cvs/pgsql/pgsql/doc/src/sgml/ref/pg_dump.sgml,v
retrieving revision 1.41
diff -u -c -r1.41 pg_dump.sgml
*** doc/src/sgml/ref/pg_dump.sgml 8 Dec 2001 03:24:37 -0000 1.41
--- doc/src/sgml/ref/pg_dump.sgml 11 Dec 2001 03:58:30 -0000
***************
*** 35,40 ****
--- 35,41 ---- <arg>-f <replaceable>file</replaceable></arg> <arg>-F <replaceable>format</replaceable></arg>
<arg>-i</arg>
+ <arg>-m <replaceable>num_rows</replaceable></arg> <group> <arg>-n</arg> <arg>-N</arg> </group> <arg>-o</arg>
<arg>-O</arg>
***************
*** 301,306 ****
--- 302,321 ---- if you need to override the version check (and if <command>pg_dump</command> then fails, don't
say you weren't warned).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>-m <replaceable class="parameter">num_rows</replaceable></term>
+ <term>--maxrows=<replaceable class="parameter">num_rows</replaceable></term>
+ <listitem>
+ <para>
+ Set maximum number of rows to put into one COPY statement.
+ This starts new COPY command after every
+ <replaceable class="parameter">num_rows</replaceable>.
+ This is useful on large tables to avoid restoring whole table in
+ one transaction which may consume lot of resources. </para> </listitem> </varlistentry>
Index: src/bin/pg_dump/pg_dump.c
===================================================================
RCS file: /opt/cvs/pgsql/pgsql/src/bin/pg_dump/pg_dump.c,v
retrieving revision 1.236
diff -u -c -r1.236 pg_dump.c
*** src/bin/pg_dump/pg_dump.c 28 Oct 2001 06:25:58 -0000 1.236
--- src/bin/pg_dump/pg_dump.c 11 Dec 2001 04:48:42 -0000
***************
*** 116,121 ****
--- 116,123 ---- bool dataOnly; bool aclsSkip;
+ int g_max_copy_rows = 0;
+ char g_opaque_type[10]; /* name for the opaque type */ /* placeholders for the delimiters for comments
*/
***************
*** 151,156 ****
--- 153,159 ---- " -h, --host=HOSTNAME database server host name\n" " -i,
--ignore-version proceed even when server version mismatches\n" " pg_dump
version\n"
+ " m, --maxrows=NUM max rows in one COPY command\n" " -n, --no-quotes suppress
mostquotes around identifiers\n" " -N, --quotes enable most quotes around identifiers\n"
" -o, --oids include oids in dump\n"
***************
*** 187,192 ****
--- 190,196 ---- " pg_dump version\n" " -n
suppressmost quotes around identifiers\n" " -N enable most quotes around identifiers\n"
+ " m NUM max rows in one COPY command\n" " -o
include oids in dump\n" " -O do not output \\connect commands in plain\n"
" text format\n"
***************
*** 244,249 ****
--- 248,255 ---- int ret; bool copydone; char copybuf[COPYBUFSIZ];
+ int cur_row;
+ int linestart; if (g_verbose) write_msg(NULL, "dumping out the contents of table %s\n",
classname);
***************
*** 297,302 ****
--- 303,310 ---- else { copydone = false;
+ linestart = 1;
+ cur_row = 0; while (!copydone) {
***************
*** 310,316 ****
--- 318,338 ---- } else {
+ /*
+ * Avoid too large transactions by breaking them up.
+ */
+ if (g_max_copy_rows > 0 && linestart
+ && cur_row >= g_max_copy_rows)
+ {
+ cur_row = 0;
+ archputs("\\.\n", fout);
+ archprintf(fout, "COPY %s %sFROM stdin;\n",
+ fmtId(classname, force_quotes),
+ (oids && hasoids) ? "WITH OIDS " : "");
+ }
+ archputs(copybuf, fout);
+ switch (ret) { case EOF:
***************
*** 318,325 ****
--- 340,350 ---- /* FALLTHROUGH */ case 0:
archputc('\n', fout);
+ cur_row++;
+ linestart = 1; break; case 1:
+ linestart = 0; break; }
}
***************
*** 696,701 ****
--- 721,727 ---- {"compress", required_argument, NULL, 'Z'}, {"help", no_argument, NULL, '?'},
{"version",no_argument, NULL, 'V'},
+ {"maxrows", required_argument, NULL, 'm'}, /* * the following options don't have an
equivalentshort option
***************
*** 748,756 **** } #ifdef HAVE_GETOPT_LONG
! while ((c = getopt_long(argc, argv, "abcCdDf:F:h:inNoOp:RsS:t:uU:vWxX:zZ:V?", long_options, &optindex)) != -1)
#else
! while ((c = getopt(argc, argv, "abcCdDf:F:h:inNoOp:RsS:t:uU:vWxX:zZ:V?-")) != -1) #endif {
--- 774,782 ---- } #ifdef HAVE_GETOPT_LONG
! while ((c = getopt_long(argc, argv, "abcCdDf:F:h:im:nNoOp:RsS:t:uU:vWxX:zZ:V?", long_options, &optindex)) != -1)
#else
! while ((c = getopt(argc, argv, "abcCdDf:F:h:im:nNoOp:RsS:t:uU:vWxX:zZ:V?-")) != -1) #endif {
***************
*** 798,803 ****
--- 824,833 ---- case 'i': /* ignore database version mismatch */
ignore_version= true;
+ break;
+
+ case 'm':
+ g_max_copy_rows = atoi(optarg); break; case 'n': /* Do not
forcedouble-quotes on
Marko Kreen <marko@l-t.ee> writes:
> Maybe I am missing something obvious, but I am unable to load
> larger tables (~300k rows) with COPY command that pg_dump by
> default produces.
I'd like to find out what the problem is, rather than work around it
with such an ugly hack.
> 1) Too few WAL files.
> - well, increase the wal_files (eg to 32),
What PG version are you running? 7.1.3 or later should not have a
problem with WAL file growth.
> 2) Machine runs out of swap, PostgreSQL seems to keep whole TX
> in memory.
That should not happen either. Could we see the full schema of the
table you are having trouble with?
> Or shortly: during pg_restore the resource requirements are
> order of magnitude higher than during pg_dump,
We found some client-side memory leaks in pg_restore recently; is that
what you're talking about?
regards, tom lane
On Tue, Dec 11, 2001 at 10:55:30AM -0500, Tom Lane wrote: > Marko Kreen <marko@l-t.ee> writes: > > Maybe I am missing something obvious, but I am unable to load > > larger tables (~300k rows) with COPY command that pg_dump by > > default produces. > > I'd like to find out what the problem is, rather than work around it > with such an ugly hack. > > > 1) Too few WAL files. > > - well, increase the wal_files (eg to 32), > > What PG version are you running? 7.1.3 or later should not have a > problem with WAL file growth. 7.1.3 > > 2) Machine runs out of swap, PostgreSQL seems to keep whole TX > > in memory. > > That should not happen either. Could we see the full schema of the > table you are having trouble with? Well, there are several such tables, I will reproduce it, then send the schema. I guess its the first one, but maybe not. postgres gets killed by Linux OOM handler, so I cant tell by messages, which one it was. (hmm, i should probably run it as psql -q -a > log). > > Or shortly: during pg_restore the resource requirements are > > order of magnitude higher than during pg_dump, > > We found some client-side memory leaks in pg_restore recently; is that > what you're talking about? No, its the postgres process thats memory-hungry, it happens with "psql < db.dump" too. If I run a dump thats produced with "pg_dump -m 5000" then it loops between 20M and 10M is much better. (the 10M depends on shared_buffers I guess). -- marko
----- Original Message ----- From: Marko Kreen <marko@l-t.ee> Sent: Tuesday, December 11, 2001 10:10 AM If this thing ever gets through, shouldn't this > /* placeholders for the delimiters for comments */ > *************** > *** 151,156 **** > --- 153,159 ---- > " -h, --host=HOSTNAME database server host name\n" > " -i, --ignore-version proceed even when server version mismatches\n" > " pg_dump version\n" > + " m, --maxrows=NUM max rows in one COPY command\n" say '-m' > + " m NUM max rows in one COPY command\n" and this one too?
On Tue, Dec 11, 2001 at 10:55:30AM -0500, Tom Lane wrote: > Marko Kreen <marko@l-t.ee> writes: > > Maybe I am missing something obvious, but I am unable to load > > larger tables (~300k rows) with COPY command that pg_dump by > > default produces. > > I'd like to find out what the problem is, rather than work around it > with such an ugly hack. Hmm, the problem was more 'interesting' than I thought. Basically: 1) pg_dump of 7.1.3 dumps constraints and primary keys with table defs in this case, so they are run during COPY. 2) I have some tricky CHECK contraints. Look at the attached Python script, it reproduces the problem. Sorry, cannot test on 7.2 at the moment. -- marko
Вложения
On Tue, Dec 11, 2001 at 12:29:07PM -0500, Serguei Mokhov wrote: > If this thing ever gets through, shouldn't this > > > /* placeholders for the delimiters for comments */ > > *************** > > *** 151,156 **** > > --- 153,159 ---- > > " -h, --host=HOSTNAME database server host name\n" > > " -i, --ignore-version proceed even when server version mismatches\n" > > " pg_dump version\n" > > + " m, --maxrows=NUM max rows in one COPY command\n" > > say '-m' > > > + " m NUM max rows in one COPY command\n" > > and this one too? One is for systems that have 'getopt_long', second for short-getopt-only ones. The '-h, --host=HOSTNAME' means that '-h HOSTNAME' and '--host=HOSTNAME' are same. -- marko
Marko Kreen <marko@l-t.ee> writes:
> Look at the attached Python script, it reproduces the problem.
Hmm. You'd probably have much better luck if you rewrote the check_code
function in plpgsql: that should eliminate the memory-leak problem, and
also speed things up because plpgsql knows about caching query plans
across function calls. IIRC, sql functions don't do that.
The memory leakage is definitely a bug, but not one that is going to get
fixed for 7.2. It'll take some nontrivial work on the SQL function
executor...
regards, tom lane
----- Original Message ----- From: Marko Kreen <marko@l-t.ee> Sent: Tuesday, December 11, 2001 12:38 PM > On Tue, Dec 11, 2001 at 12:29:07PM -0500, Serguei Mokhov wrote: > > If this thing ever gets through, shouldn't this > > > > > /* placeholders for the delimiters for comments */ > > > *************** > > > *** 151,156 **** > > > --- 153,159 ---- > > > " -h, --host=HOSTNAME database server host name\n" > > > " -i, --ignore-version proceed even when server version mismatches\n" > > > " pg_dump version\n" > > > + " m, --maxrows=NUM max rows in one COPY command\n" > > > > say '-m' > > > > > + " m NUM max rows in one COPY command\n" > > > > and this one too? > > One is for systems that have 'getopt_long', second for > short-getopt-only ones. The '-h, --host=HOSTNAME' means > that '-h HOSTNAME' and '--host=HOSTNAME' are same. I know, I know. I just was trying to point out a typo :) You forgot to add '-' in the messages before 'm'.
> > > > + " m, --maxrows=NUM max rows in one COPY command\n" > > > > > > say '-m' > You forgot to add '-' in the messages before 'm'. Ah. On my screen it looks lot like a '-', but od shows 0xAD... Well, thats VIM's digraph feature in action ;) -- marko
On Tue, Dec 11, 2001 at 01:06:13PM -0500, Tom Lane wrote: > Marko Kreen <marko@l-t.ee> writes: > > Look at the attached Python script, it reproduces the problem. > > Hmm. You'd probably have much better luck if you rewrote the check_code > function in plpgsql: that should eliminate the memory-leak problem, and > also speed things up because plpgsql knows about caching query plans > across function calls. IIRC, sql functions don't do that. And I thought that the 'sql' is the more lightweight approach... Thanks, now it seems to work. -- marko