Обсуждение: Question, re: Running enormous batch file from the command line

Поиск
Список
Период
Сортировка

Question, re: Running enormous batch file from the command line

От
"Milarsky, Jeremy F."
Дата:

I am attempting to move an unusually large legacy database from old DBF files into a Postgres database. I’ve been able to clean up the DBF tables, export them to CSV and convert that CSV into a SQL batch file.

 

My batch file contains 10 million INSERT queries to populate the corresponding table in Postgres.

 

When I run it, I see only three errors, all of them having to do with my admitted failure to escape single quotes in one of my fields when I parsed the CSV.

 

[my_command_line_prompt]$ psql mydatabase -f mybatchfile.sql > log.txt

Password:

psql:mybatchfile.sql:2279459: ERROR:  syntax error at or near "P" at character 68

psql:mybatchfile.sql:2279459: LINE 1: ...VALUES ('XXXXXXXXX','SMITH','','JOHN',''','P','09','3...

psql:mybatchfile.sql:2279459:                                                              ^

psql:mybatchfile.sql:3117024: ERROR:  syntax error at or near "','" at character 64

psql:mybatchfile.sql:3117024: LINE 1: ...mf1 VALUES ('XXXXXXXXX','SMITH','','JOHN',''',' ','10','28...

psql:mybatchfile.sql:3117024:                                                              ^

psql:mybatchfile.sql:6775763: ERROR:  syntax error at or near "P" at character 67

psql:mybatchfile.sql:6775763: LINE 1: ... VALUES ('XXXXXXXXXX','SMITH','','JOHN',''','P','01','0...

psql:mybatchfile.sql:6775763:

 

However, when afterwards I go into the psql prompt and do a SELECT COUNT(*) … on the table, I see there are only 4.5 million or so records in the table. Appropriately enough, the log.txt file I’m outputting to in the aforementioned command has 4.5 million or so lines, each saying “INSERT 0 1”.

 

So is Postgres skipping over some 5+ million lines in my batch file? If so, why isn’t it spitting only 3 error messages out? How else can I see what’s going wrong with those 5 million INSERT commands? Are there any other factors which would cause this sort of data loss?

 

I have, for the record, attempting to get all 10 million rows in with a single COPY command. This method has not been successful, apparently due to client encoding (despite executing several different “set client_encoding” in each attempt).

 

If this sort of question has been asked before, I apologize – but I did search the mailing list archive before posting.

 

Thanks in advance.

 

======

Jeremy Milarsky

Re: Question, re: Running enormous batch file from the command line

От
Martijn van Oosterhout
Дата:
On Sun, Nov 02, 2008 at 02:13:34PM -0600, Milarsky, Jeremy F. wrote:
> I am attempting to move an unusually large legacy database from old DBF
> files into a Postgres database. I've been able to clean up the DBF
> tables, export them to CSV and convert that CSV into a SQL batch file.

<snip>

At a wild guess it probably thinks that everything between line 2279459
and 3117024 is one line, because you have an uneven number of quotes
and strings can contain a newline. So it thinks the whole lot is one
line.

Fix the quoting and I imagine this problem will also go away.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Вложения

Re: Question, re: Running enormous batch file from the command line

От
Craig Ringer
Дата:
Milarsky, Jeremy F. wrote:

> I have, for the record, attempting to get all 10 million rows in with a
> single COPY command. This method has not been successful, apparently due
> to client encoding (despite executing several different "set
> client_encoding" in each attempt).

This is probably something you should investigate, rather than work
around, as it may indicate that you're loading incorrectly encoded data
into your DB.

What is your database encoding? What encoding does the CSV dump from the
DBF files use? What is your system's default text encoding? What
encoding were you telling psql to use?


Personally, to fix this I'd start by using the wonderful `iconv' tool to
convert the CSV data from its original encoding (probably one of the
WIN- codepages, but you need to find out) into UTF-8. If you encounter
any encoding errors in this process, you need to look at what's going on
there and determine if your source data is in a different encoding to
what you thought it was (and use the right one). If it turns out that
different parts of the data are in different encodings, that's something
you will need to clean up.

You have two options for cleaning up such mis-encoded data. One way is
to do the cleanup on the CSV data - say, using Python and the `csv'
module to load it, test it, and make whatever repairs you deem necessary
on a record-by-record basis before writing out a fixed CSV file.

The other way is to load the data into PostgreSQL as `bytea' or using
the SQL-ASCII encoding, then do the cleanup in PostgreSQL. Personally I
suspect a little Python cleanup script is easiest. The "dumb" way would
be to try str.decode(...) and if it throws an exception catch it and try
the next encoding in your list of suspects. Of course, you can do much
smarter things than that with a little bit more work and some
examination of the faulty data.

Once you have CSV input data of a known and correct encoding (say,
UTF-8) you can just "set client_encoding" as appropriate  and \copy from
psql.

--
Craig Ringer