Обсуждение: How To: LARGE html text or csv file COPY FROM?

Поиск
Список
Период
Сортировка

How To: LARGE html text or csv file COPY FROM?

От
Lou Picciano
Дата:
Hello Friends,

We're trying something for the first time: A COPY into a database, from a TEXT (or CSV) file containing one really, really, big field of html.

The field happens to be content of complete webpages, which we then need to later analyze, slice, dice, etc. - so it's verbatim html, with all the carriage returns, spaces, linefeeds(?) and double quotes included!

Problem is: With the very first record, the COPY commands hiccups with: missing data from column error.
in CSV mode, it's 'extra data after last expected column'  (yes, using different input files for test).

Both errors above make sense to me; COPY is running into either a cr or a tab character in each case.

Q: Is there way to handle this directly, as a PG import? 

Meanwhile, we're off into using grep/gawk to remove all carriage returns in the field?

TIA for any help, inspiration, recipes (or time in the stocks).     Lou

Re: How To: LARGE html text or csv file COPY FROM?

От
Greg Smith
Дата:
Lou Picciano wrote:p { margin: 0; }
Both errors above make sense to me; COPY is running into either a cr or a tab character in each case.

Even though they might seem very similar at first, text mode and CSV mode work very differently here.  To quote TFM at http://www.postgresql.org/docs/current/static/sql-copy.html , which covers all this if you dig into it enough:

"CSV mode will both recognize and produce CSV files with quoted values containing embedded carriage returns and line feeds. Thus the files are not strictly one line per table row like text-mode files."

So as long as the field is quoted, you can have a CR or tab in the middle.

In text mode:  "the following characters must be preceded by a backslash if they appear as part of a column value: backslash itself, newline, carriage return, and the current delimiter character."  Text mode will therefore require escaping everything on that list.

So if this is close to being importable in CSV mode, you might just need to wrap the HTML field with a pair of quotes, and escape any quote characters that are inside of it.  Text mode is going to require a whole escaping run of all the reserved characters that are input.

-- 
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book