Обсуждение: COPY TO STDOUT WITH (FORMAT CSV, HEADER), and embedded newlines

Поиск
Список
Период
Сортировка

COPY TO STDOUT WITH (FORMAT CSV, HEADER), and embedded newlines

От
Dominique Devienne
Дата:
Hi.

It's my first time using COPY TO. And first time using built-in CSV support.
Performs well. BUT...

The code below (real code, but using a custom libpq wrapper lib) is run on
a few tables, with unit tests that verify the number of lines of the
output file.
And for a few of those tables, there's a mismatch, the output from PostgreSQL
"has too many lines". I've tracked these to text values in the DB with embedded
newlines. These values are 'normal'. I'm not use to CSV, but I suppose
such newlines
must be encoded, perhaps as \n, since AFAIK CSV needs to be 1 line per row, no?

So how's one supposed to configure the CSV output for the DB with
embedded newlines?

Thanks, --DD

    auto rset = my_exec(*conn_, "COPY MY_TAB TO STDOUT WITH (FORMAT
CSV, HEADER)");
    if (rset.status() != PGRES_COPY_OUT) {
        raise("CSV Export via SQL COPY error: ", rset.error_msg());
    }

    std::ofstream os(file_name);
    bool done = false;
    while (!done) {
        auto buf = pq::CopyOutBuffer::get(*conn_);

        switch (buf.status()) {
        case pq::CopyOutStatus::eWait:  assert(false); continue;
        case pq::CopyOutStatus::eDone:  done = true; continue;
        case pq::CopyOutStatus::eError: raise("PQgetCopyData: {}",
conn_->error_msg());
        case pq::CopyOutStatus::eData:  break; // process it below
        }

        // Each buffer seems to a single line of output,
        // with Unix-newline at the end, on all platforms.
        os.write(buf.data(), buf.size());
    }

    os.close();



Re: COPY TO STDOUT WITH (FORMAT CSV, HEADER), and embedded newlines

От
"David G. Johnston"
Дата:
On Friday, March 11, 2022, Dominique Devienne <ddevienne@gmail.com> wrote:
I've tracked these to text values in the DB with embedded
newlines. These values are 'normal'. I'm not use to CSV, but I suppose
such newlines
must be encoded, perhaps as \n, since AFAIK CSV needs to be 1 line per row, no?

Haven’t tested but the docs indicate, and I do recall from memory, that PostgreSQL does indeed use \r and \n in content so what you are describing would be a bug if proven true.  Can you produce something that only uses psql that shows this problem?  If not, whatever “auto rset = my_exec(…)” is doing is probably at fault.

David J.

Re: COPY TO STDOUT WITH (FORMAT CSV, HEADER), and embedded newlines

От
"Daniel Verite"
Дата:
    Dominique Devienne wrote:

> These values are 'normal'. I'm not use to CSV, but I suppose
> such newlines
> must be encoded, perhaps as \n, since AFAIK CSV needs to be 1 line per row,
> no?

No, but such fields must be enclosed by double quotes, as documented
in RFC 4180 https://datatracker.ietf.org/doc/html/rfc4180

Consider this output:

psql> COPY (values (1, E'ab\ncd'), (2,'efgh')) TO STDOUT CSV;
1,"ab
cd"
2,efgh

That's 2 records on 3 lines.
If you feed this to a parser and it chokes on it, it means that it's
not a valid CSV parser.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



Re: COPY TO STDOUT WITH (FORMAT CSV, HEADER), and embedded newlines

От
Dominique Devienne
Дата:
On Fri, Mar 11, 2022 at 7:38 PM Daniel Verite <daniel@manitou-mail.org> wrote:
>>         Dominique Devienne wrote:
> > These values are 'normal'. I'm not used to CSV, but I suppose such newlines
> > must be encoded, perhaps as \n, since AFAIK CSV needs to be 1 line per row, no?
>
> No, but such fields must be enclosed by double quotes, as documented
> in RFC 4180 https://datatracker.ietf.org/doc/html/rfc4180

Hi Daniel. OK, good to know, thanks.

> Consider this output:
> psql> COPY (values (1, E'ab\ncd'), (2,'efgh')) TO STDOUT CSV;
> 1,"ab
> cd"
> 2,efgh

yes, this is consistent with what I am seeing in the debugger.
The value is indeed double-quoted, and definitely contains a _raw_ '\n' newline,
and not an encoded "backslash then n", as David was writing.

> That's 2 records on 3 lines.
> If you feed this to a parser and it chokes on it, it means that it's not a valid CSV parser.

I never pretended that parser to be a CSV parser :). It's a naive "wc
-l" equivalent written by someone else.
And it's comparing PostgreSQL CSV output to some old legacy "CSV"
output that's home grown, which
does encode newlines as '\\' and 'n', and which most likely is not RFC
compliant (I'm not suprised! :))).

In my case, backward-compat is more important than "compliancy" (if
that's a word!),
so I can easily do that "\n" encoding myself, as a post-processing on
the buffer I get back.

Thank you for the help. --DD

PS: And David, no, it's not my wrapper that's at fault here :). It's a
thin wrapper,
  that's just easier, terser, and safer (RAII) to use compared to naked libpq.



Re: COPY TO STDOUT WITH (FORMAT CSV, HEADER), and embedded newlines

От
"Daniel Verite"
Дата:
    Dominique Devienne wrote:


> so I can easily do that "\n" encoding myself, as a post-processing on
> the buffer I get back.

Alternatively, it might be easier to use the default TEXT format of
COPY rather than CSV, as the TEXT format already produces \n for
line feeds, along with    half a dozen other special backslashes sequences.
See https://www.postgresql.org/docs/current/sql-copy.html


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



Re: COPY TO STDOUT WITH (FORMAT CSV, HEADER), and embedded newlines

От
Dominique Devienne
Дата:
On Fri, Mar 11, 2022 at 8:32 PM Daniel Verite <daniel@manitou-mail.org> wrote:
>         Dominique Devienne wrote:
> > so I can easily do that "\n" encoding myself, as a post-processing on
> > the buffer I get back.
>
> Alternatively, it might be easier to use the default TEXT format of
> COPY rather than CSV, as the TEXT format already produces \n for
> line feeds, along with  half a dozen other special backslashes sequences.
> See https://www.postgresql.org/docs/current/sql-copy.html

I actually just submitted CSV+PostProcessing :)

But sure, if TEXT does the kind of pseudo-CSV I need, I'd change it to use it.
I'll look into it next week. Thanks again Daniel. --DD



Re: COPY TO STDOUT WITH (FORMAT CSV, HEADER), and embedded newlines

От
Francisco Olarte
Дата:
Hi Daniel:

On Fri, 11 Mar 2022 at 19:38, Daniel Verite <daniel@manitou-mail.org> wrote:
> > These values are 'normal'. I'm not use to CSV, but I suppose
> > such newlines
> > must be encoded, perhaps as \n, since AFAIK CSV needs to be 1 line per row,
> > no?
> No, but such fields must be enclosed by double quotes, as documented
> in RFC 4180 https://datatracker.ietf.org/doc/html/rfc4180

CSV is really poiosonous. And in the multiplan days, which was nearly
RFC4180, it was tolerable, but this days where everybody uses excel to
spit "localized csv" it is hell ( in spain it uses ; as delimiter
because it localizes numbers with , as decimal separator, you may have
similar problems ).

Anyway, I was going to point RFC4180 is a bit misleading. In 2.1 it states:
>>>
   1.  Each record is located on a separate line, delimited by a line
       break (CRLF).  For example:

       aaa,bbb,ccc CRLF
       zzz,yyy,xxx CRLF
<<<

Which may lead you to believe you can read by lines, but several lines
after that in 2.6 it says

>>>
   6.  Fields containing line breaks (CRLF), double quotes, and commas
       should be enclosed in double-quotes.  For example:

       "aaa","b CRLF
       bb","ccc" CRLF
       zzz,yyy,xxx
<<<

Which somehow contradicts 2.1.

In C/C++ it's easily parsed with a simple state machine reading char
by char, wich is one of the strong points of those languages, but
reading lines as strings usually leads to complex logic.

Francisco Olarte.



Re: COPY TO STDOUT WITH (FORMAT CSV, HEADER), and embedded newlines

От
Francisco Olarte
Дата:
Dominique:

On Fri, 11 Mar 2022 at 21:13, Dominique Devienne <ddevienne@gmail.com> wrote:
> But sure, if TEXT does the kind of pseudo-CSV I need, I'd change it to use it.

Text, the original format for copy, is much easier to manage than CSV.
It can easily be managed as you can split the whole input on newlines
to get records, split each record on tabs to get fields, then unescape
each field.

In C++ you can easily read it a char at a time and build along the way
or, if you have a whole line, unescape it in place and build a
vector<char*> pointing two the buffer. If you are testing, the split
on newline/split on tab approach gives you a list of escaped strings
easily compared to escaped patterns.

I've never had problems with it in decades, and in fact I use a
extension of it ( with a \E code similar to the \N trick for nulls
for 0-element lines, which are not useful in db dumps, but I and I
need to use, as "\n" decodes to {""} but I need to express {}, which I
emit as "\\E\n". It is and old problem, "join by tabs join by
newlines" makes things "prettier" but can lead to no final new line
and no way to express empty sets, "terminate with tabs terminate with
newlines" leads to uglier/harder to read lines but can express them).

Francisco Olarte.