Обсуждение: COPY TO STDOUT WITH (FORMAT CSV, HEADER), and embedded newlines
Hi. It's my first time using COPY TO. And first time using built-in CSV support. Performs well. BUT... The code below (real code, but using a custom libpq wrapper lib) is run on a few tables, with unit tests that verify the number of lines of the output file. And for a few of those tables, there's a mismatch, the output from PostgreSQL "has too many lines". I've tracked these to text values in the DB with embedded newlines. These values are 'normal'. I'm not use to CSV, but I suppose such newlines must be encoded, perhaps as \n, since AFAIK CSV needs to be 1 line per row, no? So how's one supposed to configure the CSV output for the DB with embedded newlines? Thanks, --DD auto rset = my_exec(*conn_, "COPY MY_TAB TO STDOUT WITH (FORMAT CSV, HEADER)"); if (rset.status() != PGRES_COPY_OUT) { raise("CSV Export via SQL COPY error: ", rset.error_msg()); } std::ofstream os(file_name); bool done = false; while (!done) { auto buf = pq::CopyOutBuffer::get(*conn_); switch (buf.status()) { case pq::CopyOutStatus::eWait: assert(false); continue; case pq::CopyOutStatus::eDone: done = true; continue; case pq::CopyOutStatus::eError: raise("PQgetCopyData: {}", conn_->error_msg()); case pq::CopyOutStatus::eData: break; // process it below } // Each buffer seems to a single line of output, // with Unix-newline at the end, on all platforms. os.write(buf.data(), buf.size()); } os.close();
On Friday, March 11, 2022, Dominique Devienne <ddevienne@gmail.com> wrote:
I've tracked these to text values in the DB with embedded
newlines. These values are 'normal'. I'm not use to CSV, but I suppose
such newlines
must be encoded, perhaps as \n, since AFAIK CSV needs to be 1 line per row, no?
Haven’t tested but the docs indicate, and I do recall from memory, that PostgreSQL does indeed use \r and \n in content so what you are describing would be a bug if proven true. Can you produce something that only uses psql that shows this problem? If not, whatever “auto rset = my_exec(…)” is doing is probably at fault.
David J.
Dominique Devienne wrote: > These values are 'normal'. I'm not use to CSV, but I suppose > such newlines > must be encoded, perhaps as \n, since AFAIK CSV needs to be 1 line per row, > no? No, but such fields must be enclosed by double quotes, as documented in RFC 4180 https://datatracker.ietf.org/doc/html/rfc4180 Consider this output: psql> COPY (values (1, E'ab\ncd'), (2,'efgh')) TO STDOUT CSV; 1,"ab cd" 2,efgh That's 2 records on 3 lines. If you feed this to a parser and it chokes on it, it means that it's not a valid CSV parser. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Fri, Mar 11, 2022 at 7:38 PM Daniel Verite <daniel@manitou-mail.org> wrote: >> Dominique Devienne wrote: > > These values are 'normal'. I'm not used to CSV, but I suppose such newlines > > must be encoded, perhaps as \n, since AFAIK CSV needs to be 1 line per row, no? > > No, but such fields must be enclosed by double quotes, as documented > in RFC 4180 https://datatracker.ietf.org/doc/html/rfc4180 Hi Daniel. OK, good to know, thanks. > Consider this output: > psql> COPY (values (1, E'ab\ncd'), (2,'efgh')) TO STDOUT CSV; > 1,"ab > cd" > 2,efgh yes, this is consistent with what I am seeing in the debugger. The value is indeed double-quoted, and definitely contains a _raw_ '\n' newline, and not an encoded "backslash then n", as David was writing. > That's 2 records on 3 lines. > If you feed this to a parser and it chokes on it, it means that it's not a valid CSV parser. I never pretended that parser to be a CSV parser :). It's a naive "wc -l" equivalent written by someone else. And it's comparing PostgreSQL CSV output to some old legacy "CSV" output that's home grown, which does encode newlines as '\\' and 'n', and which most likely is not RFC compliant (I'm not suprised! :))). In my case, backward-compat is more important than "compliancy" (if that's a word!), so I can easily do that "\n" encoding myself, as a post-processing on the buffer I get back. Thank you for the help. --DD PS: And David, no, it's not my wrapper that's at fault here :). It's a thin wrapper, that's just easier, terser, and safer (RAII) to use compared to naked libpq.
Dominique Devienne wrote: > so I can easily do that "\n" encoding myself, as a post-processing on > the buffer I get back. Alternatively, it might be easier to use the default TEXT format of COPY rather than CSV, as the TEXT format already produces \n for line feeds, along with half a dozen other special backslashes sequences. See https://www.postgresql.org/docs/current/sql-copy.html Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Fri, Mar 11, 2022 at 8:32 PM Daniel Verite <daniel@manitou-mail.org> wrote: > Dominique Devienne wrote: > > so I can easily do that "\n" encoding myself, as a post-processing on > > the buffer I get back. > > Alternatively, it might be easier to use the default TEXT format of > COPY rather than CSV, as the TEXT format already produces \n for > line feeds, along with half a dozen other special backslashes sequences. > See https://www.postgresql.org/docs/current/sql-copy.html I actually just submitted CSV+PostProcessing :) But sure, if TEXT does the kind of pseudo-CSV I need, I'd change it to use it. I'll look into it next week. Thanks again Daniel. --DD
Hi Daniel: On Fri, 11 Mar 2022 at 19:38, Daniel Verite <daniel@manitou-mail.org> wrote: > > These values are 'normal'. I'm not use to CSV, but I suppose > > such newlines > > must be encoded, perhaps as \n, since AFAIK CSV needs to be 1 line per row, > > no? > No, but such fields must be enclosed by double quotes, as documented > in RFC 4180 https://datatracker.ietf.org/doc/html/rfc4180 CSV is really poiosonous. And in the multiplan days, which was nearly RFC4180, it was tolerable, but this days where everybody uses excel to spit "localized csv" it is hell ( in spain it uses ; as delimiter because it localizes numbers with , as decimal separator, you may have similar problems ). Anyway, I was going to point RFC4180 is a bit misleading. In 2.1 it states: >>> 1. Each record is located on a separate line, delimited by a line break (CRLF). For example: aaa,bbb,ccc CRLF zzz,yyy,xxx CRLF <<< Which may lead you to believe you can read by lines, but several lines after that in 2.6 it says >>> 6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example: "aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx <<< Which somehow contradicts 2.1. In C/C++ it's easily parsed with a simple state machine reading char by char, wich is one of the strong points of those languages, but reading lines as strings usually leads to complex logic. Francisco Olarte.
Dominique: On Fri, 11 Mar 2022 at 21:13, Dominique Devienne <ddevienne@gmail.com> wrote: > But sure, if TEXT does the kind of pseudo-CSV I need, I'd change it to use it. Text, the original format for copy, is much easier to manage than CSV. It can easily be managed as you can split the whole input on newlines to get records, split each record on tabs to get fields, then unescape each field. In C++ you can easily read it a char at a time and build along the way or, if you have a whole line, unescape it in place and build a vector<char*> pointing two the buffer. If you are testing, the split on newline/split on tab approach gives you a list of escaped strings easily compared to escaped patterns. I've never had problems with it in decades, and in fact I use a extension of it ( with a \E code similar to the \N trick for nulls for 0-element lines, which are not useful in db dumps, but I and I need to use, as "\n" decodes to {""} but I need to express {}, which I emit as "\\E\n". It is and old problem, "join by tabs join by newlines" makes things "prettier" but can lead to no final new line and no way to express empty sets, "terminate with tabs terminate with newlines" leads to uglier/harder to read lines but can express them). Francisco Olarte.