Re: Fixing backslash dot for COPY FROM...CSV

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Fixing backslash dot for COPY FROM...CSV
Дата
Msg-id CA+TgmoZ+p85RBzUYAOFOiev1wDQ+sc7LZ1G6CEF416pMfU2QNg@mail.gmail.com
обсуждение исходный текст
Ответ на Fixing backslash dot for COPY FROM...CSV  ("Daniel Verite" <daniel@manitou-mail.org>)
Ответы Re: Fixing backslash dot for COPY FROM...CSV
Re: Fixing backslash dot for COPY FROM...CSV
Список pgsql-hackers
On Mon, Dec 18, 2023 at 3:36 PM Daniel Verite <daniel@manitou-mail.org> wrote:
> PFA a patch that attempts to fix the bug that \. on a line
> by itself is handled incorrectly by COPY FROM ... CSV.
> This issue has been discussed several times previously,
> for instance in [1] and [2], and mentioned in the
> doc for \copy in commit 42d3125.

Those links unfortunately seem not to be entirely specific to this
issue. Other, related things seem to be discussed there, and it's not
obvious that everyone agrees on what to do, or really that anyone
agrees on what to do. The best link that I found for this exact issue
is https://www.postgresql.org/message-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org
but the thread isn't very conclusive and is so old that any
conclusions reached then might no longer be considered valid today.

And I guess the reason I mention is that, supposing your patch were
technically perfect in every way, would everyone agree that it ought
to be committed? If Alice is a user with a CSV file that might contain
\. on a line by itself within a quoted CSV field, then Alice is
currently sad because she can't necessarily load all of her CSV files.
The patch would fix that, and make her happy. On the other hand, if
Bob is a user with a CSV-ish file that definitely doesn't contain \.
on a line by itself within a quoted CSV field but might have been
truncated in the middle of a quoted field, maybe Bob will be sad if
this patch gets committed, because he will no longer be able to append
\n\.\n to whatever junk he's got in the file and let the server sort
out whether to throw an error.

I have to admit that it seems more likely that there are people in the
world with Alice's problem rather than people in the world with Bob's
problem. We'd probably make more people happy with the change than
sad. But it is a definitional change, I think, and that's a little
scary, and maybe somebody will think that's a reason why we should
change nothing here. Part of my hesitancy, I suppose, is that I don't
understand why we even have this strange convention of making \.
terminate the input in the first place -- I mean, why wouldn't that be
done in some kind of out-of-band way, rather than including a special
marker in the data?

I can't help feeling a bit nervous about this first documentation hunk:

--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -761,11 +761,7 @@ COPY <replaceable class="parameter">count</replaceable>
     format, <literal>\.</literal>, the end-of-data marker, could also appear
     as a data value.  To avoid any misinterpretation, a <literal>\.</literal>
     data value appearing as a lone entry on a line is automatically
-    quoted on output, and on input, if quoted, is not interpreted as the
-    end-of-data marker.  If you are loading a file created by another
-    application that has a single unquoted column and might have a
-    value of <literal>\.</literal>, you might need to quote that value in the
-    input file.
+    quoted on output.
    </para>

    <note>

It doesn't feel right to me to just replace all of this text with
nothing. That leaves us documenting only the behavior on output,
whereas the previous text documents both the output behavior (we quote
it) and the input behavior (it has to be quoted to avoid being taken
as the EOF marker).

         /*
-         * In CSV mode, we only recognize \. alone on a line.  This is because
-         * \. is a valid CSV data value.
+         * In CSV mode, backslash is a normal character.
          */
-        if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line))
+        if (c == '\\' && !cstate->opts.csv_mode)

I don't think that the update comment accurately describes the
behavior. If I understand correctly what you're trying to fix, I'd
expect the new behavior to be that we only recognize \. alone on a
line and even then only if we're not inside a quoting string, but
that's not what the revised comment says. Instead, it claims that
backslash is just a normal character, but if that were true, the whole
if-statement wouldn't exist at all, since its purpose is to provide
special-handling for backslashes -- and indeed the patch does not
change that, since the if-statement is still looking for a backslash
and doing something special if one is found.

Hmm. Looking at the rest of the patch, it seems like you're removing
the logic that prevents us from interpreting

\. lksdghksdhgjskdghjs

as an end-of-file while in CSV mode. But I would have thought based on
what problem you're trying to fix that you would have wanted to keep
that logic and further restrict it so that it only applies when not
within a quoted string.

Maybe I'm misunderstanding what bug you're trying to fix?

--
Robert Haas
EDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Melanie Plageman
Дата:
Сообщение: Re: Emit fewer vacuum records by reaping removable tuples during pruning
Следующее
От: Jeff Davis
Дата:
Сообщение: Re: [17] CREATE SUBSCRIPTION ... SERVER