Re: Should CSV parsing be stricter about mid-field quotes?

Поиск
Список
Период
Сортировка
От Andrew Dunstan
Тема Re: Should CSV parsing be stricter about mid-field quotes?
Дата
Msg-id e06241ae-3b49-b3fd-1760-4ecf55cb6429@dunslane.net
обсуждение исходный текст
Ответ на Re: Should CSV parsing be stricter about mid-field quotes?  (Greg Stark <stark@mit.edu>)
Ответы Re: Should CSV parsing be stricter about mid-field quotes?  ("Joel Jacobson" <joel@compiler.org>)
Список pgsql-hackers


On 2023-05-13 Sa 23:11, Greg Stark wrote:
On Sat, 13 May 2023 at 09:46, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
I could see an argument for a STRICT mode which would disallow partially
quoted fields, although I'd like some evidence that we're dealing with a
real problem here. Is there really a CSV producer that produces output
like that you showed in your example? And if so has anyone objected to
them about the insanity of that?
I think you'd want not just "some evidence" but "compelling evidence".
Any such option is going to add cycles into the low-level input parser
for COPY, which we know is a hot spot and we've expended plenty of
sweat on.  Adding a speed penalty that will be paid by the 99.99%
of users who don't have an issue here is going to be a hard sell.
Well I'm not sure that follows. Joel specifically claimed that an
implementation that didn't accept inputs like this would actually be
simpler and that might mean it would actually be faster.

And I don't think you have to look very hard for inputs like this --
plenty of people generate CSV files from simple templates or script
outputs that don't understand escaping quotation marks at all. Outputs
like that will be fine as long as there's no doublequotes in the
inputs but then one day someone will enter a doublequote in a form
somewhere and blammo.


The procedure described is plain wrong, and I don't have too much sympathy for people who implement it. Parsing CSV files might be a mild PITN, but constructing them is pretty darn simple. Something like this perl fragment should do it:

do {
  s/"/""/g;
  $_ = qq{"$_"} if /[",\r\n]/;
} foreach @fields;
print join(',',@fields),"\n";


And if people do follow the method you describe then their input with unescaped quotes will be rejected 999 times out of 1000. It's only cases where the field happens to have an even number of embedded quotes, like Joel's somewhat contrived example, that the input will be accepted.


So I guess the real question is whether accepting inputs with
unescapted quotes and interpreting them the way we do is really the
best interpretation. Is the user best served by a) assuming they
intended to quote part of the field and not quote part of it b) assume
they failed to escape the quotation mark or c) assume something's gone
wrong and the input is entirely untrustworthy.


As I said earlier, I'm quite reluctant to break things that might have been working happily for people for many years, in order to accommodate people who can't do the minimum required to produce correct CSVs. I have no idea how many are relying on it, but I would be slightly surprised if the number were zero.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Greg Stark
Дата:
Сообщение: Re: Should CSV parsing be stricter about mid-field quotes?
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: Overhauling "Routine Vacuuming" docs, particularly its handling of freezing