Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

Поиск
Список
Период
Сортировка
От Aaron Botsis
Тема Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes
Дата
Msg-id 5BE71065-CCF8-4262-8762-D23CB1118BD0@bt-r.com
обсуждение исходный текст
Ответ на Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes  (Andrew Dunstan <andrew@dunslane.net>)
Список pgsql-hackers

On Jan 9, 2015, at 11:37 AM, Andrew Dunstan <andrew@dunslane.net> wrote:


On 01/08/2015 08:42 PM, Aaron Botsis wrote:

On Jan 8, 2015, at 3:44 PM, Andrew Dunstan <andrew@dunslane.net> wrote:


On 01/08/2015 03:05 PM, Aaron Botsis wrote:

It's also unnecessary. CSV format, while not designed for this, is nevertheless sufficiently flexible to allow successful import of json data meeting certain criteria (essentially no newlines), like this:

 copy the_table(jsonfield)
 from '/path/to/jsondata'
 csv quote e'\x01' delimiter e'\x02’;
While perhaps unnecessary, given the size and simplicity of the patch, IMO it’s a no brainer to merge (it actually makes the code smaller by 3 lines). It also enables non-json use cases anytime one might want to preserve embedded escapes, or use different ones entirely. Do you see other reasons not to commit it?

Well, for one thing it's seriously incomplete. You need to be able to change the delimiter as well. Otherwise, any embedded tab in the json will cause you major grief.

Currently the delimiter and the escape MUST be a single byte non-nul character, and there is a check for this in csv mode. Your patch would allow any arbitrary string (including one of zero length) for the escape in text mode, and would then silently ignore all but the first byte. That's not the way we like to do things.

And, frankly, I would need to spend quite a lot more time thinking about other implications than I have given it so far. This is an area where I tend to be VERY cautious about making changes. This is a fairly fragile ecosystem.

So I’m going to do a bit more testing with another patch tomorrow with delimiters removed. If you can think of any specific cases you think will break it let me know and I’ll make sure to add regression tests for them as well.

Well, I still need convincing that this is the best solution to the problem. As I said, I need to spend more time thinking about it.

I offered an alternative (RAW mode w/record terminator) that you ignored. So in lieu of a productive discussion about “the best solution to the problem”, I went ahead and continued to complete the patch since I believe it’s a useful FE that it could be helpful for this and other use cases. There have been multiple times (not even w/json) I wished COPY would stop being so smart. 

FWIW, (if anyone’s interested in it) I also hacked up some python that’ll read a json file, and outputs a binary file suitable for use with COPY BINARY that gets around all this stuff. Obviously this only works for json (not jsonb) columns (though you could SELECT INTO a jsonb column). Happy to pass it along.

Aaron

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jim Nasby
Дата:
Сообщение: Re: Updating copyright notices to 2015 for PGDG
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Escaping from blocked send() reprised.