Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

Поиск
Список
Период
Сортировка
От Aaron Botsis
Тема Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes
Дата
Msg-id D58AB253-3AE9-48DF-BD6F-B8F9BA979DDB@bt-r.com
обсуждение исходный текст
Ответ на Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes  (Andrew Dunstan <andrew@dunslane.net>)
Список pgsql-hackers

On Jan 7, 2015, at 2:45 PM, Andrew Dunstan <andrew@dunslane.net> wrote:


On 01/07/2015 08:25 AM, Aaron Botsis wrote:
Hi folks, I was having a problem importing json data with COPY. Lots of things export data nicely as one json blob per line. This is excellent for directly importing into a JSON/JSONB column for analysis.

...Except when there’s an embedded doublequote. Or anything that’s escaped. COPY handles this, but by the time the escaped char hit the JSON parser, it's not escaped anymore. This breaks the JSON parsing. This means I need to manipulate the input data to double-escape it. See bug #12320 for an example. Yuck.

I propose this small patch that simply allows specifying COPY … ESCAPE without requiring the CSV parser. It will make it much easier to directly use json formatted export data for folks going forward. This seemed like the simplest route.

This isn't a bug. Neither CSV format nor TEXT format are partucularly suitable for json. I'm quite certain I could compose legal json that will break your proposal (for example, with an embedded newline in the white space.)

Sorry - though I originally reported it as a bug, I didn’t mean to imply it ultimately was one. :) The patch is a feature enhancement.

It's also unnecessary. CSV format, while not designed for this, is nevertheless sufficiently flexible to allow successful import of json data meeting certain criteria (essentially no newlines), like this:

  copy the_table(jsonfield)
  from '/path/to/jsondata'
  csv quote e'\x01' delimiter e'\x02’;

While perhaps unnecessary, given the size and simplicity of the patch, IMO it’s a no brainer to merge (it actually makes the code smaller by 3 lines). It also enables non-json use cases anytime one might want to preserve embedded escapes, or use different ones entirely. Do you see other reasons not to commit it?

You aren't the first person to encounter this problem. See <http://adpgtech.blogspot.com/2014/09/importing-json-data.html>

Maybe we need to add something like this to the docs, or to the wiki.

In your post you acknowledged a text mode copy with null escape character would have solved your problem, and to me, that was the intuitive/obvious choice as well. *shrug*

Note too my comment in that blog post:

  Now this solution is a bit of a hack. I wonder if there's a case for
  a COPY mode that simply treats each line as a single datum. I also
  wonder if we need some more specialized tools for importing JSON,
  possibly one or more Foreign Data Wrappers. Such things could
  handle, say, embedded newline punctuation.

Agree. Though https://github.com/citusdata/json_fdw already exists (I haven’t used it). How do folks feel about a COPY mode that reads a single datum until it finds a single character record terminator alone on a line? something like:

=# COPY test(data) from stdin terminator ‘}’;
End with a backslash and a period on a line by itself.
>>{
>>  "a": 123, "c": "string",  "b": [1, 2, 3
>> ]
>>}
>>{
>> [1,2,3]
>>}
>>\.
COPY 2

You could also get fancy and support multi-character record terminators which would allow the same thing in a slightly different way:
COPY test(data) from stdin terminator ‘\n}’;

COPY test(data) from stdin terminator ‘\n}’;

I don’t know how crazy you could get without a lot of rework. It might have to be used in conjunction with a more constrained mode like "COPY…RAW”.

Aaron

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes
Следующее
От: Aaron Botsis
Дата:
Сообщение: Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes