Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Поиск
Список
Период
Сортировка
От Terry Laurenzo
Тема Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)
Дата
Msg-id AANLkTi=uaD_M5vo491ohO6Gcmgkqi+02vVRYzF-aM+_0@mail.gmail.com
обсуждение исходный текст
Ответ на Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)  (Itagaki Takahiro <itagaki.takahiro@gmail.com>)
Список pgsql-hackers

> I like as simple design as we can accept. ISTM format, I/O interface,
> simple get/set, mapping tuple from/to object, and indexing are minimum
> requirement.

+1 to small start, but simple get/set are already debatable...
For example, text/json conversion:
 A. SELECT '<json>'::json;
 B. SELECT '<text>'::text::json;

In the git repo, A calls parse_json_to_bson_as_vardata(), so the input
should be a json format. OTOH, B calls pgjson_json_from_text(), so the
input can be any text. Those behaviors are surprising. I think we have
no other choice but to define text-to-json cast as parsing. The same
can be said for json-to-text -- type-output function vs. extracting
text value from json.

I think casting text to/from json should behave in the same way as type
input/output. The xml type works in the same manner. And if so, we might
not have any casts to/from json for consistency, even though there are
no problems in casts for non-text types.

I just reworked some of this last night, so I'm not sure which version you are referring to (new version has a pgplugin/jsoncast.c source file).  I was basically circling around the same thing as you trying to find something that felt natural and not confusing.  I agree that we don't have much of a choice to keep in/out functions as parse/serialize and that then introducing casts that do differently creates confusion.  When I was playing with it, I was getting confused, and I wrote it. :)

An alternative to pg casting to extract postgres values could be to introduce analogs to JavaScript constructors, which is the JavaScript way to cast.  For example: String(json), Number(json), Date(json).  This would feel natural to a JavaScript programmer and would be explicit and non-surprising:
   A. SELECT String('{a: 1, b:2}'::json -> 'a')  (Returns postgres text)
   B. SELECT Number('1'::json)   (Returns postgres decimal)

I think that the most common use case for this type of thing in the DB will be to extract a JSON scalar as a postgres scalar.

The inverse, while probably less useful, is currently represented by the json_from_* functions.  We could collapse all of these down to one overloaded function, say ToJson(...):
   A. SELECT ToJson(1)   (Would return a json type with an int32 "1" value)
   B. SELECT ToJson('Some String')   (Would return a json type with a string value)

There might be some syntactic magic we could do by adding an intermediate jsonscalar type, but based on trying real cases with this stuff, you always end up having to be explicit about your conversions anyway.  Having implicit type coercion from this polymorphic type tends to make things confusing, imo.
 

I'll list issues before we start json types even in the simplest cases:
----
1. where to implement json core: external library vs. inner postgres
2. internal format: text vs. binary (*)
3. encoding: always UTF-8 vs. database encoding (*)
4. meaning of casts text to/from json: parse/stringify vs. get/set
5. parser implementation: flex/bison vs. hand-coded.
----
(*) Note that we would have comparison two json values in the future. So,
we might need to normalize the internal format even in text representation.

The most interesting parts of json types, including indexing and jsonpath,
would be made on the json core. We need conclusions about those issues.


My opinions or ramblings on the above:
   1. There's a fair bit of code involved for something that many are going to gloss over.  I can think of pros/cons for external/internal/contrib and I'm not sure which I would choose.
   2. I'm definitely in the binary camp, but part of the reason for building it out was to try it with some real world cases to get a feel for performance implications end to end.  We make heavy use of MongoDB at the office and I was thinking it might make sense to strip some of those cases down and see how they would be implemented in this context.  I'll write up more thoughts on how I think text/binary should perform for various cases tonight.
   3. I think if we go with binary, we should always store UTF-8 in the binary structure.  Otherwise, we just have too much of the guts of the binary left to the whim of the database encoding.  As currently implemented, all strings generated by the in/out functions should be escaped so that they are pure ascii (not quite working, but there in theory).  JSON is by definition UTF-8, and in this case, I think it trumps database encoding.
   4. My thoughts on the casts are above.
   5. There seems to be a lot of runtime and code size overhead inherent in the flex/bison parsers, especially considering that they will most frequently be invoked for very small streams.  Writing a good hand-coded parser for comparison is just a matter of which bottle of wine to choose prior to spending the hours coding it, and I would probably defer the decision until later.

Terry

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Greg Stark
Дата:
Сообщение: Re: Timeout and wait-forever in sync rep
Следующее
От: Pavel Stehule
Дата:
Сообщение: Re: string function - "format" function proposal