Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Поиск
Список
Период
Сортировка
От Terry Laurenzo
Тема Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)
Дата
Msg-id AANLkTinxBrLLx_bjNB0pT4kk3Qnt-Zwb_aU8ibz7m8Ea@mail.gmail.com
обсуждение исходный текст
Ответ на patch: Add JSON datatype to PostgreSQL (GSoC, WIP)  (Joseph Adams <joeyadams3.14159@gmail.com>)
Ответы Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)  (Hitoshi Harada <umi.tanuki@gmail.com>)
Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi all -
I independently started some work on a similar capability as was contributed back in August by Joey Adams for a json datatype.  Before starting, I did a quick search but for some reason didn't turn this existing thread up.

What I've been working on is out on github for now: http://github.com/tlaurenzo/pgjson

When I started, I was actually aiming for something else, and got caught up going down this rabbit hole.  I took a different design approach, making the internal form be an extended BSON stream and implementing event-driven parsing and serializing to the different formats.  There was some discussion in the original thread around storing plain text vs a custom format.  I have to admit I've been back and forth a couple of times on this and have come to like a BSON-like format for the data at rest.  

Pros:
   - It is directly iterable without parsing and/or constructing an AST
   - It is its own representation.  If iterating and you want to tear-off a value to be returned or used elsewhere, its a simple buffer copy plus some bit twiddling.
   - It is conceivable that clients already know how to deal with BSON, allowing them to work with the internal form directly (ala MongoDB)
   - It stores a wider range of primitive types than JSON-text.  The most important are Date and binary.

Cons:
   - The format appears to have been "grown".  Some of the critical decisions were made late in the game (ie. why would your integral type codes be last)
   - Natively, the format falls victim to the artificial document vs element distinction, which I never understood.  I've worked around this with an escape mechanism for representing root values, but its not great.
   - The processor is not resilient in the face of unknown element types

I'm leaning towards thinking that the determination comes down to the following:
   - If you just want a "checkbox" item that the database has a json datatype and some support functions, storing as text may make sense.  It can be much simpler; however, it becomes increasingly hard to do anything real without adding a parse to AST, manipulate, dump to text cycle to every function.
   - If you want a json datatype that is highly integrated and manipulable, you want a binary datastructure and in the absence of any other contender in this area, BSON is ok (not great, but ok).
   - The addition of a JavaScript procedural language probably does not bring its own format for data at rest.  All of the engines I know of (I haven't looked at what Guile is doing) do not have a static representation for internal data structures.  They are heap objects with liberal use of internal and external pointers.  Most do have a mechanism, however, for injecting foreign objects into the runtime without resorting to making a dumb copy.  As such, the integration approach would probably be to determine the best format for JSON data at rest and provide adapters to the chosen JavaScript runtime to manipulate this at-rest format directly (potentially using a copy on write approach).  If the at-rest format is Text, then you would need to do a parse-to-AST step for each JavaScript function invocation.

Here's a few notes on my current implementation:
   - Excessive use of lex/yacc: This was quick and easy but the grammars are simple enough that I'd probably hand-code a parser for any final solution.
   - When the choice between following the json.org spec to the letter and implementing lenient parsing for valid JavaScript constructs arose, I chose lenient.
   - Too much buffer copying: When I started, I was just doodling with writing C code to manipulate JSON/BSON and not working with postgres in particular.  As such, it all uses straight malloc/free and too many copies are made to get things in and out of VARDATA structures.  This would all be eliminated in any real version.
   - UTF-8 is supported but not fully working completely.  The support functions that Joey wrote do a better job at this.
   - My json path evaluation is crippled.  Given the integration with the PG type system, I thought I just wanted a simple property traversal mechanism, punting higher level manipulation to native PG functions.  Seeing real JSONPath work, though, I'm not so sure.  I like the simplicity of what I've done but the features of the full bit are nice too.
   - This is first-pass prototype code with the goal of seeing it all working together.

While I had an end in mind, I did a lot of this for the fun of it and to just scratch an itch, so I'm not really advocating for anything at this point.  I'm curious as to what others think the state of JSON and Postgres should be.  I've worked with JavaScript engines a good deal and would be happy to help get us there, either using some of the work/approaches here or going in a different direction.

Terry

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bernd Helmle
Дата:
Сообщение: Re: Trailing Whitespace Tips (was: Re: starting to review the Extend NOT NULL representation to pg_constraint patch)
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: knngist - 0.8