Обсуждение: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)
This is a work-in-progress patch of my GSoC project: Add JSON datatype
to PostgreSQL.  It provides the following:
 * JSON datatype: A TEXT-like datatype for holding JSON-formatted
text.  Although the JSON RFC decrees that a JSON text be an "object or
array" (meaning '"hello"' is considered invalid JSON text), this
datatype lets you store any JSON "value" (meaning '"hello"'::JSON is
allowed).
 * Validation: Content is validated when a JSON datum is constructed,
but JSON validation can also be done programmatically with the
json_validate() function.
 * Conversion to/from JSON for basic types.  Conversion functions are
needed because casting will not unwrap JSON-encoded values.  For
instance, json('"string"')::text is '"string"', while
from_json('"string"') is 'string'.  Also, to_json can convert
PostgreSQL arrays to JSON arrays, providing a nice option for dealing
with arrays client-side.  from_json currently can't handle JSON
arrays/objects yet (how they should act is rather unclear to me,
except when array dimensions and element type are consistent).
 * Retrieving/setting values in a JSON node (via selectors very
similar to, but not 100% like, JSONPath as described at
http://goessner.net/articles/JsonPath/ ).
 * Miscellaneous functions json_condense and json_type.
This is a patch against CVS HEAD.  This module compiles, installs, and
passes all 8 tests successfully on my Ubuntu 9.10 system.  It is
covered pretty decently with regression tests.  It also has SGML
documentation (the generated HTML is attached for convenience).
Although I am aware of many problems in this patch, I'd like to put it
out sooner rather than later so it can get plenty of peer review.
Problems I'm aware of include:
 * Probably won't work properly when the encoding (client or server?)
is not UTF-8.  When encoding (e.g. with json_condense), it should (but
doesn't) use \uXXXX escapes for characters the target encoding doesn't
support.
 * json.c is rather autarkic.  It has its own string buffer system
(rather than using StringInfo) and UTF-8 validator (rather than using
pg_verify_mbstr_len(?) ).
 * Some functions/structures are named suggestively, as if they belong
to (and would be nice to have in) PostgreSQL's utility libraries.
They are:
   - TypeInfo, initTypeInfo, and getTypeInfo: A less cumbersome
wrapper around get_type_io_data.
   - FN_EXTRA and FN_EXTRA_SZ: Macros to make working with
fcinfo->flinfo->fn_extra easier.
   - enumLabelToOid: Look up the Oid of an enum label; needed to
return an enum that isn't built-in.
   - utf8_substring: Extract a range of UTF-8 characters out of a UTF-8 string.
 * Capitalization and function arrangement are rather inconsistent.
Braces are K&R-style.
 * json_cleanup and company aren't even used.
 * The sql/json.sql test case should be broken into more files.
P.S. The patch is gzipped because it expands to 2.6 megabytes.
Joey Adams
			
		Вложения
On Fri, Jul 23, 2010 at 2:18 AM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:
> This is a work-in-progress patch of my GSoC project: Add JSON datatype
> to PostgreSQL.  It provides the following:
>
>  * JSON datatype: A TEXT-like datatype for holding JSON-formatted
> text.  Although the JSON RFC decrees that a JSON text be an "object or
> array" (meaning '"hello"' is considered invalid JSON text), this
> datatype lets you store any JSON "value" (meaning '"hello"'::JSON is
> allowed).
>  * Validation: Content is validated when a JSON datum is constructed,
> but JSON validation can also be done programmatically with the
> json_validate() function.
>  * Conversion to/from JSON for basic types.  Conversion functions are
> needed because casting will not unwrap JSON-encoded values.  For
> instance, json('"string"')::text is '"string"', while
> from_json('"string"') is 'string'.  Also, to_json can convert
> PostgreSQL arrays to JSON arrays, providing a nice option for dealing
> with arrays client-side.  from_json currently can't handle JSON
> arrays/objects yet (how they should act is rather unclear to me,
> except when array dimensions and element type are consistent).
>  * Retrieving/setting values in a JSON node (via selectors very
> similar to, but not 100% like, JSONPath as described at
> http://goessner.net/articles/JsonPath/ ).
>  * Miscellaneous functions json_condense and json_type.
>
> This is a patch against CVS HEAD.  This module compiles, installs, and
> passes all 8 tests successfully on my Ubuntu 9.10 system.  It is
> covered pretty decently with regression tests.  It also has SGML
> documentation (the generated HTML is attached for convenience).
>
> Although I am aware of many problems in this patch, I'd like to put it
> out sooner rather than later so it can get plenty of peer review.
> Problems I'm aware of include:
>  * Probably won't work properly when the encoding (client or server?)
> is not UTF-8.  When encoding (e.g. with json_condense), it should (but
> doesn't) use \uXXXX escapes for characters the target encoding doesn't
> support.
>  * json.c is rather autarkic.  It has its own string buffer system
> (rather than using StringInfo) and UTF-8 validator (rather than using
> pg_verify_mbstr_len(?) ).
>  * Some functions/structures are named suggestively, as if they belong
> to (and would be nice to have in) PostgreSQL's utility libraries.
> They are:
>   - TypeInfo, initTypeInfo, and getTypeInfo: A less cumbersome
> wrapper around get_type_io_data.
>   - FN_EXTRA and FN_EXTRA_SZ: Macros to make working with
> fcinfo->flinfo->fn_extra easier.
>   - enumLabelToOid: Look up the Oid of an enum label; needed to
> return an enum that isn't built-in.
>   - utf8_substring: Extract a range of UTF-8 characters out of a UTF-8 string.
>  * Capitalization and function arrangement are rather inconsistent.
> Braces are K&R-style.
>  * json_cleanup and company aren't even used.
>  * The sql/json.sql test case should be broken into more files.
>
> P.S. The patch is gzipped because it expands to 2.6 megabytes.
Some technical points about the submission:
- If you want your coded to be included in PostgreSQL, you need to put
the same license and attribution on it that we use elsewhere.
Generally that looks sorta like this:
/*-------------------------------------------------------------------------** tablecmds.c*        Commands for creating
andaltering table structures and settings** Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group*
PortionsCopyright (c) 1994, Regents of the University of California*** IDENTIFICATION*
$PostgreSQL$**-------------------------------------------------------------------------*/
You should have this header on each file, both .c and .h.
- The reason why this patch is 2.6MB is because it has 2.4MB of tests.I think you should probably pick out the most
useful10% or so, and 
drop the rest.
- I was under the impression that we wanted EXPLAIN (FORMAT JSON) to
return type json, but that's obviously not going to be possible if all
of this is contrib.  We could (a) move it all into core, (b) move the
type itself and its input and output functions into core and leave the
rest in contrib [or something along those lines], or (c) give up using
it as the return type for EXPLAIN (FORMAT JSON).
- You need to comply with the project coding standards.  Thus:
static void
foo()
{   exit(1);
}
Not:
static void foo()
{   exit(1);
}
You should have at least one blank line between every pair of
functions.  You should use uncuddled curly braces.  Your commenting
style doesn't match the project standard.  We prefer explicit NULL
tests over if (!foo).  Basically, you should run pgindent on your
code, and also read this:
http://www.postgresql.org/docs/8.4/static/source.html
I don't think this is going to fly either:
/* Declare and initialize a String with the given name. */
#define String(name) String name = NewString()
#define NewString() {{NULL, 0, 0}}
That's just too much magic.  We want people to be able to read this
code and easily tell what's going on...  spelling a few things out
long hand is OK, good, even.
- You have boatloads of functions in here with no comments whose
functions is entirely non-self-evident.  You may or may not need to
rename some of them, but you definitely need to write some comments.
- elog() must be used except for "can't happen" situations.  Compare
the way numeric_in() throws an error message versus json_in().
- You have a number of very short convenience functions which don't
seem warranted.  For example, build_array_string() is a 2-line
function that is called once.  And all the string_append,
string_append_range, string_append_char stuff is duplicating
functionality we already have in appendStringInfoStr,
appendBinaryStringInfo, appendStringInfoChar.
In short, my first impression of this patch is that there's a lot of
good stuff in here, but you need to stop adding features NOW and put A
LOT of work into getting this into a form that the community can
accept.  Otherwise, this is likely going to die on the vine, which
would be a shame because a lot of it seems pretty cool.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
			
		On Fri, Jul 23, 2010 at 2:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> - elog() must be used except for "can't happen" situations.  Compare
> the way numeric_in() throws an error message versus json_in().
Er... that should have said "elog() must NOT be used except for can't
happen situations".
Also, I was just looking at json_delete().  While the existing coding
there is kind of fun, I think this can be written more
straightforwardly by doing something like this (not tested):
   while (1)   {       while (is_internal(node) && node->v.children.head)           node = node->v.children.head;
if(node->next)           next = node->next;       else if (node->parent)           next = node->parent;       else
    break;       free_node(node);       node = next;   } 
That gets rid of all of the gotos and one of the local variables and,
at least IMO, is easier to understand...  though it would be even
better still if you also added a comment saying something like "We do
a depth-first, left-to-right traversal of the tree, freeing nodes as
we go.  We need not bother clearing any of the pointers, because the
traversal order is such that we're never in danger of revisiting a
node we've already freed."
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
			
		Update: I'm in the middle of cleaning up the JSON code ( http://git.postgresql.org/gitweb?p=json-datatype.git;a=summary if you want to see the very latest ), so I haven't addressed all of the major problems with it yet. On Fri, Jul 23, 2010 at 2:34 PM, Robert Haas <robertmhaas@gmail.com> wrote: > - I was under the impression that we wanted EXPLAIN (FORMAT JSON) to > return type json, but that's obviously not going to be possible if all > of this is contrib. We could (a) move it all into core, (b) move the > type itself and its input and output functions into core and leave the > rest in contrib [or something along those lines], or (c) give up using > it as the return type for EXPLAIN (FORMAT JSON). I've been developing it as a contrib module because: * I'd imagine it's easier than developing it as a built-in datatype right away (e.g. editing a .sql.in file versus editing pg_type.h ). * As a module, it has PGXS support, so people can try it out right away rather than having to recompile PostgreSQL. I, for one, think it would be great if the JSON datatype were all in core :-) However, if and how much JSON code should go into core is up for discussion. Thoughts, anyone? A particularly useful aspect of the JSON support is the ability to convert PostgreSQL arrays to JSON arrays (using to_json ), as there currently isn't any streamlined way to parse arrays in the PostgreSQL format client-side (that I know of). Joey Adams
On Sat, Jul 24, 2010 at 06:57:18PM -0400, Joseph Adams wrote: > A particularly useful aspect of the JSON support is the ability to > convert PostgreSQL arrays to JSON arrays (using to_json ), as there > currently isn't any streamlined way to parse arrays in the PostgreSQL > format client-side (that I know of). I really would like to address the latter issue some day. I don't know how many broken implementations I have seen in my, not that long, time using pg, but it sure is 10+. I also knowingly have written dumbed down versions. Andres
> I, for one, think it would be great if the JSON datatype were all in > core :-) However, if and how much JSON code should go into core >is up for discussion. Thoughts, anyone? > in my opinion: As soon as possible. Spinning PostgreSQL as the Ajax-enabled-database has many great uses. Harald -- GHUM Harald Massa persuadere et programmare Harald Armin Massa Spielberger Straße 49 70435 Stuttgart 0173/9409607 no fx, no carrier pigeon - Using PostgreSQL is mostly about sleeping well at night.
Updated JSON datatype patch.  It cleans up the major problems that
have been discussed, and it's very close to being commit-worthy (I
think).  The major issues as I see them are:
 * Contains several utility functions that may be useful in general.
They are all in util.c / util.h
 * It's still a contrib module
 * No json_agg or json_object functions for constructing arrays / objects
The utility functions and their potential to collide with themselves
in the future is the main problem with this patch.  Of course, this
problem could be sidestepped simply by namespacifying them (prepending
json_ to all the function names).  I would like some thoughts and
opinions about the design and usefulness of the utility code.
An overview, along with my thoughts, of the utility functions:
FN_EXTRA, FN_EXTRA_ALLOC, FN_MCXT macros
 * Useful-ometer: ()--------------------o
 * Rationale: Using fcinfo->flinfo->fn_extra takes a lot of
boilerplate.  These macros help cut down the boilerplate, and the
comment explains what fn_extra is all about.
TypeInfo structure and getTypeInfo function
 * Useful-ometer: ()---------------------------o
 * Rationale: The get_type_io_data "six-fer" function is very
cumbersome to use, since one has to declare all the output variables.
The getTypeInfo puts the results in a structure.  It also performs the
fmgr_info_cxt step, which is a step done after every usage of
get_type_io_data in the PostgreSQL code.
getEnumLabelOids
 * Useful-ometer: ()-----------------------------------o
 * Rationale: There is currently no streamlined way to return a custom
enum value from a PostgreSQL function written in C.  This function
performs a batch lookup of enum OIDs, which can then be cached with
fn_extra.  This should be reasonably efficient, and it's quite elegant
to use (see json_op.c for an example).
UTF-8 functions:
    utf8_substring
    utf8_decode_char
        (there's a patch in the works for a utf8_to_unicode function
         which does the same thing as this function)
    utf8_validate (variant of pg_verify_mbstr(PG_UTF8, str, length, true)
that allows '\0' characters)
    server_to_utf8
    utf8_to_server
    text_to_utf8_cstring
    utf8_cstring_to_text
    utf8_cstring_to_text_with_len
 * Useful-ometer: ()-------o
 * Rationale: The JSON code primarily operates in UTF-8 rather than
the server encoding because it needs to deal with Unicode escapes, and
there isn't an efficient way to encode/decode Unicode codepoints
to/from the server encoding.  These functions make it easy to perform
encoding conversions needed for the JSON datatype.  However, they're
not very useful when operating solely in the server encoding, hence
the low usefulometric reading.
As for the JSON datatype support itself, nobody has come out against
making JSON a core datatype rather than a contrib module, so I will
proceed with making it one.  I guess this would involve adding entries
to pg_type.h and pg_proc.h .  Where would I put the rest of the code?
I guess json_io.c and json_op.c (the PG_FUNCTION_ARGS functions) would
become json.c in src/backend/utils/adt .  Where would json.c and
jsonpath.c (JSON encoding/decoding functions and JSONPath
implementation) go?
Are there any other issues with the JSON code I didn't spot?
Thanks,
Joey Adams
			
		Вложения
Excerpts from Joseph Adams's message of mar ago 10 04:03:43 -0400 2010: > An overview, along with my thoughts, of the utility functions: > > FN_EXTRA, FN_EXTRA_ALLOC, FN_MCXT macros > * Useful-ometer: ()--------------------o > > TypeInfo structure and getTypeInfo function > * Useful-ometer: ()---------------------------o > > getEnumLabelOids > * Useful-ometer: ()-----------------------------------o I think this kind of thing could be stripped from the patch and submitted separately; they would presumably see a quick review and commit if they are small and useful (particularly if you can show a decrease of code verbosity by switching other uses in the existing code). The advantage is you don't have to keep arguing for their usefulness in the JSON patch; and if they turn out to be rejected, they won't cause the JSON patch to be rejected as a whole. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Joseph Adams's message of mar ago 10 04:03:43 -0400 2010:
>> An overview, along with my thoughts, of the utility functions:
>> 
>> FN_EXTRA, FN_EXTRA_ALLOC, FN_MCXT macros
>> * Useful-ometer: ()--------------------o
>> 
>> TypeInfo structure and getTypeInfo function
>> * Useful-ometer: ()---------------------------o
>> 
>> getEnumLabelOids
>> * Useful-ometer: ()-----------------------------------o
> I think this kind of thing could be stripped from the patch and
> submitted separately;
+1.  It's easier all around if a patch does just one thing.  Code
refactoring and feature addition, in particular, are easier to review
separately.
        regards, tom lane
			
		On Sat, 2010-07-24 at 18:57 -0400, Joseph Adams wrote: > I've been developing it as a contrib module because: > * I'd imagine it's easier than developing it as a built-in datatype > right away (e.g. editing a .sql.in file versus editing pg_type.h ). > * As a module, it has PGXS support, so people can try it out right > away rather than having to recompile PostgreSQL. > > I, for one, think it would be great if the JSON datatype were all in > core :-) However, if and how much JSON code should go into core is up > for discussion. Thoughts, anyone? As a GSoC piece of work, doing it as a contrib module gives an immediately useful deliverable. Good plan. Once that is available, we can then get some feedback on it and include it as an in-core datatype later in the 9.1 cycle. So lets do both: contrib and in-core. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Development, 24x7 Support, Training and Services
On Sat, Jul 24, 2010 at 06:57:18PM -0400, Joseph Adams wrote: > Update: I'm in the middle of cleaning up the JSON code ( > http://git.postgresql.org/gitweb?p=json-datatype.git;a=summary if you > want to see the very latest ), so I haven't addressed all of the major > problems with it yet. > > On Fri, Jul 23, 2010 at 2:34 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > - I was under the impression that we wanted EXPLAIN (FORMAT JSON) to > > return type json, but that's obviously not going to be possible if all > > of this is contrib. We could (a) move it all into core, (b) move the > > type itself and its input and output functions into core and leave the > > rest in contrib [or something along those lines], or (c) give up using > > it as the return type for EXPLAIN (FORMAT JSON). > > I've been developing it as a contrib module because: > * I'd imagine it's easier than developing it as a built-in datatype > right away (e.g. editing a .sql.in file versus editing pg_type.h ). > * As a module, it has PGXS support, so people can try it out right > away rather than having to recompile PostgreSQL. > > I, for one, think it would be great if the JSON datatype were all in > core :-) However, if and how much JSON code should go into core is up > for discussion. Thoughts, anyone? +1 for putting it in core in 9.1 :) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, 2010-08-11 at 15:27 -0700, David Fetter wrote: > > > I've been developing it as a contrib module because: > > * I'd imagine it's easier than developing it as a built-in datatype > > right away (e.g. editing a .sql.in file versus editing pg_type.h ). > > * As a module, it has PGXS support, so people can try it out right > > away rather than having to recompile PostgreSQL. > > > > I, for one, think it would be great if the JSON datatype were all in > > core :-) However, if and how much JSON code should go into core is up > > for discussion. Thoughts, anyone? > > +1 for putting it in core in 9.1 :) I would be curious to the benefit of putting it in core. I have no problem with the type but in core? Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Wed, Aug 11, 2010 at 03:40:36PM -0700, Joshua D. Drake wrote: > On Wed, 2010-08-11 at 15:27 -0700, David Fetter wrote: > > > > > I've been developing it as a contrib module because: > > > * I'd imagine it's easier than developing it as a built-in > > > datatype right away (e.g. editing a .sql.in file versus editing > > > pg_type.h ). > > > * As a module, it has PGXS support, so people can try it out > > > right away rather than having to recompile PostgreSQL. > > > > > > I, for one, think it would be great if the JSON datatype were > > > all in core :-) However, if and how much JSON code should go > > > into core is up for discussion. Thoughts, anyone? > > > > +1 for putting it in core in 9.1 :) > > I would be curious to the benefit of putting it in core. I have no > problem with the type but in core? If it's not in core, the vast majority of users will not have it installed, and nothing, in core or otherwise, will be able to count on it. As this is really pretty green-field stuff, it's super unlikely to break extant code. :) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 08/11/2010 07:33 PM, David Fetter wrote: >> I would be curious to the benefit of putting it in core. I have no >> problem with the type but in core? > If it's not in core, the vast majority of users will not have it > installed, and nothing, in core or otherwise, will be able to count on > it. > > You could say that about almost any feature. PostgreSQL is designed to be modular, and we can hardly credibly use that as an argument against ourselves. A convincing argument would be that there is another feature we want in core that needs or at least could benefit from it. cheers andrew
Andrew Dunstan wrote: > > > On 08/11/2010 07:33 PM, David Fetter wrote: > >> I would be curious to the benefit of putting it in core. I have no > >> problem with the type but in core? > > If it's not in core, the vast majority of users will not have it > > installed, and nothing, in core or otherwise, will be able to count on > > it. > > > > > > > You could say that about almost any feature. PostgreSQL is designed to > be modular, and we can hardly credibly use that as an argument against > ourselves. > > A convincing argument would be that there is another feature we want in > core that needs or at least could benefit from it. I would say that JSON is no longer a niche data format, which would suggest its inclusion in core. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Aug 11, 2010 at 07:39:37PM -0400, Andrew Dunstan wrote: > On 08/11/2010 07:33 PM, David Fetter wrote: > >>I would be curious to the benefit of putting it in core. I have no > >>problem with the type but in core? > >If it's not in core, the vast majority of users will not have it > >installed, and nothing, in core or otherwise, will be able to count on > >it. > > You could say that about almost any feature. PostgreSQL is designed > to be modular, and we can hardly credibly use that as an argument > against ourselves. > > A convincing argument would be that there is another feature we want > in core that needs or at least could benefit from it. EXPLAIN (FORMAT JSON) would benefit right away, as its overall code would be much more likely to be (and stay) correct. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, 2010-08-11 at 15:27 -0700, David Fetter wrote: > > > I've been developing it as a contrib module because: > > * I'd imagine it's easier than developing it as a built-in datatype > > right away (e.g. editing a .sql.in file versus editing pg_type.h ). > > * As a module, it has PGXS support, so people can try it out right > > away rather than having to recompile PostgreSQL. > > > > I, for one, think it would be great if the JSON datatype were all in > > core :-) However, if and how much JSON code should go into core is up > > for discussion. Thoughts, anyone? > > +1 for putting it in core in 9.1 :) I would be curious to the benefit of putting it in core. I have no problem with the type but in core? Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
Updated patch: the JSON code has all been moved into core, so this patch is now for a built-in data type. However, I factored the general-purpose utility functions out into a different patch to be reviewed separately, so this JSON data type patch won't work without the utility patch at http://archives.postgresql.org/pgsql-hackers/2010-08/msg00949.php . I still need to improve the documentation, as it still looks like the documentation for a module instead of a built-in data type. Also, the code is not taking advantage of the json_type_t enum being built-in yet; it still uses getEnumLabelOids (one of the utility functions) to retrieve the OIDs dynamically. Joey Adams
Вложения
Hi, I start to review JSON patch.
On Fri, Aug 13, 2010 at 7:33 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:
> Updated patch:  the JSON code has all been moved into core, so this
> patch is now for a built-in data type.
I think the patch can be split into two pieces: 1. Basic I/O support for JSON type (in/out/validate) 2. JSONPath
supportand functions for partial node management 
It is better to submit only 1 at first. Of course we should consider
about JSONPath before deciding the internal representation of JSON,
but separated patches can be easily reviewed.
I have several questions about the spec and implementation.
Sorry if you have already discussed about some of them, but I cannot
understand why the current code is the best design from the patch...
* Should we accept a scalar value as a valid JSON?
According to RFC, the root element of JSON text must be an object
or array. But to_json() and from_json() accept scalar values.
* JSON to a scalar value by from_json()
How about to have json_to_array(JSON) instead of from_json()?
JSON value is always an array or object, it's nonsense to convert
it to a scalar value directly; to an array seems to match better.
In addition, an array can be indexed with GIN; index-able JSON
type is very attractive.
* struct JSON seems to be too complex for me.
Can we use List (pg_list.h) instead of linked-list? 'key' and 'key_length'
fields should be held in the parent's List. i.e, JSON_ARRAY has List of
JSON, and JSON_OBJECT has List of {string, JSON} pairs.
We could also discard 'parent' field. It might be needed by JSONPath,
but we can have parent information in variables on C-stack because we
search JSON trees from root to children, no?
I think we don't need 'orig' field because the original input text is
not so important in normal use cases. Instead, we could have formatter
function something like json_pretty(json) RETURNS text.
* On-disk format of JSON values
(There might be some discussions before... What is the conclusion?)
The current code stores the original input text, but we can use
some kinds of pre-parsed format to store JSON, like hstore.
It can be different from BSON.
* Completeness of JSONPath APIs
json_get() can be replaced with json_path(), no?
Also, we can replace existing nodes with json_set(), but we cannot
append new nodes. What do you think modification of JSON value?
If the design is too difficult, it'd be better only to have search
APIs at this moment. Modification APIs will be added in the future.
--
Itagaki Takahiro
			
		On Wed, Aug 25, 2010 at 1:34 AM, Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote: > * Should we accept a scalar value as a valid JSON? > According to RFC, the root element of JSON text must be an object > or array. But to_json() and from_json() accept scalar values. This seems a bit like the XML document/content distinction, which I've never really understood. If [[1], false] is a legal JSON value, then it seems odd that [1] should be legal but false not. > * JSON to a scalar value by from_json() > How about to have json_to_array(JSON) instead of from_json()? > JSON value is always an array or object, it's nonsense to convert > it to a scalar value directly; to an array seems to match better. > In addition, an array can be indexed with GIN; index-able JSON > type is very attractive. Yeah, I don't like the excessive use of polymorphism either. > * On-disk format of JSON values > (There might be some discussions before... What is the conclusion?) > The current code stores the original input text, but we can use > some kinds of pre-parsed format to store JSON, like hstore. > It can be different from BSON. I see no value to choosing a different on-disk format. It might speed up indexing, but I/O will be slower. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
2010/8/25 Robert Haas <robertmhaas@gmail.com>: > On Wed, Aug 25, 2010 at 1:34 AM, Itagaki Takahiro > <itagaki.takahiro@gmail.com> wrote: >> * Should we accept a scalar value as a valid JSON? >> According to RFC, the root element of JSON text must be an object >> or array. But to_json() and from_json() accept scalar values. > > This seems a bit like the XML document/content distinction, which I've > never really understood. If [[1], false] is a legal JSON value, then > it seems odd that [1] should be legal but false not. I want false to be parsed without error, just for convinience. JSON specification seems a bit too strict. For example, it doesn't have date value as its parts, which results in people implement their own parsing rule for Date(long). And AFAIK the strictness of JSON parsing is partly because the main platform was browser engines that can eval() string that causes security issue. Without execution engine, we can allow more formats than RFC. >> * On-disk format of JSON values >> (There might be some discussions before... What is the conclusion?) >> The current code stores the original input text, but we can use >> some kinds of pre-parsed format to store JSON, like hstore. >> It can be different from BSON. > > I see no value to choosing a different on-disk format. It might speed > up indexing, but I/O will be slower. It depends on use cases, but in my mind plain text will do for us. If we have JavaScript engine in PostgreSQL like pl/v8 and it handles on-disk format as-is, then we should choose the kind of format, but in either text or binary format way it is hopeless to have such compelling environment in the short future. Regards, -- Hitoshi Harada
Hitoshi Harada <umi.tanuki@gmail.com> writes: > It depends on use cases, but in my mind plain text will do for us. If > we have JavaScript engine in PostgreSQL like pl/v8 and it handles > on-disk format as-is, then we should choose the kind of format, but in > either text or binary format way it is hopeless to have such > compelling environment in the short future. Well, for javascript support, there's another nice thing happening: - plscheme is built on GNU Guile - next version of GNU Guile supports javascript too http://plscheme.projects.postgresql.org/ http://wingolog.org/archives/2009/02/22/ecmascript-for-guile So my current guess at which javascript engine we'd get first would be plscheme. Now I don't know what implication that would have on the binary storage format of javascript or json documents. Regards, -- dim
Hi, Anyone working on JSON datatype? If no, I'm going to submit simplified version of JSON datatype patch. On Wed, Aug 25, 2010 at 2:34 PM, Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote: > On Fri, Aug 13, 2010 at 7:33 PM, Joseph Adams > <joeyadams3.14159@gmail.com> wrote: >> Updated patch: the JSON code has all been moved into core, so this >> patch is now for a built-in data type. > > I think the patch can be split into two pieces: > 1. Basic I/O support for JSON type (in/out/validate) > 2. JSONPath support and functions for partial node management > > It is better to submit only 1 at first. Of course we should consider > about JSONPath before deciding the internal representation of JSON, > but separated patches can be easily reviewed. -- Itagaki Takahiro
On 09/13/2010 09:30 PM, Itagaki Takahiro wrote: > Hi, > > Anyone working on JSON datatype? > If no, I'm going to submit simplified version of JSON datatype patch. > What's the state of the GSOC project? cheers andrew
On Tue, Sep 14, 2010 at 04:06, Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 09/13/2010 09:30 PM, Itagaki Takahiro wrote: >> >> Hi, >> >> Anyone working on JSON datatype? >> If no, I'm going to submit simplified version of JSON datatype patch. >> > > What's the state of the GSOC project? Well, GSoC itself is over. But Joey said he'd continue to work on the patch after that, to get it ready for commit. I've been unable to confirm that with him for this commitfest though, so I don't know the exact status. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Fri, Aug 13, 2010 at 7:33 PM, Joseph Adams <joeyadams3.14159@gmail.com> wrote: > Updated patch: the JSON code has all been moved into core, so this > patch is now for a built-in data type. I have a question about the design of the JSON type. Why do we need to store the value in UTF8 encoding? It's true the RFC of JSON says the the encoding SHALL be encoded in Unicode, but I don't understand why we should reject other encodings. As I said before, I'd like to propose only 3 features in the commitfest: * TYPE json data type * text to json: FUNCTION json_parse(text)* json to text: FUNCTION json_stringify(json, whitelist, space) JSONPath will be re-implemented on the basic functionalities in the subsequent commitfest. Do you have a plan to split your patch? Or, can I continue to develop my patch? If so, JSONPath needs to be adjusted to the new infrastructure. I think json_parse() and json_stringify() is well-known APIs for JSON: https://developer.mozilla.org/En/Using_JSON_in_Firefox So, it'd be worth buying the names and signatures for our APIs. (I'll rename json_pretty in my previous patch to json_stringify.) -- Itagaki Takahiro
On Fri, Sep 17, 2010 at 8:32 AM, Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote: > On Fri, Aug 13, 2010 at 7:33 PM, Joseph Adams > <joeyadams3.14159@gmail.com> wrote: >> Updated patch: the JSON code has all been moved into core, so this >> patch is now for a built-in data type. > > I have a question about the design of the JSON type. Why do we need to > store the value in UTF8 encoding? It's true the RFC of JSON says the > the encoding SHALL be encoded in Unicode, but I don't understand why > we should reject other encodings. Actually, the code in my original patch should work with any server encoding in PostgreSQL. However, internally, it operates in UTF-8 and converts to/from the server encoding when necessary. I did it this way because the JSON code needs to handle Unicode escapes like "\u266B", but there is no simple and efficient way (that I know of) to convert single characters to/from the server encoding. I noticed that in your new patch, you sidestepped the encoding issue by simply storing strings in their encoded form (right?). This is nice and simple, but in the future, JSON tree conversions and updates will still need to deal with the encoding issue somehow. > As I said before, I'd like to propose only 3 features in the commitfest: > * TYPE json data type > * text to json: FUNCTION json_parse(text) > * json to text: FUNCTION json_stringify(json, whitelist, space) Although casting from JSON to TEXT does "stringify" it in my original patch, I think json_stringify would be much more useful. In addition to the formatting options, if the internal format of the JSON type changes and no longer preserves original formatting, then the behavior of the following would change: $$ "unnecessary\u0020escape" $$ :: JSON :: TEXT json_stringify would be more predictable because it would re-encode the whitespace (but not the \u0020, unless we went out of our way to make it do that). Also, json_parse is "unnecessary" if you allow casting from TEXT to JSON (which my patch does), but I think having json_parse would be more intuitive for the same reason you do. Long story short: I like it :-) If you're keeping track, features from my patch not in the new code yet are:* Programmatically validating JSON ( json_validate() )* Getting the type of a JSONvalue ( json_type() )* Converting scalar values to/from JSON* Converting arrays to JSON* JSONPath > JSONPath will be re-implemented on the basic functionalities in the > subsequent commitfest. Do you have a plan to split your patch? > Or, can I continue to develop my patch? If so, JSONPath needs > to be adjusted to the new infrastructure. I think your patch is on a better footing than mine, so maybe I should start contributing to your code rather than the other way around. Before the next commitfest, I could merge the testcases from my patch in and identify parsing discrepancies (if any). Afterward, I could help merge the other features into the new JSON infrastructure. I can't compile your initial patch against the latest checkout because json_parser.h and json_scanner.h are missing. Is there a more recent patch, or could you update the patch so it compiles? I'd like to start tinkering with the new code. Thanks! Joey Adams
On Sat, Sep 18, 2010 at 6:45 AM, Joseph Adams <joeyadams3.14159@gmail.com> wrote: >> Why do we need to store the value in UTF8 encoding? > > because the JSON code needs to handle Unicode escapes like > "\u266B", but there is no simple and efficient way (that I know of) to > convert single characters to/from the server encoding. Ah, we don't need UTF8 encoding only to store JSON data, but we should care about Unicode escape when we support comparison and extracting values from JSON, right? I see the worth encoding to UTF8. One of my proposal is we don't have to keep the original input text. We store JSON data in effective internal formats. If users want to get human-readable output, they can use stringify() with indentation option. > I think your patch is on a better footing than mine, so maybe I should > start contributing to your code rather than the other way around. > Before the next commitfest, I could merge the testcases from my patch > in and identify parsing discrepancies (if any). Afterward, I could > help merge the other features into the new JSON infrastructure. Thanks! I'll contribute my codes developed for another project (PL/JavaScript), and let's merge our codes to the core. > I can't compile your initial patch against the latest checkout because > json_parser.h and json_scanner.h are missing. Hmm, those files should be generated from .y and .l files. I'll check it. -- Itagaki Takahiro
Itagaki Takahiro wrote: > On Sat, Sep 18, 2010 at 6:45 AM, Joseph Adams > <joeyadams3.14159@gmail.com> wrote: > >> Why do we need to store the value in UTF8 encoding? > > > > because the JSON code needs to handle Unicode escapes like > > "\u266B", but there is no simple and efficient way (that I know of) to > > convert single characters to/from the server encoding. > > Ah, we don't need UTF8 encoding only to store JSON data, but we should > care about Unicode escape when we support comparison and extracting > values from JSON, right? I see the worth encoding to UTF8. > > One of my proposal is we don't have to keep the original input text. > We store JSON data in effective internal formats. If users want to get > human-readable output, they can use stringify() with indentation option. > > > I think your patch is on a better footing than mine, so maybe I should > > start contributing to your code rather than the other way around. > > Before the next commitfest, I could merge the testcases from my patch > > in and identify parsing discrepancies (if any). ?Afterward, I could > > help merge the other features into the new JSON infrastructure. > > Thanks! I'll contribute my codes developed for another project > (PL/JavaScript), and let's merge our codes to the core. > > > I can't compile your initial patch against the latest checkout because > > json_parser.h and json_scanner.h are missing. > > Hmm, those files should be generated from .y and .l files. I'll check it. I am please the two efforts can be joined. I like the idea of PL/JavaScript too. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Fri, Sep 17, 2010 at 10:28 PM, Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote: > One of my proposal is we don't have to keep the original input text. > We store JSON data in effective internal formats. If users want to get > human-readable output, they can use stringify() with indentation option. There's a trade-off here: this will make some things faster, but other things slower. Probably some discussion of the pros and cons is in order. (Also, it's important not to break EXPLAIN (FORMAT JSON), which thinks that the internal format of JSON is text.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Sat, Sep 18, 2010 at 11:46 AM, Robert Haas <robertmhaas@gmail.com> wrote: > <itagaki.takahiro@gmail.com> wrote: >> One of my proposal is we don't have to keep the original input text. >> We store JSON data in effective internal formats. If users want to get >> human-readable output, they can use stringify() with indentation option. > > There's a trade-off here: this will make some things faster, but other > things slower. Probably some discussion of the pros and cons is in > order. I didn't intended to introduce non-text internal formats. The original patch spent some codes to keep all of whitespaces as-is in the input. But I'd say we can simplify it. Except whitespaces, normalization of strings and numbers might be problem when we support JSON comparison operators -- comparison of Unicode escaped characters in strings or 0 vs. 0.0 in numbers. -- Itagaki Takahiro
On Fri, Sep 17, 2010 at 11:12 PM, Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote: > On Sat, Sep 18, 2010 at 11:46 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> <itagaki.takahiro@gmail.com> wrote: >>> One of my proposal is we don't have to keep the original input text. >>> We store JSON data in effective internal formats. If users want to get >>> human-readable output, they can use stringify() with indentation option. >> >> There's a trade-off here: this will make some things faster, but other >> things slower. Probably some discussion of the pros and cons is in >> order. > > I didn't intended to introduce non-text internal formats. The original > patch spent some codes to keep all of whitespaces as-is in the input. > But I'd say we can simplify it. > > Except whitespaces, normalization of strings and numbers might be > problem when we support JSON comparison operators -- comparison of > Unicode escaped characters in strings or 0 vs. 0.0 in numbers. Hmm, yeah. I'd be tempted to try to keep the user's original whitespace as far as possible, but disregard it as far as equality comparison goes. However, I'm not quite sure what the right thing to do about 0 vs 0.0 is. Does the JSON spec say anything about that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Sat, Sep 18, 2010 at 4:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Hmm, yeah. I'd be tempted to try to keep the user's original > whitespace as far as possible, but disregard it as far as equality > comparison goes. However, I'm not quite sure what the right thing to > do about 0 vs 0.0 is. Does the JSON spec say anything about that? I didn't find anything in the JSON spec about comparison, but in JavaScript, 0 == 0.0 and 0 === 0.0 are both true. Also, JavaScript considers two arrays or objects equal if and only if they are references to the same object, meaning [1,2,3] == [1,2,3] is false, but if you let var a = [1,2,3]; var b = a; , then a == b and a === b are both true. Hence, JavaScript can help us when deciding how to compare scalars, but as for arrays and objects, "we're on our own" (actually, JSON arrays could be compared lexically like PostgreSQL arrays already are; I don't think anyone would disagree with that). I cast my vote for 0 == 0.0 being true. As for whitespace preservation, I don't think we should go out of our way to keep it intact. Sure, preserving formatting for input and output makes some sense because we'd have to go out of our way to normalize it, but preserving whitespace in JSONPath tree selections (json_get) and updates (json_set) is a lot of work (that I shouldn't have done), and it doesn't really help anybody. Consider json_get on a node under 5 levels of indentation. Another thing to think about is the possibility of using a non-text format in the future (such as a binary format or even a format that is internally indexed). A binary format would likely be faster to compare (and hence faster to index). If the JSON data type preserves whitespace today, it might break assumptions of future code when it stops preserving whitespace. This should at least be documented.
On Sat, Sep 18, 2010 at 11:48 PM, Joseph Adams <joeyadams3.14159@gmail.com> wrote: > As for whitespace preservation, I don't think we should go out of our > way to keep it intact. Sure, preserving formatting for input and > output makes some sense because we'd have to go out of our way to > normalize it, but preserving whitespace in JSONPath tree selections > (json_get) and updates (json_set) is a lot of work (that I shouldn't > have done), and it doesn't really help anybody. Consider json_get on > a node under 5 levels of indentation. That seems reasonable to me. I don't mind messing up the whitespace when someone pulls a value out using a jsonpath, but I don't think we should mess with it when they just ask us to store a value. Users will tend to format their JSON in a way that they find readable, and while you'd need artificial intelligence to preserve their preferred formatting in all cases, changing it to what we think is best when there's no intrinsic necessity doesn't seem helpful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On 08/12/2010 06:27 AM, David Fetter wrote: > +1 for putting it in core in 9.1 :) There are times I really wish I could get object graphs, or at least hierarchically nested object trees, of objects matching various criteria. JSON might be a reasonable representation, and one that's increasingly well supported by many different clients. Having it core would be really handy for that sort of use, especially as managing contrib modules is rather far from ideal in Pg as things stand. Using the usual generic business app example, it'd be good to be able to retrieve "customer 1234, as well as all dependent records in Invoices and Addresses" with one query, one round trip ... and no horrid ORM-like use of left outer joins plus post-processing. I've been able to do that with nested XML representations, but it's a remarkably verbose, bulky way to move chunks of data around. This patch already looks like it has lots of promise for that sort of use. It'd need aggregates, but that's already come up. A composite-or-row-type to json converter seems to be much of the point of this patch, and that's the only other part that's really required. So I'm excited, and I suspect I won't be the only one. I'm grabbing it to start playing with it now. I just wanted to chime in with interest + enthusiasm for JSON as an increasingly useful representation. -- Craig Ringer
			
				Hi all -
		
	I independently started some work on a similar capability as was contributed back in August by Joey Adams for a json datatype.  Before starting, I did a quick search but for some reason didn't turn this existing thread up.
What I've been working on is out on github for now: http://github.com/tlaurenzo/pgjson
When I started, I was actually aiming for something else, and got caught up going down this rabbit hole.  I took a different design approach, making the internal form be an extended BSON stream and implementing event-driven parsing and serializing to the different formats.  There was some discussion in the original thread around storing plain text vs a custom format.  I have to admit I've been back and forth a couple of times on this and have come to like a BSON-like format for the data at rest.  
Pros:
   - It is directly iterable without parsing and/or constructing an AST
   - It is its own representation.  If iterating and you want to tear-off a value to be returned or used elsewhere, its a simple buffer copy plus some bit twiddling.
   - It is conceivable that clients already know how to deal with BSON, allowing them to work with the internal form directly (ala MongoDB)
   - It stores a wider range of primitive types than JSON-text.  The most important are Date and binary.
Cons:
   - The format appears to have been "grown".  Some of the critical decisions were made late in the game (ie. why would your integral type codes be last)
   - Natively, the format falls victim to the artificial document vs element distinction, which I never understood.  I've worked around this with an escape mechanism for representing root values, but its not great.
   - The processor is not resilient in the face of unknown element types
I'm leaning towards thinking that the determination comes down to the following:
   - If you just want a "checkbox" item that the database has a json datatype and some support functions, storing as text may make sense.  It can be much simpler; however, it becomes increasingly hard to do anything real without adding a parse to AST, manipulate, dump to text cycle to every function.
   - If you want a json datatype that is highly integrated and manipulable, you want a binary datastructure and in the absence of any other contender in this area, BSON is ok (not great, but ok).
   - The addition of a JavaScript procedural language probably does not bring its own format for data at rest.  All of the engines I know of (I haven't looked at what Guile is doing) do not have a static representation for internal data structures.  They are heap objects with liberal use of internal and external pointers.  Most do have a mechanism, however, for injecting foreign objects into the runtime without resorting to making a dumb copy.  As such, the integration approach would probably be to determine the best format for JSON data at rest and provide adapters to the chosen JavaScript runtime to manipulate this at-rest format directly (potentially using a copy on write approach).  If the at-rest format is Text, then you would need to do a parse-to-AST step for each JavaScript function invocation.
Here's a few notes on my current implementation:
   - Excessive use of lex/yacc: This was quick and easy but the grammars are simple enough that I'd probably hand-code a parser for any final solution.
   - When the choice between following the json.org spec to the letter and implementing lenient parsing for valid JavaScript constructs arose, I chose lenient.
   - Too much buffer copying: When I started, I was just doodling with writing C code to manipulate JSON/BSON and not working with postgres in particular.  As such, it all uses straight malloc/free and too many copies are made to get things in and out of VARDATA structures.  This would all be eliminated in any real version.
   - UTF-8 is supported but not fully working completely.  The support functions that Joey wrote do a better job at this.
   - My json path evaluation is crippled.  Given the integration with the PG type system, I thought I just wanted a simple property traversal mechanism, punting higher level manipulation to native PG functions.  Seeing real JSONPath work, though, I'm not so sure.  I like the simplicity of what I've done but the features of the full bit are nice too.
   - This is first-pass prototype code with the goal of seeing it all working together.
While I had an end in mind, I did a lot of this for the fun of it and to just scratch an itch, so I'm not really advocating for anything at this point.  I'm curious as to what others think the state of JSON and Postgres should be.  I've worked with JavaScript engines a good deal and would be happy to help get us there, either using some of the work/approaches here or going in a different direction.
Terry
			
		2010/10/17 Terry Laurenzo <tj@laurenzo.org>: > Hi all - > I independently started some work on a similar capability as was contributed > back in August by Joey Adams for a json datatype. Before starting, I did a > quick search but for some reason didn't turn this existing thread up. > What I've been working on is out on github for > now: http://github.com/tlaurenzo/pgjson > When I started, I was actually aiming for something else, and got caught up > going down this rabbit hole. I took a different design approach, making the > internal form be an extended BSON stream and implementing event-driven > parsing and serializing to the different formats. There was some discussion > in the original thread around storing plain text vs a custom format. I have > to admit I've been back and forth a couple of times on this and have come to > like a BSON-like format for the data at rest. Reading your proposal, I'm now +1 for BSON-like style. Especially JS engine's capabilities to map external data to the language representation are good news. I agree the mapping is engine's task, not data format task. I'm not sure if your BSON-like format is more efficient in terms of space and time than plain text, though. I like as simple design as we can accept. ISTM format, I/O interface, simple get/set, mapping tuple from/to object, and indexing are minimum requirement. Something like JSONPath, aggregates, hstore conversion and whatsoever sound too much. Regards, -- Hitoshi Harada
On Sun, Oct 17, 2010 at 5:18 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: > Reading your proposal, I'm now +1 for BSON-like style. Especially JS > engine's capabilities to map external data to the language > representation are good news. Hmm, we could store postgres' data types as-is with their type oids. I'm not sure whether it is efficient or worth doing, though. > I like as simple design as we can accept. ISTM format, I/O interface, > simple get/set, mapping tuple from/to object, and indexing are minimum > requirement. +1 to small start, but simple get/set are already debatable... For example, text/json conversion: A. SELECT '<json>'::json; B. SELECT '<text>'::text::json; In the git repo, A calls parse_json_to_bson_as_vardata(), so the input should be a json format. OTOH, B calls pgjson_json_from_text(), so the input can be any text. Those behaviors are surprising. I think we have no other choice but to define text-to-json cast as parsing. The same can be said for json-to-text -- type-output function vs. extracting text value from json. I think casting text to/from json should behave in the same way as type input/output. The xml type works in the same manner. And if so, we might not have any casts to/from json for consistency, even though there are no problems in casts for non-text types. I'll list issues before we start json types even in the simplest cases: ---- 1. where to implement json core: external library vs. inner postgres 2. internal format: text vs. binary (*) 3. encoding: always UTF-8 vs. database encoding (*) 4. meaning of casts text to/from json: parse/stringify vs. get/set 5. parser implementation: flex/bison vs. hand-coded. ---- (*) Note that we would have comparison two json values in the future. So, we might need to normalize the internal format even in text representation. The most interesting parts of json types, including indexing and jsonpath, would be made on the json core. We need conclusions about those issues. -- Itagaki Takahiro
> I like as simple design as we can accept. ISTM format, I/O interface,+1 to small start, but simple get/set are already debatable...
> simple get/set, mapping tuple from/to object, and indexing are minimum
> requirement.
For example, text/json conversion:
A. SELECT '<json>'::json;
B. SELECT '<text>'::text::json;
In the git repo, A calls parse_json_to_bson_as_vardata(), so the input
should be a json format. OTOH, B calls pgjson_json_from_text(), so the
input can be any text. Those behaviors are surprising. I think we have
no other choice but to define text-to-json cast as parsing. The same
can be said for json-to-text -- type-output function vs. extracting
text value from json.
I think casting text to/from json should behave in the same way as type
input/output. The xml type works in the same manner. And if so, we might
not have any casts to/from json for consistency, even though there are
no problems in casts for non-text types.
I just reworked some of this last night, so I'm not sure which version you are referring to (new version has a pgplugin/jsoncast.c source file). I was basically circling around the same thing as you trying to find something that felt natural and not confusing. I agree that we don't have much of a choice to keep in/out functions as parse/serialize and that then introducing casts that do differently creates confusion. When I was playing with it, I was getting confused, and I wrote it. :)
An alternative to pg casting to extract postgres values could be to introduce analogs to JavaScript constructors, which is the JavaScript way to cast. For example: String(json), Number(json), Date(json). This would feel natural to a JavaScript programmer and would be explicit and non-surprising:
A. SELECT String('{a: 1, b:2}'::json -> 'a') (Returns postgres text)
B. SELECT Number('1'::json) (Returns postgres decimal)
I think that the most common use case for this type of thing in the DB will be to extract a JSON scalar as a postgres scalar.
The inverse, while probably less useful, is currently represented by the json_from_* functions. We could collapse all of these down to one overloaded function, say ToJson(...):
A. SELECT ToJson(1) (Would return a json type with an int32 "1" value)
B. SELECT ToJson('Some String') (Would return a json type with a string value)
There might be some syntactic magic we could do by adding an intermediate jsonscalar type, but based on trying real cases with this stuff, you always end up having to be explicit about your conversions anyway. Having implicit type coercion from this polymorphic type tends to make things confusing, imo.
I'll list issues before we start json types even in the simplest cases:
----
1. where to implement json core: external library vs. inner postgres
2. internal format: text vs. binary (*)
3. encoding: always UTF-8 vs. database encoding (*)
4. meaning of casts text to/from json: parse/stringify vs. get/set
5. parser implementation: flex/bison vs. hand-coded.
----
(*) Note that we would have comparison two json values in the future. So,
we might need to normalize the internal format even in text representation.
The most interesting parts of json types, including indexing and jsonpath,
would be made on the json core. We need conclusions about those issues.
My opinions or ramblings on the above:
1. There's a fair bit of code involved for something that many are going to gloss over. I can think of pros/cons for external/internal/contrib and I'm not sure which I would choose.
2. I'm definitely in the binary camp, but part of the reason for building it out was to try it with some real world cases to get a feel for performance implications end to end. We make heavy use of MongoDB at the office and I was thinking it might make sense to strip some of those cases down and see how they would be implemented in this context. I'll write up more thoughts on how I think text/binary should perform for various cases tonight.
3. I think if we go with binary, we should always store UTF-8 in the binary structure. Otherwise, we just have too much of the guts of the binary left to the whim of the database encoding. As currently implemented, all strings generated by the in/out functions should be escaped so that they are pure ascii (not quite working, but there in theory). JSON is by definition UTF-8, and in this case, I think it trumps database encoding.
4. My thoughts on the casts are above.
5. There seems to be a lot of runtime and code size overhead inherent in the flex/bison parsers, especially considering that they will most frequently be invoked for very small streams. Writing a good hand-coded parser for comparison is just a matter of which bottle of wine to choose prior to spending the hours coding it, and I would probably defer the decision until later.
Terry
On Sat, Oct 16, 2010 at 12:59 PM, Terry Laurenzo <tj@laurenzo.org> wrote: > - It is directly iterable without parsing and/or constructing an AST > - It is its own representation. If iterating and you want to tear-off a > value to be returned or used elsewhere, its a simple buffer copy plus some > bit twiddling. > - It is conceivable that clients already know how to deal with BSON, > allowing them to work with the internal form directly (ala MongoDB) > - It stores a wider range of primitive types than JSON-text. The most > important are Date and binary. When last I looked at that, it appeared to me that what BSON could represent was a subset of what JSON could represent - in particular, that it had things like a 32-bit limit on integers, or something along those lines. Sounds like it may be neither a superset nor a subset, in which case I think it's a poor choice for an internal representation of JSON. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/19/2010 10:44 AM, Robert Haas wrote: > On Sat, Oct 16, 2010 at 12:59 PM, Terry Laurenzo<tj@laurenzo.org> wrote: >> - It is directly iterable without parsing and/or constructing an AST >> - It is its own representation. If iterating and you want to tear-off a >> value to be returned or used elsewhere, its a simple buffer copy plus some >> bit twiddling. >> - It is conceivable that clients already know how to deal with BSON, >> allowing them to work with the internal form directly (ala MongoDB) >> - It stores a wider range of primitive types than JSON-text. The most >> important are Date and binary. > When last I looked at that, it appeared to me that what BSON could > represent was a subset of what JSON could represent - in particular, > that it had things like a 32-bit limit on integers, or something along > those lines. Sounds like it may be neither a superset nor a subset, > in which case I think it's a poor choice for an internal > representation of JSON. Yeah, if it can't handle arbitrary precision numbers as has previously been stated it's dead in the water for our purposes, I think. cheers andrew
			
				Agreed.  BSON was born out of implementations that either lacked arbitrary precision numbers or had a strong affinity to an int/floating point way of thinking about numbers.  I believe that if BSON had an arbitrary precision number type, it would be a proper superset of JSON.
As an aside, the max range of an int in BSON 64bits. Back to my original comment that BSON was "grown" instead of designed, it looks like both the 32bit and 64bit integers were added late in the game and that the original designers perhaps were just going to store all numbers as double.
Perhaps we should enumerate the attributes of what would make a good binary encoding?
Terry
			
		
		
	As an aside, the max range of an int in BSON 64bits. Back to my original comment that BSON was "grown" instead of designed, it looks like both the 32bit and 64bit integers were added late in the game and that the original designers perhaps were just going to store all numbers as double.
Perhaps we should enumerate the attributes of what would make a good binary encoding?
Terry
On Tue, Oct 19, 2010 at 8:57 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
Yeah, if it can't handle arbitrary precision numbers as has previously been stated it's dead in the water for our purposes, I think.
On 10/19/2010 10:44 AM, Robert Haas wrote:On Sat, Oct 16, 2010 at 12:59 PM, Terry Laurenzo<tj@laurenzo.org> wrote:- It is directly iterable without parsing and/or constructing an ASTWhen last I looked at that, it appeared to me that what BSON could
- It is its own representation. If iterating and you want to tear-off a
value to be returned or used elsewhere, its a simple buffer copy plus some
bit twiddling.
- It is conceivable that clients already know how to deal with BSON,
allowing them to work with the internal form directly (ala MongoDB)
- It stores a wider range of primitive types than JSON-text. The most
important are Date and binary.
represent was a subset of what JSON could represent - in particular,
that it had things like a 32-bit limit on integers, or something along
those lines. Sounds like it may be neither a superset nor a subset,
in which case I think it's a poor choice for an internal
representation of JSON.
cheers
andrew
On Tue, Oct 19, 2010 at 11:22 AM, Terry Laurenzo <tj@laurenzo.org> wrote: > Agreed. BSON was born out of implementations that either lacked arbitrary > precision numbers or had a strong affinity to an int/floating point way of > thinking about numbers. I believe that if BSON had an arbitrary precision > number type, it would be a proper superset of JSON. > > As an aside, the max range of an int in BSON 64bits. Back to my original > comment that BSON was "grown" instead of designed, it looks like both the > 32bit and 64bit integers were added late in the game and that the original > designers perhaps were just going to store all numbers as double. > > Perhaps we should enumerate the attributes of what would make a good binary > encoding? I think we should take a few steps back and ask why we think that binary encoding is the way to go. We store XML as text, for example, and I can't remember any complaints about that on -bugs or -performance, so why do we think JSON will be different? Binary encoding is a trade-off. A well-designed binary encoding should make it quicker to extract a small chunk of a large JSON object and return it; however, it will also make it slower to return the whole object (because you're adding serialization overhead). I haven't seen any analysis of which of those use cases is more important and why. I am also wondering how this proposed binary encoding scheme will interact with TOAST. If the datum is compressed on disk, you'll have to decompress it anyway to do anything with it; at that point, is there still going to be a noticeable speed-up from using the binary encoding? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Oct 19, 2010, at 12:17 PM, Robert Haas wrote: > I think we should take a few steps back and ask why we think that > binary encoding is the way to go. We store XML as text, for example, > and I can't remember any complaints about that on -bugs or > -performance, so why do we think JSON will be different? Binary > encoding is a trade-off. A well-designed binary encoding should make > it quicker to extract a small chunk of a large JSON object and return > it; however, it will also make it slower to return the whole object > (because you're adding serialization overhead). I haven't seen any > analysis of which of those use cases is more important and why. Maybe someone has numbers on that for the XML type? Best, David
On Tue, Oct 19, 2010 at 11:22 AM, Terry Laurenzo <tj@laurenzo.org> wrote: > Perhaps we should enumerate the attributes of what would make a good binary > encoding? Not sure if we're discussing the internal storage format or the binary send/recv format, but in my humble opinion, some attributes of a good internal format are: 1. Lightweight - it'd be really nice for the JSON datatype to be available in core (even if extra features like JSONPath aren't). 2. Efficiency - Retrieval and storage of JSON datums should be efficient. The internal format should probably closely resemble the binary send/recv format so there's a good reason to use it. A good attribute of the binary send/recv format would be compatibility. For instance, if MongoDB (which I know very little about) has binary send/receive, perhaps the JSON data type's binary send/receive should use it. Efficient retrieval and update of values in a large JSON tree would be cool, but would be rather complex, and IMHO, overkill. JSON's main advantage is that it's sort of a least common denominator of the type systems of many popular languages, making it easy to transfer information between them. Having hierarchical key/value store support would be pretty cool, but I don't think it's what PostgreSQL's JSON data type should do. On Tue, Oct 19, 2010 at 3:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I think we should take a few steps back and ask why we think that > binary encoding is the way to go. We store XML as text, for example, > and I can't remember any complaints about that on -bugs or > -performance, so why do we think JSON will be different? Binary > encoding is a trade-off. A well-designed binary encoding should make > it quicker to extract a small chunk of a large JSON object and return > it; however, it will also make it slower to return the whole object > (because you're adding serialization overhead). I haven't seen any > analysis of which of those use cases is more important and why. Speculation: the overhead involved with retrieving/sending and receiving/storing JSON (not to mention TOAST compression/decompression) will be far greater than that of serializing/unserializing.
On Tue, Oct 19, 2010 at 3:40 PM, Joseph Adams <joeyadams3.14159@gmail.com> wrote: > On Tue, Oct 19, 2010 at 3:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think we should take a few steps back and ask why we think that >> binary encoding is the way to go. We store XML as text, for example, >> and I can't remember any complaints about that on -bugs or >> -performance, so why do we think JSON will be different? Binary >> encoding is a trade-off. A well-designed binary encoding should make >> it quicker to extract a small chunk of a large JSON object and return >> it; however, it will also make it slower to return the whole object >> (because you're adding serialization overhead). I haven't seen any >> analysis of which of those use cases is more important and why. > > Speculation: the overhead involved with retrieving/sending and > receiving/storing JSON (not to mention TOAST > compression/decompression) will be far greater than that of > serializing/unserializing. I speculate that your speculation is incorrect. AIUI, we, unlike $COMPETITOR, tend to be CPU-bound rather than IO-bound on COPY. But perhaps less speculation and more benchmarking is in order. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Oct 19, 2010 at 2:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
After spending a week in the morass of this, I have to say that I am less certain than I was on any front regarding the text/binary distinction. I'll take some time and benchmark different cases. My hypothesis is that a well implemented binary structure and conversions will add minimal overhead in the IO + Validate case which would be the typical in/out flow. It could be substantially faster for binary send/receive because the validation step could be eliminated/reduced. Further storing as binary reduces the overhead of random access to the data by database functions.
I'm envisioning staging this up as follows:
1. Create a "jsontext". jsontext uses text as its internal representation. in/out functions are essentially a straight copy or a copy + validate.
2. Create a "jsonbinary" type. This uses an optimized binary format for internal rep and send/receive. in/out is a parse/transcode operation to standard JSON text.
3. Internal data access functions and JSON Path require a jsonbinary.
4. There are implicit casts to/from jsontext and jsonbinary.
I've got a grammar in mind for the binary structure that I'll share later when I've got some more time. It's inspired by $COMPETITOR's format but a little more sane, using type tags that implicitly define the size of the operands, simplifying parsing.
I'll then define the various use cases and benchmark using the different types. Some examples include such as IO No Validate, IO+Validate, Store and Index, Internal Processing, Internal Composition, etc.
The answer may be to have both a jsontext and jsonbinary type as each will be optimized for a different case.
Make sense? It may be a week before I get through this.
Terry
			
		On Tue, Oct 19, 2010 at 3:40 PM, Joseph AdamsI speculate that your speculation is incorrect. AIUI, we, unlike
<joeyadams3.14159@gmail.com> wrote:
> On Tue, Oct 19, 2010 at 3:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think we should take a few steps back and ask why we think that
>> binary encoding is the way to go. We store XML as text, for example,
>> and I can't remember any complaints about that on -bugs or
>> -performance, so why do we think JSON will be different? Binary
>> encoding is a trade-off. A well-designed binary encoding should make
>> it quicker to extract a small chunk of a large JSON object and return
>> it; however, it will also make it slower to return the whole object
>> (because you're adding serialization overhead). I haven't seen any
>> analysis of which of those use cases is more important and why.
>
> Speculation: the overhead involved with retrieving/sending and
> receiving/storing JSON (not to mention TOAST
> compression/decompression) will be far greater than that of
> serializing/unserializing.
$COMPETITOR, tend to be CPU-bound rather than IO-bound on COPY. But
perhaps less speculation and more benchmarking is in order.
After spending a week in the morass of this, I have to say that I am less certain than I was on any front regarding the text/binary distinction. I'll take some time and benchmark different cases. My hypothesis is that a well implemented binary structure and conversions will add minimal overhead in the IO + Validate case which would be the typical in/out flow. It could be substantially faster for binary send/receive because the validation step could be eliminated/reduced. Further storing as binary reduces the overhead of random access to the data by database functions.
I'm envisioning staging this up as follows:
1. Create a "jsontext". jsontext uses text as its internal representation. in/out functions are essentially a straight copy or a copy + validate.
2. Create a "jsonbinary" type. This uses an optimized binary format for internal rep and send/receive. in/out is a parse/transcode operation to standard JSON text.
3. Internal data access functions and JSON Path require a jsonbinary.
4. There are implicit casts to/from jsontext and jsonbinary.
I've got a grammar in mind for the binary structure that I'll share later when I've got some more time. It's inspired by $COMPETITOR's format but a little more sane, using type tags that implicitly define the size of the operands, simplifying parsing.
I'll then define the various use cases and benchmark using the different types. Some examples include such as IO No Validate, IO+Validate, Store and Index, Internal Processing, Internal Composition, etc.
The answer may be to have both a jsontext and jsonbinary type as each will be optimized for a different case.
Make sense? It may be a week before I get through this.
Terry
On Tue, Oct 19, 2010 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I think we should take a few steps back and ask why we think that > binary encoding is the way to go. We store XML as text, for example, > and I can't remember any complaints about that on -bugs or > -performance, so why do we think JSON will be different? Binary > encoding is a trade-off. A well-designed binary encoding should make > it quicker to extract a small chunk of a large JSON object and return > it; however, it will also make it slower to return the whole object > (because you're adding serialization overhead). I haven't seen any > analysis of which of those use cases is more important and why. > The elephant in the room is if the binary encoded form is smaller then it occupies less ram and disk bandwidth to copy it around. If your database is large that alone is the dominant factor. Doubling the size of all the objects in your database means halving the portion of the database that fits in RAM and doubling the amount of I/O required to complete any given operation. If your database fits entirely in RAM either way then it still means less RAM bandwidth used which is often the limiting factor but depending on how much cpu effort it takes to serialize and deserialize the balance could shift either way. -- greg
> > After spending a week in the morass of this, I have to say that I am less > certain than I was on any front regarding the text/binary distinction. I'll > take some time and benchmark different cases. My hypothesis is that a well > implemented binary structure and conversions will add minimal overhead in > the IO + Validate case which would be the typical in/out flow. It could be > substantially faster for binary send/receive because the validation step > could be eliminated/reduced. Further storing as binary reduces the overhead > of random access to the data by database functions. > > I'm envisioning staging this up as follows: > 1. Create a "jsontext". jsontext uses text as its internal > representation. in/out functions are essentially a straight copy or a copy > + validate. > 2. Create a "jsonbinary" type. This uses an optimized binary format for > internal rep and send/receive. in/out is a parse/transcode operation to > standard JSON text. > 3. Internal data access functions and JSON Path require a jsonbinary. > 4. There are implicit casts to/from jsontext and jsonbinary. some years ago I solved similar problems with xml type. I think, so you have to calculate with two factors: a) all varlena types are compressed - you cannot to get some interesting fragment or you cannot tu update some interesting fragment, every time pg working with complete document b) access to some fragment of JSON or XML document are not really important, because fast access to data are solved via indexes. c) only a few API allows binary communication between server/client. Almost all interfases use only text based API. I see some possible interesting direction for binary protocol when some one uses a javascript driver, when some one use pg in some javascript server side environment, but it isn't a often used now. Regards Pavel > > I've got a grammar in mind for the binary structure that I'll share later > when I've got some more time. It's inspired by $COMPETITOR's format but a > little more sane, using type tags that implicitly define the size of the > operands, simplifying parsing. > > I'll then define the various use cases and benchmark using the different > types. Some examples include such as IO No Validate, IO+Validate, Store and > Index, Internal Processing, Internal Composition, etc. > > The answer may be to have both a jsontext and jsonbinary type as each will > be optimized for a different case. > > Make sense? It may be a week before I get through this. > Terry > >
2010/10/19 Greg Stark <gsstark@mit.edu>: > On Tue, Oct 19, 2010 at 12:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think we should take a few steps back and ask why we think that >> binary encoding is the way to go. We store XML as text, for example, >> and I can't remember any complaints about that on -bugs or >> -performance, so why do we think JSON will be different? Binary >> encoding is a trade-off. A well-designed binary encoding should make >> it quicker to extract a small chunk of a large JSON object and return >> it; however, it will also make it slower to return the whole object >> (because you're adding serialization overhead). I haven't seen any >> analysis of which of those use cases is more important and why. >> > > The elephant in the room is if the binary encoded form is smaller then > it occupies less ram and disk bandwidth to copy it around. If your > database is large that alone is the dominant factor. Doubling the size > of all the objects in your database means halving the portion of the > database that fits in RAM and doubling the amount of I/O required to > complete any given operation. If your database fits entirely in RAM > either way then it still means less RAM bandwidth used which is often > the limiting factor but depending on how much cpu effort it takes to > serialize and deserialize the balance could shift either way. I am not sure, if this argument is important for json. This protocol has not big overhead and json documents are pretty small. More - from 9.0 TOAST uses a relative aggresive compression. I would to like a some standardised format for json inside pg too, but without using a some external library I don't see a advantages to use a other format then text. Regards Pavel > > > > > -- > greg > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Terry Laurenzo <tj@laurenzo.org> writes:
> After spending a week in the morass of this, I have to say that I am less
> certain than I was on any front regarding the text/binary distinction.  I'll
> take some time and benchmark different cases.  My hypothesis is that a well
> implemented binary structure and conversions will add minimal overhead in
> the IO + Validate case which would be the typical in/out flow.  It could be
> substantially faster for binary send/receive because the validation step
> could be eliminated/reduced.
I think that arguments proceeding from speed of binary send/receive
aren't worth the electrons they're written on, because there is nothing
anywhere that says what the binary format ought to be.  In the case of
XML we're just using the text representation as the binary format too,
and nobody's complained about that.  If we were to choose to stick with
straight text internally for a JSON type, we'd do the same thing, and
again nobody would complain.
So, if you want to make a case for using some binary internal format or
other, make it without that consideration.
> I'm envisioning staging this up as follows:
>    1. Create a "jsontext".  jsontext uses text as its internal
> representation.  in/out functions are essentially a straight copy or a copy
> + validate.
>    2. Create a "jsonbinary" type.  This uses an optimized binary format for
> internal rep and send/receive.  in/out is a parse/transcode operation to
> standard JSON text.
Ugh.  Please don't.  JSON should be JSON, and nothing else.  Do you see
any other datatypes in Postgres that expose such internal considerations?
        regards, tom lane
			
		Greg Stark <gsstark@mit.edu> writes:
> The elephant in the room is if the binary encoded form is smaller then
> it occupies less ram and disk bandwidth to copy it around.
It seems equally likely that a binary-encoded form could be larger
than the text form (that's often true for our other datatypes).
Again, this is an argument that would require experimental evidence
to back it up.
        regards, tom lane
			
		<div class="gmail_quote">On Tue, Oct 19, 2010 at 4:51 PM, Tom Lane <span dir="ltr"><<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>></span>wrote:<br /><blockquote class="gmail_quote" style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im">Terry Laurenzo<<a href="mailto:tj@laurenzo.org">tj@laurenzo.org</a>> writes:<br /> > After spending a week in the morassof this, I have to say that I am less<br /> > certain than I was on any front regarding the text/binary distinction. I'll<br /> > take some time and benchmark different cases. My hypothesis is that a well<br /> > implementedbinary structure and conversions will add minimal overhead in<br /> > the IO + Validate case which would bethe typical in/out flow. It could be<br /> > substantially faster for binary send/receive because the validation step<br/> > could be eliminated/reduced.<br /><br /></div>I think that arguments proceeding from speed of binary send/receive<br/> aren't worth the electrons they're written on, because there is nothing<br /> anywhere that says what thebinary format ought to be. In the case of<br /> XML we're just using the text representation as the binary format too,<br/> and nobody's complained about that. If we were to choose to stick with<br /> straight text internally for a JSONtype, we'd do the same thing, and<br /> again nobody would complain.<br /><br /> So, if you want to make a case for usingsome binary internal format or<br /> other, make it without that consideration.<br /><div class="im"><br /> > I'menvisioning staging this up as follows:<br /> > 1. Create a "jsontext". jsontext uses text as its internal<br />> representation. in/out functions are essentially a straight copy or a copy<br /> > + validate.<br /> > 2.Create a "jsonbinary" type. This uses an optimized binary format for<br /> > internal rep and send/receive. in/outis a parse/transcode operation to<br /> > standard JSON text.<br /><br /></div>Ugh. Please don't. JSON shouldbe JSON, and nothing else. Do you see<br /> any other datatypes in Postgres that expose such internal considerations?<br/><br /> regards, tom lane<br /></blockquote></div><br />I don't think anyone herewas really presenting arguments as yet. We're several layers deep on speculation and everyone is saying that experimentationis needed.<br /><br />I've got my own reasons for going down this path for a solution I have in mind. I hadthought that some part of that might have been applicable to pg core, but if not, that's no problem. For my own edification,I'm going to proceed down this path and see where it leads. I'll let the list know what I find out.<br /><br/>I can understand the sentiment that JSON should be JSON and nothing else from a traditional database server's pointof view, but there is nothing sacrosanct about it in the broader context.<br /><br />Terry<br /><br />
Excerpts from David E. Wheeler's message of mar oct 19 16:36:20 -0300 2010: > On Oct 19, 2010, at 12:17 PM, Robert Haas wrote: > > > I think we should take a few steps back and ask why we think that > > binary encoding is the way to go. We store XML as text, for example, > > and I can't remember any complaints about that on -bugs or > > -performance, so why do we think JSON will be different? Binary > > encoding is a trade-off. A well-designed binary encoding should make > > it quicker to extract a small chunk of a large JSON object and return > > it; however, it will also make it slower to return the whole object > > (because you're adding serialization overhead). I haven't seen any > > analysis of which of those use cases is more important and why. > > Maybe someone has numbers on that for the XML type? Like these? http://exificient.sourceforge.net/?id=performance -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Oct 19, 2010 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <gsstark@mit.edu> writes:
>> The elephant in the room is if the binary encoded form is smaller then
>> it occupies less ram and disk bandwidth to copy it around.
>
> It seems equally likely that a binary-encoded form could be larger
> than the text form (that's often true for our other datatypes).
> Again, this is an argument that would require experimental evidence
> to back it up.
That's exactly what I was thinking when I read Greg's email.  I
designed something vaguely (very vaguely) like this many years ago and
the binary format that I worked so hard to create was enormous
compared to the text format, mostly because I had a lot of small
integers in the data I was serializing, and as it turns out,
representing {0,1,2} in less than 7 bytes is not very easy.  It can
certainly be done if you set out to optimize for precisely those kinds
of cases, but I ended up with something awful like:
<4 byte type = list> <4 byte list length = 3> <4 byte type = integer>
<4 byte integer = 0> <4 byte type = integer> <4 byte integer = 1> <4
byte type = integer> <4 byte integer = 2>
= 32 bytes.  Even if you were a little smarter than I was and used 2
byte integers (with some escape hatch allowing larger numbers to be
represented) it's still more than twice the size of the text
representation.  Even if you use 1 byte integers it's still bigger.
To get it down to being smaller, you've got to do something like make
the high nibble of each byte a type field and the low nibble the first
4 payload bits.  You can certainly do all of this but you could also
just store it as text and let the TOAST compression algorithm worry
about making it smaller.
-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
			
		I hear ya.  It might be a premature optimization but I still think there may be benefit for the case of large scale
extractionand in- database transformation of large JSON datastructures.  We have terabytes of this stuff and I'd like
somethingbetween the hip nosql options and a fully structured SQL datastore. 
Terry
Sent from my iPhone
On Oct 19, 2010, at 6:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Oct 19, 2010 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Greg Stark <gsstark@mit.edu> writes:
>>> The elephant in the room is if the binary encoded form is smaller then
>>> it occupies less ram and disk bandwidth to copy it around.
>>
>> It seems equally likely that a binary-encoded form could be larger
>> than the text form (that's often true for our other datatypes).
>> Again, this is an argument that would require experimental evidence
>> to back it up.
>
> That's exactly what I was thinking when I read Greg's email.  I
> designed something vaguely (very vaguely) like this many years ago and
> the binary format that I worked so hard to create was enormous
> compared to the text format, mostly because I had a lot of small
> integers in the data I was serializing, and as it turns out,
> representing {0,1,2} in less than 7 bytes is not very easy.  It can
> certainly be done if you set out to optimize for precisely those kinds
> of cases, but I ended up with something awful like:
>
> <4 byte type = list> <4 byte list length = 3> <4 byte type = integer>
> <4 byte integer = 0> <4 byte type = integer> <4 byte integer = 1> <4
> byte type = integer> <4 byte integer = 2>
>
> = 32 bytes.  Even if you were a little smarter than I was and used 2
> byte integers (with some escape hatch allowing larger numbers to be
> represented) it's still more than twice the size of the text
> representation.  Even if you use 1 byte integers it's still bigger.
> To get it down to being smaller, you've got to do something like make
> the high nibble of each byte a type field and the low nibble the first
> 4 payload bits.  You can certainly do all of this but you could also
> just store it as text and let the TOAST compression algorithm worry
> about making it smaller.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
			
		On Tue, Oct 19, 2010 at 9:15 PM, Terri Laurenzo <tj@laurenzo.org> wrote: > I hear ya. It might be a premature optimization but I still think there may be benefit for the case of large scale extractionand in- database transformation of large JSON datastructures. We have terabytes of this stuff and I'd like somethingbetween the hip nosql options and a fully structured SQL datastore. Well, keep hacking on it! I'm interested to hear what you find out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 20, 2010 at 6:39 AM, Terry Laurenzo <tj@laurenzo.org> wrote: > The answer may be to have both a jsontext and jsonbinary type as each will > be optimized for a different case. I want to choose one format for JSON rather than having two types. It should be more efficient than other format in many cases, and not so bad in other cases. I think the discussion was started with "BSON could represent was a subset of what JSON could represent". So, any binary format could be acceptable that have enough representational power compared with text format. For example, a sequence of <byte-length> <text> could reduce CPU cycles for reparsing and hold all of the input as-is except ignorable white-spaces. It is not a BSON, but is a binary format. Or, if we want to store numbers in binary form, I think the format will be numeric type in postgres. It has high precision, and we don't need any higher precision than it to compare two numbers eventually. Even if we use BSON format, we need to extend it to store all of numeric values, that precision is 10^1000. -- Itagaki Takahiro
Good points. In addition, any binary format needs to support object property traversal without having to do a deep scanof all descendants. BSON handles this with explicit lengths for document types (objects and arrays) so that entire partsof the tree can be skipped during sibling traversal.<br /><br />It would also be nice to make sure that we store fullyparsed strings. There are lots of escape options that simply do not need to be preserved (c escapes, unicode, octal,hex sequences) and hinder the ability to do direct comparisons. BSON also makes a small extra effort to ensure thatobject property names are encoded in a way that is easily comparable, as this will be the most frequently compared items.<br/><br />I'm still going to write up a proposed grammar that takes these items into account - just ran out of timetonight.<br /><br />Terry<br /><br /><div class="gmail_quote">On Wed, Oct 20, 2010 at 12:46 AM, Itagaki Takahiro <spandir="ltr"><<a href="mailto:itagaki.takahiro@gmail.com">itagaki.takahiro@gmail.com</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left:1ex;"><div class="im">On Wed, Oct 20, 2010 at 6:39 AM, Terry Laurenzo <<a href="mailto:tj@laurenzo.org">tj@laurenzo.org</a>>wrote:<br /> > The answer may be to have both a jsontext and jsonbinarytype as each will<br /> > be optimized for a different case.<br /><br /></div>I want to choose one format forJSON rather than having two types.<br /> It should be more efficient than other format in many cases,<br /> and not sobad in other cases.<br /><br /> I think the discussion was started with<br /> "BSON could represent was a subset of whatJSON could represent".<br /> So, any binary format could be acceptable that have enough<br /> representational powercompared with text format.<br /><br /> For example, a sequence of <byte-length> <text> could reduce<br />CPU cycles for reparsing and hold all of the input as-is except<br /> ignorable white-spaces. It is not a BSON, but isa binary format.<br /><br /> Or, if we want to store numbers in binary form, I think the<br /> format will be numeric typein postgres. It has high precision,<br /> and we don't need any higher precision than it to compare two<br /> numberseventually. Even if we use BSON format, we need to extend<br /> it to store all of numeric values, that precisionis 10^1000.<br /><br /> --<br /><font color="#888888">Itagaki Takahiro<br /></font></blockquote></div><br />
* Terry Laurenzo: > Agreed. BSON was born out of implementations that either lacked > arbitrary precision numbers or had a strong affinity to an > int/floating point way of thinking about numbers. I believe that if > BSON had an arbitrary precision number type, it would be a proper > superset of JSON. But JSON has only double-precision numbers!?
On 10/20/2010 01:15 PM, Florian Weimer wrote: > * Terry Laurenzo: > >> Agreed. BSON was born out of implementations that either lacked >> arbitrary precision numbers or had a strong affinity to an >> int/floating point way of thinking about numbers. I believe that if >> BSON had an arbitrary precision number type, it would be a proper >> superset of JSON. > But JSON has only double-precision numbers!? AFAICT the JSON spec says nothing at all about the precision of numbers. It just provides a syntax for them. We should not confuse what can be allowed in JSON with what can be handled by some consumers of JSON such as ECMAScript. However, since we would quite reasonably require that any JSON implementation be able to handle arbitrary precision numbers, that apparently rules out BSON as a storage engine for it, since BSON can not handle such things. cheers andrew
I'm still going to write up a proposed grammar that takes these items into account - just ran out of time tonight.
The binary format I was thinking of is here:
 This was just a quick brain dump and I haven't done a lot of diligence on verifying it, but I think it should be more compact than most JSON text payloads and quick to iterate over/update in sibling traversal order vs depth-first traversal which is what we would get out of JSON text.
Thoughts?
Terry
On Sat, Oct 23, 2010 at 8:33 PM, Terry Laurenzo <tj@laurenzo.org> wrote: >> >> I'm still going to write up a proposed grammar that takes these items into >> account - just ran out of time tonight. >> > > The binary format I was thinking of is here: > > http://github.com/tlaurenzo/pgjson/blob/master/pgjson/shared/include/json/jsonbinary.h > This was just a quick brain dump and I haven't done a lot of diligence on > verifying it, but I think it should be more compact than most JSON text > payloads and quick to iterate over/update in sibling traversal order vs > depth-first traversal which is what we would get out of JSON text. > Thoughts? > Terry It doesn't do particularly well on my previous example of [1,2,3]. It comes out slightly shorter on ["a","b","c"] and better if the strings need any escaping. I don't think the float4 and float8 formats are very useful; how could you be sure that the output was going to look the same as the input? Or alternatively that dumping a particular object to text and reloading it will produce the same internal representation? I think it would be simpler to represent integers using a string of digits; that way you can be assured of going from text -> binary -> text without change. Perhaps it would be enough to define the high two bits as follows: 00 = array, 01 = object, 10 = string, 11 = number/true/false/null. The next 2 bits specify how the length is stored. 00 = remaining 4 bits store a length of up to 15 bytes, 01 = remaining 4 bits + 1 additional byte store a 12-bit length of up to 4K, 10 = remaining 4 bits + 2 additional bytes store a 20-bit length of up to 1MB, 11 = 4 additional bytes store a full 32-bit length word. Then, the array, object, and string representations can work as you've specified them. Anything else can be represented by itself, or perhaps we should say that numbers represent themselves and true/false/null are represented by a 1-byte sequence, t/f/n (or perhaps we could define 111100{00,01,10} to mean those values, since there's no obvious reason for the low bits to be non-zero if a 4-bit length word ostensibly follows). So [1,2,3] = 06 C1 '1' C1 '2' C1 '3' and ["a","b","c"] = 06 81 'a' 81 'b' 81 'c' (I am still worried about the serialization/deserialization overhead but that's a different issue.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
It doesn't do particularly well on my previous example of [1,2,3]. It
comes out slightly shorter on ["a","b","c"] and better if the strings
need any escaping. I don't think the float4 and float8 formats are
very useful; how could you be sure that the output was going to look
the same as the input? Or alternatively that dumping a particular
object to text and reloading it will produce the same internal
representation? I think it would be simpler to represent integers
using a string of digits; that way you can be assured of going from
text -> binary -> text without change.
Perhaps it would be enough to define the high two bits as follows: 00
= array, 01 = object, 10 = string, 11 = number/true/false/null. The
next 2 bits specify how the length is stored. 00 = remaining 4 bits
store a length of up to 15 bytes, 01 = remaining 4 bits + 1 additional
byte store a 12-bit length of up to 4K, 10 = remaining 4 bits + 2
additional bytes store a 20-bit length of up to 1MB, 11 = 4 additional
bytes store a full 32-bit length word. Then, the array, object, and
string representations can work as you've specified them. Anything
else can be represented by itself, or perhaps we should say that
numbers represent themselves and true/false/null are represented by a
1-byte sequence, t/f/n (or perhaps we could define 111100{00,01,10} to
mean those values, since there's no obvious reason for the low bits to
be non-zero if a 4-bit length word ostensibly follows).
So [1,2,3] = 06 C1 '1' C1 '2' C1 '3' and ["a","b","c"] = 06 81 'a' 81 'b' 81 'c'
(I am still worried about the serialization/deserialization overhead
but that's a different issue.)
Thanks.  I'll play around with the bit and numeric encodings you've recommended.  Arrays are certainly the toughest to economize on as a text encoding has minimum of 2 + n - 1 overhead bytes.  Text encoding for objects has some more wiggle room with 2 + (n-1) + (n*3) extra bytes.  I was admittedly thinking about more complicated objects.
I'm still worried about transcoding overhead as well.  If comparing to a simple blind storage of JSON text with no validation or normalization, there is obviously no way to beat a straight copy.  However, its not outside the realm of reason to think that it may be possible to match or beat the clock if comparing against text to text normalization, or perhaps adding slight overhead to a validator.  The advantage to the binary structure is the ability to scan it hierarchically in sibling order and provide mutation operations with simple memcpy's as opposed to parse_to_ast -> modify_ast -> serialize_ast.
Terry
On Sun, Oct 24, 2010 at 2:21 AM, Terry Laurenzo <tj@laurenzo.org> wrote:
>> It doesn't do particularly well on my previous example of [1,2,3].  It
>> comes out slightly shorter on ["a","b","c"] and better if the strings
>> need any escaping.  I don't think the float4 and float8 formats are
>> very useful; how could you be sure that the output was going to look
>> the same as the input?  Or alternatively that dumping a particular
>> object to text and reloading it will produce the same internal
>> representation?  I think it would be simpler to represent integers
>> using a string of digits; that way you can be assured of going from
>> text -> binary -> text without change.
>>
>> Perhaps it would be enough to define the high two bits as follows: 00
>> = array, 01 = object, 10 = string, 11 = number/true/false/null.  The
>> next 2 bits specify how the length is stored.  00 = remaining 4 bits
>> store a length of up to 15 bytes, 01 = remaining 4 bits + 1 additional
>> byte store a 12-bit length of up to 4K, 10 = remaining 4 bits + 2
>> additional bytes store a 20-bit length of up to 1MB, 11 = 4 additional
>> bytes store a full 32-bit length word.  Then, the array, object, and
>> string representations can work as you've specified them.  Anything
>> else can be represented by itself, or perhaps we should say that
>> numbers represent themselves and true/false/null are represented by a
>> 1-byte sequence, t/f/n (or perhaps we could define 111100{00,01,10} to
>> mean those values, since there's no obvious reason for the low bits to
>> be non-zero if a 4-bit length word ostensibly follows).
>>
>> So [1,2,3] = 06 C1 '1' C1 '2' C1 '3' and ["a","b","c"] = 06 81 'a' 81 'b'
>> 81 'c'
>>
>> (I am still worried about the serialization/deserialization overhead
>> but that's a different issue.)
>>
>
> Thanks.  I'll play around with the bit and numeric encodings you've
> recommended.  Arrays are certainly the toughest to economize on as a text
> encoding has minimum of 2 + n - 1 overhead bytes.  Text encoding for objects
> has some more wiggle room with 2 + (n-1) + (n*3) extra bytes.  I was
> admittedly thinking about more complicated objects.
> I'm still worried about transcoding overhead as well.  If comparing to a
> simple blind storage of JSON text with no validation or normalization, there
> is obviously no way to beat a straight copy.  However, its not outside the
> realm of reason to think that it may be possible to match or beat the clock
> if comparing against text to text normalization, or perhaps adding slight
> overhead to a validator.  The advantage to the binary structure is the
> ability to scan it hierarchically in sibling order and provide mutation
> operations with simple memcpy's as opposed to parse_to_ast -> modify_ast ->
> serialize_ast.
Yeah, my concern is not whether the overhead will be zero; it's
whether it will be small, yet allow large gains on other operations.
Like, how much slower will it be to pull out a moderately complex 1MB
JSON blob (not just a big string) out of a single-row, single-column
table?  If it's 5% slower, that's probably OK, since this is a
reasonable approximation of a worst-case scenario.  If it's 50%
slower, that sounds painful.  It would also be worth testing with a
much smaller size, such as a 1K object with lots of internal
structure.  In both cases, all data cached in shared_buffers, etc.
Then on the flip side how do we do on val[37]["whatever"]?  You'd like
to hope that this will be significantly faster than the text encoding
on both large and small objects.  If it's not, there's probably not
much point.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
			
		Yeah, my concern is not whether the overhead will be zero; it's
whether it will be small, yet allow large gains on other operations.
Like, how much slower will it be to pull out a moderately complex 1MB
JSON blob (not just a big string) out of a single-row, single-column
table? If it's 5% slower, that's probably OK, since this is a
reasonable approximation of a worst-case scenario. If it's 50%
slower, that sounds painful. It would also be worth testing with a
much smaller size, such as a 1K object with lots of internal
structure. In both cases, all data cached in shared_buffers, etc.
Then on the flip side how do we do on val[37]["whatever"]? You'd like
to hope that this will be significantly faster than the text encoding
on both large and small objects. If it's not, there's probably not
much point.
We're on the same page.  I'm implementing the basic cases now and then will come up with some benchmarks.
 Terry 
Are there any activities in JSON data types for the next commitfest? I'd like to help you if you are working on the topic, or I can make up previous works and discussions by myself. On Mon, Oct 25, 2010 at 1:36 AM, Terry Laurenzo <tj@laurenzo.org> wrote: >> Yeah, my concern is not whether the overhead will be zero; it's >> whether it will be small, yet allow large gains on other operations. >> Like, how much slower will it be to pull out a moderately complex 1MB >> JSON blob (not just a big string) out of a single-row, single-column >> table? If it's 5% slower, that's probably OK, since this is a >> reasonable approximation of a worst-case scenario. If it's 50% >> slower, that sounds painful. It would also be worth testing with a >> much smaller size, such as a 1K object with lots of internal >> structure. In both cases, all data cached in shared_buffers, etc. >> >> Then on the flip side how do we do on val[37]["whatever"]? You'd like >> to hope that this will be significantly faster than the text encoding >> on both large and small objects. If it's not, there's probably not >> much point. > > We're on the same page. I'm implementing the basic cases now and then will > come up with some benchmarks. -- Itagaki Takahiro
On Mon, Nov 8, 2010 at 9:52 PM, Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote: > Are there any activities in JSON data types for the next commitfest? I'm leaning toward the view that we shouldn't commit a JSON implementation to core (or contrib) for 9.1. We have at least three competing proposals on the table. I thought of picking it up and hacking on it myself, but then we'd have four competing proposals on the table. Even if we could come to some consensus on which of those proposals is technically superior, the rate at which new ideas are being proposed suggests to me that it would be premature to anoint any single implementation as our canonical one. I'd like to see some of these patches finished and put up on pgfoundry or github, and then consider moving one of them into core when we have a clear and stable consensus that one of them is technically awesome and the best thing we're going to get. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
			
				Robert,
			
		
		
	I think I agree.  At a minimum, I would like to see the "chosen" of the competing priorities live on as an outside module for use by previous versions.  Even having proposed one, and soon to be two of the competing implementations, it makes me nervous to commit to one at this juncture.
I'm wrapping some items up this week but expect to have some time over the next two weeks to complete my implementation.
Here's a quick status on where I'm at:
   - Binary format has been implemented as specified here: https://github.com/tlaurenzo/pgjson/blob/master/pgjson/jsonlib/BINARY-README.txt
   - Hand coded a JSON-text lexer/parser and JSON-binary parser and transcoders
   - Ran out of time to do exhaustive tests, but typical payloads yield significant space savings
   - Based on an admittedly small number of test cases, I identified that the process of encoding a string literal is the most expensive operation in the general case, accounting for 1/2 to 2/3 of the time spent in a transcoding operation.  This is fairly obvious but good to know.  I did not spend any time looking into this further.
   - Drastically simplified the code in preparation to build a stand-alone module
As soon as I get a bit of time I was going to do the following:
   - Create a simple PGXS based build, stripping out the rest of the bits I was doodling on
   - Re-implement the PG module based on the new jsonlib binary format and parser
   - Add JSONPath and some encoding bits back in from the original patch
   - Do some holistic profiling between the JSON-as-Text approach and the JSON-as-Binary approach
This is still a bit of a fishing expedition, imo and I would have a hard time getting this ready for commit on Monday.  If getting something in right now is critical, Joey's original patch is the most complete at this point.
Terry
On Tue, Nov 9, 2010 at 3:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Nov 8, 2010 at 9:52 PM, Itagaki Takahiro<itagaki.takahiro@gmail.com> wrote:> Are there any activities in JSON data types for the next commitfest?I'm leaning toward the view that we shouldn't commit a JSON
implementation to core (or contrib) for 9.1. We have at least three
competing proposals on the table. I thought of picking it up and
hacking on it myself, but then we'd have four competing proposals on
the table. Even if we could come to some consensus on which of those
proposals is technically superior, the rate at which new ideas are
being proposed suggests to me that it would be premature to anoint any
single implementation as our canonical one. I'd like to see some of
these patches finished and put up on pgfoundry or github, and then
consider moving one of them into core when we have a clear and stable
consensus that one of them is technically awesome and the best thing
we're going to get.--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
			
				I've got a new stripped down version of the binary json plugin on github: https://github.com/tlaurenzo/pgjson
			
		
		
	With all due warning of contrived benchmarks, I wrote some tests to see where things stand.   The test script is here: 
The results from my laptop (First gen Macbook Pro 32bit with Snow Leopard and Postgresql 9.0) are here and copied into this email at the end:
And for some commentary...
I copied the 5 sample documents from json.org's example section for these tests.  These are loaded into a table with a varchar column 1000 times each (so the test table has 5000 rows in it).  In all situations, the binary encoding was smaller than the normalized text form (between 9 and 23% smaller).  I think there are cases where the binary form will be larger than the corresponding text form, but I don't think they would be very common.
For the timings, various operations are performed on the 5000 row test table for 100 iterations.  In all situations, the query returns the Length of the transformed document instead of the document itself so as to factor out variable client IO between the test runs.  The Null Parse is essentially just a select from the table and therefore represents the baseline.  The times varied a little bit between runs but did not change materially.
What we see from this is that parsing JSON text and generating a binary representation is cheap, representing approximately 10% of the base case time.  Conversely, anything that involves generating JSON text is expensive, accounting for 30-40% of the base case time.  Some incidental profiling shows that while the entire operation is expensive, the process of generating string literals dominates this time.  There is likely room for optimization in this method, but it should be noted that most of these documents are lightly escaped (if escaped at all) which represents the happy path through the string literal output function.
While I have not profiled any advanced manipulation of the binary structure within the server, it stands to reason that manipulating the binary structure should be significantly faster than an approach that requires (perhaps multiple) transcoding between text representations in order to complete a sequence of operations.
Assuming that the JSON datatype (at a minimum) normalizes text for storage, then the text storage option accounts for about the most expensive path but with none of the benefits of an internal binary form (smaller size, ability to cheaply perform non-trivial manipulation within the database server).
Of course, just having a JSON datatype that blindly stores text will beat everything, but I'm getting closer to thinking that the binary option is worth the tradeoff.
Comments?
Terry
Running benchmark with 100 iterations per step
Loading data...
Data loaded.
=== DOCUMENT SIZE STATISTICS ===
    Document Name     | Original Size | Binary Size | Normalized Size | Percentage Savings 
----------------------+---------------+-------------+-----------------+--------------------
 jsonorg_sample1.json |           582 |         311 |             360 |   13.6111111111111
 jsonorg_sample2.json |           241 |         146 |             183 |   20.2185792349727
 jsonorg_sample3.json |           601 |         326 |             389 |   16.1953727506427
 jsonorg_sample4.json |          3467 |        2466 |            2710 |   9.00369003690037
 jsonorg_sample5.json |           872 |         469 |             613 |   23.4910277324633
(5 rows)
=== TEST PARSE AND SERIALIZATION ===
Null Parse:
       12.12 real         2.51 user         0.13 sys
Parse to Binary:
       13.38 real         2.51 user         0.14 sys
Serialize from Binary:
       16.65 real         2.51 user         0.13 sys
Normalize Text:
       18.99 real         2.51 user         0.13 sys
Roundtrip (parse to binary and serialize):
       18.58 real         2.51 user         0.14 sys
On Sun, Nov 21, 2010 at 12:31 AM, Terry Laurenzo <tj@laurenzo.org> wrote: > What we see from this is that parsing JSON text and generating a binary > representation is cheap, representing approximately 10% of the base case > time. Conversely, anything that involves generating JSON text is expensive, > accounting for 30-40% of the base case time. Some incidental profiling > shows that while the entire operation is expensive, the process of > generating string literals dominates this time. There is likely room for > optimization in this method, but it should be noted that most of these > documents are lightly escaped (if escaped at all) which represents the happy > path through the string literal output function. Ouch! That's kind of painful. But certainly for some use cases it will work out to a huge speedup, if you're doing subscripting or similar. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Nov 20, 2010, at 9:31 PM, Terry Laurenzo wrote: > Assuming that the JSON datatype (at a minimum) normalizes text for storage, then the text storage option accounts for aboutthe most expensive path but with none of the benefits of an internal binary form (smaller size, ability to cheaply performnon-trivial manipulation within the database server). > > Of course, just having a JSON datatype that blindly stores text will beat everything, but I'm getting closer to thinkingthat the binary option is worth the tradeoff. > > Comments? benchmarks++ Nice to have some data points for this discussion. Best, David, still hoping for the JSON data type in 9.1…
On 11/21/2010 12:31 AM, Terry Laurenzo wrote:
I copied the 5 sample documents from json.org's example section for these tests. These are loaded into a table with a varchar column 1000 times each (so the test table has 5000 rows in it). In all situations, the binary encoding was smaller than the normalized text form (between 9 and 23% smaller). I think there are cases where the binary form will be larger than the corresponding text form, but I don't think they would be very common.
Is that a pre-toast or post-toast comparison?
Even if it's post-toast, that doesn't seem like enough of a saving to convince me that simply storing as text, just as we do for XML, isn't a sensible way to go, especially when the cost of reproducing the text for delivery to clients (including, say, pg_dump) is likely to be quite high.
cheers
andrew