Обсуждение: Proposal: Add JSON support
I introduced myself in the thread "Proposal: access control jails (and introduction as aspiring GSoC student)", and we discussed jails and session-local variables. But, as Robert Haas suggested, implementing variable support in the backend would probably be way too ambitious a project for a newbie like me. I decided instead to pursue the task of adding JSON support to PostgreSQL, hence the new thread. I plan to reference datatype-xml.html and functions-xml.html in some design decisions, but there are some things that apply to XML that don't apply to JSON and vice versa. For instance, jsoncomment wouldn't make sense because (standard) JSON doesn't have comments. For access, we might have something like json_get('foo[1].bar') and json_set('foo[1].bar', 'hello'). jsonforest and jsonagg would be beautiful. For mapping, jsonforest/jsonagg could be used to build a JSON string from a result set (SELECT jsonagg(jsonforest(col1, col2, ...)) FROM tbl), but I'm not sure on the best way to go the other way around (generate a result set from JSON). CSS-style selectors would be cool, but "selecting" is what SQL is all about, and I'm not sure having a json_select("dom-element[key=value]") function is a good, orthogonal approach. I'm wondering whether the internal representation of JSON should be plain JSON text, or some binary code that's easier to traverse and whatnot. For the sake of code size, just keeping it in text is probably best. Now my thoughts and opinions on the JSON parsing/unparsing itself: It should be built-in, rather than relying on an external library (like XML does). Priorities of the JSON implementation, in descending order, are: * Small * Correct * Fast Moreover, JSON operations shall not crash due to stack overflows. I'm thinking Bison/Flex is overkill for parsing JSON (I haven't seen any JSON implementations out there that use it anyway). I would probably end up writing the JSON parser/serializer manually. It should not take more than a week. As far as character encodings, I'd rather keep that out of the JSON parsing/serializing code itself and assume UTF-8. Wherever I'm wrong, I'll just throw encode/decode/validate operations at it. Thoughts? Thanks.
On Sun, Mar 28, 2010 at 4:48 PM, Joseph Adams <joeyadams3.14159@gmail.com> wrote: > I'm wondering whether the internal representation of JSON should be > plain JSON text, or some binary code that's easier to traverse and > whatnot. For the sake of code size, just keeping it in text is > probably best. +1 for text. > Now my thoughts and opinions on the JSON parsing/unparsing itself: > > It should be built-in, rather than relying on an external library > (like XML does). Why? I'm not saying you aren't right, but you need to make an argument rather than an assertion. This is a community, so no one is entitled to decide anything unilaterally, and people want to be convinced - including me. > As far as character encodings, I'd rather keep that out of the JSON > parsing/serializing code itself and assume UTF-8. Wherever I'm wrong, > I'll just throw encode/decode/validate operations at it. I think you need to assume that the encoding will be the server encoding, not UTF-8. Although others on this list are better qualified to speak to that than I am. ...Robert
Robert Haas wrote: > On Sun, Mar 28, 2010 at 4:48 PM, Joseph Adams > <joeyadams3.14159@gmail.com> wrote: > >> I'm wondering whether the internal representation of JSON should be >> plain JSON text, or some binary code that's easier to traverse and >> whatnot. For the sake of code size, just keeping it in text is >> probably best. >> > > +1 for text. > Agreed. > >> Now my thoughts and opinions on the JSON parsing/unparsing itself: >> >> It should be built-in, rather than relying on an external library >> (like XML does). >> > > Why? I'm not saying you aren't right, but you need to make an > argument rather than an assertion. This is a community, so no one is > entitled to decide anything unilaterally, and people want to be > convinced - including me. > Yeah, why? We should not be in the business of reinventing the wheel (and then maintaining the reinvented wheel), unless the code in question is *really* small. > >> As far as character encodings, I'd rather keep that out of the JSON >> parsing/serializing code itself and assume UTF-8. Wherever I'm wrong, >> I'll just throw encode/decode/validate operations at it. >> > > I think you need to assume that the encoding will be the server > encoding, not UTF-8. Although others on this list are better > qualified to speak to that than I am. > > > The trouble is that JSON is defined to be specifically Unicode, and in practice for us that means UTF8 on the server side. It could get a bit hairy, and it's definitely not something I think you can wave away with a simple "I'll just throw some encoding/decoding function calls at it." cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Robert Haas wrote: >> I think you need to assume that the encoding will be the server >> encoding, not UTF-8. Although others on this list are better >> qualified to speak to that than I am. > The trouble is that JSON is defined to be specifically Unicode, and in > practice for us that means UTF8 on the server side. It could get a bit > hairy, and it's definitely not something I think you can wave away with > a simple "I'll just throw some encoding/decoding function calls at it." It's just text, no? Are there any operations where this actually makes a difference? Like Robert, I'm *very* wary of trying to introduce any text storage into the backend that is in an encoding different from server_encoding. Even the best-case scenarios for that will involve multiple new places for encoding conversion failures to happen. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Robert Haas wrote: >> >>> I think you need to assume that the encoding will be the server >>> encoding, not UTF-8. Although others on this list are better >>> qualified to speak to that than I am. >>> > > >> The trouble is that JSON is defined to be specifically Unicode, and in >> practice for us that means UTF8 on the server side. It could get a bit >> hairy, and it's definitely not something I think you can wave away with >> a simple "I'll just throw some encoding/decoding function calls at it." >> > > It's just text, no? Are there any operations where this actually makes > a difference? > If we're going to provide operations on it that might involve some. I don't know. > Like Robert, I'm *very* wary of trying to introduce any text storage > into the backend that is in an encoding different from server_encoding. > Even the best-case scenarios for that will involve multiple new places for > encoding conversion failures to happen. > > I agree entirely. All I'm suggesting is that there could be many wrinkles here. Here's another thought. Given that JSON is actually specified to consist of a string of Unicode characters, what will we deliver to the client where the client encoding is, say Latin1? Will it actually be a legal JSON byte stream? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Here's another thought. Given that JSON is actually specified to consist > of a string of Unicode characters, what will we deliver to the client > where the client encoding is, say Latin1? Will it actually be a legal > JSON byte stream? No, it won't. We will *not* be sending anything but latin1 in such a situation, and I really couldn't care less what the JSON spec says about it. Delivering wrongly-encoded data to a client is a good recipe for all sorts of problems, since the client-side code is very unlikely to be expecting that. A datatype doesn't get to make up its own mind whether to obey those rules. Likewise, data on input had better match client_encoding, because it's otherwise going to fail the encoding checks long before a json datatype could have any say in the matter. While I've not read the spec, I wonder exactly what "consist of a string of Unicode characters" should actually be taken to mean. Perhaps it only means that all the characters must be members of the Unicode set, not that the string can never be represented in any other encoding. There's more than one Unicode encoding anyway... regards, tom lane
On Sun, Mar 28, 2010 at 7:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> Here's another thought. Given that JSON is actually specified to consist >> of a string of Unicode characters, what will we deliver to the client >> where the client encoding is, say Latin1? Will it actually be a legal >> JSON byte stream? > > No, it won't. We will *not* be sending anything but latin1 in such a > situation, and I really couldn't care less what the JSON spec says about > it. Delivering wrongly-encoded data to a client is a good recipe for > all sorts of problems, since the client-side code is very unlikely to be > expecting that. A datatype doesn't get to make up its own mind whether > to obey those rules. Likewise, data on input had better match > client_encoding, because it's otherwise going to fail the encoding > checks long before a json datatype could have any say in the matter. > > While I've not read the spec, I wonder exactly what "consist of a string > of Unicode characters" should actually be taken to mean. Perhaps it > only means that all the characters must be members of the Unicode set, > not that the string can never be represented in any other encoding. > There's more than one Unicode encoding anyway... See sections 2.5 and 3 of: http://www.ietf.org/rfc/rfc4627.txt?number=4627 ...Robert
On Sun, Mar 28, 2010 at 7:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> Here's another thought. Given that JSON is actually specified to consist >> of a string of Unicode characters, what will we deliver to the client >> where the client encoding is, say Latin1? Will it actually be a legal >> JSON byte stream? > > No, it won't. We will *not* be sending anything but latin1 in such a > situation, and I really couldn't care less what the JSON spec says about > it. Delivering wrongly-encoded data to a client is a good recipe for > all sorts of problems, since the client-side code is very unlikely to be > expecting that. A datatype doesn't get to make up its own mind whether > to obey those rules. Likewise, data on input had better match > client_encoding, because it's otherwise going to fail the encoding > checks long before a json datatype could have any say in the matter. > > While I've not read the spec, I wonder exactly what "consist of a string > of Unicode characters" should actually be taken to mean. Perhaps it > only means that all the characters must be members of the Unicode set, > not that the string can never be represented in any other encoding. > There's more than one Unicode encoding anyway... In practice, every parser/serializer I've used (including the one I helped write) allows (and, often, forces) any non-ASCII character to be encoded as \u followed by a string of four hex digits. Whether it would be easy inside the backend, when generating JSON from user data stored in tables that are not in a UTF-8 encoded cluster, to convert to UTF-8, that's something else entirely. If it /is/ easy and safe, then it's just a matter of scanning for multi-byte sequences and replacing those with their \uXXXX equivalents. I have some simple and fast code I could share, if it's needed, though I suspect it's not. :) UPDATE: Thanks, Robert, for pointing to the RFC. -- Mike Rylander| VP, Research and Design| Equinox Software, Inc. / The Evergreen Experts| phone: 1-877-OPEN-ILS (673-6457)|email: miker@esilibrary.com| web: http://www.esilibrary.com
On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander@gmail.com> wrote: > In practice, every parser/serializer I've used (including the one I > helped write) allows (and, often, forces) any non-ASCII character to > be encoded as \u followed by a string of four hex digits. Is it correct to say that the only feasible place where non-ASCII characters can be used is within string constants? If so, it might be reasonable to disallow characters with the high-bit set unless the server encoding is one of the flavors of Unicode of which the spec approves. I'm tempted to think that when the server encoding is Unicode we really ought to allow Unicode characters natively, because turning a long string of two-byte wide chars into a long string of six-byte wide chars sounds pretty evil from a performance point of view. ...Robert
Robert Haas wrote: > On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander@gmail.com> wrote: > >> In practice, every parser/serializer I've used (including the one I >> helped write) allows (and, often, forces) any non-ASCII character to >> be encoded as \u followed by a string of four hex digits. >> > > Is it correct to say that the only feasible place where non-ASCII > characters can be used is within string constants? If so, it might be > reasonable to disallow characters with the high-bit set unless the > server encoding is one of the flavors of Unicode of which the spec > approves. I'm tempted to think that when the server encoding is > Unicode we really ought to allow Unicode characters natively, because > turning a long string of two-byte wide chars into a long string of > six-byte wide chars sounds pretty evil from a performance point of > view. > > > We support exactly one unicode encoding on the server side: utf8. And the maximum possible size of a validly encoded unicode char in utf8 is 4 (and that's pretty rare, IIRC). cheers andrew
Andrew Dunstan wrote: > > > Robert Haas wrote: >> On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander@gmail.com> >> wrote: >> >>> In practice, every parser/serializer I've used (including the one I >>> helped write) allows (and, often, forces) any non-ASCII character to >>> be encoded as \u followed by a string of four hex digits. >>> >> >> Is it correct to say that the only feasible place where non-ASCII >> characters can be used is within string constants? If so, it might be >> reasonable to disallow characters with the high-bit set unless the >> server encoding is one of the flavors of Unicode of which the spec >> approves. I'm tempted to think that when the server encoding is >> Unicode we really ought to allow Unicode characters natively, because >> turning a long string of two-byte wide chars into a long string of >> six-byte wide chars sounds pretty evil from a performance point of >> view. >> >> >> > > We support exactly one unicode encoding on the server side: utf8. > > And the maximum possible size of a validly encoded unicode char in > utf8 is 4 (and that's pretty rare, IIRC). > > Sorry. Disregard this. I see what you mean. Yeah, I thing *requiring* non-ascii character to be escaped would be evil. cheers andrew
On Sun, Mar 28, 2010 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Mar 28, 2010 at 8:23 PM, Mike Rylander <mrylander@gmail.com> wrote: >> In practice, every parser/serializer I've used (including the one I >> helped write) allows (and, often, forces) any non-ASCII character to >> be encoded as \u followed by a string of four hex digits. > > Is it correct to say that the only feasible place where non-ASCII > characters can be used is within string constants? Yes. That includes object property strings -- they are quoted string literals. > If so, it might be > reasonable to disallow characters with the high-bit set unless the > server encoding is one of the flavors of Unicode of which the spec > approves. I'm tempted to think that when the server encoding is > Unicode we really ought to allow Unicode characters natively, because > turning a long string of two-byte wide chars into a long string of > six-byte wide chars sounds pretty evil from a performance point of > view. > +1 As an aside, \u-encoded (escaped) characters and native multi-byte sequences (of any RFC-allowable Unicode encoding) are exactly equivalent in JSON -- it's a storage and transmission format, and doesn't prescribe the application-internal representation of the data. If it's faster (which it almost certainly is) to not mangle the data when it's all staying server side, that seems like a useful optimization. For output to the client, however, it would be useful to provide a \u-escaping function, which (AIUI) should always be safe regardless of client encoding. -- Mike Rylander| VP, Research and Design| Equinox Software, Inc. / The Evergreen Experts| phone: 1-877-OPEN-ILS (673-6457)|email: miker@esilibrary.com| web: http://www.esilibrary.com
On Sun, Mar 28, 2010 at 5:19 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Mar 28, 2010 at 4:48 PM, Joseph Adams > <joeyadams3.14159@gmail.com> wrote: >> Now my thoughts and opinions on the JSON parsing/unparsing itself: >> >> It should be built-in, rather than relying on an external library >> (like XML does). > > Why? I'm not saying you aren't right, but you need to make an > argument rather than an assertion. This is a community, so no one is > entitled to decide anything unilaterally, and people want to be > convinced - including me. I apologize; I was just starting the conversation with some of my ideas to receive feedback. I didn't want people to have to wade through too many "I think"s . I'll be sure to use <opinion> tags in the future :-) My reasoning for "It should be built-in" is:* It would be nice to have a built-in serialization format that's available by default.* It might be a little faster because it doesn't have to link to an external library.* The code to interface between JSON logic and PostgreSQL will probably be much larger than the actual JSON encoding/decoding itself.* The externally-maintained and packaged libjson implementationsI saw brought in lots of dependencies (e.g. glib).* "Everyone else" (e.g. PHP) uses a statically-linked JSON implementation. Is the code in question "*really*" small? Well, not really, but it's not enormous either. By the way, I found a bug in PHP's JSON_parser (json_decode("true "); /* with a space */ returns null instead of true). I'll have to get around to reporting that. Now, assuming JSON support is built-in to PostgreSQL and is enabled by default, it is my opinion that encoding issues should not be dealt with in the JSON code itself, but that the JSON code itself should assume UTF-8. I think conversions should be done to/from UTF-8 before passing it through the JSON code because this would likely be the smallest way to implement it (not necessarily the fastest, though). Mike Rylander pointed out something wonderful, and that is that JSON code can be stored in plain old ASCII using \u... . If a target encoding supports all of Unicode, the JSON serializer could be told not to generate \u escapes. Otherwise, the \u escapes would be necessary. Thus, here's an example of how (in my opinion) character sets and such should be handled in the JSON code: Suppose the client's encoding is UTF-16, and the server's encoding is Latin-1. When JSON is stored to the database:1. The client is responsible and sends a valid UTF-16 JSON string.2. PostgreSQLchecks to make sure it is valid UTF-16, then converts it to UTF-8.3. The JSON code parses it (to ensure it's valid).4. The JSON code unparses it (to get a representation without needless whitespace). It is given a flag indicating it should only output ASCII text.5. The ASCII is stored in the server, since it is valid Latin-1. When JSON is retrieved from the database:1. ASCII is retrieved from the server2. If user needs to extract one or more fields,the JSON is parsed, and the fields are extracted.3. Otherwise, the JSON text is converted to UTF-16 and sent to the client. Note that I am being biased toward optimizing code size rather than speed. Here's a question about semantics: should converting JSON to text guarantee that Unicode will be \u escaped, or should it render actual Unicode whenever possible (when the client uses a Unicode-complete charset) ? As for reinventing the wheel, I'm in the process of writing yet another JSON implementation simply because I didn't find the other ones I looked at palatable. I am aiming for simple code, not fast code. I am using malloc for structures and realloc for strings/arrays rather than resorting to clever buffering tricks. Of course, I'll switch it over to palloc/repalloc before migrating it to PostgreSQL.
2010/3/29 Andrew Dunstan <andrew@dunslane.net>: > Robert Haas wrote: >> On Sun, Mar 28, 2010 at 4:48 PM, Joseph Adams >> <joeyadams3.14159@gmail.com> wrote: >>> I'm wondering whether the internal representation of JSON should be >>> plain JSON text, or some binary code that's easier to traverse and >>> whatnot. For the sake of code size, just keeping it in text is >>> probably best. >> >> +1 for text. > > Agreed. There's another choice, called BSON. http://www.mongodb.org/display/DOCS/BSON I've not researched it yet deeply, it seems reasonable to be stored in databases as it is invented for MongoDB. >>> Now my thoughts and opinions on the JSON parsing/unparsing itself: >>> >>> It should be built-in, rather than relying on an external library >>> (like XML does). >> >> Why? I'm not saying you aren't right, but you need to make an >> argument rather than an assertion. This is a community, so no one is >> entitled to decide anything unilaterally, and people want to be >> convinced - including me. > > Yeah, why? We should not be in the business of reinventing the wheel (and > then maintaining the reinvented wheel), unless the code in question is > *really* small. Many implementations in many languages of JSON show that parsing JSON is not so difficult to code and the needs vary. Hence, I wonder if we can have it very our own. Never take it wrongly, I don't disagree text format nor disagree to use an external library. Regards, -- Hitoshi Harada
On Sun, Mar 28, 2010 at 11:24 PM, Joseph Adams <joeyadams3.14159@gmail.com> wrote: > I apologize; I was just starting the conversation with some of my > ideas to receive feedback. I didn't want people to have to wade > through too many "I think"s . I'll be sure to use <opinion> tags in > the future :-) FWIW, I don't care at all whether you say "I think" or "I know"; the point is that you have to provide backup for any position you choose to take. > My reasoning for "It should be built-in" is: > * It would be nice to have a built-in serialization format that's > available by default. > * It might be a little faster because it doesn't have to link to an > external library. I don't think either of these reasons is valid. > * The code to interface between JSON logic and PostgreSQL will > probably be much larger than the actual JSON encoding/decoding itself. If true, this is a good argument. > * The externally-maintained and packaged libjson implementations I > saw brought in lots of dependencies (e.g. glib). As is this. > * "Everyone else" (e.g. PHP) uses a statically-linked JSON implementation. But this isn't. > Is the code in question "*really*" small? Well, not really, but it's > not enormous either. By the way, I found a bug in PHP's JSON_parser > (json_decode("true "); /* with a space */ returns null instead of > true). I'll have to get around to reporting that. > > Now, assuming JSON support is built-in to PostgreSQL and is enabled by > default, it is my opinion that encoding issues should not be dealt > with in the JSON code itself, but that the JSON code itself should > assume UTF-8. I think conversions should be done to/from UTF-8 before > passing it through the JSON code because this would likely be the > smallest way to implement it (not necessarily the fastest, though). > > Mike Rylander pointed out something wonderful, and that is that JSON > code can be stored in plain old ASCII using \u... . If a target > encoding supports all of Unicode, the JSON serializer could be told > not to generate \u escapes. Otherwise, the \u escapes would be > necessary. > > Thus, here's an example of how (in my opinion) character sets and such > should be handled in the JSON code: > > Suppose the client's encoding is UTF-16, and the server's encoding is > Latin-1. When JSON is stored to the database: > 1. The client is responsible and sends a valid UTF-16 JSON string. > 2. PostgreSQL checks to make sure it is valid UTF-16, then converts > it to UTF-8. > 3. The JSON code parses it (to ensure it's valid). > 4. The JSON code unparses it (to get a representation without > needless whitespace). It is given a flag indicating it should only > output ASCII text. > 5. The ASCII is stored in the server, since it is valid Latin-1. > > When JSON is retrieved from the database: > 1. ASCII is retrieved from the server > 2. If user needs to extract one or more fields, the JSON is parsed, > and the fields are extracted. > 3. Otherwise, the JSON text is converted to UTF-16 and sent to the client. > > Note that I am being biased toward optimizing code size rather than speed. Can you comment on my proposal elsewhere on this thread and compare your proposal to mine? In what ways are they different, and which is better, and why? > Here's a question about semantics: should converting JSON to text > guarantee that Unicode will be \u escaped, or should it render actual > Unicode whenever possible (when the client uses a Unicode-complete > charset) ? I feel pretty strongly that the data should be stored in the database in the format in which it will be returned to the user - any conversion which is necessary should happen on the way in. I am not 100% sure to what extent we should attempt to canonicalize the input and to what extend we should simply store it in whichever way the user chooses to provide it. > As for reinventing the wheel, I'm in the process of writing yet > another JSON implementation simply because I didn't find the other > ones I looked at palatable. I am aiming for simple code, not fast > code. I am using malloc for structures and realloc for strings/arrays > rather than resorting to clever buffering tricks. Of course, I'll > switch it over to palloc/repalloc before migrating it to PostgreSQL. I'm not sure that optimizing for simplicity over speed is a good idea.I think we can reject implementations as unpalatablebecause they are slow or feature-poor or have licensing issues or are not actively maintained, but rejecting them because they use complex code in order to be fast doesn't seem like the right trade-off to me. ...Robert
Hi, Joseph Adams <joeyadams3.14159@gmail.com> writes: > As for reinventing the wheel, I'm in the process of writing yet > another JSON implementation simply because I didn't find the other > ones I looked at palatable. Even this one (ANSI C, MIT Licenced)? cJSON -- An ultra-lightweight, portable, single-file, simple-as-can-be ANSI-C compliant JSON parser, under MIT license. http://sourceforge.net/projects/cjson/ http://cjson.svn.sourceforge.net/viewvc/cjson/README?revision=7&view=markup http://cjson.svn.sourceforge.net/viewvc/cjson/cJSON.c?revision=33&view=markup And from the cJSON.h we read that it could be somewhat easy to integrate into PostgreSQL's memory management: 56 typedef struct cJSON_Hooks { 57 void *(*malloc_fn)(size_t sz); 58 void (*free_fn)(void *ptr); 59 } cJSON_Hooks; Just adding some data points, hoping that's not adding only confusion. Regards, -- dim
On sön, 2010-03-28 at 23:24 -0400, Joseph Adams wrote: > Thus, here's an example of how (in my opinion) character sets and such > should be handled in the JSON code: > > Suppose the client's encoding is UTF-16, and the server's encoding is > Latin-1. When JSON is stored to the database: > 1. The client is responsible and sends a valid UTF-16 JSON string. > 2. PostgreSQL checks to make sure it is valid UTF-16, then converts > it to UTF-8. > 3. The JSON code parses it (to ensure it's valid). > 4. The JSON code unparses it (to get a representation without > needless whitespace). It is given a flag indicating it should only > output ASCII text. > 5. The ASCII is stored in the server, since it is valid Latin-1. > > When JSON is retrieved from the database: > 1. ASCII is retrieved from the server > 2. If user needs to extract one or more fields, the JSON is parsed, > and the fields are extracted. > 3. Otherwise, the JSON text is converted to UTF-16 and sent to the client. The problem I see here is that a data type output function is normally not aware of the client encoding. The alternatives that I see is that you always escape everything you see to plain ASCII, so it's valid in every server encoding, but that would result in pretty sad behavior for users of languages that don't use a lot of ASCII characters, or you decree a nonstandard JSON variant that momentarily uses whatever encoding you decide.
Robert Haas wrote: > I feel pretty strongly that the data should be stored in the database > in the format in which it will be returned to the user - any > conversion which is necessary should happen on the way in. I am not > 100% sure to what extent we should attempt to canonicalize the input > and to what extend we should simply store it in whichever way the user > chooses to provide it. > > ISTM that implies that, with a possible exception when the server encoding is utf8, you would have to \u escape the data on the way in fairly pessimistically. I'd be inclined to say we should store and validate it exactly as the client gives it to us (converted to the server encoding, as it would be, of course). In practice that would mean that for non-utf8 databases the client would need to \u escape it. I suspect most uses of this would be in utf8-encoded databases anyway. I also think we should provide a function to do the escaping, so users could do something like: insert into foo (myjson) values (json_escape('some jason text here')); I also thought about a switch to turn on \u escaping on output - that might be useful for pg_dump for instance. cheers andrew
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Mar 28, 2010 at 11:24 PM, Joseph Adams > <joeyadams3.14159@gmail.com> wrote: >> My reasoning for "It should be built-in" is: >> �* It would be nice to have a built-in serialization format that's >> available by default. >> �* It might be a little faster because it doesn't have to link to an >> external library. > I don't think either of these reasons is valid. FWIW, our track record with relying on external libraries has been less than great --- "upstream will maintain it" sounds good but has fallen over with respect to both the regex engine and the snowball stemmers, to take two examples. And libxml2 has been nothing but a source of pain. If this is going to end up being one fairly small C file implementing a spec that is not a moving target, I'd vote against depending on an external library instead, no matter how spiffy and license-compatible the external library might be. regards, tom lane
On Mon, Mar 29, 2010 at 12:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sun, Mar 28, 2010 at 11:24 PM, Joseph Adams >> <joeyadams3.14159@gmail.com> wrote: >>> My reasoning for "It should be built-in" is: >>> * It would be nice to have a built-in serialization format that's >>> available by default. >>> * It might be a little faster because it doesn't have to link to an >>> external library. > >> I don't think either of these reasons is valid. > > FWIW, our track record with relying on external libraries has been less > than great --- "upstream will maintain it" sounds good but has fallen > over with respect to both the regex engine and the snowball stemmers, > to take two examples. And libxml2 has been nothing but a source of pain. > > If this is going to end up being one fairly small C file implementing > a spec that is not a moving target, I'd vote against depending on an > external library instead, no matter how spiffy and license-compatible > the external library might be. Fair enough. Note that I did go on to say which reasons I did think were potentially valid. ;-) ...Robert
On 3/28/10 8:52 PM, Hitoshi Harada wrote: > There's another choice, called BSON. > > http://www.mongodb.org/display/DOCS/BSON > > I've not researched it yet deeply, it seems reasonable to be stored in > databases as it is invented for MongoDB. I wouldn't take that for granted. The MongoDB project involves a lot of "re-inventing the wheel" and I'd scrutinize any of their innovations pretty thoroughly. Besides, I thought the point of a JSON type was to be compatible with the *majority* of JSON users? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Mar 29, 2010, at 9:02 AM, Tom Lane wrote: > If this is going to end up being one fairly small C file implementing > a spec that is not a moving target, I'd vote against depending on an > external library instead, no matter how spiffy and license-compatible > the external library might be. Perhaps you could fork one, in that case. Best, David
On Mon, Mar 29, 2010 at 2:23 PM, David E. Wheeler <david@kineticode.com> wrote: > On Mar 29, 2010, at 9:02 AM, Tom Lane wrote: > >> If this is going to end up being one fairly small C file implementing >> a spec that is not a moving target, I'd vote against depending on an >> external library instead, no matter how spiffy and license-compatible >> the external library might be. > > Perhaps you could fork one, in that case. > > Best, > > David > > I'm considering using and adapting cJSON instead of continuing with my redundant implementation. I could run `indent -kr -i4` on it (will that match PostgreSQL's coding style?), add support for UTF-16 surrogate pairs (pair of \u... escapes for each character above U+FFFF as required by the JSON spec), and add a switch to turn on/off pure ASCII output. P.S.: Sorry for the repeat, David. I forgot to CC the mailing list.
Tom Lane <tgl@sss.pgh.pa.us> writes: > If this is going to end up being one fairly small C file implementing > a spec that is not a moving target, I'd vote against depending on an > external library instead, no matter how spiffy and license-compatible > the external library might be. My understanding is that it's possible to include (fork) a MIT or BSD source code into our source tree, right? (Some other licenses certainly apply too). Regards, -- dim
Dimitri Fontaine <dfontaine@hi-media.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> If this is going to end up being one fairly small C file implementing >> a spec that is not a moving target, I'd vote against depending on an >> external library instead, no matter how spiffy and license-compatible >> the external library might be. > My understanding is that it's possible to include (fork) a MIT or BSD > source code into our source tree, right? (Some other licenses certainly > apply too). MIT or 2-clause BSD would be ok for such a thing, other licenses probably not. regards, tom lane
joeyadams3.14159@gmail.com (Joseph Adams) writes: > I introduced myself in the thread "Proposal: access control jails (and > introduction as aspiring GSoC student)", and we discussed jails and > session-local variables. But, as Robert Haas suggested, implementing > variable support in the backend would probably be way too ambitious a > project for a newbie like me. I decided instead to pursue the task of > adding JSON support to PostgreSQL, hence the new thread. Interesting... I had a discussion about much this sort of thing with a local LUG associate; he was interested in this from a "doing CouchDB-ish things using PostgreSQL" perspective. There were a couple perspectives there, which may be somewhat orthogonal to what you're trying to do. I'll mention them as they may suggest useful operations. 1. Buddy Myles pointed out a NYTimes project which does something pretty analagous... http://code.nytimes.com/projects/dbslayer This is a proxy that allows clients to submit requests via HTTP, returning responses in JSON form. Note that the HTTP request has the SQL query embedded into it. 2. CouchDB's interface is much the same, where clients submit HTTP requests and receive JSON responses back, but with the difference that the query is a stylized sorta-JSON form. I'd think that you could get quite a long ways on this, at least doing something like dbslayer without *necessarily* needing to do terribly much work inside the DB engine. Mapping a tuple, or a list of tuples, into a forest of JSON documents should be pretty straightforward; whether or not it's really desirable to operate a JSON-flavoured query inside PostgreSQL may be the difference between *this year's* GSOC and *next year's* :-). -- "...the Jedi learned early on what language the universe was programmed in. Then they took advantage of an accident of language to obscure this fact from the unwashed. They all affected an inverted lisp. so, a Jedi to be, you the Forth must use."
> I'd think that you could get quite a long ways on this, at least doing > something like dbslayer without *necessarily* needing to do terribly > much work inside the DB engine. There's actually an HTTP framework tool for Postgres which already does something of the sort. It was introduced at pgCon 2 years ago ... will look for. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Tue, Mar 30, 2010 at 8:58 PM, Josh Berkus <josh@agliodbs.com> wrote: >> I'd think that you could get quite a long ways on this, at least doing >> something like dbslayer without *necessarily* needing to do terribly >> much work inside the DB engine. > > There's actually an HTTP framework tool for Postgres which already does > something of the sort. It was introduced at pgCon 2 years ago ... will > look for. While it might be interesting to have/find/write a tool that puts an HTTP/JSON layer around the DB connection, it's pretty much entirely unrelated to the proposed project of creating a json type with PostgreSQL analagous to the xml type we already have, which is what the OP is proposing to do. Personally, I suspect that a JSON type is both a more interesting project to work on and a more useful result for this community. ...Robert
Robert Haas wrote: > While it might be interesting to have/find/write a tool that puts an > HTTP/JSON layer around the DB connection, it's pretty much entirely > unrelated to the proposed project of creating a json type with > PostgreSQL analagous to the xml type we already have, which is what > the OP is proposing to do. > > Personally, I suspect that a JSON type is both a more interesting > project to work on and a more useful result for this community. > > > I agree. cheers andrew
robertmhaas@gmail.com (Robert Haas) writes: > On Tue, Mar 30, 2010 at 8:58 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> I'd think that you could get quite a long ways on this, at least doing >>> something like dbslayer without *necessarily* needing to do terribly >>> much work inside the DB engine. >> >> There's actually an HTTP framework tool for Postgres which already does >> something of the sort. It was introduced at pgCon 2 years ago ... will >> look for. > > While it might be interesting to have/find/write a tool that puts an > HTTP/JSON layer around the DB connection, it's pretty much entirely > unrelated to the proposed project of creating a json type with > PostgreSQL analagous to the xml type we already have, which is what > the OP is proposing to do. > > Personally, I suspect that a JSON type is both a more interesting > project to work on and a more useful result for this community. No disagreement here; I'd expect that a JSON type would significantly ease building such a framework. Indeed, that could be a demonstration of success... "We then implemented an HTTP/JSON proxy in 27 lines of Python code..." :-) -- "Unless you used NetInfo. _Then_ changing network settings could often require torching of the existing system, salting of the ground it had rested on, and termination of anyone who used it." -- JFW <jwiede@biff.com> on comp.sys.next.advocacy
I ended up reinventing the wheel and writing another JSON library: http://constellationmedia.com/~funsite/static/json-0.0.1.tar.bz2 This is a first release, and it doesn't really have a name besides "json". It's very similar to cJSON, except it is (sans unknown bugs) more reliable, more correct, and cleaner (unless you hate gotos ;-) ).It has a simple test suite. It is not prone to stackoverflows, as it doesn't recurse. It is strict, requires input to be UTF-8 (it validates it first) and only outputs UTF-8. Other than treating numbers liberally, my implementation only accepts valid JSON code (it doesn't try to correct anything, even Unicode problems). It is under the MIT license.
Dne 1.4.2010 5:39, Joseph Adams napsal(a): > I ended up reinventing the wheel and writing another JSON library: > > http://constellationmedia.com/~funsite/static/json-0.0.1.tar.bz2 > > This is a first release, and it doesn't really have a name besides > "json". It's very similar to cJSON, except it is (sans unknown bugs) > more reliable, more correct, and cleaner (unless you hate gotos ;-) ). > It has a simple test suite. It is not prone to stack overflows, as > it doesn't recurse. It is strict, requires input to be UTF-8 (it > validates it first) and only outputs UTF-8. Other than treating > numbers liberally, my implementation only accepts valid JSON code (it > doesn't try to correct anything, even Unicode problems). It is under > the MIT license. > I did some testing on my own, it passed everything I have thrown at it so far. I also did tests using MSVC for both 32bit and 64bit targets and it worked fine too (except for missing stdbool.h in msvc which is no big deal). The coding style compared to cJSON (or other libs I've seen) seems closer to the style of PostgreSQL, it would however still require pgindent run and maybe some minor adjustments. -- Regards Petr Jelinek (PJMODOS)
On Apr 1, 2010, at 9:34 PM, Petr Jelinek wrote: >> I ended up reinventing the wheel and writing another JSON library: >> >> http://constellationmedia.com/~funsite/static/json-0.0.1.tar.bz2 >> >> This is a first release, and it doesn't really have a name besides >> "json". It's very similar to cJSON, except it is (sans unknown bugs) >> more reliable, more correct, and cleaner (unless you hate gotos ;-) ). >> It has a simple test suite. It is not prone to stack overflows, as >> it doesn't recurse. It is strict, requires input to be UTF-8 (it >> validates it first) and only outputs UTF-8. Other than treating >> numbers liberally, my implementation only accepts valid JSON code (it >> doesn't try to correct anything, even Unicode problems). It is under >> the MIT license. >> > > I did some testing on my own, it passed everything I have thrown at it so far. > I also did tests using MSVC for both 32bit and 64bit targets and it worked fine too (except for missing stdbool.h in msvcwhich is no big deal). > > The coding style compared to cJSON (or other libs I've seen) seems closer to the style of PostgreSQL, it would howeverstill require pgindent run and maybe some minor adjustments. Someone approve this project for the GSoC quick, before Joseph finishes it! Best, David
I've been wondering whether the JSON datatype should be strict or conservative. For one, there's strict JSON (following the exact specification). Then there's more conservative JSON variants. Some JSON parsers support comments, some support invalid number formats (e.g. '3.' or '+5'), etc.. The consensus seems to be that JSON content should be stored verbatim (it should store the exact string the client sent to it), as is done with XML. However, this notion is somewhat incompatible with "Be conservative in what you do; be liberal in what you accept from others" because we can't accept loose JSON, then spit out conservative JSON without messing with the content. Here's my idea: the datatype should only allow strict JSON, but there should be a function that accepts a liberal format, cleans it up to make it strict JSON, and converts it to JSON. I think making strict JSON the default makes the most sense because:* Inputs to the database will most likely be coming from programs, not humans.*Output is expected to be valid JSON and work anywhere JSON should work.* Strict JSON is what more people would expect,I'd think.
On Sat, Apr 3, 2010 at 8:59 AM, Joseph Adams <joeyadams3.14159@gmail.com> wrote: > I've been wondering whether the JSON datatype should be strict or conservative. > > For one, there's strict JSON (following the exact specification). > Then there's more conservative JSON variants. Some JSON parsers > support comments, some support invalid number formats (e.g. '3.' or > '+5'), etc.. > > The consensus seems to be that JSON content should be stored verbatim > (it should store the exact string the client sent to it), as is done > with XML. However, this notion is somewhat incompatible with "Be > conservative in what you do; be liberal in what you accept from > others" because we can't accept loose JSON, then spit out conservative > JSON without messing with the content. > > Here's my idea: the datatype should only allow strict JSON, but there > should be a function that accepts a liberal format, cleans it up to > make it strict JSON, and converts it to JSON. I think making strict > JSON the default makes the most sense because: > * Inputs to the database will most likely be coming from programs, not humans. > * Output is expected to be valid JSON and work anywhere JSON should work. > * Strict JSON is what more people would expect, I'd think. +1 -- Mike Rylander| VP, Research and Design| Equinox Software, Inc. / The Evergreen Experts| phone: 1-877-OPEN-ILS (673-6457)|email: miker@esilibrary.com| web: http://www.esilibrary.com
Mike Rylander wrote: >> >> Here's my idea: the datatype should only allow strict JSON, but there >> should be a function that accepts a liberal format, cleans it up to >> make it strict JSON, and converts it to JSON. I think making strict >> JSON the default makes the most sense because: >> * Inputs to the database will most likely be coming from programs, not humans. >> * Output is expected to be valid JSON and work anywhere JSON should work. >> * Strict JSON is what more people would expect, I'd think. >> > > +1 > > Yeah. That's the only thing that makes sense to me. We don't allow badly formed XML, for example, although we do allow document fragments (as required by the standard, IIRC). But we could sensibly have some function like 'cleanup_json(almost_json text) returns json'. cheers andrew
Another JSON strictness issue: the JSON standard ( http://www.ietf.org/rfc/rfc4627.txt ) states that JSON text can only be an array or object. However, my implementation currently accepts any valid value. Thus, '3', '"hello"', 'true', 'false', and 'null' are all accepted by my implementation, but are not strictly JSON text.The question is: should the JSON datatype accept atomicvalues (those that aren't arrays or objects) as valid JSON? I tried a few other JSON implementations to see where they stand regarding atomic types as input: JSON_checker (C) does not accept them. JSON.parse() (JavaScript) accepts them. json_decode() (PHP) accepts them. However, support is currently buggy (e.g. '1' is accepted, but '1 ' is not). cJSON (C) accepts them. JSON.pm (Perl) accepts them if you specify the allow_nonref option. Otherwise, it accepts 'true' and 'false', but not 'null', a number, or a string by itself. In my opinion, we should accept an atomic value as valid JSON content.I suppose we could get away with calling it a "content"fragment as is done with XML without a doctype. Accepting atomic values as valid JSON would be more orthagonal, as it would be possible to have a function like this: json_values(object_or_array JSON) RETURNS SETOF JSON -- extracts values from an object or members from an array, returning them as JSON fragments. Also, should we go even further and accept key:value pairs by themselves? : '"key":"value"'::JSON I don't think we should because doing so would be rather zany. It would mean JSON content could be invalid in value context, as in: // JavaScript var content = "key" : "value"; I improved my JSON library. It now only accepts strict, UTF-8 encoded JSON values (that is, objects, arrays, strings, numbers, true, false, and null). It also has a json_decode_liberal() function that accepts a string, cleans it up, and passes it through the stricter json_decode(). json_decode_liberal() filters out comments, allows single quoted strings, and accepts a lax number format compared to strict JSON. I may add Unicode repair to it later on, but implementing that well really depends on what type of Unicode errors appear in real life, I think. http://constellationmedia.com/~funsite/static/json-0.0.2.tar.bz2 My json.c is now 1161 lines long, so I can't quite call it "small" anymore.
On Mon, Apr 5, 2010 at 11:50 PM, Joseph Adams <joeyadams3.14159@gmail.com> wrote: > In my opinion, we should accept an atomic value as valid JSON content. That seems right to me. > Also, should we go even further and accept key:value pairs by themselves? : > > '"key":"value"'::JSON Definitely not. ...Robert
Dne 6.4.2010 5:50, Joseph Adams napsal(a): > Another JSON strictness issue: the JSON standard ( > http://www.ietf.org/rfc/rfc4627.txt ) states that JSON text can only > be an array or object. However, my implementation currently accepts > any valid value. Thus, '3', '"hello"', 'true', 'false', and 'null' > are all accepted by my implementation, but are not strictly JSON text. > The question is: should the JSON datatype accept atomic values (those > that aren't arrays or objects) as valid JSON? > Not really sure about this myself, but keep in mind that NULL has special meaning in SQL. > Also, should we go even further and accept key:value pairs by themselves? : > > '"key":"value"'::JSON > > No, especially considering that '{"key":"value"}' is a valid JSON value. > I improved my JSON library. It now only accepts strict, UTF-8 encoded > JSON values (that is, objects, arrays, strings, numbers, true, false, > and null). > Just a note, but PostgreSQL has some UTF-8 validation code, you might want to look at it maybe, at least once you start the actual integration into core, so that you are not reinventing too many wheels. I can see how your own code is good thing for general library which this can (and I am sure will be) used as, but for the datatype itself, it might be better idea to use what's already there, unless it's somehow incompatible of course. -- Regards Petr Jelinek (PJMODOS)
Dne 6.4.2010 7:57, Joseph Adams napsal(a): > On Tue, Apr 6, 2010 at 1:00 AM, Petr Jelinek<pjmodos@pjmodos.net> wrote: > >> Not really sure about this myself, but keep in mind that NULL has special >> meaning in SQL. >> > To me, the most logical approach is to do the obvious thing: make > JSON's 'null' be SQL's NULL. For instance, SELECTing on a table with > NULLs in it and converting the result set to JSON would yield a > structure with 'null's in it. 'null'::JSON would yield NULL. I'm not > sure what startling results would come of this approach, but I'm > guessing this would be most intuitive and useful. > +1 >> Just a note, but PostgreSQL has some UTF-8 validation code, you might want >> to look at it maybe, at least once you start the actual integration into >> core, so that you are not reinventing too many wheels. I can see how your >> own code is good thing for general library which this can (and I am sure >> will be) used as, but for the datatype itself, it might be better idea to >> use what's already there, unless it's somehow incompatible of course. >> > Indeed. My plan is to first get a strong standalone JSON library > written and tested so it can be used as a general-purpose library. As > the JSON code is merged into PostgreSQL, it can be adapted. Part of > this adaptation would most likely be removing the UTF-8 validation > function I wrote and using PostgreSQL's Unicode support code instead. > > There are probably other bits that could be PostgreSQLified as well. > I wonder if I should consider leveraging PostgreSQL's regex support or > if it would be a bad fit/waste of time/slower/not worth it. > Regex ? What for ? You certainly don't need it for parsing, you have good parser IMHO and regex would probably be all of the above. -- Regards Petr Jelinek (PJMODOS)
On Mon, Apr 5, 2010 at 11:50 PM, Joseph Adams <joeyadams3.14159@gmail.com> wrote: > Another JSON strictness issue: the JSON standard ( > http://www.ietf.org/rfc/rfc4627.txt ) states that JSON text can only > be an array or object. However, my implementation currently accepts > any valid value. Thus, '3', '"hello"', 'true', 'false', and 'null' > are all accepted by my implementation, but are not strictly JSON text. > The question is: should the JSON datatype accept atomic values (those > that aren't arrays or objects) as valid JSON? > > I tried a few other JSON implementations to see where they stand > regarding atomic types as input: > > JSON_checker (C) does not accept them. > JSON.parse() (JavaScript) accepts them. > json_decode() (PHP) accepts them. However, support is currently buggy > (e.g. '1' is accepted, but '1 ' is not). > cJSON (C) accepts them. > JSON.pm (Perl) accepts them if you specify the allow_nonref option. > Otherwise, it accepts 'true' and 'false', but not 'null', a number, or > a string by itself. > > In my opinion, we should accept an atomic value as valid JSON content. > I suppose we could get away with calling it a "content" fragment as > is done with XML without a doctype. > > Accepting atomic values as valid JSON would be more orthagonal, as it > would be possible to have a function like this: > > json_values(object_or_array JSON) RETURNS SETOF JSON > -- extracts values from an object or members from an array, returning > them as JSON fragments. > For these reasons, and the fact that my project uses atomic values ;), I think yes, we should support them. IIUC, the reason for requiring an array or object is that the O part of JSON means "some sort of a collection of atomic values". But, in ECMAScript (JavaScript), instances of strings, numbers, bools and null are, indeed, objects. IOW, I think JSON is using a faulty definition of "object" in the spec. It's the one part of the spec that doesn't make sense to me at all. > Also, should we go even further and accept key:value pairs by themselves? : > > '"key":"value"'::JSON > This, though, is probably a step too far. It violates the JS part of JSON ... > I don't think we should because doing so would be rather zany. It > would mean JSON content could be invalid in value context, as in: > > // JavaScript > var content = "key" : "value"; > Right. Thanks, Joseph. I think this will be a great addition! -- Mike Rylander| VP, Research and Design| Equinox Software, Inc. / The Evergreen Experts| phone: 1-877-OPEN-ILS (673-6457)|email: miker@esilibrary.com| web: http://www.esilibrary.com
Joseph Adams escribió: > http://constellationmedia.com/~funsite/static/json-0.0.2.tar.bz2 > > My json.c is now 1161 lines long, so I can't quite call it "small" anymore. Just noticed you don't check the return value of malloc and friends. How do you intend to handle that? There are various places that would simply dump core with the 0.0.2 code. Within Postgres it's easy -- a failed palloc aborts the transaction and doesn't continue running your code. But in a standalone library that's probably not acceptable. If we were to import this there are some lines that could be ripped out, like 60 lines in the string buffer stuff and 130 lines for Unicode. That brings your code just under 1000 lines. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Petr Jelinek <pjmodos@pjmodos.net> writes: > Dne 6.4.2010 7:57, Joseph Adams napsal(a): >> To me, the most logical approach is to do the obvious thing: make >> JSON's 'null' be SQL's NULL. For instance, SELECTing on a table with >> NULLs in it and converting the result set to JSON would yield a >> structure with 'null's in it. 'null'::JSON would yield NULL. I'm not >> sure what startling results would come of this approach, but I'm >> guessing this would be most intuitive and useful. > +1 I think it's a pretty bad idea for 'null'::JSON to yield NULL. AFAIR there is no other standard datatype for which the input converter can yield NULL from a non-null input string, and I'm not even sure that the InputFunctionCall protocol allows it. (In fact a quick look indicates that it doesn't...) To me, what this throws into question is not so much whether JSON null should equate to SQL NULL (it should), but whether it's sane to accept atomic values. If I understood the beginning of this discussion, that's not strictly legal. I think it would be better for strict input mode to reject this, and permissive mode to convert it to a non-atomic value. Thus jsonify('null') wouldn't yield NULL but a structure containing a null. regards, tom lane
On Tue, Apr 6, 2010 at 11:05 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Joseph Adams escribió: > >> http://constellationmedia.com/~funsite/static/json-0.0.2.tar.bz2 >> >> My json.c is now 1161 lines long, so I can't quite call it "small" anymore. > > Just noticed you don't check the return value of malloc and friends. > How do you intend to handle that? There are various places that would > simply dump core with the 0.0.2 code. Within Postgres it's easy -- a > failed palloc aborts the transaction and doesn't continue running your > code. But in a standalone library that's probably not acceptable. > > If we were to import this there are some lines that could be ripped out, > like 60 lines in the string buffer stuff and 130 lines for Unicode. > That brings your code just under 1000 lines. Let me be the first to suggest putting this code under the PostgreSQL license. ...Robert
On Tue, Apr 6, 2010 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Petr Jelinek <pjmodos@pjmodos.net> writes: >> Dne 6.4.2010 7:57, Joseph Adams napsal(a): >>> To me, the most logical approach is to do the obvious thing: make >>> JSON's 'null' be SQL's NULL. For instance, SELECTing on a table with >>> NULLs in it and converting the result set to JSON would yield a >>> structure with 'null's in it. 'null'::JSON would yield NULL. I'm not >>> sure what startling results would come of this approach, but I'm >>> guessing this would be most intuitive and useful. > >> +1 > > I think it's a pretty bad idea for 'null'::JSON to yield NULL. AFAIR > there is no other standard datatype for which the input converter can > yield NULL from a non-null input string, and I'm not even sure that the > InputFunctionCall protocol allows it. (In fact a quick look indicates > that it doesn't...) Oh. I missed this aspect of the proposal. I agree - that's a bad idea. > To me, what this throws into question is not so much whether JSON null > should equate to SQL NULL (it should), but whether it's sane to accept > atomic values. With this, I disagree. I see no reason to suppose that a JSON NULL and an SQL NULL are the same thing. > If I understood the beginning of this discussion, that's > not strictly legal. I think it would be better for strict input mode > to reject this, and permissive mode to convert it to a non-atomic value. > Thus jsonify('null') wouldn't yield NULL but a structure containing a > null. There's no obvious "structure" to convert this into. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Apr 6, 2010 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> To me, what this throws into question is not so much whether JSON null >> should equate to SQL NULL (it should), but whether it's sane to accept >> atomic values. > With this, I disagree. I see no reason to suppose that a JSON NULL > and an SQL NULL are the same thing. Oh. If they're not the same, then the problem is easily dodged, but then what *is* a JSON null? regards, tom lane
On Tue, Apr 6, 2010 at 1:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Apr 6, 2010 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> To me, what this throws into question is not so much whether JSON null >>> should equate to SQL NULL (it should), but whether it's sane to accept >>> atomic values. > >> With this, I disagree. I see no reason to suppose that a JSON NULL >> and an SQL NULL are the same thing. > > Oh. If they're not the same, then the problem is easily dodged, but > then what *is* a JSON null? I assume we're going to treat JSON much like XML: basically text, but with some validation (and perhaps canonicalization) under the hood. So a JSON null will be "null", just a JSON boolean true value will be "true". It would be pretty weird if storing "true" or "false" or "4" or "[3,1,4,1,5,9]" into a json column and then reading it back returned the input string; but at the same time storing "null" into the column returned a SQL NULL. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Apr 6, 2010 at 1:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Oh. �If they're not the same, then the problem is easily dodged, but >> then what *is* a JSON null? > I assume we're going to treat JSON much like XML: basically text, but > with some validation (and perhaps canonicalization) under the hood. > So a JSON null will be "null", just a JSON boolean true value will be > "true". It would be pretty weird if storing "true" or "false" or "4" > or "[3,1,4,1,5,9]" into a json column and then reading it back > returned the input string; but at the same time storing "null" into > the column returned a SQL NULL. Hmm. So the idea is that all JSON atomic values are considered to be text strings, even when they look like something else (like bools or numbers)? That would simplify matters I guess, but I'm not sure about the usability. In particular I'd want to have something that dequotes the value so that I can get foo not "foo" when converting to SQL text. (I'm assuming that quotes would be there normally, so as not to lose the distinction between 3 and "3" in the JSON representation.) regards, tom lane
On Tue, Apr 6, 2010 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Apr 6, 2010 at 1:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Oh. If they're not the same, then the problem is easily dodged, but >>> then what *is* a JSON null? > >> I assume we're going to treat JSON much like XML: basically text, but >> with some validation (and perhaps canonicalization) under the hood. >> So a JSON null will be "null", just a JSON boolean true value will be >> "true". It would be pretty weird if storing "true" or "false" or "4" >> or "[3,1,4,1,5,9]" into a json column and then reading it back >> returned the input string; but at the same time storing "null" into >> the column returned a SQL NULL. > > Hmm. So the idea is that all JSON atomic values are considered to be > text strings, even when they look like something else (like bools or > numbers)? That would simplify matters I guess, but I'm not sure about > the usability. I'm not sure what the other option is. If you do SELECT col FROM table, I'm not aware that you can return differently-typed values for different rows... > In particular I'd want to have something that dequotes > the value so that I can get foo not "foo" when converting to SQL text. > (I'm assuming that quotes would be there normally, so as not to lose > the distinction between 3 and "3" in the JSON representation.) Yes, that seems like a useful support function. ...Robert
Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > >> With this, I disagree. I see no reason to suppose that a JSON NULL >> and an SQL NULL are the same thing. >> > > Oh. If they're not the same, then the problem is easily dodged, but > then what *is* a JSON null? > Probably the same as the javascript null. regards, Yeb Havinga
On Tue, Apr 6, 2010 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Petr Jelinek <pjmodos@pjmodos.net> writes: >> Dne 6.4.2010 7:57, Joseph Adams napsal(a): >>> To me, the most logical approach is to do the obvious thing: make >>> JSON's 'null' be SQL's NULL. For instance, SELECTing on a table with >>> NULLs in it and converting the result set to JSON would yield a >>> structure with 'null's in it. 'null'::JSON would yield NULL. I'm not >>> sure what startling results would come of this approach, but I'm >>> guessing this would be most intuitive and useful. > >> +1 > > I think it's a pretty bad idea for 'null'::JSON to yield NULL. AFAIR > there is no other standard datatype for which the input converter can > yield NULL from a non-null input string, and I'm not even sure that the > InputFunctionCall protocol allows it. (In fact a quick look indicates > that it doesn't...) > > To me, what this throws into question is not so much whether JSON null > should equate to SQL NULL (it should), but whether it's sane to accept > atomic values. If I understood the beginning of this discussion, that's > not strictly legal. I think it would be better for strict input mode > to reject this, and permissive mode to convert it to a non-atomic value. > Thus jsonify('null') wouldn't yield NULL but a structure containing a > null. > > regards, tom lane > Actually, I kind of made a zany mistake here. If 'null'::JSON yielded NULL, that would mean some type of automatic conversion was going on. Likewise, '3.14159'::JSON shouldn't magically turn into a FLOAT. I think the JSON datatype should behave more like TEXT. 'null'::JSON would yield a JSON fragment containing 'null'. 'null'::JSON::TEXT would yield the literal text 'null'. However, '3.14159'::JSON::FLOAT should probably not be allowed as a precaution, as '"hello"'::JSON::TEXT would yield '"hello"', not 'hello'. In other words, casting to the target type directly isn't the same as parsing JSON and extracting a value. Perhaps there could be conversion functions. E.g.: json_to_string('"hello"') yields 'hello' json_to_number('3.14159') yields '3.14159' as text (it is up to the user to cast it to the number type s/he wants) json_to_bool('true') yields TRUE json_to_null('null') yields NULL, json_null('nonsense') fails string_to_json('hello') yields '"hello"' as JSON number_to_json(3.14159) yields '3.14159' as JSON bool_to_json(TRUE) yields 'true' as JSON null_to_json(NULL) yields 'null' as JSON (kinda useless) I wonder if these could all be reduced to two generic functions, like json_to_value and value_to_json.
Joseph Adams <joeyadams3.14159@gmail.com> writes: > Perhaps there could be conversion functions. E.g.: Yeah, that's what I was thinking about. > json_to_string('"hello"') yields 'hello' > json_to_number('3.14159') yields '3.14159' as text > (it is up to the user to cast it to the number type s/he wants) > json_to_bool('true') yields TRUE > json_to_null('null') yields NULL, json_null('nonsense') fails > string_to_json('hello') yields '"hello"' as JSON > number_to_json(3.14159) yields '3.14159' as JSON > bool_to_json(TRUE) yields 'true' as JSON > null_to_json(NULL) yields 'null' as JSON (kinda useless) The null cases seem a bit useless. What might be helpful is to translate JSON 'null' to and from SQL NULL in each of the other conversions, in addition to their primary capability. I'd go with using NUMERIC as the source/result type for the numeric conversions. Forcing people to insert explicit coercions from text isn't going to be particularly convenient to use. > I wonder if these could all be reduced to two generic functions, like > json_to_value and value_to_json. value_to_json(any) might work, but the other way could not; and it seems better to keep some symmetry between the directions. regards, tom lane