Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Поиск
Список
Период
Сортировка
От Terry Laurenzo
Тема Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)
Дата
Msg-id AANLkTinCTALsqJTDkQprBHzYztD5k+KaFnOq4pYXf9W_@mail.gmail.com
обсуждение исходный текст
Ответ на Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
It doesn't do particularly well on my previous example of [1,2,3].  It
comes out slightly shorter on ["a","b","c"] and better if the strings
need any escaping.  I don't think the float4 and float8 formats are
very useful; how could you be sure that the output was going to look
the same as the input?  Or alternatively that dumping a particular
object to text and reloading it will produce the same internal
representation?  I think it would be simpler to represent integers
using a string of digits; that way you can be assured of going from
text -> binary -> text without change.

Perhaps it would be enough to define the high two bits as follows: 00
= array, 01 = object, 10 = string, 11 = number/true/false/null.  The
next 2 bits specify how the length is stored.  00 = remaining 4 bits
store a length of up to 15 bytes, 01 = remaining 4 bits + 1 additional
byte store a 12-bit length of up to 4K, 10 = remaining 4 bits + 2
additional bytes store a 20-bit length of up to 1MB, 11 = 4 additional
bytes store a full 32-bit length word.  Then, the array, object, and
string representations can work as you've specified them.  Anything
else can be represented by itself, or perhaps we should say that
numbers represent themselves and true/false/null are represented by a
1-byte sequence, t/f/n (or perhaps we could define 111100{00,01,10} to
mean those values, since there's no obvious reason for the low bits to
be non-zero if a 4-bit length word ostensibly follows).

So [1,2,3] = 06 C1 '1' C1 '2' C1 '3' and ["a","b","c"] = 06 81 'a' 81 'b' 81 'c'

(I am still worried about the serialization/deserialization overhead
but that's a different issue.)


Thanks.  I'll play around with the bit and numeric encodings you've recommended.  Arrays are certainly the toughest to economize on as a text encoding has minimum of 2 + n - 1 overhead bytes.  Text encoding for objects has some more wiggle room with 2 + (n-1) + (n*3) extra bytes.  I was admittedly thinking about more complicated objects.

I'm still worried about transcoding overhead as well.  If comparing to a simple blind storage of JSON text with no validation or normalization, there is obviously no way to beat a straight copy.  However, its not outside the realm of reason to think that it may be possible to match or beat the clock if comparing against text to text normalization, or perhaps adding slight overhead to a validator.  The advantage to the binary structure is the ability to scan it hierarchically in sibling order and provide mutation operations with simple memcpy's as opposed to parse_to_ast -> modify_ast -> serialize_ast.

Terry

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: WIP: extensible enums
Следующее
От: Greg Stark
Дата:
Сообщение: Re: ask for review of MERGE