Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

Поиск
Список
Период
Сортировка
От Terry Laurenzo
Тема Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)
Дата
Msg-id AANLkTikL3dfjHsw1vpgHW7Cq=e6EUCDfoQXGYzUhpF+c@mail.gmail.com
обсуждение исходный текст
Ответ на Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)  (Terry Laurenzo <tj@laurenzo.org>)
Ответы Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)  (Robert Haas <robertmhaas@gmail.com>)
Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)  ("David E. Wheeler" <david@kineticode.com>)
Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)  (Andrew Dunstan <andrew@dunslane.net>)
Список pgsql-hackers
I've got a new stripped down version of the binary json plugin on github: https://github.com/tlaurenzo/pgjson

With all due warning of contrived benchmarks, I wrote some tests to see where things stand.   The test script is here: 
The results from my laptop (First gen Macbook Pro 32bit with Snow Leopard and Postgresql 9.0) are here and copied into this email at the end:

And for some commentary...
I copied the 5 sample documents from json.org's example section for these tests.  These are loaded into a table with a varchar column 1000 times each (so the test table has 5000 rows in it).  In all situations, the binary encoding was smaller than the normalized text form (between 9 and 23% smaller).  I think there are cases where the binary form will be larger than the corresponding text form, but I don't think they would be very common.

For the timings, various operations are performed on the 5000 row test table for 100 iterations.  In all situations, the query returns the Length of the transformed document instead of the document itself so as to factor out variable client IO between the test runs.  The Null Parse is essentially just a select from the table and therefore represents the baseline.  The times varied a little bit between runs but did not change materially.

What we see from this is that parsing JSON text and generating a binary representation is cheap, representing approximately 10% of the base case time.  Conversely, anything that involves generating JSON text is expensive, accounting for 30-40% of the base case time.  Some incidental profiling shows that while the entire operation is expensive, the process of generating string literals dominates this time.  There is likely room for optimization in this method, but it should be noted that most of these documents are lightly escaped (if escaped at all) which represents the happy path through the string literal output function.

While I have not profiled any advanced manipulation of the binary structure within the server, it stands to reason that manipulating the binary structure should be significantly faster than an approach that requires (perhaps multiple) transcoding between text representations in order to complete a sequence of operations.

Assuming that the JSON datatype (at a minimum) normalizes text for storage, then the text storage option accounts for about the most expensive path but with none of the benefits of an internal binary form (smaller size, ability to cheaply perform non-trivial manipulation within the database server).

Of course, just having a JSON datatype that blindly stores text will beat everything, but I'm getting closer to thinking that the binary option is worth the tradeoff.

Comments?
Terry

Running benchmark with 100 iterations per step
Loading data...
Data loaded.
=== DOCUMENT SIZE STATISTICS ===
    Document Name     | Original Size | Binary Size | Normalized Size | Percentage Savings 
----------------------+---------------+-------------+-----------------+--------------------
 jsonorg_sample1.json |           582 |         311 |             360 |   13.6111111111111
 jsonorg_sample2.json |           241 |         146 |             183 |   20.2185792349727
 jsonorg_sample3.json |           601 |         326 |             389 |   16.1953727506427
 jsonorg_sample4.json |          3467 |        2466 |            2710 |   9.00369003690037
 jsonorg_sample5.json |           872 |         469 |             613 |   23.4910277324633
(5 rows)

=== TEST PARSE AND SERIALIZATION ===
Null Parse:
       12.12 real         2.51 user         0.13 sys
Parse to Binary:
       13.38 real         2.51 user         0.14 sys
Serialize from Binary:
       16.65 real         2.51 user         0.13 sys
Normalize Text:
       18.99 real         2.51 user         0.13 sys
Roundtrip (parse to binary and serialize):
       18.58 real         2.51 user         0.14 sys

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Vaibhav Kaushal
Дата:
Сообщение: Re: Fwd: What do these terms mean in the SOURCE CODE?
Следующее
От: Pavel Stehule
Дата:
Сообщение: Re: Fwd: patch: format function - fixed oid