Обсуждение: WIP Patch: Add a function that returns binary JSONB as a bytea

Поиск
Список
Период
Сортировка

WIP Patch: Add a function that returns binary JSONB as a bytea

От
Kevin Van
Дата:
This patch adds a new function that allows callers to receive binary jsonb. 
This change was proposed in the discussion here:
and the primary motivation is to reduce database load by skipping jsonb to string conversion (with the expectation that the application can parse binary jsonb).

This patch is still a WIP and I am looking for any guidance on the approach. If it is valid, I'd also appreciate any guidance on what kind of test coverage would be appropriate for a change like this.

Note: I prepended the version number "1" to the byte array -- this is similar to what jsonb_recv does in the same file. 

I have been testing on a x86-64 processor running MacOSX and do not know what the behavior is on other platforms. 

This patch is off of master and compiles successfully. The following is some example output using psql:

database=# SELECT jsonb_raw_bytes('{"b":12345}'::jsonb);

                     jsonb_raw_bytes                      

----------------------------------------------------------

 \x0101000020010000800d0000106200000028000000018001002909

(1 row)


database=# SELECT jsonb_raw_bytes('{"a":{"b":12345}}'::jsonb->'a');

                     jsonb_raw_bytes                      

----------------------------------------------------------

 \x0101000020010000800d0000106200000028000000018001002909

(1 row)



Some preliminary testing on my own machine shows me that this change has a significant impact on performance.


I used psql to select a jsonb column from all the rows in a table (about 1 million rows) where the json data was roughly 400-500 bytes per record.


database=# \timing

Timing is on.

database=# \o /tmp/raw_bytes_out.txt;

database=# SELECT jsonb_raw_bytes(data) FROM datatable;

Time: 2582.545 ms (00:02.583)

database=# \o /tmp/json_out.txt;

database=# SELECT data FROM datatable;

Time: 5653.235 ms (00:05.653)


Of note is that the size of raw_bytes_out.txt in the example is roughly twice that of json_out.txt so the timing difference is not due to less data being transferred.


Вложения

Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Tom Lane
Дата:
Kevin Van <kevinvan@shift.com> writes:
> This patch adds a new function that allows callers to receive binary jsonb.
> This change was proposed in the discussion here:
> https://www.postgresql.org/message-id/CAOsiKEL7%2BKkV0C_hAJWxqwTg%2BPYVfiGPQ0yjFww7ECtqwBjb%2BQ%40mail.gmail.com
> and the primary motivation is to reduce database load by skipping jsonb to
> string conversion (with the expectation that the application can parse
> binary jsonb).

I dunno, I do not think it's a great idea to expose jsonb's internal
format to the world.  We intentionally did not do that when the type
was first defined --- that's why its binary I/O format isn't already
like this --- and I don't see that the tradeoffs have changed since then.
The format is complex, and specific to Postgres' needs, and we might wish
to change it in future.

            regards, tom lane


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Stephen Frost
Дата:
Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Kevin Van <kevinvan@shift.com> writes:
> > This patch adds a new function that allows callers to receive binary jsonb.
> > This change was proposed in the discussion here:
> > https://www.postgresql.org/message-id/CAOsiKEL7%2BKkV0C_hAJWxqwTg%2BPYVfiGPQ0yjFww7ECtqwBjb%2BQ%40mail.gmail.com
> > and the primary motivation is to reduce database load by skipping jsonb to
> > string conversion (with the expectation that the application can parse
> > binary jsonb).
>
> I dunno, I do not think it's a great idea to expose jsonb's internal
> format to the world.  We intentionally did not do that when the type
> was first defined --- that's why its binary I/O format isn't already
> like this --- and I don't see that the tradeoffs have changed since then.
> The format is complex, and specific to Postgres' needs, and we might wish
> to change it in future.

I disagree- it's awfully expensive to go back and forth between string
and a proper representation.  If we did decide to change the on-disk
format, we'd likely be able to translate that format into the
wire-protocol format and that'd still be much better than going to
strings.

As I recall, we did end up making changes to jsonb right before we
released it and so, at the time, it was definitely a good thing that we
hadn't exposed that format, but we're quite a few years in now (by the
time 12 comes out, every supported release will have jsonb) and the
format hasn't changed further.

If we really did want to go to a new format in the future, we'd probably
end up wanting to do so in a way which allowed us to avoid a full
dump/restore of the database too, so we'd need code to be able to at
least convert from the old format into the new format and it'd hopefully
not be too bad to have code to go the other way too.  Even if we didn't
though, I expect library authors would deal with the update across a
major version change- it's not like we don't do other things that
require them to update (SCRAM being my current favorite) and they've
been pretty good about getting things updated.

As always, there'll be the fallback option of going back to text format
too.

Thanks!

Stephen

Вложения

Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Tom Lane
Дата:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> I dunno, I do not think it's a great idea to expose jsonb's internal
>> format to the world.  We intentionally did not do that when the type
>> was first defined --- that's why its binary I/O format isn't already
>> like this --- and I don't see that the tradeoffs have changed since then.

> I disagree- it's awfully expensive to go back and forth between string
> and a proper representation.

Has anyone put any effort into making jsonb_out() faster?  I think that
that would be way more productive.  Nobody is going to want to write
code to convert jsonb's internal form into whatever their application
uses; particularly not dealing with numeric fields.

In any case, the approach proposed in this patch seems like the worst
of all possible worlds: it's inconsistent and we get zero benefit from
having thrown away our information-hiding.  If we're going to expose the
internal format, let's just change the definition of the type's binary
I/O format, thereby getting a win for purposes like COPY BINARY as well.
We'd need to ensure that jsonb_recv could tell whether it was seeing the
old or new format, but at worst that'd require prepending a header of
some sort.  (In practice, I suspect we'd end up with a wire-format
definition that isn't exactly the bits-on-disk, but something easily
convertible to/from that and more easily verifiable by jsonb_recv.
Numeric subfields, for instance, ought to match the numeric wire
format, which IIRC isn't exactly the bits-on-disk either.)

            regards, tom lane


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Andrew Dunstan
Дата:

On 10/31/2018 10:21 AM, Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
>> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>>> I dunno, I do not think it's a great idea to expose jsonb's internal
>>> format to the world.  We intentionally did not do that when the type
>>> was first defined --- that's why its binary I/O format isn't already
>>> like this --- and I don't see that the tradeoffs have changed since then.
>> I disagree- it's awfully expensive to go back and forth between string
>> and a proper representation.
> Has anyone put any effort into making jsonb_out() faster?  I think that
> that would be way more productive.  Nobody is going to want to write
> code to convert jsonb's internal form into whatever their application
> uses; particularly not dealing with numeric fields.
>
> In any case, the approach proposed in this patch seems like the worst
> of all possible worlds: it's inconsistent and we get zero benefit from
> having thrown away our information-hiding.  If we're going to expose the
> internal format, let's just change the definition of the type's binary
> I/O format, thereby getting a win for purposes like COPY BINARY as well.
> We'd need to ensure that jsonb_recv could tell whether it was seeing the
> old or new format, but at worst that'd require prepending a header of
> some sort.  (In practice, I suspect we'd end up with a wire-format
> definition that isn't exactly the bits-on-disk, but something easily
> convertible to/from that and more easily verifiable by jsonb_recv.
> Numeric subfields, for instance, ought to match the numeric wire
> format, which IIRC isn't exactly the bits-on-disk either.)
>
>     


jsonb_send() sends a version number byte, currently 1. So if we invent a 
new version we would send 2 and teach jsonb_recv to be able to handle 
both. This was kinda anticipated.

I agree that just sending a blob of the internal format isn't a great idea.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Stephen Frost
Дата:
Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> >> I dunno, I do not think it's a great idea to expose jsonb's internal
> >> format to the world.  We intentionally did not do that when the type
> >> was first defined --- that's why its binary I/O format isn't already
> >> like this --- and I don't see that the tradeoffs have changed since then.
>
> > I disagree- it's awfully expensive to go back and forth between string
> > and a proper representation.
>
> Has anyone put any effort into making jsonb_out() faster?  I think that
> that would be way more productive.  Nobody is going to want to write
> code to convert jsonb's internal form into whatever their application
> uses; particularly not dealing with numeric fields.

I'm all for making jsonb_out() faster, but even a faster jsonb_out()
isn't going to be faster than shoveling the jsonb across.

The concern over the application question seems like a complete red
herring to me- people will bake into the various libraries the ability
to convert from our jsonb format to the language's preferred json data
structure, and even if that doesn't happen, we clearly have someone here
who is saying they'd write code to convert from our jsonb format to
whatever their application or language uses, and we've heard that
multiple times on this list.

> In any case, the approach proposed in this patch seems like the worst
> of all possible worlds: it's inconsistent and we get zero benefit from
> having thrown away our information-hiding.  If we're going to expose the
> internal format, let's just change the definition of the type's binary
> I/O format, thereby getting a win for purposes like COPY BINARY as well.

I hadn't looked at the patch, so I'm not surprised it has issues.  I
recall there was some prior discussion about what a good approach was to
implementing this and that changing the binary i/o format was the way to
go, but certainly a review of the threads should be done by whomever
wants to implement this or review the patch.

> We'd need to ensure that jsonb_recv could tell whether it was seeing the
> old or new format, but at worst that'd require prepending a header of
> some sort.  (In practice, I suspect we'd end up with a wire-format
> definition that isn't exactly the bits-on-disk, but something easily
> convertible to/from that and more easily verifiable by jsonb_recv.
> Numeric subfields, for instance, ought to match the numeric wire
> format, which IIRC isn't exactly the bits-on-disk either.)

Agreed, that'd certainly be a good idea.

Thanks!

Stephen

Вложения

Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Andres Freund
Дата:
Hi,

On 2018-10-31 11:13:13 -0400, Andrew Dunstan wrote:
> I agree that just sending a blob of the internal format isn't a great idea.

It's entirely unacceptable afaict. Besides the whole "exposing
internals" issue, it's also at least not endianess safe, depends on the
local alignment requirements (which differ both between platforms and
32/64 bit), numeric's internal encoding and probably more.

Greetings,

Andres Freund


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Christian Ohler
Дата:
On Wed, Oct 31, 2018 at 7:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
If we're going to expose the
internal format, let's just change the definition of the type's binary
I/O format, thereby getting a win for purposes like COPY BINARY as well.
We'd need to ensure that jsonb_recv could tell whether it was seeing the
old or new format, but at worst that'd require prepending a header of
some sort.  (In practice, I suspect we'd end up with a wire-format
definition that isn't exactly the bits-on-disk, but something easily
convertible to/from that and more easily verifiable by jsonb_recv.
Numeric subfields, for instance, ought to match the numeric wire
format, which IIRC isn't exactly the bits-on-disk either.)

How would this work from the driver's and application's perspective?  What does the driver do when receiving JSONB data?

Applications currently receive JSON strings when reading JSONB data, and the driver presumably has to stay compatible with that.  Does that mean the driver transparently converts JSONB to JSON before handing it to the application?  That scales better than doing it in Postgres itself, but is still the kind of inefficiency we're trying to avoid.

We want to convert JSONB directly to language-native objects.  This shouldn't be the responsibility of the Postgres driver, since the conversion is complex and can be done in different ways, such as instantiating objects from a class hierarchy vs instantiating generic containers, or eager vs lazy conversion.  Applications that are sensitive to JSONB performance likely want control over these aspects.  Postgres drivers aren't coupled to specific JSON parsers; they shouldn't be coupled to specific JSONB parsers either.

So, AFAICS, when the application requests JSONB data, the driver has to hand it the raw JSONB bytes.  But that's incompatible with what currently happens.  To preserve compatibility, does the application have to opt in by setting some out-of-band per-query per-result-column flags to tell the driver how it wants the JSONB data returned?  That's workable in principle but bloats every driver's API with some rarely-used performance feature.  Seems much simpler to put this into the query.

The idea behind the proposal is to improve efficiency by avoiding conversions, and the most straightforward way to do that is for every layer to pass through the raw bytes.  With an explicit conversion to BYTEA in the query, this is automatic without any changes to drivers, since every layer already knows to leave BYTEAs untouched.

I don't have an argument against _also_ adding a binary format version 2 for JSONB once we define a portable JSONB format; but I am not sure it alone solves the problem we have.

Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Tom Lane
Дата:
Christian Ohler <ohler@shift.com> writes:
> On Wed, Oct 31, 2018 at 7:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If we're going to expose the
>> internal format, let's just change the definition of the type's binary
>> I/O format, thereby getting a win for purposes like COPY BINARY as well.

> How would this work from the driver's and application's perspective?  What
> does the driver do when receiving JSONB data?

Well, upthread it was posited that applications that read binary JSONB
data would be willing to track changes in that format (or else have no
need to, because they don't do anything with it except feed it back to the
server).  If that isn't the case, then this entire thread is a waste of
time.  I certainly don't buy that exposing the internal format via some
other mechanism than binary I/O would be a sufficient excuse for not
worrying about cross-version compatibility.

> The idea behind the proposal is to improve efficiency by avoiding
> conversions, and the most straightforward way to do that is for every layer
> to pass through the raw bytes.

This argument is, frankly, as bogus as it could possibly be.  In the first
place, you're essentially saying that ignoring version compatibility
considerations entirely is the way to avoid future version compatibility
problems.  I don't buy it.  In the second place, you already admitted
that format conversion *is* necessary; what PG finds best internally is
unlikely to be exactly what some other piece of software will want.
So we'd be better off agreeing on some common interchange format.

I'm still bemused by the proposition that that common interchange format
shouldn't be, um, JSON.  We've already seen BSON, BJSON, etc die
well-deserved deaths.  Why would jsonb internal format, which was never
for one second intended to be seen anywhere outside PG, be a better
interchange-format design than those were?

(In short, I remain unconvinced that we'd not be better off spending
our effort on making json_out faster...)

            regards, tom lane


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Stephen Frost
Дата:
Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Christian Ohler <ohler@shift.com> writes:
> > On Wed, Oct 31, 2018 at 7:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> If we're going to expose the
> >> internal format, let's just change the definition of the type's binary
> >> I/O format, thereby getting a win for purposes like COPY BINARY as well.
>
> > How would this work from the driver's and application's perspective?  What
> > does the driver do when receiving JSONB data?
>
> Well, upthread it was posited that applications that read binary JSONB
> data would be willing to track changes in that format (or else have no
> need to, because they don't do anything with it except feed it back to the
> server).  If that isn't the case, then this entire thread is a waste of
> time.  I certainly don't buy that exposing the internal format via some
> other mechanism than binary I/O would be a sufficient excuse for not
> worrying about cross-version compatibility.

Yes, I agree that the applications/libraries would need to be expecting
and able to handle changes to the format, though, of course, we'd only
change the format across major versions- we wouldn't do it in a point
release.  There might be some argument for supporting multiple versions
also, though we don't really have support for anything like that today,
unfortunately.

> > The idea behind the proposal is to improve efficiency by avoiding
> > conversions, and the most straightforward way to do that is for every layer
> > to pass through the raw bytes.
>
> This argument is, frankly, as bogus as it could possibly be.  In the first
> place, you're essentially saying that ignoring version compatibility
> considerations entirely is the way to avoid future version compatibility
> problems.  I don't buy it.  In the second place, you already admitted
> that format conversion *is* necessary; what PG finds best internally is
> unlikely to be exactly what some other piece of software will want.
> So we'd be better off agreeing on some common interchange format.

There will definitely need to be *some* kind of conversion happening,
the point of doing this would be to give us similar benefits as we have
when passing binary data with the PG protocol today- sure, the data as
the application gets it back probably isn't exactly in the format the
application would like, but at least it's just moving bytes around
typically (network byte order to host byte order, putting things into
some application-side structure, etc).  Today, with jsonb, we have the
data stored as binary, but we have to go through the
binary->text->binary conversion for every binary value and that's far
from free.

> I'm still bemused by the proposition that that common interchange format
> shouldn't be, um, JSON.  We've already seen BSON, BJSON, etc die
> well-deserved deaths.  Why would jsonb internal format, which was never
> for one second intended to be seen anywhere outside PG, be a better
> interchange-format design than those were?

I'd suggest that it's because our jsonb format doesn't have the
limitations that the others had.  That said, I also agree that we
wouldn't stream out the *exact* format that's on disk, but the point, at
least to me, is to avoid bouncing back and forth between binary and text
representation of things like integers, floats, timestamps, etc, where
we already support binary-format results being sent out of PG to an
application.

Thanks!

Stephen

Вложения

Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Daniel Gustafsson
Дата:
> On 2 Nov 2018, at 04:21, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> (In short, I remain unconvinced that we'd not be better off spending
> our effort on making json_out faster...)

Shooting wildly from the hip, isn't this a case where we can potentially
utilize the JIT infrastructure to speed it up?

cheers ./daniel


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Tomas Vondra
Дата:
On 11/02/2018 01:42 PM, Daniel Gustafsson wrote:
>> On 2 Nov 2018, at 04:21, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
>> (In short, I remain unconvinced that we'd not be better off spending
>> our effort on making json_out faster...)
> 
> Shooting wildly from the hip, isn't this a case where we can
> potentially utilize the JIT infrastructure to speed it up?
> 

I don't see how could it help here. The crucial idea of JIT is that if
you postpone the compilation, you may provide additional information
that may allow eliminating parts of the code.

For example, we don't know anything about table structure in general, so
the tuple deforming code had to assume any attribute may be NULL etc.
making the code rather complex with many branches, loops etc. But at
runtime we have the tuple descriptor, so with JIT we may eliminate some
of the branches.

But we don't have anything like that for jsonb - every value may be an
arbitrary json document, etc. And each value (even if stored in the same
column) may be entirely different.

Maybe it's just lack of imagination on my side, but I don't see how
could this benefit from JIT :-(

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Merlin Moncure
Дата:
On Wed, Oct 31, 2018 at 10:23 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2018-10-31 11:13:13 -0400, Andrew Dunstan wrote:
> > I agree that just sending a blob of the internal format isn't a great idea.
>
> It's entirely unacceptable afaict. Besides the whole "exposing
> internals" issue, it's also at least not endianess safe, depends on the
> local alignment requirements (which differ both between platforms and
> 32/64 bit), numeric's internal encoding and probably more.

Binary format consuming applications already have to deal with these
kinds of issues. We already expose internal structures in the other
functions -- not sure why jsonb is held to a different standard.  For
other data types where format changes were made, the standard of
'caveat version' was in place to protect the user.  For jsonb we
decided to implement a version flag within the type itself, which I
thought mistake at the time -- better to have a version header in the
COPY BINARY if needed.

People using binary format are basically interested one one thing,
performance.  It's really the fastest way to get data in an out of the
database.  For my part, I'd like to see some observed demonstrable
benefit in terms of performance to justify making the change.

merlin


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Tom Lane
Дата:
Merlin Moncure <mmoncure@gmail.com> writes:
> On Wed, Oct 31, 2018 at 10:23 AM Andres Freund <andres@anarazel.de> wrote:
>> It's entirely unacceptable afaict. Besides the whole "exposing
>> internals" issue, it's also at least not endianess safe, depends on the
>> local alignment requirements (which differ both between platforms and
>> 32/64 bit), numeric's internal encoding and probably more.

> Binary format consuming applications already have to deal with these
> kinds of issues. We already expose internal structures in the other
> functions -- not sure why jsonb is held to a different standard.

I don't think it's being held to a different standard at all.  Even for
data as simple as integers/floats, we convert to uniform endianness on the
wire.  Moreover, we do not expose the exact bits for anything more complex
than machine floats.  Numeric, for instance, gets disassembled into fields
rather than showing the exact header format (let alone several different
header formats, as actually found on disk).

Andres' point about alignment is a pretty good one as well, if it applies
here --- I don't recall just what internal alignment requirements jsonb
has.  We have not historically expected clients to have to deal with that.

I agree that in any particular use-case one could argue for a different
set of trade-offs in choosing the exposed binary format.  But this is the
set that the project has chosen, and one would need to make a FAR stronger
case than has been made here to let jsonb be an exception.

            regards, tom lane


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Merlin Moncure
Дата:
On Fri, Nov 2, 2018 at 10:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
> > On Wed, Oct 31, 2018 at 10:23 AM Andres Freund <andres@anarazel.de> wrote:
> >> It's entirely unacceptable afaict. Besides the whole "exposing
> >> internals" issue, it's also at least not endianess safe, depends on the
> >> local alignment requirements (which differ both between platforms and
> >> 32/64 bit), numeric's internal encoding and probably more.
>
> > Binary format consuming applications already have to deal with these
> > kinds of issues. We already expose internal structures in the other
> > functions -- not sure why jsonb is held to a different standard.
>
> I don't think it's being held to a different standard at all.  Even for
> data as simple as integers/floats, we convert to uniform endianness on the
> wire.  Moreover, we do not expose the exact bits for anything more complex
> than machine floats.  Numeric, for instance, gets disassembled into fields
> rather than showing the exact header format (let alone several different
> header formats, as actually found on disk).
>
> Andres' point about alignment is a pretty good one as well, if it applies
> here --- I don't recall just what internal alignment requirements jsonb
> has.  We have not historically expected clients to have to deal with that.

I see your (and Andres') point; the binary wire format ought to lay on
top of the basic contracts established by other types.  It can be
binary; just not a straight memcpy out of the server.  The array and
composite type serializers should give some inspiration there on
serialization.   I'll still stand other point I made though; I'd
really want to see some benchmarks demonstrating benefit over
competing approaches that work over the current formats.  That should
frame the argument as to whether this is a good idea.

merlin


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Stephen Frost
Дата:
Greetings,

* Merlin Moncure (mmoncure@gmail.com) wrote:
> On Fri, Nov 2, 2018 at 10:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Andres' point about alignment is a pretty good one as well, if it applies
> > here --- I don't recall just what internal alignment requirements jsonb
> > has.  We have not historically expected clients to have to deal with that.
>
> I see your (and Andres') point; the binary wire format ought to lay on
> top of the basic contracts established by other types.  It can be
> binary; just not a straight memcpy out of the server.  The array and
> composite type serializers should give some inspiration there on
> serialization.

Right- I agree w/ Tom and Andres on this part also.

> I'll still stand other point I made though; I'd
> really want to see some benchmarks demonstrating benefit over
> competing approaches that work over the current formats.  That should
> frame the argument as to whether this is a good idea.

What are the 'competing approaches' you're alluding to here?  Sending
text-format json across as we do today?  Is there something else you're
thinking would be appropriate in this kind of a performance bake-off...?
I'm having a hard time seeing what else would actually have the
flexibility that JSON does without being clearly worse (xml?).

Thanks!

Stephen

Вложения

Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Merlin Moncure
Дата:
On Fri, Nov 2, 2018 at 11:15 AM Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
> * Merlin Moncure (mmoncure@gmail.com) wrote:
> > On Fri, Nov 2, 2018 at 10:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > Andres' point about alignment is a pretty good one as well, if it applies
> > > here --- I don't recall just what internal alignment requirements jsonb
> > > has.  We have not historically expected clients to have to deal with that.
> >
> > I see your (and Andres') point; the binary wire format ought to lay on
> > top of the basic contracts established by other types.  It can be
> > binary; just not a straight memcpy out of the server.  The array and
> > composite type serializers should give some inspiration there on
> > serialization.
>
> Right- I agree w/ Tom and Andres on this part also.
>
> > I'll still stand other point I made though; I'd
> > really want to see some benchmarks demonstrating benefit over
> > competing approaches that work over the current formats.  That should
> > frame the argument as to whether this is a good idea.
>
> What are the 'competing approaches' you're alluding to here?  Sending
> text-format json across as we do today?

Yep -- exactly.  For example, write a C client program that recursed
the structure and dumped it to stdout or assigned to dummy variables
(being mindful of compiler optimizations).  I'd be contrasting this to
a C parsed json that did essentially the same thing, and rigging a
high scale test on the back of that.  The assumption here is that the
ultimate consumer is not, say, a browser, but some client app that can
actually exploit the performance advantages (else, why bother?).

In my experience with arrays and composites, you can see significant
performance reduction and throughput increase in certain classes of
queries.  However, some of the types that were the worst offenders
(like timestamps) have been subsequently optimized and/or are
irrelevant to json since they'd be passed as test anyways.

merlin


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Stephen Frost
Дата:
Greetings,

* Merlin Moncure (mmoncure@gmail.com) wrote:
> On Fri, Nov 2, 2018 at 11:15 AM Stephen Frost <sfrost@snowman.net> wrote:
> > * Merlin Moncure (mmoncure@gmail.com) wrote:
> > > I'll still stand other point I made though; I'd
> > > really want to see some benchmarks demonstrating benefit over
> > > competing approaches that work over the current formats.  That should
> > > frame the argument as to whether this is a good idea.
> >
> > What are the 'competing approaches' you're alluding to here?  Sending
> > text-format json across as we do today?
>
> Yep -- exactly.  For example, write a C client program that recursed
> the structure and dumped it to stdout or assigned to dummy variables
> (being mindful of compiler optimizations).  I'd be contrasting this to
> a C parsed json that did essentially the same thing, and rigging a
> high scale test on the back of that.  The assumption here is that the
> ultimate consumer is not, say, a browser, but some client app that can
> actually exploit the performance advantages (else, why bother?).

If transferring the data in binary doesn't show a performance
improvement then I could agree that it wouldn't be sensible to do- but I
also find that very unlikely to be the case.

As for what language it's written in- I don't think that matters much.
I'd very much expect it to be more performant to use binary if you're
working in C, of course, but there's no point comparing C-parsed json
into some C structure vs. psycopg2 injesting binary data and building a
Python json object- what matters is if it'd be faster for psycopg2 to
pull in binary-json data and put it into a Python json object, or if
it'd be faster to parse the text-json data and put the result into the
same Python object.  In my view, there's something clearly quite wrong
if the text-json data format is faster at that.

> In my experience with arrays and composites, you can see significant
> performance reduction and throughput increase in certain classes of
> queries.  However, some of the types that were the worst offenders
> (like timestamps) have been subsequently optimized and/or are
> irrelevant to json since they'd be passed as test anyways.

I've had very good success transferring timestamps as binary, so I'm not
quite sure what you're getting at here.

Thanks!

Stephen

Вложения

Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Merlin Moncure
Дата:
On Fri, Nov 2, 2018 at 2:34 PM Stephen Frost <sfrost@snowman.net> wrote:
> * Merlin Moncure (mmoncure@gmail.com) wrote:
> As for what language it's written in- I don't think that matters much.
> I'd very much expect it to be more performant to use binary if you're
> working in C, of course, but there's no point comparing C-parsed json
> into some C structure vs. psycopg2 injesting binary data and building a
> Python json object- what matters is if it'd be faster for psycopg2 to
> pull in binary-json data and put it into a Python json object, or if
> it'd be faster to parse the text-json data and put the result into the
> same Python object.  In my view, there's something clearly quite wrong
> if the text-json data format is faster at that.

Yeah, I figure the language would be C or another language with
drivers smart enough to speak the binary protocol.

> > In my experience with arrays and composites, you can see significant
> > performance reduction and throughput increase in certain classes of
> > queries.  However, some of the types that were the worst offenders
> > (like timestamps) have been subsequently optimized and/or are
> > irrelevant to json since they'd be passed as test anyways.
>
> I've had very good success transferring timestamps as binary, so I'm not
> quite sure what you're getting at here.

Er, yes, timestamps are much faster in binary -- that is what I'd
observed and was the point I was trying to make.  They are slightly
less faster with recent optimizations but still faster.  In the old
days I saw as much as +50% throughput binary vs text in certain
contrived situations; that may no longer be true today.

merlin


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Andrew Dunstan
Дата:

On 11/02/2018 11:34 AM, Merlin Moncure wrote:
>
> Binary format consuming applications already have to deal with these
> kinds of issues. We already expose internal structures in the other
> functions -- not sure why jsonb is held to a different standard.  For
> other data types where format changes were made, the standard of
> 'caveat version' was in place to protect the user.  For jsonb we
> decided to implement a version flag within the type itself, which I
> thought mistake at the time -- better to have a version header in the
> COPY BINARY if needed.
>


jsonb_send does output a version header, as I pointed out upthread.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Andres Freund
Дата:
On 2018-11-02 10:34:07 -0500, Merlin Moncure wrote:
> On Wed, Oct 31, 2018 at 10:23 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2018-10-31 11:13:13 -0400, Andrew Dunstan wrote:
> > > I agree that just sending a blob of the internal format isn't a great idea.
> >
> > It's entirely unacceptable afaict. Besides the whole "exposing
> > internals" issue, it's also at least not endianess safe, depends on the
> > local alignment requirements (which differ both between platforms and
> > 32/64 bit), numeric's internal encoding and probably more.
> 
> Binary format consuming applications already have to deal with these
> kinds of issues.

Uh, which?  We have some that are affected by configure flags, but
otherwise we should be endianess & alignment independent.

And *even if* that were the case, that's not a good reason to continue a
bad practice.


> For jsonb we decided to implement a version flag within the type
> itself, which I thought mistake at the time -- better to have a
> version header in the COPY BINARY if needed.

What's the problem with the version? How does a COPY BINARY flag
address normal clients?

Greetings,

Andres Freund


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Andres Freund
Дата:
Hi,


On 2018-11-02 11:52:59 -0400, Tom Lane wrote:
> Andres' point about alignment is a pretty good one as well, if it applies
> here --- I don't recall just what internal alignment requirements jsonb
> has.  We have not historically expected clients to have to deal with that.

Certainly looks like it takes it into account:

static void
fillJsonbValue(JsonbContainer *container, int index,
               char *base_addr, uint32 offset,
               JsonbValue *result)
...
    else if (JBE_ISNUMERIC(entry))
    {
        result->type = jbvNumeric;
        result->val.numeric = (Numeric) (base_addr + INTALIGN(offset));
    }
...
    else
    {
        Assert(JBE_ISCONTAINER(entry));
        result->type = jbvBinary;
        /* Remove alignment padding from data pointer and length */
        result->val.binary.data = (JsonbContainer *) (base_addr + INTALIGN(offset));
        result->val.binary.len = getJsonbLength(container, index) -
            (INTALIGN(offset) - offset);
    }
...

Greetings,

Andres Freund


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Andres Freund
Дата:
Hi,

On 2018-11-02 17:02:24 -0400, Andrew Dunstan wrote:
> On 11/02/2018 11:34 AM, Merlin Moncure wrote:
> > 
> > Binary format consuming applications already have to deal with these
> > kinds of issues. We already expose internal structures in the other
> > functions -- not sure why jsonb is held to a different standard.  For
> > other data types where format changes were made, the standard of
> > 'caveat version' was in place to protect the user.  For jsonb we
> > decided to implement a version flag within the type itself, which I
> > thought mistake at the time -- better to have a version header in the
> > COPY BINARY if needed.
> > 
> 
> 
> jsonb_send does output a version header, as I pointed out upthread.

That's Merlin's point I think. For reasons I don't quite understand he
doesn't like that. Yes, a global solution would have been prettier than
per-datum version flag, but that obvioulsy wasn't realistic to introduce
around the feature freeze of the version that introduced jsonb.

Greetings,

Andres Freund


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Andrew Dunstan
Дата:

On 11/02/2018 05:20 PM, Andres Freund wrote:
> Hi,
>
> On 2018-11-02 17:02:24 -0400, Andrew Dunstan wrote:
>> On 11/02/2018 11:34 AM, Merlin Moncure wrote:
>>> Binary format consuming applications already have to deal with these
>>> kinds of issues. We already expose internal structures in the other
>>> functions -- not sure why jsonb is held to a different standard.  For
>>> other data types where format changes were made, the standard of
>>> 'caveat version' was in place to protect the user.  For jsonb we
>>> decided to implement a version flag within the type itself, which I
>>> thought mistake at the time -- better to have a version header in the
>>> COPY BINARY if needed.
>>>
>>
>> jsonb_send does output a version header, as I pointed out upthread.
> That's Merlin's point I think. For reasons I don't quite understand he
> doesn't like that. Yes, a global solution would have been prettier than
> per-datum version flag, but that obvioulsy wasn't realistic to introduce
> around the feature freeze of the version that introduced jsonb.
>
>
>


Oh, hmm. It would have been a big change of little value, ISTM. One byte 
of overhead per jsonb datum for a version indicator doesn't seem a huge 
price to pay.

cheers

andtrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
David Fetter
Дата:
On Wed, Oct 31, 2018 at 11:18:46AM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> > Stephen Frost <sfrost@snowman.net> writes:
> > > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> > >> I dunno, I do not think it's a great idea to expose jsonb's internal
> > >> format to the world.  We intentionally did not do that when the type
> > >> was first defined --- that's why its binary I/O format isn't already
> > >> like this --- and I don't see that the tradeoffs have changed since then.
> > 
> > > I disagree- it's awfully expensive to go back and forth between string
> > > and a proper representation.
> > 
> > Has anyone put any effort into making jsonb_out() faster?  I think that
> > that would be way more productive.  Nobody is going to want to write
> > code to convert jsonb's internal form into whatever their application
> > uses; particularly not dealing with numeric fields.
> 
> I'm all for making jsonb_out() faster, but even a faster jsonb_out()
> isn't going to be faster than shoveling the jsonb across.

Would it be completely batty to try store JSONB on disk in wire format
and optimize accesses, indexing, etc. around that?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Andres Freund
Дата:
On 2018-11-02 23:24:35 +0100, David Fetter wrote:
> On Wed, Oct 31, 2018 at 11:18:46AM -0400, Stephen Frost wrote:
> > Greetings,
> > 
> > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> > > Stephen Frost <sfrost@snowman.net> writes:
> > > > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> > > >> I dunno, I do not think it's a great idea to expose jsonb's internal
> > > >> format to the world.  We intentionally did not do that when the type
> > > >> was first defined --- that's why its binary I/O format isn't already
> > > >> like this --- and I don't see that the tradeoffs have changed since then.
> > > 
> > > > I disagree- it's awfully expensive to go back and forth between string
> > > > and a proper representation.
> > > 
> > > Has anyone put any effort into making jsonb_out() faster?  I think that
> > > that would be way more productive.  Nobody is going to want to write
> > > code to convert jsonb's internal form into whatever their application
> > > uses; particularly not dealing with numeric fields.
> > 
> > I'm all for making jsonb_out() faster, but even a faster jsonb_out()
> > isn't going to be faster than shoveling the jsonb across.
> 
> Would it be completely batty to try store JSONB on disk in wire format
> and optimize accesses, indexing, etc. around that?

It doesn't seem to make any sense whatsoever.  For one, the current wire
format is text. If that's your argument, uh, yes, we have jsonb over
json for a reason.  If you're talking about a hypothetical future wire
format, that seems unlikely as well - for one there's obviously on-disk
compat issues, for another, the on-disk format is already kind of
optimized to store what it stores. And on-wire and on-disk concerns are
different.

Greetings,

Andres Freund


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Merlin Moncure
Дата:
On Fri, Nov 2, 2018 at 5:15 PM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
> On 11/02/2018 05:20 PM, Andres Freund wrote:
> > Hi,
> >
> > On 2018-11-02 17:02:24 -0400, Andrew Dunstan wrote:
> >> On 11/02/2018 11:34 AM, Merlin Moncure wrote:
> >>> Binary format consuming applications already have to deal with these
> >>> kinds of issues. We already expose internal structures in the other
> >>> functions -- not sure why jsonb is held to a different standard.  For
> >>> other data types where format changes were made, the standard of
> >>> 'caveat version' was in place to protect the user.  For jsonb we
> >>> decided to implement a version flag within the type itself, which I
> >>> thought mistake at the time -- better to have a version header in the
> >>> COPY BINARY if needed.
> >>>
> >>
> >> jsonb_send does output a version header, as I pointed out upthread.
> > That's Merlin's point I think. For reasons I don't quite understand he
> > doesn't like that. Yes, a global solution would have been prettier than
> > per-datum version flag, but that obvioulsy wasn't realistic to introduce
> > around the feature freeze of the version that introduced jsonb.
>
>
> Oh, hmm. It would have been a big change of little value, ISTM. One byte
> of overhead per jsonb datum for a version indicator doesn't seem a huge
> price to pay.

Yeah -- it really isn't.

My concern was more that it seemed odd to protect one type with a
version flag where others aren't; format agreement strikes me as more
of a protocol negotiation thing than an aspect of each individual data
point.  It also makes for slightly odd (IMO) client side coding. The
contract is of concern, not the overhead.

It works well enough though...we discussed this a bit when the header
was introduced and the discussion ended exactly the same way :-).

merlin


Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Jelte Fennema
Дата:
(reviving an old thread)

On Thu, 23 Jun 2022 at 13:29, Merlin Moncure <mmoncure@gmail.com> wrote:
> I'll still stand other point I made though; I'd
> really want to see some benchmarks demonstrating benefit over
> competing approaches that work over the current formats.  That should
> frame the argument as to whether this is a good idea.

I tried to use COPY BINARY to copy data recently from one Postgres
server to another and it was much slower than I expected. The backend
process on the receiving side was using close to 100% of a CPU core.
So the COPY command was clearly CPU bound in this case. After doing a
profile it became clear that 50% of the CPU time was spent on parsing
JSON. This seems extremely excessive to me. I'm pretty sure any
semi-decent binary format would be able to outperform this.

FYI: The table being copied contained large JSONB blobs in one of the
columns. These blobs were around 15kB for each row.



Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Jelte Fennema
Дата:
(attached is the flamegraph of the profile, in case others are interested)

On Thu, 23 Jun 2022 at 13:33, Jelte Fennema <me@jeltef.nl> wrote:
>
> (reviving an old thread)
>
> On Thu, 23 Jun 2022 at 13:29, Merlin Moncure <mmoncure@gmail.com> wrote:
> > I'll still stand other point I made though; I'd
> > really want to see some benchmarks demonstrating benefit over
> > competing approaches that work over the current formats.  That should
> > frame the argument as to whether this is a good idea.
>
> I tried to use COPY BINARY to copy data recently from one Postgres
> server to another and it was much slower than I expected. The backend
> process on the receiving side was using close to 100% of a CPU core.
> So the COPY command was clearly CPU bound in this case. After doing a
> profile it became clear that 50% of the CPU time was spent on parsing
> JSON. This seems extremely excessive to me. I'm pretty sure any
> semi-decent binary format would be able to outperform this.
>
> FYI: The table being copied contained large JSONB blobs in one of the
> columns. These blobs were around 15kB for each row.

Вложения

Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Andres Freund
Дата:
Hi,

On 2022-06-23 13:33:24 +0200, Jelte Fennema wrote:
> (reviving an old thread)
> 
> On Thu, 23 Jun 2022 at 13:29, Merlin Moncure <mmoncure@gmail.com> wrote:
> > I'll still stand other point I made though; I'd
> > really want to see some benchmarks demonstrating benefit over
> > competing approaches that work over the current formats.  That should
> > frame the argument as to whether this is a good idea.
> 
> I tried to use COPY BINARY to copy data recently from one Postgres
> server to another and it was much slower than I expected. The backend
> process on the receiving side was using close to 100% of a CPU core.
> So the COPY command was clearly CPU bound in this case. After doing a
> profile it became clear that 50% of the CPU time was spent on parsing
> JSON. This seems extremely excessive to me.

It looks like there's quite a bit of low hanging fruits to optimize...


> I'm pretty sure any semi-decent binary format would be able to outperform
> this.

Sure. It's a decent amount of work to define one though... It's clearly not
acceptable to just dump out the internal representation, as already discussed
in this thread.

Greetings,

Andres Freund



Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
John Naylor
Дата:
On Thu, Jun 23, 2022 at 9:06 PM Andres Freund <andres@anarazel.de> wrote:
> It looks like there's quite a bit of low hanging fruits to optimize...

Yeah, if escapes and control characters are rare, adding an SSE2 fast
path would give a boost to json_lex_string: check 16 bytes at a time
for those chars (plus the ending double-quote). We can also shave a
few percent by having pg_utf8_verifystr use SSE2 for the ascii path. I
can look into this.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Jelte Fennema
Дата:
> It's a decent amount of work to define one though... It's clearly not
> acceptable to just dump out the internal representation, as already discussed
> in this thread.

I totally agree that it should be a well-defined format that doesn't
leak stuff like endianness and alignment of the underlying database.

With a bit of googling I found the UBJSON specification:
https://ubjson.org/#data_format
It seems like it would be possible to transform between JSONB and
UBJSON efficiently. As an example: For my recent use case the main
thing that was slow was parsing JSON strings, because of the escape
characters. That's not needed with UBJSON, because strings are simply
UTF-8 encoded binary data, that are prefixed with their length. So all
that would be needed is checking if the binary data is valid UTF-8.

Also there seem to be implementations in many languages for this spec:
https://ubjson.org/libraries/ So, that should make it easy for
Postgres client libraries to support this binary format.

> I'm still bemused by the proposition that that common interchange format
> shouldn't be, um, JSON. We've already seen BSON, BJSON, etc die
> well-deserved deaths.

@Tom Lane: UBJSON calls explicitly lists these specific failed
attempts at a binary encoding for JSON as the reason for why it was
created, aiming to fix the issues those specs have:
https://github.com/ubjson/universal-binary-json#why

Jelte



Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
Andres Freund
Дата:
Hi,

On 2022-06-24 14:33:00 +0700, John Naylor wrote:
> On Thu, Jun 23, 2022 at 9:06 PM Andres Freund <andres@anarazel.de> wrote:
> > It looks like there's quite a bit of low hanging fruits to optimize...
> 
> Yeah, if escapes and control characters are rare, adding an SSE2 fast
> path would give a boost to json_lex_string: check 16 bytes at a time
> for those chars (plus the ending double-quote).

The biggest thing I see is building the string in bigger chunks. Doing a
separate appendStringInfoChar() for each input character is obviously bad for
performance. I'd bet a good chunk of the time attributed to json_lex_string()
in Jelte's flamegraph is actually setting up the external function call to
appendStringInfoChar(). Which then proceeds to do a check whether the buffer
needs to be enlarged and maintains the trailing null byte on every call.

Greetings,

Andres Freund



Re: WIP Patch: Add a function that returns binary JSONB as a bytea

От
John Naylor
Дата:
I wrote:
> We can also shave a
> few percent by having pg_utf8_verifystr use SSE2 for the ascii path. I
> can look into this.

Here's a patch for that. If the input is mostly ascii, I'd expect that
part of the flame graph to shrink by 40-50% and give a small boost
overall.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Вложения