Обсуждение: Call for GIST/GIN/SP-GIST opclass documentation

Поиск

Список

Период

Сортировка

Call for GIST/GIN/SP-GIST opclass documentation

От

Tom Lane

Дата:

08 апреля 2014 г., 20:41:12

I just created sections in the SGML manual chapters about GIST, GIN, and
SP-GIST to hold documentation about the standard opclasses provided for
them:

http://www.postgresql.org/docs/devel/static/gist-builtin-opclasses.html
http://www.postgresql.org/docs/devel/static/gin-builtin-opclasses.html
http://www.postgresql.org/docs/devel/static/spgist-builtin-opclasses.html

We never had any well-defined place to discuss such opclasses before,
but I think it's past time we did.  I envision these sections as places to
document, for example, the difference between the two jsonb gin opclasses.
I put this text in about that:

    Of the two operator classes for type jsonb, jsonb_ops is the
    default. jsonb_hash_ops supports fewer operators but will work with
    larger indexed values than jsonb_ops can support.

Is that accurate?  Do we need to say more?

For the moment what's there is mostly just tables of the core opclasses
and the operators they support.  If anyone can think of specific additions
that should be there, please send in patches.

(BTW, I didn't worry about btree and hash because they don't have such
a wide variety of opclass behaviors.)

            regards, tom lane

Re: Call for GIST/GIN/SP-GIST opclass documentation

От

Peter Geoghegan

Дата:

08 апреля 2014 г., 21:03:51

On Tue, Apr 8, 2014 at 1:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I just created sections in the SGML manual chapters about GIST, GIN, and
> SP-GIST to hold documentation about the standard opclasses provided for
> them:

I think that that's a good idea. I too was bothered by this omission.

>     Of the two operator classes for type jsonb, jsonb_ops is the
>     default. jsonb_hash_ops supports fewer operators but will work with
>     larger indexed values than jsonb_ops can support.
>
> Is that accurate?  Do we need to say more?

Well, I'm not sure that it's worth noting there, but as you probably
already know jsonb_hash_ops will perform a lot better than the default
GIN opclass, and will have much smaller indexes. FWIW I think that the
size limitation is overblown, and performance is in fact the
compelling reason to prefer jsonb_hash_ops, although it's probably
incongruous to explain the issues that way in this section of the
docs. It probably suffices that that is covered in the "JSON Types"
section.

--
Peter Geoghegan

Re: Call for GIST/GIN/SP-GIST opclass documentation

От

Tom Lane

Дата:

08 апреля 2014 г., 21:34:07

Peter Geoghegan <pg@heroku.com> writes:
> On Tue, Apr 8, 2014 at 1:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Of the two operator classes for type jsonb, jsonb_ops is the
>> default. jsonb_hash_ops supports fewer operators but will work with
>> larger indexed values than jsonb_ops can support.
>>
>> Is that accurate?  Do we need to say more?

> Well, I'm not sure that it's worth noting there, but as you probably
> already know jsonb_hash_ops will perform a lot better than the default
> GIN opclass, and will have much smaller indexes. FWIW I think that the
> size limitation is overblown, and performance is in fact the
> compelling reason to prefer jsonb_hash_ops, although it's probably
> incongruous to explain the issues that way in this section of the
> docs. It probably suffices that that is covered in the "JSON Types"
> section.

Well, the subtext is whether we should move that discussion to this
new section.  I think there is some comparable discussion in the
full-text-indexing chapter, too.

(BTW, wasn't there some discussion of changing our minds about which
one is the default?  We already have one bug report complaining about
jsonb_ops' size restriction, so that seems to be evidence in favor
of changing ...)

            regards, tom lane

Re: Call for GIST/GIN/SP-GIST opclass documentation

От

Peter Geoghegan

Дата:

08 апреля 2014 г., 21:42:04

On Tue, Apr 8, 2014 at 2:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> (BTW, wasn't there some discussion of changing our minds about which
> one is the default?  We already have one bug report complaining about
> jsonb_ops' size restriction, so that seems to be evidence in favor
> of changing ...)

Yes, there was. I very nearly came down on the side of making
jsonb_hash_ops the default, but given that it doesn't make all
operators indexable, I ultimately decided against supporting that
course of action. I thought that that would be an odd limitation for
the default GIN opclass to have. It was a very close call in my mind,
and if you favor changing the default now, in light of the few
complaints we've heard, I think that's a reasonable decision. That
said, as I noted in the main -bugs thread, the case presented is
fairly atypical.

--
Peter Geoghegan

default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

08 апреля 2014 г., 21:46:29

Peter Geoghegan <pg@heroku.com> writes:
> On Tue, Apr 8, 2014 at 2:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> (BTW, wasn't there some discussion of changing our minds about which
>> one is the default?  We already have one bug report complaining about
>> jsonb_ops' size restriction, so that seems to be evidence in favor
>> of changing ...)

> Yes, there was. I very nearly came down on the side of making
> jsonb_hash_ops the default, but given that it doesn't make all
> operators indexable, I ultimately decided against supporting that
> course of action. I thought that that would be an odd limitation for
> the default GIN opclass to have. It was a very close call in my mind,
> and if you favor changing the default now, in light of the few
> complaints we've heard, I think that's a reasonable decision. That
> said, as I noted in the main -bugs thread, the case presented is
> fairly atypical.

Well, let me see if I understand the situation correctly:

* jsonb_ops supports more operators

* jsonb_hash_ops produces smaller, better-performing indexes

* jsonb_ops falls over on inputs with wide field values, but
jsonb_hash_ops does not

If that's an accurate summary then I would say that we've got
the default backwards.  I would much rather tell people "you
can have more operators supported, but here are the tradeoffs"
than have a default that fails under evidently-not-so-improbable
cases.
        regards, tom lane

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Peter Geoghegan

Дата:

08 апреля 2014 г., 21:57:49

On Tue, Apr 8, 2014 at 2:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, let me see if I understand the situation correctly:
>
> * jsonb_ops supports more operators
>
> * jsonb_hash_ops produces smaller, better-performing indexes
>
> * jsonb_ops falls over on inputs with wide field values, but
> jsonb_hash_ops does not

There might be some compelling cases for indexing existence rather
than containment, since the recheck flag isn't set there, but in
general this summary seems sound. I would say that broadly, existence
is a less useful operator than containment, and so jsonb_hash_ops is
broadly more compelling. I didn't propose changing the default due to
concerns about the POLA, but I'm happy to be told that those concerns
were out of proportion to the practical benefits of a different
default.

-- 
Peter Geoghegan

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Andrew Dunstan

Дата:

08 апреля 2014 г., 22:18:46

On 04/08/2014 05:57 PM, Peter Geoghegan wrote:
> On Tue, Apr 8, 2014 at 2:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, let me see if I understand the situation correctly:
>>
>> * jsonb_ops supports more operators
>>
>> * jsonb_hash_ops produces smaller, better-performing indexes
>>
>> * jsonb_ops falls over on inputs with wide field values, but
>> jsonb_hash_ops does not
> There might be some compelling cases for indexing existence rather
> than containment, since the recheck flag isn't set there, but in
> general this summary seems sound. I would say that broadly, existence
> is a less useful operator than containment, and so jsonb_hash_ops is
> broadly more compelling. I didn't propose changing the default due to
> concerns about the POLA, but I'm happy to be told that those concerns
> were out of proportion to the practical benefits of a different
> default.
>


I tend to agree with Tom that POLA will be more violated by the default 
ops class not being able to index some values.

cheers

andrew

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

09 апреля 2014 г., 02:48:38

Andrew Dunstan <andrew@dunslane.net> writes:
> On 04/08/2014 05:57 PM, Peter Geoghegan wrote:
>> ... I didn't propose changing the default due to
>> concerns about the POLA, but I'm happy to be told that those concerns
>> were out of proportion to the practical benefits of a different
>> default.

> I tend to agree with Tom that POLA will be more violated by the default 
> ops class not being able to index some values.

We should wait a bit longer to see if anyone objects, but assuming that
this represents the consensus opinion ...

ISTM that the name "jsonb_ops" should have pride of place as the default
jsonb opclass.  Therefore, if we make this change, jsonb_hash_ops needs to
be renamed to jsonb_ops, and we need a new name for what is now jsonb_ops.
I haven't paid attention to the technical details of the differences so
I have no idea what to suggest for the new name.  Thoughts?
        regards, tom lane

Re: [DOCS] Call for GIST/GIN/SP-GIST opclass documentation

От

Robert Haas

Дата:

09 апреля 2014 г., 04:23:53

On Tue, Apr 8, 2014 at 4:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I just created sections in the SGML manual chapters about GIST, GIN, and
> SP-GIST to hold documentation about the standard opclasses provided for
> them:
>
> http://www.postgresql.org/docs/devel/static/gist-builtin-opclasses.html
> http://www.postgresql.org/docs/devel/static/gin-builtin-opclasses.html
> http://www.postgresql.org/docs/devel/static/spgist-builtin-opclasses.html
>
> We never had any well-defined place to discuss such opclasses before,
> but I think it's past time we did.

+1.  Great idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Heikki Linnakangas

Дата:

09 апреля 2014 г., 06:37:32

On 04/09/2014 01:18 AM, Andrew Dunstan wrote:
>
> On 04/08/2014 05:57 PM, Peter Geoghegan wrote:
>> On Tue, Apr 8, 2014 at 2:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Well, let me see if I understand the situation correctly:
>>>
>>> * jsonb_ops supports more operators
>>>
>>> * jsonb_hash_ops produces smaller, better-performing indexes
>>>
>>> * jsonb_ops falls over on inputs with wide field values, but
>>> jsonb_hash_ops does not
>> There might be some compelling cases for indexing existence rather
>> than containment, since the recheck flag isn't set there, but in
>> general this summary seems sound. I would say that broadly, existence
>> is a less useful operator than containment, and so jsonb_hash_ops is
>> broadly more compelling. I didn't propose changing the default due to
>> concerns about the POLA, but I'm happy to be told that those concerns
>> were out of proportion to the practical benefits of a different
>> default.
>
> I tend to agree with Tom that POLA will be more violated by the default
> ops class not being able to index some values.

Yeah.

<rant>

Both of the operator classes are actually much less flexible than I'd 
like. Firstly, they index everything. In many cases, that's not what you 
want, so you end up with much larger indexes than necessary. Secondly, 
jsonb_ops indexes all values separately from the keys. That makes the 
index pretty much useless for a query on, say, WHERE json @> 
'{"needs_processing":true}', if all the rows also contain a key-value 
pair "active":true. Thirdly, inequality operators are not supported; you 
can't search for rows with (the json-syntax equivalent of) "price < 
12.3". Fourthly, sometimes you would want to include the "path" to an 
entry in the key, sometimes not.

If I understood correctly the way jsonb_hash_ops works, the limitation 
compared to jsonb_ops is that it cannot be used for foo ? 'bar' type 
queries. And the reason for that limitation is that it hashes the whole 
path to the key; the naked values are not indexes separately. But why 
not? jsonb_ops does - why is that decision related to whether you hash 
or not? Or it could index both. Sure, it would be wasteful when you 
don't need to support foo ? 'bar', but the point is that it should be up 
to the DBA to decide, based on his needs.

As the code stands, you don't have a choice on any of those things. The 
decisions have been made by us, PostgreSQL developers. The only choice 
you have is between jsonb_ops and jsonb_hash_ops, with a strange 
combination of tradeoffs in both. Sure, they're still useful, if not 
optimal, for a wide-range of applications. For more complicated cases, 
you will have to resort to expression indexes. It bugs me greatly that 
the underlying indexam could do all those things, we're just not 
exposing the capability.

ISTM we need a way to parameterize opclasses, so that when you create 
the index, you specify the above things.

</rant>

The ship has cleatly sailed to add parameterized opclasses to 9.4, but 
let's keep it in mind when we decide on the defaults.

In the absence of parameterizable opclasses, it would be much more 
flexible to have opclasses that index, keys, values, key-value pairs and 
paths separately, instead of the current json_ops and json_hash_ops 
opclasses which index all of those in the same index. That way, if you 
only e.g. ever query on the existence of a key, you'd only need to index 
the keys.

I don't understand how we ended up with the current dichotomy of 
json_ops and json_hash_ops...

- Heikki

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Peter Geoghegan

Дата:

09 апреля 2014 г., 07:40:52

On Tue, Apr 8, 2014 at 11:37 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> As the code stands, you don't have a choice on any of those things. The
> decisions have been made by us, PostgreSQL developers. The only choice you
> have is between jsonb_ops and jsonb_hash_ops, with a strange combination of
> tradeoffs in both. Sure, they're still useful, if not optimal, for a
> wide-range of applications. For more complicated cases, you will have to
> resort to expression indexes. It bugs me greatly that the underlying indexam
> could do all those things, we're just not exposing the capability.

Why would you ever not have to use expression indexes? Idiomatic usage
of jsonb involves expression indexes because it's desirable to index
only a expression. People will want to do things like only index the
nested "tags" array far more frequently then they'll only want to
index keys (that is, Object pair keys) in the entire document. I don't
get why you'd say that they'd "resort" to expression indexes, like
they're a kludge. Have you ever tried out one of the new document
databases? I suggest you do. Expression indexes on jsonb map pretty
closely onto how you're frequently expected to index data in those
systems. That's something that they make heavy use of. Why would you
ever not really have to consider ahead of time what is important
enough to be indexed, and what is not?

> ISTM we need a way to parameterize opclasses, so that when you create the
> index, you specify the above things.

That would be nice.

> In the absence of parameterizable opclasses, it would be much more flexible
> to have opclasses that index, keys, values, key-value pairs and paths
> separately, instead of the current json_ops and json_hash_ops opclasses
> which index all of those in the same index. That way, if you only e.g. ever
> query on the existence of a key, you'd only need to index the keys.

I think only ever needing to index the keys is not a common use-case.
It pretty much makes exactly as much sense to do so as it would with
hstore, and yet hstore doesn't support that after all these years.

> I don't understand how we ended up with the current dichotomy of json_ops
> and json_hash_ops...

It makes sense if you consider jsonb_ops best suited to simpler
hstore-style indexing, while jsonb_hash_ops is best suited to testing
containment of JSON documents, potentially with lots of nesting. These
documents are typically homogeneous in structure. Idiomatic usage of
systems like MongoDB involves "collections" of fairly homogeneous
documents. If there is a lot of variability in their structure within
a collection, the collection more or less becomes impossible to
usefully query. They aim to be flexible, but still implicitly require
you to insert data with a half-way sensible/consistent structure. This
makes separately indexing the keys less than compelling as a default,
because there is so much duplication of keys in practice.

-- 
Peter Geoghegan

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Heikki Linnakangas

Дата:

09 апреля 2014 г., 08:21:43

On 04/09/2014 10:40 AM, Peter Geoghegan wrote:
> On Tue, Apr 8, 2014 at 11:37 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> As the code stands, you don't have a choice on any of those things. The
>> decisions have been made by us, PostgreSQL developers. The only choice you
>> have is between jsonb_ops and jsonb_hash_ops, with a strange combination of
>> tradeoffs in both. Sure, they're still useful, if not optimal, for a
>> wide-range of applications. For more complicated cases, you will have to
>> resort to expression indexes. It bugs me greatly that the underlying indexam
>> could do all those things, we're just not exposing the capability.
>
> Why would you ever not have to use expression indexes? Idiomatic usage
> of jsonb involves expression indexes because it's desirable to index
> only a expression. People will want to do things like only index the
> nested "tags" array far more frequently then they'll only want to
> index keys (that is, Object pair keys) in the entire document. I don't
> get why you'd say that they'd "resort" to expression indexes, like
> they're a kludge.

Expression indexes are definitely nice, but you have to be careful to 
formulate the query in exactly the same way to match the index.

> Have you ever tried out one of the new document
> databases? I suggest you do. Expression indexes on jsonb map pretty
> closely onto how you're frequently expected to index data in those
> systems. That's something that they make heavy use of. Why would you
> ever not really have to consider ahead of time what is important
> enough to be indexed, and what is not?

I didn't say that. On the contrary, I think the shotgun approach 
jsonb_ops and jsonb_hash_ops take is too broad. It should be possible to 
specify what to index in a more detailed fashion.

- Heikki

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Peter Geoghegan

Дата:

09 апреля 2014 г., 08:40:47

On Wed, Apr 9, 2014 at 1:21 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> I didn't say that. On the contrary, I think the shotgun approach jsonb_ops
> and jsonb_hash_ops take is too broad. It should be possible to specify what
> to index in a more detailed fashion.

It is - use an expression index. That's by far the most important way
to specify what to index in a more detailed fashion. There are others,
but that's the major one. Beyond that, yes, it's necessary to
carefully write your query predicate a certain way. However, a similar
situation exists in MongoDB, where there is a distinction between
"Indexes on embedded fields" (which must be accessed using special
"dot notation") and "indexes on subdocuments" (which cannot be
accessed using "dot notation"). It's late here, but I'm pretty sure
that's a feature and not a limitation.

-- 
Peter Geoghegan

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Alexander Korotkov

Дата:

09 апреля 2014 г., 14:58:22

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Apr 9, 2014 at 10:37 AM, Heikki Linnakangas
<spandir="ltr"><<a href="mailto:hlinnakangas@vmware.com" target="_blank">hlinnakangas@vmware.com</a>></span>
wrote:<br/><blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div
class="">Theship has cleatly sailed to add parameterized opclasses to 9.4, but let's keep it in mind when we decide on
thedefaults.<br /></div><br /> In the absence of parameterizable opclasses, it would be much more flexible to have
opclassesthat index, keys, values, key-value pairs and paths separately, instead of the current json_ops and
json_hash_opsopclasses which index all of those in the same index. That way, if you only e.g. ever query on the
existenceof a key, you'd only need to index the keys.<br /><br /> I don't understand how we ended up with the current
dichotomyof json_ops and json_hash_ops...</blockquote><div class="gmail_quote"><br /></div><div class="gmail_quote">+1
forparameterizable opclasses</div><br />------<br /> With best regards,<br />Alexander Korotkov. </div></div></div>

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Alexander Korotkov

Дата:

09 апреля 2014 г., 15:13:00

On Wed, Apr 9, 2014 at 12:40 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Apr 9, 2014 at 1:21 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> I didn't say that. On the contrary, I think the shotgun approach jsonb_ops
> and jsonb_hash_ops take is too broad. It should be possible to specify what
> to index in a more detailed fashion.

It is - use an expression index. That's by far the most important way
to specify what to index in a more detailed fashion. There are others,
but that's the major one. Beyond that, yes, it's necessary to
carefully write your query predicate a certain way. However, a similar
situation exists in MongoDB, where there is a distinction between
"Indexes on embedded fields" (which must be accessed using special
"dot notation") and "indexes on subdocuments" (which cannot be
accessed using "dot notation"). It's late here, but I'm pretty sure
that's a feature and not a limitation.

I believe that serious limitation we now have is that we actually specify kind of index to be used in the SQL query.
For example you need to find objects with active = true. You can write:

js @> {"active": true}

then GIN index on js can be used. Also you can write:

js->'active' = true

then btree expression index on (js->'active') can be used. For sure, one can do

js @> {"active": true} AND js->'active' = true

This query can use any of indexes, but it is:

1) Cluge

2) Excess recheck

3) If both indexes present, excess "bitmap and".

Having to choose index in SQL-query we make our SQL more imperative and less declarative. Similar things can happen without json/hstore (user have to rewrite SQL in order to use expression index), but now it could become very common. My opinion is that we have to do something in planner to make it understand at least this two kinds of queries to be equivalent.

------
With best regards,
Alexander Korotkov.

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Robert Haas

Дата:

09 апреля 2014 г., 15:16:20

On Wed, Apr 9, 2014 at 2:37 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Both of the operator classes are actually much less flexible than I'd like.
> Firstly, they index everything. In many cases, that's not what you want, so
> you end up with much larger indexes than necessary. Secondly, jsonb_ops
> indexes all values separately from the keys. That makes the index pretty
> much useless for a query on, say, WHERE json @> '{"needs_processing":true}',
> if all the rows also contain a key-value pair "active":true. Thirdly,
> inequality operators are not supported; you can't search for rows with (the
> json-syntax equivalent of) "price < 12.3". Fourthly, sometimes you would
> want to include the "path" to an entry in the key, sometimes not.

Maybe we should make *neither* of these the default opclass, and give
*neither* the name json_ops.

> ISTM we need a way to parameterize opclasses, so that when you create the
> index, you specify the above things.

Yeah, that would be great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

09 апреля 2014 г., 15:24:23

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Apr 9, 2014 at 2:37 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Both of the operator classes are actually much less flexible than I'd like.

> Maybe we should make *neither* of these the default opclass, and give
> *neither* the name json_ops.

There's definitely something to be said for that.  Default opclasses are
sensible when there's basically only one behavior that's interesting for
most people.  We can already see that that's not going to be the case
for jsonb indexes, at least not with the currently available alternatives.

Not having a default would force users to make decisions explicitly.
Is that what we want?

One other point here is that non-default opclasses can't be used in
UNIQUE/PRIMARY KEY/EXCLUDE constraints, because there's no place to
specify an opclass name in those syntaxes.  UNIQUE/PRIMARY KEY don't
matter here since these aren't btree opclasses, but is there a
use-case for EXCLUDE with any of the supported jsonb operators?
        regards, tom lane

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Alvaro Herrera

Дата:

09 апреля 2014 г., 15:56:54

Tom Lane wrote:

> > On Wed, Apr 9, 2014 at 2:37 AM, Heikki Linnakangas
> > <hlinnakangas@vmware.com> wrote:
> >> Both of the operator classes are actually much less flexible than I'd like.
> 
> > Maybe we should make *neither* of these the default opclass, and give
> > *neither* the name json_ops.

+1.  I was thinking the same thing after reading Heikki's rant.

> One other point here is that non-default opclasses can't be used in
> UNIQUE/PRIMARY KEY/EXCLUDE constraints, because there's no place to
> specify an opclass name in those syntaxes.  UNIQUE/PRIMARY KEY don't
> matter here since these aren't btree opclasses, but is there a
> use-case for EXCLUDE with any of the supported jsonb operators?

That sounds like an oversight that could better be fixed in EXCLUDE, no?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [DOCS] Call for GIST/GIN/SP-GIST opclass documentation

От

Alvaro Herrera

Дата:

09 апреля 2014 г., 16:03:47

Robert Haas wrote:
> On Tue, Apr 8, 2014 at 4:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I just created sections in the SGML manual chapters about GIST, GIN, and
> > SP-GIST to hold documentation about the standard opclasses provided for
> > them:
> >
> > http://www.postgresql.org/docs/devel/static/gist-builtin-opclasses.html
> > http://www.postgresql.org/docs/devel/static/gin-builtin-opclasses.html
> > http://www.postgresql.org/docs/devel/static/spgist-builtin-opclasses.html
> >
> > We never had any well-defined place to discuss such opclasses before,
> > but I think it's past time we did.
>
> +1.  Great idea.

Agreed.

I find that in my browser it's a bit difficult to make out the different
operators in the long list; ISTM a single space between operators isn't
enough.  I wonder how can we do better; sticking commas (or any other
single character, really) between them is not likely to improve matters;
and using one column per operator is not going to work very well.  Maybe
if it were possible to split a single cell in several sub-cells; or
perhaps there's a way to specify two whitespace units, or a long space,
or something like that.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

09 апреля 2014 г., 16:04:49

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> One other point here is that non-default opclasses can't be used in
>> UNIQUE/PRIMARY KEY/EXCLUDE constraints, because there's no place to
>> specify an opclass name in those syntaxes.  UNIQUE/PRIMARY KEY don't
>> matter here since these aren't btree opclasses, but is there a
>> use-case for EXCLUDE with any of the supported jsonb operators?

> That sounds like an oversight that could better be fixed in EXCLUDE, no?

Well, there hasn't been a use-case up to now.  I'm not sure there's
one yet.
        regards, tom lane

Re: [DOCS] Call for GIST/GIN/SP-GIST opclass documentation

От

Tom Lane

Дата:

09 апреля 2014 г., 16:19:00

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Robert Haas wrote:
>> On Tue, Apr 8, 2014 at 4:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> http://www.postgresql.org/docs/devel/static/gist-builtin-opclasses.html
>>> http://www.postgresql.org/docs/devel/static/gin-builtin-opclasses.html
>>> http://www.postgresql.org/docs/devel/static/spgist-builtin-opclasses.html

> I find that in my browser it's a bit difficult to make out the different
> operators in the long list; ISTM a single space between operators isn't
> enough.

Yeah, I noticed that too, but wasn't sure what to do about it.  One line
per operator isn't better, but I don't know how to get a bit more
horizontal space into the lists.

            regards, tom lane

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Peter Geoghegan

Дата:

09 апреля 2014 г., 16:51:50

On Wed, Apr 9, 2014 at 8:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Maybe we should make *neither* of these the default opclass, and give
>> *neither* the name json_ops.
>
> There's definitely something to be said for that.  Default opclasses are
> sensible when there's basically only one behavior that's interesting for
> most people.  We can already see that that's not going to be the case
> for jsonb indexes, at least not with the currently available alternatives.

I've heard worse ideas.


-- 
Peter Geoghegan

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Greg Stark

Дата:

09 апреля 2014 г., 18:23:39

On Wed, Apr 9, 2014 at 11:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Maybe we should make *neither* of these the default opclass, and give
>> *neither* the name json_ops.
>
> There's definitely something to be said for that.  Default opclasses are
> sensible when there's basically only one behavior that's interesting for
> most people.  We can already see that that's not going to be the case
> for jsonb indexes, at least not with the currently available alternatives.
>
> Not having a default would force users to make decisions explicitly.
> Is that what we want?

I don't like the idea of having no default opclass. I think there's a
huge usability gain in being able to "just" create an index on a
column and have it do something reasonable for most use cases.

I can get behind the idea of having separate index opclasses for paths
and path-value pairs but I suspect the default should just be to index
both in the same index. If we can have one default index opclass that
supports containment and existence and then other opclasses that are
smaller but only support a subset of the operators that would seem
like the best compromise.

I'm a bit confused by Heikki's list though. I would expect path and
path-value pair to be the only useful ones. I'm not clear what an
index on keys or key-value would be -- it would index just the
top-level keys and values without recursing?

-- 
greg

Re: Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Bruce Momjian

Дата:

22 апреля 2014 г., 22:32:38

On Wed, Apr  9, 2014 at 02:22:54PM -0400, Greg Stark wrote:
> On Wed, Apr 9, 2014 at 11:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Maybe we should make *neither* of these the default opclass, and give
> >> *neither* the name json_ops.
> >
> > There's definitely something to be said for that.  Default opclasses are
> > sensible when there's basically only one behavior that's interesting for
> > most people.  We can already see that that's not going to be the case
> > for jsonb indexes, at least not with the currently available alternatives.
> >
> > Not having a default would force users to make decisions explicitly.
> > Is that what we want?
> 
> I don't like the idea of having no default opclass. I think there's a
> huge usability gain in being able to "just" create an index on a
> column and have it do something reasonable for most use cases.
> 
> I can get behind the idea of having separate index opclasses for paths
> and path-value pairs but I suspect the default should just be to index
> both in the same index. If we can have one default index opclass that
> supports containment and existence and then other opclasses that are
> smaller but only support a subset of the operators that would seem
> like the best compromise.
> 
> I'm a bit confused by Heikki's list though. I would expect path and
> path-value pair to be the only useful ones. I'm not clear what an
> index on keys or key-value would be -- it would index just the
> top-level keys and values without recursing?

Where are we on the default JSONB opclass change?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Peter Geoghegan

Дата:

22 апреля 2014 г., 22:41:00

On Tue, Apr 22, 2014 at 3:32 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Where are we on the default JSONB opclass change?

FWIW, I still don't have any strong opinion here. I defer to others on
this question.


-- 
Peter Geoghegan

Re: Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

23 апреля 2014 г., 00:50:38

Bruce Momjian <bruce@momjian.us> writes:
> Where are we on the default JSONB opclass change?

Not sure.  I'm for changing it, I think, but it wasn't at all clear
that we had consensus on that.  We did not have a proposed new name
for the opclass either ...
        regards, tom lane

Re: Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Petr Jelinek

Дата:

23 апреля 2014 г., 08:20:52

On 23/04/14 00:40, Peter Geoghegan wrote:
> On Tue, Apr 22, 2014 at 3:32 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> Where are we on the default JSONB opclass change?
>
> FWIW, I still don't have any strong opinion here. I defer to others on
> this question.
>

I vote for changing it, even though neither option is ideal I think that 
given the nature of datatype the current default will break inserts for 
common usage pattern and that's much worse than not being able to use 
the index for some operators.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Bruce Momjian

Дата:

23 апреля 2014 г., 10:56:47

On Wed, Apr 23, 2014 at 10:20:42AM +0200, Petr Jelinek wrote:
> On 23/04/14 00:40, Peter Geoghegan wrote:
> >On Tue, Apr 22, 2014 at 3:32 PM, Bruce Momjian <bruce@momjian.us> wrote:
> >>Where are we on the default JSONB opclass change?
> >
> >FWIW, I still don't have any strong opinion here. I defer to others on
> >this question.
> >
> 
> I vote for changing it, even though neither option is ideal I think
> that given the nature of datatype the current default will break
> inserts for common usage pattern and that's much worse than not
> being able to use the index for some operators.

I agree.  We should choose the most general option as the default.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Michael Paquier

Дата:

23 апреля 2014 г., 12:18:10

On Wed, Apr 23, 2014 at 7:56 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Apr 23, 2014 at 10:20:42AM +0200, Petr Jelinek wrote:
> I vote for changing it, even though neither option is ideal I think
> that given the nature of datatype the current default will break
> inserts for common usage pattern and that's much worse than not
> being able to use the index for some operators.

I agree. We should choose the most general option as the default.

+1. Less operators are supported by the now-named jsonb_hash_ops but at least users won't be surprised by failures caused by too long index records.

--
Michael

Re: Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

23 апреля 2014 г., 13:51:07

Bruce Momjian <bruce@momjian.us> writes:
> On Wed, Apr 23, 2014 at 10:20:42AM +0200, Petr Jelinek wrote:
>> On 23/04/14 00:40, Peter Geoghegan wrote:
>>> On Tue, Apr 22, 2014 at 3:32 PM, Bruce Momjian <bruce@momjian.us> wrote:
>>>> Where are we on the default JSONB opclass change?

>>> FWIW, I still don't have any strong opinion here. I defer to others on
>>> this question.

>> I vote for changing it, even though neither option is ideal I think
>> that given the nature of datatype the current default will break
>> inserts for common usage pattern and that's much worse than not
>> being able to use the index for some operators.

> I agree.  We should choose the most general option as the default.

That seems to be the consensus, but now we need a name for the
soon-to-be-not-default opclass.  What's a good short adjective for it?
        regards, tom lane

Re: Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Alvaro Herrera

Дата:

23 апреля 2014 г., 14:03:08

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > On Wed, Apr 23, 2014 at 10:20:42AM +0200, Petr Jelinek wrote:
> >> On 23/04/14 00:40, Peter Geoghegan wrote:
> >>> On Tue, Apr 22, 2014 at 3:32 PM, Bruce Momjian <bruce@momjian.us> wrote:
> >>>> Where are we on the default JSONB opclass change?
> 
> >>> FWIW, I still don't have any strong opinion here. I defer to others on
> >>> this question.
> 
> >> I vote for changing it, even though neither option is ideal I think
> >> that given the nature of datatype the current default will break
> >> inserts for common usage pattern and that's much worse than not
> >> being able to use the index for some operators.
> 
> > I agree.  We should choose the most general option as the default.
> 
> That seems to be the consensus, but now we need a name for the
> soon-to-be-not-default opclass.  What's a good short adjective for it?

"comprehensive"?  Not particularly short ...

According to Merriam Webster:
Synonymsall-embracing, all-in [chiefly British], all-inclusive,broad-gauge (or broad-gauged), compendious,
complete,encyclopedic,cover-all, cyclopedic, embracive, exhaustive,full, global, inclusive, in-depth, omnibus,
panoramic,thorough,universal
 

Related Wordsbroad, catholic, encyclical, general, inclusionary, overall;cosmic (also cosmical), extensive, far,
far-reaching,grand,large, panoptic, sweeping, vast, wide, wide-ranging; blanket,indiscriminate, unrestricted
 

jsonb_omnibus_ops ?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

23 апреля 2014 г., 14:18:07

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> That seems to be the consensus, but now we need a name for the
>> soon-to-be-not-default opclass.  What's a good short adjective for it?

> "comprehensive"?  Not particularly short ...

> According to Merriam Webster:
> Synonyms
>     all-embracing, all-in [chiefly British], all-inclusive,
>     broad-gauge (or broad-gauged), compendious, complete,
>     encyclopedic, cover-all, cyclopedic, embracive, exhaustive,
>     full, global, inclusive, in-depth, omnibus, panoramic, thorough,
>     universal

> Related Words
>     broad, catholic, encyclical, general, inclusionary, overall;
>     cosmic (also cosmical), extensive, far, far-reaching, grand,
>     large, panoptic, sweeping, vast, wide, wide-ranging; blanket,
>     indiscriminate, unrestricted

> jsonb_omnibus_ops ?

hm ... jsonb_full_ops seems nicely short, but on the other hand it just
begs the question "full what?".  I'm a bit worried about future-proof-ness
too; what if somebody later comes up with a new opclass that indexes more
operators?  We'd end up calling it jsonb_fuller_ops, ick.

I was kind of hoping for a technical adjective, like "hash" is for the
soon-to-be-default opclass.  What is it about this opclass that
distinguishes it from other indexing approaches that someone might try?
        regards, tom lane

Re: Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Bruce Momjian

Дата:

06 мая 2014 г., 20:10:57

On Tue, Apr 22, 2014 at 08:50:20PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Where are we on the default JSONB opclass change?
> 
> Not sure.  I'm for changing it, I think, but it wasn't at all clear
> that we had consensus on that.  We did not have a proposed new name
> for the opclass either ...

Where are we on this question?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

06 мая 2014 г., 20:19:02

Bruce Momjian <bruce@momjian.us> writes:
> On Tue, Apr 22, 2014 at 08:50:20PM -0400, Tom Lane wrote:
>> Bruce Momjian <bruce@momjian.us> writes:
>>> Where are we on the default JSONB opclass change?

>> Not sure.  I'm for changing it, I think, but it wasn't at all clear
>> that we had consensus on that.  We did not have a proposed new name
>> for the opclass either ...

> Where are we on this question?

Stuck on the naming question.  I'd be willing to do the patch legwork
if we had a consensus (or even a proposal) for what to rename the
current jsonb_ops to.
        regards, tom lane

Re: Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Bruce Momjian

Дата:

06 мая 2014 г., 21:20:27

On Tue, May  6, 2014 at 04:18:50PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > On Tue, Apr 22, 2014 at 08:50:20PM -0400, Tom Lane wrote:
> >> Bruce Momjian <bruce@momjian.us> writes:
> >>> Where are we on the default JSONB opclass change?
> 
> >> Not sure.  I'm for changing it, I think, but it wasn't at all clear
> >> that we had consensus on that.  We did not have a proposed new name
> >> for the opclass either ...
> 
> > Where are we on this question?
> 
> Stuck on the naming question.  I'd be willing to do the patch legwork
> if we had a consensus (or even a proposal) for what to rename the
> current jsonb_ops to.

Well, then, we only have a few days to come up with a name.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

"David E. Wheeler"

Дата:

06 мая 2014 г., 21:31:40

On May 6, 2014, at 2:20 PM, Bruce Momjian <bruce@momjian.us> wrote:

>> Stuck on the naming question.  I'd be willing to do the patch legwork
>> if we had a consensus (or even a proposal) for what to rename the
>> current jsonb_ops to.
> 
> Well, then, we only have a few days to come up with a name.

What are the options?

D

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

06 мая 2014 г., 22:21:06

"David E. Wheeler" <david@justatheory.com> writes:
> On May 6, 2014, at 2:20 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> Well, then, we only have a few days to come up with a name.

> What are the options?

We have no proposals as yet.

I've been looking at the source code to try to understand the difference
between the two opclasses (and BTW I concur with the opinions expressed
recently about the poor state of the internal documentation for jsonb).
If I've got it straight:

jsonb_ops indexes keys and values separately, so for instance "{xyz: 2}"
would give rise to GIN entries that are effectively the strings "Kxyz"
and "V2".  If you're looking for tuples containing "{xyz: 2}" then you
would be looking for the AND of those independent index entries, which
fortunately GIN is pretty good at computing.  But you could also look
for just keys or just values.

jsonb_hash_ops creates an index entry only for values, but what it
stores is a hash of both the value and the key it's stored under.
So in this example you'd get a hash combining "xyz" and "2".  This
means the only type of query you can perform is like "find JSON tuples
containing {xyz: 2}".

Because jsonb_ops stores the *whole* value, you can do lossless index
searches (no recheck needed on the heap tuple), but you also run the
risk of long strings failing to fit into an index entry.  Since jsonb_ops
reduces everything to a hash, there's no possibility of index failure,
but all queries are lossy and require recheck.

TBH, at this point I'm sort of agreeing with the thought expressed
upthread that maybe neither of these should be the default as-is.
They seem like rather arbitrary combinations of choices.  In particular
I wonder why there's not an option to store keys and values separately,
but as hashes not as the original strings, so that indexability of
everything could be guaranteed.  Or a variant of that might be to hash
only strings that are too large to fit in an index entry, and force
recheck only when searching for a string that needed hashing.

I wonder whether the most effective use of time at this point
wouldn't be to fix jsonb_ops to do that, rather than arguing about
what to rename it to.  If it didn't have the failure-for-long-strings
problem I doubt anybody would be unhappy about making it the default.
        regards, tom lane

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Peter Geoghegan

Дата:

06 мая 2014 г., 22:29:51

On Tue, May 6, 2014 at 3:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wonder whether the most effective use of time at this point
> wouldn't be to fix jsonb_ops to do that, rather than arguing about
> what to rename it to.  If it didn't have the failure-for-long-strings
> problem I doubt anybody would be unhappy about making it the default.

I would expect the selectivity of keys on their own to be very low
with idiomatic usage of jsonb. Typically, every row in a table will
have almost the same keys. The current default opclass makes more
sense for when that isn't the case.

-- 
Peter Geoghegan

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

06 мая 2014 г., 22:39:48

Peter Geoghegan <pg@heroku.com> writes:
> On Tue, May 6, 2014 at 3:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wonder whether the most effective use of time at this point
>> wouldn't be to fix jsonb_ops to do that, rather than arguing about
>> what to rename it to.  If it didn't have the failure-for-long-strings
>> problem I doubt anybody would be unhappy about making it the default.

> I would expect the selectivity of keys on their own to be very low
> with idiomatic usage of jsonb. Typically, every row in a table will
> have almost the same keys. The current default opclass makes more
> sense for when that isn't the case.

Meh.  I would not think that that represents effective use of JSON:
if the rows are all the same, why aren't you exposing that structure
as regular SQL columns?  IMHO, the value of JSON fields within a SQL
table is to deal with data that is not so well structured.

In any case, it was certainly the complaint that insertions might
fail altogether that made me (and I assume others) want to not have
jsonb_ops as the default opclass.  Is there a good reason not to
fix that limitation while we still can?
        regards, tom lane

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

"David E. Wheeler"

Дата:

06 мая 2014 г., 22:54:35

On May 6, 2014, at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Meh.  I would not think that that represents effective use of JSON:
> if the rows are all the same, why aren't you exposing that structure
> as regular SQL columns?  IMHO, the value of JSON fields within a SQL
> table is to deal with data that is not so well structured.

The use of JSON will not be ideal -- not in this sense. For example, at $work, we’re using it in place of an EAV model.
Hencemost rows have the same keys (or a subset of known keys). Or think of your favorite JSON API: every call to
http://api.pgxn.org/user/$username.jsonis going to have a very similar structure. 

> In any case, it was certainly the complaint that insertions might
> fail altogether that made me (and I assume others) want to not have
> jsonb_ops as the default opclass.  Is there a good reason not to
> fix that limitation while we still can?

Fixing++

David

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Peter Geoghegan

Дата:

06 мая 2014 г., 23:01:21

On Tue, May 6, 2014 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Meh.  I would not think that that represents effective use of JSON:
> if the rows are all the same, why aren't you exposing that structure
> as regular SQL columns?  IMHO, the value of JSON fields within a SQL
> table is to deal with data that is not so well structured.

I used to think that. That actually isn't what people want from a JSON
type, though. People want a flexible data model, but they still
realize that if within a table/"collection" everything is totally
heterogeneous, it becomes impossible to effectively query. They don't
want to run migrations. Or, maybe they are consuming JSON from a
third-party API, and have no control over the schema, even though it
is really is a schema (already represented as JSON, making jsonb a
compelling representation) -- that's a very common use case. It's much
more compelling to store semi-structured data as JSON. Totally
unstructured data just isn't that interesting.

Don't take my word for it, though. See
http://docs.mongodb.org/manual/data-modeling, for example. There is an
implicit assumption throughout that most documents within a MongoDB
collection have the same keys. The choice to not separately index keys
in the GIN hash opclass is far from arbitrary, even if you don't agree
with it.

> In any case, it was certainly the complaint that insertions might
> fail altogether that made me (and I assume others) want to not have
> jsonb_ops as the default opclass.  Is there a good reason not to
> fix that limitation while we still can?

I have no objection to either changing the default, or having no default.

-- 
Peter Geoghegan

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Bruce Momjian

Дата:

08 мая 2014 г., 13:47:11

On Tue, May  6, 2014 at 06:20:53PM -0400, Tom Lane wrote:
> "David E. Wheeler" <david@justatheory.com> writes:
> > On May 6, 2014, at 2:20 PM, Bruce Momjian <bruce@momjian.us> wrote:
> >> Well, then, we only have a few days to come up with a name.
> 
> > What are the options?
> 
> We have no proposals as yet.
> 
> I've been looking at the source code to try to understand the difference
> between the two opclasses (and BTW I concur with the opinions expressed
> recently about the poor state of the internal documentation for jsonb).
> If I've got it straight:
> 
> jsonb_ops indexes keys and values separately, so for instance "{xyz: 2}"
> would give rise to GIN entries that are effectively the strings "Kxyz"
> and "V2".  If you're looking for tuples containing "{xyz: 2}" then you
> would be looking for the AND of those independent index entries, which
> fortunately GIN is pretty good at computing.  But you could also look
> for just keys or just values.
> 
> jsonb_hash_ops creates an index entry only for values, but what it
> stores is a hash of both the value and the key it's stored under.
> So in this example you'd get a hash combining "xyz" and "2".  This
> means the only type of query you can perform is like "find JSON tuples
> containing {xyz: 2}".

Good summary, thanks.  This is the information I was hoping we had in
our docs.  How does hstore deal with these issues?

> Because jsonb_ops stores the *whole* value, you can do lossless index
> searches (no recheck needed on the heap tuple), but you also run the
> risk of long strings failing to fit into an index entry.  Since jsonb_ops
> reduces everything to a hash, there's no possibility of index failure,
> but all queries are lossy and require recheck.
> 
> TBH, at this point I'm sort of agreeing with the thought expressed
> upthread that maybe neither of these should be the default as-is.
> They seem like rather arbitrary combinations of choices.  In particular
> I wonder why there's not an option to store keys and values separately,
> but as hashes not as the original strings, so that indexability of
> everything could be guaranteed.  Or a variant of that might be to hash
> only strings that are too large to fit in an index entry, and force
> recheck only when searching for a string that needed hashing.
> 
> I wonder whether the most effective use of time at this point
> wouldn't be to fix jsonb_ops to do that, rather than arguing about
> what to rename it to.  If it didn't have the failure-for-long-strings
> problem I doubt anybody would be unhappy about making it the default.

Can we hash just the values, not the keys, in jsonb_ops, and hash the
combo in jsonb_hash_ops.  That would give us key-only lookups without a
recheck.

How do we index long strings now?  Is it he combination of GIN and long
strings that is the problem?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

08 мая 2014 г., 14:17:08

Bruce Momjian <bruce@momjian.us> writes:
> On Tue, May  6, 2014 at 06:20:53PM -0400, Tom Lane wrote:
>> I wonder why there's not an option to store keys and values separately,
>> but as hashes not as the original strings, so that indexability of
>> everything could be guaranteed.  Or a variant of that might be to hash
>> only strings that are too large to fit in an index entry, and force
>> recheck only when searching for a string that needed hashing.
>> 
>> I wonder whether the most effective use of time at this point
>> wouldn't be to fix jsonb_ops to do that, rather than arguing about
>> what to rename it to.  If it didn't have the failure-for-long-strings
>> problem I doubt anybody would be unhappy about making it the default.

> Can we hash just the values, not the keys, in jsonb_ops, and hash the
> combo in jsonb_hash_ops.  That would give us key-only lookups without a
> recheck.

No, because there's nothing in JSON limiting the length of keys, any more
than values.

I think the idea of hashing only keys/values that are "too long" is a
reasonable compromise.  I've not finished coding it (because I keep
getting distracted by other problems in the code :-() but it does not
look to be very difficult.  I'm envisioning the cutoff as being something
like 128 bytes; in practice that would mean that few if any keys get
hashed, I think.
        regards, tom lane

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Bruce Momjian

Дата:

08 мая 2014 г., 14:34:13

On Thu, May  8, 2014 at 10:16:54AM -0400, Tom Lane wrote:
> > Can we hash just the values, not the keys, in jsonb_ops, and hash the
> > combo in jsonb_hash_ops.  That would give us key-only lookups without a
> > recheck.
> 
> No, because there's nothing in JSON limiting the length of keys, any more
> than values.
> 
> I think the idea of hashing only keys/values that are "too long" is a
> reasonable compromise.  I've not finished coding it (because I keep
> getting distracted by other problems in the code :-() but it does not
> look to be very difficult.  I'm envisioning the cutoff as being something
> like 128 bytes; in practice that would mean that few if any keys get
> hashed, I think.

Yes, that would be nice.  Ideally we would not be doing this so close to
beta, but it is what it is, and if we need to break binary compatibility
after beta1, at least we have pg_upgrade.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Andres Freund

Дата:

08 мая 2014 г., 14:37:17

On 2014-05-08 10:34:04 -0400, Bruce Momjian wrote:
> On Thu, May  8, 2014 at 10:16:54AM -0400, Tom Lane wrote:
> > > Can we hash just the values, not the keys, in jsonb_ops, and hash the
> > > combo in jsonb_hash_ops.  That would give us key-only lookups without a
> > > recheck.
> > 
> > No, because there's nothing in JSON limiting the length of keys, any more
> > than values.
> > 
> > I think the idea of hashing only keys/values that are "too long" is a
> > reasonable compromise.  I've not finished coding it (because I keep
> > getting distracted by other problems in the code :-() but it does not
> > look to be very difficult.  I'm envisioning the cutoff as being something
> > like 128 bytes; in practice that would mean that few if any keys get
> > hashed, I think.
> 
> Yes, that would be nice.  Ideally we would not be doing this so close to
> beta, but it is what it is, and if we need to break binary compatibility
> after beta1, at least we have pg_upgrade.

If we break binary compatibility pg_upgrade won't be able to help. Since
the data files wont be, err, binary compatibile.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Bruce Momjian

Дата:

08 мая 2014 г., 14:40:22

On Thu, May  8, 2014 at 04:37:05PM +0200, Andres Freund wrote:
> On 2014-05-08 10:34:04 -0400, Bruce Momjian wrote:
> > On Thu, May  8, 2014 at 10:16:54AM -0400, Tom Lane wrote:
> > > > Can we hash just the values, not the keys, in jsonb_ops, and hash the
> > > > combo in jsonb_hash_ops.  That would give us key-only lookups without a
> > > > recheck.
> > > 
> > > No, because there's nothing in JSON limiting the length of keys, any more
> > > than values.
> > > 
> > > I think the idea of hashing only keys/values that are "too long" is a
> > > reasonable compromise.  I've not finished coding it (because I keep
> > > getting distracted by other problems in the code :-() but it does not
> > > look to be very difficult.  I'm envisioning the cutoff as being something
> > > like 128 bytes; in practice that would mean that few if any keys get
> > > hashed, I think.
> > 
> > Yes, that would be nice.  Ideally we would not be doing this so close to
> > beta, but it is what it is, and if we need to break binary compatibility
> > after beta1, at least we have pg_upgrade.
> 
> If we break binary compatibility pg_upgrade won't be able to help. Since
> the data files wont be, err, binary compatibile.

Oops, yeah.  pg_upgrade only helps with system table changes. We would
have to require users to dump/reload any changed tables or recreate any
indexs.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

08 мая 2014 г., 22:39:31

I wrote:
> I think the idea of hashing only keys/values that are "too long" is a
> reasonable compromise.  I've not finished coding it (because I keep
> getting distracted by other problems in the code :-() but it does not
> look to be very difficult.  I'm envisioning the cutoff as being something
> like 128 bytes; in practice that would mean that few if any keys get
> hashed, I think.

Attached is a draft patch for this.  In addition to the hash logic per se,
I made these changes:

* Replaced the K/V prefix bytes with a code that distinguishes the types
of JSON values.  While this is not of any huge significance for the
current index search operators, it's basically free to store the info,
so I think we should do it for possible future use.

* Fixed the problem with "exists" returning rows it shouldn't.  I
concluded that the best fix is just to force recheck for exists, which
allows considerable simplification in the consistent functions.

* Tried to improve the comments in jsonb_gin.c.

Barring objections I'll commit this tomorrow, and also try to improve the
user-facing documentation about the jsonb opclasses.

            regards, tom lane

diff --git a/src/backend/utils/adt/jsonb_gin.c b/src/backend/utils/adt/jsonb_gin.c
index 592036a..2c4ade2 100644
*** a/src/backend/utils/adt/jsonb_gin.c
--- b/src/backend/utils/adt/jsonb_gin.c
***************
*** 14,19 ****
--- 14,20 ----
  #include "postgres.h"

  #include "access/gin.h"
+ #include "access/hash.h"
  #include "access/skey.h"
  #include "catalog/pg_collation.h"
  #include "catalog/pg_type.h"
*************** typedef struct PathHashStack
*** 26,39 ****
      struct PathHashStack *parent;
  } PathHashStack;

! static text *make_text_key(const char *str, int len, char flag);
! static text *make_scalar_key(const JsonbValue *scalarVal, char flag);

  /*
   *
   * jsonb_ops GIN opclass support functions
   *
   */
  Datum
  gin_compare_jsonb(PG_FUNCTION_ARGS)
  {
--- 27,41 ----
      struct PathHashStack *parent;
  } PathHashStack;

! static Datum make_text_key(char flag, const char *str, int len);
! static Datum make_scalar_key(const JsonbValue *scalarVal, bool is_key);

  /*
   *
   * jsonb_ops GIN opclass support functions
   *
   */
+
  Datum
  gin_compare_jsonb(PG_FUNCTION_ARGS)
  {
*************** gin_extract_jsonb(PG_FUNCTION_ARGS)
*** 65,144 ****
  {
      Jsonb       *jb = (Jsonb *) PG_GETARG_JSONB(0);
      int32       *nentries = (int32 *) PG_GETARG_POINTER(1);
-     Datum       *entries = NULL;
      int            total = 2 * JB_ROOT_COUNT(jb);
-     int            i = 0,
-                 r;
      JsonbIterator *it;
      JsonbValue    v;

      if (total == 0)
      {
          *nentries = 0;
          PG_RETURN_POINTER(NULL);
      }

      entries = (Datum *) palloc(sizeof(Datum) * total);

      it = JsonbIteratorInit(&jb->root);

      while ((r = JsonbIteratorNext(&it, &v, false)) != WJB_DONE)
      {
          if (i >= total)
          {
              total *= 2;
              entries = (Datum *) repalloc(entries, sizeof(Datum) * total);
          }

-         /*
-          * Serialize keys and elements equivalently,  but only when elements
-          * are Jsonb strings.  Otherwise, serialize elements as values.  Array
-          * elements are indexed as keys, for the benefit of
-          * JsonbExistsStrategyNumber.  Our definition of existence does not
-          * allow for checking the existence of a non-jbvString element (just
-          * like the definition of the underlying operator), because the
-          * operator takes a text rhs argument (which is taken as a proxy for
-          * an equivalent Jsonb string).
-          *
-          * The way existence is represented does not preclude an alternative
-          * existence operator, that takes as its rhs value an arbitrarily
-          * internally-typed Jsonb.  The only reason that isn't the case here
-          * is that the existence operator is only really intended to determine
-          * if an object has a certain key (object pair keys are of course
-          * invariably strings), which is extended to jsonb arrays.  You could
-          * think of the default Jsonb definition of existence as being
-          * equivalent to a definition where all types of scalar array elements
-          * are keys that we can check the existence of, while just forbidding
-          * non-string notation.  This inflexibility prevents the user from
-          * having to qualify that the rhs string is a raw scalar string (that
-          * is, naturally no internal string quoting in required for the text
-          * argument), and allows us to not set the reset flag for
-          * JsonbExistsStrategyNumber, since we know that keys are strings for
-          * both objects and arrays, and don't have to further account for type
-          * mismatch.  Not having to set the reset flag makes it less than
-          * tempting to tighten up the definition of existence to preclude
-          * array elements entirely, which would arguably be a simpler
-          * alternative. In any case the infrastructure used to implement the
-          * existence operator could trivially support this hypothetical,
-          * slightly distinct definition of existence.
-          */
          switch (r)
          {
              case WJB_KEY:
!                 /* Serialize key separately, for existence strategies */
!                 entries[i++] = PointerGetDatum(make_scalar_key(&v, JKEYELEM));
                  break;
              case WJB_ELEM:
!                 if (v.type == jbvString)
!                     entries[i++] = PointerGetDatum(make_scalar_key(&v, JKEYELEM));
!                 else
!                     entries[i++] = PointerGetDatum(make_scalar_key(&v, JVAL));
                  break;
              case WJB_VALUE:
!                 entries[i++] = PointerGetDatum(make_scalar_key(&v, JVAL));
                  break;
              default:
!                 continue;
          }
      }

--- 67,115 ----
  {
      Jsonb       *jb = (Jsonb *) PG_GETARG_JSONB(0);
      int32       *nentries = (int32 *) PG_GETARG_POINTER(1);
      int            total = 2 * JB_ROOT_COUNT(jb);
      JsonbIterator *it;
      JsonbValue    v;
+     int            i = 0,
+                 r;
+     Datum       *entries;

+     /* If the root level is empty, we certainly have no keys */
      if (total == 0)
      {
          *nentries = 0;
          PG_RETURN_POINTER(NULL);
      }

+     /* Otherwise, use 2 * root count as initial estimate of result size */
      entries = (Datum *) palloc(sizeof(Datum) * total);

      it = JsonbIteratorInit(&jb->root);

      while ((r = JsonbIteratorNext(&it, &v, false)) != WJB_DONE)
      {
+         /* Since we recurse into the object, we might need more space */
          if (i >= total)
          {
              total *= 2;
              entries = (Datum *) repalloc(entries, sizeof(Datum) * total);
          }

          switch (r)
          {
              case WJB_KEY:
!                 entries[i++] = make_scalar_key(&v, true);
                  break;
              case WJB_ELEM:
!                 /* Pretend string array elements are keys, see jsonb.h */
!                 entries[i++] = make_scalar_key(&v, (v.type == jbvString));
                  break;
              case WJB_VALUE:
!                 entries[i++] = make_scalar_key(&v, false);
                  break;
              default:
!                 /* we can ignore structural items */
!                 break;
          }
      }

*************** gin_extract_jsonb_query(PG_FUNCTION_ARGS
*** 168,192 ****
      }
      else if (strategy == JsonbExistsStrategyNumber)
      {
          text       *query = PG_GETARG_TEXT_PP(0);
-         text       *item;

          *nentries = 1;
          entries = (Datum *) palloc(sizeof(Datum));
!         item = make_text_key(VARDATA_ANY(query), VARSIZE_ANY_EXHDR(query),
!                              JKEYELEM);
!         entries[0] = PointerGetDatum(item);
      }
      else if (strategy == JsonbExistsAnyStrategyNumber ||
               strategy == JsonbExistsAllStrategyNumber)
      {
          ArrayType  *query = PG_GETARG_ARRAYTYPE_P(0);
          Datum       *key_datums;
          bool       *key_nulls;
          int            key_count;
          int            i,
                      j;
-         text       *item;

          deconstruct_array(query,
                            TEXTOID, -1, false, 'i',
--- 139,163 ----
      }
      else if (strategy == JsonbExistsStrategyNumber)
      {
+         /* Query is a text string, which we treat as a key */
          text       *query = PG_GETARG_TEXT_PP(0);

          *nentries = 1;
          entries = (Datum *) palloc(sizeof(Datum));
!         entries[0] = make_text_key(JGINFLAG_KEY,
!                                    VARDATA_ANY(query),
!                                    VARSIZE_ANY_EXHDR(query));
      }
      else if (strategy == JsonbExistsAnyStrategyNumber ||
               strategy == JsonbExistsAllStrategyNumber)
      {
+         /* Query is a text array; each element is treated as a key */
          ArrayType  *query = PG_GETARG_ARRAYTYPE_P(0);
          Datum       *key_datums;
          bool       *key_nulls;
          int            key_count;
          int            i,
                      j;

          deconstruct_array(query,
                            TEXTOID, -1, false, 'i',
*************** gin_extract_jsonb_query(PG_FUNCTION_ARGS
*** 194,208 ****

          entries = (Datum *) palloc(sizeof(Datum) * key_count);

!         for (i = 0, j = 0; i < key_count; ++i)
          {
              /* Nulls in the array are ignored */
              if (key_nulls[i])
                  continue;
!             item = make_text_key(VARDATA(key_datums[i]),
!                                  VARSIZE(key_datums[i]) - VARHDRSZ,
!                                  JKEYELEM);
!             entries[j++] = PointerGetDatum(item);
          }

          *nentries = j;
--- 165,178 ----

          entries = (Datum *) palloc(sizeof(Datum) * key_count);

!         for (i = 0, j = 0; i < key_count; i++)
          {
              /* Nulls in the array are ignored */
              if (key_nulls[i])
                  continue;
!             entries[j++] = make_text_key(JGINFLAG_KEY,
!                                          VARDATA_ANY(key_datums[i]),
!                                          VARSIZE_ANY_EXHDR(key_datums[i]));
          }

          *nentries = j;
*************** gin_consistent_jsonb(PG_FUNCTION_ARGS)
*** 236,248 ****
      if (strategy == JsonbContainsStrategyNumber)
      {
          /*
!          * Index doesn't have information about correspondence of Jsonb keys
!          * and values (as distinct from GIN keys, which a key/value pair is
!          * stored as), so invariably we recheck.  Besides, there are some
!          * special rules around the containment of raw scalar arrays and
!          * regular arrays that are not represented here.  However, if all of
!          * the keys are not present, that's sufficient reason to return false
!          * and finish immediately.
           */
          *recheck = true;
          for (i = 0; i < nkeys; i++)
--- 206,217 ----
      if (strategy == JsonbContainsStrategyNumber)
      {
          /*
!          * We must always recheck, since we can't tell from the index whether
!          * the positions of the matched items match the structure of the query
!          * object.  (Even if we could, we'd also have to worry about hashed
!          * keys and the index's failure to distinguish keys from string array
!          * elements.)  However, the tuple certainly doesn't match unless it
!          * contains all the query keys.
           */
          *recheck = true;
          for (i = 0; i < nkeys; i++)
*************** gin_consistent_jsonb(PG_FUNCTION_ARGS)
*** 256,275 ****
      }
      else if (strategy == JsonbExistsStrategyNumber)
      {
!         /* Existence of key guaranteed in default search mode */
!         *recheck = false;
          res = true;
      }
      else if (strategy == JsonbExistsAnyStrategyNumber)
      {
!         /* Existence of key guaranteed in default search mode */
!         *recheck = false;
          res = true;
      }
      else if (strategy == JsonbExistsAllStrategyNumber)
      {
!         /* Testing for the presence of all keys gives an exact result */
!         *recheck = false;
          for (i = 0; i < nkeys; i++)
          {
              if (!check[i])
--- 225,251 ----
      }
      else if (strategy == JsonbExistsStrategyNumber)
      {
!         /*
!          * Although the key is certainly present in the index, we must recheck
!          * because (1) the key might be hashed, and (2) the index match might
!          * be for a key that's not at top level of the JSON object.  For (1),
!          * we could look at the query key to see if it's hashed and not
!          * recheck if not, but the index lacks enough info to tell about (2).
!          */
!         *recheck = true;
          res = true;
      }
      else if (strategy == JsonbExistsAnyStrategyNumber)
      {
!         /* As for plain exists, we must recheck */
!         *recheck = true;
          res = true;
      }
      else if (strategy == JsonbExistsAllStrategyNumber)
      {
!         /* As for plain exists, we must recheck */
!         *recheck = true;
!         /* ... but unless all the keys are present, we can say "false" */
          for (i = 0; i < nkeys; i++)
          {
              if (!check[i])
*************** gin_triconsistent_jsonb(PG_FUNCTION_ARGS
*** 295,313 ****
      int32        nkeys = PG_GETARG_INT32(3);

      /* Pointer       *extra_data = (Pointer *) PG_GETARG_POINTER(4); */
!     GinTernaryValue res = GIN_TRUE;
!
      int32        i;

!     if (strategy == JsonbContainsStrategyNumber)
      {
!         bool        has_maybe = false;
!
!         /*
!          * All extracted keys must be present.  Combination of GIN_MAYBE and
!          * GIN_TRUE gives GIN_MAYBE result because then all keys may be
!          * present.
!          */
          for (i = 0; i < nkeys; i++)
          {
              if (check[i] == GIN_FALSE)
--- 271,288 ----
      int32        nkeys = PG_GETARG_INT32(3);

      /* Pointer       *extra_data = (Pointer *) PG_GETARG_POINTER(4); */
!     GinTernaryValue res = GIN_MAYBE;
      int32        i;

!     /*
!      * Note that we never return GIN_TRUE, only GIN_MAYBE or GIN_FALSE; this
!      * corresponds to always forcing recheck in the regular consistent
!      * function, for the reasons listed there.
!      */
!     if (strategy == JsonbContainsStrategyNumber ||
!         strategy == JsonbExistsAllStrategyNumber)
      {
!         /* All extracted keys must be present */
          for (i = 0; i < nkeys; i++)
          {
              if (check[i] == GIN_FALSE)
*************** gin_triconsistent_jsonb(PG_FUNCTION_ARGS
*** 315,369 ****
                  res = GIN_FALSE;
                  break;
              }
-             if (check[i] == GIN_MAYBE)
-             {
-                 res = GIN_MAYBE;
-                 has_maybe = true;
-             }
          }
-
-         /*
-          * Index doesn't have information about correspondence of Jsonb keys
-          * and values (as distinct from GIN keys, which a key/value pair is
-          * stored as), so invariably we recheck.  This is also reflected in
-          * how GIN_MAYBE is given in response to there being no GIN_MAYBE
-          * input.
-          */
-         if (!has_maybe && res == GIN_TRUE)
-             res = GIN_MAYBE;
      }
      else if (strategy == JsonbExistsStrategyNumber ||
               strategy == JsonbExistsAnyStrategyNumber)
      {
!         /* Existence of key guaranteed in default search mode */
          res = GIN_FALSE;
          for (i = 0; i < nkeys; i++)
          {
!             if (check[i] == GIN_TRUE)
!             {
!                 res = GIN_TRUE;
!                 break;
!             }
!             if (check[i] == GIN_MAYBE)
              {
                  res = GIN_MAYBE;
-             }
-         }
-     }
-     else if (strategy == JsonbExistsAllStrategyNumber)
-     {
-         /* Testing for the presence of all keys gives an exact result */
-         for (i = 0; i < nkeys; i++)
-         {
-             if (check[i] == GIN_FALSE)
-             {
-                 res = GIN_FALSE;
                  break;
              }
-             if (check[i] == GIN_MAYBE)
-             {
-                 res = GIN_MAYBE;
-             }
          }
      }
      else
--- 290,310 ----
                  res = GIN_FALSE;
                  break;
              }
          }
      }
      else if (strategy == JsonbExistsStrategyNumber ||
               strategy == JsonbExistsAnyStrategyNumber)
      {
!         /* At least one extracted key must be present */
          res = GIN_FALSE;
          for (i = 0; i < nkeys; i++)
          {
!             if (check[i] == GIN_TRUE ||
!                 check[i] == GIN_MAYBE)
              {
                  res = GIN_MAYBE;
                  break;
              }
          }
      }
      else
*************** gin_triconsistent_jsonb(PG_FUNCTION_ARGS
*** 376,382 ****
--- 317,330 ----
   *
   * jsonb_hash_ops GIN opclass support functions
   *
+  * In a jsonb_hash_ops index, the keys are uint32 hashes, one per value; but
+  * the key(s) leading to each value are also included in its hash computation.
+  * This means we can only support containment queries, but the index can
+  * distinguish, for example, {"foo": 42} from {"bar": 42} since different
+  * hashes will be generated.
+  *
   */
+
  Datum
  gin_consistent_jsonb_hash(PG_FUNCTION_ARGS)
  {
*************** gin_consistent_jsonb_hash(PG_FUNCTION_AR
*** 395,407 ****
          elog(ERROR, "unrecognized strategy number: %d", strategy);

      /*
!      * jsonb_hash_ops index doesn't have information about correspondence of
!      * Jsonb keys and values (as distinct from GIN keys, which a key/value
!      * pair is stored as), so invariably we recheck.  Besides, there are some
       * special rules around the containment of raw scalar arrays and regular
!      * arrays that are not represented here.  However, if all of the keys are
!      * not present, that's sufficient reason to return false and finish
!      * immediately.
       */
      *recheck = true;
      for (i = 0; i < nkeys; i++)
--- 343,355 ----
          elog(ERROR, "unrecognized strategy number: %d", strategy);

      /*
!      * jsonb_hash_ops is necessarily lossy, not only because of hash
!      * collisions but also because it doesn't preserve complete information
!      * about the structure of the JSON object.  Besides, there are some
       * special rules around the containment of raw scalar arrays and regular
!      * arrays that are not handled here.  So we must always recheck a match.
!      * However, if not all of the keys are present, the tuple certainly
!      * doesn't match.
       */
      *recheck = true;
      for (i = 0; i < nkeys; i++)
*************** gin_triconsistent_jsonb_hash(PG_FUNCTION
*** 426,442 ****
      int32        nkeys = PG_GETARG_INT32(3);

      /* Pointer       *extra_data = (Pointer *) PG_GETARG_POINTER(4); */
!     GinTernaryValue res = GIN_TRUE;
      int32        i;
-     bool        has_maybe = false;

      if (strategy != JsonbContainsStrategyNumber)
          elog(ERROR, "unrecognized strategy number: %d", strategy);

      /*
!      * All extracted keys must be present.  A combination of GIN_MAYBE and
!      * GIN_TRUE induces a GIN_MAYBE result, because then all keys may be
!      * present.
       */
      for (i = 0; i < nkeys; i++)
      {
--- 374,389 ----
      int32        nkeys = PG_GETARG_INT32(3);

      /* Pointer       *extra_data = (Pointer *) PG_GETARG_POINTER(4); */
!     GinTernaryValue res = GIN_MAYBE;
      int32        i;

      if (strategy != JsonbContainsStrategyNumber)
          elog(ERROR, "unrecognized strategy number: %d", strategy);

      /*
!      * Note that we never return GIN_TRUE, only GIN_MAYBE or GIN_FALSE; this
!      * corresponds to always forcing recheck in the regular consistent
!      * function, for the reasons listed there.
       */
      for (i = 0; i < nkeys; i++)
      {
*************** gin_triconsistent_jsonb_hash(PG_FUNCTION
*** 445,467 ****
              res = GIN_FALSE;
              break;
          }
-         if (check[i] == GIN_MAYBE)
-         {
-             res = GIN_MAYBE;
-             has_maybe = true;
-         }
      }

-     /*
-      * jsonb_hash_ops index doesn't have information about correspondence of
-      * Jsonb keys and values (as distinct from GIN keys, which for this
-      * opclass are a hash of a pair, or a hash of just an element), so
-      * invariably we recheck.  This is also reflected in how GIN_MAYBE is
-      * given in response to there being no GIN_MAYBE input.
-      */
-     if (!has_maybe && res == GIN_TRUE)
-         res = GIN_MAYBE;
-
      PG_RETURN_GIN_TERNARY_VALUE(res);
  }

--- 392,399 ----
*************** gin_extract_jsonb_hash(PG_FUNCTION_ARGS)
*** 477,502 ****
      PathHashStack *stack;
      int            i = 0,
                  r;
!     Datum       *entries = NULL;

      if (total == 0)
      {
          *nentries = 0;
          PG_RETURN_POINTER(NULL);
      }

      entries = (Datum *) palloc(sizeof(Datum) * total);

!     it = JsonbIteratorInit(&jb->root);
!
      tail.parent = NULL;
      tail.hash = 0;
      stack = &tail;

      while ((r = JsonbIteratorNext(&it, &v, false)) != WJB_DONE)
      {
!         PathHashStack *tmp;

          if (i >= total)
          {
              total *= 2;
--- 409,438 ----
      PathHashStack *stack;
      int            i = 0,
                  r;
!     Datum       *entries;

+     /* If the root level is empty, we certainly have no keys */
      if (total == 0)
      {
          *nentries = 0;
          PG_RETURN_POINTER(NULL);
      }

+     /* Otherwise, use 2 * root count as initial estimate of result size */
      entries = (Datum *) palloc(sizeof(Datum) * total);

!     /* We keep a stack of hashes corresponding to parent key levels */
      tail.parent = NULL;
      tail.hash = 0;
      stack = &tail;

+     it = JsonbIteratorInit(&jb->root);
+
      while ((r = JsonbIteratorNext(&it, &v, false)) != WJB_DONE)
      {
!         PathHashStack *parent;

+         /* Since we recurse into the object, we might need more space */
          if (i >= total)
          {
              total *= 2;
*************** gin_extract_jsonb_hash(PG_FUNCTION_ARGS)
*** 507,521 ****
          {
              case WJB_BEGIN_ARRAY:
              case WJB_BEGIN_OBJECT:
!                 tmp = stack;
                  stack = (PathHashStack *) palloc(sizeof(PathHashStack));

!                 /*
!                  * Nesting an array within another array will not alter
!                  * innermost scalar element hash values, but that seems
!                  * inconsequential
!                  */
!                 if (tmp->parent)
                  {
                      /*
                       * We pass forward hashes from previous container nesting
--- 443,453 ----
          {
              case WJB_BEGIN_ARRAY:
              case WJB_BEGIN_OBJECT:
!                 /* Push a stack level for this object */
!                 parent = stack;
                  stack = (PathHashStack *) palloc(sizeof(PathHashStack));

!                 if (parent->parent)
                  {
                      /*
                       * We pass forward hashes from previous container nesting
*************** gin_extract_jsonb_hash(PG_FUNCTION_ARGS)
*** 524,561 ****
                       * outermost key.  It's also somewhat useful to have
                       * nested objects innermost values have hashes that are a
                       * function of not just their own key, but outer keys too.
                       */
!                     stack->hash = tmp->hash;
                  }
                  else
                  {
                      /*
!                      * At least nested level, initialize with stable container
!                      * type proxy value
                       */
                      stack->hash = (r == WJB_BEGIN_ARRAY) ? JB_FARRAY : JB_FOBJECT;
                  }
!                 stack->parent = tmp;
                  break;
              case WJB_KEY:
!                 /* Initialize hash from parent */
                  stack->hash = stack->parent->hash;
                  JsonbHashScalarValue(&v, &stack->hash);
                  break;
              case WJB_ELEM:
!                 /* Elements have parent hash mixed in separately */
                  stack->hash = stack->parent->hash;
              case WJB_VALUE:
!                 /* Element/value case */
                  JsonbHashScalarValue(&v, &stack->hash);
                  entries[i++] = UInt32GetDatum(stack->hash);
                  break;
              case WJB_END_ARRAY:
              case WJB_END_OBJECT:
                  /* Pop the stack */
!                 tmp = stack->parent;
                  pfree(stack);
!                 stack = tmp;
                  break;
              default:
                  elog(ERROR, "invalid JsonbIteratorNext rc: %d", r);
--- 456,504 ----
                       * outermost key.  It's also somewhat useful to have
                       * nested objects innermost values have hashes that are a
                       * function of not just their own key, but outer keys too.
+                      *
+                      * Nesting an array within another array will not alter
+                      * innermost scalar element hash values, but that seems
+                      * inconsequential.
                       */
!                     stack->hash = parent->hash;
                  }
                  else
                  {
                      /*
!                      * At the outermost level, initialize hash with container
!                      * type proxy value.  Note that this makes JB_FARRAY and
!                      * JB_FOBJECT part of the on-disk representation, but they
!                      * are that in the base jsonb object storage already.
                       */
                      stack->hash = (r == WJB_BEGIN_ARRAY) ? JB_FARRAY : JB_FOBJECT;
                  }
!                 stack->parent = parent;
                  break;
              case WJB_KEY:
!                 /* initialize hash from parent */
                  stack->hash = stack->parent->hash;
+                 /* and mix in this key */
                  JsonbHashScalarValue(&v, &stack->hash);
+                 /* hash is now ready to incorporate the value */
                  break;
              case WJB_ELEM:
!                 /* array elements use parent hash mixed with element's hash */
                  stack->hash = stack->parent->hash;
+                 /* FALL THRU */
              case WJB_VALUE:
!                 /* mix the element or value's hash into the prepared hash */
                  JsonbHashScalarValue(&v, &stack->hash);
+                 /* and emit an index entry */
                  entries[i++] = UInt32GetDatum(stack->hash);
+                 /* Note: we assume we'll see KEY before another VALUE */
                  break;
              case WJB_END_ARRAY:
              case WJB_END_OBJECT:
                  /* Pop the stack */
!                 parent = stack->parent;
                  pfree(stack);
!                 stack = parent;
                  break;
              default:
                  elog(ERROR, "invalid JsonbIteratorNext rc: %d", r);
*************** gin_extract_jsonb_query_hash(PG_FUNCTION
*** 592,605 ****
  }

  /*
!  * Build a text value from a cstring and flag suitable for storage as a key
!  * value
   */
! static text *
! make_text_key(const char *str, int len, char flag)
  {
      text       *item;

      item = (text *) palloc(VARHDRSZ + len + 1);
      SET_VARSIZE(item, VARHDRSZ + len + 1);

--- 535,563 ----
  }

  /*
!  * Construct a GIN key from a flag byte and a textual representation
!  * (which need not be null-terminated).  This function is responsible
!  * for hashing overlength text representations; it will add the
!  * JGINFLAG_HASHED bit to the flag value if it does that.
   */
! static Datum
! make_text_key(char flag, const char *str, int len)
  {
      text       *item;
+     char        hashbuf[10];
+
+     if (len > JGIN_MAXLENGTH)
+     {
+         uint32        hashval;

+         hashval = DatumGetUInt32(hash_any((const unsigned char *) str, len));
+         snprintf(hashbuf, sizeof(hashbuf), "%08x", hashval);
+         str = hashbuf;
+         len = 8;
+         flag |= JGINFLAG_HASHED;
+     }
+
+     /* Now build the text Datum */
      item = (text *) palloc(VARHDRSZ + len + 1);
      SET_VARSIZE(item, VARHDRSZ + len + 1);

*************** make_text_key(const char *str, int len,
*** 607,637 ****

      memcpy(VARDATA(item) + 1, str, len);

!     return item;
  }

  /*
!  * Create a textual representation of a jsonbValue for GIN storage.
   */
! static text *
! make_scalar_key(const JsonbValue *scalarVal, char flag)
  {
!     text       *item;
      char       *cstr;

      switch (scalarVal->type)
      {
          case jbvNull:
!             item = make_text_key("n", 1, flag);
              break;
          case jbvBool:
!             item = make_text_key(scalarVal->val.boolean ? "t" : "f", 1, flag);
              break;
          case jbvNumeric:

              /*
!              * A normalized textual representation, free of trailing zeroes is
!              * is required.
               *
               * It isn't ideal that numerics are stored in a relatively bulky
               * textual format.  However, it's a notationally convenient way of
--- 565,603 ----

      memcpy(VARDATA(item) + 1, str, len);

!     return PointerGetDatum(item);
  }

  /*
!  * Create a textual representation of a JsonbValue that will serve as a GIN
!  * key in a jsonb_ops index.  is_key is true if the JsonbValue is a key,
!  * or if it is a string array element (since we pretend those are keys,
!  * see jsonb.h).
   */
! static Datum
! make_scalar_key(const JsonbValue *scalarVal, bool is_key)
  {
!     Datum        item;
      char       *cstr;

      switch (scalarVal->type)
      {
          case jbvNull:
!             Assert(!is_key);
!             item = make_text_key(JGINFLAG_NULL, "", 0);
              break;
          case jbvBool:
!             Assert(!is_key);
!             item = make_text_key(JGINFLAG_BOOL,
!                                  scalarVal->val.boolean ? "t" : "f", 1);
              break;
          case jbvNumeric:
+             Assert(!is_key);

              /*
!              * A normalized textual representation, free of trailing zeroes,
!              * is required so that numerically equal values will produce equal
!              * strings.
               *
               * It isn't ideal that numerics are stored in a relatively bulky
               * textual format.  However, it's a notationally convenient way of
*************** make_scalar_key(const JsonbValue *scalar
*** 639,653 ****
               * strings takes precedence.
               */
              cstr = numeric_normalize(scalarVal->val.numeric);
!             item = make_text_key(cstr, strlen(cstr), flag);
              pfree(cstr);
              break;
          case jbvString:
!             item = make_text_key(scalarVal->val.string.val, scalarVal->val.string.len,
!                                  flag);
              break;
          default:
!             elog(ERROR, "invalid jsonb scalar type");
      }

      return item;
--- 605,622 ----
               * strings takes precedence.
               */
              cstr = numeric_normalize(scalarVal->val.numeric);
!             item = make_text_key(JGINFLAG_NUM, cstr, strlen(cstr));
              pfree(cstr);
              break;
          case jbvString:
!             item = make_text_key(is_key ? JGINFLAG_KEY : JGINFLAG_STR,
!                                  scalarVal->val.string.val,
!                                  scalarVal->val.string.len);
              break;
          default:
!             elog(ERROR, "unrecognized jsonb scalar type: %d", scalarVal->type);
!             item = 0;            /* keep compiler quiet */
!             break;
      }

      return item;
diff --git a/src/include/utils/jsonb.h b/src/include/utils/jsonb.h
index fc746c8..1a6409a 100644
*** a/src/include/utils/jsonb.h
--- b/src/include/utils/jsonb.h
*************** typedef enum
*** 29,53 ****
      WJB_END_OBJECT
  } JsonbIteratorToken;

! /*
!  * When using a GIN index for jsonb, we choose to index both keys and values.
!  * The storage format is text, with K, or V prepended to the string to indicate
!  * key/element or value/element.
!  *
!  * Jsonb Keys and string array elements are treated equivalently when
!  * serialized to text index storage.  One day we may wish to create an opclass
!  * that only indexes values, but for now keys and values are stored in GIN
!  * indexes in a way that doesn't really consider their relationship to each
!  * other.
!  */
! #define JKEYELEM    'K'
! #define JVAL        'V'
!
  #define JsonbContainsStrategyNumber        7
  #define JsonbExistsStrategyNumber        9
  #define JsonbExistsAnyStrategyNumber    10
  #define JsonbExistsAllStrategyNumber    11

  /* Convenience macros */
  #define DatumGetJsonb(d)    ((Jsonb *) PG_DETOAST_DATUM(d))
  #define JsonbGetDatum(p)    PointerGetDatum(p)
--- 29,69 ----
      WJB_END_OBJECT
  } JsonbIteratorToken;

! /* Strategy numbers for GIN index opclasses */
  #define JsonbContainsStrategyNumber        7
  #define JsonbExistsStrategyNumber        9
  #define JsonbExistsAnyStrategyNumber    10
  #define JsonbExistsAllStrategyNumber    11

+ /*
+  * In the standard jsonb_ops GIN opclass for jsonb, we choose to index both
+  * keys and values.  The storage format is text.  The first byte of the text
+  * string distinguishes whether this is a key (always a string), null value,
+  * boolean value, numeric value, or string value.  However, array elements
+  * that are strings are marked as though they were keys; this imprecision
+  * supports the definition of the "exists" operator, which treats array
+  * elements like keys.  The remainder of the text string is empty for a null
+  * value, "t" or "f" for a boolean value, a normalized print representation of
+  * a numeric value, or the text of a string value.  However, if the length of
+  * this text representation would exceed JGIN_MAXLENGTH bytes, we instead hash
+  * the text representation and store an 8-hex-digit representation of the
+  * uint32 hash value, marking the prefix byte with an additional bit to
+  * distinguish that this has happened.  Hashing long strings saves space and
+  * ensures that we won't overrun the maximum entry length for a GIN index.
+  * (But JGIN_MAXLENGTH is quite a bit shorter than GIN's limit.  It's chosen
+  * to ensure that the on-disk text datum will have a short varlena header.)
+  * Note that when any hashed item appears in a query, we must recheck index
+  * matches against the heap tuple; currently, this costs nothing because we
+  * must always recheck for other reasons.
+  */
+ #define JGINFLAG_KEY    0x01    /* key (or string array element) */
+ #define JGINFLAG_NULL    0x02    /* null value */
+ #define JGINFLAG_BOOL    0x03    /* boolean value */
+ #define JGINFLAG_NUM    0x04    /* numeric value */
+ #define JGINFLAG_STR    0x05    /* string value (if not an array element) */
+ #define JGINFLAG_HASHED 0x10    /* OR'd into flag if value was hashed */
+ #define JGIN_MAXLENGTH    125        /* max length of text part before hashing */
+
  /* Convenience macros */
  #define DatumGetJsonb(d)    ((Jsonb *) PG_DETOAST_DATUM(d))
  #define JsonbGetDatum(p)    PointerGetDatum(p)

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Bruce Momjian

Дата:

09 мая 2014 г., 03:34:14

On Thu, May  8, 2014 at 06:39:11PM -0400, Tom Lane wrote:
> I wrote:
> > I think the idea of hashing only keys/values that are "too long" is a
> > reasonable compromise.  I've not finished coding it (because I keep
> > getting distracted by other problems in the code :-() but it does not
> > look to be very difficult.  I'm envisioning the cutoff as being something
> > like 128 bytes; in practice that would mean that few if any keys get
> > hashed, I think.
> 
> Attached is a draft patch for this.  In addition to the hash logic per se,
> I made these changes:
> 
> * Replaced the K/V prefix bytes with a code that distinguishes the types
> of JSON values.  While this is not of any huge significance for the
> current index search operators, it's basically free to store the info,
> so I think we should do it for possible future use.
> 
> * Fixed the problem with "exists" returning rows it shouldn't.  I
> concluded that the best fix is just to force recheck for exists, which
> allows considerable simplification in the consistent functions.
> 
> * Tried to improve the comments in jsonb_gin.c.
> 
> Barring objections I'll commit this tomorrow, and also try to improve the
> user-facing documentation about the jsonb opclasses.

Looks good.  I was thinking the jsonb_ops name could remain unchanged
and the jsonb_hash_ops could be called jsonb_combo_ops as it combines
the key and value into a single index entry.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Gavin Flower

Дата:

09 мая 2014 г., 04:10:21

On 09/05/14 15:34, Bruce Momjian wrote:
> On Thu, May  8, 2014 at 06:39:11PM -0400, Tom Lane wrote:
>> I wrote:
>>> I think the idea of hashing only keys/values that are "too long" is a
>>> reasonable compromise.  I've not finished coding it (because I keep
>>> getting distracted by other problems in the code :-() but it does not
>>> look to be very difficult.  I'm envisioning the cutoff as being something
>>> like 128 bytes; in practice that would mean that few if any keys get
>>> hashed, I think.
>> Attached is a draft patch for this.  In addition to the hash logic per se,
>> I made these changes:
>>
>> * Replaced the K/V prefix bytes with a code that distinguishes the types
>> of JSON values.  While this is not of any huge significance for the
>> current index search operators, it's basically free to store the info,
>> so I think we should do it for possible future use.
>>
>> * Fixed the problem with "exists" returning rows it shouldn't.  I
>> concluded that the best fix is just to force recheck for exists, which
>> allows considerable simplification in the consistent functions.
>>
>> * Tried to improve the comments in jsonb_gin.c.
>>
>> Barring objections I'll commit this tomorrow, and also try to improve the
>> user-facing documentation about the jsonb opclasses.
> Looks good.  I was thinking the jsonb_ops name could remain unchanged
> and the jsonb_hash_ops could be called jsonb_combo_ops as it combines
> the key and value into a single index entry.
>
If you have 'jsonb_combo_ops' - then surely 'jsonb_op' should be called 
'jsonb_xxx_ops', where the 'xxx' distinguishes that from 
'jsonb_combo_ops'?  I guess, if any appropriate wording of 'xxx' was too 
cumbersome, then it would be worse.


Cheers,
Gavin

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Peter Geoghegan

Дата:

09 мая 2014 г., 05:20:18

On Thu, May 8, 2014 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Barring objections I'll commit this tomorrow

Looks good to me.

-- 
Peter Geoghegan

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

09 мая 2014 г., 11:04:38

Gavin Flower <GavinFlower@archidevsys.co.nz> writes:
> On 09/05/14 15:34, Bruce Momjian wrote:
>> Looks good.  I was thinking the jsonb_ops name could remain unchanged
>> and the jsonb_hash_ops could be called jsonb_combo_ops as it combines
>> the key and value into a single index entry.

> If you have 'jsonb_combo_ops' - then surely 'jsonb_op' should be called 
> 'jsonb_xxx_ops', where the 'xxx' distinguishes that from 
> 'jsonb_combo_ops'?  I guess, if any appropriate wording of 'xxx' was too 
> cumbersome, then it would be worse.

Yeah, I'm disinclined to change the opclass names now.  It's not apparent
to me that "combo" is a better choice than "hash" for the second opclass.
        regards, tom lane

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Bruce Momjian

Дата:

09 мая 2014 г., 13:53:50

On Fri, May  9, 2014 at 07:04:17AM -0400, Tom Lane wrote:
> Gavin Flower <GavinFlower@archidevsys.co.nz> writes:
> > On 09/05/14 15:34, Bruce Momjian wrote:
> >> Looks good.  I was thinking the jsonb_ops name could remain unchanged
> >> and the jsonb_hash_ops could be called jsonb_combo_ops as it combines
> >> the key and value into a single index entry.
> 
> > If you have 'jsonb_combo_ops' - then surely 'jsonb_op' should be called 
> > 'jsonb_xxx_ops', where the 'xxx' distinguishes that from 
> > 'jsonb_combo_ops'?  I guess, if any appropriate wording of 'xxx' was too 
> > cumbersome, then it would be worse.
> 
> Yeah, I'm disinclined to change the opclass names now.  It's not apparent
> to me that "combo" is a better choice than "hash" for the second opclass.

Well, if we are optionally hashing json_ops for long strings, what does
jsonb_hash_ops do uniquely with hashing?  Does it always hash, while
json_ops optionally hashes?  Is that the distinguishing characteristic? 
It seemed the _content_ of the indexed value was more important, rather
than the storage method.

Should jsonb_hash_ops do only optional hashing too for long strings?

Also, with json_ops, when you index '{"exterior" : "white", "interior":
"blue"}', if you query for {"exterior" : "blue"}, does it match and have
to be rechecked in the heap because the index doesn't know which values
go with which keys?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Bruce Momjian

Дата:

09 мая 2014 г., 14:27:01

On Fri, May  9, 2014 at 09:53:36AM -0400, Bruce Momjian wrote:
> On Fri, May  9, 2014 at 07:04:17AM -0400, Tom Lane wrote:
> > Gavin Flower <GavinFlower@archidevsys.co.nz> writes:
> > > On 09/05/14 15:34, Bruce Momjian wrote:
> > >> Looks good.  I was thinking the jsonb_ops name could remain unchanged
> > >> and the jsonb_hash_ops could be called jsonb_combo_ops as it combines
> > >> the key and value into a single index entry.
> > 
> > > If you have 'jsonb_combo_ops' - then surely 'jsonb_op' should be called 
> > > 'jsonb_xxx_ops', where the 'xxx' distinguishes that from 
> > > 'jsonb_combo_ops'?  I guess, if any appropriate wording of 'xxx' was too 
> > > cumbersome, then it would be worse.
> > 
> > Yeah, I'm disinclined to change the opclass names now.  It's not apparent
> > to me that "combo" is a better choice than "hash" for the second opclass.
> 
> Well, if we are optionally hashing json_ops for long strings, what does
> jsonb_hash_ops do uniquely with hashing?  Does it always hash, while
> json_ops optionally hashes?  Is that the distinguishing characteristic? 
> It seemed the _content_ of the indexed value was more important, rather
> than the storage method.

Also, are people going to think that jsonb_hash_ops creates a hash
index, which is not crash safe, even though it is a GIN index?  Do we
have this "hash" confusion anywhere else?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Greg Stark

Дата:

09 мая 2014 г., 14:27:20

On Fri, May 9, 2014 at 2:53 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Well, if we are optionally hashing json_ops for long strings, what does
> jsonb_hash_ops do uniquely with hashing?  Does it always hash, while
> json_ops optionally hashes?  Is that the distinguishing characteristic?
> It seemed the _content_ of the indexed value was more important, rather
> than the storage method.
>
> Should jsonb_hash_ops do only optional hashing too for long strings?


Well the question seems to me to be that if we're always doing recheck
then what advantage is there to not hashing everything?

-- 
greg

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Andres Freund

Дата:

09 мая 2014 г., 14:32:26

On 2014-05-09 10:26:48 -0400, Bruce Momjian wrote:
> On Fri, May  9, 2014 at 09:53:36AM -0400, Bruce Momjian wrote:
> > On Fri, May  9, 2014 at 07:04:17AM -0400, Tom Lane wrote:
> > > Gavin Flower <GavinFlower@archidevsys.co.nz> writes:
> > > > On 09/05/14 15:34, Bruce Momjian wrote:
> > > >> Looks good.  I was thinking the jsonb_ops name could remain unchanged
> > > >> and the jsonb_hash_ops could be called jsonb_combo_ops as it combines
> > > >> the key and value into a single index entry.
> > > 
> > > > If you have 'jsonb_combo_ops' - then surely 'jsonb_op' should be called 
> > > > 'jsonb_xxx_ops', where the 'xxx' distinguishes that from 
> > > > 'jsonb_combo_ops'?  I guess, if any appropriate wording of 'xxx' was too 
> > > > cumbersome, then it would be worse.
> > > 
> > > Yeah, I'm disinclined to change the opclass names now.  It's not apparent
> > > to me that "combo" is a better choice than "hash" for the second opclass.
> > 
> > Well, if we are optionally hashing json_ops for long strings, what does
> > jsonb_hash_ops do uniquely with hashing?  Does it always hash, while
> > json_ops optionally hashes?  Is that the distinguishing characteristic? 
> > It seemed the _content_ of the indexed value was more important, rather
> > than the storage method.
> 
> Also, are people going to think that jsonb_hash_ops creates a hash
> index, which is not crash safe, even though it is a GIN index?  Do we
> have this "hash" confusion anywhere else?

The operator class has to be specified after the USING GIN in CREATE
INDEX so I think that rest is neglegible.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

09 мая 2014 г., 20:44:53

Greg Stark <stark@mit.edu> writes:
> Well the question seems to me to be that if we're always doing recheck
> then what advantage is there to not hashing everything?

Right now, there's not much.  But it seems likely to me that there will be
more JSON operators in future, and some of them might be able to make use
of the additional specificity of unhashed entries.  For example, it's only
a very arbitrary definitional choice for the exists operator (ie, not
looking into sub-objects) that makes jsonb_ops lossy for it.  We might
eventually build a recursive-exists-check operator for which the index
could be lossless, at least up to the string length where we start to
hash.
        regards, tom lane

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Heikki Linnakangas

Дата:

10 мая 2014 г., 20:42:48

On 05/09/2014 11:44 PM, Tom Lane wrote:
> Greg Stark <stark@mit.edu> writes:
>> Well the question seems to me to be that if we're always doing recheck
>> then what advantage is there to not hashing everything?
>
> Right now, there's not much.  But it seems likely to me that there will be
> more JSON operators in future, and some of them might be able to make use
> of the additional specificity of unhashed entries.  For example, it's only
> a very arbitrary definitional choice for the exists operator (ie, not
> looking into sub-objects) that makes jsonb_ops lossy for it.  We might
> eventually build a recursive-exists-check operator for which the index
> could be lossless, at least up to the string length where we start to
> hash.

Back to the naming:

The main difference between the two opclasses from a user's standpoint 
is not whether they hash or not. The big difference is that one indexes 
complete paths from the root, and the other indexes just the "leaf" 
level. For example, if you have an object like '{"foo": {"bar": 123 } 
}', one will index "foo", "foo->bar", and "foo->bar->123" while the 
other will index "foo", "bar" and "123".

Whether the opclasses use hashing to shorten the key is an orthogonal 
property, and IMHO not as important. To reflect that, I suggest that we 
name the opclasses:

json_path_ops
json_value_ops

or something along those lines.

- Heikki

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Andrew Dunstan

Дата:

10 мая 2014 г., 21:01:10

On 05/10/2014 04:42 PM, Heikki Linnakangas wrote:
>
>
> The main difference between the two opclasses from a user's standpoint 
> is not whether they hash or not. The big difference is that one 
> indexes complete paths from the root, and the other indexes just the 
> "leaf" level. For example, if you have an object like '{"foo": {"bar": 
> 123 } }', one will index "foo", "foo->bar", and "foo->bar->123" while 
> the other will index "foo", "bar" and "123".
>
> Whether the opclasses use hashing to shorten the key is an orthogonal 
> property, and IMHO not as important. To reflect that, I suggest that 
> we name the opclasses:
>
> json_path_ops
> json_value_ops
>
> or something along those lines.
>
>


That looks like the first suggestion I've actually liked and that users 
will be able to understand.

cheers

andrew

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Oleg Bartunov

Дата:

10 мая 2014 г., 21:03:48

+1
but  bit confused with json instead of jsonb

On Sun, May 11, 2014 at 1:00 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
>
> On 05/10/2014 04:42 PM, Heikki Linnakangas wrote:
>>
>>
>>
>> The main difference between the two opclasses from a user's standpoint is
>> not whether they hash or not. The big difference is that one indexes
>> complete paths from the root, and the other indexes just the "leaf" level.
>> For example, if you have an object like '{"foo": {"bar": 123 } }', one will
>> index "foo", "foo->bar", and "foo->bar->123" while the other will index
>> "foo", "bar" and "123".
>>
>> Whether the opclasses use hashing to shorten the key is an orthogonal
>> property, and IMHO not as important. To reflect that, I suggest that we name
>> the opclasses:
>>
>> json_path_ops
>> json_value_ops
>>
>> or something along those lines.
>>
>>
>
>
> That looks like the first suggestion I've actually liked and that users will
> be able to understand.
>
> cheers
>
> andrew
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Peter Geoghegan

Дата:

10 мая 2014 г., 21:52:56

On Sat, May 10, 2014 at 1:42 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> For example, if you have an object like '{"foo": {"bar": 123 } }', one will
> index "foo", "foo->bar", and "foo->bar->123" while the other will index
> "foo", "bar" and "123".

That isn't quite right, if we're talking about the user's perspective,
and concerns about what is made indexable. You cannot actually query
"foo->bar" there, because containment semantics don't support it.
Basically, you'd have to write that like this:

select * from docs where jdoc @> '{"foo":{"bar":*}'::jsonb;

I've added a wildcard to the rhs jsonb here, which of course won't
work, the proximate cause being that that simply isn't valid jsonb.
It's also something inherently impossible to support with the current
jsonb_hash_op's indexing strategy. That only hashes elements and
values, mixing in keys from all outer nesting levels (so it's possible
for there to be 0 *nentries). In the strict physical sense, it only
indexes "123" (with all keys mixed in to the hash) from your example,
because that's the only element or value that appears.

Your description of only indexing the "leaf level" is not ideal,
because it makes me think of B-Trees. Unlike B-Trees, jsonb can and
frequently will have jsonb_hash_ops-indexable primitive
values/elements at all nesting levels (that is, both "inner"
containers and "leaf" containers).

> Whether the opclasses use hashing to shorten the key is an orthogonal
> property, and IMHO not as important. To reflect that, I suggest that we name
> the opclasses:
>
> json_path_ops
> json_value_ops
>
> or something along those lines.

I would like to once again emphasize the user-level distinction
between the two: one (the text storage opclass) is better for
hstore-style indexing, where jsonb data is heterogeneous in structure,
most probably a hodge-podge of key/value pairs. The other (the hash
storage opclass) is better for document-database style use cases,
where keys alone have low selectivity, and there is an
unenforced/implicit schema that is mostly adhered to by client
applications. That's why I don't think it's much of a problem that the
example query above won't work (you could do something with
expressions, or expression indexes to make something work, but it
won't work with the text-storage opclass because there is no
appropriate operator). In general, for those document-database style
use cases, indexing keys alone isn't useful. Wherever that turns out
to be untrue, ad-hoc expression indexes of the text storage opclass
are probably the best solution.

Anyway, I agree with your general assessment; hashing is nothing more
than an implementation detail. I'm not sure which of your proposed
names is intended for which opclass, though. Do you mean
"jsonb_path_ops" and "jsonb_key_ops"? That makes perfect sense to me,
because indexing keys alone is the main user-level distinction.

-- 
Peter Geoghegan

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

10 мая 2014 г., 21:56:01

Andrew Dunstan <andrew@dunslane.net> writes:
> On 05/10/2014 04:42 PM, Heikki Linnakangas wrote:
>> Whether the opclasses use hashing to shorten the key is an orthogonal 
>> property, and IMHO not as important. To reflect that, I suggest that 
>> we name the opclasses:
>> 
>> json_path_ops
>> json_value_ops
>> 
>> or something along those lines.

> That looks like the first suggestion I've actually liked and that users 
> will be able to understand.

I'd prefer to stick with just "jsonb_ops" for the default opclass.
Also, "jsonb_value_ops" is not le mot juste anyway, because that opclass
doesn't just index values it also indexes keys (if I've got the JSON
terminology straight).  However, I could accept renaming jsonb_hash_ops
to jsonb_path_ops, as long as we do it PDQ.
        regards, tom lane

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Peter Geoghegan

Дата:

11 мая 2014 г., 00:25:59

On Sat, May 10, 2014 at 2:52 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I've added a wildcard to the rhs jsonb here, which of course won't
> work, the proximate cause being that that simply isn't valid jsonb.
> It's also something inherently impossible to support with the current
> jsonb_hash_op's indexing strategy. That only hashes elements and
> values, mixing in keys from all outer nesting levels (so it's possible
> for there to be 0 *nentries).

It occurs to me that this could be a particular problem for
jsonb_hash_ops. Consider this jsonb:

'{"a":{}}'::jsonb

In one sense, the outermost level's "a" key does have a value: an
empty object. So we may test containment in an indexable fashion like
this:

select * from foo where j @> '{"a":{}}'::jsonb

But in another sense, the sense that is relevant to jsonb_hash_ops, it
does not. There would be *no* GIN keys passed back from
gin_extract_jsonb_hash() if it were tasked with extracting keys from
this rhs jsonb.

Now, I'm not all that worried about this, because this is surely an
odd-ball use case, particularly for jsonb_hash_ops where no keys are
separately indexed (separately from *primitive* elements/values).
However, it is worth noting in the documentation in my view. I attach
a doc patch that adds this.

--
Peter Geoghegan

Вложения

jsonb_structure_containment.patch

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

11 мая 2014 г., 00:54:59

Peter Geoghegan <pg@heroku.com> writes:
> Now, I'm not all that worried about this, because this is surely an
> odd-ball use case, particularly for jsonb_hash_ops where no keys are
> separately indexed (separately from *primitive* elements/values).
> However, it is worth noting in the documentation in my view. I attach
> a doc patch that adds this.

Agreed, we'd better mention that somewhere.

I'm not sure whether we have consensus to rename jsonb_hash_ops to
jsonb_path_ops, but since time is so short I went ahead and made a draft
patch to do so (attached).  Probably the most interesting part of this is
the new text in json.sgml explaining the difference between the two
opclasses.  I also added a paragraph about the empty-query hazard that
Peter mentions.  Do people think this discussion is correct and useful?

            regards, tom lane

diff --git a/doc/src/sgml/gin.sgml b/doc/src/sgml/gin.sgml
index 0b3d6ee..1cbc73c 100644
*** a/doc/src/sgml/gin.sgml
--- b/doc/src/sgml/gin.sgml
***************
*** 395,401 ****
        </entry>
       </row>
       <row>
!       <entry><literal>jsonb_hash_ops</></entry>
        <entry><type>jsonb</></entry>
        <entry>
         <literal>@></>
--- 395,401 ----
        </entry>
       </row>
       <row>
!       <entry><literal>jsonb_path_ops</></entry>
        <entry><type>jsonb</></entry>
        <entry>
         <literal>@></>
***************
*** 415,421 ****

   <para>
    Of the two operator classes for type <type>jsonb</>, <literal>jsonb_ops</>
!   is the default.  <literal>jsonb_hash_ops</> supports fewer operators but
    offers better performance for those operators.
    See <xref linkend="json-indexing"> for details.
   </para>
--- 415,421 ----

   <para>
    Of the two operator classes for type <type>jsonb</>, <literal>jsonb_ops</>
!   is the default.  <literal>jsonb_path_ops</> supports fewer operators but
    offers better performance for those operators.
    See <xref linkend="json-indexing"> for details.
   </para>
diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml
index 518fe63..52ad882 100644
*** a/doc/src/sgml/json.sgml
--- b/doc/src/sgml/json.sgml
***************
*** 156,162 ****
     </table>

   <sect2 id="json-keys-elements">
!   <title><type>jsonb</> Input and Output Syntax</title>
    <para>
     The input/output syntax for the JSON data types is as specified in
     <acronym>RFC</> 7159.
--- 156,162 ----
     </table>

   <sect2 id="json-keys-elements">
!   <title>JSON Input and Output Syntax</title>
    <para>
     The input/output syntax for the JSON data types is as specified in
     <acronym>RFC</> 7159.
*************** SELECT '"foo"'::jsonb ? 'foo';
*** 366,376 ****
    <programlisting>
  CREATE INDEX idxgin ON api USING gin (jdoc);
    </programlisting>
!     The non-default GIN operator class <literal>jsonb_hash_ops</>
      supports indexing the <literal>@></> operator only.
      An example of creating an index with this operator class is:
    <programlisting>
! CREATE INDEX idxginh ON api USING gin (jdoc jsonb_hash_ops);
    </programlisting>
    </para>

--- 366,376 ----
    <programlisting>
  CREATE INDEX idxgin ON api USING gin (jdoc);
    </programlisting>
!     The non-default GIN operator class <literal>jsonb_path_ops</>
      supports indexing the <literal>@></> operator only.
      An example of creating an index with this operator class is:
    <programlisting>
! CREATE INDEX idxginh ON api USING gin (jdoc jsonb_path_ops);
    </programlisting>
    </para>

*************** SELECT jdoc->'guid', jdoc->'name'
*** 444,453 ****
    </para>

    <para>
!     Although the <literal>jsonb_hash_ops</literal> operator class supports
      only queries with the <literal>@></> operator, it has notable
      performance advantages over the default operator
!     class <literal>jsonb_ops</literal>.  A <literal>jsonb_hash_ops</literal>
      index is usually much smaller than a <literal>jsonb_ops</literal>
      index over the same data, and the specificity of searches is better,
      particularly when queries contain keys that appear frequently in the
--- 444,453 ----
    </para>

    <para>
!     Although the <literal>jsonb_path_ops</literal> operator class supports
      only queries with the <literal>@></> operator, it has notable
      performance advantages over the default operator
!     class <literal>jsonb_ops</literal>.  A <literal>jsonb_path_ops</literal>
      index is usually much smaller than a <literal>jsonb_ops</literal>
      index over the same data, and the specificity of searches is better,
      particularly when queries contain keys that appear frequently in the
*************** SELECT jdoc->'guid', jdoc->'name'
*** 456,461 ****
--- 456,493 ----
    </para>

    <para>
+     The technical difference between a <literal>jsonb_ops</literal>
+     and a <literal>jsonb_path_ops</literal> GIN index is that the former
+     creates independent index items for each key and value in the data,
+     while the latter creates index items only for each value in the data.
+     But in <literal>jsonb_path_ops</literal>, each index item is a hash
+     of both the value and the key(s) leading to it; for example to index
+     <literal>{"foo": {"bar": "baz"}}</literal>, a single index item would
+     be created incorporating all three of <literal>foo</>, <literal>bar</>,
+     and <literal>baz</> into the hash value.  Thus a containment query
+     looking for this structure would result in an extremely specific index
+     search; but there is no way at all to find out whether <literal>foo</>
+     appears as a key.  On the other hand, a <literal>jsonb_ops</literal>
+     index would create three index items representing <literal>foo</>,
+     <literal>bar</>, and <literal>baz</> separately; then to do the
+     containment query, it would look for rows containing all three of
+     these keys.  While GIN indexes can perform such an AND search fairly
+     efficiently, it will still be less specific and slower than the
+     equivalent <literal>jsonb_path_ops</literal> search, especially if
+     there are a very large number of rows containing any single one of the
+     three keys.
+   </para>
+
+   <para>
+     A disadvantage of the <literal>jsonb_path_ops</literal> approach is
+     that it produces no index entries for JSON structures not containing
+     any values, such as <literal>{"a": {}}</literal>.  If a search for
+     documents containing such a structure is requested, it will require a
+     full-index scan, which is quite slow.  <literal>jsonb_path_ops</> is
+     therefore ill-suited for applications that perform such searches.
+   </para>
+
+   <para>
      <type>jsonb</> also supports <literal>btree</> and <literal>hash</>
      indexes.  These are usually useful only if it's important to check
      equality of complete JSON documents.
diff --git a/src/backend/utils/adt/jsonb_gin.c b/src/backend/utils/adt/jsonb_gin.c
index 57a0b2c..069ee03 100644
*** a/src/backend/utils/adt/jsonb_gin.c
--- b/src/backend/utils/adt/jsonb_gin.c
*************** gin_triconsistent_jsonb(PG_FUNCTION_ARGS
*** 315,323 ****

  /*
   *
!  * jsonb_hash_ops GIN opclass support functions
   *
!  * In a jsonb_hash_ops index, the GIN keys are uint32 hashes, one per JSON
   * value; but the JSON key(s) leading to each value are also included in its
   * hash computation.  This means we can only support containment queries,
   * but the index can distinguish, for example, {"foo": 42} from {"bar": 42}
--- 315,323 ----

  /*
   *
!  * jsonb_path_ops GIN opclass support functions
   *
!  * In a jsonb_path_ops index, the GIN keys are uint32 hashes, one per JSON
   * value; but the JSON key(s) leading to each value are also included in its
   * hash computation.  This means we can only support containment queries,
   * but the index can distinguish, for example, {"foo": 42} from {"bar": 42}
*************** gin_triconsistent_jsonb(PG_FUNCTION_ARGS
*** 326,332 ****
   */

  Datum
! gin_extract_jsonb_hash(PG_FUNCTION_ARGS)
  {
      Jsonb       *jb = PG_GETARG_JSONB(0);
      int32       *nentries = (int32 *) PG_GETARG_POINTER(1);
--- 326,332 ----
   */

  Datum
! gin_extract_jsonb_path(PG_FUNCTION_ARGS)
  {
      Jsonb       *jb = PG_GETARG_JSONB(0);
      int32       *nentries = (int32 *) PG_GETARG_POINTER(1);
*************** gin_extract_jsonb_hash(PG_FUNCTION_ARGS)
*** 349,355 ****
      /* Otherwise, use 2 * root count as initial estimate of result size */
      entries = (Datum *) palloc(sizeof(Datum) * total);

!     /* We keep a stack of hashes corresponding to parent key levels */
      tail.parent = NULL;
      tail.hash = 0;
      stack = &tail;
--- 349,355 ----
      /* Otherwise, use 2 * root count as initial estimate of result size */
      entries = (Datum *) palloc(sizeof(Datum) * total);

!     /* We keep a stack of partial hashes corresponding to parent key levels */
      tail.parent = NULL;
      tail.hash = 0;
      stack = &tail;
*************** gin_extract_jsonb_hash(PG_FUNCTION_ARGS)
*** 439,445 ****
  }

  Datum
! gin_extract_jsonb_query_hash(PG_FUNCTION_ARGS)
  {
      int32       *nentries = (int32 *) PG_GETARG_POINTER(1);
      StrategyNumber strategy = PG_GETARG_UINT16(2);
--- 439,445 ----
  }

  Datum
! gin_extract_jsonb_query_path(PG_FUNCTION_ARGS)
  {
      int32       *nentries = (int32 *) PG_GETARG_POINTER(1);
      StrategyNumber strategy = PG_GETARG_UINT16(2);
*************** gin_extract_jsonb_query_hash(PG_FUNCTION
*** 449,457 ****
      if (strategy != JsonbContainsStrategyNumber)
          elog(ERROR, "unrecognized strategy number: %d", strategy);

!     /* Query is a jsonb, so just apply gin_extract_jsonb_hash ... */
      entries = (Datum *)
!         DatumGetPointer(DirectFunctionCall2(gin_extract_jsonb_hash,
                                              PG_GETARG_DATUM(0),
                                              PointerGetDatum(nentries)));

--- 449,457 ----
      if (strategy != JsonbContainsStrategyNumber)
          elog(ERROR, "unrecognized strategy number: %d", strategy);

!     /* Query is a jsonb, so just apply gin_extract_jsonb_path ... */
      entries = (Datum *)
!         DatumGetPointer(DirectFunctionCall2(gin_extract_jsonb_path,
                                              PG_GETARG_DATUM(0),
                                              PointerGetDatum(nentries)));

*************** gin_extract_jsonb_query_hash(PG_FUNCTION
*** 463,469 ****
  }

  Datum
! gin_consistent_jsonb_hash(PG_FUNCTION_ARGS)
  {
      bool       *check = (bool *) PG_GETARG_POINTER(0);
      StrategyNumber strategy = PG_GETARG_UINT16(1);
--- 463,469 ----
  }

  Datum
! gin_consistent_jsonb_path(PG_FUNCTION_ARGS)
  {
      bool       *check = (bool *) PG_GETARG_POINTER(0);
      StrategyNumber strategy = PG_GETARG_UINT16(1);
*************** gin_consistent_jsonb_hash(PG_FUNCTION_AR
*** 480,492 ****
          elog(ERROR, "unrecognized strategy number: %d", strategy);

      /*
!      * jsonb_hash_ops is necessarily lossy, not only because of hash
       * collisions but also because it doesn't preserve complete information
       * about the structure of the JSON object.  Besides, there are some
!      * special rules around the containment of raw scalar arrays and regular
!      * arrays that are not handled here.  So we must always recheck a match.
!      * However, if not all of the keys are present, the tuple certainly
!      * doesn't match.
       */
      *recheck = true;
      for (i = 0; i < nkeys; i++)
--- 480,491 ----
          elog(ERROR, "unrecognized strategy number: %d", strategy);

      /*
!      * jsonb_path_ops is necessarily lossy, not only because of hash
       * collisions but also because it doesn't preserve complete information
       * about the structure of the JSON object.  Besides, there are some
!      * special rules around the containment of raw scalars in arrays that are
!      * not handled here.  So we must always recheck a match.  However, if not
!      * all of the keys are present, the tuple certainly doesn't match.
       */
      *recheck = true;
      for (i = 0; i < nkeys; i++)
*************** gin_consistent_jsonb_hash(PG_FUNCTION_AR
*** 502,508 ****
  }

  Datum
! gin_triconsistent_jsonb_hash(PG_FUNCTION_ARGS)
  {
      GinTernaryValue *check = (GinTernaryValue *) PG_GETARG_POINTER(0);
      StrategyNumber strategy = PG_GETARG_UINT16(1);
--- 501,507 ----
  }

  Datum
! gin_triconsistent_jsonb_path(PG_FUNCTION_ARGS)
  {
      GinTernaryValue *check = (GinTernaryValue *) PG_GETARG_POINTER(0);
      StrategyNumber strategy = PG_GETARG_UINT16(1);
diff --git a/src/include/catalog/pg_amop.h b/src/include/catalog/pg_amop.h
index 8efd3be..264059f 100644
*** a/src/include/catalog/pg_amop.h
--- b/src/include/catalog/pg_amop.h
*************** DATA(insert (    4033   3802 3802 4 s    3245
*** 787,798 ****
  DATA(insert (    4033   3802 3802 5 s    3243 403 0 ));

  /*
!  * hash jsonb ops
   */
  DATA(insert (    4034   3802 3802 1 s 3240 405 0 ));

  /*
!  * GIN jsonb ops
   */
  DATA(insert (    4036   3802 3802 7 s 3246 2742 0 ));
  DATA(insert (    4036   3802 25 9 s 3247 2742 0 ));
--- 787,798 ----
  DATA(insert (    4033   3802 3802 5 s    3243 403 0 ));

  /*
!  * hash jsonb_ops
   */
  DATA(insert (    4034   3802 3802 1 s 3240 405 0 ));

  /*
!  * GIN jsonb_ops
   */
  DATA(insert (    4036   3802 3802 7 s 3246 2742 0 ));
  DATA(insert (    4036   3802 25 9 s 3247 2742 0 ));
*************** DATA(insert (    4036   3802 1009 10 s 3248
*** 800,806 ****
  DATA(insert (    4036   3802 1009 11 s 3249 2742 0 ));

  /*
!  * GIN jsonb hash ops
   */
  DATA(insert (    4037   3802 3802 7 s 3246 2742 0 ));

--- 800,806 ----
  DATA(insert (    4036   3802 1009 11 s 3249 2742 0 ));

  /*
!  * GIN jsonb_path_ops
   */
  DATA(insert (    4037   3802 3802 7 s 3246 2742 0 ));

diff --git a/src/include/catalog/pg_opclass.h b/src/include/catalog/pg_opclass.h
index ecf7063..3698886 100644
*** a/src/include/catalog/pg_opclass.h
--- b/src/include/catalog/pg_opclass.h
*************** DATA(insert (    4000    text_ops            PGNSP PGUI
*** 232,237 ****
  DATA(insert (    403        jsonb_ops            PGNSP PGUID 4033  3802 t 0 ));
  DATA(insert (    405        jsonb_ops            PGNSP PGUID 4034  3802 t 0 ));
  DATA(insert (    2742    jsonb_ops            PGNSP PGUID 4036  3802 t 25 ));
! DATA(insert (    2742    jsonb_hash_ops        PGNSP PGUID 4037  3802 f 23 ));

  #endif   /* PG_OPCLASS_H */
--- 232,237 ----
  DATA(insert (    403        jsonb_ops            PGNSP PGUID 4033  3802 t 0 ));
  DATA(insert (    405        jsonb_ops            PGNSP PGUID 4034  3802 t 0 ));
  DATA(insert (    2742    jsonb_ops            PGNSP PGUID 4036  3802 t 25 ));
! DATA(insert (    2742    jsonb_path_ops        PGNSP PGUID 4037  3802 f 23 ));

  #endif   /* PG_OPCLASS_H */
diff --git a/src/include/catalog/pg_opfamily.h b/src/include/catalog/pg_opfamily.h
index 9e8f4ac..c83ac8c 100644
*** a/src/include/catalog/pg_opfamily.h
--- b/src/include/catalog/pg_opfamily.h
*************** DATA(insert OID = 3474 (    4000    range_ops
*** 148,158 ****
  DATA(insert OID = 4015 (    4000    quad_point_ops    PGNSP PGUID ));
  DATA(insert OID = 4016 (    4000    kd_point_ops    PGNSP PGUID ));
  DATA(insert OID = 4017 (    4000    text_ops        PGNSP PGUID ));
  DATA(insert OID = 4033 (    403        jsonb_ops        PGNSP PGUID ));
  DATA(insert OID = 4034 (    405        jsonb_ops        PGNSP PGUID ));
  DATA(insert OID = 4035 (    783        jsonb_ops        PGNSP PGUID ));
  DATA(insert OID = 4036 (    2742    jsonb_ops        PGNSP PGUID ));
! DATA(insert OID = 4037 (    2742    jsonb_hash_ops    PGNSP PGUID ));
! #define TEXT_SPGIST_FAM_OID 4017

  #endif   /* PG_OPFAMILY_H */
--- 148,158 ----
  DATA(insert OID = 4015 (    4000    quad_point_ops    PGNSP PGUID ));
  DATA(insert OID = 4016 (    4000    kd_point_ops    PGNSP PGUID ));
  DATA(insert OID = 4017 (    4000    text_ops        PGNSP PGUID ));
+ #define TEXT_SPGIST_FAM_OID 4017
  DATA(insert OID = 4033 (    403        jsonb_ops        PGNSP PGUID ));
  DATA(insert OID = 4034 (    405        jsonb_ops        PGNSP PGUID ));
  DATA(insert OID = 4035 (    783        jsonb_ops        PGNSP PGUID ));
  DATA(insert OID = 4036 (    2742    jsonb_ops        PGNSP PGUID ));
! DATA(insert OID = 4037 (    2742    jsonb_path_ops    PGNSP PGUID ));

  #endif   /* PG_OPFAMILY_H */
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index e601ccd..72170af 100644
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 3484 (  gin_consistent
*** 4645,4657 ****
  DESCR("GIN support");
  DATA(insert OID = 3488 (  gin_triconsistent_jsonb    PGNSP PGUID 12 1 0 0 0 f f f f t f i 7 0 16 "2281 21 2277 23
22812281 2281" _null_ _null_ _null_ _null_ gin_triconsistent_jsonb _null_ _null_ _null_ )); 
  DESCR("GIN support");
! DATA(insert OID = 3485 (  gin_extract_jsonb_hash  PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 2281 "2281 2281 2281"
_null__null_ _null_ _null_ gin_extract_jsonb_hash _null_ _null_ _null_ )); 
  DESCR("GIN support");
! DATA(insert OID = 3486 (  gin_extract_jsonb_query_hash    PGNSP PGUID 12 1 0 0 0 f f f f t f i 7 0 2281 "2277 2281 21
22812281 2281 2281" _null_ _null_ _null_ _null_ gin_extract_jsonb_query_hash _null_ _null_ _null_ )); 
  DESCR("GIN support");
! DATA(insert OID = 3487 (  gin_consistent_jsonb_hash  PGNSP PGUID 12 1 0 0 0 f f f f t f i 8 0 16 "2281 21 2277 23
22812281 2281 2281" _null_ _null_ _null_ _null_ gin_consistent_jsonb_hash _null_ _null_ _null_ )); 
  DESCR("GIN support");
! DATA(insert OID = 3489 (  gin_triconsistent_jsonb_hash    PGNSP PGUID 12 1 0 0 0 f f f f t f i 7 0 16 "2281 21 2277
232281 2281 2281" _null_ _null_ _null_ _null_ gin_triconsistent_jsonb_hash _null_ _null_ _null_ )); 
  DESCR("GIN support");

  /* txid */
--- 4645,4657 ----
  DESCR("GIN support");
  DATA(insert OID = 3488 (  gin_triconsistent_jsonb    PGNSP PGUID 12 1 0 0 0 f f f f t f i 7 0 16 "2281 21 2277 23
22812281 2281" _null_ _null_ _null_ _null_ gin_triconsistent_jsonb _null_ _null_ _null_ )); 
  DESCR("GIN support");
! DATA(insert OID = 3485 (  gin_extract_jsonb_path  PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 2281 "2281 2281 2281"
_null__null_ _null_ _null_ gin_extract_jsonb_path _null_ _null_ _null_ )); 
  DESCR("GIN support");
! DATA(insert OID = 3486 (  gin_extract_jsonb_query_path    PGNSP PGUID 12 1 0 0 0 f f f f t f i 7 0 2281 "2277 2281 21
22812281 2281 2281" _null_ _null_ _null_ _null_ gin_extract_jsonb_query_path _null_ _null_ _null_ )); 
  DESCR("GIN support");
! DATA(insert OID = 3487 (  gin_consistent_jsonb_path  PGNSP PGUID 12 1 0 0 0 f f f f t f i 8 0 16 "2281 21 2277 23
22812281 2281 2281" _null_ _null_ _null_ _null_ gin_consistent_jsonb_path _null_ _null_ _null_ )); 
  DESCR("GIN support");
! DATA(insert OID = 3489 (  gin_triconsistent_jsonb_path    PGNSP PGUID 12 1 0 0 0 f f f f t f i 7 0 16 "2281 21 2277
232281 2281 2281" _null_ _null_ _null_ _null_ gin_triconsistent_jsonb_path _null_ _null_ _null_ )); 
  DESCR("GIN support");

  /* txid */
diff --git a/src/include/utils/jsonb.h b/src/include/utils/jsonb.h
index bb8c380..add7628 100644
*** a/src/include/utils/jsonb.h
--- b/src/include/utils/jsonb.h
*************** extern Datum jsonb_eq(PG_FUNCTION_ARGS);
*** 332,349 ****
  extern Datum jsonb_cmp(PG_FUNCTION_ARGS);
  extern Datum jsonb_hash(PG_FUNCTION_ARGS);

! /* GIN support functions */
  extern Datum gin_compare_jsonb(PG_FUNCTION_ARGS);
  extern Datum gin_extract_jsonb(PG_FUNCTION_ARGS);
  extern Datum gin_extract_jsonb_query(PG_FUNCTION_ARGS);
  extern Datum gin_consistent_jsonb(PG_FUNCTION_ARGS);
  extern Datum gin_triconsistent_jsonb(PG_FUNCTION_ARGS);

! /* GIN hash opclass functions */
! extern Datum gin_extract_jsonb_hash(PG_FUNCTION_ARGS);
! extern Datum gin_extract_jsonb_query_hash(PG_FUNCTION_ARGS);
! extern Datum gin_consistent_jsonb_hash(PG_FUNCTION_ARGS);
! extern Datum gin_triconsistent_jsonb_hash(PG_FUNCTION_ARGS);

  /* Support functions */
  extern int    compareJsonbContainers(JsonbContainer *a, JsonbContainer *b);
--- 332,349 ----
  extern Datum jsonb_cmp(PG_FUNCTION_ARGS);
  extern Datum jsonb_hash(PG_FUNCTION_ARGS);

! /* GIN support functions for jsonb_ops */
  extern Datum gin_compare_jsonb(PG_FUNCTION_ARGS);
  extern Datum gin_extract_jsonb(PG_FUNCTION_ARGS);
  extern Datum gin_extract_jsonb_query(PG_FUNCTION_ARGS);
  extern Datum gin_consistent_jsonb(PG_FUNCTION_ARGS);
  extern Datum gin_triconsistent_jsonb(PG_FUNCTION_ARGS);

! /* GIN support functions for jsonb_path_ops */
! extern Datum gin_extract_jsonb_path(PG_FUNCTION_ARGS);
! extern Datum gin_extract_jsonb_query_path(PG_FUNCTION_ARGS);
! extern Datum gin_consistent_jsonb_path(PG_FUNCTION_ARGS);
! extern Datum gin_triconsistent_jsonb_path(PG_FUNCTION_ARGS);

  /* Support functions */
  extern int    compareJsonbContainers(JsonbContainer *a, JsonbContainer *b);
diff --git a/src/test/regress/expected/jsonb.out b/src/test/regress/expected/jsonb.out
index c5a7d64..ae7c506 100644
*** a/src/test/regress/expected/jsonb.out
--- b/src/test/regress/expected/jsonb.out
*************** SELECT count(*) FROM testjsonb WHERE j =
*** 1685,1693 ****
       1
  (1 row)

! --gin hash
  DROP INDEX jidx;
! CREATE INDEX jidx ON testjsonb USING gin (j jsonb_hash_ops);
  SET enable_seqscan = off;
  SELECT count(*) FROM testjsonb WHERE j @> '{"wait":null}';
   count
--- 1685,1693 ----
       1
  (1 row)

! --gin path opclass
  DROP INDEX jidx;
! CREATE INDEX jidx ON testjsonb USING gin (j jsonb_path_ops);
  SET enable_seqscan = off;
  SELECT count(*) FROM testjsonb WHERE j @> '{"wait":null}';
   count
diff --git a/src/test/regress/expected/jsonb_1.out b/src/test/regress/expected/jsonb_1.out
index 0e3ebd1..38a95b4 100644
*** a/src/test/regress/expected/jsonb_1.out
--- b/src/test/regress/expected/jsonb_1.out
*************** SELECT count(*) FROM testjsonb WHERE j =
*** 1685,1693 ****
       1
  (1 row)

! --gin hash
  DROP INDEX jidx;
! CREATE INDEX jidx ON testjsonb USING gin (j jsonb_hash_ops);
  SET enable_seqscan = off;
  SELECT count(*) FROM testjsonb WHERE j @> '{"wait":null}';
   count
--- 1685,1693 ----
       1
  (1 row)

! --gin path opclass
  DROP INDEX jidx;
! CREATE INDEX jidx ON testjsonb USING gin (j jsonb_path_ops);
  SET enable_seqscan = off;
  SELECT count(*) FROM testjsonb WHERE j @> '{"wait":null}';
   count
diff --git a/src/test/regress/sql/jsonb.sql b/src/test/regress/sql/jsonb.sql
index 3e90489..7527925 100644
*** a/src/test/regress/sql/jsonb.sql
--- b/src/test/regress/sql/jsonb.sql
*************** SET enable_seqscan = off;
*** 391,399 ****
  SELECT count(*) FROM testjsonb WHERE j > '{"p":1}';
  SELECT count(*) FROM testjsonb WHERE j = '{"pos":98, "line":371, "node":"CBA", "indexed":true}';

! --gin hash
  DROP INDEX jidx;
! CREATE INDEX jidx ON testjsonb USING gin (j jsonb_hash_ops);
  SET enable_seqscan = off;

  SELECT count(*) FROM testjsonb WHERE j @> '{"wait":null}';
--- 391,399 ----
  SELECT count(*) FROM testjsonb WHERE j > '{"p":1}';
  SELECT count(*) FROM testjsonb WHERE j = '{"pos":98, "line":371, "node":"CBA", "indexed":true}';

! --gin path opclass
  DROP INDEX jidx;
! CREATE INDEX jidx ON testjsonb USING gin (j jsonb_path_ops);
  SET enable_seqscan = off;

  SELECT count(*) FROM testjsonb WHERE j @> '{"wait":null}';

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Peter Geoghegan

Дата:

11 мая 2014 г., 01:31:46

On Sat, May 10, 2014 at 5:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm not sure whether we have consensus to rename jsonb_hash_ops to
> jsonb_path_ops, but since time is so short I went ahead and made a draft
> patch to do so (attached).  Probably the most interesting part of this is
> the new text in json.sgml explaining the difference between the two
> opclasses.  I also added a paragraph about the empty-query hazard that
> Peter mentions.  Do people think this discussion is correct and useful?

I for one am fine with the name change you propose.

> +     especially if
> +     there are a very large number of rows containing any single one of the
> +     three keys

I suggest that you phrase this as "three index items".

> +     A disadvantage of the <literal>jsonb_path_ops</literal> approach is
> +     that it produces no index entries for JSON structures not containing
> +     any values, such as <literal>{"a": {}}</literal>.  If a search for

I suggest "any values or elements".

Even though I previously called hashing an implementation detail, we
are bound to have to mention it in passing when discussing the
limitations of jsonb_hash_ops/jsonb_path_ops. I think that you should
proceed with committing the entire patch, including the doc changes
that discuss implementation details around the two GIN opclasses.

-- 
Peter Geoghegan

Re: default opclass for jsonb (was Re: Call for GIST/GIN/SP-GIST opclass documentation)

От

Tom Lane

Дата:

11 мая 2014 г., 02:09:57

Peter Geoghegan <pg@heroku.com> writes:
> On Sat, May 10, 2014 at 5:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> +     especially if
>> +     there are a very large number of rows containing any single one of the
>> +     three keys

> I suggest that you phrase this as "three index items".

Good idea --- "key" is overloaded in this discussion.  I'd meant to use
"item" uniformly for the index entries, but missed some spots.

>> +     A disadvantage of the <literal>jsonb_path_ops</literal> approach is
>> +     that it produces no index entries for JSON structures not containing
>> +     any values, such as <literal>{"a": {}}</literal>.  If a search for

> I suggest "any values or elements".

Meh --- the previous para is also using "value" to include array elements,
and I don't see anything in RFC 7159 suggesting that that's not preferred
terminology.  But I added a footnote to clarify:
   The technical difference between a <literal>jsonb_ops</literal>   and a <literal>jsonb_path_ops</literal> GIN index
isthat the former   creates independent index items for each key and value in the data,   while the latter creates
indexitems only for each value in the   data.<footnote><para>For this purpose, the term <quote>value</>   includes
arrayelements, though JSON terminology sometimes considers   array elements distinct from values within
objects.</para></footnote>

> Even though I previously called hashing an implementation detail, we
> are bound to have to mention it in passing when discussing the
> limitations of jsonb_hash_ops/jsonb_path_ops. I think that you should
> proceed with committing the entire patch, including the doc changes
> that discuss implementation details around the two GIN opclasses.

I'll hold off committing till the morning to see if there are objections.
        regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Call for GIST/GIN/SP-GIST opclass documentation

Вложения