Обсуждение: working around JSONB's lack of stats?

Поиск
Список
Период
Сортировка

working around JSONB's lack of stats?

От
Josh Berkus
Дата:
Folks,

Currently, JSONB fields don't have statistics, and estimate a flat 1%
selectivity.  This can result in poor query plans, and I'm wondering if
anyone has a suggested workaround for this short of hacking a new
selectivity function.  For example, take the common case of using JSONB
to hold a list of "tags" for tagging documents:

 Table "public.doc_tags_json"
 Column |  Type   | Modifiers
--------+---------+-----------
 doc_id | integer |
 tags   | jsonb   |
Indexes:
    "doc_tags_json_doc_id_idx" UNIQUE, btree (doc_id)
    "doc_tags_json_tags_idx" gin (tags)

This query:

select doc_id
from doc_tags_json
where tags @> '[ "math", "physics" ]'
order by doc_id desc limit 25;

Uses this plan:


     QUERY PLAN


--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..709.79 rows=25 width=4) (actual time=24.529..340.499
rows=25 loops=1)
   ->  Index Scan Backward using doc_tags_json_doc_id_idx on
doc_tags_json  (cost=0.43..283740.95 rows=10000 width=4) (actual
time=24.528..340.483 rows=25 loops=1)
         Filter: (tags @> '["math", "physics"]'::jsonb)
         Rows Removed by Filter: 1011878
 Planning time: 0.090 ms
 Execution time: 340.528 ms

It does this because it expects @> '["math", "physics"]' to match 10,000
rows, which means that it expects to only scan 25,000 entries in the
doc_id index to return the top 25.  However, the matching condition is
much rarer than it thinks, so it's actually far faster to use the index
on the JSONB column:

drop index doc_tags_json_doc_id_idx;

QUERY PLAN


--------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10517.08..10517.14 rows=25 width=4) (actual
time=7.594..7.602 rows=25 loops=1)
   ->  Sort  (cost=10517.08..10542.08 rows=10000 width=4) (actual
time=7.593..7.596 rows=25 loops=1)
         Sort Key: doc_id
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Bitmap Heap Scan on doc_tags_json  (cost=92.90..10234.89
rows=10000 width=4) (actual time=6.733..7.475 rows=257 loops=1)
               Recheck Cond: (tags @> '["math", "physics"]'::jsonb)
               Heap Blocks: exact=256
               ->  Bitmap Index Scan on doc_tags_json_tags_idx
(cost=0.00..90.40 rows=10000 width=0) (actual time=6.695..6.695 rows=257
loops=1)
                     Index Cond: (tags @> '["math", "physics"]'::jsonb)
 Planning time: 0.093 ms
 Execution time: 7.632 ms

On a normal column, I'd raise n_distinct to reflect the higher
selecivity of the search terms.  However, since @> uses contsel,
n_distinct is ignored.  Anyone know a clever workaround I don't
currently see?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: working around JSONB's lack of stats?

От
Tomas Vondra
Дата:
On 27.1.2015 08:06, Josh Berkus wrote:
> Folks,
>
...
>
> On a normal column, I'd raise n_distinct to reflect the higher
> selecivity of the search terms.  However, since @> uses contsel,
> n_distinct is ignored.  Anyone know a clever workaround I don't
> currently see?

I don't see any reasonable workaround :-(

ISTM we'll have to invent a way to collect useful stats about contents
of JSON/JSONB documents. JSONB is cool, but at the moment we're mostly
relying on defaults that may be reasonable, but still misfire in many
cases. Do we have any ideas of how that might work?

We're already collecting stats about contents of arrays, and maybe we
could do something similar for JSONB? The nested nature of JSON makes
that rather incompatible with the flat MCV/histogram stats, though.

regards
--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: working around JSONB's lack of stats?

От
Josh Berkus
Дата:
On 01/28/2015 11:48 AM, Tomas Vondra wrote:
> On 27.1.2015 08:06, Josh Berkus wrote:
>> Folks,
>>
> ...
>>
>> On a normal column, I'd raise n_distinct to reflect the higher
>> selecivity of the search terms.  However, since @> uses contsel,
>> n_distinct is ignored.  Anyone know a clever workaround I don't
>> currently see?
>
> I don't see any reasonable workaround :-(
>
> ISTM we'll have to invent a way to collect useful stats about contents
> of JSON/JSONB documents. JSONB is cool, but at the moment we're mostly
> relying on defaults that may be reasonable, but still misfire in many
> cases. Do we have any ideas of how that might work?
>
> We're already collecting stats about contents of arrays, and maybe we
> could do something similar for JSONB? The nested nature of JSON makes
> that rather incompatible with the flat MCV/histogram stats, though.

Well, I was thinking about this.

We already have most_common_elem (MCE) for arrays and tsearch.  What if
we put JSONB's most common top-level keys (or array elements, depending)
in the MCE array?  Then we could still apply a simple rule for any path
criteria below the top-level keys, say assuming that any sub-key
criteria would match 10% of the time.  While it wouldn't be perfect, it
would be better than what we have now.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: working around JSONB's lack of stats?

От
Peter Geoghegan
Дата:
On Wed, Jan 28, 2015 at 3:03 PM, Josh Berkus <josh@agliodbs.com> wrote:
> We already have most_common_elem (MCE) for arrays and tsearch.  What if
> we put JSONB's most common top-level keys (or array elements, depending)
> in the MCE array?  Then we could still apply a simple rule for any path
> criteria below the top-level keys, say assuming that any sub-key
> criteria would match 10% of the time.  While it wouldn't be perfect, it
> would be better than what we have now.

Well, the "top-level keys" would still be gathered for expression
indexes. So yeah, maybe it would work alright for arrays of "tags",
and things like that. I tend to think that that's a common enough
use-case.

--
Regards,
Peter Geoghegan


Re: working around JSONB's lack of stats?

От
Josh Berkus
Дата:
On 01/28/2015 03:34 PM, Peter Geoghegan wrote:
> On Wed, Jan 28, 2015 at 3:03 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> We already have most_common_elem (MCE) for arrays and tsearch.  What if
>> we put JSONB's most common top-level keys (or array elements, depending)
>> in the MCE array?  Then we could still apply a simple rule for any path
>> criteria below the top-level keys, say assuming that any sub-key
>> criteria would match 10% of the time.  While it wouldn't be perfect, it
>> would be better than what we have now.
>
> Well, the "top-level keys" would still be gathered for expression
> indexes. So yeah, maybe it would work alright for arrays of "tags",
> and things like that. I tend to think that that's a common enough
> use-case.

Yah, and even for cases where people have nested structures, currently
we require @> to start at the top.  So we can at least compare top-level
keys to see if the key returned is in the MCEs or not, and take action
accordingly.

We could start with a constant for anything below the key, where we
assume that all values show up 10% of the time.

thus:

jsonb_col @> '[ "key1" ]'
or jsonb_col ? 'key1'
    if in MCE, assign % from MCE
    otherwise assign 1% of non-MCE %

jsonb_col @> '{ "key1": "value1" }'
    if in MCE, assign MCE% * 0.1
    otherwise assign 0.01 of non-MCE %

Does that make sense?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: working around JSONB's lack of stats?

От
Peter Geoghegan
Дата:
On Wed, Jan 28, 2015 at 3:42 PM, Josh Berkus <josh@agliodbs.com> wrote:
> jsonb_col @> '[ "key1" ]'
> or jsonb_col ? 'key1'
>         if in MCE, assign % from MCE
>         otherwise assign 1% of non-MCE %
>
> jsonb_col @> '{ "key1": "value1" }'
>         if in MCE, assign MCE% * 0.1
>         otherwise assign 0.01 of non-MCE %
>
> Does that make sense?

I suspect it makes a lot less sense. The way people seem to want to
use jsonb is as a document store with a bit of flexibility. Individual
JSON documents tend to be fairly homogeneous in structure within a
table, just like with systems like MongoDB. Strings within arrays are
keys for our purposes, and these are often used for tags and so on.
But Strings that are the key of an object/pair are much less useful to
index, in my estimation.

--
Regards,
Peter Geoghegan


Re: working around JSONB's lack of stats?

От
Tomas Vondra
Дата:
On 29.1.2015 00:03, Josh Berkus wrote:
> On 01/28/2015 11:48 AM, Tomas Vondra wrote:
>> On 27.1.2015 08:06, Josh Berkus wrote:
>>> Folks,
>>>
>> ...
>>>
>>> On a normal column, I'd raise n_distinct to reflect the higher
>>> selecivity of the search terms.  However, since @> uses contsel,
>>> n_distinct is ignored.  Anyone know a clever workaround I don't
>>> currently see?
>>
>> I don't see any reasonable workaround :-(
>>
>> ISTM we'll have to invent a way to collect useful stats about contents
>> of JSON/JSONB documents. JSONB is cool, but at the moment we're mostly
>> relying on defaults that may be reasonable, but still misfire in many
>> cases. Do we have any ideas of how that might work?
>>
>> We're already collecting stats about contents of arrays, and maybe we
>> could do something similar for JSONB? The nested nature of JSON makes
>> that rather incompatible with the flat MCV/histogram stats, though.
>
> Well, I was thinking about this.
>
> We already have most_common_elem (MCE) for arrays and tsearch. What
> if we put JSONB's most common top-level keys (or array elements,
> depending) in the MCE array? Then we could still apply a simple rule
> for any path criteria below the top-level keys, say assuming that any
> sub-key criteria would match 10% of the time. While it wouldn't be
> perfect, it would be better than what we have now.

So how would that work with your 'tags' example? ISTM most of your
documents have 'tags' as top-level key, so that would end up in the MCV
list. But there's no info about the elements of the 'tags' array (thus
the 10% default, which may work in this particular case, but it's hardly
a general solution and I doubt it's somehow superior to the defaults
we're using right now).

I think a 'proper' solution to JSONB stats needs to somehow reflect the
nested structure. What I was thinking about is tracking MCV for
"complete paths", i.e. for a document:

  {
    "keyA" : {
      "keyB" : "x",
      "keyC" : "z",
    }
    "keyD" : [1, 2, 3, 4]
  }

We'd extract three paths

   "keyA.keyB"
   "keyA.keyC"
   "keyD"

and aggregate that over all the documents to select the MCV paths.
And then, for each of those MCV paths track the most common values.

ISTM this would allow better estimations, but it has issues too:

Firstly, it does not match the MCV structure, because it requires
storing (a) MCV paths and (b) MCV values for those paths. Moreover, (b)
probably stores different data types (some values are strings, some
integers, etc.). Arrays might be handled just like regular arrays, i.e.
tracking stats of elements, but it's still mixed data types.

Secondly, I think it's based on the assumption of independence (i.e.
that the occurence of one path does not depend on occurence of a
different path in the same document). Same for values x paths. Which may
or may not be be true - it's essentially the same as assumption of
independence for predicates on multiple columns. While I do have ideas
on how to approach this in the multi-column case, handling this for
JSONB is going to be much more complex I think.

But the first question (what stats to collect and how to store them) is
the most important at this point, I guess.

regards

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: working around JSONB's lack of stats?

От
Josh Berkus
Дата:
On 01/28/2015 03:50 PM, Peter Geoghegan wrote:
> On Wed, Jan 28, 2015 at 3:42 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> jsonb_col @> '[ "key1" ]'
>> or jsonb_col ? 'key1'
>>         if in MCE, assign % from MCE
>>         otherwise assign 1% of non-MCE %
>>
>> jsonb_col @> '{ "key1": "value1" }'
>>         if in MCE, assign MCE% * 0.1
>>         otherwise assign 0.01 of non-MCE %
>>
>> Does that make sense?
>
> I suspect it makes a lot less sense. The way people seem to want to
> use jsonb is as a document store with a bit of flexibility. Individual
> JSON documents tend to be fairly homogeneous in structure within a
> table, just like with systems like MongoDB. Strings within arrays are
> keys for our purposes, and these are often used for tags and so on.
> But Strings that are the key of an object/pair are much less useful to
> index, in my estimation.

Yeah, I see your point; except for arrays, people are usually searching
for a key:value pair, and the existence of the key is not in doubt.

That would make the "element" the key:value pair, no?  But
realistically, we would only want to do that for simple keys and values.

Although: if you "flatten" a nested JSON structure into just keys with
scalar values (and array items as their own thing), then you could have
a series of expanded key:value pairs to put into MCE.

For example:

{ house : { city : San Francisco,
     sqft: 1200,
     color: blue,
     occupants: [ mom, dad, child1 ]
     }
  occupation: programmer
}

... would get flattened out into the following pairs:

city: san francisco
sqft: 1200
color: blue
occupants: [ mom ]
occupants: [ dad ]
occupants: [ child1 ]
occupation: programmer

This would probably work because there aren't a lot of data structures
where people would have the same key:value pair in different locations
in the JSON, and care about it stats-wise.  Alternatetly, if the same
key-value pair appears multiple times in the same sample row, we could
cut the MC% by that multiple.

No?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: working around JSONB's lack of stats?

От
Jim Nasby
Дата:
On 1/30/15 2:26 PM, Josh Berkus wrote:
> On 01/28/2015 03:50 PM, Peter Geoghegan wrote:
>> On Wed, Jan 28, 2015 at 3:42 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> jsonb_col @> '[ "key1" ]'
>>> or jsonb_col ? 'key1'
>>>          if in MCE, assign % from MCE
>>>          otherwise assign 1% of non-MCE %
>>>
>>> jsonb_col @> '{ "key1": "value1" }'
>>>          if in MCE, assign MCE% * 0.1
>>>          otherwise assign 0.01 of non-MCE %
>>>
>>> Does that make sense?
>>
>> I suspect it makes a lot less sense. The way people seem to want to
>> use jsonb is as a document store with a bit of flexibility. Individual
>> JSON documents tend to be fairly homogeneous in structure within a
>> table, just like with systems like MongoDB. Strings within arrays are
>> keys for our purposes, and these are often used for tags and so on.
>> But Strings that are the key of an object/pair are much less useful to
>> index, in my estimation.
>
> Yeah, I see your point; except for arrays, people are usually searching
> for a key:value pair, and the existence of the key is not in doubt.
>
> That would make the "element" the key:value pair, no?  But
> realistically, we would only want to do that for simple keys and values.
>
> Although: if you "flatten" a nested JSON structure into just keys with
> scalar values (and array items as their own thing), then you could have
> a series of expanded key:value pairs to put into MCE.
>
> For example:
>
> { house : { city : San Francisco,
>       sqft: 1200,
>       color: blue,
>       occupants: [ mom, dad, child1 ]
>       }
>    occupation: programmer
> }
>
> ... would get flattened out into the following pairs:
>
> city: san francisco
> sqft: 1200
> color: blue
> occupants: [ mom ]
> occupants: [ dad ]
> occupants: [ child1 ]
> occupation: programmer
>
> This would probably work because there aren't a lot of data structures
> where people would have the same key:value pair in different locations
> in the JSON, and care about it stats-wise.  Alternatetly, if the same
> key-value pair appears multiple times in the same sample row, we could
> cut the MC% by that multiple.

Even if there were multiple occurrences, this would probably still be an
improvement.

Another idea... at one time in the past when discussing statistics on
multiple columns, one idea was to build statistics on indexes. If we
built that, we could also do the same thing for at least JSONB (not sure
about JSON). Obviously doesn't help for stuff you haven't indexed, but
presumably if you care about performance and have any significant size
of data you've also indexed parts of the JSON, yes?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: working around JSONB's lack of stats?

От
Josh Berkus
Дата:
On 01/30/2015 05:34 PM, Jim Nasby wrote:
> On 1/30/15 2:26 PM, Josh Berkus wrote:
>> This would probably work because there aren't a lot of data structures
>> where people would have the same key:value pair in different locations
>> in the JSON, and care about it stats-wise.  Alternatetly, if the same
>> key-value pair appears multiple times in the same sample row, we could
>> cut the MC% by that multiple.
>
> Even if there were multiple occurrences, this would probably still be an
> improvement.
>
> Another idea... at one time in the past when discussing statistics on
> multiple columns, one idea was to build statistics on indexes. If we
> built that, we could also do the same thing for at least JSONB (not sure
> about JSON). Obviously doesn't help for stuff you haven't indexed, but
> presumably if you care about performance and have any significant size
> of data you've also indexed parts of the JSON, yes?

I'm not clear on what you're suggesting here.  I'm discussing how the
stats for a JSONB field would be stored and accessed; I don't understand
what that has to do with indexing.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: working around JSONB's lack of stats?

От
Merlin Moncure
Дата:
On Tue, Jan 27, 2015 at 1:06 AM, Josh Berkus <josh@agliodbs.com> wrote:
> Folks,
>
> Currently, JSONB fields don't have statistics, and estimate a flat 1%
> selectivity.  This can result in poor query plans, and I'm wondering if
> anyone has a suggested workaround for this short of hacking a new
> selectivity function.  For example, take the common case of using JSONB
> to hold a list of "tags" for tagging documents:

hm,  Why stop at jsonb?  What's needed is a way to override the
planner's row estimate in a general way.

merlin


Re: working around JSONB's lack of stats?

От
Jim Nasby
Дата:
On 2/1/15 3:08 PM, Josh Berkus wrote:
> On 01/30/2015 05:34 PM, Jim Nasby wrote:
>> On 1/30/15 2:26 PM, Josh Berkus wrote:
>>> This would probably work because there aren't a lot of data structures
>>> where people would have the same key:value pair in different locations
>>> in the JSON, and care about it stats-wise.  Alternatetly, if the same
>>> key-value pair appears multiple times in the same sample row, we could
>>> cut the MC% by that multiple.
>>
>> Even if there were multiple occurrences, this would probably still be an
>> improvement.
>>
>> Another idea... at one time in the past when discussing statistics on
>> multiple columns, one idea was to build statistics on indexes. If we
>> built that, we could also do the same thing for at least JSONB (not sure
>> about JSON). Obviously doesn't help for stuff you haven't indexed, but
>> presumably if you care about performance and have any significant size
>> of data you've also indexed parts of the JSON, yes?
>
> I'm not clear on what you're suggesting here.  I'm discussing how the
> stats for a JSONB field would be stored and accessed; I don't understand
> what that has to do with indexing.

The JSON problem is similar to the problem of doing multi-column
statistics: there's no way to simply try to keep statistics on all
possible combinations because that's something that's can be extremely
large.

Something that's been proposed is only trying to keep multi-column stats
on column combinations that we have an index on (which in a way is
really just keeping stats on the index itself).

If we built that, we could use the same technique for JSON by simply
defining the indexes you needed for whatever you were searching on.

Obviously that's not as ideal is simply keeping full statistics on
everything in a JSON document, but it might be a heck of a lot easier to
accomplish.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: working around JSONB's lack of stats?

От
Josh Berkus
Дата:
On 02/02/2015 05:48 PM, Jim Nasby wrote:
> On 2/1/15 3:08 PM, Josh Berkus wrote:
>> I'm not clear on what you're suggesting here.  I'm discussing how the
>> stats for a JSONB field would be stored and accessed; I don't understand
>> what that has to do with indexing.
>
> The JSON problem is similar to the problem of doing multi-column
> statistics: there's no way to simply try to keep statistics on all
> possible combinations because that's something that's can be extremely
> large.
>
> Something that's been proposed is only trying to keep multi-column stats
> on column combinations that we have an index on (which in a way is
> really just keeping stats on the index itself).

The difficulty with column correlation (as with value correlation for
JSONB) is the combination of *values*, not the combination of *columns*.
 Via the brute force method, imagine you have one column with cardinalty
100, and another with cardinality 100,000.  This would require you do
keep 10 million different correlation coefficients in order to be able
to estimate correctly.  Even correlating MCVs would add up to quite a
bit of stats in short order; these days people frequently set statistics
to 1000.  The same goes for JSON keys.

> If we built that, we could use the same technique for JSON by simply
> defining the indexes you needed for whatever you were searching on.
>
> Obviously that's not as ideal is simply keeping full statistics on
> everything in a JSON document, but it might be a heck of a lot easier to
> accomplish.

Walk before trying to run, let alone fly, please.  Right now we don't
have selectivity estimation for a *single* key; let's do that before we
start talking about better estimation for combinations of keys.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com