Обсуждение: working around JSONB's lack of stats?
Folks, Currently, JSONB fields don't have statistics, and estimate a flat 1% selectivity. This can result in poor query plans, and I'm wondering if anyone has a suggested workaround for this short of hacking a new selectivity function. For example, take the common case of using JSONB to hold a list of "tags" for tagging documents: Table "public.doc_tags_json" Column | Type | Modifiers --------+---------+----------- doc_id | integer | tags | jsonb | Indexes: "doc_tags_json_doc_id_idx" UNIQUE, btree (doc_id) "doc_tags_json_tags_idx" gin (tags) This query: select doc_id from doc_tags_json where tags @> '[ "math", "physics" ]' order by doc_id desc limit 25; Uses this plan: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.43..709.79 rows=25 width=4) (actual time=24.529..340.499 rows=25 loops=1) -> Index Scan Backward using doc_tags_json_doc_id_idx on doc_tags_json (cost=0.43..283740.95 rows=10000 width=4) (actual time=24.528..340.483 rows=25 loops=1) Filter: (tags @> '["math", "physics"]'::jsonb) Rows Removed by Filter: 1011878 Planning time: 0.090 ms Execution time: 340.528 ms It does this because it expects @> '["math", "physics"]' to match 10,000 rows, which means that it expects to only scan 25,000 entries in the doc_id index to return the top 25. However, the matching condition is much rarer than it thinks, so it's actually far faster to use the index on the JSONB column: drop index doc_tags_json_doc_id_idx; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=10517.08..10517.14 rows=25 width=4) (actual time=7.594..7.602 rows=25 loops=1) -> Sort (cost=10517.08..10542.08 rows=10000 width=4) (actual time=7.593..7.596 rows=25 loops=1) Sort Key: doc_id Sort Method: top-N heapsort Memory: 26kB -> Bitmap Heap Scan on doc_tags_json (cost=92.90..10234.89 rows=10000 width=4) (actual time=6.733..7.475 rows=257 loops=1) Recheck Cond: (tags @> '["math", "physics"]'::jsonb) Heap Blocks: exact=256 -> Bitmap Index Scan on doc_tags_json_tags_idx (cost=0.00..90.40 rows=10000 width=0) (actual time=6.695..6.695 rows=257 loops=1) Index Cond: (tags @> '["math", "physics"]'::jsonb) Planning time: 0.093 ms Execution time: 7.632 ms On a normal column, I'd raise n_distinct to reflect the higher selecivity of the search terms. However, since @> uses contsel, n_distinct is ignored. Anyone know a clever workaround I don't currently see? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 27.1.2015 08:06, Josh Berkus wrote: > Folks, > ... > > On a normal column, I'd raise n_distinct to reflect the higher > selecivity of the search terms. However, since @> uses contsel, > n_distinct is ignored. Anyone know a clever workaround I don't > currently see? I don't see any reasonable workaround :-( ISTM we'll have to invent a way to collect useful stats about contents of JSON/JSONB documents. JSONB is cool, but at the moment we're mostly relying on defaults that may be reasonable, but still misfire in many cases. Do we have any ideas of how that might work? We're already collecting stats about contents of arrays, and maybe we could do something similar for JSONB? The nested nature of JSON makes that rather incompatible with the flat MCV/histogram stats, though. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 01/28/2015 11:48 AM, Tomas Vondra wrote: > On 27.1.2015 08:06, Josh Berkus wrote: >> Folks, >> > ... >> >> On a normal column, I'd raise n_distinct to reflect the higher >> selecivity of the search terms. However, since @> uses contsel, >> n_distinct is ignored. Anyone know a clever workaround I don't >> currently see? > > I don't see any reasonable workaround :-( > > ISTM we'll have to invent a way to collect useful stats about contents > of JSON/JSONB documents. JSONB is cool, but at the moment we're mostly > relying on defaults that may be reasonable, but still misfire in many > cases. Do we have any ideas of how that might work? > > We're already collecting stats about contents of arrays, and maybe we > could do something similar for JSONB? The nested nature of JSON makes > that rather incompatible with the flat MCV/histogram stats, though. Well, I was thinking about this. We already have most_common_elem (MCE) for arrays and tsearch. What if we put JSONB's most common top-level keys (or array elements, depending) in the MCE array? Then we could still apply a simple rule for any path criteria below the top-level keys, say assuming that any sub-key criteria would match 10% of the time. While it wouldn't be perfect, it would be better than what we have now. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Jan 28, 2015 at 3:03 PM, Josh Berkus <josh@agliodbs.com> wrote: > We already have most_common_elem (MCE) for arrays and tsearch. What if > we put JSONB's most common top-level keys (or array elements, depending) > in the MCE array? Then we could still apply a simple rule for any path > criteria below the top-level keys, say assuming that any sub-key > criteria would match 10% of the time. While it wouldn't be perfect, it > would be better than what we have now. Well, the "top-level keys" would still be gathered for expression indexes. So yeah, maybe it would work alright for arrays of "tags", and things like that. I tend to think that that's a common enough use-case. -- Regards, Peter Geoghegan
On 01/28/2015 03:34 PM, Peter Geoghegan wrote: > On Wed, Jan 28, 2015 at 3:03 PM, Josh Berkus <josh@agliodbs.com> wrote: >> We already have most_common_elem (MCE) for arrays and tsearch. What if >> we put JSONB's most common top-level keys (or array elements, depending) >> in the MCE array? Then we could still apply a simple rule for any path >> criteria below the top-level keys, say assuming that any sub-key >> criteria would match 10% of the time. While it wouldn't be perfect, it >> would be better than what we have now. > > Well, the "top-level keys" would still be gathered for expression > indexes. So yeah, maybe it would work alright for arrays of "tags", > and things like that. I tend to think that that's a common enough > use-case. Yah, and even for cases where people have nested structures, currently we require @> to start at the top. So we can at least compare top-level keys to see if the key returned is in the MCEs or not, and take action accordingly. We could start with a constant for anything below the key, where we assume that all values show up 10% of the time. thus: jsonb_col @> '[ "key1" ]' or jsonb_col ? 'key1' if in MCE, assign % from MCE otherwise assign 1% of non-MCE % jsonb_col @> '{ "key1": "value1" }' if in MCE, assign MCE% * 0.1 otherwise assign 0.01 of non-MCE % Does that make sense? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Jan 28, 2015 at 3:42 PM, Josh Berkus <josh@agliodbs.com> wrote: > jsonb_col @> '[ "key1" ]' > or jsonb_col ? 'key1' > if in MCE, assign % from MCE > otherwise assign 1% of non-MCE % > > jsonb_col @> '{ "key1": "value1" }' > if in MCE, assign MCE% * 0.1 > otherwise assign 0.01 of non-MCE % > > Does that make sense? I suspect it makes a lot less sense. The way people seem to want to use jsonb is as a document store with a bit of flexibility. Individual JSON documents tend to be fairly homogeneous in structure within a table, just like with systems like MongoDB. Strings within arrays are keys for our purposes, and these are often used for tags and so on. But Strings that are the key of an object/pair are much less useful to index, in my estimation. -- Regards, Peter Geoghegan
On 29.1.2015 00:03, Josh Berkus wrote: > On 01/28/2015 11:48 AM, Tomas Vondra wrote: >> On 27.1.2015 08:06, Josh Berkus wrote: >>> Folks, >>> >> ... >>> >>> On a normal column, I'd raise n_distinct to reflect the higher >>> selecivity of the search terms. However, since @> uses contsel, >>> n_distinct is ignored. Anyone know a clever workaround I don't >>> currently see? >> >> I don't see any reasonable workaround :-( >> >> ISTM we'll have to invent a way to collect useful stats about contents >> of JSON/JSONB documents. JSONB is cool, but at the moment we're mostly >> relying on defaults that may be reasonable, but still misfire in many >> cases. Do we have any ideas of how that might work? >> >> We're already collecting stats about contents of arrays, and maybe we >> could do something similar for JSONB? The nested nature of JSON makes >> that rather incompatible with the flat MCV/histogram stats, though. > > Well, I was thinking about this. > > We already have most_common_elem (MCE) for arrays and tsearch. What > if we put JSONB's most common top-level keys (or array elements, > depending) in the MCE array? Then we could still apply a simple rule > for any path criteria below the top-level keys, say assuming that any > sub-key criteria would match 10% of the time. While it wouldn't be > perfect, it would be better than what we have now. So how would that work with your 'tags' example? ISTM most of your documents have 'tags' as top-level key, so that would end up in the MCV list. But there's no info about the elements of the 'tags' array (thus the 10% default, which may work in this particular case, but it's hardly a general solution and I doubt it's somehow superior to the defaults we're using right now). I think a 'proper' solution to JSONB stats needs to somehow reflect the nested structure. What I was thinking about is tracking MCV for "complete paths", i.e. for a document: { "keyA" : { "keyB" : "x", "keyC" : "z", } "keyD" : [1, 2, 3, 4] } We'd extract three paths "keyA.keyB" "keyA.keyC" "keyD" and aggregate that over all the documents to select the MCV paths. And then, for each of those MCV paths track the most common values. ISTM this would allow better estimations, but it has issues too: Firstly, it does not match the MCV structure, because it requires storing (a) MCV paths and (b) MCV values for those paths. Moreover, (b) probably stores different data types (some values are strings, some integers, etc.). Arrays might be handled just like regular arrays, i.e. tracking stats of elements, but it's still mixed data types. Secondly, I think it's based on the assumption of independence (i.e. that the occurence of one path does not depend on occurence of a different path in the same document). Same for values x paths. Which may or may not be be true - it's essentially the same as assumption of independence for predicates on multiple columns. While I do have ideas on how to approach this in the multi-column case, handling this for JSONB is going to be much more complex I think. But the first question (what stats to collect and how to store them) is the most important at this point, I guess. regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 01/28/2015 03:50 PM, Peter Geoghegan wrote: > On Wed, Jan 28, 2015 at 3:42 PM, Josh Berkus <josh@agliodbs.com> wrote: >> jsonb_col @> '[ "key1" ]' >> or jsonb_col ? 'key1' >> if in MCE, assign % from MCE >> otherwise assign 1% of non-MCE % >> >> jsonb_col @> '{ "key1": "value1" }' >> if in MCE, assign MCE% * 0.1 >> otherwise assign 0.01 of non-MCE % >> >> Does that make sense? > > I suspect it makes a lot less sense. The way people seem to want to > use jsonb is as a document store with a bit of flexibility. Individual > JSON documents tend to be fairly homogeneous in structure within a > table, just like with systems like MongoDB. Strings within arrays are > keys for our purposes, and these are often used for tags and so on. > But Strings that are the key of an object/pair are much less useful to > index, in my estimation. Yeah, I see your point; except for arrays, people are usually searching for a key:value pair, and the existence of the key is not in doubt. That would make the "element" the key:value pair, no? But realistically, we would only want to do that for simple keys and values. Although: if you "flatten" a nested JSON structure into just keys with scalar values (and array items as their own thing), then you could have a series of expanded key:value pairs to put into MCE. For example: { house : { city : San Francisco, sqft: 1200, color: blue, occupants: [ mom, dad, child1 ] } occupation: programmer } ... would get flattened out into the following pairs: city: san francisco sqft: 1200 color: blue occupants: [ mom ] occupants: [ dad ] occupants: [ child1 ] occupation: programmer This would probably work because there aren't a lot of data structures where people would have the same key:value pair in different locations in the JSON, and care about it stats-wise. Alternatetly, if the same key-value pair appears multiple times in the same sample row, we could cut the MC% by that multiple. No? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 1/30/15 2:26 PM, Josh Berkus wrote: > On 01/28/2015 03:50 PM, Peter Geoghegan wrote: >> On Wed, Jan 28, 2015 at 3:42 PM, Josh Berkus <josh@agliodbs.com> wrote: >>> jsonb_col @> '[ "key1" ]' >>> or jsonb_col ? 'key1' >>> if in MCE, assign % from MCE >>> otherwise assign 1% of non-MCE % >>> >>> jsonb_col @> '{ "key1": "value1" }' >>> if in MCE, assign MCE% * 0.1 >>> otherwise assign 0.01 of non-MCE % >>> >>> Does that make sense? >> >> I suspect it makes a lot less sense. The way people seem to want to >> use jsonb is as a document store with a bit of flexibility. Individual >> JSON documents tend to be fairly homogeneous in structure within a >> table, just like with systems like MongoDB. Strings within arrays are >> keys for our purposes, and these are often used for tags and so on. >> But Strings that are the key of an object/pair are much less useful to >> index, in my estimation. > > Yeah, I see your point; except for arrays, people are usually searching > for a key:value pair, and the existence of the key is not in doubt. > > That would make the "element" the key:value pair, no? But > realistically, we would only want to do that for simple keys and values. > > Although: if you "flatten" a nested JSON structure into just keys with > scalar values (and array items as their own thing), then you could have > a series of expanded key:value pairs to put into MCE. > > For example: > > { house : { city : San Francisco, > sqft: 1200, > color: blue, > occupants: [ mom, dad, child1 ] > } > occupation: programmer > } > > ... would get flattened out into the following pairs: > > city: san francisco > sqft: 1200 > color: blue > occupants: [ mom ] > occupants: [ dad ] > occupants: [ child1 ] > occupation: programmer > > This would probably work because there aren't a lot of data structures > where people would have the same key:value pair in different locations > in the JSON, and care about it stats-wise. Alternatetly, if the same > key-value pair appears multiple times in the same sample row, we could > cut the MC% by that multiple. Even if there were multiple occurrences, this would probably still be an improvement. Another idea... at one time in the past when discussing statistics on multiple columns, one idea was to build statistics on indexes. If we built that, we could also do the same thing for at least JSONB (not sure about JSON). Obviously doesn't help for stuff you haven't indexed, but presumably if you care about performance and have any significant size of data you've also indexed parts of the JSON, yes? -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 01/30/2015 05:34 PM, Jim Nasby wrote: > On 1/30/15 2:26 PM, Josh Berkus wrote: >> This would probably work because there aren't a lot of data structures >> where people would have the same key:value pair in different locations >> in the JSON, and care about it stats-wise. Alternatetly, if the same >> key-value pair appears multiple times in the same sample row, we could >> cut the MC% by that multiple. > > Even if there were multiple occurrences, this would probably still be an > improvement. > > Another idea... at one time in the past when discussing statistics on > multiple columns, one idea was to build statistics on indexes. If we > built that, we could also do the same thing for at least JSONB (not sure > about JSON). Obviously doesn't help for stuff you haven't indexed, but > presumably if you care about performance and have any significant size > of data you've also indexed parts of the JSON, yes? I'm not clear on what you're suggesting here. I'm discussing how the stats for a JSONB field would be stored and accessed; I don't understand what that has to do with indexing. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Jan 27, 2015 at 1:06 AM, Josh Berkus <josh@agliodbs.com> wrote: > Folks, > > Currently, JSONB fields don't have statistics, and estimate a flat 1% > selectivity. This can result in poor query plans, and I'm wondering if > anyone has a suggested workaround for this short of hacking a new > selectivity function. For example, take the common case of using JSONB > to hold a list of "tags" for tagging documents: hm, Why stop at jsonb? What's needed is a way to override the planner's row estimate in a general way. merlin
On 2/1/15 3:08 PM, Josh Berkus wrote: > On 01/30/2015 05:34 PM, Jim Nasby wrote: >> On 1/30/15 2:26 PM, Josh Berkus wrote: >>> This would probably work because there aren't a lot of data structures >>> where people would have the same key:value pair in different locations >>> in the JSON, and care about it stats-wise. Alternatetly, if the same >>> key-value pair appears multiple times in the same sample row, we could >>> cut the MC% by that multiple. >> >> Even if there were multiple occurrences, this would probably still be an >> improvement. >> >> Another idea... at one time in the past when discussing statistics on >> multiple columns, one idea was to build statistics on indexes. If we >> built that, we could also do the same thing for at least JSONB (not sure >> about JSON). Obviously doesn't help for stuff you haven't indexed, but >> presumably if you care about performance and have any significant size >> of data you've also indexed parts of the JSON, yes? > > I'm not clear on what you're suggesting here. I'm discussing how the > stats for a JSONB field would be stored and accessed; I don't understand > what that has to do with indexing. The JSON problem is similar to the problem of doing multi-column statistics: there's no way to simply try to keep statistics on all possible combinations because that's something that's can be extremely large. Something that's been proposed is only trying to keep multi-column stats on column combinations that we have an index on (which in a way is really just keeping stats on the index itself). If we built that, we could use the same technique for JSON by simply defining the indexes you needed for whatever you were searching on. Obviously that's not as ideal is simply keeping full statistics on everything in a JSON document, but it might be a heck of a lot easier to accomplish. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 02/02/2015 05:48 PM, Jim Nasby wrote: > On 2/1/15 3:08 PM, Josh Berkus wrote: >> I'm not clear on what you're suggesting here. I'm discussing how the >> stats for a JSONB field would be stored and accessed; I don't understand >> what that has to do with indexing. > > The JSON problem is similar to the problem of doing multi-column > statistics: there's no way to simply try to keep statistics on all > possible combinations because that's something that's can be extremely > large. > > Something that's been proposed is only trying to keep multi-column stats > on column combinations that we have an index on (which in a way is > really just keeping stats on the index itself). The difficulty with column correlation (as with value correlation for JSONB) is the combination of *values*, not the combination of *columns*. Via the brute force method, imagine you have one column with cardinalty 100, and another with cardinality 100,000. This would require you do keep 10 million different correlation coefficients in order to be able to estimate correctly. Even correlating MCVs would add up to quite a bit of stats in short order; these days people frequently set statistics to 1000. The same goes for JSON keys. > If we built that, we could use the same technique for JSON by simply > defining the indexes you needed for whatever you were searching on. > > Obviously that's not as ideal is simply keeping full statistics on > everything in a JSON document, but it might be a heck of a lot easier to > accomplish. Walk before trying to run, let alone fly, please. Right now we don't have selectivity estimation for a *single* key; let's do that before we start talking about better estimation for combinations of keys. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com