Обсуждение: Unable to make use of "deep" JSONB index
Hi,
I'm having trouble using an index on the "deep" innards of a JSONB field in that the matching query it is trying to accelerate never uses it. I did get some advice on a simplified version of the problem at [1], but the actual problem remains the same in that "EXPLAIN ANALYZE" never refers to the index. Here are the details including a test case below...
1. The JSONB can be several MB in size. This works fine for all but one access pattern.
2. The JSON in the problem use case looks like this:
{
"...stuff...": ...
"employee": {
"999": {"id": 999, "integer attribute": 0, "boolean-may-be-missing": true, "state": {
"nested-list": [[], [], ...]
}
}
}
3. As per the discussion at [2], using a SELECT with a WHERE on the 3 attributes of interest ("integer attribute", the "boolean-may-be-missing" and "nested-list") incurs a significant overheard which suggests that the JSONB storage is being accessed 3 times. In order to optimise for this case, I constructed a query using the jsonpath support which seems to successfully avoid the triple-fetch by keeping the logic inside the jsonpath query like this:
WHERE (snapshot @? '$.employee."999" ? (@.pay_graph <> 0 || @.last_run_of_employment
== true || @.state.employment[last][2] == 0)')
== true || @.state.employment[last][2] == 0)')
4. Then I created an index "matching" this query.
5. According to EXPLAIN ANALYSE, the index is never used.
=== version and platform ===
Version: PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit
Platform: Ubuntu 22.04 (Jammy), using OS-supplied build
=== test case ===
CREATE TABLE payrun (
id serial primary key,
snapshot JSONB
);
id serial primary key,
snapshot JSONB
);
INSERT INTO payrun(snapshot)
VALUES
('{"employee": {"999": {"id": 999, "state": {"employment": [["1920-01-01", null, 3]]}, "pay_graph": 0, "last_run_of_employment": false}, "111": {"id": 111, "state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}'),
('{"employee": {"999": {"id": 999, "state": {"employment": [["1970-01-01", null, 3]]}, "pay_graph": 6, "last_run_of_employment": true}, "222": {"id": 222, "state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 5, "last_run_of_employment": true}}}'),
('{"employee": {"998": {"id": 998, "state": {"employment": [["1980-01-01", null, 3]]}, "pay_graph": 7, "last_run_of_employment": false}, "333": {"id": 333, "state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}')
;
SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun" WHERE (snapshot @? '$.employee."999" ? (@.pay_graph <> 0 || @.last_run_of_employment == true || @.state.employment[last][2] == 0)');
--
-- Create index designed to match the query.
--
create index idx1 on payrun using gin ((snapshot->'$.employee.* ? (@.pay_graph <> 0 || @.last_run_of_employment == true || @.state.employment[last][2] == 0)'));
set enable_seqscan = OFF;
--
set enable_seqscan = OFF;
--
-- EXPLAIN ANALYZE ...query above...
--
explain analyse SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun" WHERE (snapshot @? '$.employee."999" ? (@.pay_graph <> 0 || @.last_run_of_employment
== true || @.state.employment[last][2] == 0)');
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on payrun (cost=10000000000.00..10000000001.04 rows=1 width=36) (actual time=0.040..0.042 rows=1 loops=1)
Filter: (snapshot @? '$."employee"."999"?((@."pay_graph" != 0 || @."last_run_of_employment" == true) || @."state"."employment"[last][2] == 0)'::jsonpath)
Rows Removed by Filter: 2
Planning Time: 0.883 ms
Execution Time: 0.078 ms
(5 rows)
== true || @.state.employment[last][2] == 0)');
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on payrun (cost=10000000000.00..10000000001.04 rows=1 width=36) (actual time=0.040..0.042 rows=1 loops=1)
Filter: (snapshot @? '$."employee"."999"?((@."pay_graph" != 0 || @."last_run_of_employment" == true) || @."state"."employment"[last][2] == 0)'::jsonpath)
Rows Removed by Filter: 2
Planning Time: 0.883 ms
Execution Time: 0.078 ms
(5 rows)
=== end test case ===
The expected result is that with enable_seqscan = OFF, the index should be used, but instead a sequential scan is reported as above. The same happens without enable_seqscan = OFF on
a moderately large test set of over 2000 rows (with extended run times circa 10+ seconds on my hardware).
I have tried the same with similar results on PG13.
Have I constructed the index incorrectly, or is there some other way to convince the query to use it?
Shaheed Haque <shaheedhaque@gmail.com> writes: > -- Create index designed to match the query. > -- > create index idx1 on payrun using gin ((snapshot->'$.employee.* ? > (@.pay_graph <> 0 || @.last_run_of_employment == true || > @.state.employment[last][2] == 0)')); But that doesn't match the query; it's not even the same topmost operator: > explain analyse SELECT id,snapshot #>'{employee,999,state,employment}' FROM > "payrun" WHERE (snapshot @? '$.employee."999" ? (@.pay_graph <> 0 || > @.last_run_of_employment > == true || @.state.employment[last][2] == 0)'); In general you seem to have much too high an opinion of what PG's index machinery can cope with. The general pattern is that it can use a query WHERE clause with an index if the clause is of the form "indexed-column indexable-operator constant". There's a small number of special cases where it can transform things that don't initially look like that into the right form, but AFAIR we don't have any such special cases for any json-related operators. The one saving grace is that "indexed-column" can be an expression appearing in an index, so in some cases you can finesse things that way. But you won't find any deep knowledge of jsonpath expressions in there. Having said that, @? is reported as an indexable operator in v14: regression=# \dAo gin jsonb* List of operators of operator families AM | Operator family | Operator | Strategy | Purpose -----+-----------------+--------------------+----------+--------- gin | jsonb_ops | @>(jsonb,jsonb) | 7 | search gin | jsonb_ops | @?(jsonb,jsonpath) | 15 | search gin | jsonb_ops | @@(jsonb,jsonpath) | 16 | search gin | jsonb_ops | ?(jsonb,text) | 9 | search gin | jsonb_ops | ?|(jsonb,text[]) | 10 | search gin | jsonb_ops | ?&(jsonb,text[]) | 11 | search gin | jsonb_path_ops | @>(jsonb,jsonb) | 7 | search gin | jsonb_path_ops | @?(jsonb,jsonpath) | 15 | search gin | jsonb_path_ops | @@(jsonb,jsonpath) | 16 | search (9 rows) so it seems like you ought to get some benefit for this query from just a plain GIN index on "snapshot". regards, tom lane
On Thu, 2 Jun 2022 at 15:31, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Shaheed Haque <shaheedhaque@gmail.com> writes: > > -- Create index designed to match the query. > > -- > > create index idx1 on payrun using gin ((snapshot->'$.employee.* ? > > (@.pay_graph <> 0 || @.last_run_of_employment == true || > > @.state.employment[last][2] == 0)')); > > But that doesn't match the query; it's not even the same topmost > operator: > > > explain analyse SELECT id,snapshot #>'{employee,999,state,employment}' FROM > > "payrun" WHERE (snapshot @? '$.employee."999" ? (@.pay_graph <> 0 || > > @.last_run_of_employment > > == true || @.state.employment[last][2] == 0)'); I assume you are referring to the difference between "snapshot @?" and "snapshot ->"? If so, apologies: too much cutting and pasting from too many experiments. I did in fact also try the "using btree ((snapshot @?" form but it gave the same results. > In general you seem to have much too high an opinion of what PG's > index machinery can cope with. The general pattern is that it can > use a query WHERE clause with an index if the clause is of the form > "indexed-column indexable-operator constant". There's a small number > of special cases where it can transform things that don't initially > look like that into the right form, but AFAIR we don't have any > such special cases for any json-related operators. LOL. I'm pretty much a noob here, so that's very possible. > The one saving grace is that "indexed-column" can be an expression > appearing in an index, so in some cases you can finesse things > that way. But you won't find any deep knowledge of jsonpath > expressions in there. I was basing my efforts on this statement in the docs https://www.postgresql.org/docs/14/datatype-json.html#JSON-INDEXING: GIN index extracts statements of following form out of jsonpath: accessors_chain = const. Accessors chain may consist of .key, [*], and [index] accessors. jsonb_ops additionally supports .* and .** accessors. Did I mis-implement, misunderstand or read too much into this? > Having said that, @? is reported as an indexable operator in v14: > > regression=# \dAo gin jsonb* > List of operators of operator families > AM | Operator family | Operator | Strategy | Purpose > -----+-----------------+--------------------+----------+--------- > gin | jsonb_ops | @>(jsonb,jsonb) | 7 | search > gin | jsonb_ops | @?(jsonb,jsonpath) | 15 | search > gin | jsonb_ops | @@(jsonb,jsonpath) | 16 | search > gin | jsonb_ops | ?(jsonb,text) | 9 | search > gin | jsonb_ops | ?|(jsonb,text[]) | 10 | search > gin | jsonb_ops | ?&(jsonb,text[]) | 11 | search > gin | jsonb_path_ops | @>(jsonb,jsonb) | 7 | search > gin | jsonb_path_ops | @?(jsonb,jsonpath) | 15 | search > gin | jsonb_path_ops | @@(jsonb,jsonpath) | 16 | search > (9 rows) > > so it seems like you ought to get some benefit for this query > from just a plain GIN index on "snapshot". Interesting. I'm pretty sure I started there a few days ago without any luck but I'll give it another spin (having learnt quite a bit since then). > > regards, tom lane
OK, I have corrected and simplified the test case (including switching to a btree index). The WHERE clause and the inex now look like this: ...WHERE ((snapshot -> 'employee' -> '999' ->> 'pay_graph')::integer != 0); ...USING btree (((snapshot -> 'employee' -> '$.*' ->> 'pay_graph')::integer != 0)); But the index is still not being used (test case below). I have confirmed that the equality operator is listed for "search" (I assume inequality is the same as equality, but I tried both): # \dAo btree jsonb* List of operators of operator families AM | Operator family | Operator | Strategy | Purpose -------+-----------------+-----------------+----------+--------- btree | jsonb_ops | <(jsonb,jsonb) | 1 | search btree | jsonb_ops | <=(jsonb,jsonb) | 2 | search btree | jsonb_ops | =(jsonb,jsonb) | 3 | search btree | jsonb_ops | >=(jsonb,jsonb) | 4 | search btree | jsonb_ops | >(jsonb,jsonb) | 5 | search (5 rows) If this is not a bug, then how should the query or the index be changed to make this work? === begin test case === CREATE TABLE payrun ( id serial primary key, snapshot JSONB ); INSERT INTO payrun(snapshot) VALUES ('{"employee": {"999": {"id": 999, "state": {"employment": [["1920-01-01", null, 3]]}, "pay_graph": 0, "last_run_of_employment": false}, "111": {"id": 111, "state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}'), ('{"employee": {"999": {"id": 999, "state": {"employment": [["1970-01-01", null, 3]]}, "pay_graph": 6, "last_run_of_employment": true}, "222": {"id": 222, "state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 5, "last_run_of_employment": true}}}'), ('{"employee": {"998": {"id": 998, "state": {"employment": [["1980-01-01", null, 3]]}, "pay_graph": 7, "last_run_of_employment": false}, "333": {"id": 333, "state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}') ; SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun" WHERE ((snapshot -> 'employee' -> '999' ->> 'pay_graph')::integer != 0); -- -- Create index designed to match the query. -- CREATE INDEX idx1 ON payrun USING btree (((snapshot -> 'employee' -> '$.*' ->> 'pay_graph')::integer != 0)); set enable_seqscan = OFF; -- -- EXPLAIN ANALYZE ...query above... -- explain analyze SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun" WHERE ((snapshot -> 'employee' -> '999' ->> 'pay_graph')::integer != 0); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Seq Scan on payrun (cost=10000000000.00..10000000001.08 rows=2 width=36) (actual time=70.051..70.052 rows=1 loops=1) Filter: (((((snapshot -> 'employee'::text) -> '999'::text) ->> 'pay_graph'::text))::integer <> 0) Rows Removed by Filter: 2 Planning Time: 0.147 ms JIT: Functions: 4 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 0.354 ms, Inlining 8.305 ms, Optimization 49.237 ms, Emission 12.499 ms, Total 70.395 ms Execution Time: 70.428 ms (9 rows) === end test case === Thanks, Shaheed On Thu, 2 Jun 2022 at 16:51, Shaheed Haque <shaheedhaque@gmail.com> wrote: > > On Thu, 2 Jun 2022 at 15:31, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Shaheed Haque <shaheedhaque@gmail.com> writes: > > > -- Create index designed to match the query. > > > -- > > > create index idx1 on payrun using gin ((snapshot->'$.employee.* ? > > > (@.pay_graph <> 0 || @.last_run_of_employment == true || > > > @.state.employment[last][2] == 0)')); > > > > But that doesn't match the query; it's not even the same topmost > > operator: > > > > > explain analyse SELECT id,snapshot #>'{employee,999,state,employment}' FROM > > > "payrun" WHERE (snapshot @? '$.employee."999" ? (@.pay_graph <> 0 || > > > @.last_run_of_employment > > > == true || @.state.employment[last][2] == 0)'); > > I assume you are referring to the difference between "snapshot @?" and > "snapshot ->"? If so, apologies: too much cutting and pasting from too > many experiments. I did in fact also try the "using btree ((snapshot > @?" form but it gave the same results. > > > In general you seem to have much too high an opinion of what PG's > > index machinery can cope with. The general pattern is that it can > > use a query WHERE clause with an index if the clause is of the form > > "indexed-column indexable-operator constant". There's a small number > > of special cases where it can transform things that don't initially > > look like that into the right form, but AFAIR we don't have any > > such special cases for any json-related operators. > > LOL. I'm pretty much a noob here, so that's very possible. > > > The one saving grace is that "indexed-column" can be an expression > > appearing in an index, so in some cases you can finesse things > > that way. But you won't find any deep knowledge of jsonpath > > expressions in there. > > I was basing my efforts on this statement in the docs > https://www.postgresql.org/docs/14/datatype-json.html#JSON-INDEXING: > > GIN index extracts statements of following form out of jsonpath: > accessors_chain = const. Accessors chain may consist of .key, [*], and > [index] accessors. jsonb_ops additionally supports .* and .** > accessors. > > Did I mis-implement, misunderstand or read too much into this? > > > Having said that, @? is reported as an indexable operator in v14: > > > > regression=# \dAo gin jsonb* > > List of operators of operator families > > AM | Operator family | Operator | Strategy | Purpose > > -----+-----------------+--------------------+----------+--------- > > gin | jsonb_ops | @>(jsonb,jsonb) | 7 | search > > gin | jsonb_ops | @?(jsonb,jsonpath) | 15 | search > > gin | jsonb_ops | @@(jsonb,jsonpath) | 16 | search > > gin | jsonb_ops | ?(jsonb,text) | 9 | search > > gin | jsonb_ops | ?|(jsonb,text[]) | 10 | search > > gin | jsonb_ops | ?&(jsonb,text[]) | 11 | search > > gin | jsonb_path_ops | @>(jsonb,jsonb) | 7 | search > > gin | jsonb_path_ops | @?(jsonb,jsonpath) | 15 | search > > gin | jsonb_path_ops | @@(jsonb,jsonpath) | 16 | search > > (9 rows) > > > > so it seems like you ought to get some benefit for this query > > from just a plain GIN index on "snapshot". > > Interesting. I'm pretty sure I started there a few days ago without > any luck but I'll give it another spin (having learnt quite a bit > since then). > > > > > regards, tom lane
Op 12-06-2022 om 11:34 schreef Shaheed Haque: > OK, I have corrected and simplified the test case (including switching > to a btree index). The WHERE clause and the inex now look like this: > > ...WHERE ((snapshot -> 'employee' -> '999' ->> > 'pay_graph')::integer != 0); > ...USING btree (((snapshot -> 'employee' -> '$.*' ->> > 'pay_graph')::integer != 0)); > > But the index is still not being used (test case below). I have > confirmed that the equality operator is listed for "search" (I assume > inequality is the same as equality, but I tried both): > > # \dAo btree jsonb* > List of operators of operator families > AM | Operator family | Operator | Strategy | Purpose > -------+-----------------+-----------------+----------+--------- > btree | jsonb_ops | <(jsonb,jsonb) | 1 | search > btree | jsonb_ops | <=(jsonb,jsonb) | 2 | search > btree | jsonb_ops | =(jsonb,jsonb) | 3 | search > btree | jsonb_ops | >=(jsonb,jsonb) | 4 | search > btree | jsonb_ops | >(jsonb,jsonb) | 5 | search > (5 rows) > > If this is not a bug, then how should the query or the index be > changed to make this work? > > > === begin test case === > > CREATE TABLE payrun ( > id serial primary key, > snapshot JSONB > ); > > INSERT INTO payrun(snapshot) > VALUES > ('{"employee": {"999": {"id": 999, "state": {"employment": > [["1920-01-01", null, 3]]}, "pay_graph": 0, "last_run_of_employment": > false}, "111": {"id": 111, "state": {"employment": [["1920-01-01", > null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}'), > ('{"employee": {"999": {"id": 999, "state": {"employment": > [["1970-01-01", null, 3]]}, "pay_graph": 6, "last_run_of_employment": > true}, "222": {"id": 222, "state": {"employment": [["1920-01-01", > null, 5]]}, "pay_graph": 5, "last_run_of_employment": true}}}'), > ('{"employee": {"998": {"id": 998, "state": {"employment": > [["1980-01-01", null, 3]]}, "pay_graph": 7, "last_run_of_employment": > false}, "333": {"id": 333, "state": {"employment": [["1920-01-01", > null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}') > ; > > SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun" > WHERE ((snapshot -> 'employee' -> '999' ->> 'pay_graph')::integer != > 0); > > -- > -- Create index designed to match the query. > -- > CREATE INDEX idx1 ON payrun USING btree (((snapshot -> 'employee' -> > '$.*' ->> 'pay_graph')::integer != 0)); > > set enable_seqscan = OFF; > How is this? I took the triple condition from your earlier email. I did not use your index. I added one index using gin jsonb_path_ops. create index payrun_jspathop_idx ON payrun using gin (snapshot jsonb_path_ops); set enable_seqscan = OFF; \timing on select id, snapshot #>'{employee,999,state,employment}' from payrun where snapshot @? '$."employee"."999" ? ( @.pay_graph <> 0 || @.last_run_of_employment == true || @.state.employment[last][2] == 0 )'; id | ?column? ----+--------------------------- 2 | [["1970-01-01", null, 3]] (1 row) Time: 0.897 ms explain analyze select id, snapshot #>'{employee,999,state,employment}' from payrun where snapshot @? '$."employee"."999" ? ( @.pay_graph <> 0 || @.last_run_of_employment == true || @.state.employment[last][2] == 0 )'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on payrun (cost=136.00..140.02 rows=1 width=36) (actual time=0.018..0.019 rows=1 loops=1) Recheck Cond: (snapshot @? '$."employee"."999"?((@."pay_graph" != 0 || @."last_run_of_employment" == true) || @."state"."employment"[last][2] == 0)'::jsonpath) Rows Removed by Index Recheck: 2 Heap Blocks: exact=1 -> Bitmap Index Scan on payrun_jspathop_idx (cost=0.00..136.00 rows=1 width=0) (actual time=0.007..0.007 rows=3 loops=1) Index Cond: (snapshot @? '$."employee"."999"?((@."pay_graph" != 0 || @."last_run_of_employment" == true) || @."state"."employment"[last][2] == 0)'::jsonpath) Planning Time: 0.034 ms Execution Time: 0.033 ms (8 rows) Time: 0.284 ms hope that helps. Erik Rijkers
Thanks Erik. Is the point that the index has to be on the JSON field as a whole (i.e. "snapshot") rather than deep inside it (e.g. "snapshot.something.further[down]")? On Sun, 12 Jun 2022 at 11:53, Erik Rijkers <er@xs4all.nl> wrote: > > Op 12-06-2022 om 11:34 schreef Shaheed Haque: > > OK, I have corrected and simplified the test case (including switching > > to a btree index). The WHERE clause and the inex now look like this: > > > > ...WHERE ((snapshot -> 'employee' -> '999' ->> > > 'pay_graph')::integer != 0); > > ...USING btree (((snapshot -> 'employee' -> '$.*' ->> > > 'pay_graph')::integer != 0)); > > > > But the index is still not being used (test case below). I have > > confirmed that the equality operator is listed for "search" (I assume > > inequality is the same as equality, but I tried both): > > > > # \dAo btree jsonb* > > List of operators of operator families > > AM | Operator family | Operator | Strategy | Purpose > > -------+-----------------+-----------------+----------+--------- > > btree | jsonb_ops | <(jsonb,jsonb) | 1 | search > > btree | jsonb_ops | <=(jsonb,jsonb) | 2 | search > > btree | jsonb_ops | =(jsonb,jsonb) | 3 | search > > btree | jsonb_ops | >=(jsonb,jsonb) | 4 | search > > btree | jsonb_ops | >(jsonb,jsonb) | 5 | search > > (5 rows) > > > > If this is not a bug, then how should the query or the index be > > changed to make this work? > > > > > > === begin test case === > > > > CREATE TABLE payrun ( > > id serial primary key, > > snapshot JSONB > > ); > > > > INSERT INTO payrun(snapshot) > > VALUES > > ('{"employee": {"999": {"id": 999, "state": {"employment": > > [["1920-01-01", null, 3]]}, "pay_graph": 0, "last_run_of_employment": > > false}, "111": {"id": 111, "state": {"employment": [["1920-01-01", > > null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}'), > > ('{"employee": {"999": {"id": 999, "state": {"employment": > > [["1970-01-01", null, 3]]}, "pay_graph": 6, "last_run_of_employment": > > true}, "222": {"id": 222, "state": {"employment": [["1920-01-01", > > null, 5]]}, "pay_graph": 5, "last_run_of_employment": true}}}'), > > ('{"employee": {"998": {"id": 998, "state": {"employment": > > [["1980-01-01", null, 3]]}, "pay_graph": 7, "last_run_of_employment": > > false}, "333": {"id": 333, "state": {"employment": [["1920-01-01", > > null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}') > > ; > > > > SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun" > > WHERE ((snapshot -> 'employee' -> '999' ->> 'pay_graph')::integer != > > 0); > > > > -- > > -- Create index designed to match the query. > > -- > > CREATE INDEX idx1 ON payrun USING btree (((snapshot -> 'employee' -> > > '$.*' ->> 'pay_graph')::integer != 0)); > > > > set enable_seqscan = OFF; > > > > How is this? > > I took the triple condition from your earlier email. > I did not use your index. > I added one index using gin jsonb_path_ops. > > create index payrun_jspathop_idx ON payrun using gin (snapshot > jsonb_path_ops); > set enable_seqscan = OFF; > \timing on > select id, snapshot #>'{employee,999,state,employment}' from payrun > where snapshot @? '$."employee"."999" ? > ( @.pay_graph <> 0 > || @.last_run_of_employment == true > || @.state.employment[last][2] == 0 > )'; > id | ?column? > ----+--------------------------- > 2 | [["1970-01-01", null, 3]] > (1 row) > > Time: 0.897 ms > explain analyze select id, snapshot #>'{employee,999,state,employment}' > from payrun > where snapshot @? '$."employee"."999" ? > ( @.pay_graph <> 0 > || @.last_run_of_employment == true > || @.state.employment[last][2] == 0 > )'; > > QUERY PLAN > > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on payrun (cost=136.00..140.02 rows=1 width=36) > (actual time=0.018..0.019 rows=1 loops=1) > Recheck Cond: (snapshot @? '$."employee"."999"?((@."pay_graph" != 0 > || @."last_run_of_employment" == true) || > @."state"."employment"[last][2] == 0)'::jsonpath) > Rows Removed by Index Recheck: 2 > Heap Blocks: exact=1 > -> Bitmap Index Scan on payrun_jspathop_idx (cost=0.00..136.00 > rows=1 width=0) (actual time=0.007..0.007 rows=3 loops=1) > Index Cond: (snapshot @? '$."employee"."999"?((@."pay_graph" > != 0 || @."last_run_of_employment" == true) || > @."state"."employment"[last][2] == 0)'::jsonpath) > Planning Time: 0.034 ms > Execution Time: 0.033 ms > (8 rows) > > Time: 0.284 ms > > > hope that helps. > > Erik Rijkers
(Resend, wrong version was sent before) Thanks Erik. Is the point that the index has to be on the JSON field as a whole (i.e. "snapshot") rather than deep inside it (e.g. "snapshot.something.further[down]")? In my case, the snapshot is several MB in size (perhaps 10MB or even 20MB), dominated by the snapshot.employee (cardinality 10k, each sized as a dict 1-2kB as text). My expectation/guess is that an index of "snapshot" will itself be of a size of similar order. However the design as-is works very well except for this one case where to speed it up, in principle, the index need contain no more than one boolean per employee. So that's what I'd like to achieve, if possible. I've seen no hint in the documentation that creating the index on "snapshot.something.further[down]" should not work, and PG certainly allows it to be created. Also, Tom has suggested that I should not look to some magical ability to infer the use of the index from a differently structured query, and I've taken that on board with the updated query + index. AFAIK, there are 3 possibilities: - I've not done things right, in which case I'd love to know my mistake. - It is not supposed to work, in which case it would be good to have that stated, and maybe have PG not allow useless indices to be created. - It is a bug. All input much appreciated, Thanks, Shaheed On Sun, 12 Jun 2022 at 11:53, Erik Rijkers <er@xs4all.nl> wrote: > > Op 12-06-2022 om 11:34 schreef Shaheed Haque: > > OK, I have corrected and simplified the test case (including switching > > to a btree index). The WHERE clause and the inex now look like this: > > > > ...WHERE ((snapshot -> 'employee' -> '999' ->> > > 'pay_graph')::integer != 0); > > ...USING btree (((snapshot -> 'employee' -> '$.*' ->> > > 'pay_graph')::integer != 0)); > > > > But the index is still not being used (test case below). I have > > confirmed that the equality operator is listed for "search" (I assume > > inequality is the same as equality, but I tried both): > > > > # \dAo btree jsonb* > > List of operators of operator families > > AM | Operator family | Operator | Strategy | Purpose > > -------+-----------------+-----------------+----------+--------- > > btree | jsonb_ops | <(jsonb,jsonb) | 1 | search > > btree | jsonb_ops | <=(jsonb,jsonb) | 2 | search > > btree | jsonb_ops | =(jsonb,jsonb) | 3 | search > > btree | jsonb_ops | >=(jsonb,jsonb) | 4 | search > > btree | jsonb_ops | >(jsonb,jsonb) | 5 | search > > (5 rows) > > > > If this is not a bug, then how should the query or the index be > > changed to make this work? > > > > > > === begin test case === > > > > CREATE TABLE payrun ( > > id serial primary key, > > snapshot JSONB > > ); > > > > INSERT INTO payrun(snapshot) > > VALUES > > ('{"employee": {"999": {"id": 999, "state": {"employment": > > [["1920-01-01", null, 3]]}, "pay_graph": 0, "last_run_of_employment": > > false}, "111": {"id": 111, "state": {"employment": [["1920-01-01", > > null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}'), > > ('{"employee": {"999": {"id": 999, "state": {"employment": > > [["1970-01-01", null, 3]]}, "pay_graph": 6, "last_run_of_employment": > > true}, "222": {"id": 222, "state": {"employment": [["1920-01-01", > > null, 5]]}, "pay_graph": 5, "last_run_of_employment": true}}}'), > > ('{"employee": {"998": {"id": 998, "state": {"employment": > > [["1980-01-01", null, 3]]}, "pay_graph": 7, "last_run_of_employment": > > false}, "333": {"id": 333, "state": {"employment": [["1920-01-01", > > null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}') > > ; > > > > SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun" > > WHERE ((snapshot -> 'employee' -> '999' ->> 'pay_graph')::integer != > > 0); > > > > -- > > -- Create index designed to match the query. > > -- > > CREATE INDEX idx1 ON payrun USING btree (((snapshot -> 'employee' -> > > '$.*' ->> 'pay_graph')::integer != 0)); > > > > set enable_seqscan = OFF; > > > > How is this? > > I took the triple condition from your earlier email. > I did not use your index. > I added one index using gin jsonb_path_ops. > > create index payrun_jspathop_idx ON payrun using gin (snapshot > jsonb_path_ops); > set enable_seqscan = OFF; > \timing on > select id, snapshot #>'{employee,999,state,employment}' from payrun > where snapshot @? '$."employee"."999" ? > ( @.pay_graph <> 0 > || @.last_run_of_employment == true > || @.state.employment[last][2] == 0 > )'; > id | ?column? > ----+--------------------------- > 2 | [["1970-01-01", null, 3]] > (1 row) > > Time: 0.897 ms > explain analyze select id, snapshot #>'{employee,999,state,employment}' > from payrun > where snapshot @? '$."employee"."999" ? > ( @.pay_graph <> 0 > || @.last_run_of_employment == true > || @.state.employment[last][2] == 0 > )'; > > QUERY PLAN > > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on payrun (cost=136.00..140.02 rows=1 width=36) > (actual time=0.018..0.019 rows=1 loops=1) > Recheck Cond: (snapshot @? '$."employee"."999"?((@."pay_graph" != 0 > || @."last_run_of_employment" == true) || > @."state"."employment"[last][2] == 0)'::jsonpath) > Rows Removed by Index Recheck: 2 > Heap Blocks: exact=1 > -> Bitmap Index Scan on payrun_jspathop_idx (cost=0.00..136.00 > rows=1 width=0) (actual time=0.007..0.007 rows=3 loops=1) > Index Cond: (snapshot @? '$."employee"."999"?((@."pay_graph" > != 0 || @."last_run_of_employment" == true) || > @."state"."employment"[last][2] == 0)'::jsonpath) > Planning Time: 0.034 ms > Execution Time: 0.033 ms > (8 rows) > > Time: 0.284 ms > > > hope that helps. > > Erik Rijkers
On Sun, Jun 12, 2022 at 5:34 AM Shaheed Haque <shaheedhaque@gmail.com> wrote:
OK, I have corrected and simplified the test case (including switching
to a btree index). The WHERE clause and the inex now look like this:
...WHERE ((snapshot -> 'employee' -> '999' ->>
'pay_graph')::integer != 0);
...USING btree (((snapshot -> 'employee' -> '$.*' ->>
'pay_graph')::integer != 0));
But, this is not a correction. You are still trying to use -> as if it were @?, and that is still not going to work.
You are indexing the part of snapshot which has the employee number of '$.*', which is a weird employee number for anyone to have. You might want to represent a wildcard but that is not what -> does.
Cheer,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Sun, Jun 12, 2022 at 5:34 AM Shaheed Haque <shaheedhaque@gmail.com> > wrote: >> OK, I have corrected and simplified the test case (including switching >> to a btree index). The WHERE clause and the inex now look like this: >> >> ...WHERE ((snapshot -> 'employee' -> '999' ->> >> 'pay_graph')::integer != 0); >> ...USING btree (((snapshot -> 'employee' -> '$.*' ->> >> 'pay_graph')::integer != 0)); > But, this is not a correction. You are still trying to use -> as if it > were @?, and that is still not going to work. In hopes of clarifying some more: all that index does is to record the boolean result of (snapshot -> 'employee' -> '$.*' ->> 'pay_graph')::integer != 0 at each row. We could use it for a query that contains *exactly* that condition as a WHERE clause. We cannot use it for a query that contains some other condition, even if that other condition looks related to you. > You are indexing the part of snapshot which has the employee number of > '$.*', which is a weird employee number for anyone to have. You might want > to represent a wildcard but that is not what -> does. Yeah, there's also the problem that the semantics of this particular expression aren't really useful. But even if they were, PG's index machinery is not smart enough to pick apart the contents of an index expression. If the index expression *exactly* matches some sub-expression of a WHERE clause, and what's above that sub-expression is an operator that's indexable according to the index opclass, then we have a chance of using it. This example is not that. regards, tom lane
On Sun, 12 Jun 2022 at 22:55, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Jeff Janes <jeff.janes@gmail.com> writes: > > On Sun, Jun 12, 2022 at 5:34 AM Shaheed Haque <shaheedhaque@gmail.com> > > wrote: > >> OK, I have corrected and simplified the test case (including switching > >> to a btree index). The WHERE clause and the inex now look like this: > >> > >> ...WHERE ((snapshot -> 'employee' -> '999' ->> > >> 'pay_graph')::integer != 0); > >> ...USING btree (((snapshot -> 'employee' -> '$.*' ->> > >> 'pay_graph')::integer != 0)); > > > But, this is not a correction. You are still trying to use -> as if it > > were @?, and that is still not going to work. > > In hopes of clarifying some more: all that index does is to record > the boolean result of > (snapshot -> 'employee' -> '$.*' ->> 'pay_graph')::integer != 0 > at each row. We could use it for a query that contains *exactly* > that condition as a WHERE clause. We cannot use it for a query that > contains some other condition, even if that other condition looks > related to you. OK, I see that I got myself all confused and the @? form is needed. So: ...USING btree ((snapshot -> 'employee' @? '$.* ? (@.pay_graph != 0)')); ...WHERE ((snapshot -> 'employee' @? '$."999" ? (@.pay_graph != 0)')) > > You are indexing the part of snapshot which has the employee number of > > '$.*', which is a weird employee number for anyone to have. You might want > > to represent a wildcard but that is not what -> does. > > Yeah, there's also the problem that the semantics of this particular > expression aren't really useful. Does the switch back to @? address this point? If not, please clarify. > But even if they were, PG's index > machinery is not smart enough to pick apart the contents of an index > expression. If the index expression *exactly* matches some sub-expression To my inexpert eye, given the presence of the wildcard, the above look like an exact match. What have I missed? > of a WHERE clause, and what's above that sub-expression is an operator > that's indexable according to the index opclass, then we have a chance > of using it. This example is not that. Because I switched to btree, and btree cannot search on "@?"? So, what should the index+query look like using gin? (I am trying to address a niche case, and can easily arrange for them to match if I knew what was needed). Thanks, Shaheed > > regards, tom lane
On Sun, Jun 12, 2022 at 4:52 PM Shaheed Haque <shaheedhaque@gmail.com> wrote:
...USING btree ((snapshot -> 'employee' @? '$.* ? (@.pay_graph != 0)'));
...WHERE ((snapshot -> 'employee' @? '$."999" ? (@.pay_graph != 0)'))
To my inexpert eye, given the presence of the wildcard, the above look
like an exact match. What have I missed?
The system doesn't see a wildcard. Exact match means identical characters. The fact that you had to write "given the presence of the wildcard" is why this is not an exact match.
David J.
On Sun, Jun 12, 2022 at 3:31 PM Shaheed Haque <shaheedhaque@gmail.com> wrote:
(Resend, wrong version was sent before)
Thanks Erik. Is the point that the index has to be on the JSON field
as a whole (i.e. "snapshot") rather than deep inside it (e.g.
"snapshot.something.further[down]")?
In my case, the snapshot is several MB in size (perhaps 10MB or even
20MB), dominated by the snapshot.employee (cardinality 10k, each sized
as a dict 1-2kB as text). My expectation/guess is that an index of
"snapshot" will itself be of a size of similar order. However the
design as-is works very well except for this one case where to speed
it up, in principle, the index need contain no more than one boolean
per employee. So that's what I'd like to achieve, if possible.
It sounds like what you really want here is to extract just the list of the ids which meet one of your three further criteria, and index that list.
You can do that with jsonpath, but you have to apply it with a function, not one of the boolean-returning operators.
This almost works to do that:
create index on payrun using gin (jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <> 0 || @.last_run_of_employment == true || @.state.employment[last][2] == 0).id'));
Then query it like:
select * from payrun where jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <> 0 || @.last_run_of_employment == true || @.state.employment[last][2] == 0).id') ? '999';
The problem is that ? only tests for top-level text values, while in your example the value for the 'id' keys are ints, not text. So they are not findable with the ? operator. If I edit your example data to wrap the ids' values in double quotes, turning them into json strings rather than json ints, then this does work for me.
Maybe there is a way to modify the jsonpath so that it converts the ints to text for you. But if there is such a way, I don't know what it is.
If I were doing this for my own system, I would probably create an immutable user-defined-function which took a jsonb and returned an int[] of the filtered id values. Then you would have to query it with @> rather than ?.
- I've not done things right, in which case I'd love to know my mistake.
- It is not supposed to work, in which case it would be good to have
that stated, and maybe have PG not allow useless indices to be
created.
- It is a bug.
I guess it is number one mixed with number two. The index you created is useless for your intended purpose, but is not useless for every conceivable purpose. It is not realistic to expect PostgreSQL to reject things just because it is not obvious (to a computer) what you are getting up to.
Cheers,
Jeff
Op 12-06-2022 om 21:31 schreef Shaheed Haque: > Thanks Erik. Is the point that the index has to be on the JSON field > as a whole (i.e. "snapshot") rather than deep inside it (e.g. > "snapshot.something.further[down]")? > > In my case, the snapshot is several MB in size (perhaps 10MB or even > 20MB), dominated by the snapshot.employee (cardinality 10k, each sized > as a dict 1-2kB as text). My expectation/guess is that an index of > "snapshot" will itself be of a size of similar order. However the > design as-is works very well except for this one case where to speed > it up, in principle, the index need contain no more than one boolean > per employee. So that's what I'd like to achieve, if possible. > > I've seen no hint in the documentation that creating the index on > "snapshot.something.further[down]" should not work, and PG certainly > allows it to be created. Also, Tom has suggested that I should not > look to some magical ability to infer the use of the index from a > differently structured query, and I've taken that on board with the > updated query + index. If you insist on a btree/integer on pay_graph, the below stuff seems to work, no? Perhaps you've missed the difference between x != 0 on the one hand, and x > 0 on the other, Here are both queries to show that difference. CREATE INDEX payrun_btree_paygr_idx ON payrun using btree (((snapshot->'employee'->'999'->>'pay_graph')::integer)); set enable_seqscan = 0; SELECT id, snapshot #>'{employee,999,state,employment}' FROM payrun WHERE (snapshot->'employee'->'999'->>'pay_graph')::integer > 0 ; id | ?column? ----+--------------------------- 2 | [["1970-01-01", null, 3]] (1 row) Time: 1.384 ms explain analyze SELECT id, snapshot #>'{employee,999,state,employment}' FROM payrun WHERE (snapshot->'employee'->'999'->>'pay_graph')::integer > 0 ; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Index Scan using payrun_btree_paygr_idx on payrun (cost=0.14..8.98 rows=42 width=36) (actual time=0.018..0.020 rows=1 loops=1) Index Cond: (((((snapshot -> 'employee'::text) -> '999'::text) ->> 'pay_graph'::text))::integer > 0) Planning Time: 0.047 ms Execution Time: 0.048 ms (4 rows) Time: 0.386 ms explain analyze SELECT id, snapshot #>'{employee,999,state,employment}' FROM payrun WHERE (snapshot->'employee'->'999'->>'pay_graph')::integer != 0 ; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Seq Scan on payrun (cost=10000000000.00..10000000004.49 rows=126 width=36) (actual time=112.182..112.186 rows=1 loops=1) Filter: (((((snapshot -> 'employee'::text) -> '999'::text) ->> 'pay_graph'::text))::integer <> 0) Rows Removed by Filter: 3 Planning Time: 0.050 ms JIT: Functions: 4 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 0.636 ms, Inlining 41.761 ms, Optimization 52.033 ms, Emission 18.228 ms, Total 112.658 ms Execution Time: 153.486 ms (9 rows) Time: 153.835 ms So: 0.386 ms for x > 0 uses payrun_btree_paygr_idx vs: 153.835 ms for x != 0 uses seq scan It's hard to see for me where you encounter problems; including the fast searches in my earier mail, it looks to me like you have many good options. Erik Rijkers
Jeff, David, Erik, Tom, I now see that I misunderstood several aspects of the problem. With regard to Jeff's last note, I note (and like!!!!): - the use of the final ".id" to collapse the employee "dict" into an array of simple values. - the use of "?" to test elements in the array as "keys". As Jeff noted, the "?" requires a text value. Luckily I have a ".username" in my real data that should work...so I think I have a solution without recourse to a custom function! I have appended an updated test case showing the indexing working. Finally, I do wonder if a "jsonb_path_query_array_text" function might be worth considering as an enhancement for when there is not a usable text value available? Thanks again for all your kind help. Shaheed == test case showing how to index "deep" into a JSONB field == DROP TABLE payrun; DROP INDEX idx1; CREATE TABLE payrun ( id serial primary key, snapshot JSONB ); INSERT INTO payrun(snapshot) VALUES ('{"employee": {"999": {"id": "user999@foo.com", "state": {"employment": [["1920-01-01", null, 3]]}, "pay_graph": 0, "last_run_of_employment": false}, "111": {"id": "user111@foo.com", "state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}'), ('{"employee": {"999": {"id": "user999@foo.com", "state": {"employment": [["1970-01-01", null, 3]]}, "pay_graph": 6, "last_run_of_employment": true}, "222": {"id": "user222@foo.com", "state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 5, "last_run_of_employment": true}}}'), ('{"employee": {"998": {"id": "user998@foo.com", "state": {"employment": [["1980-01-01", null, 3]]}, "pay_graph": 7, "last_run_of_employment": false}, "333": {"id": "user333@foo.com", "state": {"employment": [["1920-01-01", null, 5]]}, "pay_graph": 3, "last_run_of_employment": true}}}') ; CREATE INDEX idx1 ON payrun USING gin (jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <> 0 || @.last_run_of_employment == true || @.state.employment[last][2] == 0).id')); SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun" WHERE jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <> 0 || @.last_run_of_employment == true || @.state.employment[last][2] == 0).id') ? 'user999@foo.com'; -- -- Test -- set enable_seqscan = OFF; EXPLAIN ANALYSE SELECT id,snapshot #>'{employee,999,state,employment}' FROM "payrun" WHERE jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <> 0 || @.last_run_of_employment == true || @.state.employment[last][2] == 0).id') ? 'user999@foo.com'; ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------- Bitmap Heap Scan on payrun (cost=8.00..12.02 rows=1 width=36) (actual time=0.022..0.023 rows=1 loops=1) Recheck Cond: (jsonb_path_query_array(snapshot, '$."employee".*?((@."pay_graph" != 0 || @."last_run_of_employment" == true) || @."state"."employment"[last][2] == 0)."id"'::jsonpath, '{}'::jsonb, false) ? 'user999@foo.com '::text) Heap Blocks: exact=1 -> Bitmap Index Scan on idx1 (cost=0.00..8.00 rows=1 width=0) (actual time=0.007..0.007 rows=1 loops=1) Index Cond: (jsonb_path_query_array(snapshot, '$."employee".*?((@."pay_graph" != 0 || @."last_run_of_employment" == true) || @."state"."employment"[last][2] == 0)."id"'::jsonpath, '{}'::jsonb, false) ? 'user999@foo .com'::text) Planning Time: 0.062 ms Execution Time: 0.041 ms (7 rows) On Mon, 13 Jun 2022 at 03:23, Jeff Janes <jeff.janes@gmail.com> wrote: > > On Sun, Jun 12, 2022 at 3:31 PM Shaheed Haque <shaheedhaque@gmail.com> wrote: >> >> (Resend, wrong version was sent before) >> >> Thanks Erik. Is the point that the index has to be on the JSON field >> as a whole (i.e. "snapshot") rather than deep inside it (e.g. >> "snapshot.something.further[down]")? >> >> In my case, the snapshot is several MB in size (perhaps 10MB or even >> 20MB), dominated by the snapshot.employee (cardinality 10k, each sized >> as a dict 1-2kB as text). My expectation/guess is that an index of >> "snapshot" will itself be of a size of similar order. However the >> design as-is works very well except for this one case where to speed >> it up, in principle, the index need contain no more than one boolean >> per employee. So that's what I'd like to achieve, if possible. > > > It sounds like what you really want here is to extract just the list of the ids which meet one of your three further criteria,and index that list. > > You can do that with jsonpath, but you have to apply it with a function, not one of the boolean-returning operators. > > This almost works to do that: > > create index on payrun using gin (jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <> 0 || @.last_run_of_employment== true || @.state.employment[last][2] == 0).id')); > > Then query it like: > > select * from payrun where jsonb_path_query_array(snapshot, '$.employee.* ? (@.pay_graph <> 0 || @.last_run_of_employment== true || @.state.employment[last][2] == 0).id') ? '999'; > > The problem is that ? only tests for top-level text values, while in your example the value for the 'id' keys are ints,not text. So they are not findable with the ? operator. If I edit your example data to wrap the ids' values in doublequotes, turning them into json strings rather than json ints, then this does work for me. > > Maybe there is a way to modify the jsonpath so that it converts the ints to text for you. But if there is such a way,I don't know what it is. > > If I were doing this for my own system, I would probably create an immutable user-defined-function which took a jsonb andreturned an int[] of the filtered id values. Then you would have to query it with @> rather than ?. > >> >> - I've not done things right, in which case I'd love to know my mistake. >> - It is not supposed to work, in which case it would be good to have >> that stated, and maybe have PG not allow useless indices to be >> created. >> - It is a bug. > > > I guess it is number one mixed with number two. The index you created is useless for your intended purpose, but is notuseless for every conceivable purpose. It is not realistic to expect PostgreSQL to reject things just because it is notobvious (to a computer) what you are getting up to. > > Cheers, > > Jeff