Обсуждение: [GENERAL] jsonb case insensitive search
Hi
The goal would be to be able to search case insensitive by any key:value combined with some other columns like outlined below, but initially would be Company:CompuTestSystems
A sample would be
SELECT *
FROM cfg_files_data
WHERE cfg_files_data.show_id = 32
AND cfg_files_data.file_id = 123
AND lower((file_data_record ->> 'Company')) = lower('CompuTestSystems');
;
- Table definition
Table "csischema.cfg_files_data"
Column | Type | Modifiers
--------------------+--------------------------+-----------------------------------------------------------------------
file_data_id | bigint | not null default nextval('cfg_files_data_file_data_id_seq'::regclass)
file_id | bigint | not null
show_id | bigint | not null
file_data_record | jsonb | not null
additional_info | jsonb |
file_data_add_by | character varying(100) | not null
file_data_add_date | timestamp with time zone | not null default now()
Indexes:
"cfg_files_data_pkey" PRIMARY KEY, btree (file_data_id, show_id)
- show_id, file_id data distribution
select show_id, file_id, count(*) from cfg_files_data group by show_id, file_id order by 1;
show_id | file_id | count
---------+---------+-------
4 | 9 | 3
4 | 68 | 22
4 | 2 | 6
6 | 3 | 13
6 | 5 | 215
13 | 13 | 13
13 | 8 | 22
21 | 11 | 13
21 | 10 | 22
26 | 12 | 13
30 | 16 | 6
32 | 123 | 53731
35 | 125 | 5
40 | 122 | 53731
46 | 69 | 4
46 | 64 | 4
46 | 67 | 4
46 | 70 | 4
For the example mentioned
SELECT *
FROM cfg_files_data
WHERE cfg_files_data.show_id = 32
AND cfg_files_data.file_id = 123
AND lower((file_data_record ->> 'Company')) = lower('CompuTestSystems');
;
create index cfg_files_data_record_idx on cfg_files_data (show_id, file_id, lower(file_data_record::text));
SELECT *
FROM cfg_files_data
WHERE cfg_files_data.show_id = 35
AND cfg_files_data.file_id = 125
AND lower((file_data_record ->> 'Company')) = lower('CompuTestSystems');
;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using cfg_files_data_record_idx on cfg_files_data (cost=0.54..8.57 rows=1 width=359) (actual time=0.287..0.305 rows=5 loops=1)
Index Cond: ((show_id = 35) AND (file_id = 125))
Filter: (lower((file_data_record ->> 'Company'::text)) = 'computestsystems'::text)
Planning time: 0.271 ms
Execution time: 0.370 ms
(5 rows)
SELECT *
FROM cfg_files_data
WHERE cfg_files_data.show_id = 32
AND cfg_files_data.file_id = 123
AND lower((file_data_record ->> 'Company')) = lower('CompuTestSystems');
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Seq Scan on cfg_files_data (cost=0.00..7276.20 rows=134 width=359) (actual time=194.817..194.848 rows=2 loops=1)
Filter: ((show_id = 32) AND (file_id = 123) AND (lower((file_data_record ->> 'Company'::text)) = 'computestsystems'::text))
Rows Removed by Filter: 107829
Planning time: 1.006 ms
Execution time: 194.905 ms
Shall I understand that prior to apply the lower((file_data_record ->> 'Company'::text)) = 'computestsystems'::text) filtering, it goes by show_id and field_id and for 32,123 since it is way above 5% it gets into a seq scan ? Although the end result is just 2 rows
SET enable_seqscan = OFF;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on cfg_files_data (cost=2485.18..8544.40 rows=134 width=359) (actual time=173.314..173.317 rows=2 loops=1)
Recheck Cond: (show_id = 32)
Filter: ((file_id = 123) AND (lower((file_data_record ->> 'Company'::text)) = 'computestsystems'::text))
Rows Removed by Filter: 53729
Heap Blocks: exact=2437
-> Bitmap Index Scan on cfg_files_data_pkey (cost=0.00..2485.15 rows=53743 width=0) (actual time=12.195..12.195 rows=53731 loops=1)
Index Cond: (show_id = 32)
Planning time: 0.232 ms
Execution time: 173.392 ms
(9 rows)
And that got me to look at
SELECT *
FROM cfg_files_data
WHERE
lower((file_data_record ->> 'Company')) = lower('CompuTestSystems');
;
drop index cfg_files_data_record_idx ;
create index cfg_files_data_record_idx on cfg_files_data (lower(file_data_record::text));
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on cfg_files_data (cost=0.00..6737.04 rows=539 width=359) (actual time=194.382..194.468 rows=8 loops=1)
Filter: (lower((file_data_record ->> 'Company'::text)) = 'computestsystems'::text)
Rows Removed by Filter: 107823
Planning time: 0.421 ms
Execution time: 194.539 ms
Well the end result is 8 rows from a total of 100k+
Not sure why the index is ignored
But is is possible to go for a broader search, aka being able to search by any key:value , efficient and case insensitive ? What am I missing in this picture ?
Thanks for help
Armand
For the example mentionedSELECT *FROM cfg_files_dataWHERE cfg_files_data.show_id = 32AND cfg_files_data.file_id = 123AND lower((file_data_record ->> 'Company')) = lower('CompuTestSystems');;create index cfg_files_data_record_idx on cfg_files_data (show_id, file_id,lower(file_data_record::text)); Not sure why the index is ignored
Because "lower((file_data_record ->> 'Company'))" is not the same as "lower(file_data_record::text)"
But is is possible to go for a broader search, aka being able to search by any key:value , efficient and case insensitive ? What am I missing in this picture ?
Use a trigger to maintain an all lower case copy of the json file_data_record and use the copy for predicates while using the original for select-list outputs.
David J.
On May 31, 2017, at 2:32 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:For the example mentionedSELECT *FROM cfg_files_dataWHERE cfg_files_data.show_id = 32AND cfg_files_data.file_id = 123AND lower((file_data_record ->> 'Company')) = lower('CompuTestSystems');;create index cfg_files_data_record_idx on cfg_files_data (show_id, file_id,lower(file_data_record::text)); Not sure why the index is ignoredBecause "lower((file_data_record ->> 'Company'))" is not the same as "lower(file_data_record::text)”
I see, missed it
Cause this works
create index fooidx on cfg_files_data (show_id, file_id, lower(file_data_record ->> 'Company'));
The idea is that I would like to avoid having an index for each key possibly wanted to search, but rather say have the whole record then search by the key and get the key value, thus having one index serving multiple purposes so to speak
I looked at JSQuery but does not seem that I can have a composite index like fooidx ???!
But is is possible to go for a broader search, aka being able to search by any key:value , efficient and case insensitive ? What am I missing in this picture ?Use a trigger to maintain an all lower case copy of the json file_data_record and use the copy for predicates while using the original for select-list outputs.David J.
Thanks
Armand
On May 31, armand pirvu modulated: > The idea is that I would like to avoid having an index for each key > possibly wanted to search, but rather say have the whole record then > search by the key and get the key value, thus having one index serving > multiple purposes so to speak > First, benchmarking would be important to figure out if any proposed indexing actually speeds up the kinds of queries you want to perform. With the recently added parallel query features, a simpler indexing scheme with some brute-force search might be adequate? But, you could use a search idiom like this: (lower(json_column::text)::json) -> lower('key') = 'value'::json This will down-convert the case on all values and keys. The left-hand parenthetic expression could be precomputed in an expression index to avoid repeated case conversion. But, typical searches will still have to scan the whole index to perform the projection and match the final value tests on the right-hand side. If you want to do things like substring matching on field values, you might stick with text and using regexp matches: (lower(json_column::text)) ~ 'valuepattern' or more structural searches: (lower(json_column::text)) ~ '"key": "[^"]*substring[^"]*"' Here, the left-hand expression could be trigram indexed to help with sparse, substring matching without a full index scan. We've had good luck using trigram indexing with regexp matching, though I've honestly never used it for the purpose sketched above... Karl
Thank you Karl and David Ideally as far as I can tell the index would need to be show_id, file_id, lower(…) The question is if this is possible ? Thanks Armand > On Jun 1, 2017, at 12:24 PM, Karl Czajkowski <karlcz@isi.edu> wrote: > > On May 31, armand pirvu modulated: > >> The idea is that I would like to avoid having an index for each key >> possibly wanted to search, but rather say have the whole record then >> search by the key and get the key value, thus having one index serving >> multiple purposes so to speak >> > > First, benchmarking would be important to figure out if any proposed > indexing actually speeds up the kinds of queries you want to perform. > With the recently added parallel query features, a simpler indexing > scheme with some brute-force search might be adequate? > > But, you could use a search idiom like this: > > (lower(json_column::text)::json) -> lower('key') = 'value'::json > > This will down-convert the case on all values and keys. The left-hand > parenthetic expression could be precomputed in an expression index to > avoid repeated case conversion. But, typical searches will still have > to scan the whole index to perform the projection and match the final > value tests on the right-hand side. > > If you want to do things like substring matching on field values, you > might stick with text and using regexp matches: > > (lower(json_column::text)) ~ 'valuepattern' > > or more structural searches: > > (lower(json_column::text)) ~ '"key": "[^"]*substring[^"]*"' > > Here, the left-hand expression could be trigram indexed to help with > sparse, substring matching without a full index scan. We've had good > luck using trigram indexing with regexp matching, though I've honestly > never used it for the purpose sketched above... > > Karl
I apologize before hand replying again on my own reply . I know it is frowned upon . My inline comments. > On Jun 1, 2017, at 2:05 PM, armand pirvu <armand.pirvu@gmail.com> wrote: > > Thank you Karl and David > > Ideally as far as I can tell the index would need to be show_id, file_id, lower(…) > > > The question is if this is possible ? > > > Thanks > Armand > > >> On Jun 1, 2017, at 12:24 PM, Karl Czajkowski <karlcz@isi.edu> wrote: >> >> On May 31, armand pirvu modulated: >> >>> The idea is that I would like to avoid having an index for each key >>> possibly wanted to search, but rather say have the whole record then >>> search by the key and get the key value, thus having one index serving >>> multiple purposes so to speak >>> >> >> First, benchmarking would be important to figure out if any proposed >> indexing actually speeds up the kinds of queries you want to perform. >> With the recently added parallel query features, a simpler indexing >> scheme with some brute-force search might be adequate? >> Not sure what you mean by benchmarking But I think comparative times , aka 2 seconds vs a couple milliseconds is quite a difference. A table scan while in certain cases is okay , in a case when there is heavy usage on the same part/area , it will becomea problem. >> But, you could use a search idiom like this: >> >> (lower(json_column::text)::json) -> lower('key') = 'value'::json >> >> This will down-convert the case on all values and keys. The left-hand >> parenthetic expression could be precomputed in an expression index to >> avoid repeated case conversion. But, typical searches will still have >> to scan the whole index to perform the projection and match the final >> value tests on the right-hand side. >> >> If you want to do things like substring matching on field values, you >> might stick with text and using regexp matches: >> >> (lower(json_column::text)) ~ ‘valuepattern' In this case a regular index will be ignored even though IMO it should scan the index and get the needed information The criteria I am after gets back 9 rows max out of 100k+ records so I say the restriction is darn good. Wouldn’t that bethe case for the optimizer to pick the path with the least resistance aka best restriction ? Granted it uses a lower functionwhich and the search in the text column which is the third in the index is not really starting form left. But theindex starts with show_id , file_id and those are always part of the key. I can see though once the show_id, file_id isNOT a good restriction anymore , than the last column will make the difference . Either case will that not translate intoan index scan ? Or the index to be considered in this case, event the last column search has to follow the left to right,aka not in between search ? >> >> or more structural searches: >> >> (lower(json_column::text)) ~ '"key": "[^"]*substring[^"]*"' >> >> Here, the left-hand expression could be trigram indexed to help with >> sparse, substring matching without a full index scan. We've had good >> luck using trigram indexing with regexp matching, though I've honestly >> never used it for the purpose sketched above... >> >> Karl > Seems to me trigram could be the answer since I have some decent results once I applied it, more to dig Overall could it be that the optimizer blatantly ignores a scan index which is cheaper than a table scan, or jsonb implementationstill has a long way to come up or the way it is used in my case is not the one designed for ? thanks Armand
On Jun 01, armand pirvu modulated: > Overall could it be that the optimizer blatantly ignores a scan index which is cheaper than a table scan, or jsonb implementationstill has a long way to come up or the way it is used in my case is not the one designed for ? > If I remember correctly, isn't a compound index always just using btree? In general, I have found better luck using several smaller btree indices than one large compound one. Unless your entire query can be answered from an index-only lookup, the extra columns just bloat the btree index. So, you might as well use a simpler compound index for the regular scalar row keys, and this index will be much smaller without the baggage of the jsonb values at its leaves. The planner can use the jsonb from the actual candidate rows if it is going to have to visit them anyway for other WHERE or SELECT clauses. If the sparseness of your query is due to the content within the jsonb values rather than the other scalar row keys, I think you'd need some kind of GIN index over the contents of the jsonb documents to find the small subset of candidate rows by these sparse criteria. Trigram is just one example of a GIN indexing scheme. If your jsonb documents are "flat", i.e. just a bag of key value pairs and not arbitrary nested jsonb structures, you might also explode them into arrays of keys or values as separate indexed expressions? Then, you could GIN index the arrays and quickly find the subset of rows with certain unusual keys or unusual values, but would still have to follow up with a more exact check for the combination of key and value. Karl
On Jun 1, 2017, at 13:44 , Karl Czajkowski <karlcz@isi.edu> wrote:
If I remember correctly, isn't a compound index always just usingbtree? In general, I have found better luck using several smaller
btree indices than one large compound one. Unless your entire query
can be answered from an index-only lookup, the extra columns just
bloat the btree index.
The issue is *much* more subtle than this. For repetitive queries, a compound index can be a great speedup.
Best discussion I’ve seen around index design is on the website use-the-index-luke.com.