Обсуждение: [GENERAL] jsonb case insensitive search

Поиск
Список
Период
Сортировка

[GENERAL] jsonb case insensitive search

От
armand pirvu
Дата:


Hi

The goal would be to be able to search case insensitive by any key:value combined with some other columns like outlined below, but initially would be Company:CompuTestSystems

A sample would be 

SELECT * 
FROM cfg_files_data 
WHERE cfg_files_data.show_id = 32 
AND cfg_files_data.file_id = 123
AND lower((file_data_record ->> 'Company')) = lower('CompuTestSystems');
;


- Table definition
                                           Table "csischema.cfg_files_data"
       Column       |           Type           |                               Modifiers                               
--------------------+--------------------------+-----------------------------------------------------------------------
 file_data_id       | bigint                   | not null default nextval('cfg_files_data_file_data_id_seq'::regclass)
 file_id            | bigint                   | not null
 show_id            | bigint                   | not null
 file_data_record   | jsonb                    | not null
 additional_info    | jsonb                    | 
 file_data_add_by   | character varying(100)   | not null
 file_data_add_date | timestamp with time zone | not null default now()
Indexes:
    "cfg_files_data_pkey" PRIMARY KEY, btree (file_data_id, show_id)


- show_id, file_id data distribution

select show_id, file_id, count(*) from cfg_files_data group by show_id, file_id order by 1;
 show_id | file_id | count 
---------+---------+-------
       4 |       9 |     3
       4 |      68 |    22
       4 |       2 |     6
       6 |       3 |    13
       6 |       5 |   215
      13 |      13 |    13
      13 |       8 |    22
      21 |      11 |    13
      21 |      10 |    22
      26 |      12 |    13
      30 |      16 |     6
      32 |     123 | 53731
      35 |     125 |     5
      40 |     122 | 53731
      46 |      69 |     4
      46 |      64 |     4
      46 |      67 |     4
      46 |      70 |     4

For the example mentioned

SELECT * 
FROM cfg_files_data 
WHERE cfg_files_data.show_id = 32 
AND cfg_files_data.file_id = 123
AND lower((file_data_record ->> 'Company')) = lower('CompuTestSystems');
;


create index cfg_files_data_record_idx on cfg_files_data (show_id, file_id, lower(file_data_record::text));

SELECT * 
FROM cfg_files_data 
WHERE cfg_files_data.show_id = 35 
AND cfg_files_data.file_id = 125
AND lower((file_data_record ->> 'Company')) = lower('CompuTestSystems');
;


                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using cfg_files_data_record_idx on cfg_files_data  (cost=0.54..8.57 rows=1 width=359) (actual time=0.287..0.305 rows=5 loops=1)
   Index Cond: ((show_id = 35) AND (file_id = 125))
   Filter: (lower((file_data_record ->> 'Company'::text)) = 'computestsystems'::text)
 Planning time: 0.271 ms
 Execution time: 0.370 ms
(5 rows)


SELECT * 
FROM cfg_files_data 
WHERE cfg_files_data.show_id = 32 
AND cfg_files_data.file_id = 123
AND lower((file_data_record ->> 'Company')) = lower('CompuTestSystems');
;

                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on cfg_files_data  (cost=0.00..7276.20 rows=134 width=359) (actual time=194.817..194.848 rows=2 loops=1)
   Filter: ((show_id = 32) AND (file_id = 123) AND (lower((file_data_record ->> 'Company'::text)) = 'computestsystems'::text))
   Rows Removed by Filter: 107829
 Planning time: 1.006 ms
 Execution time: 194.905 ms

Shall I understand that prior to apply the lower((file_data_record ->> 'Company'::text)) = 'computestsystems'::text) filtering, it goes by show_id and field_id and for 32,123 since it is way above 5% it gets into a seq scan ? Although the end result is just 2 rows


SET enable_seqscan = OFF;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on cfg_files_data  (cost=2485.18..8544.40 rows=134 width=359) (actual time=173.314..173.317 rows=2 loops=1)
   Recheck Cond: (show_id = 32)
   Filter: ((file_id = 123) AND (lower((file_data_record ->> 'Company'::text)) = 'computestsystems'::text))
   Rows Removed by Filter: 53729
   Heap Blocks: exact=2437
   ->  Bitmap Index Scan on cfg_files_data_pkey  (cost=0.00..2485.15 rows=53743 width=0) (actual time=12.195..12.195 rows=53731 loops=1)
         Index Cond: (show_id = 32)
 Planning time: 0.232 ms
 Execution time: 173.392 ms
(9 rows)

And that got me to look at 

SELECT * 
FROM cfg_files_data 
WHERE 
lower((file_data_record ->> 'Company')) = lower('CompuTestSystems');
;


drop index cfg_files_data_record_idx ;
create index cfg_files_data_record_idx on cfg_files_data (lower(file_data_record::text));
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Seq Scan on cfg_files_data  (cost=0.00..6737.04 rows=539 width=359) (actual time=194.382..194.468 rows=8 loops=1)
   Filter: (lower((file_data_record ->> 'Company'::text)) = 'computestsystems'::text)
   Rows Removed by Filter: 107823
 Planning time: 0.421 ms
 Execution time: 194.539 ms


Well the end result is 8 rows from a total of 100k+

Not sure why the index is ignored

But is is possible to go for a broader search, aka being able to search by any key:value , efficient and case insensitive ? What am I missing in this picture ?



Thanks for help

Armand



Re: [GENERAL] jsonb case insensitive search

От
"David G. Johnston"
Дата:
On Wed, May 31, 2017 at 12:18 PM, armand pirvu <armand.pirvu@gmail.com> wrote:

For the example mentioned

SELECT * 
FROM cfg_files_data 
WHERE cfg_files_data.show_id = 32 
AND cfg_files_data.file_id = 123
AND lower((file_data_record ->> 'Company')) = lower('CompuTestSystems');
;


create index cfg_files_data_record_idx on cfg_files_data (show_id, file_id,
​​
lower(file_data_record::text));


Not sure why the index is ignored

Because ​"lower((file_data_record ->> 'Company'))" is not the same as ​"lower(file_data_record::text)"


But is is possible to go for a broader search, aka being able to search by any key:value , efficient and case insensitive ? What am I missing in this picture ?


Use a trigger to maintain an all lower case copy of the json file_data_record and use the copy for predicates while using the original ​for select-list outputs.

David J.

Re: [GENERAL] jsonb case insensitive search

От
armand pirvu
Дата:

On May 31, 2017, at 2:32 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Wed, May 31, 2017 at 12:18 PM, armand pirvu <armand.pirvu@gmail.com> wrote:

For the example mentioned

SELECT * 
FROM cfg_files_data 
WHERE cfg_files_data.show_id = 32 
AND cfg_files_data.file_id = 123
AND lower((file_data_record ->> 'Company')) = lower('CompuTestSystems');
;


create index cfg_files_data_record_idx on cfg_files_data (show_id, file_id,
​​
lower(file_data_record::text));


Not sure why the index is ignored

Because ​"lower((file_data_record ->> 'Company'))" is not the same as ​"lower(file_data_record::text)”


I see, missed it 

Cause this works
create index fooidx on cfg_files_data (show_id, file_id, lower(file_data_record ->> 'Company'));

The idea is that I would like to avoid having an index for each key possibly wanted to search, but rather say have the whole record then search by the key and get the key value, thus having one index serving multiple purposes so to speak


I looked at JSQuery but does not seem that I can have a composite index like fooidx ???!




But is is possible to go for a broader search, aka being able to search by any key:value , efficient and case insensitive ? What am I missing in this picture ?


Use a trigger to maintain an all lower case copy of the json file_data_record and use the copy for predicates while using the original ​for select-list outputs.

David J.


Uhh that would be let’s just say less than optimal. But maybe JSON itself is not intended to be used this way ? Or the functionality is just not there yet ?


Thanks
Armand



Re: [GENERAL] jsonb case insensitive search

От
Karl Czajkowski
Дата:
On May 31, armand pirvu modulated:

> The idea is that I would like to avoid having an index for each key
> possibly wanted to search, but rather say have the whole record then
> search by the key and get the key value, thus having one index serving
> multiple purposes so to speak
>

First, benchmarking would be important to figure out if any proposed
indexing actually speeds up the kinds of queries you want to perform.
With the recently added parallel query features, a simpler indexing
scheme with some brute-force search might be adequate?

But, you could use a search idiom like this:

     (lower(json_column::text)::json) -> lower('key') = 'value'::json

This will down-convert the case on all values and keys.  The left-hand
parenthetic expression could be precomputed in an expression index to
avoid repeated case conversion. But, typical searches will still have
to scan the whole index to perform the projection and match the final
value tests on the right-hand side.

If you want to do things like substring matching on field values, you
might stick with text and using regexp matches:

     (lower(json_column::text)) ~ 'valuepattern'

or more structural searches:

     (lower(json_column::text)) ~ '"key": "[^"]*substring[^"]*"'

Here, the left-hand expression could be trigram indexed to help with
sparse, substring matching without a full index scan.  We've had good
luck using trigram indexing with regexp matching, though I've honestly
never used it for the purpose sketched above...

Karl


Re: [GENERAL] jsonb case insensitive search

От
armand pirvu
Дата:
Thank you Karl and David

Ideally as far as I can tell the index would need to be show_id, file_id, lower(…)


The question is if this is  possible ?


Thanks
Armand


> On Jun 1, 2017, at 12:24 PM, Karl Czajkowski <karlcz@isi.edu> wrote:
>
> On May 31, armand pirvu modulated:
>
>> The idea is that I would like to avoid having an index for each key
>> possibly wanted to search, but rather say have the whole record then
>> search by the key and get the key value, thus having one index serving
>> multiple purposes so to speak
>>
>
> First, benchmarking would be important to figure out if any proposed
> indexing actually speeds up the kinds of queries you want to perform.
> With the recently added parallel query features, a simpler indexing
> scheme with some brute-force search might be adequate?
>
> But, you could use a search idiom like this:
>
>     (lower(json_column::text)::json) -> lower('key') = 'value'::json
>
> This will down-convert the case on all values and keys.  The left-hand
> parenthetic expression could be precomputed in an expression index to
> avoid repeated case conversion. But, typical searches will still have
> to scan the whole index to perform the projection and match the final
> value tests on the right-hand side.
>
> If you want to do things like substring matching on field values, you
> might stick with text and using regexp matches:
>
>     (lower(json_column::text)) ~ 'valuepattern'
>
> or more structural searches:
>
>     (lower(json_column::text)) ~ '"key": "[^"]*substring[^"]*"'
>
> Here, the left-hand expression could be trigram indexed to help with
> sparse, substring matching without a full index scan.  We've had good
> luck using trigram indexing with regexp matching, though I've honestly
> never used it for the purpose sketched above...
>
> Karl



Re: [GENERAL] jsonb case insensitive search

От
armand pirvu
Дата:
I apologize before hand replying again on my own reply . I know it is frowned upon . My inline comments.

> On Jun 1, 2017, at 2:05 PM, armand pirvu <armand.pirvu@gmail.com> wrote:
>
> Thank you Karl and David
>
> Ideally as far as I can tell the index would need to be show_id, file_id, lower(…)
>
>
> The question is if this is  possible ?
>
>
> Thanks
> Armand
>
>
>> On Jun 1, 2017, at 12:24 PM, Karl Czajkowski <karlcz@isi.edu> wrote:
>>
>> On May 31, armand pirvu modulated:
>>
>>> The idea is that I would like to avoid having an index for each key
>>> possibly wanted to search, but rather say have the whole record then
>>> search by the key and get the key value, thus having one index serving
>>> multiple purposes so to speak
>>>
>>
>> First, benchmarking would be important to figure out if any proposed
>> indexing actually speeds up the kinds of queries you want to perform.
>> With the recently added parallel query features, a simpler indexing
>> scheme with some brute-force search might be adequate?
>>

Not sure what you mean by benchmarking
But I think comparative times , aka 2 seconds vs a couple milliseconds is quite a difference.
A table scan while in certain cases is okay , in a case when there is heavy usage on the same part/area , it will
becomea problem.  


>> But, you could use a search idiom like this:
>>
>>    (lower(json_column::text)::json) -> lower('key') = 'value'::json
>>
>> This will down-convert the case on all values and keys.  The left-hand
>> parenthetic expression could be precomputed in an expression index to
>> avoid repeated case conversion. But, typical searches will still have
>> to scan the whole index to perform the projection and match the final
>> value tests on the right-hand side.
>>
>> If you want to do things like substring matching on field values, you
>> might stick with text and using regexp matches:
>>
>>    (lower(json_column::text)) ~ ‘valuepattern'

In this case a regular index will be ignored even though IMO it should scan the index and get the needed information
The criteria I am after gets back 9 rows max out of 100k+ records so I say the restriction is darn good. Wouldn’t that
bethe case for the optimizer to pick the path with the least resistance aka best restriction ? Granted it uses a lower
functionwhich and the search in the text column which is the third in the index is not really starting form left. But
theindex starts with show_id , file_id and those are always part of the key. I can see though once the show_id, file_id
isNOT a good restriction anymore , than the last column will make the difference . Either case will that not translate
intoan index scan ? Or the index to be considered in this case, event the last column search has to follow the left to
right,aka not in between search ?   


>>
>> or more structural searches:
>>
>>    (lower(json_column::text)) ~ '"key": "[^"]*substring[^"]*"'
>>
>> Here, the left-hand expression could be trigram indexed to help with
>> sparse, substring matching without a full index scan.  We've had good
>> luck using trigram indexing with regexp matching, though I've honestly
>> never used it for the purpose sketched above...
>>
>> Karl
>


Seems to me trigram could be the answer since I have some decent results once I applied it, more to dig

Overall could it be that the optimizer blatantly ignores a scan index which is cheaper than a table scan, or jsonb
implementationstill has a long way to come up  or the way it is used in my case is not the one designed for ? 


thanks
Armand






Re: [GENERAL] jsonb case insensitive search

От
Karl Czajkowski
Дата:
On Jun 01, armand pirvu modulated:

> Overall could it be that the optimizer blatantly ignores a scan index which is cheaper than a table scan, or jsonb
implementationstill has a long way to come up  or the way it is used in my case is not the one designed for ? 
>

If I remember correctly, isn't a compound index always just using
btree?  In general, I have found better luck using several smaller
btree indices than one large compound one.  Unless your entire query
can be answered from an index-only lookup, the extra columns just
bloat the btree index.

So, you might as well use a simpler compound index for the regular
scalar row keys, and this index will be much smaller without the
baggage of the jsonb values at its leaves.  The planner can use the
jsonb from the actual candidate rows if it is going to have to visit
them anyway for other WHERE or SELECT clauses.

If the sparseness of your query is due to the content within the jsonb
values rather than the other scalar row keys, I think you'd need some
kind of GIN index over the contents of the jsonb documents to find the
small subset of candidate rows by these sparse criteria.  Trigram is
just one example of a GIN indexing scheme.

If your jsonb documents are "flat", i.e. just a bag of key value pairs
and not arbitrary nested jsonb structures, you might also explode them
into arrays of keys or values as separate indexed expressions?  Then,
you could GIN index the arrays and quickly find the subset of rows with
certain unusual keys or unusual values, but would still have to follow
up with a more exact check for the combination of key and value.


Karl



Re: [GENERAL] jsonb case insensitive search

От
Guyren Howe
Дата:
On Jun 1, 2017, at 13:44 , Karl Czajkowski <karlcz@isi.edu> wrote:
If I remember correctly, isn't a compound index always just using
btree?  In general, I have found better luck using several smaller
btree indices than one large compound one.  Unless your entire query
can be answered from an index-only lookup, the extra columns just
bloat the btree index.

The issue is *much* more subtle than this. For repetitive queries, a compound index can be a great speedup.

Best discussion I’ve seen around index design is on the website use-the-index-luke.com.