Обсуждение: Inlining of functions (doing LIKE on an array)
Hello,
I am trying to implement an efficient “like” over a text[]. I see a lot of people have tried before me and I learnt a lot through the forums. The results of my search is that a query like the following is optimal:
select count(*)
from claims
where (select count(*)
from unnest("ICD9_DGNS_CD") x_
where x_ like '427%'
) > 0
So I figured I’d create a Function to encapsulate the concept:
CREATE OR REPLACE FUNCTION ArrayLike(text[], text)
RETURNS bigint
AS 'select count(*) from unnest($1) a where a like $2'
LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF
This works functionally, but performs like crap: full table scan, and cannot make use of any index it seems. Basically, it feels like PG can’t inline that function.
I have been trying all evening to find a way to rewrite it to trick the compiler/planner into inlining. I tried the operator approach for example, but performance is again not good.
create function rlike(text,text)
returns bool as 'select $2 like $1' language sql strict immutable;
create operator ``` (procedure = rlike, leftarg = text,
rightarg = text, commutator = ```);
CREATE OR REPLACE FUNCTION MyLike(text[], text)
RETURNS boolean
AS 'select $2 ``` ANY($1)'
LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF
And by not good, I mean that on my table of 2M+ rows, the “native” query takes 3s, while the function version takes 9s and the operator version takes (via the function, or through the operator directly), takes 15s.
Any ideas or pointers?
Thank you,
Laurent Hasson
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of ldh@laurent-hasson.com > Sent: Freitag, 11. November 2016 07:54 > To: pgsql-performance@postgresql.org > Subject: [PERFORM] Inlining of functions (doing LIKE on an array) > > Hello, > > I am trying to implement an efficient "like" over a text[]. I see a lot of people have tried before me and I learnt a lotthrough the forums. The results of my search is that a query like the following is optimal: > > select count(*) > from claims > where (select count(*) > from unnest("ICD9_DGNS_CD") x_ > where x_ like '427%' > ) > 0 > Hi, are you using GIN indexes? http://stackoverflow.com/questions/4058731/can-postgresql-index-array-columns moreover your query can still be optimized: => select count(*) from claims where exists (select * from unnest("ICD9_DGNS_CD") x_ where x_ like '427%' ) regards, Marc Mamin > So I figured I'd create a Function to encapsulate the concept: > > CREATE OR REPLACE FUNCTION ArrayLike(text[], text) > RETURNS bigint > AS 'select count(*) from unnest($1) a where a like $2' > LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF > > This works functionally, but performs like crap: full table scan, and cannot make use of any index it seems. Basically,it feels like PG can't inline that function. > > I have been trying all evening to find a way to rewrite it to trick the compiler/planner into inlining. I tried the operatorapproach for example, but performance is again not good. > > create function rlike(text,text) > returns bool as 'select $2 like $1' language sql strict immutable; > create operator ``` (procedure = rlike, leftarg = text, > rightarg = text, commutator = ```); > CREATE OR REPLACE FUNCTION MyLike(text[], text) > RETURNS boolean > AS 'select $2 ``` ANY($1)' > LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF > > And by not good, I mean that on my table of 2M+ rows, the "native" query takes 3s, while the function version takes 9sand the operator version takes (via the function, or through the operator directly), takes 15s. > > Any ideas or pointers? > > > Thank you, > Laurent Hasson
I tried "exists", but won't work in the Function, i.e., CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bigint AS 'exists (select * from unnest($1) a where a like $2)' LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF It's as expected though. As for the GIN indices, I tried and it didn't make a difference, which I guess is expected as wellbecause of the Like operator. I don't expect regular indices to work on regular columns for Like operations, especially'%xxx' ones, so I didn't expect GIN indices to work either for Array columns with Like. Am I wrong? Finally, I think the issue is actually not what I originally thought (i.e., index usage, as per above). But the inliningstill is the culprit. Here is the plan for select count(*) from claims where (select count(*) from unnest("SECONDARY_ICD9_DGNS_CD") x_ where x_ like '427%' ) > 0 "Aggregate (cost=2633016.66..2633016.67 rows=1 width=0) (actual time=3761.888..3761.889 rows=1 loops=1)" " -> Seq Scan on claims (cost=0.00..2631359.33 rows=662931 width=0) (actual time=0.097..3757.314 rows=85632 loops=1)" " Filter: ((SubPlan 1) > 0)" " Rows Removed by Filter: 1851321" " SubPlan 1" " -> Aggregate (cost=1.25..1.26 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1936953)" " -> Function Scan on unnest a (cost=0.00..1.25 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1936953)" " Filter: (a ~~ '427%'::text)" " Rows Removed by Filter: 2" "Planning time: 0.461 ms" "Execution time: 3762.272 ms" And when using the function: "Aggregate (cost=614390.75..614390.76 rows=1 width=0) (actual time=8169.416..8169.417 rows=1 loops=1)" " -> Seq Scan on claims (cost=0.00..612733.43 rows=662931 width=0) (actual time=0.163..8162.679 rows=85632 loops=1)" " Filter: (tilda."like"("SECONDARY_ICD9_DGNS_CD", '427%'::text) > 0)" " Rows Removed by Filter: 1851321" "Planning time: 0.166 ms" "Execution time: 8169.676 ms" There is something fundamental here it seems, but I am not so good at reading plans to understand the differences here. Thank you, Laurent Hasson -----Original Message----- From: Marc Mamin [mailto:M.Mamin@intershop.de] Sent: Friday, November 11, 2016 07:44 To: ldh@laurent-hasson.com; pgsql-performance@postgresql.org Subject: RE: Inlining of functions (doing LIKE on an array) > From: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of > ldh@laurent-hasson.com > Sent: Freitag, 11. November 2016 07:54 > To: pgsql-performance@postgresql.org > Subject: [PERFORM] Inlining of functions (doing LIKE on an array) > > Hello, > > I am trying to implement an efficient "like" over a text[]. I see a lot of people have tried before me and I learnt a lotthrough the forums. The results of my search is that a query like the following is optimal: > > select count(*) > from claims > where (select count(*) > from unnest("ICD9_DGNS_CD") x_ > where x_ like '427%' > ) > 0 > Hi, are you using GIN indexes? http://stackoverflow.com/questions/4058731/can-postgresql-index-array-columns moreover your query can still be optimized: => select count(*) from claims where exists (select * from unnest("ICD9_DGNS_CD") x_ where x_ like '427%' ) regards, Marc Mamin > So I figured I'd create a Function to encapsulate the concept: > > CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bigint AS > 'select count(*) from unnest($1) a where a like $2' > LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF > > This works functionally, but performs like crap: full table scan, and cannot make use of any index it seems. Basically,it feels like PG can't inline that function. > > I have been trying all evening to find a way to rewrite it to trick the compiler/planner into inlining. I tried the operatorapproach for example, but performance is again not good. > > create function rlike(text,text) > returns bool as 'select $2 like $1' language sql strict immutable; > create operator ``` (procedure = rlike, leftarg = text, > rightarg = text, commutator = ```); CREATE OR > REPLACE FUNCTION MyLike(text[], text) RETURNS boolean AS 'select $2 > ``` ANY($1)' > LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF > > And by not good, I mean that on my table of 2M+ rows, the "native" query takes 3s, while the function version takes 9sand the operator version takes (via the function, or through the operator directly), takes 15s. > > Any ideas or pointers? > > > Thank you, > Laurent Hasson
"ldh@laurent-hasson.com" <ldh@laurent-hasson.com> writes: > I tried "exists", but won't work in the Function, i.e., > CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bigint > AS 'exists (select * from unnest($1) a where a like $2)' > LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF Syntax and semantics problems. This would work: regression=# CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bool regression-# as 'select exists (select * from unnest($1) a where a like $2)' regression-# LANGUAGE SQL STRICT IMMUTABLE; CREATE FUNCTION regression=# create table tt (f1 text[]); CREATE TABLE regression=# explain select * from tt where ArrayLike(f1, 'foo'); QUERY PLAN ------------------------------------------------------- Seq Scan on tt (cost=0.00..363.60 rows=453 width=32) Filter: arraylike(f1, 'foo'::text) (2 rows) But we don't inline SQL functions containing sub-selects, so you're still stuck with the rather high overhead of a SQL function. A plpgsql function might be a bit faster: CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bool as 'begin return exists (select * from unnest($1) a where a like $2); end' LANGUAGE plpgSQL STRICT IMMUTABLE; BTW, I'd be pretty suspicious of marking this function leakproof, because the underlying LIKE operator isn't leakproof according to pg_proc. > It's as expected though. As for the GIN indices, I tried and it didn't make a difference, which I guess is expected aswell because of the Like operator. I don't expect regular indices to work on regular columns for Like operations, especially'%xxx' ones, so I didn't expect GIN indices to work either for Array columns with Like. Am I wrong? Plain GIN index, probably not. A pg_trgm index could help with LIKE searches, but I don't think we have a variant of that for array columns. Have you considered renormalizing the data so that you don't have arrays? regards, tom lane
Hello,
I am trying to implement an efficient “like” over a text[]. I see a lot of people have tried before me and I learnt a lot through the forums.
Thanks for the pointer on the "select exists" syntax Tom. Much appreciated. I couldn't figure it out! And as for normalizing,yes, thought about it, but the one-to-many relationship would make other scenarios we have more complex and slower.So I am juggling with trade-offs. So, here are my findings. I did 10 runs for each of the 4 options I have arrived at. The runs were pretty consistent, withina few 10th's of a second off each other, so little variability. Not 100% scientific, but good enough for my test. Ipicked here the last run I had with the plans for illustration. Take-aways: ----------------------- - The "select exists" (#3) approach is roughly 40% faster than "select count(*) > 0" (#1). - The SQL Function version (#3) Vs the plpgSQL function version (#2) of the same query performs better (~30%) - The inlined version (#4) is twice as fast (roughly) as the SQL version (#3). I wish there were a way to force inlining, or some other mechanism as the performance difference is large here. I'll be usingthe inlining approach when possible, but the SQL Function approach is simpler and will likely be more suitable for somedevelopers. Details: ----------------- 1- select count(*) > 0 as SQL =================================== CREATE OR REPLACE FUNCTION MyLike2(text[], text) RETURNS boolean AS 'select count(*) > 0 from unnest($1) a where a like $2' LANGUAGE SQL STRICT IMMUTABLE EXPLAIN ANALYZE select count(*) from cms.claims where MyLike2("code", '427%') --"Aggregate (cost=609418.77..609418.78 rows=1 width=0) (actual time=8464.372..8464.372 rows=1 loops=1)" --" -> Seq Scan on claims (cost=0.00..607761.44 rows=662931 width=0) (actual time=0.077..8457.963 rows=85632 loops=1)" --" Filter: MyLike2("code", '427%'::text)" --" Rows Removed by Filter: 1851321" --"Planning time: 0.131 ms" --"Execution time: 8464.407 ms" 2- select exists as plpgSQL =================================== CREATE OR REPLACE FUNCTION MyLike3(text[], text) RETURNS boolean AS 'begin return exists (select * from unnest($1) a where a like $2); end' LANGUAGE plpgSQL STRICT IMMUTABLE EXPLAIN ANALYZE select count(*) from cms.claims where MyLike3("code", '427%') --"Aggregate (cost=609418.77..609418.78 rows=1 width=0) (actual time=7708.945..7708.945 rows=1 loops=1)" --" -> Seq Scan on claims (cost=0.00..607761.44 rows=662931 width=0) (actual time=0.040..7700.528 rows=85632 loops=1)" --" Filter: MyLike3("code", '427%'::text)" --" Rows Removed by Filter: 1851321" --"Planning time: 0.076 ms" --"Execution time: 7708.975 ms" 3- select exists as SQL =================================== CREATE OR REPLACE FUNCTION MyLike(text[], text) RETURNS boolean AS 'select exists (select * from unnest($1) a where a like $2)' LANGUAGE SQL STRICT IMMUTABLE EXPLAIN ANALYZE select count(*) from cms.claims where MyLike("code", '427%') --"Aggregate (cost=609418.77..609418.78 rows=1 width=0) (actual time=5524.690..5524.690 rows=1 loops=1)" --" -> Seq Scan on claims (cost=0.00..607761.44 rows=662931 width=0) (actual time=0.064..5515.886 rows=85632 loops=1)" --" Filter: tilda."like"("code", '427%'::text)" --" Rows Removed by Filter: 1851321" --"Planning time: 0.097 ms" --"Execution time: 5524.718 ms" 4- select exists inlined =================================== EXPLAIN ANALYZE select count(*) from cms.claims where exists (select * from unnest("SECONDARY_ICD9_DGNS_CD") a where a like '427%') --"Aggregate (cost=2604013.42..2604013.43 rows=1 width=0) (actual time=2842.259..2842.259 rows=1 loops=1)" --" -> Seq Scan on claims (cost=0.00..2601527.42 rows=994397 width=0) (actual time=0.017..2837.122 rows=85632 loops=1)" --" Filter: (SubPlan 1)" --" Rows Removed by Filter: 1851321" --" SubPlan 1" --" -> Function Scan on unnest a (cost=0.00..1.25 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1936953)" --" Filter: (a ~~ '427%'::text)" --" Rows Removed by Filter: 2" --"Planning time: 0.155 ms" --"Execution time: 2842.311 ms" Thank you, Laurent Hasson -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, November 11, 2016 11:46 To: ldh@laurent-hasson.com Cc: Marc Mamin <M.Mamin@intershop.de>; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Inlining of functions (doing LIKE on an array) "ldh@laurent-hasson.com" <ldh@laurent-hasson.com> writes: > I tried "exists", but won't work in the Function, i.e., CREATE OR > REPLACE FUNCTION ArrayLike(text[], text) RETURNS bigint > AS 'exists (select * from unnest($1) a where a like $2)' > LANGUAGE SQL STRICT IMMUTABLE LEAKPROOF Syntax and semantics problems. This would work: regression=# CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bool regression-# as 'select exists (select * fromunnest($1) a where a like $2)' regression-# LANGUAGE SQL STRICT IMMUTABLE; CREATE FUNCTION regression=# create table tt (f1 text[]); CREATE TABLE regression=#explain select * from tt where ArrayLike(f1, 'foo'); QUERY PLAN ------------------------------------------------------- Seq Scan on tt (cost=0.00..363.60 rows=453 width=32) Filter: arraylike(f1, 'foo'::text) (2 rows) But we don't inline SQL functions containing sub-selects, so you're still stuck with the rather high overhead of a SQL function. A plpgsql function might be a bit faster: CREATE OR REPLACE FUNCTION ArrayLike(text[], text) RETURNS bool as 'begin return exists (select * from unnest($1) a wherea like $2); end' LANGUAGE plpgSQL STRICT IMMUTABLE; BTW, I'd be pretty suspicious of marking this function leakproof, because the underlying LIKE operator isn't leakproof accordingto pg_proc. > It's as expected though. As for the GIN indices, I tried and it didn't make a difference, which I guess is expected aswell because of the Like operator. I don't expect regular indices to work on regular columns for Like operations, especially'%xxx' ones, so I didn't expect GIN indices to work either for Array columns with Like. Am I wrong? Plain GIN index, probably not. A pg_trgm index could help with LIKE searches, but I don't think we have a variant of thatfor array columns. Have you considered renormalizing the data so that you don't have arrays? regards, tom lane
"ldh@laurent-hasson.com" <ldh@laurent-hasson.com> writes: > I wish there were a way to force inlining, or some other mechanism as the performance difference is large here. I'll beusing the inlining approach when possible, but the SQL Function approach is simpler and will likely be more suitable forsome developers. I'm not sure that there's any fundamental reason why we don't inline SQL functions containing sub-selects. It may just be not having wanted to put any effort into the case way-back-when. Inlining happens too late to allow a resulting WHERE EXISTS to get mutated into a semijoin, but in this example that couldn't happen anyway, so it's not much of an objection. regards, tom lane
Yep, agreed. A simple lexical macro-like approach to test "if it works" could be a simple approach to see if inlining a pieceof sql would not break the main query? Laurent Hasson Sent from my BlackBerry Passport Original Message From: Tom Lane Sent: Saturday, November 12, 2016 14:59 To: ldh@laurent-hasson.com Cc: Marc Mamin; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Inlining of functions (doing LIKE on an array) "ldh@laurent-hasson.com" <ldh@laurent-hasson.com> writes: > I wish there were a way to force inlining, or some other mechanism as the performance difference is large here. I'll beusing the inlining approach when possible, but the SQL Function approach is simpler and will likely be more suitable forsome developers. I'm not sure that there's any fundamental reason why we don't inline SQL functions containing sub-selects. It may just be not having wanted to put any effort into the case way-back-when. Inlining happens too late to allow a resulting WHERE EXISTS to get mutated into a semijoin, but in this example that couldn't happen anyway, so it's not much of an objection. regards, tom lane