Обсуждение: Optimizing query?
Hi, I am trying to match items from 2 tables based on a common string. One is a big table which has one column with entries like XY123, ABC44, etc The table has an index on that column. The second table is, typically, much smaller select .... from tab1, tab2 where tab1.code = tab2.code; This works fine and fast. Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D in the big table and want them to match XY423, GF55 in the second table Variants I have tried select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])'); select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z])'); both take an enormous time. In the better case that I can subset (e.g. all candidates in table 2 share initial "AX") I get back to manageable times by adding and tab1.code ~ '^AX' into the recipe. Actual runtime with about a million entries in tab1 and 800 entries in tab2 is about 40 seconds. Regards Wolfgang Hamann
Le mercredi 30 janvier 2013 à 11:08 +0000, wolfgang@noten5.maas-noten.de a écrit : > Hi, > > I am trying to match items from 2 tables based on a common string. > One is a big table which has one column with entries like XY123, ABC44, etc > The table has an index on that column. > The second table is, typically, much smaller > > select .... from tab1, tab2 where tab1.code = tab2.code; > > This works fine and fast. > Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D in the > big table and want them to match XY423, GF55 in the second table > > Variants I have tried > > select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])'); > select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z])'); > Have you tried the substring function? select .... from tab1, tab2 where substring(tab1.code from 1 for 5) = tab2.code > both take an enormous time. In the better case that I can subset (e.g. all candidates in table 2 > share initial "AX") I get back to manageable times by adding > and tab1.code ~ '^AX' > into the recipe. Actual runtime with about a million entries in tab1 and 800 entries in tab2 > is about 40 seconds. > > Regards > Wolfgang Hamann > > > > > > -- Salutations, Vincent Veyron http://marica.fr/site/demonstration Logiciel de gestion des contentieux juridiques et des sinistres d'assurance
2013/1/31 Vincent Veyron <vv.lists@wanadoo.fr>: > Le mercredi 30 janvier 2013 à 11:08 +0000, wolfgang@noten5.maas-noten.de > a écrit : >> Hi, >> >> I am trying to match items from 2 tables based on a common string. >> One is a big table which has one column with entries like XY123, ABC44, etc >> The table has an index on that column. >> The second table is, typically, much smaller >> >> select .... from tab1, tab2 where tab1.code = tab2.code; >> >> This works fine and fast. >> Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D in the >> big table and want them to match XY423, GF55 in the second table >> >> Variants I have tried >> >> select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])'); >> select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z])'); >> > > Have you tried the substring function? > > select .... from tab1, tab2 where substring(tab1.code from 1 for 5) = > tab2.code > > >> both take an enormous time. In the better case that I can subset (e.g. all candidates in table 2 >> share initial "AX") I get back to manageable times by adding >> and tab1.code ~ '^AX' >> into the recipe. Actual runtime with about a million entries in tab1 and 800 entries in tab2 >> is about 40 seconds. any join where result is related to some function result can be very slow, because estimation will be out and any repeated function evaluation is just expensive. You can try use a functional index. create index on tab2 ((substring(tab1.code from 1 for 5)) Regards Pavel Stehule >> >> Regards >> Wolfgang Hamann >> >> >> >> >> >> > > -- > Salutations, Vincent Veyron > http://marica.fr/site/demonstration > Logiciel de gestion des contentieux juridiques et des sinistres d'assurance > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
Le jeudi 31 janvier 2013 à 09:49 +0100, Pavel Stehule a écrit : > any join where result is related to some function result can be very > slow, because estimation will be out and any repeated function > evaluation is just expensive. > Hi Pavel, Thank you for the correction. Since we're at it, I have a question regarding functions in a query : Suppose I have a query of the form SELECT my_function(column_1), column_2 FROM my_table GROUP BY my_function(column_1) ORDER BY my_function(column_1); where my_function is a user defined function. How many times is the function computed? -- Salutations, Vincent Veyron http://marica.fr/site/demonstration Logiciel de gestion des contentieux juridiques et des sinistres d'assurance
2013/1/31 Vincent Veyron <vv.lists@wanadoo.fr>: > Le jeudi 31 janvier 2013 à 09:49 +0100, Pavel Stehule a écrit : > >> any join where result is related to some function result can be very >> slow, because estimation will be out and any repeated function >> evaluation is just expensive. >> > > Hi Pavel, > > Thank you for the correction. Since we're at it, I have a question > regarding functions in a query : > > Suppose I have a query of the form > > SELECT my_function(column_1), column_2 > FROM my_table > GROUP BY my_function(column_1) > ORDER BY my_function(column_1); > > where my_function is a user defined function. > > How many times is the function computed? if function is stable or immutable, then once per row Pavel > > > > > -- > Salutations, Vincent Veyron > http://marica.fr/site/demonstration > Logiciel de gestion des contentieux juridiques et des sinistres d'assurance >
Le jeudi 31 janvier 2013 à 11:06 +0100, Pavel Stehule a écrit : > 2013/1/31 Vincent Veyron <vv.lists@wanadoo.fr>: > > > > Suppose I have a query of the form > > > > SELECT my_function(column_1), column_2 > > FROM my_table > > GROUP BY my_function(column_1) > > ORDER BY my_function(column_1); > > > > where my_function is a user defined function. > > > > How many times is the function computed? > > if function is stable or immutable, then once per row > In this post (watch for line-wrap) : http://www.postgresql.org/message-id/CAFj8pRAdYL1-hCxH +QSZQKHt9YnoaOiGkfX4cNc9mzUTimcs1w@mail.gmail.com you wrote that it is usually better not to mark SQL functions (as opposed to plpgsql functions). So should I mark SQL functions stable/immutable if I use them in a query like the one above, or is it unnecessary? -- Salutations, Vincent Veyron http://marica.fr/site/demonstration Logiciel de gestion des contentieux juridiques et des sinistres d'assurance
2013/1/31 Vincent Veyron <vv.lists@wanadoo.fr>: > Le jeudi 31 janvier 2013 à 11:06 +0100, Pavel Stehule a écrit : >> 2013/1/31 Vincent Veyron <vv.lists@wanadoo.fr>: >> > >> > Suppose I have a query of the form >> > >> > SELECT my_function(column_1), column_2 >> > FROM my_table >> > GROUP BY my_function(column_1) >> > ORDER BY my_function(column_1); >> > >> > where my_function is a user defined function. >> > >> > How many times is the function computed? >> >> if function is stable or immutable, then once per row >> > > In this post (watch for line-wrap) : > > http://www.postgresql.org/message-id/CAFj8pRAdYL1-hCxH > +QSZQKHt9YnoaOiGkfX4cNc9mzUTimcs1w@mail.gmail.com > > you wrote that it is usually better not to mark SQL functions (as > opposed to plpgsql functions). > > So should I mark SQL functions stable/immutable if I use them in a query > like the one above, or is it unnecessary? > It should not be marked Regards Pavel > > -- > Salutations, Vincent Veyron > http://marica.fr/site/demonstration > Logiciel de gestion des contentieux juridiques et des sinistres d'assurance >
Pavel Stehlule wrote: >> >> Hi, >> >> >> >> I am trying to match items from 2 tables based on a common string. >> >> One is a big table which has one column with entries like XY123, ABC44, = >> etc >> >> The table has an index on that column. >> >> The second table is, typically, much smaller >> >> >> >> select .... from tab1, tab2 where tab1.code =3D tab2.code; >> >> >> >> This works fine and fast. >> >> Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D= >> in the >> >> big table and want them to match XY423, GF55 in the second table >> >> >> >> Variants I have tried >> >> >> >> select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])'); >> >> select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z= >> ])'); >> >> >> > >> > Have you tried the substring function? >> > >> > select .... from tab1, tab2 where substring(tab1.code from 1 for 5) =3D >> > tab2.code >> > Hi Pavel, it was just by chance that a fixed size substring would match the data at hand. It is more common to have a digit/letter (or vice versa) boundary or a hyphen there >> > >> >> both take an enormous time. In the better case that I can subset (e.g. a= >> ll candidates in table 2 >> >> share initial "AX") I get back to manageable times by adding >> >> and tab1.code ~ '^AX' >> >> into the recipe. Actual runtime with about a million entries in tab1 and= >> 800 entries in tab2 >> >> is about 40 seconds. >> >> any join where result is related to some function result can be very >> slow, because estimation will be out and any repeated function >> evaluation is just expensive. >> I see the problem since obviously every the ~ operator with a non-constant pattern is constantly recompiling the pattern. I wonder whether it would be possible to invent a prefix-match operator that approaches the performance of string equality. I noted in the past (not sure whether anything has changed in regex matching) that a constant leading part of regex would improve performance, i.e. use an index scan to select possible candidates. >> You can try use a functional index. >> >> create index on tab2 ((substring(tab1.code from 1 for 5)) >> What kind of trick is that - mixing two tables into a functional index? What would the exact syntax be for that? Regards Wolfgang Hamann
Hello 2013/1/31 <hamann.w@t-online.de>: > > Pavel Stehlule wrote: > >>> >> Hi, >>> >> >>> >> I am trying to match items from 2 tables based on a common string. >>> >> One is a big table which has one column with entries like XY123, ABC44, = >>> etc >>> >> The table has an index on that column. >>> >> The second table is, typically, much smaller >>> >> >>> >> select .... from tab1, tab2 where tab1.code =3D tab2.code; >>> >> >>> >> This works fine and fast. >>> >> Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D= >>> in the >>> >> big table and want them to match XY423, GF55 in the second table >>> >> >>> >> Variants I have tried >>> >> >>> >> select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])'); >>> >> select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z= >>> ])'); >>> >> >>> > >>> > Have you tried the substring function? >>> > >>> > select .... from tab1, tab2 where substring(tab1.code from 1 for 5) =3D >>> > tab2.code >>> > > > Hi Pavel, it was just by chance that a fixed size substring would match the > data at hand. It is more common to have a digit/letter (or vice versa) boundary > or a hyphen there > >>> > >>> >> both take an enormous time. In the better case that I can subset (e.g. a= >>> ll candidates in table 2 >>> >> share initial "AX") I get back to manageable times by adding >>> >> and tab1.code ~ '^AX' >>> >> into the recipe. Actual runtime with about a million entries in tab1 and= >>> 800 entries in tab2 >>> >> is about 40 seconds. >>> >>> any join where result is related to some function result can be very >>> slow, because estimation will be out and any repeated function >>> evaluation is just expensive. >>> > I see the problem since obviously every the ~ operator with a non-constant > pattern is constantly recompiling the pattern. > > I wonder whether it would be possible to invent a prefix-match operator that approaches > the performance of string equality. I noted in the past (not sure whether anything > has changed in regex matching) that a constant leading part of regex would improve > performance, i.e. use an index scan to select possible candidates. > >>> You can try use a functional index. >>> >>> create index on tab2 ((substring(tab1.code from 1 for 5)) >>> > > What kind of trick is that - mixing two tables into a functional index? it is not possible - you can do some auxiliary table and creating indexes over this table but maybe https://github.com/dimitri/prefix can help Regards Pavel > What would the exact syntax be for that? > > Regards > Wolfgang Hamann > > > > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
Pavel Stehule wrote: >> >> but maybe https://github.com/dimitri/prefix can help >> Hi Pavel, thanks - this works perfect. However, it does not seem to play well with the optimizer, so I ended up with select all candidates into a temp table using prefix operator apply all other conditions by joining that temp table to original ones Regards Wolfgang
2013/2/2 <hamann.w@t-online.de>: > Pavel Stehule wrote: > >>> >>> but maybe https://github.com/dimitri/prefix can help >>> > > Hi Pavel, > > thanks - this works perfect. However, it does not seem to play well > with the optimizer, so I ended up with > > select all candidates into a temp table using prefix operator > apply all other conditions by joining that temp table to original ones > you can send proposals to enhancing to Dimitry - He will be happy :) Regards Pavel > Regards > Wolfgang > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
2013/2/2 Pavel Stehule <pavel.stehule@gmail.com>: > 2013/2/2 <hamann.w@t-online.de>: >> Pavel Stehule wrote: >> >>>> >>>> but maybe https://github.com/dimitri/prefix can help >>>> >> >> Hi Pavel, >> >> thanks - this works perfect. However, it does not seem to play well >> with the optimizer, so I ended up with >> >> select all candidates into a temp table using prefix operator >> apply all other conditions by joining that temp table to original ones >> seriously - it is typical solution - and it is great so PostgreSQL help to you :) Regards Pavel > > you can send proposals to enhancing to Dimitry - He will be happy :) > > Regards > > Pavel > >> Regards >> Wolfgang >> >> >> >> -- >> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-general
On 2013-01-31, hamann.w@t-online.de <hamann.w@t-online.de> wrote: > > Pavel Stehlule wrote: > >>> >> Hi, >>> >> >>> >> I am trying to match items from 2 tables based on a common string. >>> >> One is a big table which has one column with entries like XY123, ABC44, = >>> etc >>> >> The table has an index on that column. >>> >> The second table is, typically, much smaller >>> >> >>> >> select .... from tab1, tab2 where tab1.code =3D tab2.code; >>> >> >>> >> This works fine and fast. >>> >> Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D= >>> in the >>> >> big table and want them to match XY423, GF55 in the second table >>> >> >>> >> Variants I have tried >>> >> >>> >> select .... from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])'); >>> >> select .... from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z= >>> ])'); >>> >> >>> > >>> > Have you tried the substring function? >>> > >>> > select .... from tab1, tab2 where substring(tab1.code from 1 for 5) =3D >>> > tab2.code >>> > > > Hi Pavel, it was just by chance that a fixed size substring would match the > data at hand. It is more common to have a digit/letter (or vice versa) boundary > or a hyphen there > >>> > >>> >> both take an enormous time. In the better case that I can subset (e.g. a= >>> ll candidates in table 2 >>> >> share initial "AX") I get back to manageable times by adding >>> >> and tab1.code ~ '^AX' >>> >> into the recipe. Actual runtime with about a million entries in tab1 and= >>> 800 entries in tab2 >>> >> is about 40 seconds. >>> >>> any join where result is related to some function result can be very >>> slow, because estimation will be out and any repeated function >>> evaluation is just expensive. >>> > I see the problem since obviously every the ~ operator with a non-constant > pattern is constantly recompiling the pattern. > > I wonder whether it would be possible to invent a prefix-match operator that approaches > the performance of string equality. I noted in the past (not sure whether anything > has changed in regex matching) that a constant leading part of regex would improve > performance, i.e. use an index scan to select possible candidates. > you could write a set returning function that opens cursors on both tables using "ORDER BY code" and merges the results -- ââ 100% natural