Обсуждение: bad query plans for ~ "^string" (and like "string%") (8.3.6)
It seems that ANALYZE does not really sample text column values as much as it could. We have some very bad query plans resulting from this: ... -> Bitmap Index Scan on m_pkey (cost=0.00..28.61 rows=102 width=0) (actual time=171.824..171.824 rows=683923 loops=1) Index Cond: ((e >= 'ean'::text) AND (e < 'eao'::text) This gets even worse for longer strings, where we know that many matching rows exist: # explain analyze select substring(e,5) from m where id=257421 and e ~ '^ean='; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Index Scan using m_pkey on m (cost=0.00..12.50 rows=1 width=60) (actual time=1623.795..1703.958 rows=18 loops=1) Index Cond: ((e >= 'ean='::text) AND (e < 'ean>'::text)) Filter: ((e ~ '^ean='::text) AND (id = 257421)) Total runtime: 1703.991 ms (4 rows) Here it would be much better to use the existing index on "id" (btree) first because the current index condition selects 683k rows whereas the result contains 18 rows. Using the index on id would yield 97 rows to filter. Is it possible to work around this problem somehow, other than adding partial indexes for the ~ / LIKE condition (when it's constant) or a 2-dimensional index? (what exactly does ANALYZE look at for text columns? in our case, about 7% of the rows match the index condition, so it seems that left-anchored regexp/like matches are not evaluated using the gathered most-common-value list at all) Regards, Marinos
Marinos Yannikos wrote: > (what exactly does ANALYZE look at for text columns? in our case, about > 7% of the rows match the index condition, so it seems that left-anchored > regexp/like matches are not evaluated using the gathered > most-common-value list at all) oops, I think I gave myself the answer there. Of course the most-common-value list will not help if all the values that match the "bad" index condition exist only once, but have a common prefix... Perhaps Postgres could sample the first few characters separately for such queries, but it's probably not worth it. Regards, Marinos
On Wed, Apr 8, 2009 at 9:42 AM, Marinos Yannikos <mjy@geizhals.at> wrote: > It seems that ANALYZE does not really sample text column values as much as > it could. We have some very bad query plans resulting from this: > > ... > -> Bitmap Index Scan on m_pkey (cost=0.00..28.61 rows=102 width=0) > (actual time=171.824..171.824 rows=683923 loops=1) > Index Cond: ((e >= 'ean'::text) AND (e < 'eao'::text) > > This gets even worse for longer strings, where we know that many matching > rows exist: > > # explain analyze select substring(e,5) from m where id=257421 and e ~ > '^ean='; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------- > Index Scan using m_pkey on m (cost=0.00..12.50 rows=1 width=60) (actual > time=1623.795..1703.958 rows=18 loops=1) > Index Cond: ((e >= 'ean='::text) AND (e < 'ean>'::text)) > Filter: ((e ~ '^ean='::text) AND (id = 257421)) > Total runtime: 1703.991 ms > (4 rows) > > Here it would be much better to use the existing index on "id" (btree) first > because the current index condition selects 683k rows whereas the result > contains 18 rows. Using the index on id would yield 97 rows to filter. > > Is it possible to work around this problem somehow, other than adding > partial indexes for the ~ / LIKE condition (when it's constant) or a > 2-dimensional index? > > (what exactly does ANALYZE look at for text columns? in our case, about 7% > of the rows match the index condition, so it seems that left-anchored > regexp/like matches are not evaluated using the gathered most-common-value > list at all) What are you using for default_statistics_target? You can see the gathered data in pg_statistic. ...Robert
Marinos Yannikos <mjy@geizhals.at> writes: > Marinos Yannikos wrote: >> (what exactly does ANALYZE look at for text columns? in our case, about >> 7% of the rows match the index condition, so it seems that left-anchored >> regexp/like matches are not evaluated using the gathered >> most-common-value list at all) > oops, I think I gave myself the answer there. Of course the > most-common-value list will not help if all the values that match the > "bad" index condition exist only once, but have a common prefix... The costing is really done off the range condition ((e >= 'ean'::text) AND (e < 'eao'::text) in your example). I wouldn't think it would have such a hard time getting a good rowcount estimate for that. Maybe you need to bump up the statistics target for that column? regards, tom lane