Обсуждение: avoiding seq scans when two columns are very correlated
Hello, A table has two columns id and EffectiveId. First is primary key. EffectiveId is almost always equal to id (95%) unless records are merged. Many queries have id = EffectiveId condition. Both columns are very distinct and Pg reasonably decides that condition has very low selectivity and picks sequence scan. Simple perl script that demonstrates estimation error: https://gist.github.com/1356744 Estimation is ~200 times off (5 vs 950), for real situation it's very similar. Understandably difference depends on correlation coefficient. In application such wrong estimation result in seq scan of this table winning leading position in execution plans over other tables and index scans. What can I do to avoid this problem? Tested with PostgreSQL 9.0.3 on x86_64-apple-darwin10.6.0, compiled by GCC i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664), 64-bit -- Best regards, Ruslan.
Ruslan Zakirov <ruz@bestpractical.com> writes: > A table has two columns id and EffectiveId. First is primary key. > EffectiveId is almost always equal to id (95%) unless records are > merged. Many queries have id = EffectiveId condition. Both columns are > very distinct and Pg reasonably decides that condition has very low > selectivity and picks sequence scan. I think the only way is to rethink your data representation. PG doesn't have cross-column statistics at all, and even if it did, you'd be asking for an estimate of conditions in the "long tail" of the distribution. That's unlikely to be very accurate. Consider adding a "merged" boolean, or defining effectiveid differently. For instance you could set it to null in unmerged records; then you could get the equivalent of the current meaning with COALESCE(effectiveid, id). In either case, PG would then have statistics that bear directly on the question of how many merged vs unmerged records there are. regards, tom lane
On Fri, Nov 11, 2011 at 7:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ruslan Zakirov <ruz@bestpractical.com> writes: >> A table has two columns id and EffectiveId. First is primary key. >> EffectiveId is almost always equal to id (95%) unless records are >> merged. Many queries have id = EffectiveId condition. Both columns are >> very distinct and Pg reasonably decides that condition has very low >> selectivity and picks sequence scan. > > I think the only way is to rethink your data representation. PG doesn't > have cross-column statistics at all, and even if it did, you'd be asking > for an estimate of conditions in the "long tail" of the distribution. > That's unlikely to be very accurate. Rethinking schema is an option that requires more considerations as we do it this way for years and run product on mysql, Pg and Oracle. Issue affects Oracle, but it can be worked around by dropping indexes or may be by building correlation statistics in 11g (didn't try it yet). Wonder if "CROSS COLUMN STATISTICS" patch that floats around would help with such case? > Consider adding a "merged" boolean, or defining effectiveid differently. > For instance you could set it to null in unmerged records; then you > could get the equivalent of the current meaning with > COALESCE(effectiveid, id). In either case, PG would then have > statistics that bear directly on the question of how many merged vs > unmerged records there are. NULL in EffectiveId is the way to go, however when we actually need those records (not so often situation) query becomes frightening: SELECT main.* FROM Tickets main JOIN Tickets te ON te.EffectiveId = main.id OR (te.id = main.id AND te.EffectiveId IS NULL) JOIN OtherTable ot ON ot.Ticket = te.id Past experience reminds that joins with ORs poorly handled by many optimizers. In the current situation join condition is very straightforward and effective. > regards, tom lane -- Best regards, Ruslan.
On Fri, Nov 11, 2011 at 10:01 PM, Ruslan Zakirov <ruz@bestpractical.com> wrote: > Hello, > > A table has two columns id and EffectiveId. First is primary key. > EffectiveId is almost always equal to id (95%) unless records are > merged. Many queries have id = EffectiveId condition. Both columns are > very distinct and Pg reasonably decides that condition has very low > selectivity and picks sequence scan. > > Simple perl script that demonstrates estimation error: > https://gist.github.com/1356744 > > Estimation is ~200 times off (5 vs 950), for real situation it's very > similar. Understandably difference depends on correlation coefficient. > > In application such wrong estimation result in seq scan of this table > winning leading position in execution plans over other tables and > index scans. > > What can I do to avoid this problem? Does a partial index help? CREATE UNIQUE INDEX foo_idx ON mytab(id) WHERE id = EffectiveId -- Stuart Bishop <stuart@stuartbishop.net> http://www.stuartbishop.net/
On Tue, Nov 15, 2011 at 11:08 AM, Stuart Bishop <stuart@stuartbishop.net> wrote: > On Fri, Nov 11, 2011 at 10:01 PM, Ruslan Zakirov <ruz@bestpractical.com> wrote: >> Hello, [snip] >> In application such wrong estimation result in seq scan of this table >> winning leading position in execution plans over other tables and >> index scans. >> >> What can I do to avoid this problem? > > Does a partial index help? CREATE UNIQUE INDEX foo_idx ON mytab(id) > WHERE id = EffectiveId It doesn't help. Probably reason is the same as for partitions. > > > -- > Stuart Bishop <stuart@stuartbishop.net> > http://www.stuartbishop.net/ > -- Best regards, Ruslan.