On Sat, Dec 02, 2017 at 05:27:51PM -0800, Jeff Janes wrote:
> I think the non-extended stats code also has trouble with booleans.
> pg_stats gives me a correlation of 0.8 or higher for the flag column.
It's not due to the boolean though; you see the same thing if you do:
CREATE INDEX aaa_f ON aaa((flag::text));
ANALYZE aaa;
correlation | 0.81193
or:
ALTER TABLE aaa ADD flag2 int; UPDATE aaa SET flag2= flag::int
correlation | 0.81193
I think it's caused by having so few (2) values to correlate.
most_common_vals | {f,t}
most_common_freqs | {0.9014,0.0986}
correlation | 0.822792
It thinks there's somewhat-high correlation since it gets a list of x and y
values (integer positions by logical and physical sort order) and 90% of the x
list (logical value) are the same value ('t'), and the CTIDs are in order on
the new index, so 90% of the values are 100% correlated.
It improves (by which I mean here that it spits out a lower number) if it's not
a 90/10 split:
CREATE TABLE aaa5 AS SELECT (id%100)::int num, (id%10>5)::bool flag FROM generate_series(1, 10000000) id;
CREATE INDEX ON aaa5 (flag);
tablename | aaa5
attname | flag
correlation | 0.522184
Justin