Strange avg value size for index on expression in pg_stats
От | Jehan-Guillaume de Rorthais |
---|---|
Тема | Strange avg value size for index on expression in pg_stats |
Дата | |
Msg-id | 20141110115254.2997aaf8@erg обсуждение исходный текст |
Ответы |
Re: Strange avg value size for index on expression in
pg_stats
Re: Strange avg value size for index on expression in pg_stats |
Список | pgsql-general |
Hello, I'm not sure if I should post here or on pgsql-hackers. While investigating about a wrong result with my btree bloat estimation query, I found a strange stat deviation between the average size of a value in its table and its average size in one index on the "lower" expression. Take the following scenario: postgres@test=# create table test as test-# select md5(t::text) AS t test-# from generate_series(1, 1000000) t; SELECT 1000000 postgres@test=# create index ON test (lower(t)); CREATE INDEX postgres@test=# analyze test; ANALYZE postgres@test=# select tablename, attname, avg_width test-# from pg_stats test-# where schemaname = 'public'; tablename | attname | avg_width ----------------+---------+----------- test | t | 33 test_lower_idx | lower | 36 Md5 values are always 32 bytes long. In the table, pg_stats reports 33 because of the text header. In the index, the reported value is 36! Looking at the page layout documentation and in the index using hexdump, I can not find any answer about this 3 bytes. PFA the "hexdump -C" output from the index. For each row, we clearly see a 8 bytes row header followed by a ONE byte value header (43 'C'), the 32 bytes of the md5 and 7 bytes of padding (00). A wild guess would be that ANALYZE is considering a text field from an expression has always a 4 bytes header whatever its actual size (text field header size is one if the value is < 127 bytes long, 4 in other situations). This tiny difference is the source of a very bad estimation with the Btree bloat estimation query when values are around an alignement boundary. As instance, here is the use case that lead me to this: tablename | attname | avg_width --------------------+---------+----------- customer | email | 23 customer_lower_idx | lower | 26 We have an index on email, and another one on lower(index). The first one is aligned on 24, the second one on 32. Leading to bloat estimation of 17% for the first one and -1% for the second one (the estimated index size is bigger than the real one). Any information about this from a hacker? Do anyone have an explanation about this? Is it something that's worth posting on pgsql-hackers? Regards,
Вложения
В списке pgsql-general по дате отправления: