Estimating geometric distributions

Поиск
Список
Период
Сортировка
От Stephen Denne
Тема Estimating geometric distributions
Дата
Msg-id F0238EBA67824444BC1CB4700960CB4804D2F0EE@dmpeints002.isotach.com
обсуждение исходный текст
Ответ на Re: Maximum statistics target  ("Stephen Denne" <Stephen.Denne@datamail.co.nz>)
Ответы Re: Estimating geometric distributions  ("Stephen Denne" <Stephen.Denne@datamail.co.nz>)
Список pgsql-hackers
I wrote:
> I have a field whose distribution of frequencies of values is
> roughly geometric, rather than flat.
> Total rows = 36 million
> relpages=504864
> Distinct field values in use = 169
> 10 values account for 50% of the rows.
> 41 values account for 90% of the rows.
>
> After setting statistics target to 1000 for that field, and
> analyzing the table, the statistics row for that field had 75
> most frequent values and a histogram with 77 entries in it.
> Estimating 152 values in total.


"public";"mytable";"myfield";0;4;152;"{202,179,8,181,173,207,6,118,107,205,182,4,54,247,168,77,169,53,120,159,149,174,167,156,148,150,56,108,66,119,5,99,96,175,97,208,1,130,10,102,228,101,121,50,11,152,32,12,78,221,55,244,241,252,203,116,103,184,154,153,238,65,49,220,83,98,111,85,139,242,240,260,7,109,114}";"{0.0836433,0.0781667,0.0738367,0.0598533,0.04629,0.04447,0.0359833,0.0314267,0.0278333,0.0268,0.0251433,0.0244867,0.02438,0.0223433,0.0207567,0.0189667,0.0168833,0.01582,0.0150267,0.0141767,0.0130467,0.0128933,0.0125767,0.0123567,0.0116567,0.0114967,0.01048,0.01037,0.00994667,0.00987667,0.00977667,0.00965333,0.00916333,0.00828667,0.00732667,0.00712,0.00629,0.00624,0.00576667,0.00558667,0.00477667,0.00475333,0.00410333,0.00405667,0.00371667,0.00334667,0.00334,0.00312667,0.00312667,0.00302,0.00300333,0.00295,0.00287333,0.00271,0.00267,0.00240667,0.00224,0.00221333,0.00215333,0.0021,0.00205667,0.00202667,0.00197333,0.00197333,0.00168667,0.00166,0.00159333,0.00159,0.00154667,0.00150333,0.00149,0.00133333,0.00132,0.00112667,0.00104}";"{2,9,9,9,67,76,84,84,86,87,87,88,95,100,100,100,104,105,105,110,112,112,128,137,137,138,143,144,144,144,151,155,155,155,157,157,158,171,171,183,185,185,185,185,187,194,199,199,200,200,201,204,204,209,209,214,214,214,214,215,217,225,225,225,229,239,239,249,250,250,253,253,255,257,261,262,266}";0.449246

My problem is frequent
> over-estimation of rows when restricting by this field with
> values not known at plan time.

examples:
select * from mytable where myfield = ?;
select * from mytable where myfield in (subquery);

My arithmetic mean of the frequencies is 214200
My geometric mean is 13444

However analyze didn't find all my values, and thinks that there are only 152 of them, so it uses a mean of 238046


When the subquery is estimated to return three myfield values, the query estimates 714138 rows, and chooses a
sequentialscan over mytable (myfield is indexed). 

explain select * from mytable where myfield in (values (1),(2),(3));

Hash IN Join  (cost=0.08..1009521.37 rows=714138 width=86) Hash Cond: (mytable.myfield = "*VALUES*".column1) ->  Seq
Scanon mytable  (cost=0.00..866693.76 rows=36182976 width=86) ->  Hash  (cost=0.04..0.04 rows=3 width=4)       ->
ValuesScan on "*VALUES*"  (cost=0.00..0.04 rows=3 width=4) 

I think this query is much more likely to return around 40000 rows, and a Bitmap Index Scan should be used.

explain select * from mytable where myfield in (values (1),(2));

Nested Loop  (cost=4445.11..931383.93 rows=476092 width=86) ->  Unique  (cost=0.04..0.04 rows=2 width=4)       ->  Sort
(cost=0.04..0.04 rows=2 width=4)             Sort Key: "*VALUES*".column1             ->  Values Scan on "*VALUES*"
(cost=0.00..0.03rows=2 width=4) ->  Bitmap Heap Scan on mytable  (cost=4445.08..462716.37 rows=238046 width=86)
RecheckCond: (mytable.myfield = "*VALUES*".column1)       ->  Bitmap Index Scan on myindex  (cost=0.00..4385.56
rows=238046width=0)             Index Cond: (mytable.myfield = "*VALUES*".column1) 

The expected number of loops (2 here, 3 above) through the Bitmap Heap Scan * 462716.37 > 1009521.37, but the cost
estimateis far too high in the general case. It should be closer to 26000 per loop if adjusted for my expectation of
thenumber of rows, being 13444 per loop. As such, you should need to expect close to 40 myfield values being returned
bythe subquery before choosing a sequential scan. 

Is there any facility already in PostgreSQL to help me here?

Hopefully an index type that I don't know about yet? (Geometric distributions are similar to those found in word count
distributions).

If not... is there any merit in this idea:

During the analyze process, the geometric mean of sampled rows was calculated, and if determined to be significantly
differentfrom the arithmetic mean, stored in a new stats column. When estimating the number of rows that will be
returnedby queries of the form shown above, if there is a geometric mean stored, use it instead of the arithmetic mean. 

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any
attachmentsis confidential and may be subject to legal privilege.  If it is not intended for you please advise by reply
immediately,destroy it and do not copy, disclose or use it in any way. 

__________________________________________________________________ This email has been scanned by the DMZGlobal
BusinessQuality              Electronic Messaging Suite. 
Please see http://www.dmzglobal.com/services/bqem.htm for details.
__________________________________________________________________




В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Dawid Kuroczko"
Дата:
Сообщение: Re: Lazy constraints / defaults
Следующее
От: Tom Lane
Дата:
Сообщение: Cleaning up the commit fest to-do list