Re: strange slow query - lost lot of time somewhere

Поиск
Список
Период
Сортировка
От David G. Johnston
Тема Re: strange slow query - lost lot of time somewhere
Дата
Msg-id CAKFQuwZEmcNk3YQo2Xj4EDUOdY6qakad31rOD1Vc4q1_s68-Ew@mail.gmail.com
обсуждение исходный текст
Ответ на Re: strange slow query - lost lot of time somewhere  (David Rowley <dgrowleyml@gmail.com>)
Ответы Re: strange slow query - lost lot of time somewhere  (David Rowley <dgrowleyml@gmail.com>)
Список pgsql-hackers
On Mon, May 2, 2022 at 7:30 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 3 May 2022 at 13:43, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> hit_ratio = (est_entries / ndistinct) - (ndistinct / calls) || clamp to 0.0
> I don't understand the adjustment factor ndistinct/calls

I've attached a spreadsheet showing you the impact of subtracting
(ndistinct / calls).  What this is correcting for is the fact that the
first scan from each unique value is a cache miss.  The more calls we
have, the more hits we'll get.  If there was only 1 call per distinct
value then there'd never be any hits. Without subtracting (ndistinct /
calls) and assuming there's space in the cache for each ndistinct
value, we'd assume 100% cache hit ratio if calls == ndistinct.  What
we should assume in that case is a 0% hit ratio as the first scan for
each distinct parameter must always be a miss as we've never had a
chance to cache any tuples for it yet.


Thank you.  I understand the theory and agree with it - but the math doesn't seem to be working out.

Plugging in:
n = 2,000
e = 500
c = 10,000

proper = 5%
incorrect = 25%

But of the 10,000 calls we will receive, the first 2,000 will be misses while 2,000 of the remaining 8,000 will be hits, due to sharing 2,000 distinct groups among the available inventory of 500 (25% of 8,000 is 2,000).  2,000 hits in 10,000 calls yields 20%.

I believe the correct formula to be:

((calls - ndistinct) / calls) * (est_entries / ndistinct) = hit_ratio
.80 * .25 = .20

First we recognize that our hit ratio can be at most c-n/c since n misses are guaranteed.  We take that ratio and scale it by our cache efficiency since of the remaining hits that fraction will turn into misses due to the relevant cache not being in memory.

David J.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: mylodon's failures in the back branches
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: bogus: logical replication rows/cols combinations