Re: incorrect information in documentation

Поиск
Список
Период
Сортировка
От David G. Johnston
Тема Re: incorrect information in documentation
Дата
Msg-id CAKFQuwax7V5R_rw=EOWmy=TBON6v3sveBx_WvwsENskCL5CLQQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: incorrect information in documentation  ("David G. Johnston" <david.g.johnston@gmail.com>)
Список pgsql-docs
On Mon, Aug 9, 2021 at 8:40 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Mon, Aug 9, 2021 at 11:05 AM Bruce Momjian <bruce@momjian.us> wrote:

>         selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/
>         num_distinct1,
>         1/num_distinct2)
>                     = (1 - 0) * (1 - 0) / max(10000, 10000)
>                     = 0.0001

Nice, can you provide a patch please?


Change the line:


Concretely, as attached and inline.

David J.


commit 73fa486a855d75d74a1a695bb350bfbfe27c7751
Author: David G. Johnston <david.g.johnston@gmail.com>
Date:   Tue Apr 12 21:23:53 2022 +0000

    doc: make unique non-null join selectivity example match the prose
   
    The description of the computation for the unique, non-null,
    join selectivity describes a division by the maximum of two values,
    while the example shows a multiplication by their reciprocal.  While
    equivalent the max phrasing is easier to understand; which seems
    more important here than precisely adhering to the formula use
    in the code (for which either variant is still an approximation).
   
    While both num_distinct and num_rows are equal for a unique column
    both the concept and formula use row count (10,000) and the
    field num_distinct has already been set to mean the specific value
    present in the pg_stats table (i.e, -1), so use num_rows here.

diff --git a/doc/src/sgml/planstats.sgml b/doc/src/sgml/planstats.sgml
index 78053d7c49..f72bc4b274 100644
--- a/doc/src/sgml/planstats.sgml
+++ b/doc/src/sgml/planstats.sgml
@@ -391,18 +391,20 @@ tablename  | null_frac | n_distinct | most_common_vals
 </programlisting>
 
    In this case there is no <acronym>MCV</acronym> information for
-   <structfield>unique2</structfield> because all the values appear to be
-   unique, so we use an algorithm that relies only on the number of
-   distinct values for both relations together with their null fractions:
+   <structname>unique2</structname> and all the values appear to be
+   unique (n_distinct = -1), so we use an algorithm that relies on the row
+   count estimates for both relations (num_rows, not shown, but "tenk")
+   together with the column null fractions (zero for both):
 
 <programlisting>
-selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
+selectivity = (1 - null_frac1) * (1 - null_frac2) / max(num_rows1, num_rows2)
             = (1 - 0) * (1 - 0) / max(10000, 10000)
             = 0.0001
 </programlisting>
 
    This is, subtract the null fraction from one for each of the relations,
-   and divide by the maximum of the numbers of distinct values.
+   and divide by the row count of the larger relation (this value does get
+   scaled in the non-unique case).
    The number of rows
    that the join is likely to emit is calculated as the cardinality of the
    Cartesian product of the two inputs, multiplied by the
Вложения

В списке pgsql-docs по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: "GIN and GiST Index Types" page is about usage in full text search, but looks general purpose
Следующее
От: Tom Lane
Дата:
Сообщение: Re: "GIN and GiST Index Types" page is about usage in full text search, but looks general purpose