Re: How the planner uses statistics
От | Bruce Momjian |
---|---|
Тема | Re: How the planner uses statistics |
Дата | |
Msg-id | 200502270049.j1R0no821755@candle.pha.pa.us обсуждение исходный текст |
Ответ на | Re: How the planner uses statistics (Mark Kirkwood <markir@coretech.co.nz>) |
Список | pgsql-docs |
Patch applied. Thanks. Your documentation changes can be viewed in five minutes using links on the developer's page, http://www.postgresql.org/developer/testing. --------------------------------------------------------------------------- Mark Kirkwood wrote: > Bruce Momjian wrote: > > Mark Kirkwood wrote: > > > >>At Tom's suggestion, I am going to amend the page to fit into the > >>'internals' chapter as opposed to 'performance tips' one. I might do > >>this first, and send you the resulting page. > > > > That sounds good, that this become part of the developer docs. > > > > Here is the amended version. I have placed it in its own chapter located > immediately after 'bki Backend Interface', however there is nothing > special about that location... feel free to move it around :-) > > Mark > > > > > > diff -Naur sgml.orig/filelist.sgml sgml/filelist.sgml > --- sgml.orig/filelist.sgml Mon Feb 14 15:02:16 2005 > +++ sgml/filelist.sgml Tue Feb 15 09:52:33 2005 > @@ -77,6 +77,7 @@ > <!entity catalogs SYSTEM "catalogs.sgml"> > <!entity geqo SYSTEM "geqo.sgml"> > <!entity gist SYSTEM "gist.sgml"> > +<!entity howplanstats SYSTEM "howplanstats.sgml"> > <!entity indexam SYSTEM "indexam.sgml"> > <!entity nls SYSTEM "nls.sgml"> > <!entity plhandler SYSTEM "plhandler.sgml"> > diff -Naur sgml.orig/howplanstats.sgml sgml/howplanstats.sgml > --- sgml.orig/howplanstats.sgml Thu Jan 1 12:00:00 1970 > +++ sgml/howplanstats.sgml Tue Feb 15 17:18:30 2005 > @@ -0,0 +1,370 @@ > +<!-- > +$PostgreSQL$ > +--> > + > +<chapter id="how-planner-stats"> > + <title>How the Planner Uses Statistics</title> > + > + <para> > + This chapter builds on the material covered in <xref linkend="using-explain"> > + and <xref linkend="planner-stats">, and shows how the planner uses the > + system statistics to estimate the number of rows each stage in a query might > + return. This is a significant part of the planning / optimizing process, > + providing much of the raw material for cost calculation. > + </para> > + > + <para> > + The intent of this chapter is not to document the code — > + better done in the code itself, but to present an overview of how it works. > + This will perhaps ease the learning curve for someone who subsequently > + wishes to read the code. As a consequence, the approach chosen is to analyze > + a series of incrementally more complex examples. > + </para> > + > + <para> > + The outputs and algorithms shown below are taken from version 8.0. > + The behaviour of earlier (or later) versions may vary. > + </para> > + > + <sect1 id="row-estimation-examples"> > + <title>Row Estimation Examples</title> > + > + <indexterm zone="row-estimation-examples"> > + <primary>row estimation</primary> > + <secondary>planner</secondary> > + </indexterm> > + > + <para> > + Using examples drawn from the regression test database, let's start with a > + very simple query: > +<programlisting> > +EXPLAIN SELECT * FROM tenk1; > + > + QUERY PLAN > +------------------------------------------------------------- > + Seq Scan on tenk1 (cost=0.00..445.00 rows=10000 width=244) > +</programlisting> > + > + How the planner determines the cardinality of <classname>tenk1</classname> > + is covered in <xref linkend="using-explain">, but is repeated here for > + completeness. The number of rows is looked up from > + <classname>pg_class</classname>: > + > +<programlisting> > +SELECT reltuples, relpages FROM pg_class WHERE relname = 'tenk1'; > + > + relpages | reltuples > +----------+----------- > + 345 | 10000 > +</programlisting> > + The planner will check the <structfield>relpages<structfield> estimate > + (this is a cheap operation) and if incorrect may scale > + <structfield>reltuples<structfield> to obtain a row estimate. In this case it > + does not, thus: > + > +<programlisting> > +rows = 10000 > +</programlisting> > + > + </para> > + > + <para> > + let's move on to an example with a range condition in its > + <literal>WHERE</literal> clause: > + > +<programlisting> > +EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000; > + > + QUERY PLAN > +------------------------------------------------------------ > + Seq Scan on tenk1 (cost=0.00..470.00 rows=1031 width=244) > + Filter: (unique1 < 1000) > +</programlisting> > + > + The planner examines the <literal>WHERE</literal> clause condition: > + > +<programlisting> > +unique1 < 1000 > +</programlisting> > + > + and looks up the restriction function for the operator > + <literal><</literal> in <classname>pg_operator</classname>. > + This is held in the column <structfield>oprrest</structfield>, > + and the result in this case is <function>scalarltsel</function>. > + The <function>scalarltsel</function> function retrieves the histogram for > + <structfield>unique1</structfield> from <classname>pg_statistics</classname> > + - we can follow this by using the simpler <classname>pg_stats</classname> > + view: > + > +<programlisting> > +SELECT histogram_bounds FROM pg_stats > +WHERE tablename='tenk1' AND attname='unique1'; > + > + histogram_bounds > +------------------------------------------------------ > + {1,970,1943,2958,3971,5069,6028,7007,7919,8982,9995} > +</programlisting> > + > + Next the fraction of the histogram occupied by <quote>< 1000</quote> > + is worked out. This is the selectivity. The histogram divides the range > + into equal frequency buckets, so all we have to do is locate the bucket > + that our value is in and count <emphasis>part</emphasis> of it and > + <emphasis>all</emphasis> of the ones before. The value 1000 is clearly in > + the second (970 - 1943) bucket, so by assuming a linear distribution of > + values inside each bucket we can calculate the selectivity as: > + > +<programlisting> > +selectivity = (1 + (1000 - bckt[2].min)/(bckt[2].max - bckt[2].min))/num_bckts > + = (1 + (1000 - 970)/(1943 - 970))/10 > + = 0.1031 > +</programlisting> > + > + that is, one whole bucket plus a linear fraction of the second, divided by > + the number of buckets. The estimated number of rows can now be calculated as > + the product of the selectivity and the cardinality of > + <classname>tenk1</classname>: > + > +<programlisting> > +rows = rel_cardinality * selectivity > + = 10000 * 0.1031 > + = 1031 > +</programlisting> > + > + </para> > + > + <para> > + Next let's consider an example with equality condition in its > + <literal>WHERE</literal> clause: > + > +<programlisting> > +EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'ATAAAA'; > + > + QUERY PLAN > +---------------------------------------------------------- > + Seq Scan on tenk1 (cost=0.00..470.00 rows=31 width=244) > + Filter: (stringu1 = 'ATAAAA'::name) > +</programlisting> > + > + Again the planner examines the <literal>WHERE</literal> clause condition: > + > +<programlisting> > +stringu1 = 'ATAAAA' > +</programlisting> > + > + and looks up the restriction function for <literal>=</literal>, which is > + <function>eqsel</function>. This case is a bit different, as the most > + common values — <acronym>MCV</acronym>s, are used to determine the > + selectivity. Let's have a look at these, with some extra columns that will > + be useful later: > + > +<programlisting> > +SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats > +WHERE tablename='tenk1' AND attname='stringu1'; > + > +null_frac | 0 > +n_distinct | 672 > +most_common_vals | {FDAAAA,NHAAAA,ATAAAA,BGAAAA,EBAAAA,MOAAAA,NDAAAA,OWAAAA,BHAAAA,BJAAAA} > +most_common_freqs | {0.00333333,0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.00266667,0.00266667} > +</programlisting> > + > + The selectivity is merely the most common frequency (<acronym>MCF</acronym>) > + corresponding to the third <acronym>MCV</acronym> — 'ATAAAA': > + > +<programlisting> > +selectivity = mcf[3] > + = 0.003 > +</programlisting> > + > + The estimated number of rows is just the product of this with the > + cardinality of <classname>tenk1</classname> as before: > + > +<programlisting> > +rows = 10000 * 0.003 > + = 30 > +</programlisting> > + > + The number displayed by <command>EXPLAIN</command> is one more than this, > + due to some post estimation checks. > + </para> > + > + <para> > + Now consider the same query, but with a constant that is not in the > + <acronym>MCV</acronym> list: > + > +<programlisting> > +EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx'; > + > + QUERY PLAN > +---------------------------------------------------------- > + Seq Scan on tenk1 (cost=0.00..470.00 rows=15 width=244) > + Filter: (stringu1 = 'xxx'::name) > +</programlisting> > + > + This is quite a different problem, how to estimate the selectivity when the > + value is <emphasis>not</emphasis> in the <acronym>MCV</acronym> list. > + The approach is to use the fact that the value is not in the list, > + combined with the knowledge of the frequencies for all of the > + <acronym>MCV</acronym>s: > + > +<programlisting> > +selectivity = (1 - sum(mvf))/(num_distinct - num_mcv) > + = (1 - (0.00333333 + 0.00333333 + 0.003 + 0.003 + 0.003 > + + 0.003 + 0.003 + 0.003 + 0.00266667 + 0.00266667))/(672 - 10) > + = 0.001465 > +</programlisting> > + > + That is, add up all the frequencies for the <acronym>MCV</acronym>s and > + subtract them from one — because it is <emphasis>not</emphasis> one > + of these, and divide by the <emphasis>remaining</emphasis> distinct values. > + Notice that there are no null values so we don't have to worry about those. > + The estimated number of rows is calculated as usual: > + > +<programlisting> > +rows = 10000 * 0.001465 > + = 15 > +</programlisting> > + > + </para> > + > + <para> > + Let's increase the complexity to consider a case with more than one > + condition in the <literal>WHERE</literal> clause: > + > +<programlisting> > +EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx'; > + > + QUERY PLAN > +----------------------------------------------------------- > + Seq Scan on tenk1 (cost=0.00..495.00 rows=2 width=244) > + Filter: ((unique1 < 1000) AND (stringu1 = 'xxx'::name)) > +</programlisting> > + > + An assumption of independence is made and the selectivities of the > + individual restrictions are multiplied together: > + > +<programlisting> > +selectivity = selectivity(unique1 < 1000) * selectivity(stringu1 = 'xxx') > + = 0.1031 * 0.001465 > + = 0.00015104 > +</programlisting> > + > + The row estimates are calculated as before: > + > +<programlisting> > +rows = 10000 * 0.00015104 > + = 2 > +</programlisting> > + </para> > + > + <para> > + Finally we will examine a query that includes a <literal>JOIN</literal> > + together with a <literal>WHERE</literal> clause: > + > +<programlisting> > +EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2 > +WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2; > + > + QUERY PLAN > +----------------------------------------------------------------------------------------- > + Nested Loop (cost=0.00..346.90 rows=51 width=488) > + -> Index Scan using tenk1_unique1 on tenk1 t1 (cost=0.00..192.57 rows=51 width=244) > + Index Cond: (unique1 < 50) > + -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..3.01 rows=1 width=244) > + Index Cond: ("outer".unique2 = t2.unique2) > +</programlisting> > + > + The restriction on <classname>tenk1</classname> > + <quote>unique1 < 50</quote> is evaluated before the nested-loop join. > + This is handled analogously to the previous range example. The restriction > + operator for <literal><</literal> is <function>scalarlteqsel</function> > + as before, but this time the value 50 is in the first bucket of the > + <structfield>unique1</structfield> histogram: > + > +<programlisting> > +selectivity = (0 + (50 - bckt[1].min)/(bckt[1].max - bckt[1].min))/num_bckts > + = (0 + (50 - 1)/(970 - 1))/10 > + = 0.005057 > + > +rows = 10000 * 0.005057 > + = 51 > +</programlisting> > + > + The restriction for the join is: > + > +<programlisting> > +t2.unique2 = t1.unique2 > +</programlisting> > + > + This is due to the join method being nested-loop, with > + <classname>tenk1</classname> being in the outer loop. The operator is just > + our familiar <literal>=<literal>, however the restriction function is > + obtained from the <structfield>oprjoin</structfield> column of > + <classname>pg_operator</classname> - and is <function>eqjoinsel</function>. > + Additionally we use the statistical information for both > + <classname>tenk2</classname> and <classname>tenk1</classname>: > + > +<programlisting> > +SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats > +WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2'; > + > +tablename | null_frac | n_distinct | most_common_vals > +-----------+-----------+------------+------------------ > + tenk1 | 0 | -1 | > + tenk2 | 0 | -1 | > +</programlisting> > + > + In this case there is no <acronym>MCV</acronym> information for > + <structfield>unique2</structfield> because all the values appear to be > + unique, so we can use an algorithm that relies only on the number of > + distinct values for both relations together with their null fractions: > + > +<programlisting> > +selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2) > + = (1 - 0) * (1 - 0) * min(1/10000, 1/1000) > + = 0.0001 > +</programlisting> > + > + This is, subtract the null fraction from one for each of the relations, > + and divide by the maximum of the two distinct values. The number of rows > + that the join is likely to emit is calculated as the cardinality of > + cartesian product of the two nodes in the nested-loop, multiplied by the > + selectivity: > + > +<programlisting> > +rows = (outer_cardinality * inner_cardinality) * selectivity > + = (51 * 10000) * 0.0001 > + = 51 > +</programlisting> > + </para> > + > + <para> > + For those interested in further details, estimation of the number of rows in > + a relation is covered in > + <filename>src/backend/optimizer/util/plancat.c</filename>. The calculation > + logic for clause selectivities is in > + <filename>src/backend/optimizer/path/clausesel.c</filename>. The actual > + implementations of the operator and join restriction functions can be found > + in <filename>src/backend/utils/adt/selfuncs.c</filename>. > + </para> > + > + </sect1> > + > + > +</chapter> > + > +<!-- Keep this comment at the end of the file > +Local variables: > +mode:sgml > +sgml-omittag:nil > +sgml-shorttag:t > +sgml-minimize-attributes:nil > +sgml-always-quote-attributes:t > +sgml-indent-step:1 > +sgml-indent-data:t > +sgml-parent-document:nil > +sgml-default-dtd-file:"./reference.ced" > +sgml-exposed-tags:nil > +sgml-local-catalogs:("/usr/lib/sgml/catalog") > +sgml-local-ecat-files:nil > +End: > +--> > diff -Naur sgml.orig/postgres.sgml sgml/postgres.sgml > --- sgml.orig/postgres.sgml Mon Feb 14 14:53:24 2005 > +++ sgml/postgres.sgml Tue Feb 15 09:51:31 2005 > @@ -239,6 +239,7 @@ > &gist; > &storage; > &bki; > + &howplanstats; > > </part> > > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
В списке pgsql-docs по дате отправления:
Предыдущее
От: Bruce MomjianДата:
Сообщение: Update to initdb locale/encoding documentation description