Обсуждение: Re: [HACKERS] Index not used on simple select

Поиск
Список
Период
Сортировка

Re: [HACKERS] Index not used on simple select

От
Tom Lane
Дата:
(Note to hackers: Ole sent me a 1000-row test case off list.)

> oletest=> explain select * from av_parts where partnumber = '123456';
> NOTICE:  QUERY PLAN:
> 
> Index Scan using av_parts_partnumber_index on av_parts  (cost=2.04 rows=1
> width=124)
> 
> EXPLAIN
> oletest=> explain select * from av_parts where nsn = '123456';
> NOTICE:  QUERY PLAN:
> 
> Seq Scan on av_parts  (cost=48.00 rows=995 width=124)

OK, I confirm seeing this behavior.  I don't have time to dig into
the code right now, but will do so when I get a chance.

It looks like the highly skewed distribution of nsn values (what you
sent me had 997 '' entries, only 3 non-empty strings) is confusing the
selectivity estimation code somehow, such that the system thinks that
the query is going to match most of the rows.  Notice it is estimating
995 returned rows for the nsn select!  Under these circumstances it will
prefer a sequential scan, since the more-expensive-per-tuple index scan
doesn't look like it will be able to avoid reading most of the table.
That logic is OK, it's the 0.995 selectivity estimate that's wrong...

Exactly why the selectivity estimate is so ludicrous remains to
be seen, but I know that there are some bogosities in that code
(search the pghackers archives for "selectivity" for more info).
I am hoping to do some extensive revisions of the selectivity code
for 6.6 or 6.7.  This particular problem might be easily fixable,
or it might have to wait for the rewrite.

Thanks for the test case!
        regards, tom lane


Re: [HACKERS] Index not used on simple select

От
Ole Gjerde
Дата:
On Fri, 23 Jul 1999, Tom Lane wrote:
> It looks like the highly skewed distribution of nsn values (what you
> sent me had 997 '' entries, only 3 non-empty strings) is confusing the

As a note, it doesn't seem to matter wether the field has '' or NULL.
Even after I do a update to set all all rows with '' to NULL, it still
does the same thing.

Also, my full set of data is not quite so skewed.  The nsn field has about
450,000 non-empty rows in it.

Thanks,
Ole Gjerde




UPDATE memory exhaustion

От
Ole Gjerde
Дата:
Hey,
I was having problems with UPDATE, so I looked through the archives.  Back
around the 20th of May, there was a thread about update using all memory
(thread: strange behavior of UPDATE).

It now looks that I am having that same problem on pg 6.5.1.
Basically I tried running a simple query:update av_parts set nsn = 'xxxxx' where nsn = '';

And postgres started chugging along.  After a while(not sure how long) it
was using all memory on the computer.

The box has 82MB of memory and 128 MB of swap.
The query is trying to update 3.5 million rows.

I would try to gdb to the process and see where it's spending its time,
unfortunately that box is pretty much dead until I reboot it.  I'll try to
do it again later with a ulimit so I can actually log into the box :)

Thanks,
Ole Gjerde



Selectivity of "=" (Re: [HACKERS] Index not used on simple select)

От
Tom Lane
Дата:
I wrote
> (Note to hackers: Ole sent me a 1000-row test case off list.)
>> oletest=> explain select * from av_parts where partnumber = '123456';
>> NOTICE:  QUERY PLAN:
>> 
>> Index Scan using av_parts_partnumber_index on av_parts  (cost=2.04 rows=1
>> width=124)
>> 
>> EXPLAIN
>> oletest=> explain select * from av_parts where nsn = '123456';
>> NOTICE:  QUERY PLAN:
>> 
>> Seq Scan on av_parts  (cost=48.00 rows=995 width=124)

> It looks like the highly skewed distribution of nsn values (what you
> sent me had 997 '' entries, only 3 non-empty strings) is confusing the
> selectivity estimation code somehow, such that the system thinks that
> the query is going to match most of the rows.  Notice it is estimating
> 995 returned rows for the nsn select!  Under these circumstances it will
> prefer a sequential scan, since the more-expensive-per-tuple index scan
> doesn't look like it will be able to avoid reading most of the table.
> That logic is OK, it's the 0.995 selectivity estimate that's wrong...

It turns out that the selectivity estimate for an "=" comparison is just
the attdisbursion statistic calculated by VACUUM ANALYZE, which can be
roughly defined as the frequency of the most common value in the column.
(I took statistics too long ago to recall the exact definition.)
Anyway, given that the test data Ole sent me contains nearly all ''
entries, I'd say that the 0.995 value is about right for disbursion.

Indeed, if one were to do a "select * from av_parts where nsn = ''",
then sequential scan would be the most efficient way to do that.
The system has no clue that that's not really something you'd do much.

By using this estimate, the system is effectively assuming that the
frequency of occurrence of values in the table corresponds to the
probability that you will be searching for each particular value.
So, the selectivity that a search for the most common value would
have is a reasonable estimate for the selectivity of a search for any
value.  That's a bogus assumption in this case --- but it's hard to
justify making any other assumption in general.

My inclination is to hack up eqsel() to never return a selectivity
estimate larger than, say, 0.5, even when the measured disbursion
is more.  I am not sure that this is a good idea, however.  Comments?
        regards, tom lane


Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple select)

От
Bruce Momjian
Дата:
> > It looks like the highly skewed distribution of nsn values (what you
> > sent me had 997 '' entries, only 3 non-empty strings) is confusing the
> > selectivity estimation code somehow, such that the system thinks that
> > the query is going to match most of the rows.  Notice it is estimating
> > 995 returned rows for the nsn select!  Under these circumstances it will
> > prefer a sequential scan, since the more-expensive-per-tuple index scan
> > doesn't look like it will be able to avoid reading most of the table.
> > That logic is OK, it's the 0.995 selectivity estimate that's wrong...
> 
> It turns out that the selectivity estimate for an "=" comparison is just
> the attdisbursion statistic calculated by VACUUM ANALYZE, which can be
> roughly defined as the frequency of the most common value in the column.
> (I took statistics too long ago to recall the exact definition.)
> Anyway, given that the test data Ole sent me contains nearly all ''
> entries, I'd say that the 0.995 value is about right for disbursion.

Yes, you are correct, though it does look at potentially one or two
other unique values, depending on the distribution.  It basically
perfectly computes disbursion for unique columns, and columns that
contain only two unique values, and it figures in NULL.  In other cases,
the disbursion is imperfect, but pretty decent.

> My inclination is to hack up eqsel() to never return a selectivity
> estimate larger than, say, 0.5, even when the measured disbursion
> is more.  I am not sure that this is a good idea, however.  Comments?

I would discourage this.  I can imagine many cases there >0.5
selectivites would be valid, i.e. state = "PA".

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple select)

От
Hannu Krosing
Дата:
Tom Lane wrote:
> It turns out that the selectivity estimate for an "=" comparison is
just
> the attdisbursion statistic calculated by VACUUM ANALYZE, which can be
> roughly defined as the frequency of the most common value in the column.
> (I took statistics too long ago to recall the exact definition.)
> Anyway, given that the test data Ole sent me contains nearly all ''
> entries, I'd say that the 0.995 value is about right for disbursion.
> 
> Indeed, if one were to do a "select * from av_parts where nsn = ''",
> then sequential scan would be the most efficient way to do that.
> The system has no clue that that's not really something you'd do much.

Does the system currently index NULLs as well ?

I suspect supporting partial indexes (initially just non-NULLs) would 
let us have much better and also use indexes intelligently for
mostly-NULL 
columns.

Perhaps a line like 

* Add partial index support

would fit in TODO

-----------------
Hannu


Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple select)

От
Bruce Momjian
Дата:
> Tom Lane wrote:
> > 
>  It turns out that the selectivity estimate for an "=" comparison is
> just
> > the attdisbursion statistic calculated by VACUUM ANALYZE, which can be
> > roughly defined as the frequency of the most common value in the column.
> > (I took statistics too long ago to recall the exact definition.)
> > Anyway, given that the test data Ole sent me contains nearly all ''
> > entries, I'd say that the 0.995 value is about right for disbursion.
> > 
> > Indeed, if one were to do a "select * from av_parts where nsn = ''",
> > then sequential scan would be the most efficient way to do that.
> > The system has no clue that that's not really something you'd do much.
> 
> Does the system currently index NULLs as well ?
> 
> I suspect supporting partial indexes (initially just non-NULLs) would 
> let us have much better and also use indexes intelligently for
> mostly-NULL 
> columns.
> 
> Perhaps a line like 
> 
> * Add partial index support
> 
> would fit in TODO
> 
> -----------------
> Hannu
> 
> 


Yes, I think we index nulls.  What are partial indexes?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple select)

От
Thomas Lockhart
Дата:
> Yes, I think we index nulls.  What are partial indexes?
 http://www.PostgreSQL.ORG/docs/postgres/partial-index.htm

Postgres had support for this at one time, and probably still does.
                   - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: Selectivity of "=" (Re: [HACKERS] Index not used on simple select)

От
Bruce Momjian
Дата:
> > Yes, I think we index nulls.  What are partial indexes?
> 
>   http://www.PostgreSQL.ORG/docs/postgres/partial-index.htm
> 
> Postgres had support for this at one time, and probably still does.
> 

Wow, that's really nice writing.  I think Tom Lane and I ripped that
stuff out of the optimizer.  (Only kidding.)

Not sure if any of it works.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026