Обсуждение: Bug #624: case sensitivity in pg_dumpall

Поиск
Список
Период
Сортировка

Bug #624: case sensitivity in pg_dumpall

От
pgsql-bugs@postgresql.org
Дата:
Volker Klemann (volkerk@gfz-potsdam.de) reports a bug with a severity of 2
The lower the number the more severe it is.

Short Description
case sensitivity in pg_dumpall

Long Description
Dear people from postgres,
Don't know if I am the first where this bug appeared:
While updating from 7.1.2 to 7.2 I used pg_dumpall to dump
my data as recommended by the INSTALL manual.
When re-instaling the database using 
psql -d template1 -f dumpfile
I got the error
    psql:/home/volkerk/postgres/backups/backup22.03.02:46:     
    \connect: FATAL 1:  Database "rsl" does not exist in the system catalog.
The corresponding lines in the script produced by pg_dumpall are:
45    CREATE DATABASE "RSL" WITH TEMPLATE = template0;
46    \connect RSL volkerk
So, while considering upper and lower cases in 45: using collons
the program missed them in 46:, and tried to log into rsl instead of RSL.
Took me some time to find it, because first 'was searching for
inconsistencies, in the new installation. 

Best wishes,
Volker Klemann

  


Sample Code


No file was uploaded with this report

Re: Bug #624: case sensitivity in pg_dumpall

От
Tom Lane
Дата:
pgsql-bugs@postgresql.org writes:
> case sensitivity in pg_dumpall

This is fixed in 7.2.1.

            regards, tom lane

Inconsistant use of index.

От
Ron Mayer
Дата:
In porting a pretty large (10s of millions of records) data warehouse
from Oracle to PostgreSQL,

  Once some of my tables started getting pretty large, PostgreSQL
suddenly stopped using indexes when I use expressions like "col = value"
decreasing performance by 20X.  This meant that my daily reports started
taking two days instead of 2 hours to run!!!!).


  Interestingly when I re-write the queries using >= and <= to produce
identical results, the index works fine.  Example queries in question include:

   select count(*) from fact where dat='2002-03-01';
   select count(*) from fact where dat<='2002-03-01' and dat>='2002-03-01';

The distribution of values in "dat" are roughly evenly spaced from
'2002-01-01' through '2002-03-25'.

Attached below are

 A: Information about the table, including "\d" and "vacuum verbose analyze"
 B: Output of "explain analyze" from the above queries (showing the 20X
    slowdown)
 C: Version and configuration information.

Any suggestions on what I should look at next would be appreciated.

   Thanks much,
   Ron

PS: As a quite perverse workaround, I rewrote all my queries to have
   "col<=val and col>=val" everywhere I used to have "col=val"
   and everything is running fine again... but that's just wierd.





============================================================================
== A: Information about the table
============================================================================
logs2=# \d fact
                Table "fact"
 Column |          Type          | Modifiers
--------+------------------------+-----------
 dat    | date                   |
 tim    | time without time zone |
 ip_id  | integer                |
 bid_id | integer                |
 req_id | integer                |
 ref_id | integer                |
Indexes: i_fact__bid_id,
         i_fact__dat,
         i_fact__ref_id,
         i_fact__req_id,
         i_fact__tim

logs2=# select count(*) from fact;
  count
----------
 18410778
(1 row)

logs2=# vacuum verbose analyze fact;
NOTICE:  --Relation fact--
NOTICE:  Pages 144967: Changed 0, Empty 0; Tup 18410778: Vac 0, Keep 0, UnUsed
0.
        Total CPU 11.56s/2.97u sec elapsed 71.91 sec.
NOTICE:  Analyzing fact
VACUUM


============================================================================
== B: Explain Analyze for the two queries.
==     Note that the <=, >= one was over 20X faster.
============================================================================

logs2=# explain analyze select count(*) from fact where dat='2002-03-01';
NOTICE:  QUERY PLAN:

Aggregate  (cost=375631.14..375631.14 rows=1 width=0) (actual
time=76689.42..76689.42 rows=1 loops=1)
  ->  Seq Scan on fact  (cost=0.00..375101.72 rows=211765 width=0) (actual
time=20330.96..76391.94 rows=180\
295 loops=1)
Total runtime: 76707.92 msec

EXPLAIN
logs2=# explain analyze select count(*) from fact where dat<='2002-03-01' and
                dat >='2002-03-01';
NOTICE:  QUERY PLAN:

Aggregate  (cost=5.98..5.98 rows=1 width=0) (actual time=2921.39..2921.39
rows=1 loops=1)
  ->  Index Scan using i_fact__dat on fact  (cost=0.00..5.98 rows=1 width=0)
(actual time=73.55..2583.53 ro\
ws=180295 loops=1)
Total runtime: 2921.55 msec

EXPLAIN
logs2=#





============================================================================
== C: Version and configuration information.
============================================================================

[17]localhost:~/apps/pgsql% psql -V
psql (PostgreSQL) 7.2
contains support for: readline, history
Portions Copyright (c) 1996-2001, PostgreSQL Global Development Group
Portions Copyright (c) 1996, Regents of the University of California
Read the file COPYRIGHT or use the command \copyright to see the
usage and distribution terms.

[17]localhost:/scratch/pgsql/data% diff -wiu postgresql.conf postgresql.conf.bak
--- postgresql.conf     Sat Mar 23 15:39:34 2002
+++ postgresql.conf.bak Tue Mar  5 19:33:54 2002
@@ -50,7 +50,7 @@
 #shared_buffers  = 10000        # 2*max_connections, min 16
 ##  goes to about 84 meg with 4000.
 #shared_buffers  = 4000        # 2*max_connections, min 16
-shared_buffers   = 10000        # 2*max_connections, min 16
+shared_buffers   = 8000        # 2*max_connections, min 16

 #max_fsm_relations = 100    # min 10, fsm is free space map
 #max_fsm_pages = 10000      # min 1000, fsm is free space map




--
   Ronald Mayer
   Director of Web Business
   InterVideo, Inc.

Re: Inconsistant use of index.

От
Tom Lane
Дата:
Ron Mayer <ron@intervideo.com> writes:
>   Once some of my tables started getting pretty large, PostgreSQL
> suddenly stopped using indexes when I use expressions like "col = value"
> decreasing performance by 20X.

Hmm.  The EXPLAIN shows that the planner is not doing too badly at
estimating the number of rows involved:

> logs2=# explain analyze select count(*) from fact where dat='2002-03-01';
> NOTICE:  QUERY PLAN:

> Aggregate  (cost=375631.14..375631.14 rows=1 width=0) (actual
> time=76689.42..76689.42 rows=1 loops=1)
>   ->  Seq Scan on fact  (cost=0.00..375101.72 rows=211765 width=0) (actual
> time=20330.96..76391.94 rows=180295 loops=1)
> Total runtime: 76707.92 msec

212K estimate for 180K real is not bad at all.  So the problem is in the
cost models not the initial row count estimation.

If you force an indexscan via "set enable_seqscan to off", what does
EXPLAIN ANALYZE report?

Also, what do you get from
    select * from pg_stats where tablename = 'fact';
I'm particularly interested in the correlation estimate for the dat
column.  (Would you happen to have an idea whether the data has been
inserted more-or-less in dat order?)

            regards, tom lane

Re: Inconsistant use of index.

От
Ron Mayer
Дата:
On Tue, 26 Mar 2002, Tom Lane wrote:
>
> Ron Mayer <ron@intervideo.com> writes:
> > [...] pretty large, PostgreSQL suddenly stopped using indexes [...]
> [...]
>
> 212K estimate for 180K real is not bad at all.  So the problem is in the
> cost models not the initial row count estimation.
>
> If you force an indexscan via "set enable_seqscan to off", what does
> EXPLAIN ANALYZE report?

It then uses the index:

===================================================================
== logs2=# set enable_seqscan to off;
== SET VARIABLE
== logs2=# explain analyze select count(*) from fact where dat='2002-03-01';
==
== NOTICE:  QUERY PLAN:
==
== Aggregate  (cost=840488.03..840488.03 rows=1 width=0) (actual
== time=2753.82..2753.82 rows=1 loops=1)
==   ->  Index Scan using i_fact__dat on fact  (cost=0.00..839957.59 rows=212174
== width=0) (actual time=101.25..2434.00 rows=180295 loops=1)
== Total runtime: 2754.24 msec
===================================================================


> Also, what do you get from
>     select * from pg_stats where tablename = 'fact';
> I'm particularly interested in the correlation estimate for the dat
> column.  (Would you happen to have an idea whether the data has been
> inserted more-or-less in dat order?)

I've attached that output as an attachment.

I beleve much of February was loaded first, then we back-filled January,
and daily I've been adding March's results.  I don't believe the index-usage
stopped when we did the january fill... something happend a few days ago after
a pretty routine daily load.




Oh... one more interesting thing...

There are a couple big exceptions to the even distribution of data.

Almost every day has between 190000 and 270000 records except '2002-03-08'
which has 404293 records and '2002-03-25' which has 6 records.

For that particular day, the "<= ... >=" trick doesn't work either.

===================================================================
==logs2=# explain select count(*) from fact where dat<='2002-03-08' and
==dat>='2002-03-08';
==NOTICE:  QUERY PLAN:
==
==Aggregate  (cost=422125.92..422125.92 rows=1 width=0)
==  ->  Seq Scan on fact  (cost=0.00..421128.67 rows=398900 width=0)
==
==EXPLAIN
==logs2=#
==logs2=# explain select count(*) from fact where dat<='2002-03-07' and
==dat>='2002-03-07';
==NOTICE:  QUERY PLAN:
==
==Aggregate  (cost=6.00..6.00 rows=1 width=0)
==  ->  Index Scan using i_fact__dat on fact  (cost=0.00..5.99 rows=1 width=0)
==
==EXPLAIN
===================================================================

I also believe that may have been the day when the index stopped
working for "=" for all dates.

  Ron

Re: Inconsistant use of index.

От
"Michael G. Martin"
Дата:
I had an issue where my index was not always used on a very large table.
 The issue came down to the data distribution and not pulling in enough
of a random sample to get an accurate estimate ( I think the default max
value was around 3000 sample rows ( 300 * 10 default_samples -- see
analyze.c ) rows.  I fixed the issue by following Tom's advice and
increased the statistics count on my table to pull in 300000 rows (1000
samples *300).  I had to play with the value, re-analyze, and check the
stats in the pg_stats table until most_common_freqs on some values were
all fairily close.  The explain plan still shows me a cost and row value
way above what is physically in the table, but at least my indexes were
being used.

alter table table_name alter symbol_name set statistics 1000;

--Michael


Tom Lane wrote:

>Ron Mayer <ron@intervideo.com> writes:
>
>>  Once some of my tables started getting pretty large, PostgreSQL
>>suddenly stopped using indexes when I use expressions like "col = value"
>>decreasing performance by 20X.
>>
>
>Hmm.  The EXPLAIN shows that the planner is not doing too badly at
>estimating the number of rows involved:
>
>>logs2=# explain analyze select count(*) from fact where dat='2002-03-01';
>>NOTICE:  QUERY PLAN:
>>
>
>>Aggregate  (cost=375631.14..375631.14 rows=1 width=0) (actual
>>time=76689.42..76689.42 rows=1 loops=1)
>>  ->  Seq Scan on fact  (cost=0.00..375101.72 rows=211765 width=0) (actual
>>time=20330.96..76391.94 rows=180295 loops=1)
>>Total runtime: 76707.92 msec
>>
>
>212K estimate for 180K real is not bad at all.  So the problem is in the
>cost models not the initial row count estimation.
>
>If you force an indexscan via "set enable_seqscan to off", what does
>EXPLAIN ANALYZE report?
>
>Also, what do you get from
>    select * from pg_stats where tablename = 'fact';
>I'm particularly interested in the correlation estimate for the dat
>column.  (Would you happen to have an idea whether the data has been
>inserted more-or-less in dat order?)
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to majordomo@postgresql.org so that your
>message can get through to the mailing list cleanly
>

Re: Inconsistant use of index.

От
Tom Lane
Дата:
Ron Mayer <ron@intervideo.com> writes:
>> I'm particularly interested in the correlation estimate for the dat
>> column.  (Would you happen to have an idea whether the data has been
>> inserted more-or-less in dat order?)

> I beleve much of February was loaded first, then we back-filled January,
> and daily I've been adding March's results.  I don't believe the index-usage
> stopped when we did the january fill... something happend a few days ago after
> a pretty routine daily load.

The correlation estimate for dat is pretty low (0.086088), which I think
reflects the fact that on a macro level your data is not very orderly
(Feb/Jan/Mar).  However, if it's been loaded on a daily basis then the
records for any single day will be together --- which is why the indexed
probe for a single day is so fast.  I don't see any way that we can
expect the system to model this effect with only one ordering-correlation
number :-( ... so a proper fix will have to wait for some future release
when we can think about having more extensive stats about ordering.

In the meantime, it would be interesting to see if re-establishing the
big-picture order correlation would persuade the planner to do the right
thing.  Could you do something like this:

    CREATE TABLE foo AS SELECT * FROM fact ORDER BY dat;
    TRUNCATE TABLE fact;
    INSERT INTO fact SELECT * FROM foo;
    DROP TABLE foo;
    VACUUM ANALYZE fact;

(this should leave you in a state where pg_stats shows correlation 1.0
for fact.dat) and then see what you get from EXPLAIN?

            regards, tom lane

Re: Inconsistant use of index.

От
Ron Mayer
Дата:
First off, thanks to everyone on the list who suggested useful workarounds
to me - and I wanted to start off by saying that with the workarounds my
application is working wonderfully again.

Anyway, here's some more information about the "=" vs. "<= and >=" question
I had earlier today...


On Tue, 26 Mar 2002, Tom Lane wrote:
>
> Ron Mayer <ron@intervideo.com> writes:
> >> I'm particularly interested in the correlation estimate for the dat
> >> column.  [...]
> >
> > [...]
>
> The correlation estimate for dat is pretty low (0.086088), which I think
> reflects the fact that on a macro level your data is not very orderly
> [...]
>
> In the meantime, it would be interesting to see if re-establishing the
> big-picture order correlation would persuade the planner to do the right
> thing. [...]
> (this should leave you in a state where pg_stats shows correlation 1.0
> for fact.dat) and then see what you get from EXPLAIN?

Correlation is 1.0, but the optimizer still does not want to use
the index.

I tried two different extreme attempts.... one with the optimal
ordering suggested above, and one with an exceptionally poor ordering
(sorted by time of the day ... so that every day probably appears in
every possible block).  As expected, pg_stats shows the good ordering
has a correlation of "1.0", and the poor ordering has a correlation
of "-0.00133352".


============================================================
= logs2=# CREATE TABLE fact_by_dat AS SELECT * FROM fact ORDER BY dat;
= SELECT
= logs2=# CREATE TABLE fact_by_tim AS SELECT * FROM fact ORDER BY tim;
= SELECT
= logs2=# CREATE INDEX fact_by_dat__dat ON fact_by_dat(dat);
= CREATE
= logs2=# CREATE INDEX fact_by_tim__dat ON fact_by_tim(dat);
= CREATE
= logs2=# vacuum analyze fact_by_dat;
= VACUUM
= logs2=# vacuum analyze fact_by_tim;
= VACUUM
= logs2=#  explain analyze select count(*) from fact_by_dat where dat='2002-03-01';
= NOTICE:  QUERY PLAN:
= Aggregate  (cost=380347.31..380347.31 rows=1 width=0) (actual time=77785.14..77785.14 rows=1 loops=1)
=   ->  Seq Scan on fact  (cost=0.00..379816.25 rows=212423 width=0) (actual time=20486.16..77420.05 rows=180295
loops=1)
= Total runtime: 77785.28 msec
= EXPLAIN
= logs2=#  explain analyze select count(*) from fact_by_tim where dat='2002-03-01';
= NOTICE:  QUERY PLAN:
= Aggregate  (cost=380341.09..380341.09 rows=1 width=0) (actual time=79308.22..79308.22 rows=1 loops=1)
=   ->  Seq Scan on fact_by_tim  (cost=0.00..379816.25 rows=209934 width=0) (actual time=24.35..78929.68 rows=180295
loops=1)
= Total runtime: 79308.35 msec
= EXPLAIN
= logs2=# select correlation from pg_stats where tablename='fact_by_dat' and attname='dat';
=  correlation
= -------------
=            1
= (1 row)
=
= logs2=# select correlation from pg_stats where tablename='fact_by_tim' and attname='dat';
=  correlation
= -------------
=  -0.00133352
= (1 row)
=
============================================================

In neither case did it use the index.  However as shown below, in the case
where it was ordered by date the index would have helped a huge amount, while
in the case where it was ordered by time using the index hurts a huge amount.

============================================================
= logs2=# set enable_seqscan to off;
= SET VARIABLE
= logs2=# explain analyze select count(*) from fact_by_dat where dat='2002-03-01';
= NOTICE:  QUERY PLAN:
=
= Aggregate  (cost=5502.57..5502.57 rows=1 width=0) (actual time=2671.20..2671.20 rows=1 loops=1)
=   ->  Index Scan using fact_by_dat__dat on fact_by_dat  (cost=0.00..4974.99 rows=211036 width=0) (actual
time=90.24..2339.64rows=180295 loops=1) 
= Total runtime: 2693.87 msec
=
= EXPLAIN
= logs2=#  explain analyze select count(*) from fact_by_tim where
= dat='2002-03-01';
= NOTICE:  QUERY PLAN:
=
= Aggregate  (cost=837849.27..837849.27 rows=1 width=0) (actual time=410705.02..410705.02 rows=1 loops=1)
=   ->  Index Scan using fact_by_tim__dat on fact_by_tim  (cost=0.00..837324.43 rows=209934 width=0) (actual
time=56.14..410271.50rows=180295 loops=1) 
= Total runtime: 410705.17 msec
=
= EXPLAIN
= logs2=#
=====================================================================


So with the ideally ordered table the index would have helped by a
factor of 30 (2.7 seconds vs. 77 seconds)... but with the bad
ordering it hurt by a factor of 5 (411 seconds vs. 79 seconds).

Very interesting...

Just for my own education, could you bare with me for a few
questions from a relative novice...

   *) Should the optimizer choose a plan that uses the index
      if the correlation is high enough?

   *) Instead of the overall correlation across the whole table,
      would a better metric be the average correlation for data
      within each page?   Then it could recognize that while I
      had a low overall correlation because I loaded Feb, then
      Jan, then Mar ... within each block the correlation was good.

   *) If the optimizer sees one alternative that may take
      from 3 - 400 seconds, and another that will produce
      a very consistent 80 seconds, is the first better because
      it could result in a 30x speedup vs. only 5x slowdown, or
      is the second better, because it could result in
      only 77sec faster vs 320 second slower.

      Or do I simply not understand optimizers at all. :-)

   *) Any reason why "a=b" and "a<=b and a>=b" would act
      differently?  I'm not complaining, because it seems
      like a convenient way to give the optimzer-hint that
      helps in my case :-), but it seems surprising.

Anyway, as I said before, thanks for all the workarounds that
got me up and running again!  I'd be glad to continue to look
if people want more information about these queries if anyone
thinks any more experimentation would be helpful.

  Thanks,
  Ron

Re: Inconsistant use of index.

От
Ron Mayer
Дата:
On Tue, 26 Mar 2002, Tom Lane wrote:
> Ron Mayer <ron@intervideo.com> writes:
> >> I'm particularly interested in the correlation estimate for the dat
> >> column.  (Would you happen to have an idea whether the data has been
> >> inserted more-or-less in dat order?)
>
> > I beleve much of February was loaded first, then we back-filled January,
> > and daily I've been adding March's results.  I don't believe the index-usage
> > stopped when we did the january fill... something happend a few days ago after
> > a pretty routine daily load.
>
> The correlation estimate for dat is pretty low (0.086088), which I think
> reflects the fact that on a macro level your data is not very orderly
> (Feb/Jan/Mar).  However, if it's been loaded on a daily basis then the
> records for any single day will be together --- which is why the indexed
> probe for a single day is so fast.  I don't see any way that we can
> expect the system to model this effect with only one ordering-correlation
> number :-( ... so a proper fix will have to wait for some future release
> when we can think about having more extensive stats about ordering.
>
> In the meantime, it would be interesting to see if re-establishing the
> big-picture order correlation would persuade the planner to do the right
> thing.  Could you do something like this:
>
>     CREATE TABLE foo AS SELECT * FROM fact ORDER BY dat;
>     TRUNCATE TABLE fact;
>     INSERT INTO fact SELECT * FROM foo;
>     DROP TABLE foo;
>     VACUUM ANALYZE fact;
>
> (this should leave you in a state where pg_stats shows correlation 1.0
> for fact.dat) and then see what you get from EXPLAIN?
>
>             regards, tom lane



I did quite a bit more playing with this, and no matter what the
correlation was (1, -0.001), it never seemed to have any effect
at all on the execution plan.

Should it?  With a high correlation the index scan is a much better choice.

   Ron

---
--- create the table with a correlation of "1".
---
logs2=# CREATE TABLE fact_by_dat AS SELECT * FROM fact ORDER BY dat;
SELECT
logs2=# CREATE INDEX fact_by_dat__dat ON fact_by_dat(dat);
CREATE
logs2=# vacuum analyze fact_by_dat;
VACUUM
logs2=# select correlation from pg_stats where tablename='fact_by_dat' and attname='dat';
 correlation
-------------
           1
(1 row)

---
--- Still does the "Seq Scan"
---
logs2=#  explain analyze select count(*) from fact_by_dat where dat='2002-03-01';
NOTICE:  QUERY PLAN:
Aggregate  (cost=380347.31..380347.31 rows=1 width=0) (actual time=77785.14..77785.14 rows=1 loops=1)
  ->  Seq Scan on fact  (cost=0.00..379816.25 rows=212423 width=0) (actual time=20486.16..77420.05 rows=180295 loops=1)
Total runtime: 77785.28 msec
EXPLAIN


---
--- Disable Seq Scan...  30 times faster.
---
logs2=# set enable_seqscan to off;
SET VARIABLE
logs2=# explain analyze select count(*) from fact_by_dat where dat='2002-03-01';
NOTICE:  QUERY PLAN:
Aggregate  (cost=5502.57..5502.57 rows=1 width=0) (actual time=2671.20..2671.20 rows=1 loops=1)
  ->  Index Scan using fact_by_dat__dat on fact_by_dat  (cost=0.00..4974.99 rows=211036 width=0)
(actual time=90.24..2339.64 rows=180295 loops=1)
Total runtime: 2693.87 msec

Re: Inconsistant use of index.

От
Tom Lane
Дата:
Ron Mayer <ron@intervideo.com> writes:
> I did quite a bit more playing with this, and no matter what the
> correlation was (1, -0.001), it never seemed to have any effect
> at all on the execution plan.

> Should it?  With a high correlation the index scan is a much better choice.

I'm confused.  Your examples show the planner correctly estimating the
indexscan as much cheaper than the seqscan.

> logs2=#  explain analyze select count(*) from fact_by_dat where dat='2002-03-01';
> NOTICE:  QUERY PLAN:
> Aggregate  (cost=380347.31..380347.31 rows=1 width=0) (actual time=77785.14..77785.14 rows=1 loops=1)
>   ->  Seq Scan on fact  (cost=0.00..379816.25 rows=212423 width=0) (actual time=20486.16..77420.05 rows=180295
loops=1)
> Total runtime: 77785.28 msec

Cut-and-paste mistake here somewhere, perhaps?  The plan refers to fact
not fact_by_dat.

            regards, tom lane

Re: Inconsistant use of index.

От
Ron Mayer
Дата:
On Wed, 3 Apr 2002, Tom Lane wrote:
>
> I'm confused.  Your examples show the planner correctly estimating the
> indexscan as much cheaper than the seqscan.
>...
> Cut-and-paste mistake here somewhere, perhaps?  The plan refers to fact
> not fact_by_dat.

My apologies...  It was indeed doing the right thing on the table that was
ordered by date.

   Sorry for the wasted bandwidth.
   Ron