Understanding histograms

Поиск
Список
Период
Сортировка
От Len Shapiro
Тема Understanding histograms
Дата
Msg-id 4817FC00.50609@cs.pdx.edu
обсуждение исходный текст
Ответы Re: Understanding histograms
Список pgsql-performance
I hope I am posting to the right list.
I am running Postgresql 8.1.9 and don't understand the behavior of
histograms for data items not in the MVC list.  I teach databases and
want to use Postgres as an example.  I will appreciate any help that
anyone can provide.

Here is the data I am using. I am interested only in the "rank" attribute.

CREATE TABLE Sailors (
   sid Integer NOT NULL,
   sname varchar(20),
   rank integer,
   age real,
   PRIMARY KEY (sid));

I insert 30 sailor rows:

INSERT INTO Sailors VALUES (3, 'Andrew', 10,  30.0);
INSERT INTO Sailors VALUES (17, 'Bart',  5,  30.2);
INSERT INTO Sailors VALUES (29, 'Beth',    3,  30.4);
INSERT INTO Sailors VALUES (28, 'Bryant',    3,  30.6);
INSERT INTO Sailors VALUES (4, 'Cynthia',  9,  30.8);
INSERT INTO Sailors VALUES (16, 'David',  9,  30.9);
INSERT INTO Sailors VALUES (27, 'Fei',    3,  31.0);
INSERT INTO Sailors VALUES (12, 'James',    3,  32.0);
INSERT INTO Sailors VALUES (30, 'Janice',    3,  33.0);
INSERT INTO Sailors VALUES (2, 'Jim', 8,  34.5);
INSERT INTO Sailors VALUES (15, 'Jingke', 10,  35.0);
INSERT INTO Sailors VALUES (26, 'Jonathan',9,  36.0);
INSERT INTO Sailors VALUES (24, 'Kal',    3,  36.6);
INSERT INTO Sailors VALUES (14, 'Karen', 8,  37.8);
INSERT INTO Sailors VALUES (8, 'Karla',7,  39.0);
INSERT INTO Sailors VALUES (25, 'Kristen', 10, 39.5);
INSERT INTO Sailors VALUES (19, 'Len',   8,  40.0);
INSERT INTO Sailors VALUES (7, 'Lois',   8,  41.0);
INSERT INTO Sailors VALUES (13, 'Mark', 7,  43.0);
INSERT INTO Sailors VALUES (18, 'Melanie', 1,  44.0);
INSERT INTO Sailors VALUES (5, 'Niru',  5,  46.0);
INSERT INTO Sailors VALUES (23, 'Pavel',    3,  48.0);
INSERT INTO Sailors VALUES (1, 'Sergio', 7,  50.0);
INSERT INTO Sailors VALUES (6, 'Suhui', 1,  51.0);
INSERT INTO Sailors VALUES (22, 'Suresh',9,  52.0);
INSERT INTO Sailors VALUES (20, 'Tim',7,  54.0);
INSERT INTO Sailors VALUES (21, 'Tom', 10,  56.0);
INSERT INTO Sailors VALUES (11, 'Warren',    3,  58.0);
INSERT INTO Sailors VALUES (10, 'WuChang',9,  59.0);
INSERT INTO Sailors VALUES (9, 'WuChi', 10,  60.0);

after analyzing, I access the pg_stats table with

SELECT  n_distinct, most_common_vals,
        most_common_freqs,   histogram_bounds
FROM pg_stats WHERE tablename = 'sailors' AND attname = 'rank';

and I get:

n_distinct      most_common_vals        most_common_freqs
histogram_bounds
-0.233333
        {3,9,10,7,8}    {0.266667,0.166667,0.166667,0.133333,0.133333}
{1,5}

I have two questions.  I'd appreciate any info you can provide,
including pointers to the source code.

1. Why does Postgres come up with a negative n_distinct?  It
apparently thinks that the number of rank values will increase as the
number of sailors increases.  What/where is the algorithm that decides
that?

2. The most_common_vals and their frequencies make sense.  They say
that the values {3,9,10,7,8} occur a total of 26 times, so other
values occur a total of 4 times.  The other, less common, values are 1
and 5, each occuring twice, so the histogram {1,5} is appropriate.
If I run the query
EXPLAIN SELECT * from sailors where rank = const;
for any const not in the MVC list, I get the plan

Seq Scan on sailors  (cost=0.00..1.38 rows=2 width=21)
  Filter: (rank = const)

The "rows=2" estimate makes sense when const = 1 or 5, but it makes no
sense to me for other values of const not in the MVC list.
For example, if I run the query
EXPLAIN SELECT * from sailors where rank = -1000;
Postgres still gives an estimate of "row=2".
Can someone please explain?

Thanks,

Len Shapiro
Portland State University

В списке pgsql-performance по дате отправления:

Предыдущее
От: "Merlin Moncure"
Дата:
Сообщение: Re: SSD database benchmarks
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Understanding histograms