Re: Postgresqlism & Vacuum?

Поиск
Список
Период
Сортировка
От Jurgen Defurne
Тема Re: Postgresqlism & Vacuum?
Дата
Msg-id 38FB6718.BA8185DB@glo.be
обсуждение исходный текст
Ответ на Re: Postgresqlism & Vacuum?  (Stephen J Lombardo <lombardo@mac.com>)
Список pgsql-general
Stephen J Lombardo wrote:
Because small is a relative term. You will notice that Bruce does not
say "where a table is less than 100 tuples" or something like that. And
because in the end you would probably waste significantly more time than a
few micro-seconds. Consider a table where you have some round number of
tuples, say 100,000.  Suppose you had b-tree indexes on two attributes,
employee_id (primary key) and last_name. Now if you were to run a query to
look up an employee by the primary key you would surly want to use the
index. Assume that it would take 3 disk accesses to search the index, and
one to fetch the data page from the heap. So you have a total of 4 disk
accesses to search on primary key and retrieve on row. Now suppose you were
You have to consider that the B-tree index itself has the property of clustering it's keys. This means first of all that with a key like employee-id, you can have as many as 150 keys in an index block (blocksize 2k, administrative data accounted for). After the first record is retrieved, your index is sequentially walked. Thus, this means that you have 1 index access for 150 keys. On account of this, your index consists of 666 blocks. To retrieve your 50000 records, you need to read only 333 pages. This means that on average, you need to do only one disk access per 150 records.
 
going to run a query that would return a significant number of rows, lets
say half the table (50,000). Now if the optimizer chose to use the index on
that query it would take 4 disk access to locate each and every row (3 to
search index, 1 to grab data page). So if the query ran using the index it
would use 200,000 (50,000 * 4) disk accesses (Worst case scenario of course.
Using CLUSTER could improve the efficiency). Lets assume that the average
size of a tuple is 500k. So PostgreSQL would pack about 16 tuples into a
I think you mean 500 bytes here, right ?
 
single page. Therefore doing a sequential search on the table would require
100,000/16, or 6250 disk accesses. Depending on the speed of your drive this
could make a big difference. Suppose the large query was run only 10 times a
day, that would waste around 2 million disk accesses. Now if you were using
a join performance would suffer even more.
Since your data is also spread into 6250 pages, and you want to retrieve 50000 of them, there are two cases :
a) Complete randomness
b) Complete sequentialness
In the case of complete randomness, you might need to read every page from disk (in which case you do have 50000 accesses), but this is nonsense, because every file/DBMS system since COBOL, implements some buffering scheme, so the chance of finding your data in a block that is already in memory grows.
In the case of complete sequentialness, you will need only to read 3125 blocks. Also, most systems since COBOL do also implement some read-ahead, so disk access drops again.
 
    The job of the optimizer is to make educated decisions about how to run
a query. Stats will help it out significantly, but it is expensive to
maintain statistics on a running database and it would decrease overall
performace. Instead the answer is to collect statistics periodically. There
is reasoning behind this to. Consider a table where you have 1,000,000
tuples. One of the attributes is called state. Currently there are only 5
states in the database. A query is run like this:

SELECT state FROM table_name WHERE state='NY';

The optimizer will see if it has any statistics on this table. If not it
will make a guess at how many rows are returned. So the optimizer guesses
that 1% of the table, or 10,000 rows, will be returned. Then it will use
that number to asses how to run the query. Now if it had statistics on the
table the optimizer would know that there were only 5 different values in
the states column of the table. So the optimizer would assume that 20% of
the table would be returned from the query. It is likely that the optimizer
will choose a very different plan when it thinks that 200,000 rows will be
returned.

Two cases here :
a) state is indexed
b) state is not indexed
If state is indexed, then the optimizer doesn't need to do much work : just find the first index block, then walk the tree until state <> 'NY', regardless of the amount of records to be retrieved.
If it is not indexed, then it should not even attempt to use the index, but just start searching sequentially.
 
    You can be confident that the fine PostgreSQL developers have done a
good job with the optimizer. There are reasons that things are done the way
they are, but they might not be immediatly apparent.

Cheers,
Stephen

The main thing the optimizer does is searching across which indices a query should best be done. The statistics of VACUUM ANALYZE can help, but I think that a good insight in the when and whereabouts of indices helps performance more than adding indices ad infinitum, doing an analyze and then hope that the system will run fast.

Jurgen
defurnj@glo.be

В списке pgsql-general по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: excell to postgres
Следующее
От: Hal Snyder
Дата:
Сообщение: excell to postgres