Обсуждение: database file size bloat

Поиск
Список
Период
Сортировка

database file size bloat

От
Matthew Arnison
Дата:
hi folks,

we are using a fairly simple database, with stuff being retrieved and
updated and inserted via php over the web. we have code in there to check
that queries sent to psql are always less than 8000 bytes.

three times now this week (on two different servers) the raw database on
disk has ballooned in size, from about 10 megs to 50 megs in two cases,
and from about 10 megs to 250 megs in another case.

a VACUUM VERBOSE ANALYZE; cleans it back down to the proper size, but
meanwhile all the queries take half a minute, instead of less than a
second. and our load average skyrockets.

here is a dump from VACUUM:

NOTICE:  --Relation webcast--
NOTICE:  Pages 5568: Changed 0, Reapped 5553, Empty 0, New 0; Tup 93: Vac
35065, Keep/VTL 0/0, Crash 0, UnUsed 29, MinLen 766, MaxLen
4782; Re-using: Free/Avail. Space 45303704/45303704; EndEmpty/Avail. Pages
0/5553. Elapsed 1/0 sec.
NOTICE:  Rel webcast: Pages: 5568 --> 15; Tuple(s) moved: 93. Elapsed
415/0 sec.

we are running postgresql-6.5.2-1, redhat 6.0, pii-350, 384 megs RAM.

i thought of upgrading, but couldn't see anything in the 6.5.3 changelogs
that would help. checked through the mail archives, couldn't find anything
on a quick search, couldn't see anything in the FAQ.

i am wondering if we can prevent this bloat from happening. any help
appreciated! this server could get very busy over the weekend, and we want
it to stay up under the load. :)

cheers,
matthew.


Re: database file size bloat

От
Ed Loehr
Дата:
Matthew Arnison wrote:
>
> three times now this week (on two different servers) the raw database on
> disk has ballooned in size, from about 10 megs to 50 megs in two cases,
> and from about 10 megs to 250 megs in another case.
>
> a VACUUM VERBOSE ANALYZE; cleans it back down to the proper size, but
> meanwhile all the queries take half a minute, instead of less than a
> second. and our load average skyrockets.

Hi Matthew,

I have no explanation for the bloat, but it is a well-known "postgresqlism"
that you should consider running vacuum analyze at least nightly, possibly
more frequently.  [I run it hourly.]

Also, there are about 300 reasons to consider upgrading to 7.0, most having
to do with bug fixes and performance improvements.  Unfortunately, there
may be a few incompatibilities (particularly in some pl/pgsql
incantations), so don't assume a seamless upgrade.

Regards,
Ed Loehr

Postgresqlism & Vacuum?

От
Thomas
Дата:
Ed Loehr wrote:

> ... it is a well-known "postgresqlism"
> that you should consider running vacuum analyze at least nightly, possibly
> more frequently.  [I run it hourly.]

I think there must be something wrong with the optimiser that it's
"postgresqlism" that you must vacuum analyze frequently.  Just as an example,
for Clipper (dBase compiler), it's Clipperism that you must re-index if you
cannot locate some records just because the indexing module screws up.

For large 24x7 installations, it's impossible to vacuum nightly because when
postgresql is vacuuming the table is locked up, to the end-user the database
has already hung.

There has been effort to speed up the vacuuming process, but this isn't the
cure.  I believe the fault lies on the optimizer.

For eg, in Bruce Momjian's FAQ 4.9:

   PostgreSQL does not automatically maintain statistics. One has to make
   an explicit vacuum call to update the statistics. After statistics are
   updated, the optimizer knows how many rows in the table, and can
   better decide if it should use indices. Note that the optimizer does
   not use indices in cases when the table is small because a sequential
   scan would be faster.

Why save on micro-seconds to use sequential scan when the table is small and
later 'forgets' that the table is now big because you didn't vacuum analyze?
Why can't the optimizer just use indexes when they are there and not
'optimize' for special cases when the table is small to save micro-seconds?


Thomas



Re: Postgresqlism & Vacuum?

От
Andrew Snow
Дата:
On Fri, 14 Apr 2000, Thomas wrote:

> For large 24x7 installations, it's impossible to vacuum nightly because when
> postgresql is vacuuming the table is locked up, to the end-user the database
> has already hung.

That's right. I complained about this in a discussion with a Postgresql
developer, who assured me they were working towards a fix.  I really don't
care whether the vacuuming is fixed so that it does not lock the table
completely, or that vacuuming becomes say, a once-a-month or less frequent
operation. For some reason everyone who is used to working with PostgreSQL
accepts the fact that you have to vacuum nightly - to outsiders it seems
like a major flaw with the system.


> There has been effort to speed up the vacuuming process, but this isn't the
> cure.  I believe the fault lies on the optimizer.

Sure, the vacuum process speed is fine for small tables, but what about the
big ones where the table gets locked for 5 minutes?  What a joke!

> Why save on micro-seconds to use sequential scan when the table is small and
> later 'forgets' that the table is now big because you didn't vacuum analyze?
> Why can't the optimizer just use indexes when they are there and not
> 'optimize' for special cases when the table is small to save micro-seconds?

Well its more than microseconds I presume, as opening indexes involves
opening files, which takes milliseconds rather than microseconds.




Andrew.




Re: database file size bloat

От
Matthew Arnison
Дата:
the bloat is a big problem. i just checked it again, and the db has
balloooned to 20 megs again, with i think 2650 unused pages. this is after
vacuuming it last night. i guess we need to setup the vacuum script to run
every hour. i am worried about this locking out users during the
vacuuming, although i guess if it happens more often, it should take less
time.

meanwhile, as for upgrading, i think i'll try 6.5.3 first.

version 7 is still in beta. is it atleast as stable as 6.5.3? is it
atleast as fast as 6.5.3?

this is a live site allright.

thanks for your advice,
matthew.

On Thu, 13 Apr 2000, Ed Loehr wrote:

> Matthew Arnison wrote:
> >
> > three times now this week (on two different servers) the raw database on
> > disk has ballooned in size, from about 10 megs to 50 megs in two cases,
> > and from about 10 megs to 250 megs in another case.
> >
> > a VACUUM VERBOSE ANALYZE; cleans it back down to the proper size, but
> > meanwhile all the queries take half a minute, instead of less than a
> > second. and our load average skyrockets.
>
> Hi Matthew,
>
> I have no explanation for the bloat, but it is a well-known "postgresqlism"
> that you should consider running vacuum analyze at least nightly, possibly
> more frequently.  [I run it hourly.]
>
> Also, there are about 300 reasons to consider upgrading to 7.0, most having
> to do with bug fixes and performance improvements.  Unfortunately, there
> may be a few incompatibilities (particularly in some pl/pgsql
> incantations), so don't assume a seamless upgrade.
>
> Regards,
> Ed Loehr
>


Re: database file size bloat

От
Brian Neal
Дата:
Maybe you might want to try out MySQL?  A little while ago, I compared both
MySQL and PostgreSQL to see how they stacked up (for my purposes, anyway).  I
came to the conclusion that while MySQL is a very fast read-only database, it
doesn't support transactions, row-level locks, stored-procedures, sub-selects,
etc.  PostgreSQL has a lot more basic database support, but it is harder to
install and maintain (in my opinion), has worse documentation, and a number of
interesting quirks...for example, the fixed-size row limitation that can only be
changed by a recompilation, or the VACUUM problem described here.  Other issues
I had included the way the backend seemed to work...it is certainly very
demanding when it comes to shared memory, and I had concerns about the process
pool (whether or not pg-sql could handle enough connections) instead of threads,
which most other databases seem to use.

MySQL is an easier installation, requires less maintenance, doesn't have
row-size limitations, and is fully threaded.  PostgreSQL supports a great deal
of basic SQL functionality that MySQL doesn't.  MySQL is good for read-only
databases because it seems to be rather ineffective when it comes to concurrent
writes to the same table (either lock the whole table or lock nothing at all)
and no commit/rollback.  PostgreSQL seems to offer what MySQL lacks, but in
reality it also lacks a lot of what MySQL has.

In my case, I am still looking, but maybe there is a more immediate solution out
there for you. ;)

-Brian

>Date: Fri, 14 Apr 2000 16:29:23 +1000 (EST)
>From: Matthew Arnison <matthewa@physics.usyd.edu.au>
>To: Ed Loehr <eloehr@austin.rr.com>
>cc: pgsql-general@postgresql.org, Rabble-Rouser <rabble@protest.net>, Manse
Jacobi <jacobi@freespeech.org>
>Subject: Re: [GENERAL] database file size bloat
>MIME-Version: 1.0
>X-Mailing-List: pgsql-general@postgresql.org
>X-UIDL: da0ddbd3341ee90e18bd247f40f6bffe
>
>the bloat is a big problem. i just checked it again, and the db has
>balloooned to 20 megs again, with i think 2650 unused pages. this is after
>vacuuming it last night. i guess we need to setup the vacuum script to run
>every hour. i am worried about this locking out users during the
>vacuuming, although i guess if it happens more often, it should take less
>time.
>
>meanwhile, as for upgrading, i think i'll try 6.5.3 first.
>
>version 7 is still in beta. is it atleast as stable as 6.5.3? is it
>atleast as fast as 6.5.3?
>
>this is a live site allright.
>
>thanks for your advice,
>matthew.
>
>On Thu, 13 Apr 2000, Ed Loehr wrote:
>
>> Matthew Arnison wrote:
>> >
>> > three times now this week (on two different servers) the raw database on
>> > disk has ballooned in size, from about 10 megs to 50 megs in two cases,
>> > and from about 10 megs to 250 megs in another case.
>> >
>> > a VACUUM VERBOSE ANALYZE; cleans it back down to the proper size, but
>> > meanwhile all the queries take half a minute, instead of less than a
>> > second. and our load average skyrockets.
>>
>> Hi Matthew,
>>
>> I have no explanation for the bloat, but it is a well-known "postgresqlism"
>> that you should consider running vacuum analyze at least nightly, possibly
>> more frequently.  [I run it hourly.]
>>
>> Also, there are about 300 reasons to consider upgrading to 7.0, most having
>> to do with bug fixes and performance improvements.  Unfortunately, there
>> may be a few incompatibilities (particularly in some pl/pgsql
>> incantations), so don't assume a seamless upgrade.
>>
>> Regards,
>> Ed Loehr
>>
>
>


Re: Postgresqlism & Vacuum?

От
Lincoln Yeoh
Дата:
At 01:13 PM 14-04-2000 +0800, Thomas wrote:
>There has been effort to speed up the vacuuming process, but this isn't the
>cure.  I believe the fault lies on the optimizer.
>
>For eg, in Bruce Momjian's FAQ 4.9:
>
>   PostgreSQL does not automatically maintain statistics. One has to make
>   an explicit vacuum call to update the statistics. After statistics are
>   updated, the optimizer knows how many rows in the table, and can
>   better decide if it should use indices. Note that the optimizer does
>   not use indices in cases when the table is small because a sequential
>   scan would be faster.

Is it too difficult/expensive for Postgresql to keep track of how many
committed rows there are in each table? Then count(*) of the whole table
could be faster too.

Since it's just for optimization it could perhaps keep a rough track of how
many rows would be selected for the past X indexed searches of a table, so
as to better decide which index to use first. Right now it seems like the
optimizer can't learn a thing till the database takes a nap and dreams
about statistics. I prefer the database to be able to learn a few things
before having to take a nap. And then maybe it will only need to take a nap
once every few weeks/months.

Also it's better for the optimizer to be good at figuring which index to
use, than figure whether to use indexes at all. Because in most cases the
people creating indexes on tables _should_ know whether to use indexes at
all. So if there's an index use it. So what if it's a bit slower when
things are small. I put in indexes to make sure that things are still ok
when things get big!

How many people care about the "slow down" when things are small? It's
still fast! If things are going to stay small, then the database admin
should just drop the index.

Often predictable degradation is more useful than academically optimum.

Cheerio,

Link.


Re: Postgresqlism & Vacuum?

От
Andy Lewis
Дата:
I'd also like to hear from anyone on the original posters topic of the "24
hour shop".

I too am in that same boat. I have a DB with 7-8 million records on a Dual
550 with 512Meg Ram and 1gig swap and it takes vacuum 10 - 15 minutes each
evening to run.

Users think the site is hosed and management isn't exactly happy about it
either.

There is one DB on the machine that has two tables, one table has 2
columns and the other has about 25 columns.

I'd think some how there could be a way to vacuum without having to lock
up the entire DB.

Andy

On Fri, 14 Apr 2000, Lincoln Yeoh wrote:

> At 01:13 PM 14-04-2000 +0800, Thomas wrote:
> >There has been effort to speed up the vacuuming process, but this isn't the
> >cure.  I believe the fault lies on the optimizer.
> >
> >For eg, in Bruce Momjian's FAQ 4.9:
> >
> >   PostgreSQL does not automatically maintain statistics. One has to make
> >   an explicit vacuum call to update the statistics. After statistics are
> >   updated, the optimizer knows how many rows in the table, and can
> >   better decide if it should use indices. Note that the optimizer does
> >   not use indices in cases when the table is small because a sequential
> >   scan would be faster.
>
> Is it too difficult/expensive for Postgresql to keep track of how many
> committed rows there are in each table? Then count(*) of the whole table
> could be faster too.
>
> Since it's just for optimization it could perhaps keep a rough track of how
> many rows would be selected for the past X indexed searches of a table, so
> as to better decide which index to use first. Right now it seems like the
> optimizer can't learn a thing till the database takes a nap and dreams
> about statistics. I prefer the database to be able to learn a few things
> before having to take a nap. And then maybe it will only need to take a nap
> once every few weeks/months.
>
> Also it's better for the optimizer to be good at figuring which index to
> use, than figure whether to use indexes at all. Because in most cases the
> people creating indexes on tables _should_ know whether to use indexes at
> all. So if there's an index use it. So what if it's a bit slower when
> things are small. I put in indexes to make sure that things are still ok
> when things get big!
>
> How many people care about the "slow down" when things are small? It's
> still fast! If things are going to stay small, then the database admin
> should just drop the index.
>
> Often predictable degradation is more useful than academically optimum.
>
> Cheerio,
>
> Link.
>


Re: Postgresqlism & Vacuum?

От
Bruce Momjian
Дата:
> Ed Loehr wrote:
>
> > ... it is a well-known "postgresqlism"
> > that you should consider running vacuum analyze at least nightly, possibly
> > more frequently.  [I run it hourly.]
>
> I think there must be something wrong with the optimiser that it's
> "postgresqlism" that you must vacuum analyze frequently.  Just as an example,
> for Clipper (dBase compiler), it's Clipperism that you must re-index if you
> cannot locate some records just because the indexing module screws up.

Vacuum collects stats on table size on every run.  Vacuum analyze every
night is a waste unless the tables are really changing dramatically
every day.

--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Postgresqlism & Vacuum?

От
Peter Eisentraut
Дата:
On Fri, 14 Apr 2000, Thomas wrote:

> I think there must be something wrong with the optimiser that it's
> "postgresqlism" that you must vacuum analyze frequently.

One thing that is not widely known is that vacuum actually has two
orthogonal tasks: garbage collection and statistics collection (only when
you ANALYZE). The fact that this is combined in one command is a
historical artifact, and there are some voices that want to separate the
commands.

The way I see it, if you have enough disk space you never have to run
vacuum to garbage collect. It might lead to obvious problems when the heap
files get so large that it takes more time to physically access them. The
alternative is to garbage collect on each transaction commit but that
bears its own set of performance implications.

The analyze part would probably not need an exclusive lock on the table
but the vacuum certainly does.


--
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden


Re: Postgresqlism & Vacuum?

От
Frank Bax
Дата:
At 06:13 AM 4/14/00 -0500, you wrote:
>I'd think some how there could be a way to vacuum without having to lock
>up the entire DB.


From http://www.postgresql.org/docs/user/sql-vacuum.htm
>
>VACUUM serves two purposes in Postgres as both a means to reclaim
>storage and also a means to collect information for the optimizer.

I'm guessing here, but it would seem to me that once the 'reclaim' portion
was written, it probably seemed like a good a place as any to put the
stat-collecting code?  As long as the entire database was being scanned
anyway, why not collect statistics.

Perhaps its time for the two functions to be separated - controlled by an
option?
Perhaps VACUUM STATONLY could collect stats, not lock table and not reclaim
space.

Actually, I'm thinking any seq-scan could collect the stats on the way
through?

Frank


Re: Postgresqlism & Vacuum?

От
Ed Loehr
Дата:
Bruce Momjian wrote:
>
> > Ed Loehr wrote:
> >
> > > ... it is a well-known "postgresqlism"
> > > that you should consider running vacuum analyze at least nightly, possibly
> > > more frequently.  [I run it hourly.]
> >
> > I think there must be something wrong with the optimiser that it's
> > "postgresqlism" that you must vacuum analyze frequently.  Just as an example,
> > for Clipper (dBase compiler), it's Clipperism that you must re-index if you
> > cannot locate some records just because the indexing module screws up.
>
> Vacuum collects stats on table size on every run.  Vacuum analyze every
> night is a waste unless the tables are really changing dramatically
> every day.

Agreed.  My tables are changing dramatically every day under normal usage.
Ideally, vacuuming would be auto-triggered after so many
inserts/updates/deletes.

[I neglected to mention that I originally started running vacuum hourly
because it was the only way to prevent a number of bugs in 6.5.*.]

Maybe the docs need to be updated?

    "We recommend that active production databases be
    cleaned nightly, in order to keep statistics relatively
    current."

    - http://www.postgresql.org/docs/postgres/sql-vacuum.htm

Regards,
Ed Loehr

Re: database file size bloat

От
Ed Loehr
Дата:
Matthew Arnison wrote:
>
> the bloat is a big problem. i just checked it again, and the db has
> balloooned to 20 megs again, with i think 2650 unused pages. this is after
> vacuuming it last night. i guess we need to setup the vacuum script to run
> every hour. i am worried about this locking out users during the
> vacuuming, although i guess if it happens more often, it should take less
> time.

I should add that my vacuum runs don't take very long (< 10 seconds).  I
would have to consider other alternatives if it took much longer...

> meanwhile, as for upgrading, i think i'll try 6.5.3 first.
>
> version 7 is still in beta. is it atleast as stable as 6.5.3? is it
> atleast as fast as 6.5.3?

Beta3 is more stable and much faster, IMO.  Haven't tried beta5.

Re: Postgresqlism & Vacuum?

От
Stephen J Lombardo
Дата:
> I think there must be something wrong with the optimiser that it's
> "postgresqlism" that you must vacuum analyze frequently.  Just as an example,
> for Clipper (dBase compiler), it's Clipperism that you must re-index if you
> cannot locate some records just because the indexing module screws up.
>
> For large 24x7 installations, it's impossible to vacuum nightly because when
> postgresql is vacuuming the table is locked up, to the end-user the database
> has already hung.
>
> There has been effort to speed up the vacuuming process, but this isn't the
> cure.  I believe the fault lies on the optimizer.
>
> For eg, in Bruce Momjian's FAQ 4.9:
>
> PostgreSQL does not automatically maintain statistics. One has to make
> an explicit vacuum call to update the statistics. After statistics are
> updated, the optimizer knows how many rows in the table, and can
> better decide if it should use indices. Note that the optimizer does
> not use indices in cases when the table is small because a sequential
> scan would be faster.
>
> Why save on micro-seconds to use sequential scan when the table is small and
> later 'forgets' that the table is now big because you didn't vacuum analyze?
> Why can't the optimizer just use indexes when they are there and not
> 'optimize' for special cases when the table is small to save micro-seconds?

    Because small is a relative term. You will notice that Bruce does not
say "where a table is less than 100 tuples" or something like that. And
because in the end you would probably waste significantly more time than a
few micro-seconds. Consider a table where you have some round number of
tuples, say 100,000.  Suppose you had b-tree indexes on two attributes,
employee_id (primary key) and last_name. Now if you were to run a query to
look up an employee by the primary key you would surly want to use the
index. Assume that it would take 3 disk accesses to search the index, and
one to fetch the data page from the heap. So you have a total of 4 disk
accesses to search on primary key and retrieve on row. Now suppose you were
going to run a query that would return a significant number of rows, lets
say half the table (50,000). Now if the optimizer chose to use the index on
that query it would take 4 disk access to locate each and every row (3 to
search index, 1 to grab data page). So if the query ran using the index it
would use 200,000 (50,000 * 4) disk accesses (Worst case scenario of course.
Using CLUSTER could improve the efficiency). Lets assume that the average
size of a tuple is 500k. So PostgreSQL would pack about 16 tuples into a
single page. Therefore doing a sequential search on the table would require
100,000/16, or 6250 disk accesses. Depending on the speed of your drive this
could make a big difference. Suppose the large query was run only 10 times a
day, that would waste around 2 million disk accesses. Now if you were using
a join performance would suffer even more.
    The job of the optimizer is to make educated decisions about how to run
a query. Stats will help it out significantly, but it is expensive to
maintain statistics on a running database and it would decrease overall
performace. Instead the answer is to collect statistics periodically. There
is reasoning behind this to. Consider a table where you have 1,000,000
tuples. One of the attributes is called state. Currently there are only 5
states in the database. A query is run like this:

SELECT state FROM table_name WHERE state='NY';

The optimizer will see if it has any statistics on this table. If not it
will make a guess at how many rows are returned. So the optimizer guesses
that 1% of the table, or 10,000 rows, will be returned. Then it will use
that number to asses how to run the query. Now if it had statistics on the
table the optimizer would know that there were only 5 different values in
the states column of the table. So the optimizer would assume that 20% of
the table would be returned from the query. It is likely that the optimizer
will choose a very different plan when it thinks that 200,000 rows will be
returned.
    You can be confident that the fine PostgreSQL developers have done a
good job with the optimizer. There are reasons that things are done the way
they are, but they might not be immediatly apparent.

Cheers,
Stephen


Re: Postgresqlism & Vacuum?

От
Bruce Momjian
Дата:
> > Vacuum collects stats on table size on every run.  Vacuum analyze every
> > night is a waste unless the tables are really changing dramatically
> > every day.
>
> Agreed.  My tables are changing dramatically every day under normal usage.
> Ideally, vacuuming would be auto-triggered after so many
> inserts/updates/deletes.

We have thought of that too.  Some vacuum option that would do stats
only if X % of the table had changed.


--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Postgresqlism & Vacuum?

От
Bruce Momjian
Дата:
> >From http://www.postgresql.org/docs/user/sql-vacuum.htm
> >
> >VACUUM serves two purposes in Postgres as both a means to reclaim
> >storage and also a means to collect information for the optimizer.
>
> I'm guessing here, but it would seem to me that once the 'reclaim' portion
> was written, it probably seemed like a good a place as any to put the
> stat-collecting code?  As long as the entire database was being scanned
> anyway, why not collect statistics.

Yes, that was the idea.  While doing one, why not do the other.

>
> Perhaps its time for the two functions to be separated - controlled by an
> option?
> Perhaps VACUUM STATONLY could collect stats, not lock table and not reclaim
> space.

Makes sense.

> Actually, I'm thinking any seq-scan could collect the stats on the way
> through?

We have thought about that, at least to count the number of rows.


--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Postgresqlism & Vacuum?

От
Bruce Momjian
Дата:
> Maybe the docs need to be updated?
>
>     "We recommend that active production databases be
>     cleaned nightly, in order to keep statistics relatively
>     current."
>
>     - http://www.postgresql.org/docs/postgres/sql-vacuum.htm

Thanks.  Updated now.

--
  Bruce Momjian                        |  http://www.op.net/~candle
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Postgresqlism & Vacuum?

От
The Hermit Hacker
Дата:
On Fri, 14 Apr 2000, Andrew Snow wrote:

>
> On Fri, 14 Apr 2000, Thomas wrote:
>
> > For large 24x7 installations, it's impossible to vacuum nightly because when
> > postgresql is vacuuming the table is locked up, to the end-user the database
> > has already hung.
>
> That's right. I complained about this in a discussion with a Postgresql
> developer, who assured me they were working towards a fix.  I really don't
> care whether the vacuuming is fixed so that it does not lock the table
> completely, or that vacuuming becomes say, a once-a-month or less frequent
> operation. For some reason everyone who is used to working with PostgreSQL
> accepts the fact that you have to vacuum nightly - to outsiders it seems
> like a major flaw with the system.

Okay, this *used* to be a problem way way back, but I definitely don't
vacuum my databases nightly ... most times I don't do it until something
odd comes up that I figure that I may as well vacuum first to see if it
makes a differnece ...

vacuum'ng once a week, unless you one helluva insert/update/delete
intensive table ...

> Sure, the vacuum process speed is fine for small tables, but what about the
> big ones where the table gets locked for 5 minutes?  What a joke!

v7.0beta5, with a table that has *over* 5miillion tuples:

pgsql% time psql -c "vacuum" postgresql
VACUUM
0.000u 0.022s 2:46.67 0.0%      0+0k 0+0io 0pf+0w



Re: database file size bloat

От
The Hermit Hacker
Дата:
On Fri, 14 Apr 2000, Matthew Arnison wrote:

> the bloat is a big problem. i just checked it again, and the db has
> balloooned to 20 megs again, with i think 2650 unused pages. this is after
> vacuuming it last night. i guess we need to setup the vacuum script to run
> every hour. i am worried about this locking out users during the
> vacuuming, although i guess if it happens more often, it should take less
> time.
>
> meanwhile, as for upgrading, i think i'll try 6.5.3 first.
>
> version 7 is still in beta. is it atleast as stable as 6.5.3? is it
> atleast as fast as 6.5.3?

it is generally felt in the developer lists that each release is generally
more stable, and faster, then the previous one.  I just put v7.0beta5 in
place as the backend to the PostgreSQL Search function off of
http://www.postgresql.org and with >5million tuples loaded already in the
dictionary table, queries are still taking <1min to complete ...






 >
> this is a live site allright.
>
> thanks for your advice,
> matthew.
>
> On Thu, 13 Apr 2000, Ed Loehr wrote:
>
> > Matthew Arnison wrote:
> > >
> > > three times now this week (on two different servers) the raw database on
> > > disk has ballooned in size, from about 10 megs to 50 megs in two cases,
> > > and from about 10 megs to 250 megs in another case.
> > >
> > > a VACUUM VERBOSE ANALYZE; cleans it back down to the proper size, but
> > > meanwhile all the queries take half a minute, instead of less than a
> > > second. and our load average skyrockets.
> >
> > Hi Matthew,
> >
> > I have no explanation for the bloat, but it is a well-known "postgresqlism"
> > that you should consider running vacuum analyze at least nightly, possibly
> > more frequently.  [I run it hourly.]
> >
> > Also, there are about 300 reasons to consider upgrading to 7.0, most having
> > to do with bug fixes and performance improvements.  Unfortunately, there
> > may be a few incompatibilities (particularly in some pl/pgsql
> > incantations), so don't assume a seamless upgrade.
> >
> > Regards,
> > Ed Loehr
> >
>

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org


Re: database file size bloat

От
The Hermit Hacker
Дата:
On Fri, 14 Apr 2000, Brian Neal wrote:

>
> Maybe you might want to try out MySQL?  A little while ago, I compared
> both MySQL and PostgreSQL to see how they stacked up (for my purposes,
> anyway).  I came to the conclusion that while MySQL is a very fast
> read-only database, it doesn't support transactions, row-level locks,
> stored-procedures, sub-selects, etc.  PostgreSQL has a lot more basic
> database support, but it is harder to install and maintain (in my
> opinion), has worse documentation, and a number of interesting
> quirks...for example, the fixed-size row limitation that can only be
> changed by a recompilation, or the VACUUM problem described here.
> Other issues I had included the way the backend seemed to work...it is
> certainly very demanding when it comes to shared memory, and I had
> concerns about the process pool (whether or not pg-sql could handle
> enough connections) instead of threads, which most other databases
> seem to use.

Few commments ... the 'row size limit' should be gone in v7.1, as Jan is
working on implementing TOAST for that ...

AS for the process pool ... there are two camps here ... if you use
threads, when one threads crashes the server, the server is done.  With
forking,if one backend dies, the server is still running ... in a 24x7
shop, I'd rather the solution that doesn't die cause JoeBlow down the hall
blew out something with his query :)



Re: Postgresqlism & Vacuum?

От
Thomas Reinke
Дата:

Stephen J Lombardo wrote:
>
> >
> > Why save on micro-seconds to use sequential scan when the table is small and
> > later 'forgets' that the table is now big because you didn't vacuum analyze?
> > Why can't the optimizer just use indexes when they are there and not
> > 'optimize' for special cases when the table is small to save micro-seconds?
>

[snip]

>     You can be confident that the fine PostgreSQL developers have done a
> good job with the optimizer. There are reasons that things are done the way
> they are, but they might not be immediatly apparent.
>

That's all fine and good. But the bottom line is that the vacuum
requirements,
and the amount of time it takes, is a major drawback to the viability of
PostGres in a commercial environment.  I have _never_ seen anybody
like it. I have seen lots of work arounds; lots of gripes; lots
of code developed to get around its problems (at least in our shop).

Without suggesting _how_ to fix it, I think it is safe to
say everyone would love to see it fixed. Personally,
if someone were to give me an option of having a slower
db engine (e.g. half the speed), but one that never had
to be vacuumed short of a corruption of data, I would
take the slower system. Simply because I can tolerate
a transaction taking 1 second as opposed to 0.5 seconds.
I cannot tolerate taking the system off-line for an hour at
a time. Nor am I thrilled with the work we've had to put
in to get around vacuum's problems.

So, my idea for the development team, assuming that there
is no easy elegant solution to keeping performance optimal
AND getting rid of the vacuum requirements: Provide a config
option that allows an admin to select a fast database with
regularly scheduled vaccuums, or a slower database with no
vacuum requirements at all. (I'd take the latter.)

Cheers, Thomas

--
------------------------------------------------------------
Thomas Reinke                            Tel: (905) 331-2260
Director of Technology                   Fax: (905) 331-2504
E-Soft Inc.                         http://www.e-softinc.com
Publishers of SecuritySpace     http://www.securityspace.com

Re: Postgresqlism & Vacuum?

От
Charles Tassell
Дата:
This may be a dumb question, since I have no idea how the core of Postgres
is actually implemented, but couldn't the VACUUM command be done within a
transaction, or something similar, so that it doesn't lock the table while
running?  Of course it would be slightly out of date if write actions
(insert/update/delete/drop) were done during the process, but not much more
than it would be if these actions were taken a few seconds after the vacuum
was finished.

I don't see having to vacuum analyze the database every once in  a while as
a big problem, but the fact that it completely locks the table from even
read access while it's running is a bit scary.

BTW: Does dropping the indexes on a table then recreating them achieve the
same affect as doing a VACUUM ANALYZE, minus the garbage collection?  Or
does create index just define the index, and the next vacuum analyze
actually populate it?

At 12:58 PM 4/14/00, Stephen J Lombardo wrote:

> > I think there must be something wrong with the optimiser that it's
> > "postgresqlism" that you must vacuum analyze frequently.  Just as an
> example,
> > for Clipper (dBase compiler), it's Clipperism that you must re-index if you
> > cannot locate some records just because the indexing module screws up.
> >
> > For large 24x7 installations, it's impossible to vacuum nightly because
> when
> > postgresql is vacuuming the table is locked up, to the end-user the
> database
> > has already hung.
> >
> > There has been effort to speed up the vacuuming process, but this isn't the
> > cure.  I believe the fault lies on the optimizer.
> >
> > For eg, in Bruce Momjian's FAQ 4.9:
> >
> > PostgreSQL does not automatically maintain statistics. One has to make
> > an explicit vacuum call to update the statistics. After statistics are
> > updated, the optimizer knows how many rows in the table, and can
> > better decide if it should use indices. Note that the optimizer does
> > not use indices in cases when the table is small because a sequential
> > scan would be faster.
> >
> > Why save on micro-seconds to use sequential scan when the table is
> small and
> > later 'forgets' that the table is now big because you didn't vacuum
> analyze?
> > Why can't the optimizer just use indexes when they are there and not
> > 'optimize' for special cases when the table is small to save micro-seconds?
>
>     Because small is a relative term. You will notice that Bruce does not
>say "where a table is less than 100 tuples" or something like that. And
>because in the end you would probably waste significantly more time than a
>few micro-seconds. Consider a table where you have some round number of
>tuples, say 100,000.  Suppose you had b-tree indexes on two attributes,
>employee_id (primary key) and last_name. Now if you were to run a query to
>look up an employee by the primary key you would surly want to use the
>index. Assume that it would take 3 disk accesses to search the index, and
>one to fetch the data page from the heap. So you have a total of 4 disk
>accesses to search on primary key and retrieve on row. Now suppose you were
>going to run a query that would return a significant number of rows, lets
>say half the table (50,000). Now if the optimizer chose to use the index on
>that query it would take 4 disk access to locate each and every row (3 to
>search index, 1 to grab data page). So if the query ran using the index it
>would use 200,000 (50,000 * 4) disk accesses (Worst case scenario of course.
>Using CLUSTER could improve the efficiency). Lets assume that the average
>size of a tuple is 500k. So PostgreSQL would pack about 16 tuples into a
>single page. Therefore doing a sequential search on the table would require
>100,000/16, or 6250 disk accesses. Depending on the speed of your drive this
>could make a big difference. Suppose the large query was run only 10 times a
>day, that would waste around 2 million disk accesses. Now if you were using
>a join performance would suffer even more.
>     The job of the optimizer is to make educated decisions about how to run
>a query. Stats will help it out significantly, but it is expensive to
>maintain statistics on a running database and it would decrease overall
>performace. Instead the answer is to collect statistics periodically. There
>is reasoning behind this to. Consider a table where you have 1,000,000
>tuples. One of the attributes is called state. Currently there are only 5
>states in the database. A query is run like this:
>
>SELECT state FROM table_name WHERE state='NY';
>
>The optimizer will see if it has any statistics on this table. If not it
>will make a guess at how many rows are returned. So the optimizer guesses
>that 1% of the table, or 10,000 rows, will be returned. Then it will use
>that number to asses how to run the query. Now if it had statistics on the
>table the optimizer would know that there were only 5 different values in
>the states column of the table. So the optimizer would assume that 20% of
>the table would be returned from the query. It is likely that the optimizer
>will choose a very different plan when it thinks that 200,000 rows will be
>returned.
>     You can be confident that the fine PostgreSQL developers have done a
>good job with the optimizer. There are reasons that things are done the way
>they are, but they might not be immediatly apparent.
>
>Cheers,
>Stephen


Re: Postgresqlism & Vacuum?

От
The Hermit Hacker
Дата:
On Fri, 14 Apr 2000, Thomas Reinke wrote:

>
>
> Stephen J Lombardo wrote:
> >
> > >
> > > Why save on micro-seconds to use sequential scan when the table is small and
> > > later 'forgets' that the table is now big because you didn't vacuum analyze?
> > > Why can't the optimizer just use indexes when they are there and not
> > > 'optimize' for special cases when the table is small to save micro-seconds?
> >
>
> [snip]
>
> >     You can be confident that the fine PostgreSQL developers have done a
> > good job with the optimizer. There are reasons that things are done the way
> > they are, but they might not be immediatly apparent.
> >
>
> That's all fine and good. But the bottom line is that the vacuum
> requirements,
> and the amount of time it takes, is a major drawback to the viability of
> PostGres in a commercial environment.  I have _never_ seen anybody
> like it. I have seen lots of work arounds; lots of gripes; lots
> of code developed to get around its problems (at least in our shop).
>
> Without suggesting _how_ to fix it, I think it is safe to
> say everyone would love to see it fixed. Personally,
> if someone were to give me an option of having a slower
> db engine (e.g. half the speed), but one that never had
> to be vacuumed short of a corruption of data, I would
> take the slower system. Simply because I can tolerate
> a transaction taking 1 second as opposed to 0.5 seconds.
> I cannot tolerate taking the system off-line for an hour at
> a time. Nor am I thrilled with the work we've had to put
> in to get around vacuum's problems.

Okay, I'm confused here .. if you don't vacuum, you will have that slower
db engine ... so just don't vacuum?

> So, my idea for the development team, assuming that there
> is no easy elegant solution to keeping performance optimal
> AND getting rid of the vacuum requirements: Provide a config
> option that allows an admin to select a fast database with
> regularly scheduled vaccuums, or a slower database with no
> vacuum requirements at all. (I'd take the latter.)

vacuum have to be run manually, not automatic ... so that option already
exists ...



Re: database file size bloat

От
Brian Neal
Дата:
>Few commments ... the 'row size limit' should be gone in v7.1, as Jan is
>working on implementing TOAST for that ...

Cool.  How does this solution stand up in terms of performance?  Will searches
through TEXT/BLOB fields be acceptably fast?

>AS for the process pool ... there are two camps here ... if you use
>threads, when one threads crashes the server, the server is done.  With
>forking,if one backend dies, the server is still running ... in a 24x7
>shop, I'd rather the solution that doesn't die cause JoeBlow down the hall
>blew out something with his query :)

I have to disagree with you here, but I suppose it depends upon your
application.  In the places I've worked and for the stuff I'm doing now, you
would never run queries that would crash the server on a production machine...so
the Joe Blow example would not happen (generally).  Beyond that, I think the
process pool has some terrible downsides...mostly resource usage and performance
in terms of forking, etc.  The amount of shared memory required to handle a
large number of back-end processes, for example, is pretty intense, at least
from my experience.  I had to increase the system's shm hard limit to get it to
work with the number of backends I needed (slightly less, actually, I would have
preferred more)....to do this I had to reboot.  Because of the nature of our
production server, I cannot reboot it, and so I cannot ever run pgsql if this
doesn't go away (I'm not prodding you to do anything about it on my account,
though).

Apache is another program using process pools, and it can use a startling amount
of resources during load spikes...it's sometimes frightening to see how quickly
memory can evaporate when it begins forking large amounts of processes to deal
with spikes.  Apache 2.0 is moving to a threaded solution, as far as I know.
Also, most other major webservers and databases use threaded-implementations as
well.  I don't think the stability/reliability of a single multi-threaded
process is a major issue as long as the program is well written and the site
running the installation is properly maintained.  Also, you can use a watchdog
to restart it in the event that it does die.

-Brian


Re: Postgresqlism & Vacuum?

От
"Steve Wolfe"
Дата:
> Okay, this *used* to be a problem way way back, but I definitely don't
> vacuum my databases nightly ... most times I don't do it until something
> odd comes up that I figure that I may as well vacuum first to see if it
> makes a differnece ...

  I used to take that approach, and I got tired of running into odd,
inexplicable problems that only a VACUUM ANALYZE would fix, so I ended up
using a cron job to do it nightly.

steve


RE: Postgresqlism & Vacuum?

От
"Andrew Snow"
Дата:
> Okay, this *used* to be a problem way way back, but I definitely don't
> vacuum my databases nightly ... most times I don't do it until something
> odd comes up that I figure that I may as well vacuum first to see if it
> makes a differnece ...
> vacuum'ng once a week, unless you one helluva insert/update/delete
> intensive table ...

Well this is where the problem is.  The whole Vacuum thing is not bad for
most people as it allows doing things like reclaiming space and gathering
some statistics, during off-peak times, and also to run it less frequently
on databases that don't change often.  The problem is for those who have
24/7 continuous use of the database, combined with a high frequency of
insert/update/delete, so that vacuuming regularly becomes a necessity.


> v7.0beta5, with a table that has *over* 5miillion tuples:
> pgsql% time psql -c "vacuum" postgresql
> VACUUM
> 0.000u 0.022s 2:46.67 0.0%      0+0k 0+0io 0pf+0w

That certainly is good.  I look forward to installing a non-beta version 7
in place of the current latest 6 and testing!  Don't forget it depends not
only how big the table is in tuples, but how many bytes, how many index,
etc..  Also the "analyze" option takes a while longer as you know


- Andrew


RE: Postgresqlism & Vacuum?

От
The Hermit Hacker
Дата:
On Sat, 15 Apr 2000, Andrew Snow wrote:

> > v7.0beta5, with a table that has *over* 5miillion tuples:
> > pgsql% time psql -c "vacuum" postgresql
> > VACUUM
> > 0.000u 0.022s 2:46.67 0.0%      0+0k 0+0io 0pf+0w
>
> That certainly is good.  I look forward to installing a non-beta version 7
> in place of the current latest 6 and testing!  Don't forget it depends not
> only how big the table is in tuples, but how many bytes, how many index,
> etc..  Also the "analyze" option takes a while longer as you know

That is true ... that is why I generally do a vacuum first then analyze
second ... it at least *feels* faster :)

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org


Re: Postgresqlism & Vacuum?

От
Haroldo Stenger
Дата:
Andrew Snow wrote:
>
> > Okay, this *used* to be a problem way way back, but I definitely don't
> > vacuum my databases nightly ... most times I don't do it until something
> > odd comes up that I figure that I may as well vacuum first to see if it
> > makes a differnece ...
> > vacuum'ng once a week, unless you one helluva insert/update/delete
> > intensive table ...
>
> Well this is where the problem is.  The whole Vacuum thing is not bad for
> most people as it allows doing things like reclaiming space and gathering
> some statistics, during off-peak times, and also to run it less frequently
> on databases that don't change often.  The problem is for those who have
> 24/7 continuous use of the database, combined with a high frequency of
> insert/update/delete, so that vacuuming regularly becomes a necessity.
>

How about:

VaccumAnalyzeDualProcedure
1. Use 2 Mirror Databases D1 and D2, every time there is an
insert/update/delete, apply it twice, one on each database.
2. Make D1 offline at a moment when no transactions are open.
3. Log transactions from now on, which apply on D2 (just
insert/update/delete).
4. Run VACUUM ANALYZE on D1.
5. When finished, apply the log on D1.
6. Make D2 offline at a moment when no transactions are open.
7. Make D1 online.
7. Log transactions from now on, which apply on D1 (just
insert/update/delete).
8. Run VACUUM ANALYZE on D2.
9. When finished, apply the log on D2.
10. Make D2 online.

Cheers,
Haroldo.

Re: database file size bloat

От
Marten Feldtmann
Дата:
>
> AS for the process pool ... there are two camps here ... if you use
> threads, when one threads crashes the server, the server is done.  With
> forking,if one backend dies, the server is still running ... in a 24x7
> shop, I'd rather the solution that doesn't die cause JoeBlow down the hall
> blew out something with his query :)
>

 Due to the used shared memory I think, that this statement is not
that true for PostgreSQL ? If one backend process dies other backends
may also be killed by the postmaster ??


 Marten







Re: Postgresqlism & Vacuum?

От
Lincoln Yeoh
Дата:
At 01:23 PM 14-04-2000 -0400, Bruce Momjian wrote:
>
>> Perhaps its time for the two functions to be separated - controlled by an
>> option?
>> Perhaps VACUUM STATONLY could collect stats, not lock table and not reclaim
>> space.
>
>Makes sense.

Actually it may be more logical to have
VACUUM        reclaim
VACUUM ANALYZE     reclaim then analyze (for downward compatibility)
ANALYZE     just analyze

But I still think that some sort of stats could be used dynamically. Even
if the stats aren't optimal (e.g. don't know which MVCC copy to use), using
them should be better than nothing in most cases. In the cases which don't
work then just do a manual analyze - people are already used to doing it
regularly as it's a postgresqlism ;). There are probably only extremely
few/unusual cases where such automatic stats would make things slower.

Well, OK I could be totally wrong - I'm far from a database expert and
don't have a wide experience of databases.

Cheerio,

Link.



Re: Postgresqlism & Vacuum?

От
Jurgen Defurne
Дата:
Stephen J Lombardo wrote:
Because small is a relative term. You will notice that Bruce does not
say "where a table is less than 100 tuples" or something like that. And
because in the end you would probably waste significantly more time than a
few micro-seconds. Consider a table where you have some round number of
tuples, say 100,000.  Suppose you had b-tree indexes on two attributes,
employee_id (primary key) and last_name. Now if you were to run a query to
look up an employee by the primary key you would surly want to use the
index. Assume that it would take 3 disk accesses to search the index, and
one to fetch the data page from the heap. So you have a total of 4 disk
accesses to search on primary key and retrieve on row. Now suppose you were
You have to consider that the B-tree index itself has the property of clustering it's keys. This means first of all that with a key like employee-id, you can have as many as 150 keys in an index block (blocksize 2k, administrative data accounted for). After the first record is retrieved, your index is sequentially walked. Thus, this means that you have 1 index access for 150 keys. On account of this, your index consists of 666 blocks. To retrieve your 50000 records, you need to read only 333 pages. This means that on average, you need to do only one disk access per 150 records.
 
going to run a query that would return a significant number of rows, lets
say half the table (50,000). Now if the optimizer chose to use the index on
that query it would take 4 disk access to locate each and every row (3 to
search index, 1 to grab data page). So if the query ran using the index it
would use 200,000 (50,000 * 4) disk accesses (Worst case scenario of course.
Using CLUSTER could improve the efficiency). Lets assume that the average
size of a tuple is 500k. So PostgreSQL would pack about 16 tuples into a
I think you mean 500 bytes here, right ?
 
single page. Therefore doing a sequential search on the table would require
100,000/16, or 6250 disk accesses. Depending on the speed of your drive this
could make a big difference. Suppose the large query was run only 10 times a
day, that would waste around 2 million disk accesses. Now if you were using
a join performance would suffer even more.
Since your data is also spread into 6250 pages, and you want to retrieve 50000 of them, there are two cases :
a) Complete randomness
b) Complete sequentialness
In the case of complete randomness, you might need to read every page from disk (in which case you do have 50000 accesses), but this is nonsense, because every file/DBMS system since COBOL, implements some buffering scheme, so the chance of finding your data in a block that is already in memory grows.
In the case of complete sequentialness, you will need only to read 3125 blocks. Also, most systems since COBOL do also implement some read-ahead, so disk access drops again.
 
    The job of the optimizer is to make educated decisions about how to run
a query. Stats will help it out significantly, but it is expensive to
maintain statistics on a running database and it would decrease overall
performace. Instead the answer is to collect statistics periodically. There
is reasoning behind this to. Consider a table where you have 1,000,000
tuples. One of the attributes is called state. Currently there are only 5
states in the database. A query is run like this:

SELECT state FROM table_name WHERE state='NY';

The optimizer will see if it has any statistics on this table. If not it
will make a guess at how many rows are returned. So the optimizer guesses
that 1% of the table, or 10,000 rows, will be returned. Then it will use
that number to asses how to run the query. Now if it had statistics on the
table the optimizer would know that there were only 5 different values in
the states column of the table. So the optimizer would assume that 20% of
the table would be returned from the query. It is likely that the optimizer
will choose a very different plan when it thinks that 200,000 rows will be
returned.

Two cases here :
a) state is indexed
b) state is not indexed
If state is indexed, then the optimizer doesn't need to do much work : just find the first index block, then walk the tree until state <> 'NY', regardless of the amount of records to be retrieved.
If it is not indexed, then it should not even attempt to use the index, but just start searching sequentially.
 
    You can be confident that the fine PostgreSQL developers have done a
good job with the optimizer. There are reasons that things are done the way
they are, but they might not be immediatly apparent.

Cheers,
Stephen

The main thing the optimizer does is searching across which indices a query should best be done. The statistics of VACUUM ANALYZE can help, but I think that a good insight in the when and whereabouts of indices helps performance more than adding indices ad infinitum, doing an analyze and then hope that the system will run fast.

Jurgen
defurnj@glo.be