Обсуждение: How to best use 32 15k.7 300GB drives?

От:
Robert Schnabel
Дата:

HI,

I use PostgreSQL basically as a data warehouse to store all the genetic
data that our lab generates.  The only person that accesses the database
is myself and therefore I've had it housed on my workstation in my
office up till now.  However, it's getting time to move it to bigger
hardware.  I currently have a server that is basically only storing
backup images of all our other workstations so I'm going to move my
database onto it.  The server looks like this: Windows Server Enterprise
2008 R2 64-bit, AMD 2350 quad-core x2, 32GB RAM.  For my purposes the
CPUs and RAM are fine.  I currently have an Adaptec 52445+BBU controller
that has the OS (4 drive RAID5), FTP (4 drive RAID5) and two backup
arrays (8 drive each RAID0).  The backup arrays are in a 16 drive
external enclosure through an expander so I actually have 16 ports free
on the 52445 card.  I plan to remove 3 of the drives from my backup
arrays to make room for 3 - 73GB 15k.5 drives (re-purposed from my
workstation).  Two 16 drive enclosures with SAS2 expanders just arrived
as well as 36 Seagate 15k.7 300GB drives (ST3300657SS).  I also intend
on getting an Adaptec 6445 controller with the flash module when it
becomes available in about a month or two.  I already have several
Adaptec cards so I'd prefer to stick with them.

Here's the way I was planning using the new hardware:
xlog & wal: 3 - 73G 15k.5 RAID1+hot spare in enclosure A on 52445
controller
data: 22 - 300G 15k.7 RAID10 enclosure B&C on 6445 controller
indexes: 8 - 300G 15k.7 RAID10 enclosure C on 6445 controller
2 - 300G 15k.7 as hot spares enclosure C
4 spare 15k.7 for on the shelf

With this configuration I figure I'll have ~3TB for my main data tables
and 1TB for indexes.  Right now my database is 500GB total.  The 3:1
split reflects my current table structure and what I foresee coming down
the road in terms of new data.

So my questions are 1) am I'm crazy for doing this, 2) would you change
anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp
filespace) on a different controller than everything else?  Please keep
in mind I'm a geneticist who happens to know a little bit about
bioinformatics and not the reverse. :-)

Thanks!
Bob

--

*************************************************
Robert Schnabel
Research Assistant Professor
University of Missouri-Columbia
Animal Sciences Unit, Rm.162
920 East Campus Drive
Columbia, MO 65211-5300
Phone: 573-884-4106
Fax:   573-882-6827
http://animalgenomics.missouri.edu

"...Socialist governments traditionally do make
a financial mess.  They always run out of other
people's money."

Margaret Thatcher, 5 February 1976
*************************************************


От:
david@lang.hm
Дата:

On Thu, 27 Jan 2011, Robert Schnabel wrote:

> HI,
>
> I use PostgreSQL basically as a data warehouse to store all the genetic data
> that our lab generates.  The only person that accesses the database is myself
> and therefore I've had it housed on my workstation in my office up till now.
> However, it's getting time to move it to bigger hardware.  I currently have a
> server that is basically only storing backup images of all our other
> workstations so I'm going to move my database onto it.  The server looks like
> this: Windows Server Enterprise 2008 R2 64-bit, AMD 2350 quad-core x2, 32GB
> RAM.  For my purposes the CPUs and RAM are fine.  I currently have an Adaptec
> 52445+BBU controller that has the OS (4 drive RAID5), FTP (4 drive RAID5) and
> two backup arrays (8 drive each RAID0).  The backup arrays are in a 16 drive
> external enclosure through an expander so I actually have 16 ports free on
> the 52445 card.  I plan to remove 3 of the drives from my backup arrays to
> make room for 3 - 73GB 15k.5 drives (re-purposed from my workstation).  Two
> 16 drive enclosures with SAS2 expanders just arrived as well as 36 Seagate
> 15k.7 300GB drives (ST3300657SS).  I also intend on getting an Adaptec 6445
> controller with the flash module when it becomes available in about a month
> or two.  I already have several Adaptec cards so I'd prefer to stick with
> them.
>
> Here's the way I was planning using the new hardware:
> xlog & wal: 3 - 73G 15k.5 RAID1+hot spare in enclosure A on 52445 controller
> data: 22 - 300G 15k.7 RAID10 enclosure B&C on 6445 controller
> indexes: 8 - 300G 15k.7 RAID10 enclosure C on 6445 controller
> 2 - 300G 15k.7 as hot spares enclosure C
> 4 spare 15k.7 for on the shelf
>
> With this configuration I figure I'll have ~3TB for my main data tables and
> 1TB for indexes.  Right now my database is 500GB total.  The 3:1 split
> reflects my current table structure and what I foresee coming down the road
> in terms of new data.
>
> So my questions are 1) am I'm crazy for doing this, 2) would you change
> anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp
> filespace) on a different controller than everything else?  Please keep in
> mind I'm a geneticist who happens to know a little bit about bioinformatics
> and not the reverse. :-)

a number of questions spring to mind

how much of the time are you expecting to spend inserting data into this
system vs querying data from the system?

is data arriving continuously, or is it a matter of receiving a bunch of
data, inserting it, then querying it?

which do you need to optimize for, insert speed or query speed?

do you expect your queries to be searching for a subset of the data
scattered randomly throughlut the input data, or do you expect it to be
'grab this (relativly) sequential chunk of input data and manipulate it to
generate a report' type of thing

what is your connectvity to the raid enclosures? (does
putting 22 drives on one cable mean that you will be limited due to the
bandwidth of this cable rather than the performance of the drives)

can you do other forms of raid on these drives or only raid 10?

how critical is the data in this database? if it were to die would it just
be a matter of recreating it and reloading the data? or would you loose
irreplaceable data?

David Lang

От:
Alan Hodgson
Дата:

On January 27, 2011, Robert Schnabel <> wrote:

> So my questions are 1) am I'm crazy for doing this, 2) would you change

> anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp

> filespace) on a different controller than everything else? Please keep

> in mind I'm a geneticist who happens to know a little bit about

> bioinformatics and not the reverse. :-)

>

Putting the WAL on a second controller does help, if you're write-heavy.

I tried separating indexes and data once on one server and didn't really notice that it helped much. Managing the space was problematic. I would suggest putting those together on a single RAID-10 of all the 300GB drives (minus a spare). It will probably outperform separate arrays most of the time, and be much easier to manage.

--

A hybrid Escalade is missing the point much in the same way that having a diet soda with your extra large pepperoni pizza is missing the point.

От:
Robert Schnabel
Дата:

On 1/27/2011 5:19 PM,  wrote:
> On Thu, 27 Jan 2011, Robert Schnabel wrote:
>
>> HI,
>>
>> I use PostgreSQL basically as a data warehouse to store all the genetic data
>> that our lab generates.  The only person that accesses the database is myself
>> and therefore I've had it housed on my workstation in my office up till now.
>> However, it's getting time to move it to bigger hardware.  I currently have a
>> server that is basically only storing backup images of all our other
>> workstations so I'm going to move my database onto it.  The server looks like
>> this: Windows Server Enterprise 2008 R2 64-bit, AMD 2350 quad-core x2, 32GB
>> RAM.  For my purposes the CPUs and RAM are fine.  I currently have an Adaptec
>> 52445+BBU controller that has the OS (4 drive RAID5), FTP (4 drive RAID5) and
>> two backup arrays (8 drive each RAID0).  The backup arrays are in a 16 drive
>> external enclosure through an expander so I actually have 16 ports free on
>> the 52445 card.  I plan to remove 3 of the drives from my backup arrays to
>> make room for 3 - 73GB 15k.5 drives (re-purposed from my workstation).  Two
>> 16 drive enclosures with SAS2 expanders just arrived as well as 36 Seagate
>> 15k.7 300GB drives (ST3300657SS).  I also intend on getting an Adaptec 6445
>> controller with the flash module when it becomes available in about a month
>> or two.  I already have several Adaptec cards so I'd prefer to stick with
>> them.
>>
>> Here's the way I was planning using the new hardware:
>> xlog&  wal: 3 - 73G 15k.5 RAID1+hot spare in enclosure A on 52445 controller
>> data: 22 - 300G 15k.7 RAID10 enclosure B&C on 6445 controller
>> indexes: 8 - 300G 15k.7 RAID10 enclosure C on 6445 controller
>> 2 - 300G 15k.7 as hot spares enclosure C
>> 4 spare 15k.7 for on the shelf
>>
>> With this configuration I figure I'll have ~3TB for my main data tables and
>> 1TB for indexes.  Right now my database is 500GB total.  The 3:1 split
>> reflects my current table structure and what I foresee coming down the road
>> in terms of new data.
>>
>> So my questions are 1) am I'm crazy for doing this, 2) would you change
>> anything and 3) is it acceptable to put the xlog&  wal (and perhaps tmp
>> filespace) on a different controller than everything else?  Please keep in
>> mind I'm a geneticist who happens to know a little bit about bioinformatics
>> and not the reverse. :-)
> a number of questions spring to mind
>
> how much of the time are you expecting to spend inserting data into this
> system vs querying data from the system?
>
> is data arriving continuously, or is it a matter of receiving a bunch of
> data, inserting it, then querying it?
>
> which do you need to optimize for, insert speed or query speed?
>
Bulk loads of GB of data via COPY from csv files once every couple
weeks.  I basically only have a couple different table "types" based on
the data going into them.  Each type is set up as inherited tables so
there is a new child table for each "sample" that is added.  Once the
bulk data is inserted into the tables I generally do some updates on
columns to set values which characterize the data.  These columns then
get indexed.  Basically once the initial manipulation is done the table
is then static and what I'm looking for is query speed.

> do you expect your queries to be searching for a subset of the data
> scattered randomly throughlut the input data, or do you expect it to be
> 'grab this (relativly) sequential chunk of input data and manipulate it to
> generate a report' type of thing
Generally it is grab a big sequential chunk of data and either dump it
to a csv or insert into another table.  I use external scripts to format
data.  My two big table structures look like this:

CREATE TABLE genotypes
(
   snp_number integer NOT NULL,
   sample_id integer NOT NULL,
   genotype smallint NOT NULL
)

There are ~58k unique snp_number.  Other tables will have upwards of
600-700k snp_number.  The child tables have a constraint based on
sample_id such as:
CONSTRAINT check100 CHECK (sample_id > 100000000 AND sample_id < 101000000)

The data is sorted by snp_number, sample_id.  So if I want the data for
a given sample_id it would be a block of ~58k rows.  The size of the
table depends on how many sample_id's there are.  My largest has ~30k
sample_id by 58k snp_number per sample.  The other big table (with
children) is "mutations" and is set up similarly so that I can access
individual tables (samples) based on constraints.  Each of these
children have between 5-60M records.

> what is your connectvity to the raid enclosures? (does
> putting 22 drives on one cable mean that you will be limited due to the
> bandwidth of this cable rather than the performance of the drives)
>
> can you do other forms of raid on these drives or only raid 10?
This is all direct attach storage via SAS2 so I'm guessing it's probably
limited to the single port link between the controller and the
expander.  Again, geneticist here not computer scientist. ;-)  The
enclosures have Areca ARC-8026-16 expanders.  I can basically do
whatever RAID level I want.

> how critical is the data in this database? if it were to die would it just
> be a matter of recreating it and reloading the data? or would you loose
> irreplaceable data?
>
> David Lang
All of the data could be reloaded.  Basically, once I get the data into
the database and I'm done manipulating it I create a backup copy/dump
which then gets stored at a couple different locations.  Like I said, I
really only do big loads/updates periodically so if it tanked all I'd be
out is whatever I did since the last backup/dump and some time.

My goal is to 1) have a fairly robust system so that I don't have to
spend my time rebuilding things and 2) be able to query the data
quickly.  Most of what I do are ad hoc queries.  I have an idea... "how
many X have Y in this set of Z samples" and write the query to get the
answer.  I can wait a couple minutes to get an answer but waiting an
hour is becoming tiresome.

Bob





От:
david@lang.hm
Дата:

sorry for not replying properly to your response, I managed to delete the
mail.

as I understand your data access pattern it's the following:

for the main table space:

bulk loads every couple of weeks. if the data is lost you can just reload
it.

searches tend to be extracting large sequential chunks of data, either to
external files or into different tables spaces.

for this table space, you are basically only inserting every couple of
weeks, and it sounds as if you do not really care how long it takes to
load the data.


first the disclaimer, I'm not a postgres expert, but I do have good
experiance with large amounts of data on linux systems (and especially
running into the limitations when doing it on the cheap ;-)


with this data pattern your WAL is meaningless (as it's only relavent for
isertes), and you may as well use raid6 as raid10 (both allow you to
utalize all drives for reads, but raid6 gives you 2 drives worth of
reducnancy while the wrong two drives on raid10 could kill the entire
array). You may even want to disable fsync on imports. It will save you a
lot of time, and if the system crashes during the load you can just
reinitialize and reload the data.

however, since you are going to be large sequential data transfers, you
want to be utalizing multiple SAS links, preferrably as evenly as
possible, so rather than putting all your data drives on one port, you may
want to spread them between ports so that your aggragate bandwidth to the
drives is higher (with this many high speed drives, this is a significant
limitation)


the usual reason for keeping the index drives separate is to avoid having
writes interact with index reads. Since you are not going to be doing both
at the same time, I don't know if it helps to separate your indexes.


now, if you pull the data from this main table into a smaller table for
analysis, you may want to do more interesting things with the drives that
you use for this smaller table as you are going to be loading data into
them more frequently.

David Lang


On Thu, 27 Jan 2011,  wrote:

> Date: Thu, 27 Jan 2011 15:19:32 -0800 (PST)
> From: 
> To: Robert Schnabel <>
> Cc: pgsql-performance <>
> Subject: Re: [PERFORM] How to best use 32 15k.7 300GB drives?
>
> On Thu, 27 Jan 2011, Robert Schnabel wrote:
>
>> HI,
>>
>> I use PostgreSQL basically as a data warehouse to store all the genetic
>> data that our lab generates.  The only person that accesses the database is
>> myself and therefore I've had it housed on my workstation in my office up
>> till now. However, it's getting time to move it to bigger hardware.  I
>> currently have a server that is basically only storing backup images of all
>> our other workstations so I'm going to move my database onto it.  The
>> server looks like this: Windows Server Enterprise 2008 R2 64-bit, AMD 2350
>> quad-core x2, 32GB RAM.  For my purposes the CPUs and RAM are fine.  I
>> currently have an Adaptec 52445+BBU controller that has the OS (4 drive
>> RAID5), FTP (4 drive RAID5) and two backup arrays (8 drive each RAID0).
>> The backup arrays are in a 16 drive external enclosure through an expander
>> so I actually have 16 ports free on the 52445 card.  I plan to remove 3 of
>> the drives from my backup arrays to make room for 3 - 73GB 15k.5 drives
>> (re-purposed from my workstation).  Two 16 drive enclosures with SAS2
>> expanders just arrived as well as 36 Seagate 15k.7 300GB drives
>> (ST3300657SS).  I also intend on getting an Adaptec 6445 controller with
>> the flash module when it becomes available in about a month or two.  I
>> already have several Adaptec cards so I'd prefer to stick with them.
>>
>> Here's the way I was planning using the new hardware:
>> xlog & wal: 3 - 73G 15k.5 RAID1+hot spare in enclosure A on 52445
>> controller
>> data: 22 - 300G 15k.7 RAID10 enclosure B&C on 6445 controller
>> indexes: 8 - 300G 15k.7 RAID10 enclosure C on 6445 controller
>> 2 - 300G 15k.7 as hot spares enclosure C
>> 4 spare 15k.7 for on the shelf
>>
>> With this configuration I figure I'll have ~3TB for my main data tables and
>> 1TB for indexes.  Right now my database is 500GB total.  The 3:1 split
>> reflects my current table structure and what I foresee coming down the road
>> in terms of new data.
>>
>> So my questions are 1) am I'm crazy for doing this, 2) would you change
>> anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp
>> filespace) on a different controller than everything else?  Please keep in
>> mind I'm a geneticist who happens to know a little bit about bioinformatics
>> and not the reverse. :-)
>
> a number of questions spring to mind
>
> how much of the time are you expecting to spend inserting data into this
> system vs querying data from the system?
>
> is data arriving continuously, or is it a matter of receiving a bunch of
> data, inserting it, then querying it?
>
> which do you need to optimize for, insert speed or query speed?
>
> do you expect your queries to be searching for a subset of the data scattered
> randomly throughlut the input data, or do you expect it to be 'grab this
> (relativly) sequential chunk of input data and manipulate it to generate a
> report' type of thing
>
> what is your connectvity to the raid enclosures? (does putting 22 drives on
> one cable mean that you will be limited due to the bandwidth of this cable
> rather than the performance of the drives)
>
> can you do other forms of raid on these drives or only raid 10?
>
> how critical is the data in this database? if it were to die would it just be
> a matter of recreating it and reloading the data? or would you loose
> irreplaceable data?
>
> David Lang
>

От:
Stephen Frost
Дата:

Robert,

* Robert Schnabel () wrote:
> Once the bulk data is inserted into the tables I generally
> do some updates on columns to set values which characterize the
> data.

Please tell me you're not running actual full-table UPDATE statements...
You would be *much* better off either:
a) munging the data on the way in (if possible/reasonable)
b) loading the data into temp tables first, and then using INSERT
   statements to move the data into the 'final' tables WITH the new
   columns/info you want
c) considering if you can normalize the data into multiple tables and/or
   to cut down the columns to only what you need as you go through the
   above, too

A full-table UPDATE means you're basically making the table twice as big
as it needs to be, and it'll never get smaller..

> These columns then get indexed.  Basically once the initial
> manipulation is done the table is then static and what I'm looking
> for is query speed.

Sadly, this is the same type of DW needs that I've got (though with
telecomm data and phone calls, not genetic stuffs ;), and PG ends up
being limited by the fact that it can only use one core/thread to go
through the data with.

You might consider investing some time trying to figure out how to
parallelize your queries.  My approach to this has been to partition the
data (probably something you're doing already) into multiple tables and
then have shell/perl scripts which will run a given query against all of
the tables, dumping the results of that aggregation/analysis into other
tables, and then having a final 'merge' query.

> The data is sorted by snp_number, sample_id.  So if I want the data
> for a given sample_id it would be a block of ~58k rows.  The size of
> the table depends on how many sample_id's there are.  My largest has
> ~30k sample_id by 58k snp_number per sample.  The other big table
> (with children) is "mutations" and is set up similarly so that I can
> access individual tables (samples) based on constraints.  Each of
> these children have between 5-60M records.

Understand that indexes are only going to be used/useful, typically, if
the amount of records being returned is small relative to the size of
the table (eg: 5%).

> This is all direct attach storage via SAS2 so I'm guessing it's
> probably limited to the single port link between the controller and
> the expander.  Again, geneticist here not computer scientist. ;-)

That link certainly isn't going to help things..  You might consider how
or if you can improve that.

> All of the data could be reloaded.  Basically, once I get the data
> into the database and I'm done manipulating it I create a backup
> copy/dump which then gets stored at a couple different locations.

You might consider turning fsync off while you're doing these massive
data loads..  and make sure that you issue your 'CREATE TABLE' and your
'COPY' statements in the same transaction, and again, I suggest loading
into temporary (CREATE TEMPORARY TABLE) tables first, then doing the
CREATE TABLE/INSERT statement for the 'real' table.  Make sure that you
create *both* your constraints *and* your indexes *after* the table is
populated.

If you turn fsync off, make sure you turn it back on. :)

> My goal is to 1) have a fairly robust system so that I don't have to
> spend my time rebuilding things and 2) be able to query the data
> quickly.  Most of what I do are ad hoc queries.  I have an idea...
> "how many X have Y in this set of Z samples" and write the query to
> get the answer.  I can wait a couple minutes to get an answer but
> waiting an hour is becoming tiresome.

Have you done any analysis to see what the bottleneck actually is?  When
you run top, is your PG process constantly in 'D' state, or is it in 'R'
state, or what?  Might help figure some of that out.  Note that
parallelizing the query will help regardless of if it's disk bound or
CPU bound, when you're running on the kind of hardware you're talking
about (lots of spindles, multiple CPUs, etc).

    Thanks,

        Stephen

От:
Grant Johnson
Дата:

> Putting the WAL on a second controller does help, if you're write-heavy.
>
> I tried separating indexes and data once on one server and didn't
> really notice that it helped much. Managing the space was problematic.
> I would suggest putting those together on a single RAID-10 of all the
> 300GB drives (minus a spare). It will probably outperform separate
> arrays most of the time, and be much easier to manage.
>
> --
>
>
I like to use RAID 1, and let LVM do the striping.   That way I can add
more drives later too.

От:
Robert Schnabel
Дата:

<br /> On 1/28/2011 7:14 AM, Stephen Frost wrote: <blockquote cite="mid:"
type="cite"><prewrap="">Robert,
 

* Robert Schnabel (<a class="moz-txt-link-abbreviated" href="mailto:"></a>)
wrote:
</pre><blockquote type="cite"><pre wrap="">Once the bulk data is inserted into the tables I generally
do some updates on columns to set values which characterize the
data.  
</pre></blockquote><pre wrap="">
Please tell me you're not running actual full-table UPDATE statements...
You would be *much* better off either:
a) munging the data on the way in (if possible/reasonable)
b) loading the data into temp tables first, and then using INSERT  statements to move the data into the 'final' tables
WITHthe new  columns/info you want
 
c) considering if you can normalize the data into multiple tables and/or  to cut down the columns to only what you need
asyou go through the  above, too
 

A full-table UPDATE means you're basically making the table twice as big
as it needs to be, and it'll never get smaller..
</pre></blockquote> Depends on what you mean by that.  The tables that I'm concerned with look something like bigint
x2,char var x13, int x24, real x8, smallint x4 by about 65M rows, each.  I only do the updates on one table at a time. 
Thereal columns are actually null in the input csv file.  I run an update which basically uses some of the integer
columnsand calculates frequencies which go into the real columns.  Ditto with some of the other columns.  I don't do
thisbefore I upload the data because 1) it's easier this way and 2) I can't because some of the updates involve joins
toother tables to grab info that I can't do outside the database.  So yes, once the upload is done I run queries that
updateevery row for certain columns, not every column.  After I'm done with a table I run a VACUUM ANALYZE.  I'm really
notworried about what my table looks like on disk.  I actually take other steps also to avoid what you're talking
about.<br/><br /><blockquote cite="mid:" type="cite"><pre
wrap=""></pre><blockquotetype="cite"><pre wrap="">These columns then get indexed.  Basically once the initial
 
manipulation is done the table is then static and what I'm looking
for is query speed.
</pre></blockquote><pre wrap="">
Sadly, this is the same type of DW needs that I've got (though with
telecomm data and phone calls, not genetic stuffs ;), and PG ends up
being limited by the fact that it can only use one core/thread to go
through the data with.

You might consider investing some time trying to figure out how to
parallelize your queries.  My approach to this has been to partition the
data (probably something you're doing already) into multiple tables and
then have shell/perl scripts which will run a given query against all of
the tables, dumping the results of that aggregation/analysis into other
tables, and then having a final 'merge' query.
</pre></blockquote> Thanks for the advise but parallelizing/automating doesn't really do anything for me.  The data is
alreadypartitioned.  Think of it this way, you just got 65M new records with about 30 data points per record on an
individualsample.  You put it in a new table of it's own and now you want to characterize those 65M data points.  The
firstupdate flags about 60M of the rows as uninteresting so you move them to their own *uninteresting* table and
basicallynever really touch them again (but you cant get rid of them).  Now you're working with 5M that you're going to
characterizeinto about 20 categories based on what is in those 30 columns of data.  Do all the querying/updating then
indexand you're done.  Too long to describe but I cannot automate this.  I only update one partition at a time and only
aboutevery couple weeks or so.<br /><br /><br /><blockquote cite="mid:"
type="cite"><prewrap=""></pre><blockquote type="cite"><pre wrap="">The data is sorted by snp_number, sample_id.  So if
Iwant the data
 
for a given sample_id it would be a block of ~58k rows.  The size of
the table depends on how many sample_id's there are.  My largest has
~30k sample_id by 58k snp_number per sample.  The other big table
(with children) is "mutations" and is set up similarly so that I can
access individual tables (samples) based on constraints.  Each of
these children have between 5-60M records.
</pre></blockquote><pre wrap="">
Understand that indexes are only going to be used/useful, typically, if
the amount of records being returned is small relative to the size of
the table (eg: 5%).
</pre></blockquote> Yep, I understand that.  Even though they occupy a lot of space, I keep them around because there
aretimes when I need them.<br /><br /><br /><blockquote cite="mid:"
type="cite"><prewrap="">
 
</pre><blockquote type="cite"><pre wrap="">This is all direct attach storage via SAS2 so I'm guessing it's
probably limited to the single port link between the controller and
the expander.  Again, geneticist here not computer scientist. ;-)
</pre></blockquote><pre wrap="">
That link certainly isn't going to help things..  You might consider how
or if you can improve that.
</pre></blockquote> Suggestions???  It was previously suggested to split the drives on each array across the two
controllerports rather than have all the data drives on one port which makes sense.  Maybe I'm getting my terminology
wronghere but I'm talking about a single SFF-8088 link to each 16 drive enclosure.  What about two controllers, one for
eachenclosure?  Don't know if I have enough empty slots though.<br /><br /><blockquote
cite="mid:"type="cite"><pre wrap=""></pre><blockquote type="cite"><pre
wrap="">Allof the data could be reloaded.  Basically, once I get the data
 
into the database and I'm done manipulating it I create a backup
copy/dump which then gets stored at a couple different locations.
</pre></blockquote><pre wrap="">
You might consider turning fsync off while you're doing these massive
data loads..  and make sure that you issue your 'CREATE TABLE' and your
'COPY' statements in the same transaction, and again, I suggest loading
into temporary (CREATE TEMPORARY TABLE) tables first, then doing the
CREATE TABLE/INSERT statement for the 'real' table.  Make sure that you
create *both* your constraints *and* your indexes *after* the table is
populated.

If you turn fsync off, make sure you turn it back on. :)

</pre></blockquote> I haven't messed with fsync but maybe I'll try.  In general, I create my indexes and constraints
afterI'm done doing all the updating I need to do.  I made the mistake <b>*once*</b> of copying millions of rows into a
tablethat already had indexes.<br /><br /><blockquote cite="mid:"
type="cite"><prewrap=""></pre><blockquote type="cite"><pre wrap="">My goal is to 1) have a fairly robust system so that
Idon't have to
 
spend my time rebuilding things and 2) be able to query the data
quickly.  Most of what I do are ad hoc queries.  I have an idea...
"how many X have Y in this set of Z samples" and write the query to
get the answer.  I can wait a couple minutes to get an answer but
waiting an hour is becoming tiresome.
</pre></blockquote><pre wrap="">
Have you done any analysis to see what the bottleneck actually is?  When
you run top, is your PG process constantly in 'D' state, or is it in 'R'
state, or what?  Might help figure some of that out.  Note that
parallelizing the query will help regardless of if it's disk bound or
CPU bound, when you're running on the kind of hardware you're talking
about (lots of spindles, multiple CPUs, etc).
Thanks,
    Stephen
</pre></blockquote> It got lost from the original post but my database (9.0.0) is currently on my Windows XP 64-bit
workstationin my office on a 16 drive Seagate 15k.5 RAID5, no comments needed, I know, I'm moving it :-).  I'm moving
itto my server which is Windows Ent Server 2008 R2 64-bit 8 AMD cores & 32G ram and these new drives/controller. So
notop or lvm although I do keep an eye on things with Process Explorer.  Also, I don't have any single query that is a
problem. I have my canned queries which I run manually to update/manipulate/move data around every couple weeks when I
geta new chunk of data.  Other than that my queries are all ad hoc.  I'm just trying to get opinions on the best way to
setup these drives/controllers/enclosures for basically large sequential reads that quite often use indexes.<br /><br
/>So far I'd summarize the consensus as:<br /> 1) putting WAL on a separate array is worthless since I do very little
writes. What about if I put my temp tablespace on the same array with WAL & xlog?  I've noticed a lot of the ad hoc
queriesI run create tmp files, sometimes tens of GB.  I appreciate the fact that managing multiple tablespaces is not
aseasy as managing one but if it helps...<br /><br /> 2) Indexes on a separate array may not be all that useful since
I'mnot doing simultaneous reads/writes.<br /><br /> 3) Since I can very easily recreate the database in case of
crash/corruptionRAID10 may not be the best option.  However, if I do go with RAID10 split the drives between the two
enclosures(this assumes data & index arrays).  I've thought about RAID0 but quite frankly I really don't like
havingto rebuild things.  At some point my time becomes valuable.  RAID6 was suggested but rebuilding a 9TB RAID6 seems
scaryslow to me.<br /><br /> I appreciate the comments thus far.<br /> Bob<br /><br /><br /><br /><br /><br /><br /> 
От:
Scott Marlowe
Дата:

On Fri, Jan 28, 2011 at 9:39 AM, Robert Schnabel <> wrote:
> I can't do outside the database.  So yes, once the upload is done I run
> queries that update every row for certain columns, not every column.  After
> I'm done with a table I run a VACUUM ANALYZE.  I'm really not worried about
> what my table looks like on disk.  I actually take other steps also to avoid
> what you're talking about.

It will still get bloated.  If you update one column in one row in pg,
you now have two copies of that row in the database.  If you date 1
column in 1M rows, you now have 2M rows in the database (1M "dead"
rows, 1M "live" rows).  vacuum analyze will not get rid of them, but
will free them up to be used in future updates / inserts.  Vacuum full
or cluster will free up the space, but will lock the table while it
does so.

There's nothing wrong with whole table updates as part of an import
process, you just have to know to "clean up" after you're done, and
regular vacuum can't fix this issue, only vacuum full or reindex or
cluster.

От:
Robert Schnabel
Дата:

On 1/28/2011 11:00 AM, Scott Marlowe wrote:
> On Fri, Jan 28, 2011 at 9:39 AM, Robert Schnabel<>  wrote:
>> I can't do outside the database.  So yes, once the upload is done I run
>> queries that update every row for certain columns, not every column.  After
>> I'm done with a table I run a VACUUM ANALYZE.  I'm really not worried about
>> what my table looks like on disk.  I actually take other steps also to avoid
>> what you're talking about.
> It will still get bloated.  If you update one column in one row in pg,
> you now have two copies of that row in the database.  If you date 1
> column in 1M rows, you now have 2M rows in the database (1M "dead"
> rows, 1M "live" rows).  vacuum analyze will not get rid of them, but
> will free them up to be used in future updates / inserts.  Vacuum full
> or cluster will free up the space, but will lock the table while it
> does so.
>
> There's nothing wrong with whole table updates as part of an import
> process, you just have to know to "clean up" after you're done, and
> regular vacuum can't fix this issue, only vacuum full or reindex or
> cluster.

Those are exactly what I was referring to with my "other steps".  I just
don't always do them as soon as I'm done updating because sometimes I
want to query the table right away to find out something.  Yep, I found
out the hard way that regular VACUUM didn't help.



От:
Stephen Frost
Дата:

Robert,

* Robert Schnabel () wrote:
>    Depends on what you mean by that.  The tables that I'm concerned with look
>    something like bigint x2, char var x13, int x24, real x8, smallint x4 by
>    about 65M rows, each.  I only do the updates on one table at a time.  The
>    real columns are actually null in the input csv file.  I run an update
>    which basically uses some of the integer columns and calculates
>    frequencies which go into the real columns. 

Erm, I'm pretty sure you're still increasing the size of the resulting
tables by quite a bit by doing this process- which will slow down later
queries.

> Ditto with some of the other
>    columns.  I don't do this before I upload the data because 1) it's easier
>    this way and 2) I can't because some of the updates involve joins to other
>    tables to grab info that I can't do outside the database. 

That's fine- just first load the data into temporary tables and then do
INSERT INTO new_table SELECT <your query>;

instead.

> So yes, once
>    the upload is done I run queries that update every row for certain
>    columns, not every column.  After I'm done with a table I run a VACUUM
>    ANALYZE.  I'm really not worried about what my table looks like on disk. 

I thought you wanted it fast..?  If not, I'm not sure why you're
bothering to post to this list.  What it looks like on disk certainly
impacts how fast it is...

>    I actually take other steps also to avoid what you're talking about.

If you really don't feel like changing your process, you could just run
'CLUSTER' on the table, on whatever index you use most frequently, and
PG will rewrite the entire table for you, dropping all the dead rows,
etc.  You should then run VACUUM FREEZE on it.

>  These columns then get indexed.  Basically once the initial
>  manipulation is done the table is then static and what I'm looking
>  for is query speed.

Yes, I gathered that, making the table smaller on disk will improve
query speed.

>    Thanks for the advise but parallelizing/automating doesn't really do
>    anything for me.  The data is already partitioned.  Think of it this way,
>    you just got 65M new records with about 30 data points per record on an
>    individual sample.  You put it in a new table of it's own and now you want
>    to characterize those 65M data points.  The first update flags about 60M
>    of the rows as uninteresting so you move them to their own *uninteresting*
>    table and basically never really touch them again (but you cant get rid of
>    them).  Now you're working with 5M that you're going to characterize into
>    about 20 categories based on what is in those 30 columns of data.  Do all
>    the querying/updating then index and you're done.  Too long to describe
>    but I cannot automate this.  I only update one partition at a time and
>    only about every couple weeks or so.

I was referring to parallelizing queries *after* the data is all loaded,
etc.  I wasn't talking about the queries that you use during the load.

I presume that after the load you run some queries.  You can probably
parallelize those queries (most DW queries can be, be ime...).

>  That link certainly isn't going to help things..  You might consider how
>  or if you can improve that.
>
>    Suggestions???  It was previously suggested to split the drives on each
>    array across the two controller ports rather than have all the data drives
>    on one port which makes sense.  Maybe I'm getting my terminology wrong
>    here but I'm talking about a single SFF-8088 link to each 16 drive
>    enclosure.  What about two controllers, one for each enclosure?  Don't
>    know if I have enough empty slots though.

I don't know that you'd need a second controller (though it probably
wouldn't hurt if you could).  If there's only one way to attach the
enclosure, then so be it.  The issue is if the enclosures end up
multi-plexing the individual drives into fewer channels than there are
actual drives, hence creating a bottle-neck.  You would need different
enclosures to deal with that, if that's the case.

>    I haven't messed with fsync but maybe I'll try.  In general, I create my
>    indexes and constraints after I'm done doing all the updating I need to
>    do.  I made the mistake *once* of copying millions of rows into a table
>    that already had indexes.

Yeah, I bet that took a while.  As I said above, if you don't want to
change your process (which, tbh, I think would be faster if you were
doing INSERTs into a new table than full-table UPDATEs...), then you
should do a CLUSTER after you've created whatever is the most popular
INDEX, and then create your other indexes after that.

>    It got lost from the original post but my database (9.0.0) is currently on
>    my Windows XP 64-bit workstation in my office on a 16 drive Seagate 15k.5
>    RAID5, no comments needed, I know, I'm moving it :-).  I'm moving it to my
>    server which is Windows Ent Server 2008 R2 64-bit 8 AMD cores & 32G ram
>    and these new drives/controller.

Ughh...  No chance to get a Unix-based system (Linux, BSD, whatever) on
there instead?  I really don't think Windows Server is going to help
your situation one bit.. :(

>    1) putting WAL on a separate array is worthless since I do very little
>    writes.  What about if I put my temp tablespace on the same array with WAL
>    & xlog?  I've noticed a lot of the ad hoc queries I run create tmp files,
>    sometimes tens of GB.  I appreciate the fact that managing multiple
>    tablespaces is not as easy as managing one but if it helps...

That's not a bad idea but I'm not sure it'd make as much difference as
you think it would..  What would be better would be to *avoid*, at all
cost, letting it spill out to on-disk for queries.  The way to do that
is to make sure your work_mem is as high as PG will actually use (1GB),
and then to *parallelize* those queries using multiple PG connections,
so that each one will be able to use up that much memory.

For example, say you need to summarize the values for each of your
strands (or whatever) across 5 different "loads".  Your query might
look like:

select load,strand,sum(value) from parent_table group by load,strand;

Ideally, PG will use a hash table, key'd on load+strand, to store the
resulting summations in.  If it doesn't think the hash table will fit in
work_mem, it's going to SORT ALL OF YOUR DATA ON DISK first instead, and
then WALK THROUGH IT, sum'ing each section, then spitting out the result
to the client, and moving on.  This is *not* a fast process.  If doing
the same query on an individual child will use a hash table, then it'd
be hugely faster to query each load first, storing the results into
temporary tables.  What would be even *faster* would be the run all 5 of
those queries against the child tables in parallel (given that you have
over 5 CPUs and enough memory that you don't start swapping).

If it's still too big on the per-child basis, you might be able to use
conditionals to do the first 100 strands, then the next hundred, etc.

>    I appreciate the comments thus far.

Let's hope you'll always appreciate them. :)

    Thanks,

        Stephen

От:
Stephen Frost
Дата:

* Scott Marlowe () wrote:
> There's nothing wrong with whole table updates as part of an import
> process, you just have to know to "clean up" after you're done, and
> regular vacuum can't fix this issue, only vacuum full or reindex or
> cluster.

Just to share my experiences- I've found that creating a new table and
inserting into it is actually faster than doing full-table updates, if
that's an option for you.

    Thanks,

        Stephen

От:
Scott Carey
Дата:



On 1/27/11 4:11 PM, "Alan Hodgson" <> wrote:

On January 27, 2011, Robert Schnabel <> wrote:

> So my questions are 1) am I'm crazy for doing this, 2) would you change

> anything and 3) is it acceptable to put the xlog & wal (and perhaps tmp

> filespace) on a different controller than everything else? Please keep

> in mind I'm a geneticist who happens to know a little bit about

> bioinformatics and not the reverse. :-)

>

Putting the WAL on a second controller does help, if you're write-heavy.

I tried separating indexes and data once on one server and didn't really notice that it helped much. Managing the space was problematic. I would suggest putting those together on a single RAID-10 of all the 300GB drives (minus a spare). It will probably outperform separate arrays most of the time, and be much easier to manage.


If you go this route, I suggest two equally sized RAID 10's on different controllers fir index + data, with software raid-0 on top of that.  RAID 10 will max out a controller after 6 to 10 drives, usually.  Using the OS RAID 0 to aggregate the throughput of two controllers works great.

WAL only has to be a little bit faster than your network in most cases.  I've never seen it be a bottleneck on large bulk loads if it is on its own controller with 120MB/sec write throughput.  I suppose a bulk load from COPY might stress it a bit more, but CPU ends up the bottleneck in postgres once you have I/O hardware this capable.


--

A hybrid Escalade is missing the point much in the same way that having a diet soda with your extra large pepperoni pizza is missing the point.

От:
Scott Carey
Дата:


On 1/28/11 9:00 AM, "Scott Marlowe" <> wrote:

>On Fri, Jan 28, 2011 at 9:39 AM, Robert Schnabel <>
>wrote:
>> I can't do outside the database.  So yes, once the upload is done I run
>> queries that update every row for certain columns, not every column.
>>After
>> I'm done with a table I run a VACUUM ANALYZE.  I'm really not worried
>>about
>> what my table looks like on disk.  I actually take other steps also to
>>avoid
>> what you're talking about.
>
>It will still get bloated.  If you update one column in one row in pg,
>you now have two copies of that row in the database.  If you date 1
>column in 1M rows, you now have 2M rows in the database (1M "dead"
>rows, 1M "live" rows).  vacuum analyze will not get rid of them, but
>will free them up to be used in future updates / inserts.  Vacuum full
>or cluster will free up the space, but will lock the table while it
>does so.
>
>There's nothing wrong with whole table updates as part of an import
>process, you just have to know to "clean up" after you're done, and
>regular vacuum can't fix this issue, only vacuum full or reindex or
>cluster.


Also note that HOT will come into play if you have FILLFACTOR set
appropriately, so you won't get two copies of the row.  This is true if
the column being updated is small enough and not indexed.  It wastes some
space, but a lot less than the factor of two.

>
>--
>Sent via pgsql-performance mailing list ()
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-performance


От:
Scott Carey
Дата:


On 1/28/11 9:28 AM, "Stephen Frost" <> wrote:

>* Scott Marlowe () wrote:
>> There's nothing wrong with whole table updates as part of an import
>> process, you just have to know to "clean up" after you're done, and
>> regular vacuum can't fix this issue, only vacuum full or reindex or
>> cluster.
>
>Just to share my experiences- I've found that creating a new table and
>inserting into it is actually faster than doing full-table updates, if
>that's an option for you.

I wonder if postgres could automatically optimize that, if it thought that
it was going to update more than X% of a table, and HOT was not going to
help, then just create a new table file for XID's = or higher than the one
making the change, and leave the old one for old XIDs, then regular VACUUM
could toss out the old one if no more transactions could see it.


>
>    Thanks,
>
>        Stephen


От:
Scott Marlowe
Дата:

On Fri, Jan 28, 2011 at 10:44 AM, Scott Carey <> wrote:
> If you go this route, I suggest two equally sized RAID 10's on different
> controllers fir index + data, with software raid-0 on top of that.  RAID 10
> will max out a controller after 6 to 10 drives, usually.  Using the OS RAID
> 0 to aggregate the throughput of two controllers works great.

I often go one step further and just create a bunch of RAID-1 pairs
and use OS level RAID-0 on top of that.  On the LSI8888 cards that was
by far the fastest setup I tested.

От:
Robert Schnabel
Дата:

On 1/28/2011 11:14 AM, Stephen Frost wrote:

>>     It got lost from the original post but my database (9.0.0) is currently on
>>     my Windows XP 64-bit workstation in my office on a 16 drive Seagate 15k.5
>>     RAID5, no comments needed, I know, I'm moving it :-).  I'm moving it to my
>>     server which is Windows Ent Server 2008 R2 64-bit 8 AMD cores&  32G ram
>>     and these new drives/controller.
> Ughh...  No chance to get a Unix-based system (Linux, BSD, whatever) on
> there instead?  I really don't think Windows Server is going to help
> your situation one bit.. :(
>
Almost zero chance.  I basically admin the server myself so I can do
whatever I want but all permissions are controlled through campus active
directory and our departmental IT person doesn't do *nix.  So let's just
assume I'm stuck with Windows. The main purpose of the server at the
moment is to house our backup images.  I have two 9 TB arrays which I
use robocopy to mirror images once a day between our other server and my
workstation.  There's really not much of anything else ever eating up
CPUs on the server which is why I'm moving my database onto it.

>>     I appreciate the comments thus far.
> Let's hope you'll always appreciate them. :)
>
>     Thanks,
>
>         Stephen
Umm, that didn't quite read the way I meant it to when I wrote it.  All
comments are appreciated.  :-)

Seriously though, there have been points made that have made me rethink
how I go about processing data which I'm sure will help.  I'm in a
fairly fortunate position in that I can put these new drives on the
server and play around with different configurations while I maintain my
current setup on my workstation.  I guess I just need to experiment and
see what works.

Thanks again,
Bob



От:
Robert Schnabel
Дата:

<br /> On 1/28/2011 11:44 AM, Scott Carey wrote: <blockquote cite="mid:C9683F12.1E3D5%"
type="cite"><div><br/></div><div><br /></div><span id="OLK_SRC_BODY_SECTION"><div><div>On 1/27/11 4:11 PM, "Alan
Hodgson"<<a href="mailto:" moz-do-not-send="true"></a>>
wrote:</div></div><div><br/></div><blockquote id="MAC_OUTLOOK_ATTRIBUTION_BLOCKQUOTE" style="padding: 0pt 0pt 0pt 5px;
margin:0pt 0pt 0pt 5px;"><div><style type="text/css">
 
p, li { white-space: pre-wrap; }
</style><div style="font-family: 'Ubuntu'; font-size: 11pt;             font-weight: 400; font-style: normal;"><p
style="margin:0px; text-indent: 0px;">On January 27, 2011, Robert Schnabel <<a href="mailto:"
moz-do-not-send="true"></a>>wrote:<p style="margin: 0px; text-indent: 0px;">> So my
questionsare 1) am I'm crazy for doing this, 2) would you change<p style="margin: 0px; text-indent: 0px;">> anything
and3) is it acceptable to put the xlog & wal (and perhaps tmp<p style="margin: 0px; text-indent: 0px;">>
filespace)on a different controller than everything else? Please keep<p style="margin: 0px; text-indent: 0px;">> in
mindI'm a geneticist who happens to know a little bit about<p style="margin: 0px; text-indent: 0px;">>
bioinformaticsand not the reverse. :-)<p style="margin: 0px; text-indent: 0px;">> <p style="margin: 0px;
text-indent:0px;">Putting the WAL on a second controller does help, if you're write-heavy.<p style="margin: 0px;
text-indent:0px;">I tried separating indexes and data once on one server and didn't really notice that it helped much.
Managingthe space was problematic. I would suggest putting those together on a single RAID-10 of all the 300GB drives
(minusa spare). It will probably outperform separate arrays most of the time, and be much easier to
manage.</div></div></blockquote></span><div><br/></div><div>If you go this route, I suggest two equally sized RAID 10's
ondifferent controllers fir index + data, with software raid-0 on top of that.  RAID 10 will max out a controller after
6to 10 drives, usually.  Using the OS RAID 0 to aggregate the throughput of two controllers works great.</div><div><br
/></div><div>WALonly has to be a little bit faster than your network in most cases.  I've never seen it be a bottleneck
onlarge bulk loads if it is on its own controller with 120MB/sec write throughput.  I suppose a bulk load from COPY
mightstress it a bit more, but CPU ends up the bottleneck in postgres once you have I/O hardware this
capable.</div><div><br/></div></blockquote> Do you mean 14 drives in one box as RAID10's on one controller, then 14
drivesin the other box on a second controller, then software RAID0 each of the two RAID10's together essentially as a
single4 TB array?  Would you still recommend doing this with Windows?<br /> Bob<br /><br /><br /><br /> 
От:
Віталій Тимчишин
Дата:



2011/1/28 Scott Carey <>


On 1/28/11 9:28 AM, "Stephen Frost" <> wrote:

>* Scott Marlowe () wrote:
>> There's nothing wrong with whole table updates as part of an import
>> process, you just have to know to "clean up" after you're done, and
>> regular vacuum can't fix this issue, only vacuum full or reindex or
>> cluster.
>
>Just to share my experiences- I've found that creating a new table and
>inserting into it is actually faster than doing full-table updates, if
>that's an option for you.

I wonder if postgres could automatically optimize that, if it thought that
it was going to update more than X% of a table, and HOT was not going to
help, then just create a new table file for XID's = or higher than the one
making the change, and leave the old one for old XIDs, then regular VACUUM
could toss out the old one if no more transactions could see it.


I was thinking if a table file could be deleted if it has no single live row. And if this could be done by vacuum. In this case vacuum on table that was fully updated recently could be almost as good as cluster - any scan would skip such non-existing files really fast. Also almost no disk space would be wasted. 

--
Best regards,
 Vitalii Tymchyshyn
От:
Robert Haas
Дата:

2011/1/30 Віталій Тимчишин <>:
> I was thinking if a table file could be deleted if it has no single live
> row. And if this could be done by vacuum. In this case vacuum on table that
> was fully updated recently could be almost as good as cluster - any scan
> would skip such non-existing files really fast. Also almost no disk space
> would be wasted.

VACUUM actually already does something along these lines.  If there
are 1 or any larger number of entirely-free pages at the end of a
table, VACUUM will truncate them away.  In the degenerate case where
ALL pages are entirely-free, this results in zeroing out the file.

The problem with this is that it rarely does much.  Consider a table
with 1,000,000 pages, 50% of which contain live rows.  On average, how
many pages will this algorithm truncate away?  Answer: if the pages
containing live rows are randomly distributed, approximately one.
(Proof: There is a 50% chance that the last page will contain live
rows.  If so, we can't truncate anything.  If not, we can truncate one
page, and maybe more.  Now the chances of the next page being free are
499,999 in 999,999, or roughly one-half.  So we have an almost-25%
chance of being able to truncate at least two pages.  And so on.   So
you get roughly 1/2 + 1/4 + 1/8 + 1/16 + 1/32 + ... = 1 page.)

Your idea of having a set of heaps rather than a single heap is an
interesting one, but it's pretty much catering to the very specific
case of a full-table update.  I think the code changes needed would be
far too invasive to seriously contemplate doing it just for that one
case - although it is an important case that I would like to see us
improve.  Tom Lane previously objected to the idea of on-line table
compaction on the grounds that people's apps might break if CTIDs
changed under them, but I think a brawl between all the people who
want on-line table compaction and all the people who want to avoid
unexpected CTID changes would be pretty short.  A bigger problem - or
at least another problem - is that moving tuples this way is
cumbersome and expensive.  You basically have to move some tuples
(inserting new index entries for them), vacuum away the old index
entries (requiring a full scan of every index), and then repeat as
many times as necessary to shrink the table.  This is not exactly a
smooth maintenance procedure, or one that can be done without
significant disruption, but AFAIK nobody's come up with a better idea
yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

От:
Dave Crooke
Дата:

There is a process in Oracle which essentially allows you to do the equivalent of a CLUSTER in Postgres, but without locking the table, and so updates can continue throughout the process. It requires a bit of manual setup and fiddling (at least in Oracle 10g) .... this would probably scratch a lot of people's itches in this area. Of course, it's not trivial at all to implement :-(

The Oracle equivalent of "too many dead rows" is "too many chained rows" and that's where I've seen it used.

Cheers
Dave

2011/2/3 Robert Haas <>
2011/1/30 Віталій Тимчишин <>:
> I was thinking if a table file could be deleted if it has no single live
> row. And if this could be done by vacuum. In this case vacuum on table that
> was fully updated recently could be almost as good as cluster - any scan
> would skip such non-existing files really fast. Also almost no disk space
> would be wasted.

VACUUM actually already does something along these lines.  If there
are 1 or any larger number of entirely-free pages at the end of a
table, VACUUM will truncate them away.  In the degenerate case where
ALL pages are entirely-free, this results in zeroing out the file.

The problem with this is that it rarely does much.  Consider a table
with 1,000,000 pages, 50% of which contain live rows.  On average, how
many pages will this algorithm truncate away?  Answer: if the pages
containing live rows are randomly distributed, approximately one.
(Proof: There is a 50% chance that the last page will contain live
rows.  If so, we can't truncate anything.  If not, we can truncate one
page, and maybe more.  Now the chances of the next page being free are
499,999 in 999,999, or roughly one-half.  So we have an almost-25%
chance of being able to truncate at least two pages.  And so on.   So
you get roughly 1/2 + 1/4 + 1/8 + 1/16 + 1/32 + ... = 1 page.)

Your idea of having a set of heaps rather than a single heap is an
interesting one, but it's pretty much catering to the very specific
case of a full-table update.  I think the code changes needed would be
far too invasive to seriously contemplate doing it just for that one
case - although it is an important case that I would like to see us
improve.  Tom Lane previously objected to the idea of on-line table
compaction on the grounds that people's apps might break if CTIDs
changed under them, but I think a brawl between all the people who
want on-line table compaction and all the people who want to avoid
unexpected CTID changes would be pretty short.  A bigger problem - or
at least another problem - is that moving tuples this way is
cumbersome and expensive.  You basically have to move some tuples
(inserting new index entries for them), vacuum away the old index
entries (requiring a full scan of every index), and then repeat as
many times as necessary to shrink the table.  This is not exactly a
smooth maintenance procedure, or one that can be done without
significant disruption, but AFAIK nobody's come up with a better idea
yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-performance mailing list ()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

От:
Vitalii Tymchyshyn
Дата:

03.02.11 20:42, Robert Haas написав(ла):
> 2011/1/30 Віталій Тимчишин<>:
>> I was thinking if a table file could be deleted if it has no single live
>> row. And if this could be done by vacuum. In this case vacuum on table that
>> was fully updated recently could be almost as good as cluster - any scan
>> would skip such non-existing files really fast. Also almost no disk space
>> would be wasted.
> VACUUM actually already does something along these lines.  If there
> are 1 or any larger number of entirely-free pages at the end of a
> table, VACUUM will truncate them away.  In the degenerate case where
> ALL pages are entirely-free, this results in zeroing out the file.
>
> The problem with this is that it rarely does much.  Consider a table
> with 1,000,000 pages, 50% of which contain live rows.  On average, how
> many pages will this algorithm truncate away?  Answer: if the pages
> containing live rows are randomly distributed, approximately one.
Yes, but take into account operations on a (by different reasons)
clustered tables, like removing archived data (yes I know, this is best
done with partitioning, but one must still go to a point when he will
decide to use partitioning :) ).
> Your idea of having a set of heaps rather than a single heap is an
> interesting one, but it's pretty much catering to the very specific
> case of a full-table update.  I think the code changes needed would be
> far too invasive to seriously contemplate doing it just for that one
> case - although it is an important case that I would like to see us
> improve.
Why do you expect such a invasive code changes? I know little about
postgresql code layering, but what I propose (with changing delete to
truncate) is:
1) Leave tuple addressing as it is now
2) Allow truncated files, treating non-existing part as if it contained
not used tuples
3) Make vacuum truncate file if it has not used tuples at the end.

The only (relatively) tricky thing I can see is synchronizing truncation
with parallel ongoing scan.

Best regards, Vitalii Tymchyshyn



От:
Robert Haas
Дата:

On Fri, Feb 4, 2011 at 4:19 AM, Vitalii Tymchyshyn <> wrote:
> Why do you expect such a invasive code changes? I know little about
> postgresql code layering, but what I propose (with changing delete to
> truncate) is:
> 1) Leave tuple addressing as it is now

i.e. a block number and a slot position within the block?

Seems like you'd need <file,block,slot>.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

От:
Віталій Тимчишин
Дата:



2011/2/4 Robert Haas <>
On Fri, Feb 4, 2011 at 4:19 AM, Vitalii Tymchyshyn <> wrote:
> Why do you expect such a invasive code changes? I know little about
> postgresql code layering, but what I propose (with changing delete to
> truncate) is:
> 1) Leave tuple addressing as it is now

i.e. a block number and a slot position within the block?

Seems like you'd need <file,block,slot>.

No, that's what I mean. Leave as it is. You will have file logical length (fixed for all but the last one, 1GB currently) and file actual legth that can be less (if file trucated). In the latter case you still have this "empty" blocks that don't exists at all. Actually the simplest implementation could be to tell to file system "drop this part of file and pretend it's all zeros", but I don't think many FSs (OSes?) supports this.
So, each  file still have it's fixed N blocks. And filenumber is still blocknumber / N.


--
Best regards,
 Vitalii Tymchyshyn