Обсуждение: Sustained inserts per sec ... ?

От:
"Marc G. Fournier"
Дата:

Just curious, but does anyone have an idea of what we are capable of?  I
realize that size of record would affect things, as well as hardware, but
if anyone has some ideas on max, with 'record size', that would be
appreciated ...

Thanks ...

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email:            Yahoo!: yscrappy              ICQ: 7615664

От:
Mike Nolan
Дата:

> Just curious, but does anyone have an idea of what we are capable of?  I
> realize that size of record would affect things, as well as hardware, but
> if anyone has some ideas on max, with 'record size', that would be
> appreciated ...

Well, I just did an insert of 27,500 records with 9 fields, averaging
around 118 bytes per record, each insert statement coming from a separate
SQL statement fed to psql, and it took a bit over 4 minutes, or about
106 inserts per second.

That seems consistent with what I get when I do a restore of a dump
file that has insert statement instead of COPY.

The hardware is a Dell dual Xeon system, the disks are mirrored SATA
drives with write buffering turned off.
--
Mike Nolan



От:
Christopher Petrilli
Дата:

On Apr 1, 2005 1:06 PM, Marc G. Fournier <> wrote:
>
> Just curious, but does anyone have an idea of what we are capable of?  I
> realize that size of record would affect things, as well as hardware, but
> if anyone has some ideas on max, with 'record size', that would be
> appreciated ...

On a AMD64/3000, 1Gb RAM, 2 SATA drives (1 for log, 1 for data), and
inserting using batches of 500-1000 rows, and also using the COPY
syntax, I have seen an interesting thing.  There are 5 indexes
involved here, BTW.  This is Linux 2.6 running on an XFS file system
(ext3 was even worse for me).

I can start at about 4,000 rows/second, but at about 1M rows, it
plummets, and by 4M it's taking 6-15 seconds to insert 1000 rows.
That's only about 15 rows/second, which is quite pathetic.   The
problem seems to be related to my indexes, since I have to keep them
online (the system in continually querying, as well).

This was an application originally written for MySQL/MYISAM, and it's
looking like PostgreSQL can't hold up for it, simply because it's "too
much database" if that makes sense.  The same box, running the MySQL
implementation (which uses no transactions) runs around 800-1000
rows/second systained.

Just a point of reference.  I'm trying to collect some data so that I
can provide some charts of the degredation, hoping to find the point
where it dies and thereby find the point where it needs attention.

Chris
--
| Christopher Petrilli
| 

От:
Tom Lane
Дата:

Christopher Petrilli <> writes:
> I can start at about 4,000 rows/second, but at about 1M rows, it
> plummets, and by 4M it's taking 6-15 seconds to insert 1000 rows.
> That's only about 15 rows/second, which is quite pathetic.   The
> problem seems to be related to my indexes, since I have to keep them
> online (the system in continually querying, as well).

I doubt it has anything to do with your indexes.  I'm wondering about
foreign key checks, myself.  What FKs exist on these tables?  Is the
"start" condition zero rows in every table?  Which PG version exactly?

            regards, tom lane

От:
Christopher Petrilli
Дата:

On Apr 1, 2005 3:42 PM, Tom Lane <> wrote:
> Christopher Petrilli <> writes:
> > I can start at about 4,000 rows/second, but at about 1M rows, it
> > plummets, and by 4M it's taking 6-15 seconds to insert 1000 rows.
> > That's only about 15 rows/second, which is quite pathetic.   The
> > problem seems to be related to my indexes, since I have to keep them
> > online (the system in continually querying, as well).
>
> I doubt it has anything to do with your indexes.  I'm wondering about
> foreign key checks, myself.  What FKs exist on these tables?  Is the
> "start" condition zero rows in every table?  Which PG version exactly?

Sure, I'm going to post something on my web site when I get some
numbers that will make this more valuable.  To answer your question:

1. No foreign keys (long story, but they just don't exist for this one table)
2. Start condition is zero. I'm using multiple inherited tables to
deal with the data "partitioning" since eventual size is billions of
rows. Each "partition" currently has 10M rows in it as a goal.
3. Version 8.0.2, however I started this experiment with 8.0.0.
4. fsync=off

What seems to happen is it slams into a "wall" of some sort, the
system goes into disk write frenzy (wait=90% CPU), and eventually
recovers and starts running for a while at a more normal speed.  What
I need though, is to not have that wall happen.  It is easier for me
to accept a constant degredation of 5%, rather than a 99% degredation
for short periods, as it can cause cascade problems in the system.

My goal is to gather some numbers, and post code + schema + analysis.

Chris
--
| Christopher Petrilli
| 

От:
"Joshua D. Drake"
Дата:

> What seems to happen is it slams into a "wall" of some sort, the
> system goes into disk write frenzy (wait=90% CPU), and eventually
> recovers and starts running for a while at a more normal speed.  What
> I need though, is to not have that wall happen.  It is easier for me
> to accept a constant degredation of 5%, rather than a 99% degredation
> for short periods, as it can cause cascade problems in the system.

Could this possibly be a checkpoint happening?

Also how many checkpoint segments do you have?


>
> My goal is to gather some numbers, and post code + schema + analysis.
>
> Chris
--
Command Prompt, Inc., Your PostgreSQL solutions company. 503-667-4564
Custom programming, 24x7 support, managed services, and hosting
Open Source Authors: plPHP, pgManage, Co-Authors: plPerlNG
Reliable replication, Mammoth Replicator - http://www.commandprompt.com/


От:
Tom Lane
Дата:

Christopher Petrilli <> writes:
> What seems to happen is it slams into a "wall" of some sort, the
> system goes into disk write frenzy (wait=90% CPU), and eventually
> recovers and starts running for a while at a more normal speed.

Checkpoint maybe?  If so, tweaking the bgwriter parameters might
improve matters.

            regards, tom lane

От:
Christopher Petrilli
Дата:

On Apr 1, 2005 3:53 PM, Joshua D. Drake <> wrote:
>
> > What seems to happen is it slams into a "wall" of some sort, the
> > system goes into disk write frenzy (wait=90% CPU), and eventually
> > recovers and starts running for a while at a more normal speed.  What
> > I need though, is to not have that wall happen.  It is easier for me
> > to accept a constant degredation of 5%, rather than a 99% degredation
> > for short periods, as it can cause cascade problems in the system.
>
> Could this possibly be a checkpoint happening?
>
> Also how many checkpoint segments do you have?

Changes to the postgresql.conf file from "default":

    maintenance_work_mem = 131072
    fsync = false
    checkpoint_segments = 32

I set the checkpoint segments up until it no longer complained about
them rolling over.  That was the best "advice" I could find online.
The maintenance_work_mem I upped to deal with indexes being updated
constantly.  And finally, since I'm willing to risk some loss, I
turned fsync off, since the system is on a UPS (or will be in
production) and carefully monitored.

I did actually wonder about the checkpoint_segments being an issue,
since it seems to me the more of them you have, the more you'd have to
deal with when checkpointing, and so you might actually want to turn
that number down to create a "smoother" behavior.

Unfortunately, the alot advice for 'loading data' doesn't apply when
you have a constant stream of load, rather than just sporadic.  Any
advice is more than appreciated.

Chris
--
| Christopher Petrilli
| 

От:
Tom Lane
Дата:

Mike Nolan <> writes:
> Well, I just did an insert of 27,500 records with 9 fields, averaging
> around 118 bytes per record, each insert statement coming from a separate
> SQL statement fed to psql, and it took a bit over 4 minutes, or about
> 106 inserts per second.

Is that with a separate transaction for each insert command?  I can get
significantly higher rates on my devel machine if the inserts are
bundled into transactions of reasonable length.

With fsync on, you can't expect to get more than about one commit per
disk rotation (with a single inserting process), so with say a 7200RPM
drive (120 revs/sec) the above is a pretty good fraction of the
theoretical limit.

            regards, tom lane

От:
Alex Turner
Дата:

1250/sec with record size average is 26 bytes
800/sec with record size average is 48 bytes.
250/sec with record size average is 618 bytes.

Data from pg_stats and our own job monitoring

System has four partitions, two raid 1s, a four disk RAID 10 and a six
disk RAID 10.
pg_xlog is on four disk RAID 10, database is on RAID 10.

Data is very spread out because database turnover time is very high,
so our performance is about double this with a fresh DB. (the data
half life is probably measurable in days or weeks).

Alex Turner
netEconomist

On Apr 1, 2005 1:06 PM, Marc G. Fournier <> wrote:
>
> Just curious, but does anyone have an idea of what we are capable of?  I
> realize that size of record would affect things, as well as hardware, but
> if anyone has some ideas on max, with 'record size', that would be
> appreciated ...
>
> Thanks ...
>
> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email:            Yahoo!: yscrappy              ICQ: 7615664
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

От:
Alex Turner
Дата:

Oh - this is with a seperate transaction per command.

fsync is on.

Alex Turner
netEconomist

On Apr 1, 2005 4:17 PM, Alex Turner <> wrote:
> 1250/sec with record size average is 26 bytes
> 800/sec with record size average is 48 bytes.
> 250/sec with record size average is 618 bytes.
>
> Data from pg_stats and our own job monitoring
>
> System has four partitions, two raid 1s, a four disk RAID 10 and a six
> disk RAID 10.
> pg_xlog is on four disk RAID 10, database is on RAID 10.
>
> Data is very spread out because database turnover time is very high,
> so our performance is about double this with a fresh DB. (the data
> half life is probably measurable in days or weeks).
>
> Alex Turner
> netEconomist
>
> On Apr 1, 2005 1:06 PM, Marc G. Fournier <> wrote:
> >
> > Just curious, but does anyone have an idea of what we are capable of?  I
> > realize that size of record would affect things, as well as hardware, but
> > if anyone has some ideas on max, with 'record size', that would be
> > appreciated ...
> >
> > Thanks ...
> >
> > ----
> > Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> > Email:            Yahoo!: yscrappy              ICQ: 7615664
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 8: explain analyze is your friend
> >
>

От:
Tom Lane
Дата:

Alex Turner <> writes:
> On Apr 1, 2005 4:17 PM, Alex Turner <> wrote:
>> 1250/sec with record size average is 26 bytes
>> 800/sec with record size average is 48 bytes.
>> 250/sec with record size average is 618 bytes.

> Oh - this is with a seperate transaction per command.
> fsync is on.

[ raised eyebrow... ]  What kind of disk hardware is that exactly, and
does it have write cache enabled?  It's hard to believe those numbers
if not.

Write caching is fine if it's done in a battery-backed cache, which you
can get in the higher-end hardware RAID controllers.  Otherwise you're
going to have problems whenever the power goes away unexpectedly.

            regards, tom lane

От:
Mike Nolan
Дата:

> > Well, I just did an insert of 27,500 records with 9 fields, averaging
> > around 118 bytes per record, each insert statement coming from a separate
> > SQL statement fed to psql, and it took a bit over 4 minutes, or about
> > 106 inserts per second.
>
> Is that with a separate transaction for each insert command?  I can get
> significantly higher rates on my devel machine if the inserts are
> bundled into transactions of reasonable length.

That's with autocommit on.  If I do it as a single transaction block,
it takes about 6.5 seconds, which is about 4200 transactions/second.
--
Mike Nolan

От:
Christopher Petrilli
Дата:

On Apr 1, 2005 3:59 PM, Christopher Petrilli <> wrote:
> On Apr 1, 2005 3:53 PM, Joshua D. Drake <> wrote:
> >
> > > What seems to happen is it slams into a "wall" of some sort, the
> > > system goes into disk write frenzy (wait=90% CPU), and eventually
> > > recovers and starts running for a while at a more normal speed.  What
> > > I need though, is to not have that wall happen.  It is easier for me
> > > to accept a constant degredation of 5%, rather than a 99% degredation
> > > for short periods, as it can cause cascade problems in the system.
> >
> > Could this possibly be a checkpoint happening?
> >
> > Also how many checkpoint segments do you have?
>
> Changes to the postgresql.conf file from "default":
>
>     maintenance_work_mem = 131072
>     fsync = false
>     checkpoint_segments = 32

I've now had a chance to run a couple more tests, and here's two
graphs of the time required to insert (via COPY from a file) 500
records at a time:

http://www.amber.org/~petrilli/diagrams/pgsql_copy500.png
http://www.amber.org/~petrilli/diagrams/pgsql_copy500_bgwriter.png

The first is with the above changes, the second contains two
additional modificiations to the configuration:

    bgwriter_percent = 25
    bgwriter_maxpages = 1000

To my, likely faulty, intuition, it would seem that there is a backup
happening in the moving of data from the WAL to the final resting
place, and that by increasing these I could pull that forward.  As you
can see from the charts, that doesn't seem to have any major impact.
The point, in the rough middle, is where the program begins inserting
into a new table (inherited). The X axis is the "total" number of rows
inserted. The table has:

    * 21 columns (nothing too strange)
    * No OIDS
    * 5 indexes, including the primary key on a string

They are created by creating a main table, then doing:

    CREATE TABLE foo001 INHERITS (foos);

And then recreating all the indexes.

Thoughts? Any advice would be more than appreciated.

Chris
--
| Christopher Petrilli
| 

От:
Alex Turner
Дата:

Yup, Battery backed, cache enabled.  6 drive RAID 10, and 4 drive RAID
10, and 2xRAID 1.

It's a 3ware 9500S-8MI - not bad for $450 plus BBU.

Alex Turner
netEconomist

On Apr 1, 2005 6:03 PM, Tom Lane <> wrote:
> Alex Turner <> writes:
> > On Apr 1, 2005 4:17 PM, Alex Turner <> wrote:
> >> 1250/sec with record size average is 26 bytes
> >> 800/sec with record size average is 48 bytes.
> >> 250/sec with record size average is 618 bytes.
>
> > Oh - this is with a seperate transaction per command.
> > fsync is on.
>
> [ raised eyebrow... ]  What kind of disk hardware is that exactly, and
> does it have write cache enabled?  It's hard to believe those numbers
> if not.
>
> Write caching is fine if it's done in a battery-backed cache, which you
> can get in the higher-end hardware RAID controllers.  Otherwise you're
> going to have problems whenever the power goes away unexpectedly.
>
>                         regards, tom lane
>

От:
Tom Lane
Дата:

Christopher Petrilli <> writes:
> The table has:
>     * 21 columns (nothing too strange)
>     * No OIDS
>     * 5 indexes, including the primary key on a string

Could we see the *exact* SQL definitions of the table and indexes?
Also some sample data would be interesting.  I'm wondering for example
about the incidence of duplicate index keys.

            regards, tom lane

От:
Christopher Petrilli
Дата:

On Apr 4, 2005 11:52 AM, Tom Lane <> wrote:
> Christopher Petrilli <> writes:
> > The table has:
> >     * 21 columns (nothing too strange)
> >     * No OIDS
> >     * 5 indexes, including the primary key on a string
>
> Could we see the *exact* SQL definitions of the table and indexes?
> Also some sample data would be interesting.  I'm wondering for example
> about the incidence of duplicate index keys.

Of course, this is a bit cleansed, since it's an internal project, but
only the column names are changed:

CREATE TABLE foos (
    foo_id                VARCHAR(32),
    s_ts               INTEGER NOT NULL,
    c_ts                  INTEGER NOT NULL,
    bar_id               INTEGER NOT NULL,
    proto               INTEGER NOT NULL,
    src_ip              INT8 NOT NULL,
    dst_ip              INT8 NOT NULL,
    src_port            INTEGER,
    dst_port            INTEGER,
    nated               INTEGER NOT NULL,
    src_nat_ip          INT8,
    dst_nat_ip          INT8,
    src_nat_port        INTEGER,
    dst_nat_port        INTEGER,
    foo_class             INTEGER NOT NULL,
    foo_type              INTEGER NOT NULL,
    src_bar              INTEGER NOT NULL,
    dst_bar              INTEGER NOT NULL,
    user_name           VARCHAR(255),
    info                TEXT
) WITHOUT OIDS;
ALTER TABLE foos ADD CONSTRAINT foos_foo_id_pk UNIQUE (foo_id);
CREATE INDEX foos_c_ts_idx ON foos(conduit_ts);
CREATE INDEX foos_src_ip_idx ON foos(src_ip);
CREATE INDEX foos_dst_ip_idx ON foos(dst_ip);
CREATE INDEX foos_foo_class_idx ON foos(foo_class);
CREATE INDEX foos_foo_type_idx ON foos(foo_type);


CREATE TABLE foos001 ( ) INHERITS (foos) WITHOUT OIDS;
ALTER TABLE foos001 ADD CONSTRAINT foos001_foo_id_pk UNIQUE (foo_id);
CREATE INDEX foos001_c_ts_idx ON foos001(conduit_ts);
CREATE INDEX foos001_src_ip_idx ON foos001(src_ip);
CREATE INDEX foos001_dst_ip_idx ON foos001(dst_ip);
CREATE INDEX foos001_foo_class_idx ON foos001(foo_class);
CREATE INDEX foos001_foo_type_idx ON foos001(foo_type);

That continues on, but you get the idea...

So, as you asked about data content, specifically regarding indices,
here's what the "simulator" creates:

foo_id - 32 character UID (generated by the UUID function in mxTools,
which looks like '00beef19420053c64f3f01aeb0b4a2a5', and varies in the
upper components more than the lower.

*_ts - UNIX epoch timestamps, sequential.  There's a long story behind
not using DATETIME format, but if that's the big issue, it can be
dealt with.

*_ip - Randomly selected 32-bit integers from a pre-generated list
containing about 500 different numbers ranging from 3232235500 to
3232236031. This is unfortunately, not too atypical from the "real
world".

*_class - Randomly selected 1-100 (again, not atypical, although
normal distribution would be less random)

*_type - Randomly selected 1-10000 (not atypical, and more random than
in real world)

Hopefully this helps?

Chris
--
| Christopher Petrilli
| 

От:
Tom Lane
Дата:

Christopher Petrilli <> writes:
> On Apr 4, 2005 11:52 AM, Tom Lane <> wrote:
>> Could we see the *exact* SQL definitions of the table and indexes?

> Of course, this is a bit cleansed, since it's an internal project, but
> only the column names are changed:

Thanks.  No smoking gun in sight there.  But out of curiosity, can you
do a test run with *no* indexes on the table, just to see if it behaves
any differently?  Basically I was wondering if index overhead might be
part of the problem.

Also, the X-axis on your graphs seems to be total number of rows
inserted ... can you relate that to elapsed real time for us?

            regards, tom lane

От:
Christopher Petrilli
Дата:

On Apr 4, 2005 12:23 PM, Tom Lane <> wrote:
> Christopher Petrilli <> writes:
> > On Apr 4, 2005 11:52 AM, Tom Lane <> wrote:
> >> Could we see the *exact* SQL definitions of the table and indexes?
>
> > Of course, this is a bit cleansed, since it's an internal project, but
> > only the column names are changed:
>
> Thanks.  No smoking gun in sight there.  But out of curiosity, can you
> do a test run with *no* indexes on the table, just to see if it behaves
> any differently?  Basically I was wondering if index overhead might be
> part of the problem.

Running now, but it'll take a while since I have a 3/4 second pause
after each COPY to better reflect "real world" ... the application
does 1 COPY per second, or whenever it hits 1000 entries. This seemed
to be a sane way to deal with it, and not burden the system with
needless index balancing, etc.

> Also, the X-axis on your graphs seems to be total number of rows
> inserted ... can you relate that to elapsed real time for us?

Sure, like I said, there's a 3/4 second sleep between each COPY,
regardless of how long it took (which well, isn't quite right, but
close enough for this test).  I've created a PNG with the X axies
reflecting "elapsed time":

http://www.amber.org/~petrilli/diagrams/pgsql_copyperf_timeline.png

In addition, I've put up the raw data I used:

http://www.amber.org/~petrilli/diagrams/results_timeline.txt

The columns are rowcount, elapsed time, instance time.
Hopefully this might help some?  This machine has nothing else running
on it other than the normal stripped down background processes (like
sshd).

--
| Christopher Petrilli
| 

От:
Simon Riggs
Дата:

On Mon, 2005-04-04 at 09:48 -0400, Christopher Petrilli wrote:
> The point, in the rough middle, is where the program begins inserting
> into a new table (inherited). The X axis is the "total" number of rows
> inserted.

and you also mention the same data plotted with elapsed time:
http://www.amber.org/~petrilli/diagrams/pgsql_copyperf_timeline.png

Your graphs look identical to others I've seen, so I think we're
touching on something wider than your specific situation. The big
difference is that things seem to go back to high performance when you
switch to a new inherited table.

I'm very interested in the graphs of elapsed time for COPY 500 rows
against rows inserted. The simplistic inference from those graphs are
that if you only inserted 5 million rows into each table, rather than 10
million rows then everything would be much quicker. I hope this doesn't
work, but could you try that to see if it works? I'd like to rule out a
function of "number of rows" as an issue, or focus in on it depending
upon the results.

Q: Please can you confirm that the discontinuity on the graph at around
5000 elapsed seconds matches EXACTLY with the switch from one table to
another? That is an important point.

Q: How many data files are there for these relations? Wouldn't be two,
by any chance, when we have 10 million rows in them?

Q: What is the average row length?
About 150-160 bytes?

Thanks,

Best Regards, Simon Riggs


От:
Christopher Petrilli
Дата:

On Apr 4, 2005 3:46 PM, Simon Riggs <> wrote:
> On Mon, 2005-04-04 at 09:48 -0400, Christopher Petrilli wrote:
> > The point, in the rough middle, is where the program begins inserting
> > into a new table (inherited). The X axis is the "total" number of rows
> > inserted.
>
> and you also mention the same data plotted with elapsed time:
> http://www.amber.org/~petrilli/diagrams/pgsql_copyperf_timeline.png
>
> Your graphs look identical to others I've seen, so I think we're
> touching on something wider than your specific situation. The big
> difference is that things seem to go back to high performance when you
> switch to a new inherited table.

This is correct.

> I'm very interested in the graphs of elapsed time for COPY 500 rows
> against rows inserted. The simplistic inference from those graphs are
> that if you only inserted 5 million rows into each table, rather than 10
> million rows then everything would be much quicker. I hope this doesn't
> work, but could you try that to see if it works? I'd like to rule out a
> function of "number of rows" as an issue, or focus in on it depending
> upon the results.
>
> Q: Please can you confirm that the discontinuity on the graph at around
> 5000 elapsed seconds matches EXACTLY with the switch from one table to
> another? That is an important point.

Well, the change over happens at 51593.395205 seconds :-)  Here's two
lines from the results with row count and time added:

10000000    51584.9818912    8.41331386566
10000500    51593.395205    0.416964054108

Note that 10M is when it swaps.  I see no reason to interpret it
differently, so it seems to be totally based around switching tables
(and thereby indices).

> Q: How many data files are there for these relations? Wouldn't be two,
> by any chance, when we have 10 million rows in them?

I allow PostgreSQL to manage all the data files itself, so here's the
default tablespace:

total 48
drwx------  2 pgsql pgsql 4096 Jan 26 20:59 1
drwx------  2 pgsql pgsql 4096 Dec 17 19:15 17229
drwx------  2 pgsql pgsql 4096 Feb 16 17:55 26385357
drwx------  2 pgsql pgsql 4096 Mar 24 23:56 26425059
drwx------  2 pgsql pgsql 8192 Mar 28 11:31 26459063
drwx------  2 pgsql pgsql 8192 Mar 31 23:54 26475755
drwx------  2 pgsql pgsql 4096 Apr  4 15:07 26488263
[root@bigbird base]# du
16624   ./26425059
5028    ./26385357
5660    ./26459063
4636    ./17229
6796    ./26475755
4780    ./1
1862428 ./26488263
1905952 .

> Q: What is the average row length?
> About 150-160 bytes?

Raw data is around 150bytes, after insertion, I'd need to do some
other calculations.

Chris
--
| Christopher Petrilli
| 

От:
Simon Riggs
Дата:

On Mon, 2005-04-04 at 15:56 -0400, Christopher Petrilli wrote:
> On Apr 4, 2005 3:46 PM, Simon Riggs <> wrote:
> > On Mon, 2005-04-04 at 09:48 -0400, Christopher Petrilli wrote:
> > > The point, in the rough middle, is where the program begins inserting
> > > into a new table (inherited). The X axis is the "total" number of rows
> > > inserted.
> >
> > and you also mention the same data plotted with elapsed time:
> > http://www.amber.org/~petrilli/diagrams/pgsql_copyperf_timeline.png
> >
> > Your graphs look identical to others I've seen, so I think we're
> > touching on something wider than your specific situation. The big
> > difference is that things seem to go back to high performance when you
> > switch to a new inherited table.
>
> This is correct.
>
> > I'm very interested in the graphs of elapsed time for COPY 500 rows
> > against rows inserted. The simplistic inference from those graphs are
> > that if you only inserted 5 million rows into each table, rather than 10
> > million rows then everything would be much quicker. I hope this doesn't
> > work, but could you try that to see if it works? I'd like to rule out a
> > function of "number of rows" as an issue, or focus in on it depending
> > upon the results.

Any chance of running a multiple load of 4 million rows per table,
leaving the test running for at least 3 tables worth (12+ M rows)?

> >
> > Q: Please can you confirm that the discontinuity on the graph at around
> > 5000 elapsed seconds matches EXACTLY with the switch from one table to
> > another? That is an important point.
>
> Well, the change over happens at 51593.395205 seconds :-)  Here's two
> lines from the results with row count and time added:
>
> 10000000    51584.9818912    8.41331386566
> 10000500    51593.395205    0.416964054108
>
> Note that 10M is when it swaps.  I see no reason to interpret it
> differently, so it seems to be totally based around switching tables
> (and thereby indices).

OK, but do you have some other external knowledge that it is definitely
happening at that time? Your argument above seems slightly circular to
me.

This is really important because we need to know whether it ties in with
that event, or some other.

Have you run this for more than 2 files, say 3 or more?

You COMMIT after each 500 rows?

> > Q: How many data files are there for these relations? Wouldn't be two,
> > by any chance, when we have 10 million rows in them?
>
> I allow PostgreSQL to manage all the data files itself, so here's the
> default tablespace:
>
> total 48
> drwx------  2 pgsql pgsql 4096 Jan 26 20:59 1
> drwx------  2 pgsql pgsql 4096 Dec 17 19:15 17229
> drwx------  2 pgsql pgsql 4096 Feb 16 17:55 26385357
> drwx------  2 pgsql pgsql 4096 Mar 24 23:56 26425059
> drwx------  2 pgsql pgsql 8192 Mar 28 11:31 26459063
> drwx------  2 pgsql pgsql 8192 Mar 31 23:54 26475755
> drwx------  2 pgsql pgsql 4096 Apr  4 15:07 26488263
> [root@bigbird base]# du
> 16624   ./26425059
> 5028    ./26385357
> 5660    ./26459063
> 4636    ./17229
> 6796    ./26475755
> 4780    ./1
> 1862428 ./26488263
> 1905952 .

OK. Please...
cd $PGDATA/base/26488263
ls -l

I'm looking for the number of files associated with each inherited table
(heap).

> > Q: What is the average row length?
> > About 150-160 bytes?
>
> Raw data is around 150bytes, after insertion, I'd need to do some
> other calculations.

By my calculations, you should have just 2 data files per 10M rows for
the main table. The performance degradation seems to coincide with the
point where we move to inserting into the second of the two files.

I'm not looking for explanations yet, just examining coincidences and
trying to predict the behaviour based upon conjectures.

Best Regards, Simon Riggs


От:
Christopher Petrilli
Дата:

On Apr 4, 2005 4:11 PM, Simon Riggs <> wrote:
> > > I'm very interested in the graphs of elapsed time for COPY 500 rows
> > > against rows inserted. The simplistic inference from those graphs are
> > > that if you only inserted 5 million rows into each table, rather than 10
> > > million rows then everything would be much quicker. I hope this doesn't
> > > work, but could you try that to see if it works? I'd like to rule out a
> > > function of "number of rows" as an issue, or focus in on it depending
> > > upon the results.
>
> Any chance of running a multiple load of 4 million rows per table,
> leaving the test running for at least 3 tables worth (12+ M rows)?

As soon as I get done running a test without indexes :-)

> > > Q: Please can you confirm that the discontinuity on the graph at around
> > > 5000 elapsed seconds matches EXACTLY with the switch from one table to
> > > another? That is an important point.
> >
> > Well, the change over happens at 51593.395205 seconds :-)  Here's two
> > lines from the results with row count and time added:
> >
> > 10000000      51584.9818912   8.41331386566
> > 10000500      51593.395205    0.416964054108
> >
> > Note that 10M is when it swaps.  I see no reason to interpret it
> > differently, so it seems to be totally based around switching tables
> > (and thereby indices).
>
> OK, but do you have some other external knowledge that it is definitely
> happening at that time? Your argument above seems slightly circular to
> me.

My program *SPECIFICALLY* counts to 10M then switches the COPY statement.

> This is really important because we need to know whether it ties in with
> that event, or some other.

Unless basic integer math is failing, it's definately happening at 10M rows.

> Have you run this for more than 2 files, say 3 or more?

You mean, 3 or more tables? I'm not sure which type of files you are
reffering to here.

> You COMMIT after each 500 rows?

This is done using COPY syntax, not INSERT syntax. So I suppose "yes"
I do.  The file that is being used for COPY is kept on a ramdisk.

> OK. Please...
> cd $PGDATA/base/26488263
> ls -l

[root@bigbird base]# cd 26488263/
[root@bigbird 26488263]# ls -l
total 2003740
-rw-------  1 pgsql pgsql      49152 Apr  4 12:26 1247
-rw-------  1 pgsql pgsql     245760 Apr  4 12:27 1249
-rw-------  1 pgsql pgsql     573440 Apr  4 12:24 1255
-rw-------  1 pgsql pgsql      57344 Apr  4 14:44 1259
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 16384
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 16386
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 16388
-rw-------  1 pgsql pgsql      24576 Apr  4 12:29 16390
-rw-------  1 pgsql pgsql     106496 Apr  4 12:24 16392
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16394
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 16396
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16398
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 16400
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 16402
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 16404
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 16406
-rw-------  1 pgsql pgsql     212992 Apr  4 14:44 16408
-rw-------  1 pgsql pgsql      49152 Apr  4 12:24 16410
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 16412
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 16414
-rw-------  1 pgsql pgsql     114688 Apr  4 12:24 16416
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16418
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 16672
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16674
-rw-------  1 pgsql pgsql     237568 Apr  4 12:26 16676
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16678
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16679
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16680
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16681
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16682
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16683
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 16684
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 16685
-rw-------  1 pgsql pgsql     245760 Apr  4 12:26 16686
-rw-------  1 pgsql pgsql      73728 Apr  4 12:26 16687
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16688
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16689
-rw-------  1 pgsql pgsql      16384 Apr  4 12:26 16690
-rw-------  1 pgsql pgsql      65536 Apr  4 12:26 16691
-rw-------  1 pgsql pgsql      16384 Apr  4 12:26 16692
-rw-------  1 pgsql pgsql      16384 Apr  4 12:26 16693
-rw-------  1 pgsql pgsql      16384 Apr  4 12:26 16694
-rw-------  1 pgsql pgsql      16384 Apr  4 12:26 16695
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16696
-rw-------  1 pgsql pgsql      32768 Apr  4 12:24 16697
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16698
-rw-------  1 pgsql pgsql     163840 Apr  4 12:26 16701
-rw-------  1 pgsql pgsql     196608 Apr  4 12:26 16702
-rw-------  1 pgsql pgsql      73728 Apr  4 12:24 16703
-rw-------  1 pgsql pgsql      16384 Apr  4 12:26 16706
-rw-------  1 pgsql pgsql      16384 Apr  4 12:26 16707
-rw-------  1 pgsql pgsql      16384 Apr  4 12:26 16708
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16709
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16710
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 16711
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16712
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16713
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16714
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16715
-rw-------  1 pgsql pgsql      32768 Apr  4 12:24 16716
-rw-------  1 pgsql pgsql     106496 Apr  4 12:24 16717
-rw-------  1 pgsql pgsql     106496 Apr  4 12:24 16718
-rw-------  1 pgsql pgsql    1212416 Apr  4 12:24 16719
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16720
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16721
-rw-------  1 pgsql pgsql      40960 Apr  4 14:44 16724
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16727
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16728
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16729
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16730
-rw-------  1 pgsql pgsql      16384 Apr  4 12:26 16731
-rw-------  1 pgsql pgsql      49152 Apr  4 12:26 16732
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 16735
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 16737
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 16738
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 16740
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 16744
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 16746
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 16750
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 16752
-rw-------  1 pgsql pgsql     122880 Apr  4 12:24 16753
-rw-------  1 pgsql pgsql      16384 Apr  4 12:24 16755
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 16759
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 16761
-rw-------  1 pgsql pgsql      40960 Apr  4 12:24 17158
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 17160
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 17162
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 17163
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 17165
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 17167
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 17168
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 17170
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 17172
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 17173
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 17175
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 17177
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 17178
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 17180
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 17182
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 17183
-rw-------  1 pgsql pgsql          0 Apr  4 12:24 17185
-rw-------  1 pgsql pgsql       8192 Apr  4 12:24 17187
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488264
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488266
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488268
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488269
-rw-------  1 pgsql pgsql 1073741824 Apr  4 15:07 26488271
-rw-------  1 pgsql pgsql  407527424 Apr  4 16:17 26488271.1
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488273
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488275
-rw-------  1 pgsql pgsql  565067776 Apr  4 16:17 26488276
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488278
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488280
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488282
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488283
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488285
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488287
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488289
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488290
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488292
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488294
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488296
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488297
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488299
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488301
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488303
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488304
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488306
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488308
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488310
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488311
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488313
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488315
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488317
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488318
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488320
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488322
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488324
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488325
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488327
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488329
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488331
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488332
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488334
-rw-------  1 pgsql pgsql          0 Apr  4 12:26 26488336
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488338
-rw-------  1 pgsql pgsql       8192 Apr  4 12:26 26488339
-rw-------  1 pgsql pgsql      60045 Apr  4 12:24 pg_internal.init
-rw-------  1 pgsql pgsql          4 Apr  4 12:24 PG_VERSION

Hopefully this helps.

Chris
--
| Christopher Petrilli
| 

От:
PFC
Дата:


> This is done using COPY syntax, not INSERT syntax. So I suppose "yes"
> I do.  The file that is being used for COPY is kept on a ramdisk.

    COPY or psql \copy ?
    If you wanna be sure you commit after each COPY, launch a psql in a shell
and check if the inserted rows are visible (watching SELECT count(*) grow
will do)

От:
Christopher Petrilli
Дата:

On Apr 4, 2005 4:53 PM, PFC <> wrote:
> > This is done using COPY syntax, not INSERT syntax. So I suppose "yes"
> > I do.  The file that is being used for COPY is kept on a ramdisk.
>
>         COPY or psql \copy ?
>         If you wanna be sure you commit after each COPY, launch a psql in a shell
> and check if the inserted rows are visible (watching SELECT count(*) grow
> will do)

The script is Python, using pyexpect (a'la expect) and does this, exactly:

psql = pexpect.spawn('/usr/local/pgsql/bin/psql -d bench2 ')
[ ...]
start = time.time()
psql.expect_exact('bench2=#')
psql.sendline("COPY events%03i FROM '/mnt/tmpfs/loadfile';" % (tablenum+1))
results.write('%s\n' % (time.time() - start))
results.flush()

There's other code, but it's all related to building the loadfile.
Note that I'm specifically including the time it takes to get the
prompt back in the timing (but it does slip 1 loop, which isn't
relevent).

Chris
--
| Christopher Petrilli
| 

От:
Simon Riggs
Дата:

On Mon, 2005-04-04 at 16:18 -0400, Christopher Petrilli wrote:
> On Apr 4, 2005 4:11 PM, Simon Riggs <> wrote:
> > > > I'm very interested in the graphs of elapsed time for COPY 500 rows
> > > > against rows inserted. The simplistic inference from those graphs are
> > > > that if you only inserted 5 million rows into each table, rather than 10
> > > > million rows then everything would be much quicker. I hope this doesn't
> > > > work, but could you try that to see if it works? I'd like to rule out a
> > > > function of "number of rows" as an issue, or focus in on it depending
> > > > upon the results.
> >
> > Any chance of running a multiple load of 4 million rows per table,
> > leaving the test running for at least 3 tables worth (12+ M rows)?
>
> As soon as I get done running a test without indexes :-)
>
> > > > Q: Please can you confirm that the discontinuity on the graph at around
> > > > 5000 elapsed seconds matches EXACTLY with the switch from one table to
> > > > another? That is an important point.
> > >
> > > Well, the change over happens at 51593.395205 seconds :-)  Here's two
> > > lines from the results with row count and time added:
> > >
> > > 10000000      51584.9818912   8.41331386566
> > > 10000500      51593.395205    0.416964054108
> > >
> My program *SPECIFICALLY* counts to 10M then switches the COPY statement.

> > OK. Please...
> > cd $PGDATA/base/26488263
> > ls -l
>
> [root@bigbird base]# cd 26488263/
> [root@bigbird 26488263]# ls -l
> total 2003740

> -rw-------  1 pgsql pgsql 1073741824 Apr  4 15:07 26488271
> -rw-------  1 pgsql pgsql  407527424 Apr  4 16:17 26488271.1

Can you do:
select relname from pg_class where relfilenode = 26488271
and confirm that the name is the table you've been loading...

Couldn't see all your indexes... are they still there?

Thanks,

Best Regards, Simon Riggs



От:
Christopher Petrilli
Дата:

On Apr 4, 2005 4:58 PM, Simon Riggs <> wrote:
> Can you do:
> select relname from pg_class where relfilenode = 26488271
> and confirm that the name is the table you've been loading...

It is.

> Couldn't see all your indexes... are they still there?

Nope, I'm running a second run without the auxilary indices.  I only
have the primary key index.  So far, a quick scan with the eye says
that it's behaving "better", but beginning to have issues again.  I'll
post results as soon as they are done.

Chris
--
| Christopher Petrilli
| 

От:
Simon Riggs
Дата:

On Mon, 2005-04-04 at 17:03 -0400, Christopher Petrilli wrote:
> On Apr 4, 2005 4:58 PM, Simon Riggs <> wrote:
> > Can you do:
> > select relname from pg_class where relfilenode = 26488271
> > and confirm that the name is the table you've been loading...
>
> It is.
>
> > Couldn't see all your indexes... are they still there?
>
> Nope, I'm running a second run without the auxilary indices.  I only
> have the primary key index.  So far, a quick scan with the eye says
> that it's behaving "better", but beginning to have issues again.  I'll
> post results as soon as they are done.

Hmmm....

Before I start to tunnel-vision on a particular coincidence...

How much memory have you got on the system?
How much of that have you allocated to various tasks?
What else is happening on your system?
Tell us more about disk set-up and other hardware related things.
Disk cache...disk speed...seek times....etc

Best Regards, Simon Riggs






От:
Mike Rylander
Дата:

If I'm getting the point of this thread correctly, have a huge amount
of data in one table degrades INSERT/COPY performance even with just a
PKEY index.  If that's about the size of it, read on.  If not, ignore
me because I missed something.

On Apr 4, 2005 10:44 PM, Simon Riggs <> wrote:
> Before I start to tunnel-vision on a particular coincidence...
>

Don't worry too much about tunnel vision.  I see the same thing every
day with multi-million row tables.  The bigger the table gets (with
only a pkey index) the slower the inserts go.  If I start over
(truncate, drop/create table), or if I point the initial load at a new
table, everything gets speedy.  I've always figured it was a function
of table size and learned to live with it...

> How much memory have you got on the system?

On mine, 16G

> How much of that have you allocated to various tasks?

shared buffers: 15000

> What else is happening on your system?

Nothing on mine.

> Tell us more about disk set-up and other hardware related things.

6-disk RAID10 on a Compaq SmartArray 6404 with 256M BB cache, WAL on
2-disk mirror on built in SmartArray5 controller.

--
Mike Rylander

GPLS -- PINES Development
Database Developer
http://open-ils.org

От:
Christopher Petrilli
Дата:

On Apr 4, 2005 6:44 PM, Simon Riggs <> wrote:
> Before I start to tunnel-vision on a particular coincidence...
>
> How much memory have you got on the system?

Now, 2Gb, but most of it is free in this situation.  Earlier, I posted
some of the settings related to work mem.

> How much of that have you allocated to various tasks?

Do you mean inside PostgreSQL?

> What else is happening on your system?

sshd, that's it :-)

> Tell us more about disk set-up and other hardware related things.
> Disk cache...disk speed...seek times....etc

Sure, here's the system configuration:

* AMD64/3000
* 2GB RAM (was 1GB, has made no difference)
* 1 x 120GB SATA drive (w/WAL), 7200RPM Seagate
* 1 x 160GB SATA drive (main), 7200RPM Seagate

Chris
--
| Christopher Petrilli
| 

От:
Tom Lane
Дата:

Christopher Petrilli <> writes:
> On Apr 4, 2005 12:23 PM, Tom Lane <> wrote:
>> do a test run with *no* indexes on the table, just to see if it behaves
>> any differently?  Basically I was wondering if index overhead might be
>> part of the problem.

> http://www.amber.org/~petrilli/diagrams/pgsql_copy500_pkonly.png

> I appologize, I forgot to kill the PK, but as you can see, the curve
> flattened out a lot.  It still begins to increase in what seems like
> the same place.  You can find the results themselves at:

Yeah, this confirms the thought that the indexes are the source of
the issue.  (Which is what I'd expect, because a bare INSERT ought to be
an approximately constant-time operation.  But it's good to verify.)

Now some amount of slowdown is to be expected as the indexes get larger,
since it ought to take roughly O(log N) time to insert a new entry in an
index of size N.  The weird thing about your curves is the very sudden
jump in the insert times.

What I think might be happening is that the "working set" of pages
touched during index inserts is gradually growing, and at some point it
exceeds shared_buffers, and at that point performance goes in the toilet
because we are suddenly doing lots of reads to pull in index pages that
fell out of the shared buffer area.

It would be interesting to watch the output of iostat or vmstat during
this test run.  If I'm correct about this, the I/O load should be
basically all writes during the initial part of the test, and then
suddenly develop a significant and increasing fraction of reads at the
point where the slowdown occurs.

The indicated fix of course is to increase shared_buffers.

            regards, tom lane

От:
Christopher Petrilli
Дата:

On Apr 4, 2005 10:36 PM, Tom Lane <> wrote:
> Christopher Petrilli <> writes:
> > On Apr 4, 2005 12:23 PM, Tom Lane <> wrote:
> >> do a test run with *no* indexes on the table, just to see if it behaves
> >> any differently?  Basically I was wondering if index overhead might be
> >> part of the problem.
>
> > http://www.amber.org/~petrilli/diagrams/pgsql_copy500_pkonly.png
>
> > I appologize, I forgot to kill the PK, but as you can see, the curve
> > flattened out a lot.  It still begins to increase in what seems like
> > the same place.  You can find the results themselves at:
>
> Yeah, this confirms the thought that the indexes are the source of
> the issue.  (Which is what I'd expect, because a bare INSERT ought to be
> an approximately constant-time operation.  But it's good to verify.)

This seemsed to be my original idea, but I wanted to eliminate
everything else as much as possible. I was also concerned that I might
be hitting a bad case in the trees.  I had to change some UID
generation code to better hash, so...

> Now some amount of slowdown is to be expected as the indexes get larger,
> since it ought to take roughly O(log N) time to insert a new entry in an
> index of size N.  The weird thing about your curves is the very sudden
> jump in the insert times.

Right, I expected O(log N) behavior myself, and it seems to behave
that way, if you look at the first section (although there's some
interesting patterns that are visible if you exclude data outside the
90th percentile in the first section, that seems to coincide with some
write activity.

> It would be interesting to watch the output of iostat or vmstat during
> this test run.  If I'm correct about this, the I/O load should be
> basically all writes during the initial part of the test, and then
> suddenly develop a significant and increasing fraction of reads at the
> point where the slowdown occurs.

Well, I can track this on a run, if it would be useful, but I think
you're right as it matches what I saw from looking at iostat at those
points.

> The indicated fix of course is to increase shared_buffers.

Any idea where it should be set?

Chris
--
| Christopher Petrilli
| 

От:
Tom Lane
Дата:

Christopher Petrilli <> writes:
> On Apr 4, 2005 10:36 PM, Tom Lane <> wrote:
>> The indicated fix of course is to increase shared_buffers.

> Any idea where it should be set?

Not really.  An upper bound would be the total size of the finished
indexes for one 10M-row table, but one would suppose that that's
overkill.  The leaf pages shouldn't have to stay in RAM to have
reasonable behavior --- the killer case is when upper-level tree
pages drop out.  Or that's what I'd expect anyway.

You could probably drop the inter-insert sleep for testing purposes,
if you want to experiment with several shared_buffers values quickly.

            regards, tom lane

От:
Greg Stark
Дата:

Tom Lane <> writes:

> What I think might be happening is that the "working set" of pages
> touched during index inserts is gradually growing, and at some point it
> exceeds shared_buffers, and at that point performance goes in the toilet
> because we are suddenly doing lots of reads to pull in index pages that
> fell out of the shared buffer area.

All this is happening within a single transaction too, right? So there hasn't
been an fsync the entire time. It's entirely up to the kernel when to decide
to start writing data.

It's possible it's just buffering all the writes in memory until the amount of
free buffers drops below some threshold then it suddenly starts writing out
buffers.

> It would be interesting to watch the output of iostat or vmstat during
> this test run.  If I'm correct about this, the I/O load should be
> basically all writes during the initial part of the test, and then
> suddenly develop a significant and increasing fraction of reads at the
> point where the slowdown occurs.

I think he's right, if you see a reasonable write volume before the
performance drop followed by a sudden increase in read volume (and decrease of
write volume proportionate to the drop in performance) then it's just shared
buffers becoming a bottleneck.

If there's hardly any write volume before, then a sudden increase in write
volume despite a drop in performance then I might be right. In which case you
might want to look into tools to tune your kernel vm system.


--
greg

От:
Christopher Petrilli
Дата:

On 04 Apr 2005 23:45:47 -0400, Greg Stark <> wrote:
>
> Tom Lane <> writes:
>
> > What I think might be happening is that the "working set" of pages
> > touched during index inserts is gradually growing, and at some point it
> > exceeds shared_buffers, and at that point performance goes in the toilet
> > because we are suddenly doing lots of reads to pull in index pages that
> > fell out of the shared buffer area.
>
> All this is happening within a single transaction too, right? So there hasn't
> been an fsync the entire time. It's entirely up to the kernel when to decide
> to start writing data.

This was my concern, and in fact moving from ext3 -> XFS has helped
substantially in this regard. This is all happening inside COPY
statements, so there's effectively a commit every 500 rows. I could
enlarge this, but I didn't notice a huge increase in performance when
doing tests on smaller bits.

Also, you are correct, I am running without fsync, although I could
change that if you thought it would "smooth" the performance.  The
issue is less absolute performance than something more deterministic.
Going from 0.05 seconds for a 500 row COPY to 26 seconds really messes
with the system.

One thing that was mentioned early on, and I hope people remember, is
that I am running autovacuum in the background, but the timing of it
seems to have little to do with the system's problems, at least the
debug output doesn't conincide with performance loss.

> It's possible it's just buffering all the writes in memory until the amount of
> free buffers drops below some threshold then it suddenly starts writing out
> buffers.

That was happening with ext3, actually, or at least to the best of my knowledge.

> > It would be interesting to watch the output of iostat or vmstat during
> > this test run.  If I'm correct about this, the I/O load should be
> > basically all writes during the initial part of the test, and then
> > suddenly develop a significant and increasing fraction of reads at the
> > point where the slowdown occurs.
>
> I think he's right, if you see a reasonable write volume before the
> performance drop followed by a sudden increase in read volume (and decrease of
> write volume proportionate to the drop in performance) then it's just shared
> buffers becoming a bottleneck.

I've set shared_buffers to 16000 (from the original 1000) and am
running now, without the pauses. We'll see what it looks like, but so
far it seems to be running faster. How much and how it degrades will
be an interesting view.

> If there's hardly any write volume before, then a sudden increase in write
> volume despite a drop in performance then I might be right. In which case you
> might want to look into tools to tune your kernel vm system.

Here's a quick snapshot of iostat:

Linux 2.6.9-1.667 (bigbird.amber.org)   04/04/2005

avg-cpu:  %user   %nice    %sys %iowait   %idle
           1.05    0.01    0.63   13.15   85.17

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
hda               0.00         0.00         0.00       3616          0
sda              23.15        68.09       748.89  246884021 2715312654
sdb              19.08        37.65       773.03  136515457 2802814036

The first 3 columns have been identical (or nearly so) the whole time,
which tells me the system is pegged in its performance on IO.  This is
not surprising.

Chris
--
| Christopher Petrilli
| 

От:
Tom Lane
Дата:

Greg Stark <> writes:
> All this is happening within a single transaction too, right? So there hasn't
> been an fsync the entire time. It's entirely up to the kernel when to decide
> to start writing data.

No ... there's a commit every 500 records.  However, I think Chris said
he was running with fsync off; so you're right that the kernel is at
liberty to write stuff to disk when it feels like.  It could be that
those outlier points are transactions that occurred in the middle of
periodic syncer-driven mass writes.  Maybe fsync off is
counterproductive for this situation?

            regards, tom lane

От:
Christopher Petrilli
Дата:

On Apr 4, 2005 11:57 PM, Tom Lane <> wrote:
> Greg Stark <> writes:
> > All this is happening within a single transaction too, right? So there hasn't
> > been an fsync the entire time. It's entirely up to the kernel when to decide
> > to start writing data.
>
> No ... there's a commit every 500 records.  However, I think Chris said
> he was running with fsync off; so you're right that the kernel is at
> liberty to write stuff to disk when it feels like.  It could be that
> those outlier points are transactions that occurred in the middle of
> periodic syncer-driven mass writes.  Maybe fsync off is
> counterproductive for this situation?

Looking at preliminary results from running with shared_buffers at
16000, it seems this may be correct.  Performance was flatter for a
BIT longer, but slammed right into the wall and started hitting the
3-30 second range per COPY.  I've restarted the run, with fsync turned
on (fdatasync), and we'll see.

My fear is that it's some bizarre situation interacting with both
issues, and one that might not be solvable.  Does anyone else have
much experience with this sort of sustained COPY?

Chris
--
| Christopher Petrilli
| 

От:
"Jim C. Nasby"
Дата:

On Tue, Apr 05, 2005 at 12:16:27AM -0400, Christopher Petrilli wrote:
> My fear is that it's some bizarre situation interacting with both
> issues, and one that might not be solvable.  Does anyone else have
> much experience with this sort of sustained COPY?

You might ask the guy who just posted to -admin about a database that's
doing 340M inserts a day in 300M transactions...
--
Jim C. Nasby, Database Consultant               
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

От:
Christopher Petrilli
Дата:

On Apr 5, 2005 12:16 AM, Christopher Petrilli <> wrote:
> Looking at preliminary results from running with shared_buffers at
> 16000, it seems this may be correct.  Performance was flatter for a
> BIT longer, but slammed right into the wall and started hitting the
> 3-30 second range per COPY.  I've restarted the run, with fsync turned
> on (fdatasync), and we'll see.
>
> My fear is that it's some bizarre situation interacting with both
> issues, and one that might not be solvable.  Does anyone else have
> much experience with this sort of sustained COPY?

Well, here's the results:

http://www.amber.org/~petrilli/diagrams/pgsql_copy500_comparison.png

The red is the run with shared_buffers turned up, but fsync off.
The blue is the run with shared_buffers turned up, but fsync on.

Note that it hits the wall sooner. Unfortunately, my brain is fried,
and not sure what that means!

Chris
--
| Christopher Petrilli
| 

От:
Simon Riggs
Дата:

On Mon, 2005-04-04 at 22:36 -0400, Tom Lane wrote:
> Christopher Petrilli <> writes:
> > On Apr 4, 2005 12:23 PM, Tom Lane <> wrote:
> >> do a test run with *no* indexes on the table, just to see if it behaves
> >> any differently?  Basically I was wondering if index overhead might be
> >> part of the problem.
>
> > http://www.amber.org/~petrilli/diagrams/pgsql_copy500_pkonly.png
>
> > I appologize, I forgot to kill the PK, but as you can see, the curve
> > flattened out a lot.  It still begins to increase in what seems like
> > the same place.  You can find the results themselves at:
>
> Yeah, this confirms the thought that the indexes are the source of
> the issue.  (Which is what I'd expect, because a bare INSERT ought to be
> an approximately constant-time operation.  But it's good to verify.)

Yup, indexes are the best explanation so far - block extension needs
some work, but I doubted that it was the source of this effect.

> Now some amount of slowdown is to be expected as the indexes get larger,
> since it ought to take roughly O(log N) time to insert a new entry in an
> index of size N.  The weird thing about your curves is the very sudden
> jump in the insert times.

Well, ISTM that the curve is far from unique. Mark's OSDL tests show
them too. What was wierd, for me, was that it "resets" when you move to
a new table. The index theory does accurately explain that.

Perhaps the jump is not so sudden? Do I see a first small step up at
about 4.5M rows, then another much bigger one at 7.5M (which looks like
the only one at first glance)?

> What I think might be happening is that the "working set" of pages
> touched during index inserts is gradually growing, and at some point it
> exceeds shared_buffers, and at that point performance goes in the toilet
> because we are suddenly doing lots of reads to pull in index pages that
> fell out of the shared buffer area.

So this does seem to be the best explanation and it seems a good one.

It's also an excellent advert for table and index partitioning, and some
damning evidence against global indexes on partitioned tables (though
they may still be better than the alternative...)

> The indicated fix of course is to increase shared_buffers.

Splitting your tables at 4M, not 10M would work even better.

..

Anyway, where most of this started was with Christopher's comments:

On Fri, 2005-04-01 at 14:38 -0500, Christopher Petrilli wrote:
> This was an application originally written for MySQL/MYISAM, and it's
> looking like PostgreSQL can't hold up for it, simply because it's "too
> much database" if that makes sense.  The same box, running the MySQL
> implementation (which uses no transactions) runs around 800-1000
> rows/second systained.

B-trees aren't unique to PostgreSQL; the explanation developed here
would work equally well for any database system that used tree-based
indexes. Do we still think that MySQL can do this when PostgreSQL
cannot? How?

Do we have performance test results showing the same application load
without the degradation? We don't need to look at the source code to
measure MySQL performance...

Best Regards, Simon Riggs



От:
Christopher Petrilli
Дата:

On Apr 5, 2005 3:48 PM, Simon Riggs <> wrote:
> > Now some amount of slowdown is to be expected as the indexes get larger,
> > since it ought to take roughly O(log N) time to insert a new entry in an
> > index of size N.  The weird thing about your curves is the very sudden
> > jump in the insert times.
>
> Well, ISTM that the curve is far from unique. Mark's OSDL tests show
> them too. What was wierd, for me, was that it "resets" when you move to
> a new table. The index theory does accurately explain that.
>
> Perhaps the jump is not so sudden? Do I see a first small step up at
> about 4.5M rows, then another much bigger one at 7.5M (which looks like
> the only one at first glance)?
>
> > What I think might be happening is that the "working set" of pages
> > touched during index inserts is gradually growing, and at some point it
> > exceeds shared_buffers, and at that point performance goes in the toilet
> > because we are suddenly doing lots of reads to pull in index pages that
> > fell out of the shared buffer area.
>
> So this does seem to be the best explanation and it seems a good one.
>
> It's also an excellent advert for table and index partitioning, and some
> damning evidence against global indexes on partitioned tables (though
> they may still be better than the alternative...)
>
> > The indicated fix of course is to increase shared_buffers.
>
> Splitting your tables at 4M, not 10M would work even better.

Unfortunately, given we are talking about billions of rows
potentially, I'm concerned about that many tables when it comes to
query time. I assume this will kick in the genetic optimizer?


> Anyway, where most of this started was with Christopher's comments:
>
> On Fri, 2005-04-01 at 14:38 -0500, Christopher Petrilli wrote:
> > This was an application originally written for MySQL/MYISAM, and it's
> > looking like PostgreSQL can't hold up for it, simply because it's "too
> > much database" if that makes sense.  The same box, running the MySQL
> > implementation (which uses no transactions) runs around 800-1000
> > rows/second systained.
>
> B-trees aren't unique to PostgreSQL; the explanation developed here
> would work equally well for any database system that used tree-based
> indexes. Do we still think that MySQL can do this when PostgreSQL
> cannot? How?

There are customers in production using MySQL with 10M rows/table, and
I have no evidence of this behavior. I do not have the test jig for
MySQL, but I can create one, which is what I will do. Note that they
are using MyISAM files, so there is no ACID behavior. Also, I have
seen troubling corruption issues that I've never been able to
concretely identify.

Above all, I've been impressed that PostgreSQL, even when it hits this
wall, never corrupts anything.

 > Do we have performance test results showing the same application load
> without the degradation? We don't need to look at the source code to
> measure MySQL performance...

I will see what I can do in the next few days to create a similar
little test for MySQL.

Chris
--
| Christopher Petrilli
| 

От:
Christopher Petrilli
Дата:

On Apr 5, 2005 3:48 PM, Simon Riggs <> wrote:
> B-trees aren't unique to PostgreSQL; the explanation developed here
> would work equally well for any database system that used tree-based
> indexes. Do we still think that MySQL can do this when PostgreSQL
> cannot? How?
>
> Do we have performance test results showing the same application load
> without the degradation? We don't need to look at the source code to
> measure MySQL performance...

http://www.amber.org/~petrilli/diagrams/comparison_mysql_pgsql.png

That chart shows MySQL (using INSERT against MyISAM tables) and
PostgreSQL (using COPY) running with the exact same code otherwise.
Note that MySQL does hit a bit of a wall, but nothing as drastic as
PostgreSQL and actually maintains something "more flat".  The red and
blue dashed lines are the 95th percentile point.

My suspicion is that what we're seeing is WAL issues, not particularly
index issues.  The indices just fill up the WAL faster because there's
more data.  This is a wag basically, but it would seem to explain the
difference. In both cases, the indices were identical. Five on each.

One interesting thing... PostgreSQL starts out a good bit faster, but
looses in the end.

Chris
--
| Christopher Petrilli
| 

От:
Simon Riggs
Дата:

On Tue, 2005-04-05 at 18:55 -0400, Christopher Petrilli wrote:
> On Apr 5, 2005 3:48 PM, Simon Riggs <> wrote:
> > B-trees aren't unique to PostgreSQL; the explanation developed here
> > would work equally well for any database system that used tree-based
> > indexes. Do we still think that MySQL can do this when PostgreSQL
> > cannot? How?
> >
> > Do we have performance test results showing the same application load
> > without the degradation? We don't need to look at the source code to
> > measure MySQL performance...
>
> http://www.amber.org/~petrilli/diagrams/comparison_mysql_pgsql.png
>
> That chart shows MySQL (using INSERT against MyISAM tables) and
> PostgreSQL (using COPY) running with the exact same code otherwise.
> Note that MySQL does hit a bit of a wall, but nothing as drastic as
> PostgreSQL and actually maintains something "more flat".  The red and
> blue dashed lines are the 95th percentile point.

Interesting comparison. Any chance of separating the graphs as well, I'm
interested in the detail on both graphs.

Could you estimate the apparent periodicity on the PostgreSQL graphs?

> My suspicion is that what we're seeing is WAL issues, not particularly
> index issues.  The indices just fill up the WAL faster because there's
> more data.  This is a wag basically, but it would seem to explain the
> difference. In both cases, the indices were identical. Five on each.

Let's test the shared_buffers theory.

Would you mind loading only 5M rows per table, but load the same amount
of data overall? That should keep us within the comparable zone overall.

Best Regards, Simon Riggs


От:
Simon Riggs
Дата:

On Tue, 2005-04-05 at 16:05 -0400, Christopher Petrilli wrote:
> On Apr 5, 2005 3:48 PM, Simon Riggs <> wrote:
> > > The indicated fix of course is to increase shared_buffers.
> >
> > Splitting your tables at 4M, not 10M would work even better.
>
> Unfortunately, given we are talking about billions of rows
> potentially, I'm concerned about that many tables when it comes to
> query time. I assume this will kick in the genetic optimizer?

No, it won't start using the genetic optimizer.

You could just buy more RAM and keep table size the same.

Best Regards, Simon Riggs