Обсуждение: improvement suggestions for performance design

Поиск
Список
Период
Сортировка

improvement suggestions for performance design

От
tfinneid@ifi.uio.no
Дата:
Hi

I have the following scenario for a database that I need to design, and
would like some hints on what to improve or do differently to achieve the
desired performance goal, disregarding hardware and postgres tuning.

The premise is an attribute database that stores about 100 different
attribute types as attribute values. Every X seconds, Y number of new
attribute values are stored in the database. X is constant and currently
between 6 and 20 seconds, depending on the setup. In the future X could
become as low as 3 seconds. Y can, within the next 5-10 years, become as
high as 200 000.

That means that for example, every 6 seconds 100 000 attributes needs to
be written to the database.

At the same time, somewhere between 5-20 users needs to read parts of
those newly written attributes, maybe in total 30 000 attributes.

This continues for the duration of the field operation, which could be
18hrs a day for 6 weeks. So the total db size is up towards 200 gigs.

Now here is how I suggest doing this:

1- the tables

table attribute_values:
    id         int
    attr_type     int  ( references attribute_types(id) )
    posX        int
    posY        int
    data_type    int
    value        varchar(50)

table attribute_types:
    id        int
    name        varchar(200);



2- function

   a function that receives an array of data and inserts each attribute.
   perhaps one array per attribute data (type, posX, posY, data_type,
   value) so five arrays as in parameters ot the function

3- java client

   the client receives the data from a corba request, and splits it
   into, say 4 equally sized blocks and executes 4 threads that insert
   each block (this seems to be more efficient than just using one
   thread.)

Now I am wondering if this is the most efficient way of doing it?

- I know that I could group the attributes so that each type of attribute
gets its own table with all attributes in one row. But I am not sure if
that is any more efficient than ont attribute per row since I pass
everything to the function as an array.
With the above design a change in attribute types only requires changing
the data in a table instead of having to modify the client, the function
and the tables.

- I am also wondering if writing the client and function in C would create
a more efficient solution.

any comments?

ps, I am currently running postgres 8.1, but could probably use 8.2 if it
is needed for functionality or performance reasons. It will run on a sparc
machine with solaris 10 and perhaps 4-6 processors, as many GB of RAM as
necessary and SCSI disks ( perhaps in raid 0 ).

regards

thomas




Re: improvement suggestions for performance design

От
Heikki Linnakangas
Дата:
I would strongly suggest that you use a proper relational schema,
instead of storing everything in two tables. I don't know your
application, but a schema like that is called an Entity-Attribute-Value
(though your entity seems to be just posx and posy) and it should raise
a big red flag in the mind of any database designer. In particular,
constructing queries against an EAV schema is a major pain in the ass.
This has been discussed before on postgresql lists as well, you might
want to search and read the previous discussions.

Ignoring the EAV issue for a moment, it's hard to give advice without
knowing what kind of queries are going to executed. Are the lookups
always going to be by id? By posx/posy perhaps? By attribute?

tfinneid@ifi.uio.no wrote:
> Hi
>
> I have the following scenario for a database that I need to design, and
> would like some hints on what to improve or do differently to achieve the
> desired performance goal, disregarding hardware and postgres tuning.
>
> The premise is an attribute database that stores about 100 different
> attribute types as attribute values. Every X seconds, Y number of new
> attribute values are stored in the database. X is constant and currently
> between 6 and 20 seconds, depending on the setup. In the future X could
> become as low as 3 seconds. Y can, within the next 5-10 years, become as
> high as 200 000.
>
> That means that for example, every 6 seconds 100 000 attributes needs to
> be written to the database.
>
> At the same time, somewhere between 5-20 users needs to read parts of
> those newly written attributes, maybe in total 30 000 attributes.
>
> This continues for the duration of the field operation, which could be
> 18hrs a day for 6 weeks. So the total db size is up towards 200 gigs.
>
> Now here is how I suggest doing this:
>
> 1- the tables
>
> table attribute_values:
>     id         int
>     attr_type     int  ( references attribute_types(id) )
>     posX        int
>     posY        int
>     data_type    int
>     value        varchar(50)
>
> table attribute_types:
>     id        int
>     name        varchar(200);
>
>
>
> 2- function
>
>    a function that receives an array of data and inserts each attribute.
>    perhaps one array per attribute data (type, posX, posY, data_type,
>    value) so five arrays as in parameters ot the function
>
> 3- java client
>
>    the client receives the data from a corba request, and splits it
>    into, say 4 equally sized blocks and executes 4 threads that insert
>    each block (this seems to be more efficient than just using one
>    thread.)
>
> Now I am wondering if this is the most efficient way of doing it?
>
> - I know that I could group the attributes so that each type of attribute
> gets its own table with all attributes in one row. But I am not sure if
> that is any more efficient than ont attribute per row since I pass
> everything to the function as an array.
> With the above design a change in attribute types only requires changing
> the data in a table instead of having to modify the client, the function
> and the tables.
>
> - I am also wondering if writing the client and function in C would create
> a more efficient solution.
>
> any comments?
>
> ps, I am currently running postgres 8.1, but could probably use 8.2 if it
> is needed for functionality or performance reasons. It will run on a sparc
> machine with solaris 10 and perhaps 4-6 processors, as many GB of RAM as
> necessary and SCSI disks ( perhaps in raid 0 ).


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: improvement suggestions for performance design

От
tfinneid@ifi.uio.no
Дата:
> I would strongly suggest that you use a proper relational schema,
> instead of storing everything in two tables. I don't know your
> application, but a schema like that is called an Entity-Attribute-Value
> (though your entity seems to be just posx and posy) and it should raise
> a big red flag in the mind of any database designer. In particular,
> constructing queries against an EAV schema is a major pain in the ass.
> This has been discussed before on postgresql lists as well, you might
> want to search and read the previous discussions.

I get your point, but the thing is the attributes have no particular
relation to each other, other than belonging to same attribute groups.
There are no specific rules that states that certain attributes are always
used together, such as with an address record. It depends on what
attributes the operator wants to study. This is why I don't find any
reason to group the attributes into separate tables and columns.

I am still looking into the design of the tables, but I need to get at
proper test harness running before I can start ruling things out. And a
part of that, is for example, efficient ways of transferring the insert
data from the client to the db, instead of just single command inserts.
This is where bulk transfer by arrays probably would be preferable.

> Ignoring the EAV issue for a moment, it's hard to give advice without
> knowing what kind of queries are going to executed. Are the lookups
> always going to be by id? By posx/posy perhaps? By attribute?

the query will be by attribute type and posx/y. So for position x,y, give
me the following attributes...

thomas




Re: improvement suggestions for performance design

От
"Y Sidhu"
Дата:


On 7/5/07, tfinneid@ifi.uio.no <tfinneid@ifi.uio.no> wrote:
> I would strongly suggest that you use a proper relational schema,
> instead of storing everything in two tables. I don't know your
> application, but a schema like that is called an Entity-Attribute-Value
> (though your entity seems to be just posx and posy) and it should raise
> a big red flag in the mind of any database designer. In particular,
> constructing queries against an EAV schema is a major pain in the ass.
> This has been discussed before on postgresql lists as well, you might
> want to search and read the previous discussions.

I get your point, but the thing is the attributes have no particular
relation to each other, other than belonging to same attribute groups.
There are no specific rules that states that certain attributes are always
used together, such as with an address record. It depends on what
attributes the operator wants to study. This is why I don't find any
reason to group the attributes into separate tables and columns.

I am still looking into the design of the tables, but I need to get at
proper test harness running before I can start ruling things out. And a
part of that, is for example, efficient ways of transferring the insert
data from the client to the db, instead of just single command inserts.
This is where bulk transfer by arrays probably would be preferable.

> Ignoring the EAV issue for a moment, it's hard to give advice without
> knowing what kind of queries are going to executed. Are the lookups
> always going to be by id? By posx/posy perhaps? By attribute?

the query will be by attribute type and posx/y. So for position x,y, give
me the following attributes...

thomas




---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

I don't know much about this EAV stuff. Except to say that my company is in a situation with a lot of adds and bulk deletes and I wish the tables were designed with partitioning in mind. That is if you know how much, order of magnitude, data each table will hold or will pass through (add and delete), you may want to design the table with partitioning in mind. I have not done any partitioning so I cannot give you details but can tell you that mass deletes are a breeze because you just "drop" that part of the table. I think it is a sub table. And that alleviates table bloat and excessive vacuuming.

Good luck.

--
Yudhvir Singh Sidhu
408 375 3134 cell

Re: improvement suggestions for performance design

От
tfinneid@ifi.uio.no
Дата:
> On 7/5/07, tfinneid@ifi.uio.no <tfinneid@ifi.uio.no> wrote:
>
> I don't know much about this EAV stuff. Except to say that my company is
> in
> a situation with a lot of adds and bulk deletes and I wish the tables were
> designed with partitioning in mind. That is if you know how much, order of
> magnitude, data each table will hold or will pass through (add and
> delete),
> you may want to design the table with partitioning in mind. I have not
> done
> any partitioning so I cannot give you details but can tell you that mass
> deletes are a breeze because you just "drop" that part of the table. I
> think
> it is a sub table. And that alleviates table bloat and excessive
> vacuuming.

By partitioning, do you mean some sort of internal db table partitioning
scheme or just me dividing the data into different tables?

There want be many deletes, but there might of course be some.
Additionally, because of the
performance requirements, there wont be time to run vacuum in between the
insert, except for in non-operational periods. which will only be a couple
of hours during the day. So vacuum will have to be scheduled at those
times, instead of the normal intervals.



Re: improvement suggestions for performance design

От
"Y Sidhu"
Дата:


On 7/5/07, tfinneid@ifi.uio.no <tfinneid@ifi.uio.no> wrote:
> On 7/5/07, tfinneid@ifi.uio.no < tfinneid@ifi.uio.no> wrote:
>
> I don't know much about this EAV stuff. Except to say that my company is
> in
> a situation with a lot of adds and bulk deletes and I wish the tables were
> designed with partitioning in mind. That is if you know how much, order of
> magnitude, data each table will hold or will pass through (add and
> delete),
> you may want to design the table with partitioning in mind. I have not
> done
> any partitioning so I cannot give you details but can tell you that mass
> deletes are a breeze because you just "drop" that part of the table. I
> think
> it is a sub table. And that alleviates table bloat and excessive
> vacuuming.

By partitioning, do you mean some sort of internal db table partitioning
scheme or just me dividing the data into different tables?

There want be many deletes, but there might of course be some.
Additionally, because of the
performance requirements, there wont be time to run vacuum in between the
insert, except for in non-operational periods. which will only be a couple
of hours during the day. So vacuum will have to be scheduled at those
times, instead of the normal intervals.



---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

                 http://www.postgresql.org/about/donate

Internal db table partitioning. Check out: http://www.postgresql.org/docs/8.2/static/ddl-partitioning.html

--
Yudhvir Singh Sidhu
408 375 3134 cell

Re: improvement suggestions for performance design

От
Heikki Linnakangas
Дата:
tfinneid@ifi.uio.no wrote:
>> I would strongly suggest that you use a proper relational schema,
>> instead of storing everything in two tables. I don't know your
>> application, but a schema like that is called an Entity-Attribute-Value
>> (though your entity seems to be just posx and posy) and it should raise
>> a big red flag in the mind of any database designer. In particular,
>> constructing queries against an EAV schema is a major pain in the ass.
>> This has been discussed before on postgresql lists as well, you might
>> want to search and read the previous discussions.
>
> I get your point, but the thing is the attributes have no particular
> relation to each other, other than belonging to same attribute groups.
> There are no specific rules that states that certain attributes are always
> used together, such as with an address record. It depends on what
> attributes the operator wants to study. This is why I don't find any
> reason to group the attributes into separate tables and columns.

ISTM that a properly normalized schema would look something like this:

create table position (
   posX int not null,
   posY int not null,
   primary key (posX, posY)
);

create table colour (
   posX int not null,
   posY int not null,
   colour varchar(50) not null,
   primary key (posX, posY),
   foreign key (posX, posY) references position (posX, posY)
);

create table population (
   posX int not null,
   posY int not null,
   population int notn u,
   primary key (posX, posY),
   foreign key (posX, posY) references position (posX, posY)
);

where colour and population are examples of attributes you want to
store. If you have 100 different attributes, you'll have 100 tables like
that. That may sound like a lot, but it's not.

This allows you to use proper data types for the attributes, as well as
constraints and all the other goodies a good relational data model gives you

It also allows you to build proper indexes on the attributes. For
example, if you store populations as text, you're going to have a hard
time building an index that allows you to query for positions with a
population between 100-2000 efficiently.

These are all imaginary examples, but what I'm trying to point out here
is that a proper relational schema allows you to manage and query your
data much more easily and with more flexibility, allows for future
extensions.

A normalized schema will also take less space, which means less I/O and
more performance, because there's no need to store metadata like the
data_type, attr_type on every row. For performance reasons, you might
actually want to not store the position-table at all in the above schema.

An alternative design would be to have a single table, with one column
per attribute:

create table position (
   posX int not null,
   posY int not null,
   colour varchar(50),
   population int,
   ...
   primary key (posX, posY)
)

This is more space-efficient, especially if you have a lot of attributes
on same coordinates. You can easily add and drop columns as needed,
using ALTER TABLE.

> I am still looking into the design of the tables, but I need to get at
> proper test harness running before I can start ruling things out. And a
> part of that, is for example, efficient ways of transferring the insert
> data from the client to the db, instead of just single command inserts.
> This is where bulk transfer by arrays probably would be preferable.

Before you start fiddling with functions, I'd suggest that you try
batching the inserts with the JDBC PreparedStatement batch facility.

Splitting the inserts into multiple threads in your application sounds
messy. The inserts would have to be in separate transactions, for
example. Presumably your CORBA ORB will spawn multiple threads for you
when there's a lot requests coming in, so the overall throughput should
be the same with a single thread per request.

BTW, I concur with Y Sidhu that with data volumes as high as you have,
partitioning is a good idea. It's a lot easier to manage 20 10 GB table
partitions, than one 200 GB table. For example, VACUUM, CLUSTER, CREATE
INDEX can be done partition per partition, instead of as a single huge
operatio that runs for hours. Though if you choose to have just one
table per attribute type, each table might be conveniently small by
nature, so that no partitioning is required.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: improvement suggestions for performance design

От
Thomas Finneid
Дата:
Heikki Linnakangas wrote:
> ISTM that a properly normalized schema would look something like this:
>
> create table position (
>   posX int not null,
>   posY int not null,
>   primary key (posX, posY)
> );
>
> create table colour (
>   posX int not null,
>   posY int not null,
>   colour varchar(50) not null,
>   primary key (posX, posY),
>   foreign key (posX, posY) references position (posX, posY)
> );
>
> create table population (
>   posX int not null,
>   posY int not null,
>   population int notn u,
>   primary key (posX, posY),
>   foreign key (posX, posY) references position (posX, posY)
> );

I agree that this is a way it could be done.

> where colour and population are examples of attributes you want to
> store. If you have 100 different attributes, you'll have 100 tables like
> that. That may sound like a lot, but it's not.

In any case, there is no point in having one table per attribute, as
some attributes are logically grouped and can therefore be grouped
toghether in the table. Since there are 5-10 groups of attributes, 5-10
tables would be enough.

>
> This allows you to use proper data types for the attributes, as well as
> constraints and all the other goodies a good relational data model gives
> you
>
> It also allows you to build proper indexes on the attributes. For
> example, if you store populations as text, you're going to have a hard
> time building an index that allows you to query for positions with a
> population between 100-2000 efficiently.

Performing queries on the attribute value is of no interrest, so that
does not matter,

> These are all imaginary examples, but what I'm trying to point out here
> is that a proper relational schema allows you to manage and query your
> data much more easily and with more flexibility, allows for future
> extensions.

They have been treating their data this way for the last 20 years, and
there is nothing on the horizon that tells neither them nor me that it
will be any different the next 10 years. So I am not sure I need to plan
for that.

> A normalized schema will also take less space, which means less I/O and
> more performance,

That is what I am trying to find out, if it is true for this scenario as
well.

> because there's no need to store metadata like the
> data_type, attr_type on every row.

data_type and attr_type are not decorative meta_data, they are actively
used as query parameters for each attribute, if they where not there I
would not be able to perform the queries I need to do.

For performance reasons, you might
> actually want to not store the position-table at all in the above schema.
>
> An alternative design would be to have a single table, with one column
> per attribute:
>
> create table position (
>   posX int not null,
>   posY int not null,
>   colour varchar(50),
>   population int,
>   ...
>   primary key (posX, posY)
> )
>
> This is more space-efficient, especially if you have a lot of attributes
> on same coordinates. You can easily add and drop columns as needed,
> using ALTER TABLE.
>
>> I am still looking into the design of the tables, but I need to get at
>> proper test harness running before I can start ruling things out. And a
>> part of that, is for example, efficient ways of transferring the insert
>> data from the client to the db, instead of just single command inserts.
>> This is where bulk transfer by arrays probably would be preferable.
>
> Before you start fiddling with functions, I'd suggest that you try
> batching the inserts with the JDBC PreparedStatement batch facility.

I have done that, now I need to have something to compare it against,
preferably a function written in plpgsql and one in c.
So any other suggestions on how to efficiently bulk transfer the data to
the db for insertion?

> Splitting the inserts into multiple threads in your application sounds
> messy.

Well, it has been tested and showed to make postgres perform much
better, ie. 100 000 inserts separated between 4 threads performed much
faster than with a single thread alone.

> BTW, I concur with Y Sidhu that with data volumes as high as you have,
> partitioning is a good idea.

Yes, I will be looking into to it.

regards

thomas


Re: improvement suggestions for performance design

От
"Kalle Hallivuori"
Дата:
Hi Thomas & all,

2007/7/6, Thomas Finneid <tfinneid@ifi.uio.no>:
>
> Heikki Linnakangas wrote:
> > ISTM that a properly normalized schema would look something like this:

[example of tables per attr referencing main table containing only primary key]

> I agree that this is a way it could be done.

Indeed. Another way is postgresql-specific:

create table position (
  posX int not null,
  posY int not null,
);

create table population INHERITS position ( population int not null );
-- optionally:
create unique index population_idx(posX,posY,population);

This leads to each attr table inheriting posX, posY from position; you
never insert anything to position itself, but you can use it to list
all positions that have any attributes in any of the inherited tables
(in that sense it's a view over all its children).

> In any case, there is no point in having one table per attribute, as
> some attributes are logically grouped and can therefore be grouped
> toghether in the table. Since there are 5-10 groups of attributes, 5-10
> tables would be enough.

This sounds very sensible. This way you would send only 1 table (or
procedure, or prepared statement) name instead of as many attr_types
as you have attributes in a group.

So instead of calling 'your_procedure(type, posX, posY, data_type,
value)' for each 5 values separately you would call
'attrgroup_and_datatype_specific_procedure(posX, posY, value1, value2,
value3, value4, value5)'. Inside the procedure the inserts change from
'insert into attribute_values values (type, posX, posY, data_type,
value)' to 'insert into attrgroup_and_datatype_specific_table values
(posX, posY, value1, value2, value3, value4, value5)' -- so you save
four inserts and for each value inserted you use 2/5 extra fields
instead of 4. You are allowed to use shorter names for the tables and
procedures ;)

It should be trivial to hide this separation in client; you could even
create new tables for new kinds of attribute-datatype combinations
automatically on the fly.

> They have been treating their data this way for the last 20 years, and
> there is nothing on the horizon that tells neither them nor me that it
> will be any different the next 10 years. So I am not sure I need to plan
> for that.

Is it possible that normalization has been skipped originally because
of lack of resources or knowledge of the nature of data to be
imported, or lack of dynamism on the part of the original tools (such
as creation of type specific tables on the fly), that would now be
available, or at least worth a second look?

> > A normalized schema will also take less space, which means less I/O and
> > more performance,
>
> That is what I am trying to find out, if it is true for this scenario as
> well.

Well, you're saving four extra ints per each value, when you only need
two per 5-10 values.

If you happen to save numerical data as the value in the text field
for some data_types, you are losing a lot more.

> > because there's no need to store metadata like the
> > data_type, attr_type on every row.
>
> data_type and attr_type are not decorative meta_data, they are actively
> used as query parameters for each attribute, if they where not there I
> would not be able to perform the queries I need to do.

You can still express them as table or column names rather than extra
data per row.

> > Before you start fiddling with functions, I'd suggest that you try
> > batching the inserts with the JDBC PreparedStatement batch facility.
>
> I have done that, now I need to have something to compare it against,
> preferably a function written in plpgsql and one in c.
> So any other suggestions on how to efficiently bulk transfer the data to
> the db for insertion?

COPY is plentitudes faster than INSERT:
http://www.postgresql.org/docs/8.1/interactive/sql-copy.html

If you can't just push the data straight into the final table with
COPY, push it into a temporary table that you go through with the
database procedure.

Shameless plug: If you use Java and miss COPY functionality in the
driver, it's available at

http://kato.iki.fi/sw/db/postgresql/jdbc/copy/

I was able to practically nullify time spent inserting with that.

> Well, it has been tested and showed to make postgres perform much
> better, ie. 100 000 inserts separated between 4 threads performed much
> faster than with a single thread alone.

Sounds interesting. The results should still end up written into the
same table, so are you sure this didn't end up using the same time at
server end - would that even matter to you?

We ended up having best results with sequential batches of around 10
000 rows each.

> > BTW, I concur with Y Sidhu that with data volumes as high as you have,
> > partitioning is a good idea.
>
> Yes, I will be looking into to it.

Depending on distribution of your data, saving each attribute group
(or datatype, or both) to its own table will take you some way to the
same direction.

If you have no indexes and do no deletes (like it seems to be in your
case), size of table might not matter much.

It might make sense in your case, though, to name tables with times,
like attribute_values_YYYYMMDD, and create a new table for each chosen
period, be it month, day or even per batch. (It takes a few seconds to
create a table though.)

To keep viewing the data as your customer is used to, you can hide the
separation of data into partitions by inheriting each partial table
from an identical ancestor table, that then serves as a view over all
its children -- but that's already explained in the docs. Separation
into tables by attribute groups you have to hide with a view, or
procedures, preferably server-side.

Cheers,

--
Kalle Hallivuori +358-41-5053073 http://korpiq.iki.fi/

Re: improvement suggestions for performance design

От
Thomas Finneid
Дата:
Kalle Hallivuori wrote:

 > COPY is plentitudes faster than INSERT:
 > http://www.postgresql.org/docs/8.1/interactive/sql-copy.html
 >
 > If you can't just push the data straight into the final table with
 > COPY, push it into a temporary table that you go through with the
 > database procedure.
 >
 > Shameless plug: If you use Java and miss COPY functionality in the
 > driver, it's available at
 >
 > http://kato.iki.fi/sw/db/postgresql/jdbc/copy/
 >
 > I was able to practically nullify time spent inserting with that.

Interresting, I will definately have a look at it.
What is the maturity level of the code at this point? and what is
potentially missing to bring it up to production quality? (stability is
of the utmost importance in my usage scenario.)

 >
 >> Well, it has been tested and showed to make postgres perform much
 >> better, ie. 100 000 inserts separated between 4 threads performed much
 >> faster than with a single thread alone.
 >
 > Sounds interesting. The results should still end up written into the
 > same table, so are you sure this didn't end up using the same time at
 > server end - would that even matter to you?

yes it would matter, because a number of clients are waiting to read the
data before the next batch of data is inserted. (in essence every 6
seconds 40000 attributes must be written, and after that write 8-16
clients read most of that data based on query criteria.and this is just
today, in the future, 5-10 years, it might be as high as 2-300 000
attributes per 3 seconds.

thomas

Re: improvement suggestions for performance design

От
"Kalle Hallivuori"
Дата:
Hi.

2007/7/8, Thomas Finneid <tfinneid@student.matnat.uio.no>:
>
> Kalle Hallivuori wrote:
>
>  > COPY is plentitudes faster than INSERT:
>  > http://www.postgresql.org/docs/8.1/interactive/sql-copy.html
>  >
>  > If you can't just push the data straight into the final table with
>  > COPY, push it into a temporary table that you go through with the
>  > database procedure.
>  >
>  > Shameless plug: If you use Java and miss COPY functionality in the
>  > driver, it's available at
>  >
>  > http://kato.iki.fi/sw/db/postgresql/jdbc/copy/
>  >
>  > I was able to practically nullify time spent inserting with that.
>
> Interresting, I will definately have a look at it.
> What is the maturity level of the code at this point? and what is
> potentially missing to bring it up to production quality? (stability is
> of the utmost importance in my usage scenario.)

It's my third implementation, based on earlier work by Kris Jurka, a
maintainer of the JDBC driver. (It is really quite short so it's easy
to keep it clear.) I consider it mature enough to have accommodated it
as part of an upcoming large application, but I'd really like to hear
others' opinions. Documentation I should add one of these days, maybe
even rewrite the javadoc.

You can use COPY as is on the server side without the patch, but then
you need to get the data as CSV or TSV files onto the database server
machine, and use db superuser privileges to import it. My patch just
adds the ability to feed data from client with normal user privileges.

--
Kalle Hallivuori +358-41-5053073 http://korpiq.iki.fi/

Re: improvement suggestions for performance design

От
Thomas Finneid
Дата:
Kalle Hallivuori wrote:
> Hi.
>
> 2007/7/8, Thomas Finneid <tfinneid@student.matnat.uio.no>:
>>
>> Kalle Hallivuori wrote:
>>
>>  > COPY is plentitudes faster than INSERT:
>>  > http://www.postgresql.org/docs/8.1/interactive/sql-copy.html
>>  >
>>  > If you can't just push the data straight into the final table with
>>  > COPY, push it into a temporary table that you go through with the
>>  > database procedure.
>>  >
>>  > Shameless plug: If you use Java and miss COPY functionality in the
>>  > driver, it's available at
>>  >
>>  > http://kato.iki.fi/sw/db/postgresql/jdbc/copy/
>>  >
>>  > I was able to practically nullify time spent inserting with that.
>>
>> Interresting, I will definately have a look at it.
>> What is the maturity level of the code at this point? and what is
>> potentially missing to bring it up to production quality? (stability is
>> of the utmost importance in my usage scenario.)
>
> It's my third implementation, based on earlier work by Kris Jurka, a
> maintainer of the JDBC driver. (It is really quite short so it's easy
> to keep it clear.) I consider it mature enough to have accommodated it
> as part of an upcoming large application, but I'd really like to hear
> others' opinions. Documentation I should add one of these days, maybe
> even rewrite the javadoc.

Hi I have tested your COPY patch (actually I tested
postgresql-jdbc-8.2-505-copy-20070716.jdbc3.jar) and it is really fast,
actually just as fast as serverside COPY (boths tests was performed on
local machine).

This means I am interrested in using it in my project, but I have some
concerns that needs to be adressed, (and I am prepared to help in any
way I can). The following are the concerns I have

- While testing I got some errors, which needs to be fixed (detailed below)
- The patch must be of production grade quality
- I would like the patch to be part of the official pg JDBC driver.


The error I got the most is :

This command runs a single run, single thread and generates 10000 rows
of data

tofi@duplo:~/svn/pores$ java -server -Xms20m -Xmx256m -cp
/usr/java/jdk1.5.0_06/jre/lib/rt.jar:.:src/:test/:conf/:lib/postgresql-jdbc-8.2-505-copy-20070716.jdbc3.jar
wg.daemon.Main -m SINGLE_WRITE -t 1 -r 1 -c 10000 -p CPBulk
Initialising connection...
Performing insert...
Build bulk data time: 0s 211ms
toString() bulk data time: 0s 4ms
time: 0s 205ms
org.postgresql.util.PSQLException: Unexpected command completion while
copying: COPY
         at
org.postgresql.core.v3.QueryExecutorImpl.executeCopy(QueryExecutorImpl.java:706)
         at org.postgresql.copy.CopyManager.copyIntoDB(CopyManager.java:50)
         at org.postgresql.copy.CopyManager.copyIntoDB(CopyManager.java:37)
         at wg.storage.pores1.CPBulk.addAttributes(CPBulk.java:72)
         at wg.daemon.Daemon.run(Daemon.java:57)
tofi@duplo:~/svn/pores$ ls -al lib/


regards

thomas


Re: improvement suggestions for performance design

От
"Kalle Hallivuori"
Дата:
Hi all!

2007/7/18, Thomas Finneid <tfinneid@student.matnat.uio.no>:
> Hi I have tested your COPY patch (actually I tested
> postgresql-jdbc-8.2-505-copy-20070716.jdbc3.jar) and it is really fast,
> actually just as fast as serverside COPY (boths tests was performed on
> local machine).

Happy to hear there's interest toward this solution.

> This means I am interrested in using it in my project, but I have some
> concerns that needs to be adressed, (and I am prepared to help in any
> way I can). The following are the concerns I have
>
> - While testing I got some errors, which needs to be fixed (detailed below)
> - The patch must be of production grade quality
> - I would like the patch to be part of the official pg JDBC driver.

Definitely agreed, those are my requirements as well. We can discuss
bug fixing among ourselves; new versions I'll announce on pgsql-jdbc
list.

--
Kalle Hallivuori +358-41-5053073 http://korpiq.iki.fi/