Обсуждение: Help request to improve function performance

Поиск
Список
Период
Сортировка

Help request to improve function performance

От
sarikan
Дата:
Dear members of the list,
I have a function which returns a custom type, that has only two fields,
each of them being varchar arrays.
The reason that I have written this function is that I have a table
basically with the following structure (with simplified column names)

name_col1  name_col2 sessionId
value1       value3       id1
value2       value2       id1
value4       value4       id1
value7       value4       id2
value2       value2       id2
value4       value4       id2
value1       value5       id3

So mutliple rows are bound together with sessionIds, and I need to get back
all rows with a query, grouped by sessionID. However, group by sql statement
does not solve my problem, since I get back a lot of rows, which I have to
group into objects again in my application. What I need is a way to return
all rows having the same sessionId as a single row. Of course this is not
possible with this table, so I've created a custom type, which has array
type columns. The following function gets all rows that belongs to a
patient, and for each session id, it inserts rows with that session id into
array fields of the custom type.
The problem is, it is very slow! Getting back all the rows with a select
takes 360 ms, while getting back the results of this function takes 50
seconds! Is there any way I can make the following function faster, or any
other methods you can recommend to do what I'm trying to do?  I am trying to
avoid hundreds of calls to db, or grouping query results in my middleware
application. Here comes the function, and your help will be much
appreciated.

Best Regards
Seref


CREATE or REPLACE FUNCTION getNodeContainers( context_Id varchar) RETURNS
setof NodesContainer AS
$$
DECLARE
archetype_data_row app.archetype_data%ROWTYPE;
archetype_data_row_main app.archetype_data%ROWTYPE;
nodescontainervar NodesContainer%ROWTYPE;
session_Id varchar;
indexVar integer := 0;
BEGIN
    CREATE TEMP TABLE all_rows_of_patient AS select * from
app.archetype_data
    WHERE app.archetype_data.context_id = context_Id;
    FOR session_Id IN
      SELECT distinct(all_rows_of_patient.session_id) from
all_rows_of_patient
    LOOP -- do the following for each session_ID
    indexVar := 0;
    FOR archetype_data_row IN --select rows that belong to this session ID
        SELECT * from all_rows_of_patient
    WHERE all_rows_of_patient.session_id = session_Id and
all_rows_of_patient.context_id = context_Id
    LOOP
        nodescontainervar.name[indexVar] := archetype_data_row.name;
        nodescontainervar.context_Id[indexVar] := archetype_data_row.context_Id;
        indexVar := indexVar + 1;
    END LOOP;
    return NEXT nodescontainervar;
    END LOOP;
    drop table all_rows_of_patient;
    return;
END;
$$ LANGUAGE 'plpgsql';

--
View this message in context:
http://www.nabble.com/Help-request-to-improve-function-performance-tp23175540p23175540.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Help request to improve function performance

От
Filip Rembiałkowski
Дата:


2009/4/22 sarikan <serefarikan@kurumsalteknoloji.com>

Dear members of the list,
I have a function which returns a custom type, that has only two fields,
each of them being varchar arrays.
The reason that I have written this function is that I have a table
basically with the following structure (with simplified column names)

name_col1  name_col2 sessionId
value1       value3       id1
value2       value2       id1
value4       value4       id1
value7       value4       id2
value2       value2       id2
value4       value4       id2
value1       value5       id3

Why not post your REAL schema? It would make life easier, both for you and for people trying to help.

 

So mutliple rows are bound together with sessionIds, and I need to get back
all rows with a query, grouped by sessionID. However, group by sql statement
does not solve my problem, since I get back a lot of rows, which I have to
group into objects again in my application. What I need is a way to return
all rows having the same sessionId as a single row. Of course this is not
possible with this table, so I've created a custom type, which has array
type columns. The following function gets all rows that belongs to a
patient, and for each session id, it inserts rows with that session id into
array fields of the custom type.
The problem is, it is very slow! Getting back all the rows with a select
takes 360 ms, while getting back the results of this function takes 50
seconds! Is there any way I can make the following function faster, or any
other methods you can recommend to do what I'm trying to do?  I am trying to
avoid hundreds of calls to db, or grouping query results in my middleware
application. Here comes the function, and your help will be much
appreciated.

Best Regards
Seref


(below code edited to be more readable; logic unchanged)
 

CREATE or REPLACE FUNCTION getNodeContainers( context_Id varchar) RETURNS
setof NodesContainer AS
$$
DECLARE
archetype_data_row app.archetype_data;
archetype_data_row_main app.archetype_data;
nodescontainervar NodesContainer;
session_Id varchar;
indexVar integer := 0;
BEGIN
   CREATE TEMP TABLE all_rows_of_patient AS select * from app.archetype_data
   WHERE context_id = context_Id;
   FOR session_Id IN
     SELECT distinct session_id from all_rows_of_patient
   LOOP -- do the following for each session_ID
       indexVar := 0;
       FOR archetype_data_row IN --select rows that belong to this session ID
               SELECT * from all_rows_of_patient
       WHERE session_id = session_Id and context_id = context_Id
       LOOP
               nodescontainervar.name[indexVar] := archetype_data_row.name;
               nodescontainervar.context_Id[indexVar] := archetype_data_row.context_Id;
               indexVar := indexVar + 1;
       END LOOP;
       return NEXT nodescontainervar;
   END LOOP;
   drop table all_rows_of_patient;
   return;
END;
$$ LANGUAGE 'plpgsql';


please read above code - thats what postgres actually executes. column names have precedence before variable names in name resolution.
conditions like
 WHERE context_id = context_Id;
 WHERE session_id = session_Id and context_id = context_Id
are obviously no-op conditions, not what you really want.

I hope now it's clear now why this function has long execution time :)


some other remarks:

1) you use temp tables inside a function, which is rather bad (search archives for explanation). try to avoid it.
2) usage of indexvar is not needed - there are array operators and functions
3) if you get rid of temp table, this function could be marked as STABLE,which will prevent penalty in some strange situations
4) mark your function as STRICT, which will save some CPU cycles when someone calls it on null input

HTH.
 

--
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@gmail.com
http://filip.rembialkowski.net/

Re: Help request to improve function performance

От
Seref Arikan
Дата:
Hi Filip,
First of all: thanks a lot for your kind response. Here is the create script for my schema:

CREATE TABLE "app"."archetype_data" (
  "id" BIGINT NOT NULL,
  "context_id" VARCHAR(1000),
  "archetype_name" VARCHAR(1000),
  "archetype_path" VARCHAR(1000),
  "name" VARCHAR(1000),
  "value_string" VARCHAR(1000),
  "value_int" BIGINT,
  "value_double" DOUBLE PRECISION,
  "session_id" VARCHAR(1000),
  "instance_index" INTEGER,
  CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id")
) WITHOUT OIDS;

Now, regarding your feedback, here are the points, hoping that you can give me feedback, and bring me up to speed in the topic, for I've been in the upper layers of the software world for so long :) Please forgive me for further questions:

Are you telling me that due to name resolution process, my use of variable and column names for context_id and session_id are causing problems? I'll change variable names into names which would be obviously different from column names.

I used the temp table to speed up the following selects, since the actual table has more than 9 million rows. after creating the temp table, I am selecting from 50K rows. Am I wrong about the performance gain here? What would you suggest instead?

Temp tables forced me to use execute, after hitting a known problem, also expressed in the faq, is this what you're talking about?

I will be investing serious time into postgresql from now on, and I hope you can give me couple of useful hints, to ease my way forward :) Looking at the schema, can you think of a better way to send this result set to a java based app?

Many thanks again

Kind regards
Seref


2009/4/22 Filip Rembiałkowski <plk.zuber@gmail.com>


2009/4/22 sarikan <serefarikan@kurumsalteknoloji.com>

Dear members of the list,
I have a function which returns a custom type, that has only two fields,
each of them being varchar arrays.
The reason that I have written this function is that I have a table
basically with the following structure (with simplified column names)

name_col1  name_col2 sessionId
value1       value3       id1
value2       value2       id1
value4       value4       id1
value7       value4       id2
value2       value2       id2
value4       value4       id2
value1       value5       id3

Why not post your REAL schema? It would make life easier, both for you and for people trying to help.

 

So mutliple rows are bound together with sessionIds, and I need to get back
all rows with a query, grouped by sessionID. However, group by sql statement
does not solve my problem, since I get back a lot of rows, which I have to
group into objects again in my application. What I need is a way to return
all rows having the same sessionId as a single row. Of course this is not
possible with this table, so I've created a custom type, which has array
type columns. The following function gets all rows that belongs to a
patient, and for each session id, it inserts rows with that session id into
array fields of the custom type.
The problem is, it is very slow! Getting back all the rows with a select
takes 360 ms, while getting back the results of this function takes 50
seconds! Is there any way I can make the following function faster, or any
other methods you can recommend to do what I'm trying to do?  I am trying to
avoid hundreds of calls to db, or grouping query results in my middleware
application. Here comes the function, and your help will be much
appreciated.

Best Regards
Seref


(below code edited to be more readable; logic unchanged)
 

CREATE or REPLACE FUNCTION getNodeContainers( context_Id varchar) RETURNS
setof NodesContainer AS
$$
DECLARE
archetype_data_row app.archetype_data;
archetype_data_row_main app.archetype_data;
nodescontainervar NodesContainer;
session_Id varchar;
indexVar integer := 0;
BEGIN
   CREATE TEMP TABLE all_rows_of_patient AS select * from app.archetype_data
   WHERE context_id = context_Id;
   FOR session_Id IN
     SELECT distinct session_id from all_rows_of_patient
   LOOP -- do the following for each session_ID
       indexVar := 0;
       FOR archetype_data_row IN --select rows that belong to this session ID
               SELECT * from all_rows_of_patient
       WHERE session_id = session_Id and context_id = context_Id
       LOOP
               nodescontainervar.name[indexVar] := archetype_data_row.name;
               nodescontainervar.context_Id[indexVar] := archetype_data_row.context_Id;
               indexVar := indexVar + 1;
       END LOOP;
       return NEXT nodescontainervar;
   END LOOP;
   drop table all_rows_of_patient;
   return;
END;
$$ LANGUAGE 'plpgsql';


please read above code - thats what postgres actually executes. column names have precedence before variable names in name resolution.
conditions like
 WHERE context_id = context_Id;
 WHERE session_id = session_Id and context_id = context_Id
are obviously no-op conditions, not what you really want.

I hope now it's clear now why this function has long execution time :)


some other remarks:

1) you use temp tables inside a function, which is rather bad (search archives for explanation). try to avoid it.
2) usage of indexvar is not needed - there are array operators and functions
3) if you get rid of temp table, this function could be marked as STABLE,which will prevent penalty in some strange situations
4) mark your function as STRICT, which will save some CPU cycles when someone calls it on null input

HTH.
 

--
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@gmail.com
http://filip.rembialkowski.net/

Re: Help request to improve function performance

От
Grzegorz Jaśkiewicz
Дата:
you keep everything in varchars, and yet you request improvements in
performance.
you are a funny guy, ...

Re: Help request to improve function performance

От
Seref Arikan
Дата:
Hi there,
I have a set of dynamically composed objects represented in Java, with string values for various attributes, which have variable length. In case you have suggestions for a better type for this case, it would be my pleasure to hear about them.

2009/4/22 Grzegorz Jaśkiewicz <gryzman@gmail.com>
you keep everything in varchars, and yet you request improvements in
performance.
you are a funny guy, ...

Re: Help request to improve function performance

От
Scott Marlowe
Дата:
2009/4/22 Seref Arikan <serefarikan@kurumsalteknoloji.com>:
> Hi Filip,
> First of all: thanks a lot for your kind response. Here is the create script
> for my schema:
>
> CREATE TABLE "app"."archetype_data" (
>   "id" BIGINT NOT NULL,
>   "context_id" VARCHAR(1000),
>   "archetype_name" VARCHAR(1000),
>   "archetype_path" VARCHAR(1000),
>   "name" VARCHAR(1000),
>   "value_string" VARCHAR(1000),
>   "value_int" BIGINT,
>   "value_double" DOUBLE PRECISION,
>   "session_id" VARCHAR(1000),
>   "instance_index" INTEGER,
>   CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id")
> ) WITHOUT OIDS;

If I'm not mistaken, you're doing Entity Attribute Value model type
storage.  I.e. a database in a database.  Makes for easy coding, and
danged near impossible to troubleshoot your data.

It's a religious issue but I come down on the side that good data
modelling is hard for a reason, because it pays you back so much in
the end.

Re: Help request to improve function performance

От
John R Pierce
Дата:
Seref Arikan wrote:
> I have a set of dynamically composed objects represented in Java, with
> string values for various attributes, which have variable length. In
> case you have suggestions for a better type for this case, it would be
> my pleasure to hear about them.

cut out about 3 layers of abstraction and get down to what you REALLY
need to get done.   dynamically composed piles of text valued attributes
will NEVER be efficient, no matter what you do.





Re: Help request to improve function performance

От
Seref Arikan
Дата:
Hi Scott,
I agree, and I am doing the entity attribute model because I simply have to. This table is used to persist data that is hold in user defined information models. Kind of a domain specific language. The users continously create these hierarchical structures, so neither the amount of them, nor their structure is stable. On top of that, these structures can have quite deep hieararchies, with collections, references to other structures etc.. This forces almost everyone working in the domain to end up in this db model. In case you are curious and have a lot of time at your hands, you can visit www.openehr.org to see what I'm talking about. The specifications part have all the documents one can need.

I have worked with very capable DBAs before, and even though it has been quite some time since I've done real DB work, I would like to invest in postgresql as much as I can, to make the total framework faster. Therefore, all suggestions are welcommed.

All the best
Seref

On Thu, Apr 23, 2009 at 1:21 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
2009/4/22 Seref Arikan <serefarikan@kurumsalteknoloji.com>:
> Hi Filip,
> First of all: thanks a lot for your kind response. Here is the create script
> for my schema:
>
> CREATE TABLE "app"."archetype_data" (
>   "id" BIGINT NOT NULL,
>   "context_id" VARCHAR(1000),
>   "archetype_name" VARCHAR(1000),
>   "archetype_path" VARCHAR(1000),
>   "name" VARCHAR(1000),
>   "value_string" VARCHAR(1000),
>   "value_int" BIGINT,
>   "value_double" DOUBLE PRECISION,
>   "session_id" VARCHAR(1000),
>   "instance_index" INTEGER,
>   CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id")
> ) WITHOUT OIDS;

If I'm not mistaken, you're doing Entity Attribute Value model type
storage.  I.e. a database in a database.  Makes for easy coding, and
danged near impossible to troubleshoot your data.

It's a religious issue but I come down on the side that good data
modelling is hard for a reason, because it pays you back so much in
the end.

Re: Help request to improve function performance

От
Filip Rembiałkowski
Дата:

W dniu 22 kwietnia 2009 23:47 użytkownik Seref Arikan <serefarikan@kurumsalteknoloji.com> napisał:
Hi Filip,
First of all: thanks a lot for your kind response. Here is the create script for my schema:

CREATE TABLE "app"."archetype_data" (
  "id" BIGINT NOT NULL,
  "context_id" VARCHAR(1000),
  "archetype_name" VARCHAR(1000),
  "archetype_path" VARCHAR(1000),
  "name" VARCHAR(1000),
  "value_string" VARCHAR(1000),
  "value_int" BIGINT,
  "value_double" DOUBLE PRECISION,
  "session_id" VARCHAR(1000),
  "instance_index" INTEGER,
  CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id")
) WITHOUT OIDS;

Now, regarding your feedback, here are the points, hoping that you can give me feedback, and bring me up to speed in the topic, for I've been in the upper layers of the software world for so long :)
 

Yes that's easy to observe; but Java and Postgres can go together, I assure you :)

 
Please forgive me for further questions:

come on, nothing to forgive, if I did not like answering questions I would not read this at all.
 


Are you telling me that due to name resolution process, my use of variable and column names for context_id and session_id are causing problems?

that's what I'm telling, doesn't I?


 
I'll change variable names into names which would be obviously different from column names.


I used the temp table to speed up the following selects, since the actual table has more than 9 million rows. after creating the temp table, I am selecting from 50K rows. Am I wrong about the performance gain here? What would you suggest instead?

Select once. ONCE. Not to temp table, just to a implicit cursor [like here: http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-RECORDS-ITERATING ] . You can use ORDER BY session_id , so in ONE pass you will be able to do all needed computations. ( you will have block of records belonging to one session_id, then another block for next session_id and so on).

Regarding performance: if you create index on (context_id, session_id) this query will be fast.





Temp tables forced me to use execute, after hitting a known problem, also expressed in the faq, is this what you're talking about?

yes that's it, but as I sad before - you really can get rid of temporary tables here.

 


I will be investing serious time into postgresql from now on, and I hope you can give me couple of useful hints, to ease my way forward :) Looking at the schema, can you think of a better way to send this result set to a java based app?

Better way to query or better schema?

If you mean better way to query - I would just select all rows and combine them in upper layer. ARRAYs are not so cute for me.

If you mean better schema... That's the hardest question, as other guys suggested. You will have to decide what to use. I know that some ORM for Java (Hibernate, namely) produce quite decent database schemas. NOT E-A-V model. Maybe this will suit you.


good luck!


--
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@gmail.com
http://filip.rembialkowski.net/

Re: Help request to improve function performance

От
Karsten Hilbert
Дата:
On Thu, Apr 23, 2009 at 12:02:13AM +0100, Seref Arikan wrote:

> I have a set of dynamically composed objects represented in Java, with
> string values for various attributes, which have variable length. In case
> you have suggestions for a better type for this case, it would be my
> pleasure to hear about them.

Seref, he's suggesting you use TEXT instead of
VARCHAR(something). In PG it's actually usually *less*
overhead to use the unbounded text datatype (no length check
required).

Length checks mandated by business logic can be added by
more dynamic means -- check constraints, triggers, etc which
allow for less invasive change if needed.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: Help request to improve function performance

От
Karsten Hilbert
Дата:
On Wed, Apr 22, 2009 at 06:21:41PM -0600, Scott Marlowe wrote:

> > CREATE TABLE "app"."archetype_data" (
> >   "id" BIGINT NOT NULL,
> >   "context_id" VARCHAR(1000),
> >   "archetype_name" VARCHAR(1000),
> >   "archetype_path" VARCHAR(1000),
> >   "name" VARCHAR(1000),
> >   "value_string" VARCHAR(1000),
> >   "value_int" BIGINT,
> >   "value_double" DOUBLE PRECISION,
> >   "session_id" VARCHAR(1000),
> >   "instance_index" INTEGER,
> >   CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id")
> > ) WITHOUT OIDS;
>
> If I'm not mistaken, you're doing Entity Attribute Value model type
> storage.  I.e. a database in a database.  Makes for easy coding, and
> danged near impossible to troubleshoot your data.

The upper layer model he's trying to store data of has
extremely well modelled constraints complete with
verification engine and all. So while your concern is valid
it's probably less so.

> It's a religious issue but I come down on the side that good data
> modelling is hard for a reason, because it pays you back so much in
> the end.
The hard part has already been done for him in a very
thoughtful way: They've got a model, a metamodel and yet
another model for data instances of the model ;-)

The design group of the above methodology pretty much
suggested not putting too much additional modelling into the
data store (not that I agree too much)

@Seref: Thomas Beale said so ;-)

I dare say the extension power of PostgreSQL would lend
itself extremely well to actual implementation of the
OpenEHR model right in the database (mapping OpenEHR types
to complex types including real-time verification, building
AQL right into PostgreSQL in the form of stored procedures,
etc) but it would take some serious effort.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: Help request to improve function performance

От
Karsten Hilbert
Дата:
On Thu, Apr 23, 2009 at 09:44:53AM +0100, Seref Arikan wrote:

> I have worked with very capable DBAs before, and even though it has been
> quite some time since I've done real DB work, I would like to invest in
> postgresql as much as I can

Seref, if you can muster the man power to build archetypes
right into PostgreSQL that would make it the killer database
for OpenEHR:

- functions for reading and validating ADL creating complex
  datatypes thereof including on-store validation of instances
- functions to store and produce serialized versions of archetype instances
  (similar to XML handling)
- AQL right inside the database (select aql('')) returning
  serialized instances of archetypes

This list is probably incomplete and partially wrong.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

varchar vs. text + constraint/triggers was: Help request to improve function performance

От
Ivan Sergio Borgonovo
Дата:
On Thu, 23 Apr 2009 11:00:59 +0200
Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:

> On Thu, Apr 23, 2009 at 12:02:13AM +0100, Seref Arikan wrote:
>
> > I have a set of dynamically composed objects represented in
> > Java, with string values for various attributes, which have
> > variable length. In case you have suggestions for a better type
> > for this case, it would be my pleasure to hear about them.
>
> Seref, he's suggesting you use TEXT instead of
> VARCHAR(something). In PG it's actually usually *less*
> overhead to use the unbounded text datatype (no length check
> required).
>
> Length checks mandated by business logic can be added by
> more dynamic means -- check constraints, triggers, etc which
> allow for less invasive change if needed.

Could you point us to some example of a constraint/trigger (etc...)
that is going to provide the same checking of varchar and explain
(if the code/example... doesn't make it clear) why it should be
faster or less invasive?

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: varchar vs. text + constraint/triggers was: Help request to improve function performance

От
Karsten Hilbert
Дата:
On Thu, Apr 23, 2009 at 11:33:34AM +0200, Ivan Sergio Borgonovo wrote:

> Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:
>
> > On Thu, Apr 23, 2009 at 12:02:13AM +0100, Seref Arikan wrote:
> >
> > > I have a set of dynamically composed objects represented in
> > > Java, with string values for various attributes, which have
> > > variable length. In case you have suggestions for a better type
> > > for this case, it would be my pleasure to hear about them.
> >
> > Seref, he's suggesting you use TEXT instead of
> > VARCHAR(something). In PG it's actually usually *less*
> > overhead to use the unbounded text datatype (no length check
> > required).
> >
> > Length checks mandated by business logic can be added by
> > more dynamic means -- check constraints, triggers, etc which
> > allow for less invasive change if needed.
>
> Could you point us to some example of a constraint/trigger (etc...)
> that is going to provide the same checking of varchar and explain
> (if the code/example... doesn't make it clear) why it should be
> faster or less invasive?

check constraint based:

    create table foo (
        txt text
            check (char_length(txt) < 1001)
    );

trigger based:

    create function trf_check_length_1000() ... returns trigger ... $$...$$;

    create table foo (
        txt text
    );

    create trigger check_txt_length before INSERT or UPDATE ... execute trf_check_length_1000();

faster:

- TEXT is (judging by previous comments on this list)
  marginally faster than VARCHAR(1000) because a) it runs
  the same code but b) doesn't have to check for the 1000
  length

- other options (VARCHAR, constraint, trigger) incur
  additional overhead and are thus slower

less invasive:

Well, poor wording on my part, perhaps. What I meant is that
changing a check constraint or trigger appears to be a less
costly operation on a table than changing the datatype of a
column (although I seem to remember there being some
optimizations in place for the case of changing the *length*
of a varchar).


I may be wrong in the above and if so it better be brought
to our collective attention for the benefit of readers.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: varchar vs. text + constraint/triggers was: Help request to improve function performance

От
Ivan Sergio Borgonovo
Дата:
On Thu, 23 Apr 2009 12:00:30 +0200
Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:

> On Thu, Apr 23, 2009 at 11:33:34AM +0200, Ivan Sergio Borgonovo
> wrote:
>
> > Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:
> >
> > > On Thu, Apr 23, 2009 at 12:02:13AM +0100, Seref Arikan wrote:
> > >
> > > > I have a set of dynamically composed objects represented in
> > > > Java, with string values for various attributes, which have
> > > > variable length. In case you have suggestions for a better
> > > > type for this case, it would be my pleasure to hear about
> > > > them.
> > >
> > > Seref, he's suggesting you use TEXT instead of
> > > VARCHAR(something). In PG it's actually usually *less*
> > > overhead to use the unbounded text datatype (no length check
> > > required).
> > >
> > > Length checks mandated by business logic can be added by
> > > more dynamic means -- check constraints, triggers, etc which
> > > allow for less invasive change if needed.
> >
> > Could you point us to some example of a constraint/trigger
> > (etc...) that is going to provide the same checking of varchar
> > and explain (if the code/example... doesn't make it clear) why
> > it should be faster or less invasive?
>
> check constraint based:
>
>     create table foo (
>         txt text
>             check (char_length(txt) < 1001)
>     );
>
> trigger based:
>
>     create function trf_check_length_1000() ... returns
> trigger ... $$...$$;
>
>     create table foo (
>         txt text
>     );
>
>     create trigger check_txt_length before INSERT or
> UPDATE ... execute trf_check_length_1000();
>
> faster:
>
> - TEXT is (judging by previous comments on this list)
>   marginally faster than VARCHAR(1000) because a) it runs
>   the same code but b) doesn't have to check for the 1000
>   length
>
> - other options (VARCHAR, constraint, trigger) incur
>   additional overhead and are thus slower
>
> less invasive:
>
> Well, poor wording on my part, perhaps. What I meant is that
> changing a check constraint or trigger appears to be a less
> costly operation on a table than changing the datatype of a
> column (although I seem to remember there being some
> optimizations in place for the case of changing the *length*
> of a varchar).

I'll try to rephrase to check if I understood and for reference.

varchar is slower than text since it has to do some "data type
check".

text is faster but if you add a check... it gets slower (slower than
varchar?, faster?).

constraint and trigger should have the advantage that in case of
refactoring you're not going to touch the table definition that
*may* end in being faster.

But... if in postgresql implementation varchar is just text with a
check... how can a change in type be faster?
If it was a char(N) maybe there would be some kind of optimization
since the length of the data is known in advance... so
shrinking/enlarging a char(N) may have a different cost than
shrinking a varchar(N) that in pg *should* have the same
implementation than text.

On the other end... you're preferring text just because they have
the same implementation (modulo check) in Postgresql... but it may
not be so in other DB.
So *maybe* other DB do some optimization on varchar vs. text.

Somehow I like the idea of considering a varchar a text with a
check, but I think I prefer the "more traditional" approach since
somehow is the "most expected".

Nothing can handle strings of infinite length, and much before
reaching infinite I'll get in trouble.
People read differently what you'd like to say writing varchar(N).
Most people read:
1) we expect a length around N
Fewer people read:
2) There is something not working if we get something larger than N
But it may also mean:
3) if we get something larger than N something is going to explode
I think the same "ambiguity" is carried by check().
Anyway for a sufficiently large N 2) and 3) can be valid.

Supposing the cost of loosing an insert for an unpredicted large
value of N is high I'd be tempted to set N to at least protect me
from 3) but I bet people may interpret it as 1).

In my experience anyway varchar is a good early warning for troubles
and the risk of being misunderstood/get caught by implementation
dependent gotcha writing varchar(N) where N mean 3) largely
encompass the risk of loosing an insert you didn't have to lose.

Maybe I've spotted a potential advantage of check over varchar.
If you use some kind of convention to name checks you could
remove/re-apply them easier than spotting varchars().
The name of the constraint may contain metadata to help you.
The name of the constraint may also suggest why it's there to your
colleagues.
But this works just if your implementation perform similarly on text
over varchar().

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: varchar vs. text + constraint/triggers was: Help request to improve function performance

От
Grzegorz Jaśkiewicz
Дата:
essentially you are trying to store a database in a database, and
that's slow for one.
Second, storing things as varchar is space and index (space)
ineffective - and that's another reason to make things slower.
Third - you need to complicate your logic to retrieve data, and that adds up.

text is less of hassle for db, true - but that's just a tip of iceberg.

Just learn to create proper database schema, and make it so it meets
your criteria - otherwise , whatever else you choose - especially
'automatic' 'intelligent' isn't going to be ever as good as proper
schema.

Re: varchar vs. text + constraint/triggers was: Help request to improve function performance

От
Karsten Hilbert
Дата:
On Thu, Apr 23, 2009 at 01:21:05PM +0200, Ivan Sergio Borgonovo wrote:

> I'll try to rephrase to check if I understood and for reference.
>
> varchar is slower than text since it has to do some "data type
> check".

Yes but no. It is said to be slower because it has to do a
data length check, not a data *type* check. Oh, did you mean
"check inherent to the data type" ?

> text is faster
... than varchar(something)

> but if you add a check... it gets slower
... than itself without the check, yes

> (slower than varchar?, faster?).
subject to testing

> constraint and trigger should have the advantage that in case of
> refactoring you're not going to touch the table definition
as far as the data types are constrained, the "core" table
definition so to speak

> that *may* end in being faster.
Yes, but I wasn't concerned about faster with respect to
which method for constraints only about which datatype
(which distinction, in the case of TEXT vs VARCHAR, is
somewhat arbitrary).

> But... if in postgresql implementation varchar is just text with a
> check... how can a change in type be faster?

A change of column datatype rewrites the table (again, I am
not 100% sure whether this applies for VARCHAR(x) <->
VARCHAR(y) and VARCHAR(x) <-> TEXT) since they are
essentially the same type with or without a check).

> On the other end... you're preferring text just because they have
> the same implementation (modulo check) in Postgresql... but it may
> not be so in other DB.
> So *maybe* other DB do some optimization on varchar vs. text.
True enough but I wasn't talking about those. Seref asked
about implementing archetypes on PostgreSQL.

> Nothing can handle strings of infinite length, and much before
> reaching infinite I'll get in trouble.
> People read differently what you'd like to say writing varchar(N).

> Most people read:
> 1) we expect a length around N
> Fewer people read:
> 2) There is something not working if we get something larger than N

VARCHAR(N)

VAR  - variable something
CHAR - characters

-> so, likely, variable *number* of characters because
   it better store variable characters ;-)

(N) - some boundary condition

so, either:

- exactly N (but, then, why *VAR*char ?)
- at least N (huh ?, but, well)
- at most N

> But it may also mean:
> 3) if we get something larger than N something is going to explode
> I think the same "ambiguity" is carried by check().
> Anyway for a sufficiently large N 2) and 3) can be valid.

No doubt.

> Supposing the cost of loosing an insert for an unpredicted large
> value of N is high I'd be tempted to set N to at least protect me
> from 3) but I bet people may interpret it as 1).
If you want PostgreSQL to help protect you from the risk of
out-of-memory error, then, yes, it can help a tiny bit to
use VARCHAR(N) where N = "reasonable" (due to earlier
warning) instead of letting PG go to the limits with TEXT.
Agreed.

> In my experience anyway varchar is a good early warning for troubles
> and the risk of being misunderstood/get caught by implementation
> dependent gotcha writing varchar(N) where N mean 3) largely
> encompass the risk of loosing an insert you didn't have to lose.
I see. That's surely a valid point of view.

> Maybe I've spotted a potential advantage of check over varchar.
> If you use some kind of convention to name checks you could
> remove/re-apply them easier than spotting varchars().
> The name of the constraint may contain metadata to help you.
> The name of the constraint may also suggest why it's there to your
> colleagues.
> But this works just if your implementation perform similarly on text
> over varchar().
No, the self-documentation advantage is there regardless of
performance. But the choice is a tradeoff either way, that's
for sure.

I think we may have gotten to a point where we won't help
the OP much :-)

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: Help request to improve function performance

От
Karsten Hilbert
Дата:
On Wed, Apr 22, 2009 at 06:21:41PM -0600, Scott Marlowe wrote:

> > CREATE TABLE "app"."archetype_data" (
> >   "id" BIGINT NOT NULL,
> >   "context_id" VARCHAR(1000),
> >   "archetype_name" VARCHAR(1000),
> >   "archetype_path" VARCHAR(1000),
> >   "name" VARCHAR(1000),
> >   "value_string" VARCHAR(1000),
> >   "value_int" BIGINT,
> >   "value_double" DOUBLE PRECISION,
> >   "session_id" VARCHAR(1000),
> >   "instance_index" INTEGER,
> >   CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id")
> > ) WITHOUT OIDS;
>
> If I'm not mistaken, you're doing Entity Attribute Value model type
> storage.  I.e. a database in a database.  Makes for easy coding, and
> danged near impossible to troubleshoot your data.

The upper layer model he's trying to store data of has
extremely well modelled constraints complete with
verification engine and all. So while your concern is valid
it's probably less so.

> It's a religious issue but I come down on the side that good data
> modelling is hard for a reason, because it pays you back so much in
> the end.
The hard part has already been done for him in a very
thoughtful way: They've got a model, a metamodel and yet
another model for data instances of the model ;-)

The design group of the above methodology pretty much
suggested not putting too much additional modelling into the
data store (not that I agree too much)

@Seref: Thomas Beale said so ;-)

I dare say the extension power of PostgreSQL would lend
itself extremely well to actual implementation of the
OpenEHR model right in the database (mapping OpenEHR types
to complex types including real-time verification, building
AQL right into PostgreSQL in the form of stored procedures,
etc) but it would take some serious effort.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: Help request to improve function performance

От
Seref Arikan
Дата:
Hi Filip,
Thanks a lot for your kind help. Selecting only once did the trick. Dropping to 2 seconds for select instead of 50 IS an improvement indeed :)
Indexes on columns already existed, and just out of curiosity I've tested char columns instead of varchars, with no significant positive changes. Eliminating the loop made all the difference.

Next thing to check out is the partitioning options.

Kind regards.

2009/4/23 Filip Rembiałkowski <plk.zuber@gmail.com>

W dniu 22 kwietnia 2009 23:47 użytkownik Seref Arikan <serefarikan@kurumsalteknoloji.com> napisał:

Hi Filip,
First of all: thanks a lot for your kind response. Here is the create script for my schema:

CREATE TABLE "app"."archetype_data" (
  "id" BIGINT NOT NULL,
  "context_id" VARCHAR(1000),
  "archetype_name" VARCHAR(1000),
  "archetype_path" VARCHAR(1000),
  "name" VARCHAR(1000),
  "value_string" VARCHAR(1000),
  "value_int" BIGINT,
  "value_double" DOUBLE PRECISION,
  "session_id" VARCHAR(1000),
  "instance_index" INTEGER,
  CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id")
) WITHOUT OIDS;

Now, regarding your feedback, here are the points, hoping that you can give me feedback, and bring me up to speed in the topic, for I've been in the upper layers of the software world for so long :)
 

Yes that's easy to observe; but Java and Postgres can go together, I assure you :)

 
Please forgive me for further questions:

come on, nothing to forgive, if I did not like answering questions I would not read this at all.
 


Are you telling me that due to name resolution process, my use of variable and column names for context_id and session_id are causing problems?

that's what I'm telling, doesn't I?


 
I'll change variable names into names which would be obviously different from column names.


I used the temp table to speed up the following selects, since the actual table has more than 9 million rows. after creating the temp table, I am selecting from 50K rows. Am I wrong about the performance gain here? What would you suggest instead?

Select once. ONCE. Not to temp table, just to a implicit cursor [like here: http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-RECORDS-ITERATING ] . You can use ORDER BY session_id , so in ONE pass you will be able to do all needed computations. ( you will have block of records belonging to one session_id, then another block for next session_id and so on).

Regarding performance: if you create index on (context_id, session_id) this query will be fast.





Temp tables forced me to use execute, after hitting a known problem, also expressed in the faq, is this what you're talking about?

yes that's it, but as I sad before - you really can get rid of temporary tables here.

 


I will be investing serious time into postgresql from now on, and I hope you can give me couple of useful hints, to ease my way forward :) Looking at the schema, can you think of a better way to send this result set to a java based app?

Better way to query or better schema?

If you mean better way to query - I would just select all rows and combine them in upper layer. ARRAYs are not so cute for me.

If you mean better schema... That's the hardest question, as other guys suggested. You will have to decide what to use. I know that some ORM for Java (Hibernate, namely) produce quite decent database schemas. NOT E-A-V model. Maybe this will suit you.


good luck!