Обсуждение: Help request to improve function performance
Dear members of the list, I have a function which returns a custom type, that has only two fields, each of them being varchar arrays. The reason that I have written this function is that I have a table basically with the following structure (with simplified column names) name_col1 name_col2 sessionId value1 value3 id1 value2 value2 id1 value4 value4 id1 value7 value4 id2 value2 value2 id2 value4 value4 id2 value1 value5 id3 So mutliple rows are bound together with sessionIds, and I need to get back all rows with a query, grouped by sessionID. However, group by sql statement does not solve my problem, since I get back a lot of rows, which I have to group into objects again in my application. What I need is a way to return all rows having the same sessionId as a single row. Of course this is not possible with this table, so I've created a custom type, which has array type columns. The following function gets all rows that belongs to a patient, and for each session id, it inserts rows with that session id into array fields of the custom type. The problem is, it is very slow! Getting back all the rows with a select takes 360 ms, while getting back the results of this function takes 50 seconds! Is there any way I can make the following function faster, or any other methods you can recommend to do what I'm trying to do? I am trying to avoid hundreds of calls to db, or grouping query results in my middleware application. Here comes the function, and your help will be much appreciated. Best Regards Seref CREATE or REPLACE FUNCTION getNodeContainers( context_Id varchar) RETURNS setof NodesContainer AS $$ DECLARE archetype_data_row app.archetype_data%ROWTYPE; archetype_data_row_main app.archetype_data%ROWTYPE; nodescontainervar NodesContainer%ROWTYPE; session_Id varchar; indexVar integer := 0; BEGIN CREATE TEMP TABLE all_rows_of_patient AS select * from app.archetype_data WHERE app.archetype_data.context_id = context_Id; FOR session_Id IN SELECT distinct(all_rows_of_patient.session_id) from all_rows_of_patient LOOP -- do the following for each session_ID indexVar := 0; FOR archetype_data_row IN --select rows that belong to this session ID SELECT * from all_rows_of_patient WHERE all_rows_of_patient.session_id = session_Id and all_rows_of_patient.context_id = context_Id LOOP nodescontainervar.name[indexVar] := archetype_data_row.name; nodescontainervar.context_Id[indexVar] := archetype_data_row.context_Id; indexVar := indexVar + 1; END LOOP; return NEXT nodescontainervar; END LOOP; drop table all_rows_of_patient; return; END; $$ LANGUAGE 'plpgsql'; -- View this message in context: http://www.nabble.com/Help-request-to-improve-function-performance-tp23175540p23175540.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
2009/4/22 sarikan <serefarikan@kurumsalteknoloji.com>
Why not post your REAL schema? It would make life easier, both for you and for people trying to help.
(below code edited to be more readable; logic unchanged)
please read above code - thats what postgres actually executes. column names have precedence before variable names in name resolution.
conditions like
WHERE context_id = context_Id;
WHERE session_id = session_Id and context_id = context_Id
are obviously no-op conditions, not what you really want.
I hope now it's clear now why this function has long execution time :)
some other remarks:
1) you use temp tables inside a function, which is rather bad (search archives for explanation). try to avoid it.
2) usage of indexvar is not needed - there are array operators and functions
3) if you get rid of temp table, this function could be marked as STABLE,which will prevent penalty in some strange situations
4) mark your function as STRICT, which will save some CPU cycles when someone calls it on null input
HTH.
Dear members of the list,
I have a function which returns a custom type, that has only two fields,
each of them being varchar arrays.
The reason that I have written this function is that I have a table
basically with the following structure (with simplified column names)
name_col1 name_col2 sessionId
value1 value3 id1
value2 value2 id1
value4 value4 id1
value7 value4 id2
value2 value2 id2
value4 value4 id2
value1 value5 id3
Why not post your REAL schema? It would make life easier, both for you and for people trying to help.
So mutliple rows are bound together with sessionIds, and I need to get back
all rows with a query, grouped by sessionID. However, group by sql statement
does not solve my problem, since I get back a lot of rows, which I have to
group into objects again in my application. What I need is a way to return
all rows having the same sessionId as a single row. Of course this is not
possible with this table, so I've created a custom type, which has array
type columns. The following function gets all rows that belongs to a
patient, and for each session id, it inserts rows with that session id into
array fields of the custom type.
The problem is, it is very slow! Getting back all the rows with a select
takes 360 ms, while getting back the results of this function takes 50
seconds! Is there any way I can make the following function faster, or any
other methods you can recommend to do what I'm trying to do? I am trying to
avoid hundreds of calls to db, or grouping query results in my middleware
application. Here comes the function, and your help will be much
appreciated.
Best Regards
Seref
(below code edited to be more readable; logic unchanged)
CREATE or REPLACE FUNCTION getNodeContainers( context_Id varchar) RETURNS
setof NodesContainer AS
$$
DECLARE
archetype_data_row app.archetype_data;
archetype_data_row_main app.archetype_data;
nodescontainervar NodesContainer;
session_Id varchar;
indexVar integer := 0;
BEGIN
CREATE TEMP TABLE all_rows_of_patient AS select * from app.archetype_data
WHERE context_id = context_Id;
FOR session_Id IN
SELECT distinct session_id from all_rows_of_patient
LOOP -- do the following for each session_ID
indexVar := 0;
FOR archetype_data_row IN --select rows that belong to this session ID
SELECT * from all_rows_of_patient
WHERE session_id = session_Id and context_id = context_Id
LOOP
nodescontainervar.name[indexVar] := archetype_data_row.name;
nodescontainervar.context_Id[indexVar] := archetype_data_row.context_Id;
indexVar := indexVar + 1;
END LOOP;
return NEXT nodescontainervar;
END LOOP;
drop table all_rows_of_patient;
return;
END;
$$ LANGUAGE 'plpgsql';
please read above code - thats what postgres actually executes. column names have precedence before variable names in name resolution.
conditions like
WHERE context_id = context_Id;
WHERE session_id = session_Id and context_id = context_Id
are obviously no-op conditions, not what you really want.
I hope now it's clear now why this function has long execution time :)
some other remarks:
1) you use temp tables inside a function, which is rather bad (search archives for explanation). try to avoid it.
2) usage of indexvar is not needed - there are array operators and functions
3) if you get rid of temp table, this function could be marked as STABLE,which will prevent penalty in some strange situations
4) mark your function as STRICT, which will save some CPU cycles when someone calls it on null input
HTH.
--
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@gmail.com
http://filip.rembialkowski.net/
Hi Filip,
First of all: thanks a lot for your kind response. Here is the create script for my schema:
CREATE TABLE "app"."archetype_data" (
"id" BIGINT NOT NULL,
"context_id" VARCHAR(1000),
"archetype_name" VARCHAR(1000),
"archetype_path" VARCHAR(1000),
"name" VARCHAR(1000),
"value_string" VARCHAR(1000),
"value_int" BIGINT,
"value_double" DOUBLE PRECISION,
"session_id" VARCHAR(1000),
"instance_index" INTEGER,
CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id")
) WITHOUT OIDS;
Now, regarding your feedback, here are the points, hoping that you can give me feedback, and bring me up to speed in the topic, for I've been in the upper layers of the software world for so long :) Please forgive me for further questions:
Are you telling me that due to name resolution process, my use of variable and column names for context_id and session_id are causing problems? I'll change variable names into names which would be obviously different from column names.
I used the temp table to speed up the following selects, since the actual table has more than 9 million rows. after creating the temp table, I am selecting from 50K rows. Am I wrong about the performance gain here? What would you suggest instead?
Temp tables forced me to use execute, after hitting a known problem, also expressed in the faq, is this what you're talking about?
I will be investing serious time into postgresql from now on, and I hope you can give me couple of useful hints, to ease my way forward :) Looking at the schema, can you think of a better way to send this result set to a java based app?
Many thanks again
Kind regards
Seref
First of all: thanks a lot for your kind response. Here is the create script for my schema:
CREATE TABLE "app"."archetype_data" (
"id" BIGINT NOT NULL,
"context_id" VARCHAR(1000),
"archetype_name" VARCHAR(1000),
"archetype_path" VARCHAR(1000),
"name" VARCHAR(1000),
"value_string" VARCHAR(1000),
"value_int" BIGINT,
"value_double" DOUBLE PRECISION,
"session_id" VARCHAR(1000),
"instance_index" INTEGER,
CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id")
) WITHOUT OIDS;
Now, regarding your feedback, here are the points, hoping that you can give me feedback, and bring me up to speed in the topic, for I've been in the upper layers of the software world for so long :) Please forgive me for further questions:
Are you telling me that due to name resolution process, my use of variable and column names for context_id and session_id are causing problems? I'll change variable names into names which would be obviously different from column names.
I used the temp table to speed up the following selects, since the actual table has more than 9 million rows. after creating the temp table, I am selecting from 50K rows. Am I wrong about the performance gain here? What would you suggest instead?
Temp tables forced me to use execute, after hitting a known problem, also expressed in the faq, is this what you're talking about?
I will be investing serious time into postgresql from now on, and I hope you can give me couple of useful hints, to ease my way forward :) Looking at the schema, can you think of a better way to send this result set to a java based app?
Many thanks again
Kind regards
Seref
2009/4/22 Filip Rembiałkowski <plk.zuber@gmail.com>
2009/4/22 sarikan <serefarikan@kurumsalteknoloji.com>
Dear members of the list,
I have a function which returns a custom type, that has only two fields,
each of them being varchar arrays.
The reason that I have written this function is that I have a table
basically with the following structure (with simplified column names)
name_col1 name_col2 sessionId
value1 value3 id1
value2 value2 id1
value4 value4 id1
value7 value4 id2
value2 value2 id2
value4 value4 id2
value1 value5 id3
Why not post your REAL schema? It would make life easier, both for you and for people trying to help.
So mutliple rows are bound together with sessionIds, and I need to get back
all rows with a query, grouped by sessionID. However, group by sql statement
does not solve my problem, since I get back a lot of rows, which I have to
group into objects again in my application. What I need is a way to return
all rows having the same sessionId as a single row. Of course this is not
possible with this table, so I've created a custom type, which has array
type columns. The following function gets all rows that belongs to a
patient, and for each session id, it inserts rows with that session id into
array fields of the custom type.
The problem is, it is very slow! Getting back all the rows with a select
takes 360 ms, while getting back the results of this function takes 50
seconds! Is there any way I can make the following function faster, or any
other methods you can recommend to do what I'm trying to do? I am trying to
avoid hundreds of calls to db, or grouping query results in my middleware
application. Here comes the function, and your help will be much
appreciated.
Best Regards
Seref
(below code edited to be more readable; logic unchanged)
CREATE or REPLACE FUNCTION getNodeContainers( context_Id varchar) RETURNS
setof NodesContainer AS
$$
DECLARE
archetype_data_row app.archetype_data;
archetype_data_row_main app.archetype_data;
nodescontainervar NodesContainer;
session_Id varchar;
indexVar integer := 0;
BEGIN
CREATE TEMP TABLE all_rows_of_patient AS select * from app.archetype_data
WHERE context_id = context_Id;
FOR session_Id IN
SELECT distinct session_id from all_rows_of_patient
LOOP -- do the following for each session_ID
indexVar := 0;
FOR archetype_data_row IN --select rows that belong to this session ID
SELECT * from all_rows_of_patient
WHERE session_id = session_Id and context_id = context_Id
LOOP
nodescontainervar.name[indexVar] := archetype_data_row.name;
nodescontainervar.context_Id[indexVar] := archetype_data_row.context_Id;
indexVar := indexVar + 1;
END LOOP;
return NEXT nodescontainervar;
END LOOP;
drop table all_rows_of_patient;
return;
END;
$$ LANGUAGE 'plpgsql';
please read above code - thats what postgres actually executes. column names have precedence before variable names in name resolution.
conditions like
WHERE context_id = context_Id;
WHERE session_id = session_Id and context_id = context_Id
are obviously no-op conditions, not what you really want.
I hope now it's clear now why this function has long execution time :)
some other remarks:
1) you use temp tables inside a function, which is rather bad (search archives for explanation). try to avoid it.
2) usage of indexvar is not needed - there are array operators and functions
3) if you get rid of temp table, this function could be marked as STABLE,which will prevent penalty in some strange situations
4) mark your function as STRICT, which will save some CPU cycles when someone calls it on null input
HTH.
--
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@gmail.com
http://filip.rembialkowski.net/
you keep everything in varchars, and yet you request improvements in performance. you are a funny guy, ...
Hi there,
I have a set of dynamically composed objects represented in Java, with string values for various attributes, which have variable length. In case you have suggestions for a better type for this case, it would be my pleasure to hear about them.
I have a set of dynamically composed objects represented in Java, with string values for various attributes, which have variable length. In case you have suggestions for a better type for this case, it would be my pleasure to hear about them.
2009/4/22 Grzegorz Jaśkiewicz <gryzman@gmail.com>
you keep everything in varchars, and yet you request improvements in
performance.
you are a funny guy, ...
2009/4/22 Seref Arikan <serefarikan@kurumsalteknoloji.com>: > Hi Filip, > First of all: thanks a lot for your kind response. Here is the create script > for my schema: > > CREATE TABLE "app"."archetype_data" ( > "id" BIGINT NOT NULL, > "context_id" VARCHAR(1000), > "archetype_name" VARCHAR(1000), > "archetype_path" VARCHAR(1000), > "name" VARCHAR(1000), > "value_string" VARCHAR(1000), > "value_int" BIGINT, > "value_double" DOUBLE PRECISION, > "session_id" VARCHAR(1000), > "instance_index" INTEGER, > CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id") > ) WITHOUT OIDS; If I'm not mistaken, you're doing Entity Attribute Value model type storage. I.e. a database in a database. Makes for easy coding, and danged near impossible to troubleshoot your data. It's a religious issue but I come down on the side that good data modelling is hard for a reason, because it pays you back so much in the end.
Seref Arikan wrote: > I have a set of dynamically composed objects represented in Java, with > string values for various attributes, which have variable length. In > case you have suggestions for a better type for this case, it would be > my pleasure to hear about them. cut out about 3 layers of abstraction and get down to what you REALLY need to get done. dynamically composed piles of text valued attributes will NEVER be efficient, no matter what you do.
Hi Scott,
I agree, and I am doing the entity attribute model because I simply have to. This table is used to persist data that is hold in user defined information models. Kind of a domain specific language. The users continously create these hierarchical structures, so neither the amount of them, nor their structure is stable. On top of that, these structures can have quite deep hieararchies, with collections, references to other structures etc.. This forces almost everyone working in the domain to end up in this db model. In case you are curious and have a lot of time at your hands, you can visit www.openehr.org to see what I'm talking about. The specifications part have all the documents one can need.
I have worked with very capable DBAs before, and even though it has been quite some time since I've done real DB work, I would like to invest in postgresql as much as I can, to make the total framework faster. Therefore, all suggestions are welcommed.
All the best
Seref
I agree, and I am doing the entity attribute model because I simply have to. This table is used to persist data that is hold in user defined information models. Kind of a domain specific language. The users continously create these hierarchical structures, so neither the amount of them, nor their structure is stable. On top of that, these structures can have quite deep hieararchies, with collections, references to other structures etc.. This forces almost everyone working in the domain to end up in this db model. In case you are curious and have a lot of time at your hands, you can visit www.openehr.org to see what I'm talking about. The specifications part have all the documents one can need.
I have worked with very capable DBAs before, and even though it has been quite some time since I've done real DB work, I would like to invest in postgresql as much as I can, to make the total framework faster. Therefore, all suggestions are welcommed.
All the best
Seref
On Thu, Apr 23, 2009 at 1:21 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
2009/4/22 Seref Arikan <serefarikan@kurumsalteknoloji.com>:> Hi Filip,If I'm not mistaken, you're doing Entity Attribute Value model type
> First of all: thanks a lot for your kind response. Here is the create script
> for my schema:
>
> CREATE TABLE "app"."archetype_data" (
> "id" BIGINT NOT NULL,
> "context_id" VARCHAR(1000),
> "archetype_name" VARCHAR(1000),
> "archetype_path" VARCHAR(1000),
> "name" VARCHAR(1000),
> "value_string" VARCHAR(1000),
> "value_int" BIGINT,
> "value_double" DOUBLE PRECISION,
> "session_id" VARCHAR(1000),
> "instance_index" INTEGER,
> CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id")
> ) WITHOUT OIDS;
storage. I.e. a database in a database. Makes for easy coding, and
danged near impossible to troubleshoot your data.
It's a religious issue but I come down on the side that good data
modelling is hard for a reason, because it pays you back so much in
the end.
W dniu 22 kwietnia 2009 23:47 użytkownik Seref Arikan <serefarikan@kurumsalteknoloji.com> napisał:
Yes that's easy to observe; but Java and Postgres can go together, I assure you :)
come on, nothing to forgive, if I did not like answering questions I would not read this at all.
that's what I'm telling, doesn't I?
Select once. ONCE. Not to temp table, just to a implicit cursor [like here: http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-RECORDS-ITERATING ] . You can use ORDER BY session_id , so in ONE pass you will be able to do all needed computations. ( you will have block of records belonging to one session_id, then another block for next session_id and so on).
Regarding performance: if you create index on (context_id, session_id) this query will be fast.
yes that's it, but as I sad before - you really can get rid of temporary tables here.
Better way to query or better schema?
If you mean better way to query - I would just select all rows and combine them in upper layer. ARRAYs are not so cute for me.
If you mean better schema... That's the hardest question, as other guys suggested. You will have to decide what to use. I know that some ORM for Java (Hibernate, namely) produce quite decent database schemas. NOT E-A-V model. Maybe this will suit you.
good luck!
Hi Filip,
First of all: thanks a lot for your kind response. Here is the create script for my schema:
CREATE TABLE "app"."archetype_data" (
"id" BIGINT NOT NULL,
"context_id" VARCHAR(1000),
"archetype_name" VARCHAR(1000),
"archetype_path" VARCHAR(1000),
"name" VARCHAR(1000),
"value_string" VARCHAR(1000),
"value_int" BIGINT,
"value_double" DOUBLE PRECISION,
"session_id" VARCHAR(1000),
"instance_index" INTEGER,
CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id")
) WITHOUT OIDS;
Now, regarding your feedback, here are the points, hoping that you can give me feedback, and bring me up to speed in the topic, for I've been in the upper layers of the software world for so long :)
Yes that's easy to observe; but Java and Postgres can go together, I assure you :)
Please forgive me for further questions:
come on, nothing to forgive, if I did not like answering questions I would not read this at all.
Are you telling me that due to name resolution process, my use of variable and column names for context_id and session_id are causing problems?
that's what I'm telling, doesn't I?
I'll change variable names into names which would be obviously different from column names.
I used the temp table to speed up the following selects, since the actual table has more than 9 million rows. after creating the temp table, I am selecting from 50K rows. Am I wrong about the performance gain here? What would you suggest instead?
Select once. ONCE. Not to temp table, just to a implicit cursor [like here: http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-RECORDS-ITERATING ] . You can use ORDER BY session_id , so in ONE pass you will be able to do all needed computations. ( you will have block of records belonging to one session_id, then another block for next session_id and so on).
Regarding performance: if you create index on (context_id, session_id) this query will be fast.
Temp tables forced me to use execute, after hitting a known problem, also expressed in the faq, is this what you're talking about?
yes that's it, but as I sad before - you really can get rid of temporary tables here.
I will be investing serious time into postgresql from now on, and I hope you can give me couple of useful hints, to ease my way forward :) Looking at the schema, can you think of a better way to send this result set to a java based app?
Better way to query or better schema?
If you mean better way to query - I would just select all rows and combine them in upper layer. ARRAYs are not so cute for me.
If you mean better schema... That's the hardest question, as other guys suggested. You will have to decide what to use. I know that some ORM for Java (Hibernate, namely) produce quite decent database schemas. NOT E-A-V model. Maybe this will suit you.
good luck!
--
Filip Rembiałkowski
JID,mailto:filip.rembialkowski@gmail.com
http://filip.rembialkowski.net/
On Thu, Apr 23, 2009 at 12:02:13AM +0100, Seref Arikan wrote: > I have a set of dynamically composed objects represented in Java, with > string values for various attributes, which have variable length. In case > you have suggestions for a better type for this case, it would be my > pleasure to hear about them. Seref, he's suggesting you use TEXT instead of VARCHAR(something). In PG it's actually usually *less* overhead to use the unbounded text datatype (no length check required). Length checks mandated by business logic can be added by more dynamic means -- check constraints, triggers, etc which allow for less invasive change if needed. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On Wed, Apr 22, 2009 at 06:21:41PM -0600, Scott Marlowe wrote: > > CREATE TABLE "app"."archetype_data" ( > > "id" BIGINT NOT NULL, > > "context_id" VARCHAR(1000), > > "archetype_name" VARCHAR(1000), > > "archetype_path" VARCHAR(1000), > > "name" VARCHAR(1000), > > "value_string" VARCHAR(1000), > > "value_int" BIGINT, > > "value_double" DOUBLE PRECISION, > > "session_id" VARCHAR(1000), > > "instance_index" INTEGER, > > CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id") > > ) WITHOUT OIDS; > > If I'm not mistaken, you're doing Entity Attribute Value model type > storage. I.e. a database in a database. Makes for easy coding, and > danged near impossible to troubleshoot your data. The upper layer model he's trying to store data of has extremely well modelled constraints complete with verification engine and all. So while your concern is valid it's probably less so. > It's a religious issue but I come down on the side that good data > modelling is hard for a reason, because it pays you back so much in > the end. The hard part has already been done for him in a very thoughtful way: They've got a model, a metamodel and yet another model for data instances of the model ;-) The design group of the above methodology pretty much suggested not putting too much additional modelling into the data store (not that I agree too much) @Seref: Thomas Beale said so ;-) I dare say the extension power of PostgreSQL would lend itself extremely well to actual implementation of the OpenEHR model right in the database (mapping OpenEHR types to complex types including real-time verification, building AQL right into PostgreSQL in the form of stored procedures, etc) but it would take some serious effort. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On Thu, Apr 23, 2009 at 09:44:53AM +0100, Seref Arikan wrote: > I have worked with very capable DBAs before, and even though it has been > quite some time since I've done real DB work, I would like to invest in > postgresql as much as I can Seref, if you can muster the man power to build archetypes right into PostgreSQL that would make it the killer database for OpenEHR: - functions for reading and validating ADL creating complex datatypes thereof including on-store validation of instances - functions to store and produce serialized versions of archetype instances (similar to XML handling) - AQL right inside the database (select aql('')) returning serialized instances of archetypes This list is probably incomplete and partially wrong. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
varchar vs. text + constraint/triggers was: Help request to improve function performance
От
Ivan Sergio Borgonovo
Дата:
On Thu, 23 Apr 2009 11:00:59 +0200 Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote: > On Thu, Apr 23, 2009 at 12:02:13AM +0100, Seref Arikan wrote: > > > I have a set of dynamically composed objects represented in > > Java, with string values for various attributes, which have > > variable length. In case you have suggestions for a better type > > for this case, it would be my pleasure to hear about them. > > Seref, he's suggesting you use TEXT instead of > VARCHAR(something). In PG it's actually usually *less* > overhead to use the unbounded text datatype (no length check > required). > > Length checks mandated by business logic can be added by > more dynamic means -- check constraints, triggers, etc which > allow for less invasive change if needed. Could you point us to some example of a constraint/trigger (etc...) that is going to provide the same checking of varchar and explain (if the code/example... doesn't make it clear) why it should be faster or less invasive? thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it
Re: varchar vs. text + constraint/triggers was: Help request to improve function performance
От
Karsten Hilbert
Дата:
On Thu, Apr 23, 2009 at 11:33:34AM +0200, Ivan Sergio Borgonovo wrote: > Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote: > > > On Thu, Apr 23, 2009 at 12:02:13AM +0100, Seref Arikan wrote: > > > > > I have a set of dynamically composed objects represented in > > > Java, with string values for various attributes, which have > > > variable length. In case you have suggestions for a better type > > > for this case, it would be my pleasure to hear about them. > > > > Seref, he's suggesting you use TEXT instead of > > VARCHAR(something). In PG it's actually usually *less* > > overhead to use the unbounded text datatype (no length check > > required). > > > > Length checks mandated by business logic can be added by > > more dynamic means -- check constraints, triggers, etc which > > allow for less invasive change if needed. > > Could you point us to some example of a constraint/trigger (etc...) > that is going to provide the same checking of varchar and explain > (if the code/example... doesn't make it clear) why it should be > faster or less invasive? check constraint based: create table foo ( txt text check (char_length(txt) < 1001) ); trigger based: create function trf_check_length_1000() ... returns trigger ... $$...$$; create table foo ( txt text ); create trigger check_txt_length before INSERT or UPDATE ... execute trf_check_length_1000(); faster: - TEXT is (judging by previous comments on this list) marginally faster than VARCHAR(1000) because a) it runs the same code but b) doesn't have to check for the 1000 length - other options (VARCHAR, constraint, trigger) incur additional overhead and are thus slower less invasive: Well, poor wording on my part, perhaps. What I meant is that changing a check constraint or trigger appears to be a less costly operation on a table than changing the datatype of a column (although I seem to remember there being some optimizations in place for the case of changing the *length* of a varchar). I may be wrong in the above and if so it better be brought to our collective attention for the benefit of readers. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Re: varchar vs. text + constraint/triggers was: Help request to improve function performance
От
Ivan Sergio Borgonovo
Дата:
On Thu, 23 Apr 2009 12:00:30 +0200 Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote: > On Thu, Apr 23, 2009 at 11:33:34AM +0200, Ivan Sergio Borgonovo > wrote: > > > Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote: > > > > > On Thu, Apr 23, 2009 at 12:02:13AM +0100, Seref Arikan wrote: > > > > > > > I have a set of dynamically composed objects represented in > > > > Java, with string values for various attributes, which have > > > > variable length. In case you have suggestions for a better > > > > type for this case, it would be my pleasure to hear about > > > > them. > > > > > > Seref, he's suggesting you use TEXT instead of > > > VARCHAR(something). In PG it's actually usually *less* > > > overhead to use the unbounded text datatype (no length check > > > required). > > > > > > Length checks mandated by business logic can be added by > > > more dynamic means -- check constraints, triggers, etc which > > > allow for less invasive change if needed. > > > > Could you point us to some example of a constraint/trigger > > (etc...) that is going to provide the same checking of varchar > > and explain (if the code/example... doesn't make it clear) why > > it should be faster or less invasive? > > check constraint based: > > create table foo ( > txt text > check (char_length(txt) < 1001) > ); > > trigger based: > > create function trf_check_length_1000() ... returns > trigger ... $$...$$; > > create table foo ( > txt text > ); > > create trigger check_txt_length before INSERT or > UPDATE ... execute trf_check_length_1000(); > > faster: > > - TEXT is (judging by previous comments on this list) > marginally faster than VARCHAR(1000) because a) it runs > the same code but b) doesn't have to check for the 1000 > length > > - other options (VARCHAR, constraint, trigger) incur > additional overhead and are thus slower > > less invasive: > > Well, poor wording on my part, perhaps. What I meant is that > changing a check constraint or trigger appears to be a less > costly operation on a table than changing the datatype of a > column (although I seem to remember there being some > optimizations in place for the case of changing the *length* > of a varchar). I'll try to rephrase to check if I understood and for reference. varchar is slower than text since it has to do some "data type check". text is faster but if you add a check... it gets slower (slower than varchar?, faster?). constraint and trigger should have the advantage that in case of refactoring you're not going to touch the table definition that *may* end in being faster. But... if in postgresql implementation varchar is just text with a check... how can a change in type be faster? If it was a char(N) maybe there would be some kind of optimization since the length of the data is known in advance... so shrinking/enlarging a char(N) may have a different cost than shrinking a varchar(N) that in pg *should* have the same implementation than text. On the other end... you're preferring text just because they have the same implementation (modulo check) in Postgresql... but it may not be so in other DB. So *maybe* other DB do some optimization on varchar vs. text. Somehow I like the idea of considering a varchar a text with a check, but I think I prefer the "more traditional" approach since somehow is the "most expected". Nothing can handle strings of infinite length, and much before reaching infinite I'll get in trouble. People read differently what you'd like to say writing varchar(N). Most people read: 1) we expect a length around N Fewer people read: 2) There is something not working if we get something larger than N But it may also mean: 3) if we get something larger than N something is going to explode I think the same "ambiguity" is carried by check(). Anyway for a sufficiently large N 2) and 3) can be valid. Supposing the cost of loosing an insert for an unpredicted large value of N is high I'd be tempted to set N to at least protect me from 3) but I bet people may interpret it as 1). In my experience anyway varchar is a good early warning for troubles and the risk of being misunderstood/get caught by implementation dependent gotcha writing varchar(N) where N mean 3) largely encompass the risk of loosing an insert you didn't have to lose. Maybe I've spotted a potential advantage of check over varchar. If you use some kind of convention to name checks you could remove/re-apply them easier than spotting varchars(). The name of the constraint may contain metadata to help you. The name of the constraint may also suggest why it's there to your colleagues. But this works just if your implementation perform similarly on text over varchar(). -- Ivan Sergio Borgonovo http://www.webthatworks.it
Re: varchar vs. text + constraint/triggers was: Help request to improve function performance
От
Grzegorz Jaśkiewicz
Дата:
essentially you are trying to store a database in a database, and that's slow for one. Second, storing things as varchar is space and index (space) ineffective - and that's another reason to make things slower. Third - you need to complicate your logic to retrieve data, and that adds up. text is less of hassle for db, true - but that's just a tip of iceberg. Just learn to create proper database schema, and make it so it meets your criteria - otherwise , whatever else you choose - especially 'automatic' 'intelligent' isn't going to be ever as good as proper schema.
Re: varchar vs. text + constraint/triggers was: Help request to improve function performance
От
Karsten Hilbert
Дата:
On Thu, Apr 23, 2009 at 01:21:05PM +0200, Ivan Sergio Borgonovo wrote: > I'll try to rephrase to check if I understood and for reference. > > varchar is slower than text since it has to do some "data type > check". Yes but no. It is said to be slower because it has to do a data length check, not a data *type* check. Oh, did you mean "check inherent to the data type" ? > text is faster ... than varchar(something) > but if you add a check... it gets slower ... than itself without the check, yes > (slower than varchar?, faster?). subject to testing > constraint and trigger should have the advantage that in case of > refactoring you're not going to touch the table definition as far as the data types are constrained, the "core" table definition so to speak > that *may* end in being faster. Yes, but I wasn't concerned about faster with respect to which method for constraints only about which datatype (which distinction, in the case of TEXT vs VARCHAR, is somewhat arbitrary). > But... if in postgresql implementation varchar is just text with a > check... how can a change in type be faster? A change of column datatype rewrites the table (again, I am not 100% sure whether this applies for VARCHAR(x) <-> VARCHAR(y) and VARCHAR(x) <-> TEXT) since they are essentially the same type with or without a check). > On the other end... you're preferring text just because they have > the same implementation (modulo check) in Postgresql... but it may > not be so in other DB. > So *maybe* other DB do some optimization on varchar vs. text. True enough but I wasn't talking about those. Seref asked about implementing archetypes on PostgreSQL. > Nothing can handle strings of infinite length, and much before > reaching infinite I'll get in trouble. > People read differently what you'd like to say writing varchar(N). > Most people read: > 1) we expect a length around N > Fewer people read: > 2) There is something not working if we get something larger than N VARCHAR(N) VAR - variable something CHAR - characters -> so, likely, variable *number* of characters because it better store variable characters ;-) (N) - some boundary condition so, either: - exactly N (but, then, why *VAR*char ?) - at least N (huh ?, but, well) - at most N > But it may also mean: > 3) if we get something larger than N something is going to explode > I think the same "ambiguity" is carried by check(). > Anyway for a sufficiently large N 2) and 3) can be valid. No doubt. > Supposing the cost of loosing an insert for an unpredicted large > value of N is high I'd be tempted to set N to at least protect me > from 3) but I bet people may interpret it as 1). If you want PostgreSQL to help protect you from the risk of out-of-memory error, then, yes, it can help a tiny bit to use VARCHAR(N) where N = "reasonable" (due to earlier warning) instead of letting PG go to the limits with TEXT. Agreed. > In my experience anyway varchar is a good early warning for troubles > and the risk of being misunderstood/get caught by implementation > dependent gotcha writing varchar(N) where N mean 3) largely > encompass the risk of loosing an insert you didn't have to lose. I see. That's surely a valid point of view. > Maybe I've spotted a potential advantage of check over varchar. > If you use some kind of convention to name checks you could > remove/re-apply them easier than spotting varchars(). > The name of the constraint may contain metadata to help you. > The name of the constraint may also suggest why it's there to your > colleagues. > But this works just if your implementation perform similarly on text > over varchar(). No, the self-documentation advantage is there regardless of performance. But the choice is a tradeoff either way, that's for sure. I think we may have gotten to a point where we won't help the OP much :-) Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On Wed, Apr 22, 2009 at 06:21:41PM -0600, Scott Marlowe wrote: > > CREATE TABLE "app"."archetype_data" ( > > "id" BIGINT NOT NULL, > > "context_id" VARCHAR(1000), > > "archetype_name" VARCHAR(1000), > > "archetype_path" VARCHAR(1000), > > "name" VARCHAR(1000), > > "value_string" VARCHAR(1000), > > "value_int" BIGINT, > > "value_double" DOUBLE PRECISION, > > "session_id" VARCHAR(1000), > > "instance_index" INTEGER, > > CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id") > > ) WITHOUT OIDS; > > If I'm not mistaken, you're doing Entity Attribute Value model type > storage. I.e. a database in a database. Makes for easy coding, and > danged near impossible to troubleshoot your data. The upper layer model he's trying to store data of has extremely well modelled constraints complete with verification engine and all. So while your concern is valid it's probably less so. > It's a religious issue but I come down on the side that good data > modelling is hard for a reason, because it pays you back so much in > the end. The hard part has already been done for him in a very thoughtful way: They've got a model, a metamodel and yet another model for data instances of the model ;-) The design group of the above methodology pretty much suggested not putting too much additional modelling into the data store (not that I agree too much) @Seref: Thomas Beale said so ;-) I dare say the extension power of PostgreSQL would lend itself extremely well to actual implementation of the OpenEHR model right in the database (mapping OpenEHR types to complex types including real-time verification, building AQL right into PostgreSQL in the form of stored procedures, etc) but it would take some serious effort. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Hi Filip,
Thanks a lot for your kind help. Selecting only once did the trick. Dropping to 2 seconds for select instead of 50 IS an improvement indeed :)
Indexes on columns already existed, and just out of curiosity I've tested char columns instead of varchars, with no significant positive changes. Eliminating the loop made all the difference.
Next thing to check out is the partitioning options.
Kind regards.
Thanks a lot for your kind help. Selecting only once did the trick. Dropping to 2 seconds for select instead of 50 IS an improvement indeed :)
Indexes on columns already existed, and just out of curiosity I've tested char columns instead of varchars, with no significant positive changes. Eliminating the loop made all the difference.
Next thing to check out is the partitioning options.
Kind regards.
2009/4/23 Filip Rembiałkowski <plk.zuber@gmail.com>
W dniu 22 kwietnia 2009 23:47 użytkownik Seref Arikan <serefarikan@kurumsalteknoloji.com> napisał:Hi Filip,
First of all: thanks a lot for your kind response. Here is the create script for my schema:
CREATE TABLE "app"."archetype_data" (
"id" BIGINT NOT NULL,
"context_id" VARCHAR(1000),
"archetype_name" VARCHAR(1000),
"archetype_path" VARCHAR(1000),
"name" VARCHAR(1000),
"value_string" VARCHAR(1000),
"value_int" BIGINT,
"value_double" DOUBLE PRECISION,
"session_id" VARCHAR(1000),
"instance_index" INTEGER,
CONSTRAINT "archetype_data_pkey" PRIMARY KEY("id")
) WITHOUT OIDS;
Now, regarding your feedback, here are the points, hoping that you can give me feedback, and bring me up to speed in the topic, for I've been in the upper layers of the software world for so long :)
Yes that's easy to observe; but Java and Postgres can go together, I assure you :)
Please forgive me for further questions:
come on, nothing to forgive, if I did not like answering questions I would not read this at all.
Are you telling me that due to name resolution process, my use of variable and column names for context_id and session_id are causing problems?
that's what I'm telling, doesn't I?
I'll change variable names into names which would be obviously different from column names.
I used the temp table to speed up the following selects, since the actual table has more than 9 million rows. after creating the temp table, I am selecting from 50K rows. Am I wrong about the performance gain here? What would you suggest instead?
Select once. ONCE. Not to temp table, just to a implicit cursor [like here: http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-RECORDS-ITERATING ] . You can use ORDER BY session_id , so in ONE pass you will be able to do all needed computations. ( you will have block of records belonging to one session_id, then another block for next session_id and so on).
Regarding performance: if you create index on (context_id, session_id) this query will be fast.
Temp tables forced me to use execute, after hitting a known problem, also expressed in the faq, is this what you're talking about?
yes that's it, but as I sad before - you really can get rid of temporary tables here.
I will be investing serious time into postgresql from now on, and I hope you can give me couple of useful hints, to ease my way forward :) Looking at the schema, can you think of a better way to send this result set to a java based app?
Better way to query or better schema?
If you mean better way to query - I would just select all rows and combine them in upper layer. ARRAYs are not so cute for me.
If you mean better schema... That's the hardest question, as other guys suggested. You will have to decide what to use. I know that some ORM for Java (Hibernate, namely) produce quite decent database schemas. NOT E-A-V model. Maybe this will suit you.
good luck!