Обсуждение: design help for performance

Поиск
Список
Период
Сортировка

design help for performance

От
Culley Harrelson
Дата:
I am bumping into some performance issues and am seeking help.

I have two tables A and B in a one (A) to many (B) relationship.  There are 1.4 million records in table A and 44 million records in table B.  In my web application any request for a record from table A is also going to need a count of associated records in table B.  Several years ago I added table_b_rowcount to table A in order to minimize queries on table B.  And now, as the application has grown, I am starting to having locking problems on table A.  Any change to table B requires the that table_b_rowcount be updated on table A...  The application has outgrown this solution.

So... is there a common solution to this problem?

culley

Re: design help for performance

От
Misa Simic
Дата:
Hi Culley,

Have you tried to create fk together with index on fk column on table B?

What are results? Would be good if you could send the query and explain
analyze...

Sent from my Windows Phone
------------------------------
From: Culley Harrelson
Sent: 21 December 2011 00:57
To: pgsql-general@postgresql.org
Subject: [GENERAL] design help for performance

I am bumping into some performance issues and am seeking help.

I have two tables A and B in a one (A) to many (B) relationship.  There are
1.4 million records in table A and 44 million records in table B.  In my
web application any request for a record from table A is also going to need
a count of associated records in table B.  Several years ago I added
table_b_rowcount to table A in order to minimize queries on table B.  And
now, as the application has grown, I am starting to having locking problems
on table A.  Any change to table B requires the that table_b_rowcount be
updated on table A...  The application has outgrown this solution.

So... is there a common solution to this problem?

culley

Re: design help for performance

От
"David Johnston"
Дата:

Continued top-posting to remain consistent….

 

It isn’t that the application has outgrown the solution but rather the solution was never correct in the first place.  You attempted pre-mature optimization and are getting burned because of it.  The reference solution is simply:

 

SELECT a.*, COUNT(*) AS b_count

FROM a

JOIN b USING (a_id)

GROUP BY a.* {expanded * as needed)

 

Make sure table b has an index on the a.id column.

 

This is reference because you never want to introduce computed fields that keep track of other tables WITHOUT some kind of proof that the maintenance nightmare/overhead you are incurring is more than offset by the savings during usage.

 

Any further optimization requires two things:

Knowledge of the usage patterns of the affected data

Testing to prove that the alternative solutions out-perform the reference solution

 

Since you already have an existing query you should implement the reference solution above and then test and see whether it performs better or worse than you current solution.  If it indeed performs better than move to it; and if it is still not good enough then you need to provide more information about what kinds of queries are hitting A and B as well as Insert/Delete patterns on Table B.

 

David J.

 

From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Misa Simic
Sent: Tuesday, December 20, 2011 7:13 PM
To: Culley Harrelson; pgsql-general@postgresql.org
Subject: Re: [GENERAL] design help for performance

 

Hi Culley,

Have you tried to create fk together with index on fk column on table B?

What are results? Would be good if you could send the query and explain analyze...

Sent from my Windows Phone


From: Culley Harrelson
Sent: 21 December 2011 00:57
To: pgsql-general@postgresql.org
Subject: [GENERAL] design help for performance

I am bumping into some performance issues and am seeking help.

I have two tables A and B in a one (A) to many (B) relationship.  There are 1.4 million records in table A and 44 million records in table B.  In my web application any request for a record from table A is also going to need a count of associated records in table B.  Several years ago I added table_b_rowcount to table A in order to minimize queries on table B.  And now, as the application has grown, I am starting to having locking problems on table A.  Any change to table B requires the that table_b_rowcount be updated on table A...  The application has outgrown this solution.

So... is there a common solution to this problem?

culley

Re: design help for performance

От
Culley Harrelson
Дата:
Thanks David.  That was my original solution and it began to bog down the website so I resorted to demoralization 3 years ago....  This is an extremely high volume website.


On Tue, Dec 20, 2011 at 4:27 PM, David Johnston <polobo@yahoo.com> wrote:

Continued top-posting to remain consistent….

 

It isn’t that the application has outgrown the solution but rather the solution was never correct in the first place.  You attempted pre-mature optimization and are getting burned because of it.  The reference solution is simply:

 

SELECT a.*, COUNT(*) AS b_count

FROM a

JOIN b USING (a_id)

GROUP BY a.* {expanded * as needed)

 

Make sure table b has an index on the a.id column.

 

This is reference because you never want to introduce computed fields that keep track of other tables WITHOUT some kind of proof that the maintenance nightmare/overhead you are incurring is more than offset by the savings during usage.

 

Any further optimization requires two things:

Knowledge of the usage patterns of the affected data

Testing to prove that the alternative solutions out-perform the reference solution

 

Since you already have an existing query you should implement the reference solution above and then test and see whether it performs better or worse than you current solution.  If it indeed performs better than move to it; and if it is still not good enough then you need to provide more information about what kinds of queries are hitting A and B as well as Insert/Delete patterns on Table B.

 

David J.

 

From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Misa Simic
Sent: Tuesday, December 20, 2011 7:13 PM
To: Culley Harrelson; pgsql-general@postgresql.org
Subject: Re: [GENERAL] design help for performance

 

Hi Culley,

Have you tried to create fk together with index on fk column on table B?

What are results? Would be good if you could send the query and explain analyze...

Sent from my Windows Phone


From: Culley Harrelson
Sent: 21 December 2011 00:57
To: pgsql-general@postgresql.org
Subject: [GENERAL] design help for performance

I am bumping into some performance issues and am seeking help.

I have two tables A and B in a one (A) to many (B) relationship.  There are 1.4 million records in table A and 44 million records in table B.  In my web application any request for a record from table A is also going to need a count of associated records in table B.  Several years ago I added table_b_rowcount to table A in order to minimize queries on table B.  And now, as the application has grown, I am starting to having locking problems on table A.  Any change to table B requires the that table_b_rowcount be updated on table A...  The application has outgrown this solution.

So... is there a common solution to this problem?

culley


Re: design help for performance

От
Alban Hertroys
Дата:
On 21 Dec 2011, at 24:56, Culley Harrelson wrote:

> Several years ago I added table_b_rowcount to table A in order to minimize queries on table B.  And now, as the
applicationhas grown, I am starting to having locking problems on table A.  Any change to table B requires the that
table_b_rowcountbe updated on table A...  The application has outgrown this solution. 


When you update rowcount_b in table A, that locks the row in A of course, but there's more going on. Because a new
versionof that row gets created, the references from B to A also need updating to that new version (creating new
versionsof rows in B as well). I think that causes a little bit more locking than originally anticipated - it may even
bethe cause of your locking problem. 

Instead, if you'd create a new table C that only holds the rowcount_b and a reference to A (in a 1:1 relationship),
mostof those problems go away. It does add an extra foreign key reference to table A though, which means it will weigh
downupdates and deletes there some more. 

CREATE TABLE C (
  table_a_id int PRIMARY KEY
         REFERENCES table_a (id) ON UPDATE CASCADE ON DELETE CASCADE,
  table_b_rowcount int NOT NULL DEFAULT 0
);

Yes, those cascades are on purpose - the data in C is useless without the accompanying record in A. Also, the PK makes
sureit stays a 1:1 relationship. 

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.


Re: design help for performance

От
Misa Simic
Дата:
Hi Culley,

Have you tried to create fk together with index on fk column on table B?

What are results? Would be good if you could send the query and explain analyze...

Sent from my Windows Phone

From: Culley Harrelson
Sent: 21 December 2011 00:57
To: pgsql-general@postgresql.org
Subject: [GENERAL] design help for performance

I am bumping into some performance issues and am seeking help.

I have two tables A and B in a one (A) to many (B) relationship.  There are 1.4 million records in table A and 44 million records in table B.  In my web application any request for a record from table A is also going to need a count of associated records in table B.  Several years ago I added table_b_rowcount to table A in order to minimize queries on table B.  And now, as the application has grown, I am starting to having locking problems on table A.  Any change to table B requires the that table_b_rowcount be updated on table A...  The application has outgrown this solution.

So... is there a common solution to this problem?

culley

Re: design help for performance

От
"Marc Mamin"
Дата:
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Alban Hertroys
> Sent: Mittwoch, 21. Dezember 2011 08:53
> To: Culley Harrelson
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] design help for performance
>
> On 21 Dec 2011, at 24:56, Culley Harrelson wrote:
>
> > Several years ago I added table_b_rowcount to table A in order to
> minimize queries on table B.  And now, as the application has grown, I
> am starting to having locking problems on table A.  Any change to
table
> B requires the that table_b_rowcount be updated on table A...  The
> application has outgrown this solution.
>
>
> When you update rowcount_b in table A, that locks the row in A of
> course, but there's more going on. Because a new version of that row
> gets created, the references from B to A also need updating to that
new
> version (creating new versions of rows in B as well). I think that
> causes a little bit more locking than originally anticipated - it may
> even be the cause of your locking problem.
>
> Instead, if you'd create a new table C that only holds the rowcount_b
> and a reference to A (in a 1:1 relationship), most of those problems
go
> away. It does add an extra foreign key reference to table A though,
> which means it will weigh down updates and deletes there some more.
>
> CREATE TABLE C (
>   table_a_id int PRIMARY KEY
>          REFERENCES table_a (id) ON UPDATE CASCADE ON DELETE
> CASCADE,
>   table_b_rowcount int NOT NULL DEFAULT 0
> );
>
> Yes, those cascades are on purpose - the data in C is useless without
> the accompanying record in A. Also, the PK makes sure it stays a 1:1
> relationship.
>
> Alban Hertroys

Hello,

it may help to combine Alban solution with yours but at the cost of a
higher complexity:

In table C use instead a column table_b_delta_rowcount (+1 /-1 ,
smallint) and only use INSERTs to maintain it, no UPDATEs (hence with a
non unique index on id).

Then regularily flush table C content to table A, in order to only have
recent changes in C.
Your query should  then query both tables:

SELECT A. table_b_rowcount + coalesce(sum(C.table_b_delta_rowcount))
FROM A LEFT OUTER JOIN B on (A.id=B.id)
WHERE A.id = xxx

Marc Mamin

Re: design help for performance

От
Culley Harrelson
Дата:
Thank you so much everyone!  Introducing table C was indeed my next step but I was unsure if I was going to be just moving the locking problems from A to C.  Locking on C is preferable to locking on A but it doesn't really solve the problem.  It sounds like I should expect less locking on C because it doesn't relate to B.  Thanks again, I am going to give it a try. 

I am not going to take it to the delta solution for now.



On Wed, Dec 21, 2011 at 1:46 AM, Marc Mamin <M.Mamin@intershop.de> wrote:

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Alban Hertroys
> Sent: Mittwoch, 21. Dezember 2011 08:53
> To: Culley Harrelson
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] design help for performance
>
> On 21 Dec 2011, at 24:56, Culley Harrelson wrote:
>
> > Several years ago I added table_b_rowcount to table A in order to
> minimize queries on table B.  And now, as the application has grown, I
> am starting to having locking problems on table A.  Any change to
table
> B requires the that table_b_rowcount be updated on table A...  The
> application has outgrown this solution.
>
>
> When you update rowcount_b in table A, that locks the row in A of
> course, but there's more going on. Because a new version of that row
> gets created, the references from B to A also need updating to that
new
> version (creating new versions of rows in B as well). I think that
> causes a little bit more locking than originally anticipated - it may
> even be the cause of your locking problem.
>
> Instead, if you'd create a new table C that only holds the rowcount_b
> and a reference to A (in a 1:1 relationship), most of those problems
go
> away. It does add an extra foreign key reference to table A though,
> which means it will weigh down updates and deletes there some more.
>
> CREATE TABLE C (
>   table_a_id int PRIMARY KEY
>                REFERENCES table_a (id) ON UPDATE CASCADE ON DELETE
> CASCADE,
>   table_b_rowcount int NOT NULL DEFAULT 0
> );
>
> Yes, those cascades are on purpose - the data in C is useless without
> the accompanying record in A. Also, the PK makes sure it stays a 1:1
> relationship.
>
> Alban Hertroys

Hello,

it may help to combine Alban solution with yours but at the cost of a
higher complexity:

In table C use instead a column table_b_delta_rowcount (+1 /-1 ,
smallint) and only use INSERTs to maintain it, no UPDATEs (hence with a
non unique index on id).

Then regularily flush table C content to table A, in order to only have
recent changes in C.
Your query should  then query both tables:

SELECT A. table_b_rowcount + coalesce(sum(C.table_b_delta_rowcount))
FROM A LEFT OUTER JOIN B on (A.id=B.id)
WHERE A.id = xxx

Marc Mamin

Re: design help for performance

От
Misa Simic
Дата:
Hm...

I think result on the end will be the same... I am not sure realation
produce any locks on parent table...

What produces locks is UPDATE, so is it on table A or C should make no
difference...

If simple join and count fk is so slow - other option would be materialized
view... So it would need to include table C as materialized view but on the
way to to don't make expensive calculations in real time during insert in B
(and locking)

There is a article about materialized views on postgres wiki...

Sent from my Windows Phone
------------------------------
From: Culley Harrelson
Sent: 21 December 2011 22:07
To: Marc Mamin
Cc: Alban Hertroys; pgsql-general@postgresql.org
Subject: Re: [GENERAL] design help for performance

Thank you so much everyone!  Introducing table C was indeed my next step
but I was unsure if I was going to be just moving the locking problems from
A to C.  Locking on C is preferable to locking on A but it doesn't really
solve the problem.  It sounds like I should expect less locking on C
because it doesn't relate to B.  Thanks again, I am going to give it a
try.

I am not going to take it to the delta solution for now.



On Wed, Dec 21, 2011 at 1:46 AM, Marc Mamin <M.Mamin@intershop.de> wrote:

>
> > -----Original Message-----
> > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> > owner@postgresql.org] On Behalf Of Alban Hertroys
> > Sent: Mittwoch, 21. Dezember 2011 08:53
> > To: Culley Harrelson
> > Cc: pgsql-general@postgresql.org
> > Subject: Re: [GENERAL] design help for performance
> >
> > On 21 Dec 2011, at 24:56, Culley Harrelson wrote:
> >
> > > Several years ago I added table_b_rowcount to table A in order to
> > minimize queries on table B.  And now, as the application has grown, I
> > am starting to having locking problems on table A.  Any change to
> table
> > B requires the that table_b_rowcount be updated on table A...  The
> > application has outgrown this solution.
> >
> >
> > When you update rowcount_b in table A, that locks the row in A of
> > course, but there's more going on. Because a new version of that row
> > gets created, the references from B to A also need updating to that
> new
> > version (creating new versions of rows in B as well). I think that
> > causes a little bit more locking than originally anticipated - it may
> > even be the cause of your locking problem.
> >
> > Instead, if you'd create a new table C that only holds the rowcount_b
> > and a reference to A (in a 1:1 relationship), most of those problems
> go
> > away. It does add an extra foreign key reference to table A though,
> > which means it will weigh down updates and deletes there some more.
> >
> > CREATE TABLE C (
> >   table_a_id int PRIMARY KEY
> >                REFERENCES table_a (id) ON UPDATE CASCADE ON DELETE
> > CASCADE,
> >   table_b_rowcount int NOT NULL DEFAULT 0
> > );
> >
> > Yes, those cascades are on purpose - the data in C is useless without
> > the accompanying record in A. Also, the PK makes sure it stays a 1:1
> > relationship.
> >
> > Alban Hertroys
>
> Hello,
>
> it may help to combine Alban solution with yours but at the cost of a
> higher complexity:
>
> In table C use instead a column table_b_delta_rowcount (+1 /-1 ,
> smallint) and only use INSERTs to maintain it, no UPDATEs (hence with a
> non unique index on id).
>
> Then regularily flush table C content to table A, in order to only have
> recent changes in C.
> Your query should  then query both tables:
>
> SELECT A. table_b_rowcount + coalesce(sum(C.table_b_delta_rowcount))
> FROM A LEFT OUTER JOIN B on (A.id=B.id)
> WHERE A.id = xxx
>
> Marc Mamin
>

Re: design help for performance

От
Misa Simic
Дата:
Hm...

I think result on the end will be the same... I am not sure realation produce any locks on parent table...

What produces locks is UPDATE, so is it on table A or C should make no difference...

If simple join and count fk is so slow - other option would be materialized view... So it would need to include table C as materialized view but on the way to to don't make expensive calculations in real time during insert in B (and locking)

There is a article about materialized views on postgres wiki...

Sent from my Windows Phone

From: Culley Harrelson
Sent: 21 December 2011 22:07
To: Marc Mamin
Cc: Alban Hertroys; pgsql-general@postgresql.org
Subject: Re: [GENERAL] design help for performance

Thank you so much everyone!  Introducing table C was indeed my next step but I was unsure if I was going to be just moving the locking problems from A to C.  Locking on C is preferable to locking on A but it doesn't really solve the problem.  It sounds like I should expect less locking on C because it doesn't relate to B.  Thanks again, I am going to give it a try. 

I am not going to take it to the delta solution for now.



On Wed, Dec 21, 2011 at 1:46 AM, Marc Mamin <M.Mamin@intershop.de> wrote:

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Alban Hertroys
> Sent: Mittwoch, 21. Dezember 2011 08:53
> To: Culley Harrelson
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] design help for performance
>
> On 21 Dec 2011, at 24:56, Culley Harrelson wrote:
>
> > Several years ago I added table_b_rowcount to table A in order to
> minimize queries on table B.  And now, as the application has grown, I
> am starting to having locking problems on table A.  Any change to
table
> B requires the that table_b_rowcount be updated on table A...  The
> application has outgrown this solution.
>
>
> When you update rowcount_b in table A, that locks the row in A of
> course, but there's more going on. Because a new version of that row
> gets created, the references from B to A also need updating to that
new
> version (creating new versions of rows in B as well). I think that
> causes a little bit more locking than originally anticipated - it may
> even be the cause of your locking problem.
>
> Instead, if you'd create a new table C that only holds the rowcount_b
> and a reference to A (in a 1:1 relationship), most of those problems
go
> away. It does add an extra foreign key reference to table A though,
> which means it will weigh down updates and deletes there some more.
>
> CREATE TABLE C (
>   table_a_id int PRIMARY KEY
>                REFERENCES table_a (id) ON UPDATE CASCADE ON DELETE
> CASCADE,
>   table_b_rowcount int NOT NULL DEFAULT 0
> );
>
> Yes, those cascades are on purpose - the data in C is useless without
> the accompanying record in A. Also, the PK makes sure it stays a 1:1
> relationship.
>
> Alban Hertroys

Hello,

it may help to combine Alban solution with yours but at the cost of a
higher complexity:

In table C use instead a column table_b_delta_rowcount (+1 /-1 ,
smallint) and only use INSERTs to maintain it, no UPDATEs (hence with a
non unique index on id).

Then regularily flush table C content to table A, in order to only have
recent changes in C.
Your query should  then query both tables:

SELECT A. table_b_rowcount + coalesce(sum(C.table_b_delta_rowcount))
FROM A LEFT OUTER JOIN B on (A.id=B.id)
WHERE A.id = xxx

Marc Mamin