Обсуждение: Partitioning performance: cache stringToNode() of pg_constraint.ccbin
A colleague, Korry Douglas, observed a table partitioning scenario where
deserializing pg_constraint.ccbin is a hot spot.  The following test case, a
simplification of a typical partitioning setup, spends 28% of its time in
stringToNode() and callees thereof:
\timing on
\set n 600000
BEGIN;
CREATE TABLE bench_check_constr_parent (c int);
CREATE TABLE bench_check_constr_child (
    CHECK (c > 0 AND c <= 100000000)
) INHERITS (bench_check_constr_parent);
CREATE FUNCTION trig() RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
    INSERT INTO bench_check_constr_child VALUES (NEW.*);
    RETURN NULL;
END
$$;
CREATE TRIGGER redir BEFORE INSERT ON bench_check_constr_parent
    FOR EACH ROW EXECUTE PROCEDURE trig();
-- Main benchmark
INSERT INTO bench_check_constr_parent SELECT * FROM generate_series(1, :n);
TRUNCATE bench_check_constr_parent;
INSERT INTO bench_check_constr_parent SELECT * FROM generate_series(1, :n);
TRUNCATE bench_check_constr_parent;
INSERT INTO bench_check_constr_parent SELECT * FROM generate_series(1, :n);
TRUNCATE bench_check_constr_parent;
-- Compare direct insert performance @ 10x volume
INSERT INTO bench_check_constr_child SELECT * FROM generate_series(1, :n * 10);
TRUNCATE bench_check_constr_parent;
INSERT INTO bench_check_constr_child SELECT * FROM generate_series(1, :n * 10);
TRUNCATE bench_check_constr_parent;
INSERT INTO bench_check_constr_child SELECT * FROM generate_series(1, :n * 10);
TRUNCATE bench_check_constr_parent;
ROLLBACK;
The executor caches each CHECK constraint in ResultRelInfo as a planned
expression.  That cache is highly effectively for long-running statements, but
the trivial INSERTs effectively work without a cache.  Korry devised this
patch to cache the stringToNode() form of the constraint in the relcache.  It
improves the benchmark's partitioned scenario by 33%:
-- Timings (seconds) --
master, INSERT parent:       14.2, 14.4, 14.4
patched, INSERT parent:      9.6,  9.7,  9.7
master, INSERT*10 child:     9.9,  9.9,  10.2
patched, INSERT*10 child:    10.0, 10.2, 10.2
There's still not much to like about that tenfold overhead from use of the
partition routing trigger, but this patch makes a nice cut into that overhead
without doing anything aggressive.  The profile no longer shows low-hanging
fruit; running an entire SQL statement per row piles on the runtime from a
wide range of sources.  For anyone curious, I've attached output from "perf
report -s parent -g graph,1,caller" with the patch applied; I suggest browsing
under "less -S".
Some call sites need to modify the node tree, so the patch has them do
copyObject().  I ran a microbenchmark of copyObject() on the cached node tree
vs. redoing stringToNode(), and copyObject() still won by a factor of four.
Thanks,
nm
--
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com
			
		Вложения
On Tuesday, June 04, 2013 12:37 AM Noah Misch wrote: > A colleague, Korry Douglas, observed a table partitioning scenario > where deserializing pg_constraint.ccbin is a hot spot. The following > test case, a simplification of a typical partitioning setup, spends 28% > of its time in > stringToNode() and callees thereof: > > > > > The executor caches each CHECK constraint in ResultRelInfo as a planned > expression. That cache is highly effectively for long-running > statements, but the trivial INSERTs effectively work without a cache. > Korry devised this patch to cache the stringToNode() form of the > constraint in the relcache. It improves the benchmark's partitioned > scenario by 33%: > > -- Timings (seconds) -- > master, INSERT parent: 14.2, 14.4, 14.4 > patched, INSERT parent: 9.6, 9.7, 9.7 > > master, INSERT*10 child: 9.9, 9.9, 10.2 > patched, INSERT*10 child: 10.0, 10.2, 10.2 > > There's still not much to like about that tenfold overhead from use of > the partition routing trigger, but this patch makes a nice cut into > that overhead without doing anything aggressive. This patch can give good performance gain in the scenario described by you. Infact I had taken the readings with patch, it shows similar gain. -- Timings (seconds) -- master, INSERT parent: 14.9, 15.4, 15.4 patched, INSERT parent: 9.9, 9.6, 9.5 master, INSERT*10 child: 13.8, 14.5, 15.6 patched, INSERT*10 child: 13.0, 14.3, 14.6 This patch would increase the relcache size, as for each constraint on table it would increase 4 bytes irrespective of whether that can give performance benefit or not. Why in function CheckConstraintFetch(), the node is not formed from string? > > Some call sites need to modify the node tree, so the patch has them do > copyObject(). I ran a microbenchmark of copyObject() on the cached > node tree vs. redoing stringToNode(), and copyObject() still won by a > factor of four. I have not tried any performance run to measure if extra copyObject() has added any benefit. What kind of benchmark you use to validate it? With Regards, Amit Kapila.
On Thu, Jun 06, 2013 at 07:02:27PM +0530, Amit Kapila wrote: > On Tuesday, June 04, 2013 12:37 AM Noah Misch wrote: > This patch can give good performance gain in the scenario described by you. > Infact I had taken the readings with patch, it shows similar gain. Thanks for testing. > This patch would increase the relcache size, as for each constraint on table > it would increase 4 bytes irrespective of whether that can give performance > benefit or not. Yep, sizeof(Node *) bytes. > Why in function CheckConstraintFetch(), the node is not formed from string? That's to avoid the cost of stringToNode() if the field is not needed during the life of the cache entry. > > Some call sites need to modify the node tree, so the patch has them do > > copyObject(). I ran a microbenchmark of copyObject() on the cached > > node tree vs. redoing stringToNode(), and copyObject() still won by a > > factor of four. > > I have not tried any performance run to measure if extra copyObject() has > added any benefit. Callers must not modify the cache entry; those that would otherwise do so must use copyObject() first. I benchmarked to ensure they wouldn't be better off ignoring the cached node tree and calling stringToNode() themselves. > What kind of benchmark you use to validate it? It boiled down to comparing runtime of loops like these: while (n-- > 0) copyObject(RelationGetConstraint(r, 0)); while (n-- > 0) stringToNode(r->rd_att->constr->check[0].ccbin); -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Friday, June 07, 2013 2:10 AM Noah Misch wrote: > On Thu, Jun 06, 2013 at 07:02:27PM +0530, Amit Kapila wrote: > > On Tuesday, June 04, 2013 12:37 AM Noah Misch wrote: > > > This patch can give good performance gain in the scenario described > by you. > > Infact I had taken the readings with patch, it shows similar gain. > > Thanks for testing. > > > This patch would increase the relcache size, as for each constraint > on table > > it would increase 4 bytes irrespective of whether that can give > performance > > benefit or not. > > Yep, sizeof(Node *) bytes. So the memory increase number's would like: Example for 64-bit m/c In database, there are 100 tables, each having 2 constraints and 30 live connections Size increase = no. of tables * size of (Node*) * number of constraints * no. of live sessions = 100 * 8 * 2 * 30 = 46.8K So if such a memory increase seems reasonable, then I think it would be really beneficial for the load of data in inherited tables. If no one has objections on this part, then I think this is really useful patch for PostgreSQL. With Regards, Amit Kapila.
On Fri, Jun 07, 2013 at 11:58:14AM +0530, Amit Kapila wrote: > So the memory increase number's would like: > > Example for 64-bit m/c > In database, there are 100 tables, each having 2 constraints and 30 live > connections > > Size increase = no. of tables * size of (Node*) * number of constraints * > no. of live sessions > = 100 * 8 * 2 * 30 > = 46.8K That's the increase when nothing consults the constraints. Upon first use of a particular constraint, one cache can easily grow by a few hundred bytes. > So if such a memory increase seems reasonable, then I think it would be > really beneficial for the load of data in inherited tables. I agree that the memory usage is the main downside. We already do similar caching for index expressions and for rules, incidentally. On that note, I see that the patch does not make RelationDestroyRelation() correctly free the newly-cached data. Will fix. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Mon, Jun 3, 2013 at 03:07:27PM -0400, Noah Misch wrote: > A colleague, Korry Douglas, observed a table partitioning scenario where > deserializing pg_constraint.ccbin is a hot spot. The following test case, a > simplification of a typical partitioning setup, spends 28% of its time in > stringToNode() and callees thereof: Noah, what is the status on this? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Sat, Jan 11, 2014 at 02:10:01PM -0500, Bruce Momjian wrote: > On Mon, Jun 3, 2013 at 03:07:27PM -0400, Noah Misch wrote: > > A colleague, Korry Douglas, observed a table partitioning scenario where > > deserializing pg_constraint.ccbin is a hot spot. The following test case, a > > simplification of a typical partitioning setup, spends 28% of its time in > > stringToNode() and callees thereof: > > Noah, what is the status on this? Further study revealed a defect in the patch's memory management, and I have not gotten around to correcting that. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Sun, Jan 12, 2014 at 12:53:40PM -0500, Noah Misch wrote: > On Sat, Jan 11, 2014 at 02:10:01PM -0500, Bruce Momjian wrote: > > On Mon, Jun 3, 2013 at 03:07:27PM -0400, Noah Misch wrote: > > > A colleague, Korry Douglas, observed a table partitioning scenario where > > > deserializing pg_constraint.ccbin is a hot spot. The following test case, a > > > simplification of a typical partitioning setup, spends 28% of its time in > > > stringToNode() and callees thereof: > > > > Noah, what is the status on this? > > Further study revealed a defect in the patch's memory management, and I have > not gotten around to correcting that. I talked to Noah and he can't continue on this item. Can someone else work on it? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Wed, Aug 6, 2014 at 9:35 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Sun, Jan 12, 2014 at 12:53:40PM -0500, Noah Misch wrote: >> On Sat, Jan 11, 2014 at 02:10:01PM -0500, Bruce Momjian wrote: >> > On Mon, Jun 3, 2013 at 03:07:27PM -0400, Noah Misch wrote: >> > > A colleague, Korry Douglas, observed a table partitioning scenario where >> > > deserializing pg_constraint.ccbin is a hot spot. The following test case, a >> > > simplification of a typical partitioning setup, spends 28% of its time in >> > > stringToNode() and callees thereof: >> > >> > Noah, what is the status on this? >> >> Further study revealed a defect in the patch's memory management, and I have >> not gotten around to correcting that. > > I talked to Noah and he can't continue on this item. Can someone else > work on it? Well, it would be helpful if he could describe the defect he found, so that the next person doesn't have to guess. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 07, 2014 at 09:39:57AM -0400, Robert Haas wrote: > On Wed, Aug 6, 2014 at 9:35 PM, Bruce Momjian <bruce@momjian.us> wrote: > > On Sun, Jan 12, 2014 at 12:53:40PM -0500, Noah Misch wrote: > >> Further study revealed a defect in the patch's memory management, and I have > >> not gotten around to correcting that. > > > > I talked to Noah and he can't continue on this item. Can someone else > > work on it? > > Well, it would be helpful if he could describe the defect he found, so > that the next person doesn't have to guess. Quite right. Primarily, a leak: FreeTupleDesc() did a mere pfree() on a node tree, leaking all but the root node. Perhaps the best fix there is to give each TupleDesc a memory context and then simplify FreeTupleDesc() to a context deletion. That will tend to waste more memory, but I haven't thought of something clearly better. The patch passes a relcache-owned node tree to eval_const_expressions() via ExecPrepareExpr(). If that stands, I should add a comment to the effect that eval_const_expressions() must not modify in place the original tree or reference it from the result tree. The comment at expression_planner() implies that's already true. I know of no exceptions, but I have not audited. How reasonable is reliance on continued conformance to that rule? An alternative is to have the relcache always return a node tree copy, like it does in RelationGetIndexExpressions(). That sacrifices about 1/4 of the performance gain. -- Noah Misch EnterpriseDB http://www.enterprisedb.com