Re: hyrax vs. RelationBuildPartitionDesc

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: hyrax vs. RelationBuildPartitionDesc
Дата
Msg-id 27380.1555270166@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: hyrax vs. RelationBuildPartitionDesc  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: hyrax vs. RelationBuildPartitionDesc  (Robert Haas <robertmhaas@gmail.com>)
Re: hyrax vs. RelationBuildPartitionDesc  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
Список pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Mar 15, 2019 at 3:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I agree that copying data isn't great.  What I don't agree is that this
>> solution is better.

> I am not very convinced that it makes sense to lump something like
> RelationGetIndexAttrBitmap in with something like
> RelationGetPartitionDesc.  RelationGetIndexAttrBitmap is returning a
> single Bitmapset, whereas the data structure RelationGetPartitionDesc
> is vastly larger and more complex.  You can't say "well, if it's best
> to copy 32 bytes of data out of the relcache every time we need it, it
> must also be right to copy 10k or 100k of data out of the relcache
> every time we need it."

I did not say that.  What I did say is that they're both correct
solutions.  Copying large data structures is clearly a potential
performance problem, but that doesn't mean we can take correctness
shortcuts.

> If we want an at-least-somewhat unified solution here, I think we need
> to bite the bullet and make the planner hold a reference to the
> relcache throughout planning.  (The executor already keeps it open, I
> believe.) Otherwise, how is the relcache supposed to know when it can
> throw stuff away and when it can't?

The real problem with that is that we *still* won't know whether it's
okay to throw stuff away or not.  The issue with these subsidiary
data structures is exactly that we're trying to reduce the lock strength
required for changing one of them, and as soon as you make that lock
strength less than AEL, you have the problem that that value may need
to change in relcache entries despite them being pinned.  The code I'm
complaining about is trying to devise a shortcut solution to exactly
that situation ... and it's not a good shortcut.

> The only alternative seems to be to have each subsystem hold its own
> reference count, as I did with the PartitionDirectory stuff, which is
> not something we'd want to scale out.

Well, we clearly don't want to devise a separate solution for every
subsystem, but it doesn't seem out of the question to me that we could
build a "reference counted cache sub-structure" mechanism and apply it
to each of these relcache fields.  Maybe it could be unified with the
existing code in the typcache that does a similar thing.  Sure, this'd
be a fair amount of work, but we've done it before.  Syscache entries
didn't use to have reference counts, for example.

BTW, the problem I have with the PartitionDirectory stuff is exactly
that it *isn't* a reference-counted solution.  If it were, we'd not
have this problem of not knowing when we could free rd_pdcxt.

>> I especially don't care for the much-less-than-half-baked kluge of
>> chaining the old rd_pdcxt onto the new one and hoping that it will go away
>> at a suitable time.

> It will go away at a suitable time, but maybe not at the soonest
> suitable time.  It wouldn't be hard to improve that, though.  The
> easiest thing to do, I think, would be to have an rd_oldpdcxt and
> stuff the old context there.  If there already is one there, parent
> the new one under it.  When RelationDecrementReferenceCount reduces
> the reference count to zero, destroy anything found in rd_oldpdcxt.

Meh.  While it seems likely that that would mask most practical problems,
it still seems like covering up a wart with a dirty bandage.  In
particular, that fix doesn't fix anything unless relcache reference counts
go to zero pretty quickly --- which I'll just note is directly contrary
to your enthusiasm for holding relcache pins longer.

I think that what we ought to do for v12 is have PartitionDirectory
copy the data, and then in v13 work on creating real reference-count
infrastructure that would allow eliminating the copy steps with full
safety.  The $64 question is whether that really would cause unacceptable
performance problems.  To look into that, I made the attached WIP patches.
(These are functionally complete, but I didn't bother for instance with
removing the hunk that 898e5e329 added to relcache.c, and the comments
need work, etc.)  The first one just changes the PartitionDirectory
code to do that, and then the second one micro-optimizes
partition_bounds_copy() to make it somewhat less expensive, mostly by
collapsing lots of small palloc's into one big one.

What I get for test cases like [1] is

single-partition SELECT, hash partitioning:

N       tps, HEAD       tps, patch
2       11426.243754    11448.615193
8       11254.833267    11374.278861
32      11288.329114    11371.942425
128     11222.329256    11185.845258
512     11001.177137    10572.917288
1024    10612.456470    9834.172965
4096    8819.110195     7021.864625
8192    7372.611355     5276.130161

single-partition SELECT, range partitioning:

N       tps, HEAD       tps, patch
2       11037.855338    11153.595860
8       11085.218022    11019.132341
32      10994.348207    10935.719951
128     10884.417324    10532.685237
512     10635.583411    9578.108915
1024    10407.286414    8689.585136
4096    8361.463829     5139.084405
8192    7075.880701     3442.542768

Now certainly these numbers suggest that avoiding the copy could be worth
our trouble, but these results are still several orders of magnitude
better than where we were two weeks ago [2].  Plus, this is an extreme
case that's not really representative of real-world usage, since the test
tables have neither indexes nor any data.  In practical situations the
baseline would be lower and the dropoff less bad.  So I don't feel bad
about shipping v12 with these sorts of numbers and hoping for more
improvement later.

            regards, tom lane

[1] https://www.postgresql.org/message-id/3529.1554051867%40sss.pgh.pa.us

[2] https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512BAC60%40g01jpexmbkw24

diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 4d6595b..96129e7 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -43,7 +43,6 @@ typedef struct PartitionDirectoryData
 typedef struct PartitionDirectoryEntry
 {
     Oid            reloid;
-    Relation    rel;
     PartitionDesc pd;
 } PartitionDirectoryEntry;

@@ -235,6 +234,34 @@ RelationBuildPartitionDesc(Relation rel)
 }

 /*
+ * copyPartitionDesc
+ *
+ * Should be somewhere else perhaps?
+ */
+static PartitionDesc
+copyPartitionDesc(PartitionDesc src, PartitionKey key)
+{
+    PartitionDesc partdesc;
+    int            nparts = src->nparts;
+
+    partdesc = (PartitionDescData *) palloc0(sizeof(PartitionDescData));
+    partdesc->nparts = nparts;
+    /* If there are no partitions, the rest of the partdesc can stay zero */
+    if (nparts > 0)
+    {
+        partdesc->oids = (Oid *) palloc(nparts * sizeof(Oid));
+        memcpy(partdesc->oids, src->oids, nparts * sizeof(Oid));
+
+        partdesc->is_leaf = (bool *) palloc(nparts * sizeof(bool));
+        memcpy(partdesc->is_leaf, src->is_leaf, nparts * sizeof(bool));
+
+        partdesc->boundinfo = partition_bounds_copy(src->boundinfo, key);
+    }
+
+    return partdesc;
+}
+
+/*
  * CreatePartitionDirectory
  *        Create a new partition directory object.
  */
@@ -265,7 +292,7 @@ CreatePartitionDirectory(MemoryContext mcxt)
  *
  * The purpose of this function is to ensure that we get the same
  * PartitionDesc for each relation every time we look it up.  In the
- * face of current DDL, different PartitionDescs may be constructed with
+ * face of concurrent DDL, different PartitionDescs may be constructed with
  * different views of the catalog state, but any single particular OID
  * will always get the same PartitionDesc for as long as the same
  * PartitionDirectory is used.
@@ -281,13 +308,16 @@ PartitionDirectoryLookup(PartitionDirectory pdir, Relation rel)
     if (!found)
     {
         /*
-         * We must keep a reference count on the relation so that the
-         * PartitionDesc to which we are pointing can't get destroyed.
+         * Currently, neither RelationGetPartitionDesc nor
+         * RelationGetPartitionKey can leak any memory; but if they start
+         * doing so, do those calls before switching into pdir_mcxt.
          */
-        RelationIncrementReferenceCount(rel);
-        pde->rel = rel;
-        pde->pd = RelationGetPartitionDesc(rel);
+        MemoryContext oldcontext = MemoryContextSwitchTo(pdir->pdir_mcxt);
+
+        pde->pd = copyPartitionDesc(RelationGetPartitionDesc(rel),
+                                    RelationGetPartitionKey(rel));
         Assert(pde->pd != NULL);
+        MemoryContextSwitchTo(oldcontext);
     }
     return pde->pd;
 }
@@ -296,17 +326,11 @@ PartitionDirectoryLookup(PartitionDirectory pdir, Relation rel)
  * DestroyPartitionDirectory
  *        Destroy a partition directory.
  *
- * Release the reference counts we're holding.
+ * This is a no-op at present; caller is supposed to release the memory context.
  */
 void
 DestroyPartitionDirectory(PartitionDirectory pdir)
 {
-    HASH_SEQ_STATUS    status;
-    PartitionDirectoryEntry *pde;
-
-    hash_seq_init(&status, pdir->pdir_hash);
-    while ((pde = hash_seq_search(&status)) != NULL)
-        RelationDecrementReferenceCount(pde->rel);
 }

 /*
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index c8770cc..b7ddbe5 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -782,10 +782,13 @@ partition_bounds_copy(PartitionBoundInfo src,
                       PartitionKey key)
 {
     PartitionBoundInfo dest;
-    int            i;
+    bool        hash_part = (key->strategy == PARTITION_STRATEGY_HASH);
+    Datum *alldatums;
     int            ndatums;
     int            partnatts;
+    int            natts;
     int            num_indexes;
+    int            i;

     dest = (PartitionBoundInfo) palloc(sizeof(PartitionBoundInfoData));

@@ -793,65 +796,69 @@ partition_bounds_copy(PartitionBoundInfo src,
     ndatums = dest->ndatums = src->ndatums;
     partnatts = key->partnatts;

-    num_indexes = get_partition_bound_num_indexes(src);
-
     /* List partitioned tables have only a single partition key. */
     Assert(key->strategy != PARTITION_STRATEGY_LIST || partnatts == 1);

-    dest->datums = (Datum **) palloc(sizeof(Datum *) * ndatums);
-
     if (src->kind != NULL)
     {
+        PartitionRangeDatumKind *allkinds;
+
         dest->kind = (PartitionRangeDatumKind **) palloc(ndatums *
                                                          sizeof(PartitionRangeDatumKind *));
+        allkinds = (PartitionRangeDatumKind *) palloc(ndatums * partnatts *
+                                                      sizeof(PartitionRangeDatumKind));
+
         for (i = 0; i < ndatums; i++)
         {
-            dest->kind[i] = (PartitionRangeDatumKind *) palloc(partnatts *
-                                                               sizeof(PartitionRangeDatumKind));
+            dest->kind[i] = allkinds;
+            allkinds += partnatts;

             memcpy(dest->kind[i], src->kind[i],
-                   sizeof(PartitionRangeDatumKind) * key->partnatts);
+                   sizeof(PartitionRangeDatumKind) * partnatts);
         }
     }
     else
         dest->kind = NULL;

+    dest->datums = (Datum **) palloc(sizeof(Datum *) * ndatums);
+
+    /*
+     * For hash partitioning, datums arrays will have two elements - modulus
+     * and remainder.
+     */
+    natts = hash_part ? 2 : partnatts;
+
+    alldatums = (Datum *) palloc(sizeof(Datum) * natts * ndatums);
+
     for (i = 0; i < ndatums; i++)
     {
-        int            j;
+        dest->datums[i] = alldatums;
+        alldatums += natts;

         /*
-         * For a corresponding to hash partition, datums array will have two
-         * elements - modulus and remainder.
+         * For hash partitioning, we know that both datums are present and
+         * pass-by-value (since they're int4s), so just memcpy the datum
+         * array.  Otherwise we have to apply datumCopy.
          */
-        bool        hash_part = (key->strategy == PARTITION_STRATEGY_HASH);
-        int            natts = hash_part ? 2 : partnatts;
-
-        dest->datums[i] = (Datum *) palloc(sizeof(Datum) * natts);
-
-        for (j = 0; j < natts; j++)
+        if (hash_part)
         {
-            bool        byval;
-            int            typlen;
-
-            if (hash_part)
-            {
-                typlen = sizeof(int32); /* Always int4 */
-                byval = true;    /* int4 is pass-by-value */
-            }
-            else
+            memcpy(dest->datums[i], src->datums[i], 2 * sizeof(Datum));
+        }
+        else
+        {
+            for (int j = 0; j < natts; j++)
             {
-                byval = key->parttypbyval[j];
-                typlen = key->parttyplen[j];
+                if (dest->kind == NULL ||
+                    dest->kind[i][j] == PARTITION_RANGE_DATUM_VALUE)
+                    dest->datums[i][j] = datumCopy(src->datums[i][j],
+                                                   key->parttypbyval[j],
+                                                   key->parttyplen[j]);
             }
-
-            if (dest->kind == NULL ||
-                dest->kind[i][j] == PARTITION_RANGE_DATUM_VALUE)
-                dest->datums[i][j] = datumCopy(src->datums[i][j],
-                                               byval, typlen);
         }
     }

+    num_indexes = get_partition_bound_num_indexes(src);
+
     dest->indexes = (int *) palloc(sizeof(int) * num_indexes);
     memcpy(dest->indexes, src->indexes, sizeof(int) * num_indexes);


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Floris Van Nee
Дата:
Сообщение: partitioning performance tests after recent patches
Следующее
От: Jeff Janes
Дата:
Сообщение: Re: Should the docs have a warning about pg_stat_reset()?