Обсуждение: Optimize WindowAgg's use of tuplestores

Поиск
Список
Период
Сортировка

Optimize WindowAgg's use of tuplestores

От
David Rowley
Дата:
As part of making tuplestores faster [1], I noticed that in WindowAgg, when we end one partition we call tuplestore_end() and then we do tuplestore_begin_heap() again for the next partition in begin_partition() and then go on to set up the tuplestore read pointers according to what's required for the given frameOptions of the WindowAgg.  This might make sense if the frameOptions could change between partitions, but they can't, so I don't see any reason why we can't just do tuplestore_clear() at the end of a partition.  That resets the read pointer positions back to the start again ready for the next partition.

I wrote the attached patch and checked how it affects performance. It helps quite a bit when there are lots of partitions.

CREATE TABLE a (a INT NOT NULL);
INSERT INTO a SELECT x FROM generate_series(1,1000000)x;
VACUUM FREEZE ANALYZE a;

bench.sql:
SELECT a,count(*) OVER (PARTITION BY a) FROM a OFFSET 1000000;

master:
$ pgbench -n -f bench.sql -T 60 -M prepared postgres | grep latency
latency average = 293.488 ms
latency average = 295.509 ms
latency average = 297.772 ms

patched:
$ pgbench -n -f bench.sql -T 60 -M prepared postgres | grep latency
latency average = 203.234 ms
latency average = 204.538 ms
latency average = 203.877 ms

About 45% faster.

Here's the top of perf top of each:

master:
     8.61%  libc.so.6          [.] _int_malloc
     6.71%  libc.so.6          [.] _int_free
     6.42%  postgres           [.] heap_form_minimal_tuple
     6.40%  postgres           [.] tuplestore_alloc_read_pointer
     5.87%  libc.so.6          [.] unlink_chunk.constprop.0
     3.86%  libc.so.6          [.] __memmove_avx512_unaligned_erms
     3.80%  postgres           [.] AllocSetFree
     3.51%  postgres           [.] spool_tuples
     3.45%  postgres           [.] ExecWindowAgg
     2.09%  postgres           [.] tuplesort_puttuple_common
     1.92%  postgres           [.] comparetup_datum
     1.88%  postgres           [.] AllocSetAlloc
     1.85%  postgres           [.] tuplesort_heap_replace_top
     1.84%  postgres           [.] ExecStoreBufferHeapTuple
     1.69%  postgres           [.] window_gettupleslot

patched:
     8.95%  postgres           [.] ExecWindowAgg
     8.10%  postgres           [.] heap_form_minimal_tuple
     5.16%  postgres           [.] spool_tuples
     4.03%  libc.so.6          [.] __memmove_avx512_unaligned_erms
     4.02%  postgres           [.] begin_partition
     3.11%  postgres           [.] tuplesort_puttuple_common
     2.97%  postgres           [.] AllocSetAlloc
     2.96%  postgres           [.] comparetup_datum
     2.83%  postgres           [.] tuplesort_heap_replace_top
     2.79%  postgres           [.] window_gettupleslot
     2.22%  postgres           [.] AllocSetFree
     2.11%  postgres           [.] MemoryContextReset
     1.99%  postgres           [.] LogicalTapeWrite
     1.98%  postgres           [.] advance_windowaggregate

Lots less malloc/free work going on.

I'm also tempted to do a cleanup of the state machine inside nodeWindowAgg.c as I had to add another bool flag to WindowAggState to replace the previous test that checked if the tuplestore was empty (i.e just finished a partition) with if (winstate->buffer == NULL). I couldn't do that anymore due to no longer nullifying that field. I think such a cleanup would be a separate patch. Annoyingly the extra bool is the 9th bool flag and widens that struct by 8 bytes and leaves a 7 byte hole.

David

Вложения

Re: Optimize WindowAgg's use of tuplestores

От
Ashutosh Bapat
Дата:
On Sun, Jul 7, 2024 at 4:27 PM David Rowley <dgrowley@gmail.com> wrote:
>
> As part of making tuplestores faster [1], I noticed that in WindowAgg, when we end one partition we call
tuplestore_end()and then we do tuplestore_begin_heap() again for the next partition in begin_partition() and then go on
toset up the tuplestore read pointers according to what's required for the given frameOptions of the WindowAgg.  This
mightmake sense if the frameOptions could change between partitions, but they can't, so I don't see any reason why we
can'tjust do tuplestore_clear() at the end of a partition.  That resets the read pointer positions back to the start
againready for the next partition. 
>
> I wrote the attached patch and checked how it affects performance. It helps quite a bit when there are lots of
partitions.
>
> CREATE TABLE a (a INT NOT NULL);
> INSERT INTO a SELECT x FROM generate_series(1,1000000)x;
> VACUUM FREEZE ANALYZE a;
>
> bench.sql:
> SELECT a,count(*) OVER (PARTITION BY a) FROM a OFFSET 1000000;
>
> master:
> $ pgbench -n -f bench.sql -T 60 -M prepared postgres | grep latency
> latency average = 293.488 ms
> latency average = 295.509 ms
> latency average = 297.772 ms
>
> patched:
> $ pgbench -n -f bench.sql -T 60 -M prepared postgres | grep latency
> latency average = 203.234 ms
> latency average = 204.538 ms
> latency average = 203.877 ms
>
> About 45% faster.
>

I repeated your measurements but by varying the number of partitions
and repeating pgbench 5 times instead of 3. The idea is to see the
impact of the change on a lower number of partitions.

10 partitions query: SELECT a,count(*) OVER (PARTITION BY a % 10) FROM
a OFFSET 1000000;
100 partitions query: SELECT a,count(*) OVER (PARTITION BY a % 100)
FROM a OFFSET 1000000;
1000 partitions query: SELECT a,count(*) OVER (PARTITION BY a % 1000)
FROM a OFFSET 1000000;
original query with 1M partitions: SELECT a,count(*) OVER (PARTITION
BY a) FROM a OFFSET 1000000;
Notice that the offset is still the same to avoid any impact it may
have on the query execution.

Here are the results
master:
no. of partitions, average latencies
10, 362.166 ms, 369.313 ms, 375.203 ms, 368.798 ms, 372.483 ms
100, 372.885 ms, 381.463 ms, 385.372 ms, 382.915 ms, 383.630 ms
1000, 390.834 ms, 395.653 ms, 400.339 ms, 407.777 ms, 389.906 ms
1000000, 552.848 ms, 553.943 ms, 547.806 ms, 541.871 ms, 546.741 ms

patched
10, 356.980 ms, 371.223 ms, 375.550 ms, 378.011 ms, 381.119 ms
100, 392.307 ms, 385.087 ms, 380.383 ms, 390.999 ms, 388.422 ms
1000, 405.136 ms, 397.576 ms, 399.021 ms, 399.572 ms, 406.604 ms
1000000, 394.711 ms, 403.741 ms, 399.008 ms, 392.932 ms, 393.335 ms

Observations
1. The numbers corresponding to 10 and 100 partitions are higher when
patched. That might be just noise. I don't see any reason why it would
impact negatively when there are a small number of partitions. The
lower partition cases also have a higher number of rows per partition,
so is the difference between MemoryContextDelete() vs
MemoryContextReset() making any difference here. May be worth
verifying those cases carefully. Otherwise upto 1000 partitions, it
doesn't show any differences.
2.  For 1M partitions it does make a difference. About 35% in my case.
Moreover the change seems to be making the execution times independent
of the number of partitions (more or less).

Combining this observation with the first one, It might be worth
looking at the execution times when there are many rows per partition
in case of a higher number of partitions.

--
Best Wishes,
Ashutosh Bapat



Re: Optimize WindowAgg's use of tuplestores

От
David Rowley
Дата:
On Wed, 10 Jul 2024 at 02:42, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> Observations
> 1. The numbers corresponding to 10 and 100 partitions are higher when
> patched. That might be just noise. I don't see any reason why it would
> impact negatively when there are a small number of partitions. The
> lower partition cases also have a higher number of rows per partition,
> so is the difference between MemoryContextDelete() vs
> MemoryContextReset() making any difference here. May be worth
> verifying those cases carefully. Otherwise upto 1000 partitions, it
> doesn't show any differences.

I think this might just be noise as a result of rearranging code. In
terms of C code, I don't see any reason for it to be slower.  If you
look at GenerationDelete() (as what is getting called from
MemoryContextDelete()), it just calls GenerationReset(). So resetting
is going to always be less work than deleting the context, especially
given we don't need to create the context again when we reset it.

I wrote the attached script to see if I can also see the slowdown and
I do see the patched code come out slightly slower (within noise
levels) in lower partition counts.

To get my compiler to produce code in a more optimal order for the
common case, I added unlikely() to the "if (winstate->all_first)"
condition.  This is only evaluated on the first time after a rescan,
so putting that code at the end of the function makes more sense.  The
attached v2 patch has it this way.  You can see the numbers look
slightly better in the attached graph.

Thanks for having a look at this.

David

Вложения

Re: Optimize WindowAgg's use of tuplestores

От
Ranier Vilela
Дата:
Em qui., 11 de jul. de 2024 às 09:09, David Rowley <dgrowleyml@gmail.com> escreveu:
On Wed, 10 Jul 2024 at 02:42, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> Observations
> 1. The numbers corresponding to 10 and 100 partitions are higher when
> patched. That might be just noise. I don't see any reason why it would
> impact negatively when there are a small number of partitions. The
> lower partition cases also have a higher number of rows per partition,
> so is the difference between MemoryContextDelete() vs
> MemoryContextReset() making any difference here. May be worth
> verifying those cases carefully. Otherwise upto 1000 partitions, it
> doesn't show any differences.

I think this might just be noise as a result of rearranging code. In
terms of C code, I don't see any reason for it to be slower.  If you
look at GenerationDelete() (as what is getting called from
MemoryContextDelete()), it just calls GenerationReset(). So resetting
is going to always be less work than deleting the context, especially
given we don't need to create the context again when we reset it.

I wrote the attached script to see if I can also see the slowdown and
I do see the patched code come out slightly slower (within noise
levels) in lower partition counts.

To get my compiler to produce code in a more optimal order for the
common case, I added unlikely() to the "if (winstate->all_first)"
condition.  This is only evaluated on the first time after a rescan,
so putting that code at the end of the function makes more sense.  The
attached v2 patch has it this way.
Not that it's going to make any difference, 
but since you're going to touch this code, why not?

Function *begin_partition*:
1. You can reduce the scope for variable *outerPlan*, 
it is not needed anywhere else.
+ /*
+ * If this is the very first partition, we need to fetch the first input
+ * row to store in first_part_slot.
+ */
+ if (TupIsNull(winstate->first_part_slot))
+ {
+ PlanState *outerPlan = outerPlanState(winstate);
+ TupleTableSlot *outerslot = ExecProcNode(outerPlan);

2. There is once reference to variable numFuncs
You can reduce the scope.

+ /* reset mark and seek positions for each real window function */
+ for (int i = 0; i < winstate->numfuncs; i++)

Function *prepare_tuplestore. *
1. There is once reference to variable numFuncs
You can reduce the scope.
  /* create mark and read pointers for each real window function */
- for (i = 0; i < numfuncs; i++)
+ for (int i = 0; i < winstate->numfuncs; i++)

2. You can securely initialize the default value for variable *winstate->grouptail_ptr*
in *else* part.

  if ((frameOptions & (FRAMEOPTION_EXCLUDE_GROUP |
  FRAMEOPTION_EXCLUDE_TIES)) &&
  node->ordNumCols != 0)
  winstate->grouptail_ptr =
  tuplestore_alloc_read_pointer(winstate->buffer, 0);
  }
+ else
+ winstate->grouptail_ptr = -1;

best regards,
Ranier Vilela

Re: Optimize WindowAgg's use of tuplestores

От
Tatsuo Ishii
Дата:
Hi David,

Thank you for the patch.

> I think this might just be noise as a result of rearranging code. In
> terms of C code, I don't see any reason for it to be slower.  If you
> look at GenerationDelete() (as what is getting called from
> MemoryContextDelete()), it just calls GenerationReset(). So resetting
> is going to always be less work than deleting the context, especially
> given we don't need to create the context again when we reset it.
> 
> I wrote the attached script to see if I can also see the slowdown and
> I do see the patched code come out slightly slower (within noise
> levels) in lower partition counts.
> 
> To get my compiler to produce code in a more optimal order for the
> common case, I added unlikely() to the "if (winstate->all_first)"
> condition.  This is only evaluated on the first time after a rescan,
> so putting that code at the end of the function makes more sense.  The
> attached v2 patch has it this way.  You can see the numbers look
> slightly better in the attached graph.
> 
> Thanks for having a look at this.
> 
> David

@@ -2684,6 +2726,14 @@ ExecEndWindowAgg(WindowAggState *node)
     PlanState  *outerPlan;
     int            i;
 
+    if (node->buffer != NULL)
+    {
+        tuplestore_end(node->buffer);
+
+        /* nullify so that release_partition skips the tuplestore_clear() */
+        node->buffer = NULL;
+    }
+

Is it possible that node->buffer == NULL in ExecEndWindowAgg()?  If
not, probably making it an Assert() or just removing the if() should
be fine.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Optimize WindowAgg's use of tuplestores

От
Tatsuo Ishii
Дата:
> @@ -2684,6 +2726,14 @@ ExecEndWindowAgg(WindowAggState *node)
>      PlanState  *outerPlan;
>      int            i;
>  
> +    if (node->buffer != NULL)
> +    {
> +        tuplestore_end(node->buffer);
> +
> +        /* nullify so that release_partition skips the tuplestore_clear() */
> +        node->buffer = NULL;
> +    }
> +
> 
> Is it possible that node->buffer == NULL in ExecEndWindowAgg()?  If
> not, probably making it an Assert() or just removing the if() should
> be fine.

Of course it it possible, for example there's no row in a partition.
Sorry for noise.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Optimize WindowAgg's use of tuplestores

От
Ashutosh Bapat
Дата:
On Thu, Jul 11, 2024 at 5:39 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Wed, 10 Jul 2024 at 02:42, Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> > Observations
> > 1. The numbers corresponding to 10 and 100 partitions are higher when
> > patched. That might be just noise. I don't see any reason why it would
> > impact negatively when there are a small number of partitions. The
> > lower partition cases also have a higher number of rows per partition,
> > so is the difference between MemoryContextDelete() vs
> > MemoryContextReset() making any difference here. May be worth
> > verifying those cases carefully. Otherwise upto 1000 partitions, it
> > doesn't show any differences.
>
> I think this might just be noise as a result of rearranging code. In
> terms of C code, I don't see any reason for it to be slower.  If you
> look at GenerationDelete() (as what is getting called from
> MemoryContextDelete()), it just calls GenerationReset(). So resetting
> is going to always be less work than deleting the context, especially
> given we don't need to create the context again when we reset it.
>
> I wrote the attached script to see if I can also see the slowdown and
> I do see the patched code come out slightly slower (within noise
> levels) in lower partition counts.
>
> To get my compiler to produce code in a more optimal order for the
> common case, I added unlikely() to the "if (winstate->all_first)"
> condition.  This is only evaluated on the first time after a rescan,
> so putting that code at the end of the function makes more sense.  The
> attached v2 patch has it this way.  You can see the numbers look
> slightly better in the attached graph.

The change to all_first seems unrelated to the tuplestore
optimization. But it's bringing the results inline with the master for
lower number of partitions.

Thanks for the script. I have similar results on my laptop.
From master
Testing with 1000000 partitions
latency average = 505.738 ms
latency average = 509.407 ms
latency average = 522.461 ms
Testing with 100000 partitions
latency average = 329.026 ms
latency average = 327.504 ms
latency average = 342.556 ms
Testing with 10000 partitions
latency average = 299.496 ms
latency average = 298.266 ms
latency average = 306.773 ms
Testing with 1000 partitions
latency average = 299.006 ms
latency average = 302.188 ms
latency average = 301.701 ms
Testing with 100 partitions
latency average = 305.411 ms
latency average = 286.935 ms
latency average = 302.432 ms
Testing with 10 partitions
latency average = 288.091 ms
latency average = 294.506 ms
latency average = 305.082 ms
Testing with 1 partitions
latency average = 301.121 ms
latency average = 319.615 ms
latency average = 301.141 ms

Patched
Testing with 1000000 partitions
latency average = 351.683 ms
latency average = 352.516 ms
latency average = 352.086 ms
Testing with 100000 partitions
latency average = 300.626 ms
latency average = 303.584 ms
latency average = 306.959 ms
Testing with 10000 partitions
latency average = 289.560 ms
latency average = 302.248 ms
latency average = 297.423 ms
Testing with 1000 partitions
latency average = 308.600 ms
latency average = 299.215 ms
latency average = 289.681 ms
Testing with 100 partitions
latency average = 301.216 ms
latency average = 286.240 ms
latency average = 291.232 ms
Testing with 10 partitions
latency average = 305.260 ms
latency average = 296.707 ms
latency average = 300.266 ms
Testing with 1 partitions
latency average = 316.199 ms
latency average = 314.043 ms
latency average = 309.425 ms

Now that you are also seeing the slowdown with your earlier patch, I
am wondering whether adding unlikely() by itself is a good
optimization. There might be some other reason behind the perceived
slowdown. How do the numbers look when you just add unlikely() without
any other changes?

--
Best Wishes,
Ashutosh Bapat



Re: Optimize WindowAgg's use of tuplestores

От
Ashutosh Bapat
Дата:
On Fri, Jul 12, 2024 at 11:59 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
>
> Now that you are also seeing the slowdown with your earlier patch, I
> am wondering whether adding unlikely() by itself is a good
> optimization. There might be some other reason behind the perceived
> slowdown. How do the numbers look when you just add unlikely() without
> any other changes?

Out of curiosity, I measured the performance with just the "unlikely"
change and with the full patch. Below are the results
Testing with 1000000 partitions
latency average = 503.321 ms
latency average = 510.365 ms
latency average = 512.117 ms
Testing with 100000 partitions
latency average = 371.764 ms
latency average = 361.202 ms
latency average = 360.529 ms
Testing with 10000 partitions
latency average = 327.495 ms
latency average = 327.334 ms
latency average = 325.925 ms
Testing with 1000 partitions
latency average = 319.290 ms
latency average = 318.709 ms
latency average = 318.013 ms
Testing with 100 partitions
latency average = 317.756 ms
latency average = 318.933 ms
latency average = 316.529 ms
Testing with 10 partitions
latency average = 316.392 ms
latency average = 315.297 ms
latency average = 316.007 ms
Testing with 1 partitions
latency average = 330.978 ms
latency average = 330.529 ms
latency average = 333.538 ms

with just unlikely change
Testing with 1000000 partitions
latency average = 504.786 ms
latency average = 507.557 ms
latency average = 508.522 ms
Testing with 100000 partitions
latency average = 316.345 ms
latency average = 315.496 ms
latency average = 326.503 ms
Testing with 10000 partitions
latency average = 296.878 ms
latency average = 293.927 ms
latency average = 294.654 ms
Testing with 1000 partitions
latency average = 292.680 ms
latency average = 283.245 ms
latency average = 280.857 ms
Testing with 100 partitions
latency average = 292.569 ms
latency average = 296.330 ms
latency average = 295.389 ms
Testing with 10 partitions
latency average = 285.909 ms
latency average = 287.499 ms
latency average = 293.322 ms
Testing with 1 partitions
latency average = 305.080 ms
latency average = 309.100 ms
latency average = 307.794 ms

There's noticeable change across all the number of partitions with
just "unlikely" change. The improvement is lesser with larger number
of partitions but quite visible with lesser number of partitions.

full patch
Testing with 1000000 partitions
latency average = 356.026 ms
latency average = 375.280 ms
latency average = 374.575 ms
Testing with 100000 partitions
latency average = 318.173 ms
latency average = 307.598 ms
latency average = 315.868 ms
Testing with 10000 partitions
latency average = 295.541 ms
latency average = 313.317 ms
latency average = 299.936 ms
Testing with 1000 partitions
latency average = 295.082 ms
latency average = 305.204 ms
latency average = 294.702 ms
Testing with 100 partitions
latency average = 302.552 ms
latency average = 307.596 ms
latency average = 304.202 ms
Testing with 10 partitions
latency average = 295.050 ms
latency average = 291.127 ms
latency average = 299.704 ms
Testing with 1 partitions
latency average = 308.781 ms
latency average = 304.071 ms
latency average = 319.560 ms

There is significant improvement with a large number of partitions as
seen previously. But for a smaller number of partitions the
performance worsens, which needs some investigation.
--
Best Wishes,
Ashutosh Bapat