Re: what to revert

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: what to revert
Дата
Msg-id c7ef10b9-276e-71bd-a76b-b5782d8d7033@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: what to revert  (Kevin Grittner <kgrittn@gmail.com>)
Ответы Re: what to revert  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: what to revert  (Kevin Grittner <kgrittn@gmail.com>)
Список pgsql-hackers
Hi,

On 05/10/2016 10:29 AM, Kevin Grittner wrote:
> On Mon, May 9, 2016 at 9:01 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>> Over the past few days I've been running benchmarks on a fairly
>> large NUMA box (4 sockets, 32 cores / 64 with HR, 256GB of RAM)
>> to see the impact of the 'snapshot too old' - both when disabled
>> and enabled with various values in the old_snapshot_threshold
>> GUC.
>
> Thanks!
>
>> The benchmark is a simple read-only pgbench with prepared
>> statements, i.e. doing something like this:
>>
>>    pgbench -S -M prepared -j N -c N
>
> Do you have any plans to benchmark cases where the patch can have a
> benefit?  (Clearly, nobody would be interested in using the feature
> with a read-only load; so while that makes a good "worst case"
> scenario and is very valuable for testing the "off" versus
> "reverted" comparison, it's not an intended use or one that's
> likely to happen in production.)

Yes, I'd like to repeat the tests with other workloads - I'm thinking 
about regular pgbench and perhaps something that'd qualify as 'mostly 
read-only' (not having a clear idea how that should work).

Feel free to propose other tests.

>
>> master-10-new - 91fd1df4 + old_snapshot_threshold=10
>> master-10-new-2 - 91fd1df4 + old_snapshot_threshold=10 (rerun)
>
> So, these runs were with identical software on the same data? Any
> differences are just noise?

Yes, same config. The differences are either noise or something 
unexpected (like the sudden drops of tps with high client counts).

>> * The results are a bit noisy, but I think in general this shows
>> that for certain cases there's a clearly measurable difference
>> (up to 5%) between the "disabled" and "reverted" cases. This is
>> particularly visible on the smallest data set.
>
> In some cases, the differences are in favor of disabled over
> reverted.

Well, that's a good question. I think the results for higher client 
counts (>=64) are fairy noisy, so in those cases it may easily be just 
due to noise. For the lower client counts that seems to be much less 
noisy though.

>
>> * What's fairly strange is that on the largest dataset (scale
>> 10000), the "disabled" case is actually consistently faster than
>> "reverted" - that seems a bit suspicious, I think. It's possible
>> that I did the revert wrong, though - the revert.patch is
>> included in the tgz. This is why I also tested 689f9a05, but
>> that's also slower than "disabled".
>
> Since there is not a consistent win of disabled or reverted over
> the other, and what difference there is is often far less than the
> difference between the two runs with identical software, is there
> any reasonable interpretation of this except that the difference is
> "in the noise"?

Are we both looking at the results for scale 10000? I think there's 
pretty clear difference between "disabled" and "reverted" (or 68919a05, 
for that matter). The gap is also much larger compared to the two 
"identical" runs (ignoring the runs with 128 clients).

>
>> * The performance impact with the feature enabled seems rather
>> significant, especially once you exceed the number of physical
>> cores (32 in this case). Then the drop is pretty clear - often
>> ~50% or more.
>>
>> * 7e3da1c4 claims to bring the performance within 5% of the
>> disabled case, but that seems not to be the case.
>
> The commit comment says "At least in the tested case this brings
> performance within 5% of when the feature is off, compared to
> several times slower without this patch."  The tested case was a
> read-write load, so your read-only tests do nothing to determine
> whether this was the case in general for this type of load.
> Partly, the patch decreases chasing through HOT chains and
> increases the number of HOT updates, so there are compensating
> benefits of performing early vacuum in a read-write load.

OK. Sadly the commit message does not mention what the tested case was, 
so I wasn't really sure ...

>
>> What it however does is bringing the 'non-immediate' cases close
>> to the immediate ones (before the performance drop came much
>> sooner in these cases - at 16 clients).
>
> Right.  This is, of course, just the first optimization, that we
> were able to get in "under the wire" before beta, but the other
> optimizations under consideration would only tend to bring the
> "enabled" cases closer together in performance, not make an enabled
> case perform the same as when the feature was off -- especially for
> a read-only workload.

OK

>
>> * It's also seems to me the feature greatly amplifies the
>> variability of the results, somehow. It's not uncommon to see
>> results like this:
>>
>>  master-10-new-2    235516     331976    133316    155563    133396
>>
>> where after the first runs (already fairly variable) the
>> performance tanks to ~50%. This happens particularly with higher
>> client counts, otherwise the max-min is within ~5% of the max.
>> There are a few cases where this happens without the feature
>> (i.e. old master, reverted or disabled), but it's usually much
>> smaller than with it enabled (immediate, 10 or 60). See the
>> 'summary' sheet in the ODS spreadsheet.
>>
>> I don't know what's the problem here - at first I thought that
>> maybe something else was running on the machine, or that
>> anti-wraparound autovacuum kicked in, but that seems not to be
>> the case. There's nothing like that in the postgres log (also
>> included in the .tgz).
>
> I'm inclined to suspect NUMA effects.  It would be interesting to
> try with the NUMA patch and cpuset I submitted a while back or with
> fixes in place for the Linux scheduler bugs which were reported
> last month.  Which kernel version was this?

I can try that, sure. Can you point me to the last versions of the 
patches, possibly rebased to current master if needed?

The kernel is 3.19.0-031900-generic

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kevin Grittner
Дата:
Сообщение: Re: what to revert
Следующее
От: Andres Freund
Дата:
Сообщение: Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)