Обсуждение: [HACKERS] Slow I/O can break throttling of base backup

Поиск
Список
Период
Сортировка

[HACKERS] Slow I/O can break throttling of base backup

От
Magnus Hagander
Дата:
Running pg_basebackup with a throttling of say 10M runs it into the risk of the I/O on the server actually being slower than pg_basebackup (I have preproduced similar issues on fake-slow disks with lower rate limits).

What happens in this case in basebackup.c is that the value for "sleep" comes out negative. This means we don't sleep, which is fine.

However, that also means we don't set throttle_last.

That means that the next time we come around to throttle(), the value for "elapsed" is even bigger, which results in an even bigger negative number, and we're "stuck".

AFAICT this means that a temporary I/O spike that makes reading of the disk too slow can leave us in a situation where we never recover, and thus never end up sleeping ever again, effectively turning off rate limiting.

I wonder if the 
else if (sleep > 0)
at the bottom of throttle()
should just be a simple else and always run, resetting last_throttle?


--

Re: [HACKERS] Slow I/O can break throttling of base backup

От
Antonin Houska
Дата:
It seems to be my bug. I'll check tomorrow.

Magnus Hagander <magnus@hagander.net> wrote:

> Running pg_basebackup with a throttling of say 10M runs it into the risk of
> the I/O on the server actually being slower than pg_basebackup (I have
> preproduced similar issues on fake-slow disks with lower rate limits).
>
> What happens in this case in basebackup.c is that the value for "sleep"
> comes out negative. This means we don't sleep, which is fine.
>
> However, that also means we don't set throttle_last.
>
> That means that the next time we come around to throttle(), the value for
> "elapsed" is even bigger, which results in an even bigger negative number,
> and we're "stuck".
>
> AFAICT this means that a temporary I/O spike that makes reading of the disk
> too slow can leave us in a situation where we never recover, and thus never
> end up sleeping ever again, effectively turning off rate limiting.
>
> I wonder if the else if (sleep > 0) at the bottom of throttle() should just
> be a simple else and always run, resetting last_throttle?
>
> --
> Magnus Hagander
> Me: http://www.hagander.net/
> Work: http://www.redpill-linpro.com/
>

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: [HACKERS] Slow I/O can break throttling of base backup

От
Antonin Houska
Дата:
Antonin Houska <ah@cybertec.at> wrote:

> It seems to be my bug. I'll check tomorrow.

I could reproduce the problem by adding sufficient sleep time to the
loop.

> Magnus Hagander <magnus@hagander.net> wrote:
>> I wonder if the else if (sleep > 0) at the bottom of throttle() should just
>> be a simple else and always run, resetting last_throttle?

I agree. In fact, I could simplify the code even more.

Since (elapsed + sleep) almost equals to GetCurrentIntegerTimestamp(), we can
use the following statement unconditionally (I think I tried too hard to avoid
calling GetCurrentIntegerTimestamp too often in the original patch):

throttled_last = GetCurrentIntegerTimestamp();

Thus we can also get rid of the "else" branch that clears both "sleep" and
"wait_result".

(The patch contains minor comment refinement that I found useful when seeing
the code after years.)

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
new file mode 100644
index ffc7e58..40b3c11
*** a/src/backend/replication/basebackup.c
--- b/src/backend/replication/basebackup.c
*************** throttle(size_t increment)
*** 1370,1395 ****
          if (wait_result & WL_LATCH_SET)
              CHECK_FOR_INTERRUPTS();
      }
-     else
-     {
-         /*
-          * The actual transfer rate is below the limit.  A negative value
-          * would distort the adjustment of throttled_last.
-          */
-         wait_result = 0;
-         sleep = 0;
-     }
  
      /*
!      * Only a whole multiple of throttling_sample was processed. The rest will
!      * be done during the next call of this function.
       */
      throttling_counter %= throttling_sample;
  
!     /* Once the (possible) sleep has ended, new period starts. */
!     if (wait_result & WL_TIMEOUT)
!         throttled_last += elapsed + sleep;
!     else if (sleep > 0)
!         /* Sleep was necessary but might have been interrupted. */
!         throttled_last = GetCurrentIntegerTimestamp();
  }
--- 1370,1385 ----
          if (wait_result & WL_LATCH_SET)
              CHECK_FOR_INTERRUPTS();
      }
  
      /*
!      * As we work with integers, only whole multiple of throttling_sample was
!      * processed. The rest will be done during the next call of this function.
       */
      throttling_counter %= throttling_sample;
  
!     /*
!      * Time interval for the remaining amount and possible next increments
!      * starts now.
!      */
!     throttled_last = GetCurrentIntegerTimestamp();
  }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow I/O can break throttling of base backup

От
Magnus Hagander
Дата:


On Fri, Dec 16, 2016 at 11:24 AM, Antonin Houska <ah@cybertec.at> wrote:
Antonin Houska <ah@cybertec.at> wrote:

> It seems to be my bug. I'll check tomorrow.

I could reproduce the problem by adding sufficient sleep time to the
loop.

> Magnus Hagander <magnus@hagander.net> wrote:
>> I wonder if the else if (sleep > 0) at the bottom of throttle() should just
>> be a simple else and always run, resetting last_throttle?

I agree. In fact, I could simplify the code even more.

Since (elapsed + sleep) almost equals to GetCurrentIntegerTimestamp(), we can
use the following statement unconditionally (I think I tried too hard to avoid
calling GetCurrentIntegerTimestamp too often in the original patch):

throttled_last = GetCurrentIntegerTimestamp();

Thus we can also get rid of the "else" branch that clears both "sleep" and
"wait_result".

(The patch contains minor comment refinement that I found useful when seeing
the code after years.)


Thanks! Applied and backpatched. 

--