Обсуждение: restoration after crash slowness, any way to improve?

Поиск
Список
Период
Сортировка

restoration after crash slowness, any way to improve?

От
"Joshua D. Drake"
Дата:
-hackers,

So this is more of a spit balling thread than anything. As I understand 
it, if we have a long running transaction or a large number of wal logs 
and we crash, we then have to restore those logs on restart to the last 
known good transaction. No problem.

I recently ran a very long transaction. I was building up a large number 
of rows into a two column table to test index performance. I ended up 
having to kill the connection and thus the transaction after I realized 
I had an extra zero in my generate_series(). (Side note: Amazing the 
difference a single zero can make ;)). When coming back up, I watched 
the machine and I was averaging anywhere from 60MBs to 97MBs on writes. 
That in itself isn't that bad over a single thread and a single SSD, 
doing what it is doing.

However, since I know this machine can get well over 400MBs when using 
multiple connections I can't help but wonder if there is anything we can 
do to make restoration more efficient without sacrificing the purpose of 
what it is doing?

Can we have multiple readers pull transaction logs into shared_buffers 
(on recovery only), sort the good transactions and then push them back 
to the walwriter or bgwriter to the pages?

Anyway, like I said, spitballing and I thought I would start the thread.

Sincerely,

JD

-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.



Re: restoration after crash slowness, any way to improve?

От
Jeff Janes
Дата:
On Wed, Aug 31, 2016 at 6:26 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
-hackers,

So this is more of a spit balling thread than anything. As I understand it, if we have a long running transaction or a large number of wal logs and we crash, we then have to restore those logs on restart to the last known good transaction. No problem.

It only has to replay from the start of the last successful checkpoint.  It doesn't matter whether there was a long-running transaction or not.  If a long transaction spans many checkpoints, replay still only has to go back to the start of the last successful checkpoint.  Maybe you just had checkpoint_segments or max_wal_size st way too high, assuming checkpoint_timeout to always kick in instead and be the limiting factor.  But then your long-running transaction  invalidated that assumption?
 
I recently ran a very long transaction. I was building up a large number of rows into a two column table to test index performance. I ended up having to kill the connection and thus the transaction after I realized I had an extra zero in my generate_series(). (Side note: Amazing the difference a single zero can make ;)). When coming back up, I watched the machine and I was averaging anywhere from 60MBs to 97MBs on writes.

Was it IO limited?

Killing a session/connection/transaction should not take down the entire server, so there should be no recovery taking place in the first place.  Are you sure you are seeing recovery, and not just the vacuuming of the aborted tuples?
 


However, since I know this machine can get well over 400MBs when using multiple connections I can't help but wonder if there is anything we can do to make restoration more efficient without sacrificing the purpose of what it is doing?

Can we have multiple readers pull transaction logs into shared_buffers (on recovery only), sort the good transactions and then push them back to the walwriter or bgwriter to the pages?

I don't see how that could work.  Whether a page is consistent or not is orthogonal to whether the transactions on that page have committed or aborted.

There are two possibilities that I've considered though for long-running PITR, which could also apply to crash recovery, and which I think have been discussed here before.  One is to have a leading recovery process which would identify pages which will be recovered from a FPI, and send word back to the lagging process not to bother applying incremental WAL to those pages.  The other would be for a leading process to asynchronously read into memory (either FS cache or shared_buffers) pages which it sees the lagging process will need to write to.

In the first case, you would want the leading process to be leading by a lot, so that it has the broadest scope to detect FPI.  Basically you would want it to read all the way to the end of the replay, provided it had enough memory to store the list of FPI pages.  For the second one, you would not want it to run so far ahead that it the pages it read in would get pushed out again before the lagging process got to them.  Controlling how far ahead that would be seems like it would be hard.

Cheers,

Jeff