Recently I've been dismissing a lot of suggested changes to checkpoint fsync timing without suggesting an alternative. I have a simple one in mind that captures the biggest problem I see: that the number of backend and checkpoint writes to a file are not connected at all.
We know that a 1GB relation segment can take a really long time to write out. That could include up to 128 changed 8K pages, and we allow all of them to get dirty before any are forced to disk with fsync.
It was surely already discussed but why isn't postresql writing sequentially its cache in a temporary file? With storage random speed at least five to ten time slower it could help a lot.