FWIW, I've experimented with LTO and PGO a bunch, both with gcc and clang. I did hit a crash in gcc, but that did turn out to be a compiler bug, and actually reduced to something not even needing LTO.
Good to hear that it works. I just need to figure out what is going wrong on my end then.
I saw quite substantial speedups with PGO, but I only tested very specific workloads. IIRC it was >15% gain in concurrent readonly pgbench.
I successfully applied PGO only and obtained similar gains with TPC-C & TPC-H workloads.
I dimly recall failing to get some benefit out of bolt for some reason that I unfortunately don't even vaguely recall.
I got similar gains slightly higher than PGO with BOLT, but not for all queries in TPC-H. In fact, I observed small (2-4%) regressions with BOLT.
--
João Paulo L. de Carvalho Ph.D Computer Science | IC-UNICAMP | Campinas , SP - Brazil Postdoctoral Research Fellow | University of Alberta | Edmonton, AB - Canada