With that expensive sort spilling to disk and then aggregating after that, it would seem like the work_mem being significantly increased is going to make the critical difference. Unless it could fetch the data sorted via an index, but that doesn't seem likely.
I would suggest increase default_statistics_target, but you have good estimates already for the most part. Hopefully someone else will chime in with more.