On Mon, Feb 4, 2019 at 6:52 PM Jakub Glapa <jakub.glapa@gmail.com> wrote:
> I see the error showing up every night on 2 different servers. But it's a bit of a heisenbug because If I go there
nowit won't be reproducible.
Huh. Ok well that's a lot more frequent that I thought. Is it always
the same query? Any chance you can get the plan? Are there more
things going on on the server, like perhaps concurrent parallel
queries?
> It was suggested by Justin Pryzby that I recompile pg src with his patch that would cause a coredump.
Small correction to Justin's suggestion: don't abort() after
elog(ERROR, ...), it'll never be reached.
> But I don't feel comfortable doing this especially if I would have to run this with prod data.
> My question is. Can I do anything like increasing logging level or enable some additional options?
> It's a production server but I'm willing to sacrifice a bit of it's performance if that would help.
If you're able to run a throwaway copy of your production database on
another system that you don't have to worry about crashing, you could
just replace ERROR with PANIC and run a high-speed loop of the query
that crashed in product, or something. This might at least tell us
whether it's reach that condition via something dereferencing a
dsa_pointer or something manipulating the segment lists while
allocating/freeing.
In my own 100% unsuccessful attempts to reproduce this I was mostly
running the same query (based on my guess at what ingredients are
needed), but perhaps it requires a particular allocation pattern that
will require more randomness to reach... hmm.
--
Thomas Munro
http://www.enterprisedb.com