Re: snapshot too old issues, first around wraparound and then more.

Поиск
Список
Период
Сортировка
От Kevin Grittner
Тема Re: snapshot too old issues, first around wraparound and then more.
Дата
Msg-id CACjxUsMe9kXpNN7UEzSHPRN0xAKXxfaLLihuVDqQXwOmDqOvdA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: snapshot too old issues, first around wraparound and then more.  (Andres Freund <andres@anarazel.de>)
Ответы Re: snapshot too old issues, first around wraparound and then more.
Список pgsql-hackers
On Wed, Apr 1, 2020 at 10:09 AM Andres Freund <andres@anarazel.de> wrote:

First off, many thanks to Andres for investigating this, and apologies for the bugs.  Also thanks to Michael for making sure I saw the thread.  I must also apologize that for not being able to track the community lists consistently due to health issues that are exacerbated by stress, and the fact that these lists often push past my current limits.  I'll try to help in this as best I can.

Do we actually have any evidence of this feature ever beeing used? I
didn't find much evidence for that in the archives (except Thomas
finding a problem).

This was added because a very large company trying to convert from Oracle had a test that started to show some slowdown on PostgreSQL after 8 hours, serious slowdown by 24 hours, and crashed hard before it could get to 48 hours -- due to lingering WITH HOLD cursors left by ODBC code.  They had millions of lines of code that would need to be rewritten without this feature.  With this feature (set to 20 minutes, if I recall correctly), their unmodified code ran successfully for at least three months solid without failure or corruption.  Last I heard, they were converting a large number of instances from Oracle to PostgreSQL, and those would all fail hard within days of running with this feature removed or disabled.

Also, VMware is using PostgreSQL as an embedded part of many products, and this feature was enabled to deal with similar failures due to ODBC cursors; so the number of instances running 24/7 under high load which have shown a clear benefit from enabling this feature has a lot of zeros.

Perhaps the lack of evidence for usage in the archives indicates a low frequency of real-world failures due to the feature, rather than lack of use?  I'm not doubting that Andres found real issues that should be fixed, but perhaps not very many people who are using the feature have more than two billion transactions within the time threshold, and perhaps the other problems are not as big as the problems solved by use of the feature -- at least in some cases.

To save readers who have not yet done the math some effort, at the 20 minute threshold used by the initial user, they would need to have a sustained rate of consumption of transaction IDs of over 66 million per second to experience wraparound problems, and at the longest threshold I have seen it would need to exceed an average of 461,893 TPS for three days solid to hit wraparound.  Those aren't impossible rates to hit, but in practice it might not be a frequent occurrence yet on modern hardware with some real-world applications.  Hopefully we can find a way to fix this before those rates become common.

I am reviewing the issue and patches now, and hope I can make some useful contribution to the discussion.

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: snapshot too old issues, first around wraparound and then more.
Следующее
От: Andres Freund
Дата:
Сообщение: Re: error context for vacuum to include block number