Re: snapshot too old issues, first around wraparound and then more.
От | Andres Freund |
---|---|
Тема | Re: snapshot too old issues, first around wraparound and then more. |
Дата | |
Msg-id | 20200403001713.3ukjhhqwscsgf2rl@alap3.anarazel.de обсуждение исходный текст |
Ответ на | Re: snapshot too old issues, first around wraparound and then more. (Robert Haas <robertmhaas@gmail.com>) |
Ответы |
Re: snapshot too old issues, first around wraparound and then more.
|
Список | pgsql-hackers |
Hi, On 2020-04-01 12:02:18 -0400, Robert Haas wrote: > I have no objection to the idea that *if* the feature is hopelessly > broken, it should be removed. I don't think we have a real choice here at this point, at least for the back branches. Just about nothing around old_snapshot_threshold works correctly: * There are basically no tests (see [1] I jsut sent, and also old_snapshot_threshold bypassing a lot of the relevant code). * We don't detect errors after hot pruning (to allow that is a major point of the feature) when access is via any sort of index scans. Wrong query results. * The time->xid mapping is is entirely broken. We don't prevent bloat for many multiples of old_snapshot_threshold (if above 1min). It's possible, but harder, to have this cause wrong query results. * In read-mostly workloads it can trigger errors in sessions that are much younger than old_snapshot_threshold, if the transactionid is not advancing. I've not tried to reproduce, but I suspect this can also cause wrong query results. Because a later snapshot can have the same xmin as older transactions, it sure looks like we can end up with data from an older xmin getting removed, but the newer snapshot's whenTaken will prevent TestForOldSnapshot_impl() from raising an error. * I am fairly sure that it can cause crashes (or even data corruption), because it assumes that DML never needs to check for old snapshots (with no meaningful justificiation). Leading to heap_update/delete to assume the page header is a tuple. * There's obviously also the wraparound issue that made me start this thread initially. Since this is a feature that can result in wrong query results (and quite possibly crashes / data corruption), I don't think we can just leave this unfixed. But given the amount of code / infrastructure changes required to get this into a working feature, I don't see how we can unleash those changes onto the stable branches. There's quite a few issues in here that require not just local bugfixes, but some design changes too. And it's pretty clear that the feature didn't go through enough review before getting committed. I see quite some merit in removing the code in master, and having a potential reimplementation go through a normal feature integration process. I don't really know what to do here. Causing problems by neutering a feature in the back branch *sucks*. While not quite as bad, removing a feature without a replacement in a major release is pretty harsh too. But I don't really see any other realistic path forward. FWIW, I've now worked around the interdependency of s_t_o my snapshot scalability patch (only took like 10 days). I have manually confirmed it works with 0/1 minute thresholds. I can make the tests pass unmodified if I just add SetOldSnapshotThresholdTimestamp() calls when not necessary (which obviously makes no sense). Lead to some decent improvements around pruning that are independent of s_t_o (with more possibilities "opened"). But I still think we need to do something here. Greetings, Andres Freund [1] https://postgr.es/m/20200403001235.e6jfdll3gh2ygbuc%40alap3.anarazel.de
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Andres FreundДата:
Сообщение: Re: snapshot too old issues, first around wraparound and then more.