Hi all, This is a follow-up of the following thread where I have touched the topic of corrupted 2PC files being completely ignored by recovery: https://www.postgresql.org/message-id/20180709012955.GD1467%40paquier.xyz I have posted a patch on this thread, but after more reviews I have noticed that it does not close all the holes. What happens is that when a two-phase state file cannot be read by recovery when the file is on disk, which is what ReadTwoPhaseFile does, then its data is simply skipped. It is perfectly possible that an on-disk 2PC is missing at crash recovery if it has been already processed, so if trying to open the file and seeing an ENOENT, then the file can be properly skipped. However, if you look at this API, it skips *all* errors instead of the ones that matter. For example, if a file needs to be processed and is cannot be opened because of permission issues, then it is simply skipped. If its size, its CRC or its magic number is incorrect, the same thing happens. Skipping unconditionally files this way can result in data loss. I think that we really need to harden things, by making ReadTwoPhaseFile() fail hard is it finds something unexpected, which is in this case anything except trying to open a file which fails on ENOENT, and that this stuff should be back-patched. Thoughts? -- Michael
On Mon, Jul 09, 2018 at 02:03:09PM +0900, Michael Paquier wrote: > I think that we really need to harden things, by making > ReadTwoPhaseFile() fail hard is it finds something unexpected, which is > in this case anything except trying to open a file which fails on > ENOENT, and that this stuff should be back-patched. Rebased as attached because of the conflicts from 811b6e3. -- Michael
On Wed, Jul 18, 2018 at 05:18:18PM +0900, Michael Paquier wrote:
> On Mon, Jul 09, 2018 at 02:03:09PM +0900, Michael Paquier wrote:
>> I think that we really need to harden things, by making
>> ReadTwoPhaseFile() fail hard is it finds something unexpected, which is
>> in this case anything except trying to open a file which fails on
>> ENOENT, and that this stuff should be back-patched.
>
> Rebased as attached because of the conflicts from 811b6e3.
So... I have been doing a self-review of this patch after letting it
aside for a couple of weeks, and I did not spot any fundamental issue
wit hit. I have spotted two minor issues:
- {
- CloseTransientFile(fd);
- return NULL;
- }
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("incorrect size of two-phase state file \"%s\": %zu bytes",
+ path, stat.st_size)));
This needs to use errmsg_plural.
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted two-phase state file for \"%u\"",
This should use "for transaction %u".
As this is a data corruption issue, are there any objections if I patch
and back-patch? I also would like to get this stuff in first as I have
other refactoring work which would shave some more code.
--
Michael
On Fri, Aug 17, 2018 at 07:56:00AM +0900, Michael Paquier wrote: > As this is a data corruption issue, are there any objections if I patch > and back-patch? I also would like to get this stuff in first as I have > other refactoring work which would shave some more code. I looked at this patch again after letting it aside for a couple of weeks and pushed it. This is a potential, very rare data loss bug, and nobody complained about it so I have not done any back-patch. Please let me know if you would like to see something happen in stable branches as well. -- Michael
Сайт использует файлы cookie для корректной работы и повышения удобства. Нажимая кнопку «Принять» или продолжая пользоваться сайтом, вы соглашаетесь на их использование в соответствии с Политикой в отношении обработки cookie ООО «ППГ», в том числе на передачу данных из файлов cookie сторонним статистическим и рекламным службам. Вы можете управлять настройками cookie через параметры вашего браузера