Re: buildfarm animals and 'snapshot too old'

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: buildfarm animals and 'snapshot too old'
Дата
Msg-id 537A797D.1020307@fuzzy.cz
обсуждение исходный текст
Ответ на Re: buildfarm animals and 'snapshot too old'  (Andrew Dunstan <andrew@dunslane.net>)
Ответы Re: buildfarm animals and 'snapshot too old'
Список pgsql-hackers
On 19.5.2014 23:04, Andrew Dunstan wrote:
>
> On 05/19/2014 03:40 PM, Tomas Vondra wrote:
>> On 17.5.2014 22:35, Tomas Vondra wrote:
>>> On 17.5.2014 19:58, Andrew Dunstan wrote:
>>>> On 05/15/2014 07:47 PM, Tomas Vondra wrote:
>>>>> On 15.5.2014 22:07, Andrew Dunstan wrote:
>>>>>> Yes, I've seen that. Frankly, a test that takes something like 500
>>>>>> hours is a bit crazy.
>>>>> Maybe. It certainly is not a test people will use during development.
>>>>> But if it can detect some hard-to-find errors in the code, that might
>>>>> possibly lead to serious problems, then +1 from me to run them at
>>>>> least
>>>>> on one animal. 500 hours is ~3 weeks, which is not that bad IMHO.
>>>>>
>>>>> Also, once you know where it fails the developer can run just that
>>>>> single test (which might take minutes/hours, but not days).
>>>>
>>>>
>>>> I have made a change that omits the snapshot sanity check for
>>>> CLOBBER_CACHE_RECURSIVELY cases, but keeps it for all others. See
>>>> <https://github.com/PGBuildFarm/server-code/commit/abd946918279b7683056a4fc3156415ef31a4675>
>>>>
>>> OK, thanks. Seems reasonable.
>> Seems we're still running into this on the CLOBBER_CACHE_ALWAYS animals.
>> The problem is that the git mirror is refreshed only at the very
>> beginning, and while a single branch does not exceed the limit, all the
>> branches do.
>>
>> Could this be solved by keeping a local mirror, without a mirror in the
>> build root? I mean, something like this:
>>
>>      git_keep_mirror => 0,
>>      scmrepo => '/path/to/local/mirror'
>>
>> And of course a cron script updating the mirror every hour or so.
>>
>
> No, the git mirror should be refreshed at the start of each branch
> build. It's not done at all by run_branches.pl. So your premise is
> false. This should only happen if the actual run takes more than 24 hours.

OK. I think I understand what's wrong. This is a summary of log from
'leech' (attached is the full log):

========================================================================
Sat May 17 23:43:24 2014: buildfarm run for leech:REL8_4_STABLE starting
[23:43:24] checking out source ...
...
[04:32:47] OK
Sun May 18 04:32:51 2014: buildfarm run for leech:REL9_0_STABLE starting
[04:32:51] checking out source ...
...
[08:58:57] OK
Sun May 18 08:59:01 2014: buildfarm run for leech:REL9_1_STABLE starting
[08:59:01] checking out source ...
...
[14:09:08] OK
Sun May 18 14:09:12 2014: buildfarm run for leech:REL9_2_STABLE starting
[14:09:12] checking out source ...
...
[00:13:59] OK
Mon May 19 00:14:04 2014: buildfarm run for leech:REL9_3_STABLE starting
[00:14:04] checking out source ...
...
[14:26:29] OK
Query for: stage=OK&animal=leech&ts=1400451244
Target:
http://www.pgbuildfarm.org/cgi-bin/pgstatus.pl/28116b975a4186275d83f7e0f5c3fc92b1a75e85
Status Line: 493 snapshot too old: Fri May 16 07:05:50 2014 GMT
Content:
snapshot to old: Fri May 16 07:05:50 2014 GMT

Web txn failed with status: 1
Mon May 19 14:26:31 2014: buildfarm run for leech:HEAD starting
[14:26:31] checking out source ...
...
[20:07:23] checking test-decoding
Buildfarm member leech failed on HEAD stage test-decoding-check
========================================================================

So the REL9_3_STABLE starts at 00:14, completes at 14:26 (i.e. 14h
runtime), but fails with 'snapshot too old'.

These are the commits for each branch:

REL8_4_STABLE => 1dd0b3eeccaffd33b9c970a91c53fe42692ce8c2 (May 15, 2014)
REL9_0_STABLE => 0fc94340753f19da8acca5fc53039adbf2fa3632 (May 16, 2014)
REL9_1_STABLE => 39b3739c05688b5cd5d5da8c52fa5476304eff11 (May 16, 2014)
REL9_2_STABLE => 0d4c75f4de68012bb6f3bc52ebb58234334259d2 (May 16, 2014)
REL9_3_STABLE => d6a9767404cfee7f037a58e445b601af5837e4a5 (May 16, 2014)
HEAD          => f097d70b7227c1f9aa2ed0af1d6deb05367ef657 (May 19, 2014)

IMHO the problem is that d6a97674 was the last revision in the
REL9_3_STABLE branch when the test started (00:14), but at 06:06
777d07d7 got committed. So the check at the end failed, because the
tested revision was suddenly ~2 days over the limit.

This seems wrong to me, because even a very fast test including the
commit (e.g. starting at 06:00, finishing at 06:10) would fail exactly
like this.

This is more probable on the old stable branches, because the commits
are not that frequent (on HEAD the commits are usually less than a few
hours apart, so the new one won't obsolete the previous one). It's also
made more likely to hit by the long runtime, because it increases the
probability something will be committed into the branch. And it also
makes it more "expensive" because it effectively throws all the cpu time
to /dev/null.

regards
Tomas

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: buildfarm animals and 'snapshot too old'
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: buildfarm: strange OOM failures on markhor (running CLOBBER_CACHE_RECURSIVELY)