Обсуждение: Building with musl in CI and the build farm

Поиск
Список
Период
Сортировка

Building with musl in CI and the build farm

От
Wolfgang Walther
Дата:
The need to do $subject came up in [1]. Moving this to a separate 
discussion on -hackers, because there are more issues to solve than just 
the LD_LIBRARY_PATH problem.

Andres Freund:
> FWIW, except for one small issue, building postgres against musl works on
> debian and the tests pass if I install first.
> 
> 
> The small problem mentioned above is that on debian linux/fs.h isn't available
> when building with musl, which in turn causes src/bin/pg_upgrade/file.c to
> fail to compile.  I assume that's not the case on "fully musl" distro?

Correct, I have not seen this before on Alpine.

Here is my progress setting up a buildfarm animal to run on Alpine Linux 
and the issues I found, so far:

The animal runs in a docker container via GitHub Actions in [2]. Right 
now it's still running with --test, until I get the credentials to 
activate it.

I tried to enable everything (except systemd, because Alpine doesn't 
have it) and run all tests. The LDAP tests are failing right now, but 
that is likely something that I need to fix in the Dockerfile - it's 
failing to start the slapd, IIRC. There are other issues, though - all 
of them have open pull requests in that repo [3].

I also had to skip the recovery check. Andrew mentioned that he had to 
do that, too, when he was still running his animal on Alpine. Not sure 
what this is about, yet.

Building --with-icu fails two tests. One of them (001_initdb) is fixed 
by having the "locale" command in your PATH, which is not the case on 
Alpine by default. I assume this will not break on your debian/musl 
build, Andres - but it will also probably not return any sane values, 
because it will run glibc's locale command.
I haven't looked into that in detail, yet, but I think the other test 
(icu/010_database) fails because it expects that setlocale(LC_COLLATE, 
<illegal_value>) throws an error. I think it doesn't do that on musl, 
because LC_COLLATE is not implemented.
Those failing tests are not "just failing", but probably mean that we 
need to do something about how we deal with locale/setlocale on musl.

The last failure is about building --with-nls. This fails with something 
like:

ld: src/port/strerror.c:72:(.text+0x2d8): undefined reference to 
`libintl_gettext'

Of course, gettext-dev is installed, libintl.so is available in /usr/lib 
and it also contains the symbol. So not sure what's happening here.

Andres, did you build --with-icu and/or --with-nls on debian/musl? Did 
you run the recovery tests?

Best,

Wolfgang

[1]: 
https://postgr.es/m/fddd1cd6-dc16-40a2-9eb5-d7fef2101488%40technowledgy.de
[2]: 
https://github.com/technowledgy/postgresql-buildfarm-alpine/actions/workflows/run.yaml
[3]: https://github.com/technowledgy/postgresql-buildfarm-alpine/pulls



Re: Building with musl in CI and the build farm

От
walther@technowledgy.de
Дата:
Here's an update on the progress to run musl (Alpine Linux) in the 
buildfarm.

Wolfgang Walther:
> The animal runs in a docker container via GitHub Actions in [2]. Right 
> now it's still running with --test, until I get the credentials to 
> activate it.

The animals have been activated and are reporting now. Thanks, Andrew!


> I tried to enable everything (except systemd, because Alpine doesn't 
> have it) and run all tests. The LDAP tests are failing right now, but 
> that is likely something that I need to fix in the Dockerfile - it's 
> failing to start the slapd, IIRC. There are other issues, though - all 
> of them have open pull requests in that repo [3].

ldap tests are enabled, just a missing package.


> I also had to skip the recovery check. Andrew mentioned that he had to 
> do that, too, when he was still running his animal on Alpine. Not sure 
> what this is about, yet.

This was about a missing init process in the docker image. Without an 
init process reaping zombie processes, the recovery tests end up with 
some supposed-to-be-terminated backends still running and can't start 
them up again. Fixed by adding a minimal init process with "tinit".


> Building --with-icu fails two tests. One of them (001_initdb) is fixed 
> by having the "locale" command in your PATH, which is not the case on 
> Alpine by default. I assume this will not break on your debian/musl 
> build, Andres - but it will also probably not return any sane values, 
> because it will run glibc's locale command.
> I haven't looked into that in detail, yet, but I think the other test 
> (icu/010_database) fails because it expects that setlocale(LC_COLLATE, 
> <illegal_value>) throws an error. I think it doesn't do that on musl, 
> because LC_COLLATE is not implemented.
> Those failing tests are not "just failing", but probably mean that we 
> need to do something about how we deal with locale/setlocale on musl.

I still need to look into this in depth.


> The last failure is about building --with-nls. This fails with something 
> like:
> 
> ld: src/port/strerror.c:72:(.text+0x2d8): undefined reference to 
> `libintl_gettext'
> 
> Of course, gettext-dev is installed, libintl.so is available in /usr/lib 
> and it also contains the symbol. So not sure what's happening here.

This is an Alpine Linux packaging issue. Theoretically, it could be made 
to work by introducing some configure/meson flag like "--with-gettext" 
or so, to prefer gettext's libintl over the libc-builtin. However, NixOS 
/ nixpkgs with its pkgsMusl overlay manages to solve this issue just 
fine, builds with --enable-nls and gettext work. Thus, I conclude this 
is best solved upstream in Alpine Linux.

TLDR: The only real issue which is still open from PostgreSQL's side is 
around locales and ICU - certainly the pain point in musl. Will look 
into it further.

Best,

Wolfgang



Re: Building with musl in CI and the build farm

От
walther@technowledgy.de
Дата:
About building one of the CI tasks with musl:

Andres Freund:
> I'd rather adapt one of the existing tasks, to avoid increasing CI costs unduly.

I looked into this and I think the only task that could be changed is 
the SanityCheck. This is because this builds without any additional 
features enabled. I guess that makes sense, because otherwise those 
dependencies would first have to be built with musl-gcc as well.


> FWIW, except for one small issue, building postgres against musl works on debian and the tests pass if I install
first.

After the fix for LD_LIBRARY_PATH this now works as expected without 
installing first. I confirmed it works on debian with CC=musl-gcc.


> The small problem mentioned above is that on debian linux/fs.h isn't available
> when building with musl, which in turn causes src/bin/pg_upgrade/file.c to
> fail to compile.

According to [1], this can be worked around by linking some folders:

ln -s /usr/include/linux /usr/include/x86_64-linux-musl/
ln -s /usr/include/asm-generic /usr/include/x86_64-linux-musl/
ln -s /usr/include/x86_64-linux-gnu/asm /usr/include/x86_64-linux-musl/

Please find a patch to use musl-gcc in SanityCheck attached. Logs from 
the CI run are in [2]. It has this in the configure phase:

[13:19:52.712] Using 'CC' from environment with value: 'ccache musl-gcc'
[13:19:52.712] C compiler for the host machine: ccache musl-gcc (gcc 
10.2.1 "cc (Debian 10.2.1-6) 10.2.1 20210110")
[13:19:52.712] C linker for the host machine: musl-gcc ld.bfd 2.35.2
[13:19:52.712] Using 'CC' from environment with value: 'ccache musl-gcc'

So meson picks up musl-gcc properly. I also double checked that without 
the links above, the build does indeed fail with the linux/fs.h error.

I assume the installation of musl-tools should be done in the 
pg-vm-images repo instead of the additional script here?

Best,

Wolfgang

[1]: 

https://debian-bugs-dist.debian.narkive.com/VlFkLigg/bug-789789-musl-fails-to-compile-stuff-that-depends-on-kernel-headers
[2]: https://cirrus-ci.com/task/5741892590108672
Вложения

Re: Building with musl in CI and the build farm

От
Thomas Munro
Дата:
On Wed, Mar 27, 2024 at 11:27 AM Wolfgang Walther
<walther@technowledgy.de> wrote:
> The animal runs in a docker container via GitHub Actions in [2].

Great idea :-)



Re: Building with musl in CI and the build farm

От
Peter Eisentraut
Дата:
On 31.03.24 15:34, walther@technowledgy.de wrote:
>> I'd rather adapt one of the existing tasks, to avoid increasing CI 
>> costs unduly.
> 
> I looked into this and I think the only task that could be changed is 
> the SanityCheck.

I think SanityCheck should run a simple, "average" environment, like the 
current Debian one.  Otherwise, niche problems with musl or multi-arch 
or whatever will throw off the entire build pipeline.



Re: Building with musl in CI and the build farm

От
Wolfgang Walther
Дата:
Peter Eisentraut:
> On 31.03.24 15:34, walther@technowledgy.de wrote:
>>> I'd rather adapt one of the existing tasks, to avoid increasing CI 
>>> costs unduly.
>>
>> I looked into this and I think the only task that could be changed is 
>> the SanityCheck.
> 
> I think SanityCheck should run a simple, "average" environment, like the 
> current Debian one.  Otherwise, niche problems with musl or multi-arch 
> or whatever will throw off the entire build pipeline.

All the errors/problems I have seen so far, while setting up the 
buildfarm animal on Alpine Linux, have been way beyond what SanityCheck 
does. Problems only appeared in the tests suites, of which sanity check 
only runs *very* basic ones. I don't have much experience with the 
"cross" setup, that "musl on debian" essentially is, though.

All those things are certainly out of scope for CI - they are tested in 
the build farm instead.

I do agree: SanityCheck doesn't feel like the right place to put this. 
But on the other side.. if it really fails to *build* with musl, then it 
shouldn't make a difference whether you will be notified about that 
immediately or later in the CI pipeline. It certainly needs the fewest 
additional resources to put it there.

I'm not sure what Andres meant with "adopting one of the existing 
tasks". It could fit as another step into the "Linux - Debian Bullseye - 
Autoconf" task, too. A bit similar to how the meson task build for 32 
and 64bit. This would still not be an entirely new task like I proposed 
initially (to run in Docker).

Best,

Wolfgang



Re: Building with musl in CI and the build farm

От
Tom Lane
Дата:
Wolfgang Walther <walther@technowledgy.de> writes:
> Peter Eisentraut:
>> I think SanityCheck should run a simple, "average" environment, like the 
>> current Debian one.  Otherwise, niche problems with musl or multi-arch 
>> or whatever will throw off the entire build pipeline.

> I do agree: SanityCheck doesn't feel like the right place to put this. 
> But on the other side.. if it really fails to *build* with musl, then it 
> shouldn't make a difference whether you will be notified about that 
> immediately or later in the CI pipeline. It certainly needs the fewest 
> additional resources to put it there.

That is not the concern here.  What I think Peter is worried about,
and certainly what I'm worried about, is that a breakage in
SanityCheck comprehensively breaks all CI testing for all Postgres
developers.  One buildfarm member that's failing does not halt
progress altogether, so it's not even in the same ballpark of
being as critical.  So I agree with Peter that SanityCheck had
better use a very common, vanilla environment.

To be blunt, I do not think we need to test musl in the CI pipeline.
I see it as one of the niche platforms that the buildfarm exists
to test.

            regards, tom lane



Re: Building with musl in CI and the build farm

От
Wolfgang Walther
Дата:
Tom Lane:
> That is not the concern here.  What I think Peter is worried about,
> and certainly what I'm worried about, is that a breakage in
> SanityCheck comprehensively breaks all CI testing for all Postgres
> developers.

You'd have to commit a failing patch first to break CI for all other 
developers. If you're only going to commit patches that pass those CI 
tasks, then this is not going to happen. Then it only becomes a question 
of how much feedback *you* get from a single CI run of your own patch.

> To be blunt, I do not think we need to test musl in the CI pipeline.
> I see it as one of the niche platforms that the buildfarm exists
> to test.

I don't really have an opinion on this. I'm fine with having musl in the 
buildfarm only. I don't expect the core build itself to fail with musl 
anyway, this has been working fine for years. Andres asked for it to be 
added to CI, so maybe he sees more value on top of just "building with 
musl"?

Best,

Wolfgang



Re: Building with musl in CI and the build farm

От
Tom Lane
Дата:
Wolfgang Walther <walther@technowledgy.de> writes:
>> That is not the concern here.  What I think Peter is worried about,
>> and certainly what I'm worried about, is that a breakage in
>> SanityCheck comprehensively breaks all CI testing for all Postgres
>> developers.

> You'd have to commit a failing patch first to break CI for all other 
> developers.

No, what I'm more worried about is some change in the environment
causing the build to start failing.  When that happens, it'd better
be an environment that many of us are familiar with and can test/fix.

            regards, tom lane



Re: Building with musl in CI and the build farm

От
Wolfgang Walther
Дата:
Tom Lane:
>> You'd have to commit a failing patch first to break CI for all other
>> developers.
> 
> No, what I'm more worried about is some change in the environment
> causing the build to start failing.  When that happens, it'd better
> be an environment that many of us are familiar with and can test/fix.

The way I understand how this work is, that the images for the VMs in 
which those CI tasks run, are not just dynamically updated - but are 
actually tested before they are used in CI. So the environment doesn't 
just change suddenly.

See e.g. [1] for a pull request to the repo containing those images to 
update the linux debian image from bullseye to bookworm. This is exactly 
the image we're talking about. Before this image is used in postgres CI, 
it's tested and confirmed that it actually works there. If one of the 
jobs was using musl - that would be tested as well. So this job would 
not just suddenly start failing for everybody.

I do see the "familiarity" argument for the SanityCheck task, but for a 
different reason: Even though it's unlikely for this job to fail for 
musl specific reasons - if you're not familiar with musl and can't 
easily test it locally, you might not be able to tell immediately 
whether it's musl specific or not. If musl was run in one of the later 
jobs, it's much different: You see all tests failing - alright, not musl 
specific. You see only the musl test failing - yeah, musl problem. This 
should give developers much more confidence looking at the results.

Best,

Wolfgang

[1]: https://github.com/anarazel/pg-vm-images/pull/91