remap the .text segment into huge pages at run time
От | John Naylor |
---|---|
Тема | remap the .text segment into huge pages at run time |
Дата | |
Msg-id | CAFBsxsHx9z45MfsAjELFiPv_kcgCcH_P5jNa=WaeGxO7HU3mag@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: remap the .text segment into huge pages at run time
|
Список | pgsql-hackers |
It's been known for a while that Postgres spends a lot of time translating instruction addresses, and using huge pages in the text segment yields a substantial performance boost in OLTP workloads [1][2]. The difficulty is, this normally requires a lot of painstaking work (unless your OS does superpage promotion, like FreeBSD).
I found an MIT-licensed library "iodlr" from Intel [3] that allows one to remap the .text segment to huge pages at program start. Attached is a hackish, Meson-only, "works on my machine" patchset to experiment with this idea.
0001 adapts the library to our error logging and GUC system. The overview:
- read ELF info to get the start/end addresses of the .text segment
- calculate addresses therein aligned at huge page boundaries
- mmap a temporary region and memcpy the aligned portion of the .text segment
- mmap aligned start address to a second region with huge pages and MAP_FIXED
- memcpy over from the temp region and revoke the PROT_WRITE bit
The reason this doesn't "saw off the branch you're standing on" is that the remapping is done in a function that's forced to live in a different segment, and doesn't call any non-libc functions living elsewhere:
static void
__attribute__((__section__("lpstub")))
__attribute__((__noinline__))
MoveRegionToLargePages(const mem_range * r, int mmap_flags)
Debug messages show
2022-11-02 12:02:31.064 +07 [26955] DEBUG: .text start: 0x487540
2022-11-02 12:02:31.064 +07 [26955] DEBUG: .text end: 0x96cf12
2022-11-02 12:02:31.064 +07 [26955] DEBUG: aligned .text start: 0x600000
2022-11-02 12:02:31.064 +07 [26955] DEBUG: aligned .text end: 0x800000
2022-11-02 12:02:31.066 +07 [26955] DEBUG: binary mapped to huge pages
2022-11-02 12:02:31.066 +07 [26955] DEBUG: un-mmapping temporary code region
Here, out of 5MB of Postgres text, only 1 huge page can be used, but that still saves 512 entries in the TLB and might bring a small improvement. The un-remapped region below 0x600000 contains the ~600kB of "cold" code, since the linker puts the cold section first, at least recent versions of ld and lld.
0002 is my attempt to force the linker's hand and get the entire text segment mapped to huge pages. It's quite a finicky hack, and easily broken (see below). That said, it still builds easily within our normal build process, and maybe there is a better way to get the effect.
It does two things:
- Pass the linker -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's done for predictability, but that means the next 2MB boundary is very nearly 2MB away.
- Add a "cold" __asm__ filler function that just takes up space, enough to push the end of the .text segment over the next aligned boundary, or to ~8MB in size.
In a non-assert build:
0001:
$ bloaty inst-perf/bin/postgres
FILE SIZE VM SIZE
-------------- --------------
53.7% 4.90Mi 58.7% 4.90Mi .text
...
100.0% 9.12Mi 100.0% 8.35Mi TOTAL
$ readelf -S --wide inst-perf/bin/postgres
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
...
[12] .init PROGBITS 0000000000486000 086000 00001b 00 AX 0 0 4
[13] .plt PROGBITS 0000000000486020 086020 001520 10 AX 0 0 16
[14] .text PROGBITS 0000000000487540 087540 4e59d2 00 AX 0 0 16
...
0002:
$ bloaty inst-perf/bin/postgres
FILE SIZE VM SIZE
-------------- --------------
46.9% 8.00Mi 69.9% 8.00Mi .text
...
100.0% 17.1Mi 100.0% 11.4Mi TOTAL
$ readelf -S --wide inst-perf/bin/postgres
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
...
[12] .init PROGBITS 0000000000600000 200000 00001b 00 AX 0 0 4
[13] .plt PROGBITS 0000000000600020 200020 001520 10 AX 0 0 16
[14] .text PROGBITS 0000000000601540 201540 7ff512 00 AX 0 0 16
...
Debug messages with 0002 shows 6MB mapped:
2022-11-02 12:35:28.482 +07 [28530] DEBUG: .text start: 0x601540
2022-11-02 12:35:28.482 +07 [28530] DEBUG: .text end: 0xe00a52
2022-11-02 12:35:28.482 +07 [28530] DEBUG: aligned .text start: 0x800000
2022-11-02 12:35:28.482 +07 [28530] DEBUG: aligned .text end: 0xe00000
2022-11-02 12:35:28.486 +07 [28530] DEBUG: binary mapped to huge pages
2022-11-02 12:35:28.486 +07 [28530] DEBUG: un-mmapping temporary code region
Since the front is all-cold, and there is very little at the end, practically all hot pages are now remapped. The biggest problem with the hackish filler function (in addition to maintainability) is, if explicit huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB causes complete startup failure if the .text segment is larger than 8MB. I haven't looked into what's happening there yet, but I didn't want to get too far in the weeds before getting feedback on whether the entire approach in this thread is sound enough to justify working further on.
[1] https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf
(paper: "On the Impact of Instruction Address Translation Overhead")
[2] https://twitter.com/AndresFreundTec/status/1214305610172289024
[3] https://github.com/intel/iodlr
--
I found an MIT-licensed library "iodlr" from Intel [3] that allows one to remap the .text segment to huge pages at program start. Attached is a hackish, Meson-only, "works on my machine" patchset to experiment with this idea.
0001 adapts the library to our error logging and GUC system. The overview:
- read ELF info to get the start/end addresses of the .text segment
- calculate addresses therein aligned at huge page boundaries
- mmap a temporary region and memcpy the aligned portion of the .text segment
- mmap aligned start address to a second region with huge pages and MAP_FIXED
- memcpy over from the temp region and revoke the PROT_WRITE bit
The reason this doesn't "saw off the branch you're standing on" is that the remapping is done in a function that's forced to live in a different segment, and doesn't call any non-libc functions living elsewhere:
static void
__attribute__((__section__("lpstub")))
__attribute__((__noinline__))
MoveRegionToLargePages(const mem_range * r, int mmap_flags)
Debug messages show
2022-11-02 12:02:31.064 +07 [26955] DEBUG: .text start: 0x487540
2022-11-02 12:02:31.064 +07 [26955] DEBUG: .text end: 0x96cf12
2022-11-02 12:02:31.064 +07 [26955] DEBUG: aligned .text start: 0x600000
2022-11-02 12:02:31.064 +07 [26955] DEBUG: aligned .text end: 0x800000
2022-11-02 12:02:31.066 +07 [26955] DEBUG: binary mapped to huge pages
2022-11-02 12:02:31.066 +07 [26955] DEBUG: un-mmapping temporary code region
Here, out of 5MB of Postgres text, only 1 huge page can be used, but that still saves 512 entries in the TLB and might bring a small improvement. The un-remapped region below 0x600000 contains the ~600kB of "cold" code, since the linker puts the cold section first, at least recent versions of ld and lld.
0002 is my attempt to force the linker's hand and get the entire text segment mapped to huge pages. It's quite a finicky hack, and easily broken (see below). That said, it still builds easily within our normal build process, and maybe there is a better way to get the effect.
It does two things:
- Pass the linker -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's done for predictability, but that means the next 2MB boundary is very nearly 2MB away.
- Add a "cold" __asm__ filler function that just takes up space, enough to push the end of the .text segment over the next aligned boundary, or to ~8MB in size.
In a non-assert build:
0001:
$ bloaty inst-perf/bin/postgres
FILE SIZE VM SIZE
-------------- --------------
53.7% 4.90Mi 58.7% 4.90Mi .text
...
100.0% 9.12Mi 100.0% 8.35Mi TOTAL
$ readelf -S --wide inst-perf/bin/postgres
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
...
[12] .init PROGBITS 0000000000486000 086000 00001b 00 AX 0 0 4
[13] .plt PROGBITS 0000000000486020 086020 001520 10 AX 0 0 16
[14] .text PROGBITS 0000000000487540 087540 4e59d2 00 AX 0 0 16
...
0002:
$ bloaty inst-perf/bin/postgres
FILE SIZE VM SIZE
-------------- --------------
46.9% 8.00Mi 69.9% 8.00Mi .text
...
100.0% 17.1Mi 100.0% 11.4Mi TOTAL
$ readelf -S --wide inst-perf/bin/postgres
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
...
[12] .init PROGBITS 0000000000600000 200000 00001b 00 AX 0 0 4
[13] .plt PROGBITS 0000000000600020 200020 001520 10 AX 0 0 16
[14] .text PROGBITS 0000000000601540 201540 7ff512 00 AX 0 0 16
...
Debug messages with 0002 shows 6MB mapped:
2022-11-02 12:35:28.482 +07 [28530] DEBUG: .text start: 0x601540
2022-11-02 12:35:28.482 +07 [28530] DEBUG: .text end: 0xe00a52
2022-11-02 12:35:28.482 +07 [28530] DEBUG: aligned .text start: 0x800000
2022-11-02 12:35:28.482 +07 [28530] DEBUG: aligned .text end: 0xe00000
2022-11-02 12:35:28.486 +07 [28530] DEBUG: binary mapped to huge pages
2022-11-02 12:35:28.486 +07 [28530] DEBUG: un-mmapping temporary code region
Since the front is all-cold, and there is very little at the end, practically all hot pages are now remapped. The biggest problem with the hackish filler function (in addition to maintainability) is, if explicit huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB causes complete startup failure if the .text segment is larger than 8MB. I haven't looked into what's happening there yet, but I didn't want to get too far in the weeds before getting feedback on whether the entire approach in this thread is sound enough to justify working further on.
[1] https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf
(paper: "On the Impact of Instruction Address Translation Overhead")
[2] https://twitter.com/AndresFreundTec/status/1214305610172289024
[3] https://github.com/intel/iodlr
Вложения
В списке pgsql-hackers по дате отправления: