Обсуждение: [PATCH] Add support for choosing huge page size

Поиск

Список

Период

Сортировка

[PATCH] Add support for choosing huge page size

От

Odin Ugedal

Дата:

08 июня 2020 г., 18:46:39

This adds support for using non-default huge page sizes for shared
memory. This is achived via the new "huge_page_size" config entry.
The config value defaults to 0, meaning it will use the system default.
---

This would be very helpful when running in kubernetes since nodes may
support multiple huge page sizes, and have pre-allocated huge page meory
for each size. This lets the user select huge page size without having
to change the default huge page size on the node. This will also be
useful when doing benchmarking with different huge page sizes, since it
wouldn't require a full system reboot.

Since the default value of the new config is 0 (resulting in using the
default huge page size) this should be backwards compatible with old
configs.

Feel free to comment on the phrasing (both in docs and code) and on the
overall change.

 doc/src/sgml/config.sgml                      | 25 ++++++
 doc/src/sgml/runtime.sgml                     | 41 +++++----
 src/backend/port/sysv_shmem.c                 | 88 ++++++++++++-------
 src/backend/utils/misc/guc.c                  | 11 +++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/storage/pg_shmem.h                |  1 +
 6 files changed, 120 insertions(+), 48 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index aca8f73a50..6177b819ce 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1582,6 +1582,31 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-huge-page-size" xreflabel="huge_page_size">
+      <term><varname>huge_page_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>huge_page_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Controls what size of huge pages is used in conjunction with
+        <xref linkend="guc-huge-pages"/>.
+        The default is zero (<literal>0</literal>).
+        When set to <literal>0</literal>, the default huge page size on the system will
+        be used.
+       </para>
+       <para>
+        Most modern linux systems support <literal>2MB</literal> and <literal>1GB</literal>
+        huge pages, and some architectures supports other sizes as well. For more information
+        on how to check for support and usage, see <xref linkend="linux-huge-pages"/>.
+       </para>
+       <para>
+        Controling huge page size is not supported on Windows.  
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-temp-buffers" xreflabel="temp_buffers">
       <term><varname>temp_buffers</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 88210c4a5d..cbdbcb4fdf 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -1391,41 +1391,50 @@ export PG_OOM_ADJUST_VALUE=0
     using large values of <xref linkend="guc-shared-buffers"/>.  To use this
     feature in <productname>PostgreSQL</productname> you need a kernel
     with <varname>CONFIG_HUGETLBFS=y</varname> and
-    <varname>CONFIG_HUGETLB_PAGE=y</varname>. You will also have to adjust
-    the kernel setting <varname>vm.nr_hugepages</varname>. To estimate the
-    number of huge pages needed, start <productname>PostgreSQL</productname>
-    without huge pages enabled and check the
-    postmaster's anonymous shared memory segment size, as well as the system's
-    huge page size, using the <filename>/proc</filename> file system.  This might
-    look like:
+    <varname>CONFIG_HUGETLB_PAGE=y</varname>. You will also have to pre-allocate
+    huge pages with the the desired huge page size. To estimate the number of
+    huge pages needed, start <productname>PostgreSQL</productname> without huge
+    pages enabled and check the postmaster's anonymous shared memory segment size,
+    as well as the system's supported huge page sizes, using the
+    <filename>/sys</filename> file system.  This might look like:
 <programlisting>
 $ <userinput>head -1 $PGDATA/postmaster.pid</userinput>
 4170
 $ <userinput>pmap 4170 | awk '/rw-s/ && /zero/ {print $2}'</userinput>
 6490428K
+$ <userinput>ls /sys/kernel/mm/hugepages</userinput>
+hugepages-1048576kB  hugepages-2048kB
+</programlisting>
+
+     You can now choose between the supported sizes, 2MiB and 1GiB in this case.
+     By default <productname>PostgreSQL</productname> will use the default huge
+     page size on the system, but that can be configured via
+     <xref linkend="guc-huge-page-size"/>.
+     The default huge page size can be found with:
+<programlisting>
 $ <userinput>grep ^Hugepagesize /proc/meminfo</userinput>
 Hugepagesize:       2048 kB
 </programlisting>
+
+     For <literal>2MiB</literal>,
      <literal>6490428</literal> / <literal>2048</literal> gives approximately
      <literal>3169.154</literal>, so in this example we need at
      least <literal>3170</literal> huge pages, which we can set with:
 <programlisting>
-$ <userinput>sysctl -w vm.nr_hugepages=3170</userinput>
+$ <userinput>echo 3170 | tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages</userinput>
 </programlisting>
     A larger setting would be appropriate if other programs on the machine
-    also need huge pages.  Don't forget to add this setting
-    to <filename>/etc/sysctl.conf</filename> so that it will be reapplied
-    after reboots.
+    also need huge pages. It is also possible to pre allocate huge pages on boot
+    by adding the kernel parameters <literal>hugepagesz=2M hugepages=3170</literal>.
    </para>
 
    <para>
     Sometimes the kernel is not able to allocate the desired number of huge
-    pages immediately, so it might be necessary to repeat the command or to
-    reboot.  (Immediately after a reboot, most of the machine's memory
-    should be available to convert into huge pages.)  To verify the huge
-    page allocation situation, use:
+    pages immediately due to external fragmentation, so it might be necessary to
+    repeat the command or to reboot. To verify the huge page allocation situation
+    for a given size, use:
 <programlisting>
-$ <userinput>grep Huge /proc/meminfo</userinput>
+$ <userinput>cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages</userinput>
 </programlisting>
    </para>
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 198a6985bf..56419417dc 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -32,6 +32,7 @@
 #endif
 
 #include "miscadmin.h"
+#include "port/pg_bitutils.h"
 #include "portability/mem.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
@@ -466,53 +467,76 @@ PGSharedMemoryAttach(IpcMemoryId shmId,
  *
  * Returns the (real or assumed) page size into *hugepagesize,
  * and the hugepage-related mmap flags to use into *mmap_flags.
- *
- * Currently *mmap_flags is always just MAP_HUGETLB.  Someday, on systems
- * that support it, we might OR in additional bits to specify a particular
- * non-default huge page size.
  */
+
+
 static void
 GetHugePageSize(Size *hugepagesize, int *mmap_flags)
 {
-    /*
-     * If we fail to find out the system's default huge page size, assume it
-     * is 2MB.  This will work fine when the actual size is less.  If it's
-     * more, we might get mmap() or munmap() failures due to unaligned
-     * requests; but at this writing, there are no reports of any non-Linux
-     * systems being picky about that.
-     */
-    *hugepagesize = 2 * 1024 * 1024;
-    *mmap_flags = MAP_HUGETLB;
+    if (huge_page_size != 0)
+    {
+        /* If huge page size is provided in in config we use that size */
+        *hugepagesize = (Size) huge_page_size * 1024;
+    }
+    else
+    {
+        /*
+         * If we fail to find out the system's default huge page size, or no
+         * huge page size is provided in config, assume it is 2MB. This will
+         * work fine when the actual size is less.  If it's more, we might get
+         * mmap() or munmap() failures due to unaligned requests; but at this
+         * writing, there are no reports of any non-Linux systems being picky
+         * about that.
+         */
+        *hugepagesize = 2 * 1024 * 1024;
 
-    /*
-     * System-dependent code to find out the default huge page size.
-     *
-     * On Linux, read /proc/meminfo looking for a line like "Hugepagesize:
-     * nnnn kB".  Ignore any failures, falling back to the preset default.
-     */
+        /*
+         * System-dependent code to find out the default huge page size.
+         *
+         * On Linux, read /proc/meminfo looking for a line like "Hugepagesize:
+         * nnnn kB".  Ignore any failures, falling back to the preset default.
+         */
 #ifdef __linux__
-    {
-        FILE       *fp = AllocateFile("/proc/meminfo", "r");
-        char        buf[128];
-        unsigned int sz;
-        char        ch;
 
-        if (fp)
         {
-            while (fgets(buf, sizeof(buf), fp))
+            FILE       *fp = AllocateFile("/proc/meminfo", "r");
+            char        buf[128];
+            unsigned int sz;
+            char        ch;
+
+            if (fp)
             {
-                if (sscanf(buf, "Hugepagesize: %u %c", &sz, &ch) == 2)
+                while (fgets(buf, sizeof(buf), fp))
                 {
-                    if (ch == 'k')
+                    if (sscanf(buf, "Hugepagesize: %u %c", &sz, &ch) == 2)
                     {
-                        *hugepagesize = sz * (Size) 1024;
-                        break;
+                        if (ch == 'k')
+                        {
+                            *hugepagesize = sz * (Size) 1024;
+                            break;
+                        }
+                        /* We could accept other units besides kB, if needed */
                     }
-                    /* We could accept other units besides kB, if needed */
                 }
+                FreeFile(fp);
             }
-            FreeFile(fp);
         }
+#endif                            /* __linux__ */
+    }
+
+    *mmap_flags = MAP_HUGETLB;
+
+    /*
+     * System-dependent code to configure mmap_flags.
+     *
+     * On Linux, configure flags to include page size, since default huge page
+     * size will be used in case no size is provided.
+     */
+#ifdef __linux__
+    {
+        int            shift = pg_ceil_log2_64(*hugepagesize);
+
+        *mmap_flags |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
     }
 #endif                            /* __linux__ */
 }
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2f3e0a70e0..b482c660cf 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -585,6 +585,7 @@ int            ssl_renegotiation_limit;
  * need to be duplicated in all the different implementations of pg_shmem.c.
  */
 int            huge_pages;
+int            huge_page_size;
 
 /*
  * These variables are all dummies that don't do anything, except in some
@@ -2269,6 +2270,16 @@ static struct config_int ConfigureNamesInt[] =
         1024, 16, INT_MAX / 2,
         NULL, NULL, NULL
     },
+    {
+        {"huge_page_size", PGC_POSTMASTER, RESOURCES_MEM,
+            gettext_noop("The size of huge page that should be used."),
+            NULL,
+            GUC_UNIT_KB
+        },
+        &huge_page_size,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
 
     {
         {"temp_buffers", PGC_USERSET, RESOURCES_MEM,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ac02bd0c00..750d3f6245 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -122,6 +122,8 @@
                     # (change requires restart)
 #huge_pages = try            # on, off, or try
                     # (change requires restart)
+#huge_page_size = 0            # use defualt huge page size when set to zero
+                    # (change requires restart)
 #temp_buffers = 8MB            # min 800kB
 #max_prepared_transactions = 0        # zero disables the feature
                     # (change requires restart)
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 0de26b3427..9992932a00 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -44,6 +44,7 @@ typedef struct PGShmemHeader    /* standard header for all Postgres shmem */
 /* GUC variables */
 extern int    shared_memory_type;
 extern int    huge_pages;
+extern int    huge_page_size;
 
 /* Possible values for huge_pages */
 typedef enum
-- 
2.27.0

Re: [PATCH] Add support for choosing huge page size

От

Thomas Munro

Дата:

09 июня 2020 г., 01:26:26

On Tue, Jun 9, 2020 at 4:13 AM Odin Ugedal <odin@ugedal.com> wrote:
> This adds support for using non-default huge page sizes for shared
> memory. This is achived via the new "huge_page_size" config entry.
> The config value defaults to 0, meaning it will use the system default.
> ---
>
> This would be very helpful when running in kubernetes since nodes may
> support multiple huge page sizes, and have pre-allocated huge page meory
> for each size. This lets the user select huge page size without having
> to change the default huge page size on the node. This will also be
> useful when doing benchmarking with different huge page sizes, since it
> wouldn't require a full system reboot.

+1

> Since the default value of the new config is 0 (resulting in using the
> default huge page size) this should be backwards compatible with old
> configs.

+1

> Feel free to comment on the phrasing (both in docs and code) and on the
> overall change.

This change seems good to me, because it will make testing easier and
certain mixed page size configurations possible.  I haven't tried your
patch yet; I'll take it for a spin when I'm benchmarking some other
relevant stuff soon.

Minor comments on wording:

> +       <para>
> +        Most modern linux systems support <literal>2MB</literal> and <literal>1GB</literal>
> +        huge pages, and some architectures supports other sizes as well. For more information
> +        on how to check for support and usage, see <xref linkend="linux-huge-pages"/>.

Linux with a capital L.  Hmm, I don't especially like saying "Most
modern Linux systems" as code for Intel.  I wonder if we should
instead say something like: "Some commonly available page sizes on
modern 64 bit server architectures include: <literal>2MB<literal> and
<literal>1GB</literal> (Intel and AMD), <literal>16MB</literal> and
<literal>16GB</literal> (IBM POWER), and ... (ARM)."

> +       </para>
> +       <para>
> +        Controling huge page size is not supported on Windows.

Controlling

Just by the way, some googling is telling me that very recent versions
of Windows *could* support this (search keywords:
"NtAllocateVirtualMemoryEx 1GB"), so that could be a project for
someone who understands Windows to look into later.

Re: [PATCH] Add support for choosing huge page size

От

Odin Ugedal

Дата:

09 июня 2020 г., 17:24:24

Hi,

Thank you so much for the feedback David and Thomas!

Attached v2 of patch, updated with the comments from Thomas (again,
thanks). I also changed the mmap flags to only set size if the
selected huge page size is not the default on (on linux). The support
for this functionality was added in Linux 3.8, and therefore it was
not supported before then. Should we add that to the docs, or what do
you think? The definitions of MAP_HUGE_MASK and MAP_HUGE_SHIFT were
added in Linux 3.8 too, but since they are a part of libc/musl, and
are "used" at compile time, that shouldn't be a problem, or?

If a huge page size that is not supported on the system is chosen via
huge_page_size (and huge_pages = on), it will result in "FATAL:  could
not map anonymous shared memory: Invalid argument". This is the same
that happens today when huge pages aren't supported at all, so I guess
it is ok for now (and then we can consider verifying that it is
supported at a later stage).

Also, thanks for the information about the Windows. Have been
searching about info on huge pages in windows and "superpages" in bsd,
without that much luck. I only have experience on linux, so I think we
can do as you said, to let someone else look at it. :)

Odin

Вложения

v2-0001-Add-support-for-choosing-huge-page-size.patch

Re: [PATCH] Add support for choosing huge page size

От

Thomas Munro

Дата:

10 июня 2020 г., 08:11:40

On Wed, Jun 10, 2020 at 2:24 AM Odin Ugedal <odin@ugedal.com> wrote:
> Attached v2 of patch, updated with the comments from Thomas (again,
> thanks). I also changed the mmap flags to only set size if the
> selected huge page size is not the default on (on linux). The support
> for this functionality was added in Linux 3.8, and therefore it was
> not supported before then. Should we add that to the docs, or what do
> you think? The definitions of MAP_HUGE_MASK and MAP_HUGE_SHIFT were
> added in Linux 3.8 too, but since they are a part of libc/musl, and
> are "used" at compile time, that shouldn't be a problem, or?

Oh, so maybe we need a configure test for them?  And if you don't have
it, a runtime error if you try to set the page size to something other
than 0 (like we do for effective_io_concurrency if you don't have a
posix_fadvise() function).

> If a huge page size that is not supported on the system is chosen via
> huge_page_size (and huge_pages = on), it will result in "FATAL:  could
> not map anonymous shared memory: Invalid argument". This is the same
> that happens today when huge pages aren't supported at all, so I guess
> it is ok for now (and then we can consider verifying that it is
> supported at a later stage).

If you set it to an unsupported size, that seems reasonable to me.  If
you set it to an unsupported size and have huge_pages=try, do we fall
back to using no huge pages?

> Also, thanks for the information about the Windows. Have been
> searching about info on huge pages in windows and "superpages" in bsd,
> without that much luck. I only have experience on linux, so I think we
> can do as you said, to let someone else look at it. :)

For what it's worth, here's what I know about this on other operating systems:

1.  AIX can do huge pages, but only if you use System V shared memory
(not for mmap() anonymous shared).  In
https://commitfest.postgresql.org/25/1960/ we got as far as adding
support for shared_memory_type=sysv, but to go further we'll need
someone willing to hack on the patch on an AIX system, preferably with
root access so they can grant the postgres user wired memory
privileges (or whatever they call that over there).  But at a glance,
they don't have a way to ask for a specific page size, just "large".

2.  FreeBSD doesn't currently have a way to ask for super pages
explicitly at all; it does something like Linux Transparent Huge
Pages, except that it's transparent.  It does seem to do a pretty good
job of putting PostgreSQL text/code, shared memory and heap memory
into super pages automatically on my systems.  One small detail is
that there is a flag MAP_ALIGNED_SUPER that might help get better
alignment; it'd be bad if the lower pages of our shared memory
happened to be the location of lock arrays, proc array, buffer mapping
or other largish and very hot stuff and also happened to be on 4kb
pages due to misalignment stuff, but I wonder if the flag is really
needed to avoid that on current FreeBSD or not.  I should probably go
and check some time!  (I have no clue for other BSDs.)

3.  Last time I checked, Solaris and illumos seemed to have the same
philosophy as FreeBSD and not give you explicit control; my info could
be out of date, and I have no clue beyond that.

4.  What I said above about Windows; the explicit page size thing
seems to be bleeding edge and barely documented.

5.  macOS does have flags to ask for super pages with various sizes,
but apparently such mappings are not inherited by child processes.  So
that's useless for us.

As for the relevance of all this to your patch, I think we just need a
check callback for the GUC, that says "ERROR: huge_page_size must be
set to 0 on this platform".

Re: [PATCH] Add support for choosing huge page size

От

Thomas Munro

Дата:

10 июня 2020 г., 08:42:16

On Wed, Jun 10, 2020 at 5:11 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> 3.  Last time I checked, Solaris and illumos seemed to have the same
> philosophy as FreeBSD and not give you explicit control; my info could
> be out of date, and I have no clue beyond that.

Ah, I was wrong about that one: memcntl(2) looks highly relevant, but
I'm not planning to look into that myself.

Re: [PATCH] Add support for choosing huge page size

От

Odin Ugedal

Дата:

10 июня 2020 г., 14:45:02

Thanks again Thomas,

> Oh, so maybe we need a configure test for them?  And if you don't have
> it, a runtime error if you try to set the page size to something other
> than 0 (like we do for effective_io_concurrency if you don't have a
> posix_fadvise() function).

Ahh, yes, that sounds reasonable. Did some fiddling with the configure
script to add a check, and think I got it right (but not 100% sure
tho.). Added new v3 patch.

> If you set it to an unsupported size, that seems reasonable to me.  If
> you set it to an unsupported size and have huge_pages=try, do we fall
> back to using no huge pages?

Yes, the "fallback" with huge_pages=try is the same for both
huge_page_size=0 and huge_page_size=nMB, and is the same as without
this patch.

> For what it's worth, here's what I know about this on other operating systems:

Thanks for all the background info!

> 1.  AIX can do huge pages, but only if you use System V shared memory
> (not for mmap() anonymous shared).  In
> https://commitfest.postgresql.org/25/1960/ we got as far as adding
> support for shared_memory_type=sysv, but to go further we'll need
> someone willing to hack on the patch on an AIX system, preferably with
> root access so they can grant the postgres user wired memory
> privileges (or whatever they call that over there).  But at a glance,
> they don't have a way to ask for a specific page size, just "large".

Interesting. I might get access to some AIX systems at university this fall,
so maybe I will get some time to dive into the patch.


Odin

Вложения

v3-0001-Add-support-for-choosing-huge-page-size.patch

Re: [PATCH] Add support for choosing huge page size

От

Thomas Munro

Дата:

18 июня 2020 г., 07:00:49

Hi Odin,

Documentation syntax error "<literal>2MB<literal>" shows up as:

config.sgml:1605: parser error : Opening and ending tag mismatch:
literal line 1602 and para
       </para>
              ^

Please install the documentation tools
https://www.postgresql.org/docs/devel/docguide-toolsets.html, rerun
configure and "make docs" to see these kinds of errors.

The build is currently failing on Windows:

undefined symbol: HAVE_DECL_MAP_HUGE_MASK at src/include/pg_config.h
line 143 at src/tools/msvc/Mkvcbuild.pm line 851.

I think that's telling us that you need to add this stuff into
src/tools/msvc/Solution.pm, so that we can say it doesn't have it.  I
don't have Windows but whenever you post a new version we'll see if
Windows likes it here:

http://cfbot.cputube.org/odin-ugedal.html

When using huge_pages=on, huge_page_size=1GB, but default
shared_buffers, I noticed that the error message reports the wrong
(unrounded) size in this message:

2020-06-18 02:06:30.407 UTC [73552] HINT:  This error usually means
that PostgreSQL's request for a shared memory segment exceeded
available memory, swap space, or huge pages. To reduce the request
size (currently 149069824 bytes), reduce PostgreSQL's shared memory
usage, perhaps by reducing shared_buffers or max_connections.

The request size was actually:

mmap(NULL, 1073741824, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB|30<<MAP_HUGE_SHIFT, -1, 0) = -1
ENOMEM (Cannot allocate memory)

1GB pages are so big that it becomes a little tricky to set shared
buffers large enough without wasting RAM.  What I mean is, if I want
to use shared_buffers=16GB, I need to have at least 17 huge pages
available, but the 17th page is nearly entirely wasted!  Imagine that
on POWER 16GB pages.  That makes me wonder if we should actually
redefine these GUCs differently so that you state the total, or at
least use the rounded memory for buffers...  I think we could consider
that to be a separate problem with a separate patch though.

Just for fun, I compared 4KB, 2MB and 1GB pages for a hash join of a
3.5GB table against itself.  Hash joins are the perfect way to
exercise the TLB because they're very likely to miss.  I also applied
my patch[1] to allow parallel queries to use shared memory from the
main shared memory area, so that they benefit from the configured page
size, using pages that are allocated once at start up.  (Without that,
you'd have to mess around with /dev/shm mount options, and then hope
that pages were available at query time, and it'd also be slower for
other stupid implementation reasons).

# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# echo 8500 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# echo 17 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

shared_buffers=8GB
dynamic_shared_memory_main_size=8GB

create table t as select generate_series(1, 100000000)::int i;
alter table t set (parallel_workers = 7);
create extension pg_prewarm;
select pg_prewarm('t');
set max_parallel_workers_per_gather=7;
set work_mem='1GB';

select count(*) from t t1 join t t2 using (i);

4KB pages: 12.42 seconds
2MB pages:  9.12 seconds
1GB pages:  9.07 seconds

Unfortunately I can't access the TLB miss counters on this system due
to virtualisation restrictions, and the systems where I can don't have
1GB pages.  According to cpuid(1) this system has a fairly typical
setup:

   cache and TLB information (2):
      0x63: data TLB: 2M/4M pages, 4-way, 32 entries
            data TLB: 1G pages, 4-way, 4 entries
      0x03: data TLB: 4K pages, 4-way, 64 entries

This operation is touching about 8GB of data (scanning 3.5GB of table,
building a 4.5GB hash table) so 4 x 1GB is not enough do this without
TLB misses.

Let's try that again, except this time with shared_buffers=4GB,
dynamic_shared_memory_main_size=4GB, and only half as many tuples in
t, so it ought to fit:

4KB pages:  6.37 seconds
2MB pages:  4.96 seconds
1GB pages:  5.07 seconds

Well that's disappointing.  I wondered if this was something to do
with NUMA effects on this two node box, so I tried running that again
with postgres under numactl --cpunodebind 0 --membind 0 and I got:

4KB pages:  5.43 seconds
2MB pages:  4.05 seconds
1GB pages:  4.00 seconds

From this I can't really conclude that it's terribly useful to use
larger page sizes, but it's certainly useful to have the ability to do
further testing using the proposed GUC.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com

Re: [PATCH] Add support for choosing huge page size

От

Odin Ugedal

Дата:

21 июня 2020 г., 22:51:11

> Documentation syntax error "<literal>2MB<literal>" shows up as:

Ops, sorry, should be fixed now.

> The build is currently failing on Windows:

Ahh, thanks. Looks like the Windows stuff isn't autogenerated, so
maybe this new patch works..

> When using huge_pages=on, huge_page_size=1GB, but default
shared_buffers, I noticed that the error message reports the wrong
(unrounded) size in this message:

Ahh, yes, that is correct. Switched to printing the _real_ allocsize now!


> 1GB pages are so big that it becomes a little tricky to set shared
buffers large enough without wasting RAM.  What I mean is, if I want
to use shared_buffers=16GB, I need to have at least 17 huge pages
available, but the 17th page is nearly entirely wasted!  Imagine that
on POWER 16GB pages.  That makes me wonder if we should actually
redefine these GUCs differently so that you state the total, or at
least use the rounded memory for buffers...  I think we could consider
that to be a separate problem with a separate patch though.

Yes, that is a good point! But as you say, I guess that fits better in
another patch.

> Just for fun, I compared 4KB, 2MB and 1GB pages for a hash join of a
3.5GB table against itself. [...]

Thanks for the results! Will look into your patch when I get time, but
it certainly looks cool! I have a 4-node numa machine with ~100GiB of
memory and a single node numa machine, so i'll take some benchmarks
when I get time!

> I wondered if this was something to do
> with NUMA effects on this two node box, so I tried running that again
> with postgres under numactl --cpunodebind 0 --membind 0 and I got: [...]

Yes, making this "properly" numa aware to avoid/limit cross-numa
memory access is kinda tricky. When reserving huge pages they are
distributed more or less evenly between the nodes, and they can be
found by using `grep -R ""
/sys/devices/system/node/node*/hugepages/hugepages-*/nr_hugepages`
(can also be written to), so there _may_ be a chance that the huge
pages you got was on another node than 0 (due to the fact that there
not were enough), but that is just guessing.

tor. 18. jun. 2020 kl. 06:01 skrev Thomas Munro <thomas.munro@gmail.com>:
>
> Hi Odin,
>
> Documentation syntax error "<literal>2MB<literal>" shows up as:
>
> config.sgml:1605: parser error : Opening and ending tag mismatch:
> literal line 1602 and para
>        </para>
>               ^
>
> Please install the documentation tools
> https://www.postgresql.org/docs/devel/docguide-toolsets.html, rerun
> configure and "make docs" to see these kinds of errors.
>
> The build is currently failing on Windows:
>
> undefined symbol: HAVE_DECL_MAP_HUGE_MASK at src/include/pg_config.h
> line 143 at src/tools/msvc/Mkvcbuild.pm line 851.
>
> I think that's telling us that you need to add this stuff into
> src/tools/msvc/Solution.pm, so that we can say it doesn't have it.  I
> don't have Windows but whenever you post a new version we'll see if
> Windows likes it here:
>
> http://cfbot.cputube.org/odin-ugedal.html
>
> When using huge_pages=on, huge_page_size=1GB, but default
> shared_buffers, I noticed that the error message reports the wrong
> (unrounded) size in this message:
>
> 2020-06-18 02:06:30.407 UTC [73552] HINT:  This error usually means
> that PostgreSQL's request for a shared memory segment exceeded
> available memory, swap space, or huge pages. To reduce the request
> size (currently 149069824 bytes), reduce PostgreSQL's shared memory
> usage, perhaps by reducing shared_buffers or max_connections.
>
> The request size was actually:
>
> mmap(NULL, 1073741824, PROT_READ|PROT_WRITE,
> MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB|30<<MAP_HUGE_SHIFT, -1, 0) = -1
> ENOMEM (Cannot allocate memory)
>
> 1GB pages are so big that it becomes a little tricky to set shared
> buffers large enough without wasting RAM.  What I mean is, if I want
> to use shared_buffers=16GB, I need to have at least 17 huge pages
> available, but the 17th page is nearly entirely wasted!  Imagine that
> on POWER 16GB pages.  That makes me wonder if we should actually
> redefine these GUCs differently so that you state the total, or at
> least use the rounded memory for buffers...  I think we could consider
> that to be a separate problem with a separate patch though.
>
> Just for fun, I compared 4KB, 2MB and 1GB pages for a hash join of a
> 3.5GB table against itself.  Hash joins are the perfect way to
> exercise the TLB because they're very likely to miss.  I also applied
> my patch[1] to allow parallel queries to use shared memory from the
> main shared memory area, so that they benefit from the configured page
> size, using pages that are allocated once at start up.  (Without that,
> you'd have to mess around with /dev/shm mount options, and then hope
> that pages were available at query time, and it'd also be slower for
> other stupid implementation reasons).
>
> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
> # echo 8500 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> # echo 17 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>
> shared_buffers=8GB
> dynamic_shared_memory_main_size=8GB
>
> create table t as select generate_series(1, 100000000)::int i;
> alter table t set (parallel_workers = 7);
> create extension pg_prewarm;
> select pg_prewarm('t');
> set max_parallel_workers_per_gather=7;
> set work_mem='1GB';
>
> select count(*) from t t1 join t t2 using (i);
>
> 4KB pages: 12.42 seconds
> 2MB pages:  9.12 seconds
> 1GB pages:  9.07 seconds
>
> Unfortunately I can't access the TLB miss counters on this system due
> to virtualisation restrictions, and the systems where I can don't have
> 1GB pages.  According to cpuid(1) this system has a fairly typical
> setup:
>
>    cache and TLB information (2):
>       0x63: data TLB: 2M/4M pages, 4-way, 32 entries
>             data TLB: 1G pages, 4-way, 4 entries
>       0x03: data TLB: 4K pages, 4-way, 64 entries
>
> This operation is touching about 8GB of data (scanning 3.5GB of table,
> building a 4.5GB hash table) so 4 x 1GB is not enough do this without
> TLB misses.
>
> Let's try that again, except this time with shared_buffers=4GB,
> dynamic_shared_memory_main_size=4GB, and only half as many tuples in
> t, so it ought to fit:
>
> 4KB pages:  6.37 seconds
> 2MB pages:  4.96 seconds
> 1GB pages:  5.07 seconds
>
> Well that's disappointing.  I wondered if this was something to do
> with NUMA effects on this two node box, so I tried running that again
> with postgres under numactl --cpunodebind 0 --membind 0 and I got:
>
> 4KB pages:  5.43 seconds
> 2MB pages:  4.05 seconds
> 1GB pages:  4.00 seconds
>
> From this I can't really conclude that it's terribly useful to use
> larger page sizes, but it's certainly useful to have the ability to do
> further testing using the proposed GUC.
>
> [1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com

Вложения

v4-0001-Add-support-for-choosing-huge-page-size.patch

Re: [PATCH] Add support for choosing huge page size

От

Andres Freund

Дата:

21 июня 2020 г., 23:55:17

Hi,

On 2020-06-18 16:00:49 +1200, Thomas Munro wrote:
> Unfortunately I can't access the TLB miss counters on this system due
> to virtualisation restrictions, and the systems where I can don't have
> 1GB pages.  According to cpuid(1) this system has a fairly typical
> setup:
> 
>    cache and TLB information (2):
>       0x63: data TLB: 2M/4M pages, 4-way, 32 entries
>             data TLB: 1G pages, 4-way, 4 entries
>       0x03: data TLB: 4K pages, 4-way, 64 entries

Hm. Doesn't that system have a second level of TLB (STLB) with more 1GB
entries? I think there's some errata around what intel exposes via cpuid
around this :(

Guessing that this is a skylake server chip?
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Memory_Hierarchy

> [...] Additionally there is a unified L2 TLB (STLB)
> [...]  STLB
> [...] 1 GiB page translations:
> [...]     16 entries; 4-way set associative

> This operation is touching about 8GB of data (scanning 3.5GB of table,
> building a 4.5GB hash table) so 4 x 1GB is not enough do this without
> TLB misses.

I assume this uses 7 workers?

> Let's try that again, except this time with shared_buffers=4GB,
> dynamic_shared_memory_main_size=4GB, and only half as many tuples in
> t, so it ought to fit:
> 
> 4KB pages:  6.37 seconds
> 2MB pages:  4.96 seconds
> 1GB pages:  5.07 seconds
> 
> Well that's disappointing.

Hm, I don't actually know the answer to this: If this actually uses
multiple workers, won't the fact that each has an independent page table
(despite having overlapping contents) lead to there being fewer actually
available 1GB entries available?  Obviously depends on how processes are
scheduled (iirc hyperthreading shares dTLBs).

Might be worth looking at whether there are cpu migrations or testing
with a single worker.

> I wondered if this was something to do
> with NUMA effects on this two node box, so I tried running that again
> with postgres under numactl --cpunodebind 0 --membind 0 and I got:

> 4KB pages:  5.43 seconds
> 2MB pages:  4.05 seconds
> 1GB pages:  4.00 seconds
> 
> From this I can't really conclude that it's terribly useful to use
> larger page sizes, but it's certainly useful to have the ability to do
> further testing using the proposed GUC.

Due to the low number of 1GB entries they're quite likely to be
problematic imo. Especially when there's more concurrent misses than
there are page table entries.

I'm somewhat doubtful that it's useful to use 1GB entries for all of our
shared memory when that's bigger than the maximum covered size. I
suspect that it'd better to use 1GB entries for some and smaller entries
for the rest of the memory.

Greetings,

Andres Freund

Re: [PATCH] Add support for choosing huge page size

От

Thomas Munro

Дата:

17 июля 2020 г., 05:42:18

On Mon, Jun 22, 2020 at 7:51 AM Odin Ugedal <odin@ugedal.com> wrote:
> Ahh, thanks. Looks like the Windows stuff isn't autogenerated, so
> maybe this new patch works..

On second thoughts, it seemed like overkill to use configure just to
detect whether macros are defined, so I dropped that and used plain
old #if defined().  I also did some minor proof-reading and editing on
the documentation and comments; I put back the bit about sysctl and
sysctl.conf because I think that is still pretty useful to highlight
for people who just want to use the default size, along with the /sys
method.

Pushed.  Thanks for the patch!  It's always nice to see notes like
this being removed:

- * Currently *mmap_flags is always just MAP_HUGETLB.  Someday, on systems
- * that support it, we might OR in additional bits to specify a particular
- * non-default huge page size.

In passing, I think GetHugePageSize() is a bit odd; it claims to have
a Linux-specific part and a generic part, and yet the whole thing is
wrapped in #ifdef MAP_HUGETLB which is Linux-specific as far as I
know.  But that's not this patch's fault.

We might want to consider removing the note about CONFIG_HUGETLB_PAGE
from the manual; I'm not sure if kernels built without that stuff are
still roaming in the wild, or if it's another anachronysm due for
removal like commit c8be915a.  I didn't do that today, though.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [PATCH] Add support for choosing huge page size

Вложения

Вложения

Вложения