Обсуждение: postgresql v11.1 Segmentation fault: signal 11: by running SELECT...JIT Issue?

Поиск
Список
Период
Сортировка

postgresql v11.1 Segmentation fault: signal 11: by running SELECT...JIT Issue?

От
pabloa98
Дата:
Hello

I just migrated our databases from PostgreSQL version 9.6 to version 11.1. We got a segmentation fault while running this query:

SELECT f_2110 as x FROM baseline_denull
ORDER BY eid ASC
limit 500
OFFSET 131000;

It works in version 11,1 if offset + limit < 131000 approx (it is some number around it).

It works too if I disable jit (it was enabled). So this works:

set jit = 0;
SELECT f_2110 as x FROM baseline_denull
ORDER BY eid ASC
limit 500
OFFSET 131000;

It works all the time in version 9.6


The workaround seems to disable JIT. Is this a configuration problem or a bug?

We are using a compiled version of Postgres because we have tables (like this one) with thousands of columns.

This server was compiled as follows:

In Ubuntu 16.04:

sudo apt update
sudo apt install --yes libcrypto++-utils libssl-dev libcrypto++-dev libsystemd-dev libpthread-stubs0-dev libpthread-workqueue-dev
sudo apt install --yes docbook-xml docbook-xsl fop libxml2-utils xsltproc
sudo apt install --yes gcc zlib1g-dev libreadline6-dev make
sudo apt install --yes llvm-6.0 clang-6.0
sudo apt install --yes build-essential
sudo apt install --yes opensp
sudo locale-gen en_US.UTF-8

Download the source code:

mkdir -p ~/soft
cd ~/soft
wget https://ftp.postgresql.org/pub/source/v11.1/postgresql-11.1.tar.gz
tar xvzf postgresql-11.1.tar.gz
cd postgresql-11.1/

./configure --prefix=$HOME/soft/postgresql/postgresql-11 --with-extra-version=ps.2.0 --with-llvm --with-openssl --with-systemd --with-blocksize=32 --with-wal-blocksize=32 --with-system-tzdata=/usr/share/zoneinfo


make world
make check   # 11 tests fail. I assumed it is because the planner behaves differently because the change of blocksize.

make install-world


$HOME/soft/postgresql/postgresql-11/bin/initdb -D $HOME/soft/postgresql/postgresql-11/data/

Changes in ./data/postgresql.conf:

    listen_addresses = '*'
    max_connections = 300
    work_mem = 32MB
    maintenance_work_mem = 256MB
    shared_buffers = 1024MB
    log_timezone = 'US/Pacific'
    log_destination = 'csvlog'
    logging_collector = on
    log_filename = 'postgresql-%Y-%m-%d.log'
    log_rotation_size = 0
    log_min_duration_statement = 1000
    debug_print_parse = off
    debug_print_rewritten = off
    debug_print_plan = off
    log_temp_files = 100000000

    jit = on  # As a workaround, I turned off... but I want it on.



The database is created as:

CREATE DATABASE xxx
    WITH
    OWNER = user
    ENCODING = 'UTF8'
    LC_COLLATE = 'en_US.UTF-8'
    LC_CTYPE = 'en_US.UTF-8'
    TABLESPACE = pg_default
    CONNECTION LIMIT = -1;

the table baseline_denull has 1765 columns, mainly numbers, like:

CREATE TABLE public.baseline_denull
(
    eid integer,
    f_19 integer,
    f_21 integer,
    f_23 integer,
    f_31 integer,
    f_34 integer,
    f_35 integer,
    f_42 text COLLATE pg_catalog."default",
    f_43 text COLLATE pg_catalog."default",
    f_45 text COLLATE pg_catalog."default",
    f_46 integer,
    f_47 integer,
    f_48 double precision,
    f_49 double precision,
    f_50 double precision,
    f_51 double precision,
    f_52 integer,
    f_53 date,
    f_54 integer,
    f_68 integer,
    f_74 integer,
    f_77 double precision,
    f_78 double precision,
    f_84 integer[],
    f_87 integer[],
    f_92 integer[],
    f_93 integer[],
    f_94 integer[],
    f_95 integer[],
    f_96 integer[],
    f_102 integer[],
    f_120 integer,
    f_129 integer,

etc

and 1 index:

CREATE INDEX baseline_denull_eid_idx
    ON public.baseline_denull USING btree
    (eid)
    TABLESPACE pg_default;


I have a core saved, It says:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: user xxx 172.17.0.64(36654) SELECT                      '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f3c0c08c290 in ?? ()
(gdb) bt
#0  0x00007f3c0c08c290 in ?? ()
#1  0x0000000000000000 in ?? ()
(gdb) quit


How could I enable JIT again without getting a segmentation fault?

Regards,

Pablo

Re: postgresql v11.1 Segmentation fault: signal 11: by running SELECT... JIT Issue?

От
Tom Lane
Дата:
pabloa98 <pabloa98@gmail.com> writes:
> I just migrated our databases from PostgreSQL version 9.6 to version 11.1.
> We got a segmentation fault while running this query:

> SELECT f_2110 as x FROM baseline_denull
> ORDER BY eid ASC
> limit 500
> OFFSET 131000;

> the table baseline_denull has 1765 columns, mainly numbers, like:

Hm, that sounds like it matches this recent bug fix:

Author: Andres Freund <andres@anarazel.de>
Branch: master [b23852766] 2018-11-27 10:07:03 -0800
Branch: REL_11_STABLE [aee085bc0] 2018-11-27 10:07:43 -0800

    Fix jit compilation bug on wide tables.
    
    The function generated to perform JIT compiled tuple deforming failed
    when HeapTupleHeader's t_hoff was bigger than a signed int8. I'd
    failed to realize that LLVM's getelementptr would treat an int8 index
    argument as signed, rather than unsigned.  That means that a hoff
    larger than 127 would result in a negative offset being applied.  Fix
    that by widening the index to 32bit.
    
    Add a testcase with a wide table. Don't drop it, as it seems useful to
    verify other tools deal properly with wide tables.
    
    Thanks to Justin Pryzby for both reporting a bug and then reducing it
    to a reproducible testcase!
    
    Reported-By: Justin Pryzby
    Author: Andres Freund
    Discussion: https://postgr.es/m/20181115223959.GB10913@telsasoft.com
    Backpatch: 11, just as jit compilation was


This would result in failures on wide rows that contain some null
entries.  If your table is mostly-not-null, that would fit the
observation that it only crashes on a few rows.

Can you try REL_11_STABLE branch tip and see if it works for you?

            regards, tom lane


Re: postgresql v11.1 Segmentation fault: signal 11: by running SELECT... JIT Issue?

От
Andrew Gierth
Дата:
>>>>> "pabloa98" == pabloa98  <pabloa98@gmail.com> writes:

 pabloa98> the table baseline_denull has 1765 columns,

Uhh...

#define MaxHeapAttributeNumber    1600    /* 8 * 200 */

Did you modify that?

(The back of my envelope says that on 64bit, the largest usable t_hoff
would be 248, of which 23 is fixed overhead leaving 225 as the max null
bitmap size, giving a hard limit of 1800 for MaxTupleAttributeNumber and
1799 for MaxHeapAttributeNumber. And the concerns expressed in the
comments above those #defines would obviously apply.)

-- 
Andrew (irc:RhodiumToad)


Re: postgresql v11.1 Segmentation fault: signal 11: by runningSELECT... JIT Issue?

От
pabloa98
Дата:
I did not modify it.

I guess I should make it bigger than 1765. is 2400 or 3200 fine?

My apologies if my questions look silly. I do not know about the internal format of the database.

Pablo

On Mon, Jan 28, 2019 at 11:58 PM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>>>>> "pabloa98" == pabloa98  <pabloa98@gmail.com> writes:

 pabloa98> the table baseline_denull has 1765 columns,

Uhh...

#define MaxHeapAttributeNumber  1600    /* 8 * 200 */

Did you modify that?

(The back of my envelope says that on 64bit, the largest usable t_hoff
would be 248, of which 23 is fixed overhead leaving 225 as the max null
bitmap size, giving a hard limit of 1800 for MaxTupleAttributeNumber and
1799 for MaxHeapAttributeNumber. And the concerns expressed in the
comments above those #defines would obviously apply.)

--
Andrew (irc:RhodiumToad)

Re: postgresql v11.1 Segmentation fault: signal 11: by runningSELECT... JIT Issue?

От
pabloa98
Дата:
I found this article:


It seems I should modify: uint8 t_hoff;
and replace it with something like: uint32 t_hoff; or uint64 t_hoff;

And perhaps should I modify this too?

The fix is easy enough, just adding a
v_hoff = LLVMBuildZExt(b, v_hoff, LLVMInt32Type(), "");
fixes the issue for me.

If that is the case, I am not sure what kind of modification we should do.


I feel I need to explain why we create these huge tables. Basically we want to process big matrices for machine learning.
Using tables with classic columns let us write very clear code. If we have to start using arrays as columns, things would become complicated and not intuitive (besides, some columns store vectors as arrays... ).

We could use JSONB (we do, but for json documents). The problem is, storing large amounts of jsonb columns create performance issues (compared with normal tables).

Since almost everybody is doing ML to apply to different products, perhaps are there other companies interested in a version of Postgres that could deal with tables with thousands of columns?
I did not find any postgres package ready to use like that though.

Pablo




On Tue, Jan 29, 2019 at 12:11 AM pabloa98 <pabloa98@gmail.com> wrote:
I did not modify it.

I guess I should make it bigger than 1765. is 2400 or 3200 fine?

My apologies if my questions look silly. I do not know about the internal format of the database.

Pablo

On Mon, Jan 28, 2019 at 11:58 PM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>>>>> "pabloa98" == pabloa98  <pabloa98@gmail.com> writes:

 pabloa98> the table baseline_denull has 1765 columns,

Uhh...

#define MaxHeapAttributeNumber  1600    /* 8 * 200 */

Did you modify that?

(The back of my envelope says that on 64bit, the largest usable t_hoff
would be 248, of which 23 is fixed overhead leaving 225 as the max null
bitmap size, giving a hard limit of 1800 for MaxTupleAttributeNumber and
1799 for MaxHeapAttributeNumber. And the concerns expressed in the
comments above those #defines would obviously apply.)

--
Andrew (irc:RhodiumToad)

Re: postgresql v11.1 Segmentation fault: signal 11: by running SELECT... JIT Issue?

От
Andrew Gierth
Дата:
>>>>> "pabloa98" == pabloa98  <pabloa98@gmail.com> writes:

 pabloa98> I did not modify it.

Then how did you create a table with more than 1600 columns? If I try
and create a table with 1765 columns, I get:

ERROR:  tables can have at most 1600 columns

-- 
Andrew (irc:RhodiumToad)


Re: postgresql v11.1 Segmentation fault: signal 11: by running SELECT... JIT Issue?

От
Andrew Gierth
Дата:
>>>>> "pabloa98" == pabloa98  <pabloa98@gmail.com> writes:

 pabloa98> I found this article:

 pabloa98>
https://manual.limesurvey.org/Instructions_for_increasing_the_maximum_number_of_columns_in_PostgreSQL_on_Linux

Those instructions contain obvious errors.

 pabloa98> It seems I should modify: uint8 t_hoff;
 pabloa98> and replace it with something like: uint32 t_hoff; or uint64 t_hoff;

At the very least, that ought to be uint16 t_hoff; since there is never
any possibility of hoff being larger than 32k since that's the largest
allowed pagesize. However, if you modify that, it's then up to you to
ensure that all the code that assumes it's a uint8 is found and fixed.
I have no idea what else would break.

-- 
Andrew (irc:RhodiumToad)


Re: postgresql v11.1 Segmentation fault: signal 11: by runningSELECT... JIT Issue?

От
pabloa98
Дата:
I appreciate your advice. I will check the number of columns in that table. 



On Tue, Jan 29, 2019, 1:53 AM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>>>>> "pabloa98" == pabloa98  <pabloa98@gmail.com> writes:

 pabloa98> I found this article:

 pabloa98> https://manual.limesurvey.org/Instructions_for_increasing_the_maximum_number_of_columns_in_PostgreSQL_on_Linux

Those instructions contain obvious errors.

 pabloa98> It seems I should modify: uint8 t_hoff;
 pabloa98> and replace it with something like: uint32 t_hoff; or uint64 t_hoff;

At the very least, that ought to be uint16 t_hoff; since there is never
any possibility of hoff being larger than 32k since that's the largest
allowed pagesize. However, if you modify that, it's then up to you to
ensure that all the code that assumes it's a uint8 is found and fixed.
I have no idea what else would break.

--
Andrew (irc:RhodiumToad)

Re: postgresql v11.1 Segmentation fault: signal 11: by runningSELECT... JIT Issue?

От
pabloa98
Дата:
I checked the table. It has 1265 columns. Sorry about the typo.

Pablo

On Tue, Jan 29, 2019 at 1:10 AM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
>>>>> "pabloa98" == pabloa98  <pabloa98@gmail.com> writes:

 pabloa98> I did not modify it.

Then how did you create a table with more than 1600 columns? If I try
and create a table with 1765 columns, I get:

ERROR:  tables can have at most 1600 columns

--
Andrew (irc:RhodiumToad)

Re: postgresql v11.1 Segmentation fault: signal 11: by runningSELECT... JIT Issue?

От
Justin Pryzby
Дата:
On Mon, Nov 26, 2018 at 07:00:35PM -0800, Andres Freund wrote:
> The fix is easy enough, just adding a
>     v_hoff = LLVMBuildZExt(b, v_hoff, LLVMInt32Type(), "");
> fixes the issue for me.

On Tue, Jan 29, 2019 at 12:38:38AM -0800, pabloa98 wrote:
> And perhaps should I modify this too?
> If that is the case, I am not sure what kind of modification we should do.

Andres commited the fix in November, and it's included in postgres11.2, which
is scheduled to be released Thursday.  So we'll both be able to re-enable JIT
on our wide tables again.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b23852766

Justin


Re: postgresql v11.1 Segmentation fault: signal 11: by runningSELECT... JIT Issue?

От
pabloa98
Дата:
I tried. It works 
Thanks for the information.
P

On Mon, Jan 28, 2019, 7:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
pabloa98 <pabloa98@gmail.com> writes:
> I just migrated our databases from PostgreSQL version 9.6 to version 11.1.
> We got a segmentation fault while running this query:

> SELECT f_2110 as x FROM baseline_denull
> ORDER BY eid ASC
> limit 500
> OFFSET 131000;

> the table baseline_denull has 1765 columns, mainly numbers, like:

Hm, that sounds like it matches this recent bug fix:

Author: Andres Freund <andres@anarazel.de>
Branch: master [b23852766] 2018-11-27 10:07:03 -0800
Branch: REL_11_STABLE [aee085bc0] 2018-11-27 10:07:43 -0800

    Fix jit compilation bug on wide tables.

    The function generated to perform JIT compiled tuple deforming failed
    when HeapTupleHeader's t_hoff was bigger than a signed int8. I'd
    failed to realize that LLVM's getelementptr would treat an int8 index
    argument as signed, rather than unsigned.  That means that a hoff
    larger than 127 would result in a negative offset being applied.  Fix
    that by widening the index to 32bit.

    Add a testcase with a wide table. Don't drop it, as it seems useful to
    verify other tools deal properly with wide tables.

    Thanks to Justin Pryzby for both reporting a bug and then reducing it
    to a reproducible testcase!

    Reported-By: Justin Pryzby
    Author: Andres Freund
    Discussion: https://postgr.es/m/20181115223959.GB10913@telsasoft.com
    Backpatch: 11, just as jit compilation was


This would result in failures on wide rows that contain some null
entries.  If your table is mostly-not-null, that would fit the
observation that it only crashes on a few rows.

Can you try REL_11_STABLE branch tip and see if it works for you?

                        regards, tom lane