Обсуждение: Segfault during queries

Поиск
Список
Период
Сортировка

Segfault during queries

От
Tyler Brock
Дата:
Hi all,

I keep getting segfaults when running queries against postgresql replicas. We have an ETL job that hits these nodes with many small queries (no more than 8 concurrently) and the box has 16 cores and plenty of I/O and ram.

However, during this process postgresql segfaults, usually on COPY commands that have this shape:

COPY
(
       SELECT row_to_json(t)
       FROM   (
                     SELECT lead_tag."objectId" AS "lead_id",
                            lead_tag."tag"      AS "tag_id"
                     FROM   (
                                   SELECT "objectId",
                                          jsonb_array_elements_text("tags") AS tag
                                   FROM   "Lead"
                                   WHERE  true
                                   AND    "Lead"."organization" = 'I6JDWAaZx5'
                                   AND    tags IS NOT NULL
                                   AND    "Lead"."updatedAt" >= '2015-01-01'
                                   AND    "Lead"."updatedAt" < '2021-10-30T11:05:40.389773+00:00' ) AS lead_tag) t) TO stdout

Is there anything I can do to prevent this or anything i can look at to try and diagnose what is happening here? I’m running Postgres 12.7 and cannot tell if this is a symptom of a bug or just misconfiguration.

What I see in the error logs is:

2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:ERROR: cache lookup failed for type 0 2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:STATEMENT: COPY (SELECT ROW_TO_JSON(t) FROM (… the query above...) TO STDOUT 2021-10-29 11:08:10 UTC::@:[19406]:LOG: server process (PID 22200) was terminated by signal 11: Segmentation fault

Any help would be greatly appreciated, thanks!

-Tyler

Re: Segfault during queries

От
Scott Ribe
Дата:
Of course I cannot say if this is your case, but I've been using PG since 2004 and every single instance* of a segfault
hasturned out to be caused by some 3rd-party extension or other. And yes, the extension was never related in any way to
thequery being run, it just corrupted memory such that PG would die later. So I would audit extensions, and see if any
havebeen recently added, any could be removed temporarily, and so on. 

* It's only 2 or 3 cases in all that time, but still, it was never core PG.

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/



> On Oct 29, 2021, at 8:05 AM, Tyler Brock <tyler.brock@gmail.com> wrote:
>
> Hi all,
>
> I keep getting segfaults when running queries against postgresql replicas. We have an ETL job that hits these nodes
withmany small queries (no more than 8 concurrently) and the box has 16 cores and plenty of I/O and ram. 
>
> However, during this process postgresql segfaults, usually on COPY commands that have this shape:
>
> COPY
> (
>        SELECT row_to_json(t)
>        FROM   (
>                      SELECT lead_tag."objectId" AS "lead_id",
>                             lead_tag."tag"      AS "tag_id"
>                      FROM   (
>                                    SELECT "objectId",
>                                           jsonb_array_elements_text("tags") AS tag
>                                    FROM   "Lead"
>                                    WHERE  true
>                                    AND    "Lead"."organization" = 'I6JDWAaZx5'
>                                    AND    tags IS NOT NULL
>                                    AND    "Lead"."updatedAt" >= '2015-01-01'
>                                    AND    "Lead"."updatedAt" < '2021-10-30T11:05:40.389773+00:00' ) AS lead_tag) t)
TOstdout 
>
> Is there anything I can do to prevent this or anything i can look at to try and diagnose what is happening here? I’m
runningPostgres 12.7 and cannot tell if this is a symptom of a bug or just misconfiguration. 
>
> What I see in the error logs is:
>
> 2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:ERROR: cache lookup failed for type 0
> 2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:STATEMENT: COPY (SELECT ROW_TO_JSON(t) FROM (…
thequery above...) TO STDOUT 
> 2021-10-29 11:08:10 UTC::@:[19406]:LOG: server process (PID 22200) was terminated by signal 11: Segmentation fault
>
>
> Any help would be greatly appreciated, thanks!
>
> -Tyler




Re: Segfault during queries

От
Tom Lane
Дата:
Tyler Brock <tyler.brock@gmail.com> writes:
> Is there anything I can do to prevent this or anything i can look at to try
> and diagnose what is happening here? I’m running Postgres 12.7 and cannot
> tell if this is a symptom of a bug or just misconfiguration.

> What I see in the error logs is:

> 2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:ERROR:
> cache lookup failed for type 0 2021-10-29 11:08:10
> UTC:172.23.17.171(54332):postgres@postgres:[22200]:STATEMENT: COPY (SELECT
> ROW_TO_JSON(t) FROM (… the query above...) TO STDOUT 2021-10-29 11:08:10
> UTC::@:[19406]:LOG: server process (PID 22200) was terminated by signal 11:
> Segmentation fault

That's most certainly a bug.  Can you get a stack trace from the crash?

https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

            regards, tom lane



Re: Segfault during queries

От
Tyler Brock
Дата:
Interesting! Thank you, I am seeing that we have an outdated version of plv8 that does not match what we should be running for pg12.x, let me try dropping that extension and recreating it with a newer version.

-Tyler


On Oct 29, 2021 at 10:27:45 AM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
Of course I cannot say if this is your case, but I've been using PG since 2004 and every single instance* of a segfault has turned out to be caused by some 3rd-party extension or other. And yes, the extension was never related in any way to the query being run, it just corrupted memory such that PG would die later. So I would audit extensions, and see if any have been recently added, any could be removed temporarily, and so on.

* It's only 2 or 3 cases in all that time, but still, it was never core PG.

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/



On Oct 29, 2021, at 8:05 AM, Tyler Brock <tyler.brock@gmail.com> wrote:

Hi all,

I keep getting segfaults when running queries against postgresql replicas. We have an ETL job that hits these nodes with many small queries (no more than 8 concurrently) and the box has 16 cores and plenty of I/O and ram.

However, during this process postgresql segfaults, usually on COPY commands that have this shape:

COPY
(
       SELECT row_to_json(t)
       FROM   (
                     SELECT lead_tag."objectId" AS "lead_id",
                            lead_tag."tag"      AS "tag_id"
                     FROM   (
                                   SELECT "objectId",
                                          jsonb_array_elements_text("tags") AS tag
                                   FROM   "Lead"
                                   WHERE  true
                                   AND    "Lead"."organization" = 'I6JDWAaZx5'
                                   AND    tags IS NOT NULL
                                   AND    "Lead"."updatedAt" >= '2015-01-01'
                                   AND    "Lead"."updatedAt" < '2021-10-30T11:05:40.389773+00:00' ) AS lead_tag) t) TO stdout

Is there anything I can do to prevent this or anything i can look at to try and diagnose what is happening here? I’m running Postgres 12.7 and cannot tell if this is a symptom of a bug or just misconfiguration.

What I see in the error logs is:

2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:ERROR: cache lookup failed for type 0
2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:STATEMENT: COPY (SELECT ROW_TO_JSON(t) FROM (… the query above...) TO STDOUT
2021-10-29 11:08:10 UTC::@:[19406]:LOG: server process (PID 22200) was terminated by signal 11: Segmentation fault


Any help would be greatly appreciated, thanks!

-Tyler

Re: Segfault during queries

От
Tyler Brock
Дата:
It doesn’t seem like upgrading plv8 did the trick unfortunately. I’m going to try and get those stack traces but it may be difficult/impossible because we are running postgres via AWS RDS and don’t have control of the underlying VM.

-Tyler


On Oct 29, 2021 at 11:08:57 AM, Tyler Brock <tyler.brock@gmail.com> wrote:
Interesting! Thank you, I am seeing that we have an outdated version of plv8 that does not match what we should be running for pg12.x, let me try dropping that extension and recreating it with a newer version.

-Tyler


On Oct 29, 2021 at 10:27:45 AM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
Of course I cannot say if this is your case, but I've been using PG since 2004 and every single instance* of a segfault has turned out to be caused by some 3rd-party extension or other. And yes, the extension was never related in any way to the query being run, it just corrupted memory such that PG would die later. So I would audit extensions, and see if any have been recently added, any could be removed temporarily, and so on.

* It's only 2 or 3 cases in all that time, but still, it was never core PG.

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/



On Oct 29, 2021, at 8:05 AM, Tyler Brock <tyler.brock@gmail.com> wrote:

Hi all,

I keep getting segfaults when running queries against postgresql replicas. We have an ETL job that hits these nodes with many small queries (no more than 8 concurrently) and the box has 16 cores and plenty of I/O and ram.

However, during this process postgresql segfaults, usually on COPY commands that have this shape:

COPY
(
       SELECT row_to_json(t)
       FROM   (
                     SELECT lead_tag."objectId" AS "lead_id",
                            lead_tag."tag"      AS "tag_id"
                     FROM   (
                                   SELECT "objectId",
                                          jsonb_array_elements_text("tags") AS tag
                                   FROM   "Lead"
                                   WHERE  true
                                   AND    "Lead"."organization" = 'I6JDWAaZx5'
                                   AND    tags IS NOT NULL
                                   AND    "Lead"."updatedAt" >= '2015-01-01'
                                   AND    "Lead"."updatedAt" < '2021-10-30T11:05:40.389773+00:00' ) AS lead_tag) t) TO stdout

Is there anything I can do to prevent this or anything i can look at to try and diagnose what is happening here? I’m running Postgres 12.7 and cannot tell if this is a symptom of a bug or just misconfiguration.

What I see in the error logs is:

2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:ERROR: cache lookup failed for type 0
2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:STATEMENT: COPY (SELECT ROW_TO_JSON(t) FROM (… the query above...) TO STDOUT
2021-10-29 11:08:10 UTC::@:[19406]:LOG: server process (PID 22200) was terminated by signal 11: Segmentation fault


Any help would be greatly appreciated, thanks!

-Tyler

Re: Segfault during queries

От
Tyler Brock
Дата:
Hey all, so i solved it but I’m not sure how…

Some combination of upgrading plv8, upgrading from postgres 12.7 to 12.8 and upgrading psycopg minor version did the trick and we are no longer seeing segfaults when we run these queries.

AWS was not helpful in assisting with core dumps + stack traces (as you could imagine).

Their suggestion was to set verbosity to “verbose”… would that have yielded any useful info for you all or would debugging this have been impossible on RDS?

I greatly appreciate all the help.

-Tyler


On Oct 29, 2021 at 1:36:12 PM, Tyler Brock <tyler.brock@gmail.com> wrote:
It doesn’t seem like upgrading plv8 did the trick unfortunately. I’m going to try and get those stack traces but it may be difficult/impossible because we are running postgres via AWS RDS and don’t have control of the underlying VM.

-Tyler


On Oct 29, 2021 at 11:08:57 AM, Tyler Brock <tyler.brock@gmail.com> wrote:
Interesting! Thank you, I am seeing that we have an outdated version of plv8 that does not match what we should be running for pg12.x, let me try dropping that extension and recreating it with a newer version.

-Tyler


On Oct 29, 2021 at 10:27:45 AM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
Of course I cannot say if this is your case, but I've been using PG since 2004 and every single instance* of a segfault has turned out to be caused by some 3rd-party extension or other. And yes, the extension was never related in any way to the query being run, it just corrupted memory such that PG would die later. So I would audit extensions, and see if any have been recently added, any could be removed temporarily, and so on.

* It's only 2 or 3 cases in all that time, but still, it was never core PG.

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/



On Oct 29, 2021, at 8:05 AM, Tyler Brock <tyler.brock@gmail.com> wrote:

Hi all,

I keep getting segfaults when running queries against postgresql replicas. We have an ETL job that hits these nodes with many small queries (no more than 8 concurrently) and the box has 16 cores and plenty of I/O and ram.

However, during this process postgresql segfaults, usually on COPY commands that have this shape:

COPY
(
       SELECT row_to_json(t)
       FROM   (
                     SELECT lead_tag."objectId" AS "lead_id",
                            lead_tag."tag"      AS "tag_id"
                     FROM   (
                                   SELECT "objectId",
                                          jsonb_array_elements_text("tags") AS tag
                                   FROM   "Lead"
                                   WHERE  true
                                   AND    "Lead"."organization" = 'I6JDWAaZx5'
                                   AND    tags IS NOT NULL
                                   AND    "Lead"."updatedAt" >= '2015-01-01'
                                   AND    "Lead"."updatedAt" < '2021-10-30T11:05:40.389773+00:00' ) AS lead_tag) t) TO stdout

Is there anything I can do to prevent this or anything i can look at to try and diagnose what is happening here? I’m running Postgres 12.7 and cannot tell if this is a symptom of a bug or just misconfiguration.

What I see in the error logs is:

2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:ERROR: cache lookup failed for type 0
2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:STATEMENT: COPY (SELECT ROW_TO_JSON(t) FROM (… the query above...) TO STDOUT
2021-10-29 11:08:10 UTC::@:[19406]:LOG: server process (PID 22200) was terminated by signal 11: Segmentation fault


Any help would be greatly appreciated, thanks!

-Tyler