Обсуждение: postgres 9.5 DB corruption

Поиск
Список
Период
Сортировка

postgres 9.5 DB corruption

От
Thomas Tignor
Дата:
Hello postgres community,

Writing again to see if there are insights on this issue. We have had infrequent but recurring corruption since upgrading from postgres 9.1 to postgres 9.5. We are presently on 9.5.16. Our DB-facing app continually performs a mixture of DML, primarily inserts and updates on two specific tables, with no single op being suspect. In the past, corruption events have produced encoding errors on COPY operations (invalid byte sequence for encoding "UTF8"). More recently, they have caused segmentation faults. We were able to take a cold backup after a recent event. SELECTing the corrupted data on our cold backup yields the following stack. Any info on a solution or how to proceed towards a solution would be much appreciated.

Thanks in advance.


(gdb) where
#0  pglz_decompress (source=source@entry=0xa617904 "0", slen=8139, dest=dest@entry=0x4268e028 "", rawsize=808452096) at pg_lzcompress.c:745
#1  0x080f3079 in toast_decompress_datum (attr=0xa6178fc) at tuptoaster.c:2210
#2  0x080f3716 in heap_tuple_untoast_attr (attr=0xa6178fc) at tuptoaster.c:183
#3  0x08440955 in pg_detoast_datum_packed (datum=<optimized out>) at fmgr.c:2270
#4  0x084145bf in text_to_cstring (t=0x7592fd2a) at varlena.c:176
#5  0x0843e874 in FunctionCall1Coll (flinfo=flinfo@entry=0xa614738, collation=collation@entry=0, arg1=arg1@entry=1972567338) at fmgr.c:1297
#6  0x0843fef8 in OutputFunctionCall (flinfo=0xa614738, val=1972567338) at fmgr.c:1950
#7  0x080bf84b in printtup (slot=0xa613bf4, self=0xa60d714) at printtup.c:359
#8  0x08220f9a in ExecutePlan (dest=0xa60d714, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, planstate=0xa613974, estate=0xa6138ec) at execMain.c:1574
#9  standard_ExecutorRun (queryDesc=0xa6134e4, direction=ForwardScanDirection, count=0) at execMain.c:337
#10 0x08332c1b in PortalRunSelect (portal=portal@entry=0xa6114dc, forward=forward@entry=1 '\001', count=0, count@entry=2147483647, dest=dest@entry=0xa60d714) at pquery.c:942
#11 0x08333fa7 in PortalRun (portal=portal@entry=0xa6114dc, count=count@entry=2147483647, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0xa60d714, altdest=altdest@entry=0xa60d714, completionTag=completionTag@entry=0xffd5d71c "")
    at pquery.c:786
#12 0x08330ba8 in exec_simple_query (query_string=0xa5f1754 "select * from ams.alert_attribute_bak;") at postgres.c:1096
#13 PostgresMain (argc=1, argv=0xa53dbbc, dbname=0xa53daec "ams", username=0xa53dadc "akamai") at postgres.c:4049
#14 0x080b53af in BackendRun (port=0xa584b78) at postmaster.c:4312
#15 BackendStartup (port=0xa584b78) at postmaster.c:3986
#16 ServerLoop () at postmaster.c:1705
#17 0x082d0dd7 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xa53d2a8) at postmaster.c:1313
#18 0x080b68eb in main (argc=3, argv=0xa53d2a8) at main.c:228
(gdb)


Tom    :-)

Re: postgres 9.5 DB corruption

От
Adrian Klaver
Дата:
On 7/24/19 7:38 AM, Thomas Tignor wrote:
> Hello postgres community,
> 
> Writing again to see if there are insights on this issue. We have had 
> infrequent but recurring corruption since upgrading from postgres 9.1 to 
> postgres 9.5. We are presently on 9.5.16. Our DB-facing app continually 
> performs a mixture of DML, primarily inserts and updates on two specific 
> tables, with no single op being suspect. In the past, corruption events 
> have produced encoding errors on COPY operations (invalid byte sequence 
> for encoding "UTF8"). More recently, they have caused segmentation 
> faults. We were able to take a cold backup after a recent event. 
> SELECTing the corrupted data on our cold backup yields the following 
> stack. Any info on a solution or how to proceed towards a solution would 
> be much appreciated.

More information would be useful:

1) Schema of the tables.

2) Source of the data.


> 
> Thanks in advance.
> 
> 
> (gdb) where
> #0  pglz_decompress (source=source@entry=0xa617904 "0", slen=8139, dest=dest@entry=0x4268e028 "", rawsize=808452096)
atpg_lzcompress.c:745
 
> #1  0x080f3079 in toast_decompress_datum (attr=0xa6178fc) at tuptoaster.c:2210
> #2  0x080f3716 in heap_tuple_untoast_attr (attr=0xa6178fc) at tuptoaster.c:183
> #3  0x08440955 in pg_detoast_datum_packed (datum=<optimized out>) at fmgr.c:2270
> #4  0x084145bf in text_to_cstring (t=0x7592fd2a) at varlena.c:176
> #5  0x0843e874 in FunctionCall1Coll (flinfo=flinfo@entry=0xa614738, collation=collation@entry=0,
arg1=arg1@entry=1972567338)at fmgr.c:1297
 
> #6  0x0843fef8 in OutputFunctionCall (flinfo=0xa614738, val=1972567338) at fmgr.c:1950
> #7  0x080bf84b in printtup (slot=0xa613bf4, self=0xa60d714) at printtup.c:359
> #8  0x08220f9a in ExecutePlan (dest=0xa60d714, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>,
operation=CMD_SELECT,planstate=0xa613974, estate=0xa6138ec) at execMain.c:1574
 
> #9  standard_ExecutorRun (queryDesc=0xa6134e4, direction=ForwardScanDirection, count=0) at execMain.c:337
> #10 0x08332c1b in PortalRunSelect (portal=portal@entry=0xa6114dc, forward=forward@entry=1 '\001', count=0,
count@entry=2147483647,dest=dest@entry=0xa60d714) at pquery.c:942
 
> #11 0x08333fa7 in PortalRun (portal=portal@entry=0xa6114dc, count=count@entry=2147483647,
isTopLevel=isTopLevel@entry=1'\001', dest=dest@entry=0xa60d714, altdest=altdest@entry=0xa60d714,
completionTag=completionTag@entry=0xffd5d71c"")
 
>      at pquery.c:786
> #12 0x08330ba8 in exec_simple_query (query_string=0xa5f1754 "select * from ams.alert_attribute_bak;") at
postgres.c:1096
> #13 PostgresMain (argc=1, argv=0xa53dbbc, dbname=0xa53daec "ams", username=0xa53dadc "akamai") at postgres.c:4049
> #14 0x080b53af in BackendRun (port=0xa584b78) at postmaster.c:4312
> #15 BackendStartup (port=0xa584b78) at postmaster.c:3986
> #16 ServerLoop () at postmaster.c:1705
> #17 0x082d0dd7 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xa53d2a8) at postmaster.c:1313
> #18 0x080b68eb in main (argc=3, argv=0xa53d2a8) at main.c:228
> (gdb)
> 
> 
> 
> Tom    :-)


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: postgres 9.5 DB corruption

От
Adrian Klaver
Дата:
On 7/24/19 7:38 AM, Thomas Tignor wrote:
> Hello postgres community,
> 
> Writing again to see if there are insights on this issue. We have had 
> infrequent but recurring corruption since upgrading from postgres 9.1 to 
> postgres 9.5. We are presently on 9.5.16. Our DB-facing app continually 
> performs a mixture of DML, primarily inserts and updates on two specific 
> tables, with no single op being suspect. In the past, corruption events 
> have produced encoding errors on COPY operations (invalid byte sequence 
> for encoding "UTF8"). More recently, they have caused segmentation 
> faults. We were able to take a cold backup after a recent event. 
> SELECTing the corrupted data on our cold backup yields the following 
> stack. Any info on a solution or how to proceed towards a solution would 
> be much appreciated.
> 
> Thanks in advance.
> 

In my previous post when I referred to table schema I mean that to 
include associated schema like triggers, constraints, etc. Basically 
what is returned by \d in psql.

> Tom    :-)


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Re: postgres 9.5 DB corruption

От
Thomas Tignor
Дата:
Hi Adrian,
Thanks for responding. Below is the schema data for the tables where we always see corruption. You'll notice they have triggers for a postgres extension called Slony-I which provides replication service. It's not clear if/how that's a factor, though.

ams=# \d ams.alert_instance

                    Table "ams.alert_instance"

       Column        |              Type              | Modifiers 

---------------------+--------------------------------+-----------

 alert_instance_id   | integer                        | not null

 alert_definition_id | integer                        | not null

 alert_instance_key  | character varying(500)         | not null

 start_active_date   | timestamp(0) without time zone | not null

 stop_active_date    | timestamp(0) without time zone | 

 active              | smallint                       | not null

 acknowledged        | smallint                       | not null

 ack_clear_time      | timestamp(0) without time zone | 

 user_set_clear_time | smallint                       | 

 category_id         | integer                        | not null

 condition_start     | timestamp(0) without time zone | not null

 unack_reason        | character varying(1)           | 

 viewer_visible      | smallint                       | not null

Indexes:

    "pk_alert_instance" PRIMARY KEY, btree (alert_instance_id), tablespace "tbls5"

    "idx_alert_inst_1" btree (alert_instance_key, alert_definition_id, alert_instance_id, active, acknowledged, ack_clear_time), tablespace "tbls5"

    "idx_alert_inst_cat_id" btree (category_id), tablespace "tbls5"

    "idx_alert_inst_def_id" btree (alert_definition_id), tablespace "tbls5"

Check constraints:

    "ck_alert_inst_acked" CHECK (acknowledged = 0 OR acknowledged = 1)

    "ck_alert_inst_active" CHECK (active = 0 OR active = 1)

    "ck_alert_inst_set_cl_tm" CHECK (user_set_clear_time = 0 OR user_set_clear_time = 1)

    "ck_alert_inst_viewer_vis" CHECK (viewer_visible = 0 OR viewer_visible = 1)

Foreign-key constraints:

    "fk_alert_inst_cat_id" FOREIGN KEY (category_id) REFERENCES ams.category(category_id)

    "fk_alert_inst_def_id" FOREIGN KEY (alert_definition_id) REFERENCES ams.alert_definition(alert_definition_id)

    "fk_alert_inst_unack_reason" FOREIGN KEY (unack_reason) REFERENCES ams.unack_reason(unack_reason)

Referenced by:

    TABLE "ams.alert_attribute" CONSTRAINT "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETE CASCADE

Triggers:

    _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster', '1', 'k')

    _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('1')

Disabled user triggers:

    _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster')

    _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

 

ams=# 

ams=# \d ams.alert_attribute

               Table "ams.alert_attribute"

      Column       |          Type           | Modifiers 

-------------------+-------------------------+-----------

 alert_instance_id | integer                 | not null

 name              | character varying(200)  | not null

 data_type         | smallint                | not null

 value             | character varying(2000) | 

Indexes:

    "pk_alert_attributes" PRIMARY KEY, btree (alert_instance_id, name), tablespace "tbls5"

    "idx_alert_attr_name" btree (name)

Foreign-key constraints:

    "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETE CASCADE

Triggers:

    _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster', '2', 'kk')

    _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('2')

Disabled user triggers:

    _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster')

    _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()

 

ams=# 



Tom    :-)


On Wednesday, July 24, 2019, 11:15:04 AM EDT, Adrian Klaver <adrian.klaver@aklaver.com> wrote:


On 7/24/19 7:38 AM, Thomas Tignor wrote:
> Hello postgres community,
>
> Writing again to see if there are insights on this issue. We have had
> infrequent but recurring corruption since upgrading from postgres 9.1 to
> postgres 9.5. We are presently on 9.5.16. Our DB-facing app continually
> performs a mixture of DML, primarily inserts and updates on two specific
> tables, with no single op being suspect. In the past, corruption events
> have produced encoding errors on COPY operations (invalid byte sequence
> for encoding "UTF8"). More recently, they have caused segmentation
> faults. We were able to take a cold backup after a recent event.
> SELECTing the corrupted data on our cold backup yields the following
> stack. Any info on a solution or how to proceed towards a solution would
> be much appreciated.
>
> Thanks in advance.
>

In my previous post when I referred to table schema I mean that to
include associated schema like triggers, constraints, etc. Basically
what is returned by \d in psql.


> Tom    :-)



--
Adrian Klaver
adrian.klaver@aklaver.com



Re: postgres 9.5 DB corruption

От
Adrian Klaver
Дата:
On 7/25/19 10:24 AM, Thomas Tignor wrote:
> Hi Adrian,
> Thanks for responding. Below is the schema data for the tables where we 
> always see corruption. You'll notice they have triggers for a postgres 
> extension called Slony-I which provides replication service. It's not 
> clear if/how that's a factor, though.

What specific version of Slony?

Did you upgrade Slony when you moved from 9.1 to 9.5?

Trace you showed in your first post was for:

    ams.alert_attribute_bak

I do not see that below.

Are the errors on any specific field?

The errors are occurring on the primary, correct?

Where is the data coming from?

> 
> ams=# \d ams.alert_instance
> 
> Table "ams.alert_instance"
> 
> Column|Type| Modifiers
> 
> ---------------------+--------------------------------+-----------
> 
> alert_instance_id| integer| not null
> 
> alert_definition_id | integer| not null
> 
> alert_instance_key| character varying(500)| not null
> 
> start_active_date| timestamp(0) without time zone | not null
> 
> stop_active_date| timestamp(0) without time zone |
> 
> active| smallint| not null
> 
> acknowledged| smallint| not null
> 
> ack_clear_time| timestamp(0) without time zone |
> 
> user_set_clear_time | smallint|
> 
> category_id| integer| not null
> 
> condition_start| timestamp(0) without time zone | not null
> 
> unack_reason| character varying(1)|
> 
> viewer_visible| smallint| not null
> 
> Indexes:
> 
> "pk_alert_instance" PRIMARY KEY, btree (alert_instance_id), tablespace 
> "tbls5"
> 
> "idx_alert_inst_1" btree (alert_instance_key, alert_definition_id, 
> alert_instance_id, active, acknowledged, ack_clear_time), tablespace "tbls5"
> 
> "idx_alert_inst_cat_id" btree (category_id), tablespace "tbls5"
> 
> "idx_alert_inst_def_id" btree (alert_definition_id), tablespace "tbls5"
> 
> Check constraints:
> 
> "ck_alert_inst_acked" CHECK (acknowledged = 0 OR acknowledged = 1)
> 
> "ck_alert_inst_active" CHECK (active = 0 OR active = 1)
> 
> "ck_alert_inst_set_cl_tm" CHECK (user_set_clear_time = 0 OR 
> user_set_clear_time = 1)
> 
> "ck_alert_inst_viewer_vis" CHECK (viewer_visible = 0 OR viewer_visible = 1)
> 
> Foreign-key constraints:
> 
> "fk_alert_inst_cat_id" FOREIGN KEY (category_id) REFERENCES 
> ams.category(category_id)
> 
> "fk_alert_inst_def_id" FOREIGN KEY (alert_definition_id) REFERENCES 
> ams.alert_definition(alert_definition_id)
> 
> "fk_alert_inst_unack_reason" FOREIGN KEY (unack_reason) REFERENCES 
> ams.unack_reason(unack_reason)
> 
> Referenced by:
> 
> TABLE "ams.alert_attribute" CONSTRAINT "fk_alert_attr_instance_id" 
> FOREIGN KEY (alert_instance_id) REFERENCES 
> ams.alert_instance(alert_instance_id) ON DELETE CASCADE
> 
> Triggers:
> 
> _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON 
> ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE 
> _ams_cluster.logtrigger('_ams_cluster', '1', 'k')
> 
> _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_instance FOR 
> EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('1')
> 
> Disabled user triggers:
> 
> _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON 
> ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE 
> _ams_cluster.denyaccess('_ams_cluster')
> 
> _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_instance FOR EACH 
> STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()
> 
> ams=#
> 
> ams=# \d ams.alert_attribute
> 
> Table "ams.alert_attribute"
> 
> Column|Type| Modifiers
> 
> -------------------+-------------------------+-----------
> 
> alert_instance_id | integer| not null
> 
> name| character varying(200)| not null
> 
> data_type| smallint| not null
> 
> value| character varying(2000) |
> 
> Indexes:
> 
> "pk_alert_attributes" PRIMARY KEY, btree (alert_instance_id, name), 
> tablespace "tbls5"
> 
> "idx_alert_attr_name" btree (name)
> 
> Foreign-key constraints:
> 
> "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES 
> ams.alert_instance(alert_instance_id) ON DELETE CASCADE
> 
> Triggers:
> 
> _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON 
> ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE 
> _ams_cluster.logtrigger('_ams_cluster', '2', 'kk')
> 
> _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_attribute FOR 
> EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('2')
> 
> Disabled user triggers:
> 
> _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON 
> ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE 
> _ams_cluster.denyaccess('_ams_cluster')
> 
> _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_attribute FOR 
> EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()
> 
> ams=#
> 
> 
> 
> Tom    :-)
> 
> 
> On Wednesday, July 24, 2019, 11:15:04 AM EDT, Adrian Klaver 
> <adrian.klaver@aklaver.com> wrote:
> 
> 
> On 7/24/19 7:38 AM, Thomas Tignor wrote:
>  > Hello postgres community,
>  >
>  > Writing again to see if there are insights on this issue. We have had
>  > infrequent but recurring corruption since upgrading from postgres 9.1 to
>  > postgres 9.5. We are presently on 9.5.16. Our DB-facing app continually
>  > performs a mixture of DML, primarily inserts and updates on two specific
>  > tables, with no single op being suspect. In the past, corruption events
>  > have produced encoding errors on COPY operations (invalid byte sequence
>  > for encoding "UTF8"). More recently, they have caused segmentation
>  > faults. We were able to take a cold backup after a recent event.
>  > SELECTing the corrupted data on our cold backup yields the following
>  > stack. Any info on a solution or how to proceed towards a solution would
>  > be much appreciated.
>  >
>  > Thanks in advance.
>  >
> 
> In my previous post when I referred to table schema I mean that to
> include associated schema like triggers, constraints, etc. Basically
> what is returned by \d in psql.
> 
> 
>  > Tom    :-)
> 
> 
> 
> -- 
> Adrian Klaver
> adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com