Re: free(): invalid pointer , when to dumping data to a new server

Поиск
Список
Период
Сортировка
От Lars Aksel Opsahl
Тема Re: free(): invalid pointer , when to dumping data to a new server
Дата
Msg-id AM7P189MB1028756DDD017F9828C8FDE99D1D0@AM7P189MB1028.EURP189.PROD.OUTLOOK.COM
обсуждение исходный текст
Ответ на free(): invalid pointer , when to dumping data to a new server  (Lars Aksel Opsahl <Lars.Opsahl@nibio.no>)
Ответы Re: free(): invalid pointer , when to dumping data to a new server  (Lars Aksel Opsahl <Lars.Opsahl@nibio.no>)
Список pgsql-admin



From: Lars Aksel Opsahl <Lars.Opsahl@nibio.no>
Sent: Thursday, October 22, 2020 9:32 AM
To: pgsql-admin@lists.postgresql.org <pgsql-admin@lists.postgresql.org>
Subject: free(): invalid pointer , when to dumping data to a new server
 
>
>
>From: Lars Aksel Opsahl <Lars.Opsahl@nibio.no>Sent: Thursday, October 22, 2020 9:32 AMTo: pgsql-admin@lists.postgresql.org <pgsql-admin@lists.postgresql.org>Subject: free(): invalid pointer , when to dumping data to a new server
>
>
>Hi
>
>
>I have problem with a new server that crashes when restoring data to it. It's around 1.5 TB of raw data with both spatial and raster data.
>
>
>I use Postgres 12.4 and POSTGIS="3.0.2 2fb2a18" [EXTENSION] PGSQL="120" GEOS="3.8.1-CAPI-1.13.3" SFCGAL="1.3.7" PROJ="7.1.1" GDAL="GDAL 3.1.3, released 2020/09/01" LIBXML="2.9.7" LIBJSON="0.13.1" LIBPROTOBUF="1.3.0" WAGYU="0.4.3 (Internal)" TOPOLOGY RASTER
>
>(I have also tested with postgis 3.1 and latest form master)
>
>
>For Os I use CentOS Linux release 8.2.2004 (Core) and kernel 4.18.0-193.19.1.el8_2.x86_64 on server with 64 “Intel(R) Xeon(R) Gold 6246R CPU @ 3.40GHz”
>
>
>First I have restored 3 schemas with no errors the logs
>
>388G data/sl/schema_grl_met_wisline_v2
>
>49G data/sl/schema_met_vaer_dyrkbar
>
>118G data/sl/schema_met_vaer_wisline
>
>
>using this commands :
>
>pg_restore -j 50 --format=d -d sl data/sl/schema_grl_met_wisline_v2/ 2> log/sl/restor_grl_met_wisline_v2.log
>
>pg_restore -j 50 --format=d -d sl data/sl/schema_met_vaer_wisline/ 2> log/sl/restor_met_vaer_wisline.log
>
>pg_restore -j 50 --format=d -d sl data/sl/schema_met_vaer_dyrkbar/ 2> log/sl/restor_met_vaer_dyrkbar.log
>
>
>Then i start to restore the rest of data below.
>
>744G data/sl/schema_sl_rest_ofd
>
>
>Here I split the restore job in 3 parts, like below .
>
>
>Command 1 below went OK and no errors in the client log file or on the server side.
>
>pg_restore --section=pre-data -1 --format=d -d sl data/sl/schema_sl_rest_ofd/ 2> log/sl/restore_sl_rest_ofdata_pre_data.log
>
>
>Command 2 below went OK according to logs the client side but not according logs on the server side and the last table with one more than one billion rows are empty.
>
>pg_restore --section=data -j 50 --format=d -d sl data/sl/schema_sl_rest_ofd/ 2> log/sl/restore_sl_rest_ofdata_data.log
>
>
>I do not understand why I do not have any errors in client log file, because on the server side I this.
>
>< [local] postgres, sl, 2020-10-22 00:52:47.823 CEST >LOG: duration: 3417759.209 ms statement: COPY grl_sk.dtm10_norge_retiled (rid, rast, filename) FROM stdin;
>
>< [local] postgres, sl, 2020-10-22 00:56:05.771 CEST >LOG: duration: 3615706.705 ms statement: COPY sl_sos.lsk_arome_v1 (flateid, point_uid_ref, epoch_hours, tg_l03, tg_l04, tswi_l03, tswi_l04
>
>, tg_l05, tswi_l05, pr, t2m, sfcwindl, snw, twpr) FROM stdin;
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>< , , 2020-10-22 00:56:05.830 CEST >LOG: server process (PID 516912) was terminated by signal 6: Aborted
>
>< , , 2020-10-22 00:56:05.830 CEST >LOG: terminating any other active server processes
>
>< , , 2020-10-22 00:56:05.830 CEST >WARNING: terminating connection because of crash of another server process
>
>< , , 2020-10-22 00:56:05.830 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
>
>< , , 2020-10-22 00:56:05.830 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
>
>< , , 2020-10-22 00:56:05.830 CEST >WARNING: terminating connection because of crash of another server process
>
>< , , 2020-10-22 00:56:05.830 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
>
>< , , 2020-10-22 00:56:05.830 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
>
>< , , 2020-10-22 00:56:05.830 CEST >WARNING: terminating connection because of crash of another server process
>
>< , , 2020-10-22 00:56:05.830 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
>
>< , , 2020-10-22 00:56:05.830 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
>
>< , , 2020-10-22 00:56:05.830 CEST >WARNING: terminating connection because of crash of another server process
>
>< , , 2020-10-22 00:56:05.830 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
>
>< , , 2020-10-22 00:56:05.830 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
>
>< , , 2020-10-22 00:56:10.892 CEST >LOG: all server processes terminated; reinitializing
>
>< , , 2020-10-22 00:56:11.324 CEST >LOG: database system was interrupted; last known up at 2020-10-22 00:52:33 CEST
>
>< , , 2020-10-22 00:56:11.480 CEST >LOG: database system was not properly shut down; automatic recovery in progress
>
>< , , 2020-10-22 00:56:11.521 CEST >LOG: redo starts at B9C/EA83BBF0
>
>< , , 2020-10-22 00:56:35.074 CEST >LOG: redo done at BA0/FDFFEF48
>
>< , , 2020-10-22 00:56:51.250 CEST >LOG: database system is ready to accept connections
>
>
>I don't care about the last table, we dont't need that data,  so I run the last command below.
>
>pg_restore --section=post-data -j 1 --format=d -d sl data/sl/schema_sl_rest_ofd/ 2> log/sl/restore_sl_rest_ofdata_post_data.log
>
>
>Then I see this in the logs so I to run the above command multiple times.
>
>
>First time :
>
>< [local] postgres, sl, 2020-10-22 03:46:27.349 CEST >LOG: duration: 1746.370 ms statement: CREATE INDEX o_8_dtm10_norge_retiled_st_convexhull_idx ON grl_sk.o_8_dtm10_norge_retiled USING gist
>
>(public.st_convexhull(rast));
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>< , , 2020-10-22 03:46:30.283 CEST >LOG: background worker "parallel worker" (PID 658814) was terminated by signal 6: Aborted
>
>< , , 2020-10-22 03:46:30.283 CEST >LOG: terminating any other active server processes
>
>< [local] postgres, sl, 2020-10-22 03:46:30.283 CEST >WARNING: terminating connection because of crash of another server process
>
>< [local] postgres, sl, 2020-10-22 03:46:30.283 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
>
>< [local] postgres, sl, 2020-10-22 03:46:30.283 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
>
>< , , 2020-10-22 03:46:30.283 CEST >WARNING: terminating connection because of crash of another server process
>
>< , , 2020-10-22 03:46:30.283 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
>
>< , , 2020-10-22 03:46:30.283 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
>
>< , , 2020-10-22 03:46:30.599 CEST >LOG: all server processes terminated; reinitializing
>
>< , , 2020-10-22 03:46:31.019 CEST >LOG: database system was interrupted; last known up at 2020-10-22 03:41:23 CEST
>
>< , , 2020-10-22 03:46:31.178 CEST >LOG: database system was not properly shut down; automatic recovery in progress
>
>< , , 2020-10-22 03:46:31.224 CEST >LOG: redo starts at BAA/B07178F8
>
>< , , 2020-10-22 03:46:37.209 CEST >LOG: redo done at BAC/7F3DFD70
>
>
>Second time :
>
>< [local] postgres, sl, 2020-10-22 08:13:08.882 CEST >LOG: duration: 417316.937 ms statement: CREATE INDEX geoidx_org_forestmap____ostl3_predictions ON org_forestmap.ostl3_predictions USING gi
>
>st (geo);
>
>< [local] munin, template1, 2020-10-22 08:13:36.797 CEST >FATAL: role "munin" does not exist
>
>< [local] munin, template1, 2020-10-22 08:13:36.835 CEST >FATAL: role "munin" does not exist
>
>< [local] postgres, sl_help, 2020-10-22 08:13:39.106 CEST >FATAL: database "sl_help" does not exist
>
>< [local] postgres, sl_help, 2020-10-22 08:13:39.183 CEST >FATAL: database "sl_help" does not exist
>
>< [local] munin, template1, 2020-10-22 08:13:44.607 CEST >FATAL: role "munin" does not exist
>
>< [local] munin, template1, 2020-10-22 08:13:44.684 CEST >FATAL: role "munin" does not exist
>
>< [local] munin, template1, 2020-10-22 08:18:36.225 CEST >FATAL: role "munin" does not exist
>
>< [local] munin, template1, 2020-10-22 08:18:36.266 CEST >FATAL: role "munin" does not exist
>
>< [local] postgres, sl_help, 2020-10-22 08:18:38.611 CEST >FATAL: database "sl_help" does not exist
>
>< [local] postgres, sl_help, 2020-10-22 08:18:38.691 CEST >FATAL: database "sl_help" does not exist
>
>< [local] munin, template1, 2020-10-22 08:18:44.234 CEST >FATAL: role "munin" does not exist
>
>< [local] munin, template1, 2020-10-22 08:18:44.314 CEST >FATAL: role "munin" does not exist
>
>free(): invalid pointer
>
>free(): invalid pointer
>
>< , , 2020-10-22 08:19:58.932 CEST >LOG: background worker "parallel worker" (PID 835249) was terminated by signal 6: Aborted
>
>< , , 2020-10-22 08:19:58.932 CEST >LOG: terminating any other active server processes
>
>< [local] postgres, sl, 2020-10-22 08:19:58.932 CEST >WARNING: terminating connection because of crash of another server process
>
>< [local] postgres, sl, 2020-10-22 08:19:58.932 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
>
>< [local] postgres, sl, 2020-10-22 08:19:58.932 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
>
>< , , 2020-10-22 08:19:58.932 CEST >WARNING: terminating connection because of crash of another server process
>
>< , , 2020-10-22 08:19:58.932 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
>
>< , , 2020-10-22 08:19:58.932 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
>
>< 192.168.229.48 postgres, sl, 2020-10-22 08:19:58.932 CEST >WARNING: terminating connection because of crash of another server process
>
>< 192.168.229.48 postgres, sl, 2020-10-22 08:19:58.932 CEST >DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
>
>< 192.168.229.48 postgres, sl, 2020-10-22 08:19:58.932 CEST >HINT: In a moment you should be able to reconnect to the database and repeat your command.
>
>< 192.168.229.48 postgres, sl, 2020-10-22 08:19:58.970 CEST >FATAL: the database system is in recovery mode
>
>< , , 2020-10-22 08:19:59.041 CEST >LOG: all server processes terminated; reinitializing
>
>< , , 2020-10-22 08:19:59.476 CEST >LOG: database system was interrupted; last known up at 2020-10-22 08:11:17 CEST
>
>< , , 2020-10-22 08:20:01.901 CEST >LOG: database system was not properly shut down; automatic recovery in progress
>
>< , , 2020-10-22 08:20:02.005 CEST >LOG: redo starts at BB1/AA4309A0
>
>< , , 2020-10-22 08:20:03.517 CEST >LOG: invalid record length at BB1/FBF33FB0: wanted 24, got 0
>
>< , , 2020-10-22 08:20:03.517 CEST >LOG: redo done at BB1/FBF33F78
>
>< , , 2020-10-22 08:20:05.573 CEST >LOG: database system is ready to accept connections
>
>
>Any idea about why this happens and what I can do to fix it ?
>
>
>Do you believe Postgres 13 will work better ?
>
>Thanks
>
>Lars

I did a test with Postgres 13 and right after running this cmd "pg_restore  --section=pre-data -1 --format=d -d sl data/sl/schema_sl_rest_ofd" where the rest of tables and schemes was created with no errors  and before I was running this cmd  "pg_restore  --section=data -j 50 --format=d -d sl data/sl/schema_sl_rest_ofd" I found errors in the server log.

(The table grl_met_wisline_v2.nora_bc25_observation_200912 was with no errors two hours before.
"LOG:  duration: 589215.382 ms  statement: COPY grl_met_wisline_v2.nora_bc25_observation_200912 (point_uid_ref, epoch_hours, tg_l03, tg_l04, tswi_l03, tswi_l04, tg_l05, tswi_l05, pr, t2m, sfcwindl, snw, twpr) FROM stdin")

I the server log I found this

<  , , 2020-10-22 14:23:44.167 CEST >LOG:  server process (PID 1085979) was terminated by signal 6: Aborted
<  , , 2020-10-22 14:23:44.167 CEST >LOG:  terminating any other active server processes
<  , , 2020-10-22 14:23:44.167 CEST >WARNING:  terminating connection because of crash of another server process
<  , , 2020-10-22 14:23:44.167 CEST >DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
<  , , 2020-10-22 14:23:44.167 CEST >HINT:  In a moment you should be able to reconnect to the database and repeat your command.
<  , , 2020-10-22 14:23:44.167 CEST >CONTEXT:  while scanning block 393158 of relation "grl_met_wisline_v2.nora_bc25_observation_201001"
<  , , 2020-10-22 14:23:44.167 CEST >WARNING:  terminating connection because of crash of another server process
<  , , 2020-10-22 14:23:44.167 CEST >DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
<  , , 2020-10-22 14:23:44.167 CEST >HINT:  In a moment you should be able to reconnect to the database and repeat your command.
<  , , 2020-10-22 14:23:44.167 CEST >CONTEXT:  while scanning block 59409 of relation "grl_met_wisline_v2.nora_bc25_observation_200912"
<  , , 2020-10-22 14:23:44.167 CEST >WARNING:  terminating connection because of crash of another server process
<  , , 2020-10-22 14:23:44.167 CEST >DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
<  , , 2020-10-22 14:23:44.167 CEST >HINT:  In a moment you should be able to reconnect to the database and repeat your command.
<  , , 2020-10-22 14:23:44.167 CEST >CONTEXT:  while scanning block 51913 of relation "grl_met_wisline_v2.nora_bc25_observation_200910"
<  , , 2020-10-22 14:23:44.167 CEST >WARNING:  terminating connection because of crash of another server process
<  , , 2020-10-22 14:23:44.167 CEST >DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
<  , , 2020-10-22 14:23:44.167 CEST >HINT:  In a moment you should be able to reconnect to the database and repeat your command.
<  , , 2020-10-22 14:23:44.185 CEST >LOG:  all server processes terminated; reinitializing
<  , , 2020-10-22 14:23:44.616 CEST >LOG:  database system was interrupted; last known up at 2020-10-22 14:23:21 CEST
<  , , 2020-10-22 14:23:44.782 CEST >LOG:  database system was not properly shut down; automatic recovery in progress
<  , , 2020-10-22 14:23:44.823 CEST >LOG:  redo starts at 146/A2485468
<  , , 2020-10-22 14:23:46.035 CEST >LOG:  redo done at 146/A83A7F98
<  , , 2020-10-22 14:23:47.043 CEST >LOG:  database system is ready to accept connections

Lars

The


В списке pgsql-admin по дате отправления:

Предыдущее
От: Lars Aksel Opsahl
Дата:
Сообщение: free(): invalid pointer , when to dumping data to a new server
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: Odd pg dump error: cache lookup failure