Обсуждение: logical replication snapshots

Поиск
Список
Период
Сортировка

logical replication snapshots

От
Dimitri Maziuk
Дата:
Hi everyone,

I'm playing with logical replication in postgres-10 and I got it to hang
on startup with

> 2018-07-23 13:15:06.577 CDT [18624] FATAL:  the database system is starting up
> 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
> 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  shmem_exit(1): 0 on_shmem_exit callbacks to make
> 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  proc_exit(1): 1 callbacks to make
> 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  exit(1)
> 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
> 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
> 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  proc_exit(-1): 0 callbacks to make
... forever ...

After poking around:

# find /var/lib/pgsql/10/data/pg_logical/snapshots -printf . | wc -c
12727975

It looks like ZFS simply can't handle this many files: `ls` etc. hang
forever.

So my question is, does it have to be that many? And if not, is there a
knob to keep the number down?

TIA
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


Вложения

Re: logical replication snapshots

От
Andres Freund
Дата:
Hi,

On 2018-07-23 15:23:04 -0500, Dimitri Maziuk wrote:
> I'm playing with logical replication in postgres-10 and I got it to hang
> on startup with
> 
> > 2018-07-23 13:15:06.577 CDT [18624] FATAL:  the database system is starting up
> > 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
> > 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  shmem_exit(1): 0 on_shmem_exit callbacks to make
> > 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  proc_exit(1): 1 callbacks to make
> > 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  exit(1)
> > 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
> > 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
> > 2018-07-23 13:15:06.577 CDT [18624] DEBUG:  proc_exit(-1): 0 callbacks to make
> ... forever ...

What precedes that "loop"?


Could you describe what exactly you did into that situation?


> After poking around:
> 
> # find /var/lib/pgsql/10/data/pg_logical/snapshots -printf . | wc -c
> 12727975

Ugh, something absolutely clearly isn't right here.


> It looks like ZFS simply can't handle this many files: `ls` etc. hang
> forever.

ls probably works in general, it's just the sorting that makes it
hang. Should work with -f or such.


Greetings,

Andres Freund


Re: logical replication snapshots

От
Dimitri Maziuk
Дата:
On 7/23/2018 6:51 PM, Andres Freund wrote:

> What precedes that "loop"?

systemctl start postgresql-10

> Could you describe what exactly you did into that situation?

Created a database, few publications, as per TFM, and a few 
subscriptions on another host. Went on vacation for 3 weeks. The problem 
host is centos 7.current w/ postgres rpms from PGDG repo. Postgres is on 
port 5434 because there is a stock one on 5432.

> ls probably works in general, it's just the sorting that makes it
> hang. Should work with -f or such.

I don't think this means what you think this means.

There are *13 million* files in 
/var/lib/pgsql/10/data/pg_logical/snapshots and postgres hangs forever 
on start with
FATAL:  the database system is starting up
FATAL:  the database system is starting up
FATAL:  the database system is starting up
...
I think the two are related. ICBW and all that. What do I do no not get 
it into that state again?

Thx


Re: logical replication snapshots

От
Alvaro Herrera
Дата:
On 2018-Jul-24, Dimitri Maziuk wrote:

> On 7/23/2018 6:51 PM, Andres Freund wrote:

> > Could you describe what exactly you did into that situation?
> 
> Created a database, few publications, as per TFM, and a few subscriptions on
> another host. Went on vacation for 3 weeks. The problem host is centos
> 7.current w/ postgres rpms from PGDG repo. Postgres is on port 5434 because
> there is a stock one on 5432.

Are you serious about us trying to diagnose a bug with this description?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: logical replication snapshots

От
Dimitri Maziuk
Дата:
On 07/24/2018 12:21 PM, Alvaro Herrera wrote:
>
> Are you serious about us trying to diagnose a bug with this description?

What do you want to know, exactly?

--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


Вложения

Re: logical replication snapshots

От
Andres Freund
Дата:
On 2018-07-24 12:22:24 -0500, Dimitri Maziuk wrote:
> On 07/24/2018 12:21 PM, Alvaro Herrera wrote:
> > 
> > Are you serious about us trying to diagnose a bug with this description?
> 
> What do you want to know, exactly?

A recipe that we can follow and reproduce the issue.

Greetings,

Andres Freund


Re: logical replication snapshots

От
Dimitri Maziuk
Дата:
On 07/24/2018 01:43 PM, Andres Freund wrote:
> On 2018-07-24 12:22:24 -0500, Dimitri Maziuk wrote:
>> On 07/24/2018 12:21 PM, Alvaro Herrera wrote:
>>>
>>> Are you serious about us trying to diagnose a bug with this description?
>>
>> What do you want to know, exactly?
>
> A recipe that we can follow and reproduce the issue.

The nutshell version is as I said: I pretty much followed the manual to
create a test publication/subscription setup and let it run for a few
weeks. I had to increase a few limits but otherwise everything's at the
default settings as shipped in the rpms.

I can send you postgres config files from both nodes and the entire
database dump -- it's all public domain. I can `zfs send` you the
snapshot of the entire /var/lib/pgsql/10 as is, too.

If you are asking for something other than those two extremes, please
let me know what it is and I'll be happy to try and provide it.

--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


Вложения

Re: logical replication snapshots

От
Andres Freund
Дата:
On 2018-07-24 14:02:26 -0500, Dimitri Maziuk wrote:
> On 07/24/2018 01:43 PM, Andres Freund wrote:
> > On 2018-07-24 12:22:24 -0500, Dimitri Maziuk wrote:
> >> On 07/24/2018 12:21 PM, Alvaro Herrera wrote:
> >>>
> >>> Are you serious about us trying to diagnose a bug with this description?
> >>
> >> What do you want to know, exactly?
> > 
> > A recipe that we can follow and reproduce the issue.
> 
> The nutshell version is as I said: I pretty much followed the manual to
> create a test publication/subscription setup and let it run for a few
> weeks. I had to increase a few limits but otherwise everything's at the
> default settings as shipped in the rpms.

Are you really expecting us to be able to reproduce the problem based on
the above description? Our test suites do setup plain replications
setups, and the problem doesn't occur there.


> If you are asking for something other than those two extremes, please
> let me know what it is and I'll be happy to try and provide it.

A command by command reproducer would be good.


> I can send you postgres config files from both nodes and the entire
> database dump -- it's all public domain. I can `zfs send` you the
> snapshot of the entire /var/lib/pgsql/10 as is, too.

Without the genesis, that's probably not as helpful, unfortunately.


Greetings,

Andres Freund


Re: logical replication snapshots

От
Dimitri Maziuk
Дата:
On 7/25/2018 10:28 AM, Andres Freund wrote:
> 
> Are you really expecting us to be able to reproduce the problem based on
> the above description? Our test suites do setup plain replications
> setups, and the problem doesn't occur there.

I don't, by definition, have a reproducible case: it only happened once 
so far.

If nobody knows what limits the number of files created in 
$PGDATA/pg_logical/snapshots then we'll all have to wait until this 
happens again.

(To somebody else as I'm obviously not turning logical replication back 
on until I know it won't kill my server again.)

Dima


Re: logical replication snapshots

От
Adrian Klaver
Дата:
On 07/25/2018 08:40 AM, Dimitri Maziuk wrote:
> On 7/25/2018 10:28 AM, Andres Freund wrote:
>>
>> Are you really expecting us to be able to reproduce the problem based on
>> the above description? Our test suites do setup plain replications
>> setups, and the problem doesn't occur there.
> 
> I don't, by definition, have a reproducible case: it only happened once 
> so far.

Where you using pg_export_snapshot() by any chance?:

https://www.postgresql.org/docs/10/static/functions-admin.html#FUNCTIONS-SNAPSHOT-SYNCHRONIZATION


Where there any relevant error messages in the log before the database hung?


> 
> If nobody knows what limits the number of files created in 
> $PGDATA/pg_logical/snapshots then we'll all have to wait until this 
> happens again.
> 
> (To somebody else as I'm obviously not turning logical replication back 
> on until I know it won't kill my server again.)

Given that it took 3 weeks to manifest itself before, I would say give 
it a try and monitor $PGDATA/pg_logical/snapshots. That would help 
provide information for getting at the source of the problem. You can 
always disable the replication if it looks like it running away.

> 
> Dima
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: logical replication snapshots

От
Dimitri Maziuk
Дата:
On 07/25/2018 11:16 AM, Adrian Klaver wrote:

> Where you using pg_export_snapshot() by any chance?:

If it's not in chapter 31 of TFM then I have not touched it. The only
notable thing I changed from 31.9 quickstart is that replica identity is
full on all tables (the data's a mess).

> Where there any relevant error messages in the log before the database
> hung?

Do you seriously expect me to know what "relevant" is? ;) -- It's 7
days' worth of

> 2018-07-23 09:43:12.851 CDT [7975] STATEMENT:  COPY macromolecules."Entity_common_name" TO STDOUT
> 2018-07-23 09:43:12.925 CDT [7982] LOG:  logical decoding found consistent point at 19/E6942408
> 2018-07-23 09:43:12.925 CDT [7982] DETAIL:  There are no running transactions.
> 2018-07-23 09:43:12.935 CDT [7982] ERROR:  permission denied for schema macromolecules
> 2018-07-23 09:43:12.935 CDT [7982] STATEMENT:  COPY macromolecules."Assembly" TO STDOUT
> 2018-07-23 09:43:13.034 CDT [7987] LOG:  logical decoding found consistent point at 19/E6942440
> 2018-07-23 09:43:13.034 CDT [7987] DETAIL:  There are no running transactions.
> 2018-07-23 09:43:13.044 CDT [7987] ERROR:  permission denied for schema macromolecules
> 2018-07-23 09:43:13.044 CDT [7987] STATEMENT:  COPY macromolecules."Spectral_transition" TO STDOUT
> 2018-07-23 09:43:13.135 CDT [7990] LOG:  logical decoding found consistent point at 19/E6942478
> 2018-07-23 09:43:13.135 CDT [7990] DETAIL:  There are no running transactions.
> 2018-07-23 09:43:13.145 CDT [7990] ERROR:  permission denied for schema macromolecules
> 2018-07-23 09:43:13.145 CDT [7990] STATEMENT:  COPY macromolecules."Assembly_db_link" TO STDOUT

that ends with

> 2018-07-23 09:45:40.280 CDT [6032] LOG:  database system was interrupted; last known up at 2018-07-06 17:04:28 CDT
> 2018-07-23 09:45:40.281 CDT [6033] FATAL:  the database system is starting up
> 2018-07-23 09:45:40.282 CDT [6034] FATAL:  the database system is starting up
... ad infinitum ...

Would "permission denied" be relevant?

(Not sure where that is coming from: I did initial sync as postgres then
altered it to replication user. Can't track it down now since the
database is dead.)

--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


Вложения

Re: logical replication snapshots

От
Adrian Klaver
Дата:
On 07/25/2018 11:10 AM, Dimitri Maziuk wrote:
> On 07/25/2018 11:16 AM, Adrian Klaver wrote:
> 
>> Where you using pg_export_snapshot() by any chance?:
> 
> If it's not in chapter 31 of TFM then I have not touched it. The only
> notable thing I changed from 31.9 quickstart is that replica identity is
> full on all tables (the data's a mess).

So no.

> 
>> Where there any relevant error messages in the log before the database
>> hung?
> 
> Do you seriously expect me to know what "relevant" is? ;) -- It's 7
> days' worth of

Generally speaking relevancy increases the closer you get to the fall 
over event. So we are most interested in entries close to the point 
where the database stopped/failed.

> 
>> 2018-07-23 09:43:12.851 CDT [7975] STATEMENT:  COPY macromolecules."Entity_common_name" TO STDOUT
>> 2018-07-23 09:43:12.925 CDT [7982] LOG:  logical decoding found consistent point at 19/E6942408
>> 2018-07-23 09:43:12.925 CDT [7982] DETAIL:  There are no running transactions.
>> 2018-07-23 09:43:12.935 CDT [7982] ERROR:  permission denied for schema macromolecules
>> 2018-07-23 09:43:12.935 CDT [7982] STATEMENT:  COPY macromolecules."Assembly" TO STDOUT
>> 2018-07-23 09:43:13.034 CDT [7987] LOG:  logical decoding found consistent point at 19/E6942440
>> 2018-07-23 09:43:13.034 CDT [7987] DETAIL:  There are no running transactions.
>> 2018-07-23 09:43:13.044 CDT [7987] ERROR:  permission denied for schema macromolecules
>> 2018-07-23 09:43:13.044 CDT [7987] STATEMENT:  COPY macromolecules."Spectral_transition" TO STDOUT
>> 2018-07-23 09:43:13.135 CDT [7990] LOG:  logical decoding found consistent point at 19/E6942478
>> 2018-07-23 09:43:13.135 CDT [7990] DETAIL:  There are no running transactions.
>> 2018-07-23 09:43:13.145 CDT [7990] ERROR:  permission denied for schema macromolecules
>> 2018-07-23 09:43:13.145 CDT [7990] STATEMENT:  COPY macromolecules."Assembly_db_link" TO STDOUT
> 
> that ends with
> 
>> 2018-07-23 09:45:40.280 CDT [6032] LOG:  database system was interrupted; last known up at 2018-07-06 17:04:28 CDT
>> 2018-07-23 09:45:40.281 CDT [6033] FATAL:  the database system is starting up
>> 2018-07-23 09:45:40.282 CDT [6034] FATAL:  the database system is starting up
> ... ad infinitum ...
> 
> Would "permission denied" be relevant?

Logical decoding is something I am still learning. The "permission 
denied" would to me be relevant only to the extent that it seems to be 
provoking:

"LOG:  logical decoding found consistent point at 19/E6942440"
DETAIL:  There are no running transactions."

Others with more experience in this area would need to fill whether that 
might account for the 13 million files in the ~snapshot/ directory.

> 
> (Not sure where that is coming from: I did initial sync as postgres then
> altered it to replication user. Can't track it down now since the
> database is dead.)
> 

In your previous post you said:

"(To somebody else as I'm obviously not turning logical replication back 
on until I know it won't kill my server again.) "

So what are you working off now?



-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: logical replication snapshots

От
Dimitri Maziuk
Дата:
On 07/25/2018 02:31 PM, Adrian Klaver wrote:

> In your previous post you said:
>
> "(To somebody else as I'm obviously not turning logical replication back
> on until I know it won't kill my server again.) "
>
> So what are you working off now?

I have that database rebuilt with no publications defined, and I have
the original $PGDATA on a ZFS volume.

I have not looked at the subscriber host yet, I've more urgent fires to
douse first.
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


Вложения

Re: logical replication snapshots

От
Adrian Klaver
Дата:
On 07/25/2018 11:10 AM, Dimitri Maziuk wrote:
> On 07/25/2018 11:16 AM, Adrian Klaver wrote:
> 
>> Where you using pg_export_snapshot() by any chance?:
> 
> If it's not in chapter 31 of TFM then I have not touched it. The only
> notable thing I changed from 31.9 quickstart is that replica identity is
> full on all tables (the data's a mess).
> 
>> Where there any relevant error messages in the log before the database
>> hung?
> 
> Do you seriously expect me to know what "relevant" is? ;) -- It's 7
> days' worth of
> 
>> 2018-07-23 09:43:12.851 CDT [7975] STATEMENT:  COPY macromolecules."Entity_common_name" TO STDOUT
>> 2018-07-23 09:43:12.925 CDT [7982] LOG:  logical decoding found consistent point at 19/E6942408
>> 2018-07-23 09:43:12.925 CDT [7982] DETAIL:  There are no running transactions.
>> 2018-07-23 09:43:12.935 CDT [7982] ERROR:  permission denied for schema macromolecules
>> 2018-07-23 09:43:12.935 CDT [7982] STATEMENT:  COPY macromolecules."Assembly" TO STDOUT
>> 2018-07-23 09:43:13.034 CDT [7987] LOG:  logical decoding found consistent point at 19/E6942440
>> 2018-07-23 09:43:13.034 CDT [7987] DETAIL:  There are no running transactions.
>> 2018-07-23 09:43:13.044 CDT [7987] ERROR:  permission denied for schema macromolecules
>> 2018-07-23 09:43:13.044 CDT [7987] STATEMENT:  COPY macromolecules."Spectral_transition" TO STDOUT
>> 2018-07-23 09:43:13.135 CDT [7990] LOG:  logical decoding found consistent point at 19/E6942478
>> 2018-07-23 09:43:13.135 CDT [7990] DETAIL:  There are no running transactions.
>> 2018-07-23 09:43:13.145 CDT [7990] ERROR:  permission denied for schema macromolecules
>> 2018-07-23 09:43:13.145 CDT [7990] STATEMENT:  COPY macromolecules."Assembly_db_link" TO STDOUT

A little slow, but I realized that the above looks like the initial sync 
for a table. Do these errors stretch back all the way back to your 
original sync?



> 
> that ends with
> 
>> 2018-07-23 09:45:40.280 CDT [6032] LOG:  database system was interrupted; last known up at 2018-07-06 17:04:28 CDT
>> 2018-07-23 09:45:40.281 CDT [6033] FATAL:  the database system is starting up
>> 2018-07-23 09:45:40.282 CDT [6034] FATAL:  the database system is starting up
> ... ad infinitum ...
> 
> Would "permission denied" be relevant?
> 
> (Not sure where that is coming from: I did initial sync as postgres then
> altered it to replication user. Can't track it down now since the
> database is dead.)
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: logical replication snapshots

От
Andres Freund
Дата:
On 2018-07-25 12:31:01 -0700, Adrian Klaver wrote:
> On 07/25/2018 11:10 AM, Dimitri Maziuk wrote:
> > On 07/25/2018 11:16 AM, Adrian Klaver wrote:
> > 
> > > Where you using pg_export_snapshot() by any chance?:
> > 
> > If it's not in chapter 31 of TFM then I have not touched it. The only
> > notable thing I changed from 31.9 quickstart is that replica identity is
> > full on all tables (the data's a mess).
> 
> So no.
> 
> > 
> > > Where there any relevant error messages in the log before the database
> > > hung?
> > 
> > Do you seriously expect me to know what "relevant" is? ;) -- It's 7
> > days' worth of
> 
> Generally speaking relevancy increases the closer you get to the fall over
> event. So we are most interested in entries close to the point where the
> database stopped/failed.
> 
> > 
> > > 2018-07-23 09:43:12.851 CDT [7975] STATEMENT:  COPY macromolecules."Entity_common_name" TO STDOUT
> > > 2018-07-23 09:43:12.925 CDT [7982] LOG:  logical decoding found consistent point at 19/E6942408
> > > 2018-07-23 09:43:12.925 CDT [7982] DETAIL:  There are no running transactions.
> > > 2018-07-23 09:43:12.935 CDT [7982] ERROR:  permission denied for schema macromolecules
> > > 2018-07-23 09:43:12.935 CDT [7982] STATEMENT:  COPY macromolecules."Assembly" TO STDOUT
> > > 2018-07-23 09:43:13.034 CDT [7987] LOG:  logical decoding found consistent point at 19/E6942440
> > > 2018-07-23 09:43:13.034 CDT [7987] DETAIL:  There are no running transactions.
> > > 2018-07-23 09:43:13.044 CDT [7987] ERROR:  permission denied for schema macromolecules
> > > 2018-07-23 09:43:13.044 CDT [7987] STATEMENT:  COPY macromolecules."Spectral_transition" TO STDOUT
> > > 2018-07-23 09:43:13.135 CDT [7990] LOG:  logical decoding found consistent point at 19/E6942478
> > > 2018-07-23 09:43:13.135 CDT [7990] DETAIL:  There are no running transactions.
> > > 2018-07-23 09:43:13.145 CDT [7990] ERROR:  permission denied for schema macromolecules
> > > 2018-07-23 09:43:13.145 CDT [7990] STATEMENT:  COPY macromolecules."Assembly_db_link" TO STDOUT
> > 
> > that ends with
> > 
> > > 2018-07-23 09:45:40.280 CDT [6032] LOG:  database system was interrupted; last known up at 2018-07-06 17:04:28
CDT
> > > 2018-07-23 09:45:40.281 CDT [6033] FATAL:  the database system is starting up
> > > 2018-07-23 09:45:40.282 CDT [6034] FATAL:  the database system is starting up
> > ... ad infinitum ...
> > 
> > Would "permission denied" be relevant?
> 
> Logical decoding is something I am still learning. The "permission denied"
> would to me be relevant only to the extent that it seems to be provoking:

Yes, it looks related. Looks like logical rep was trying to perform the
intiial sync of those tables, and couldn't due to permission errors.


> "LOG:  logical decoding found consistent point at 19/E6942440"
> DETAIL:  There are no running transactions."
> 
> Others with more experience in this area would need to fill whether that
> might account for the 13 million files in the ~snapshot/ directory.

That indicates there's some error handling issues to be resolved. Petr?

Greetings,

Andres Freund


Re: logical replication snapshots

От
Dimitri Maziuk
Дата:
On 07/25/2018 07:57 PM, Andres Freund wrote:
> On 2018-07-25 12:31:01 -0700, Adrian Klaver wrote:

>>> Would "permission denied" be relevant?
>>
>> Logical decoding is something I am still learning. The "permission denied"
>> would to me be relevant only to the extent that it seems to be provoking:
>
> Yes, it looks related. Looks like logical rep was trying to perform the
> intiial sync of those tables, and couldn't due to permission errors.

I'm not sure what happened, I remember the initial sync of that
particular schema failing on one table only, but looking at it now, all
tables are empty on the subscriber.

>> "LOG:  logical decoding found consistent point at 19/E6942440"
>> DETAIL:  There are no running transactions."
>>
>> Others with more experience in this area would need to fill whether that
>> might account for the 13 million files in the ~snapshot/ directory.
>
> That indicates there's some error handling issues to be resolved. Petr?

It logs "COPY TO STDOUT", does that actually cache to disk (in ~snapshot/)?

Would it be safe to delete every file in ~snapshot/ older than some
X-number of minutes? (And what a good X might be?)

--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


Вложения

Re: logical replication snapshots

От
Adrian Klaver
Дата:
On 07/26/2018 10:54 AM, Dimitri Maziuk wrote:
> On 07/25/2018 07:57 PM, Andres Freund wrote:
>> On 2018-07-25 12:31:01 -0700, Adrian Klaver wrote:
> 
>>>> Would "permission denied" be relevant?
>>>
>>> Logical decoding is something I am still learning. The "permission denied"
>>> would to me be relevant only to the extent that it seems to be provoking:
>>
>> Yes, it looks related. Looks like logical rep was trying to perform the
>> intiial sync of those tables, and couldn't due to permission errors.
> 
> I'm not sure what happened, I remember the initial sync of that
> particular schema failing on one table only, but looking at it now, all
> tables are empty on the subscriber.

To me that indicates all the syncs failed.

> 
>>> "LOG:  logical decoding found consistent point at 19/E6942440"
>>> DETAIL:  There are no running transactions."
>>>
>>> Others with more experience in this area would need to fill whether that
>>> might account for the 13 million files in the ~snapshot/ directory.
>>
>> That indicates there's some error handling issues to be resolved. Petr?
> 
> It logs "COPY TO STDOUT", does that actually cache to disk (in ~snapshot/)?

That is getting deeper into the code then I have been, so I don't know. 
FYI the code is below and the part that triggered the error is around 
line 762 (copy_table(Relation rel)).


https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/replication/logical/tablesync.c;h=acc6498567d07c3e93d0adf5039a10c693e38f64;hb=HEAD

> 
> Would it be safe to delete every file in ~snapshot/ older than some
> X-number of minutes? (And what a good X might be?)
> 

Again I don't know the answer to this. Are you trying this on a test 
setup or production one?

-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: logical replication snapshots

От
Andres Freund
Дата:
On 2018-07-26 12:54:19 -0500, Dimitri Maziuk wrote:
> >> "LOG:  logical decoding found consistent point at 19/E6942440"
> >> DETAIL:  There are no running transactions."
> >>
> >> Others with more experience in this area would need to fill whether that
> >> might account for the 13 million files in the ~snapshot/ directory.
> > 
> > That indicates there's some error handling issues to be resolved. Petr?
> 
> It logs "COPY TO STDOUT", does that actually cache to disk (in ~snapshot/)?

No, it doesn't. The snapshots are internal to logical replication.


> Would it be safe to delete every file in ~snapshot/ older than some
> X-number of minutes? (And what a good X might be?)

You shouldn't manually delete them. But an explicit 'checkpoint;' (and
all automatically triggered ones), should do so for all that are
older. Is there a chance that you have some old leftover replication
slots around? Because that'd prevent cleanup?

What does SELECT * FROM pg_replication_slots ; return?

Greetings,

Andres Freund


Re: logical replication snapshots

От
Dimitri Maziuk
Дата:
On 07/26/2018 02:54 PM, Adrian Klaver wrote:
> On 07/26/2018 10:54 AM, Dimitri Maziuk wrote:

>> I'm not sure what happened, I remember the initial sync of that
>> particular schema failing on one table only, but looking at it now, all
>> tables are empty on the subscriber.
>
> To me that indicates all the syncs failed.

Yeah, well... the error message said one table failed and I went off to
find out why (a co-worker added a column behind everyone's back) and
never checked 'count(*)' on the other tables.

... deleting files in ~snapshots
> Again I don't know the answer to this. Are you trying this on a test
> setup or production one?

I could fire up another instance on a different port off the now-broken
$PGDATA easily enough and test. However if whatever uses those files
needs to start from the very first file and "replay" them in sequence,
that won't work.

The files are named like 19_E6942440.snap which presumably corresponds
to "LOG: logical decoding found consistent point at 19/E6942440" and
they seem to get progressively larger. That suggest that maybe just one
(the newest one) could be good enough...

--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


Вложения

Re: logical replication snapshots

От
Adrian Klaver
Дата:
On 07/26/2018 02:06 PM, Dimitri Maziuk wrote:
> On 07/26/2018 02:54 PM, Adrian Klaver wrote:
>> On 07/26/2018 10:54 AM, Dimitri Maziuk wrote:
> 
>>> I'm not sure what happened, I remember the initial sync of that
>>> particular schema failing on one table only, but looking at it now, all
>>> tables are empty on the subscriber.
>>
>> To me that indicates all the syncs failed.
> 
> Yeah, well... the error message said one table failed and I went off to
> find out why (a co-worker added a column behind everyone's back) and
> never checked 'count(*)' on the other tables.

The thing that has me somewhat confused is:

ERROR: permission denied for schema macromolecules

I would thought the replication user could access that.

Is there something special about that schema?

Did the user role you set up for replication have the REPLICATION 
attribute or was it a superuser?

> 
> ... deleting files in ~snapshots
>> Again I don't know the answer to this. Are you trying this on a test
>> setup or production one?
> 
> I could fire up another instance on a different port off the now-broken
> $PGDATA easily enough and test. However if whatever uses those files
> needs to start from the very first file and "replay" them in sequence,
> that won't work.

If I remember back correctly this is because the server chokes on the 13 
million files in the ~/pg_logical/snapshot directly.

I wonder what happens if you change(on test setup)?:

wal_level = logical

to

wal_level = replica



> 
> The files are named like 19_E6942440.snap which presumably corresponds
> to "LOG: logical decoding found consistent point at 19/E6942440" and
> they seem to get progressively larger. That suggest that maybe just one
> (the newest one) could be good enough...
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: logical replication snapshots

От
Dimitri Maziuk
Дата:
On 07/26/2018 04:39 PM, Adrian Klaver wrote:

> The thing that has me somewhat confused is:
>
> ERROR: permission denied for schema macromolecules
>
> I would thought the replication user could access that.

The more I look at the errors, the less sense it all makes.

Let me ask a different question: if I drop and re-create a published
table on the publisher without doing anything to the publication and
subscription, what happens?

Do I need to re-create the publication when I re-create the table?

--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


Вложения

Re: logical replication snapshots

От
Adrian Klaver
Дата:
On 07/26/2018 03:01 PM, Dimitri Maziuk wrote:
> On 07/26/2018 04:39 PM, Adrian Klaver wrote:
> 
>> The thing that has me somewhat confused is:
>>
>> ERROR: permission denied for schema macromolecules
>>
>> I would thought the replication user could access that.
> 
> The more I look at the errors, the less sense it all makes.

Sounds like my problems:)

> 
> Let me ask a different question: if I drop and re-create a published
> table on the publisher without doing anything to the publication and
> subscription, what happens?


Take a look at:

https://www.postgresql.org/docs/10/static/sql-alterpublication.html

> 
> Do I need to re-create the publication when I re-create the table?
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: logical replication snapshots

От
Dimitri Maziuk
Дата:
On 07/26/2018 05:34 PM, Adrian Klaver wrote:
> On 07/26/2018 03:01 PM, Dimitri Maziuk wrote:

>> Let me ask a different question: if I drop and re-create a published
>> table on the publisher without doing anything to the publication and
>> subscription, what happens?
>
> Take a look at:
>
> https://www.postgresql.org/docs/10/static/sql-alterpublication.html

Thanks but what I ask (and what I suspect happened here while I wasn't
looking) is

- create schema foo with tables bar and baz and some rows in 'em
- create publication foopub with foo.bar and foo.baz
- create subscription to foopub on the other node, initial sync and all
- drop schema foo on the publisher
- create schema foo with tables bar and baz and some rows in 'em

The publication foopub is at this point fubar I take it? And needs to be
re-created on the publisher and reconnected on the subscriber? Complete
with with inital resync?

--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


Вложения

Re: logical replication snapshots

От
Adrian Klaver
Дата:
On 07/26/2018 04:48 PM, Dimitri Maziuk wrote:
> On 07/26/2018 05:34 PM, Adrian Klaver wrote:
>> On 07/26/2018 03:01 PM, Dimitri Maziuk wrote:
> 
>>> Let me ask a different question: if I drop and re-create a published
>>> table on the publisher without doing anything to the publication and
>>> subscription, what happens?
>>
>> Take a look at:
>>
>> https://www.postgresql.org/docs/10/static/sql-alterpublication.html
> 
> Thanks but what I ask (and what I suspect happened here while I wasn't
> looking) is
> 
> - create schema foo with tables bar and baz and some rows in 'em
> - create publication foopub with foo.bar and foo.baz
> - create subscription to foopub on the other node, initial sync and all
> - drop schema foo on the publisher
> - create schema foo with tables bar and baz and some rows in 'em
> 
> The publication foopub is at this point fubar I take it? And needs to be
> re-created on the publisher and reconnected on the subscriber? Complete
> with with inital resync?

Not sure. Personally I would try:

1) ALTER PUBLICATION DROP TABLE foo|bar;

2) ALTER PUBLICATION ADD TABLE foo|bar;

3) ALTER SUBSCRIPTION sub_name REFRESH PUBLICATION

If you get to 3) it will re-sync the data unless you tell it otherwise.

The above is probably dependent on the size of the publication. If you 
did a publication for ALL it would make more sense to do the above then 
if you did a publication for just foo or bar.



-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: logical replication snapshots

От
Dimitri Maziuk
Дата:
On 07/26/2018 07:11 PM, Adrian Klaver wrote:
> On 07/26/2018 04:48 PM, Dimitri Maziuk wrote:
...
>> The publication foopub is at this point fubar I take it? And needs to be
>> re-created on the publisher and reconnected on the subscriber? Complete
>> with with inital resync?
>
> Not sure. Personally I would try:
>
> 1) ALTER PUBLICATION DROP TABLE foo|bar;
>
> 2) ALTER PUBLICATION ADD TABLE foo|bar;
>
> 3) ALTER SUBSCRIPTION sub_name REFRESH PUBLICATION
>
> If you get to 3) it will re-sync the data unless you tell it otherwise.
>
> The above is probably dependent on the size of the publication. If you
> did a publication for ALL it would make more sense to do the above then
> if you did a publication for just foo or bar.

... but if I did the publication for ALL, I could just use streaming
replication and then drop table/add table would replicate automagically ...

It looks like we probably have to re-think a few of our workflows and
procedures, and until/unless that happens, logical replication won't do
what we want. Which means figuring out this 13-million-files problem
becomes a very low priority for me. Unfortunately: it'd be nice to track
it down and squash it...

Thanks everyone,
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


Вложения

Re: logical replication snapshots

От
Adrian Klaver
Дата:
On 07/27/2018 03:04 PM, Dimitri Maziuk wrote:
> On 07/26/2018 07:11 PM, Adrian Klaver wrote:
>> On 07/26/2018 04:48 PM, Dimitri Maziuk wrote:
> ...
>>> The publication foopub is at this point fubar I take it? And needs to be
>>> re-created on the publisher and reconnected on the subscriber? Complete
>>> with with inital resync?
>>
>> Not sure. Personally I would try:
>>
>> 1) ALTER PUBLICATION DROP TABLE foo|bar;
>>
>> 2) ALTER PUBLICATION ADD TABLE foo|bar;
>>
>> 3) ALTER SUBSCRIPTION sub_name REFRESH PUBLICATION
>>
>> If you get to 3) it will re-sync the data unless you tell it otherwise.
>>
>> The above is probably dependent on the size of the publication. If you
>> did a publication for ALL it would make more sense to do the above then
>> if you did a publication for just foo or bar.
> 
> ... but if I did the publication for ALL, I could just use streaming
> replication and then drop table/add table would replicate automagically ...

Well I was just showing the extremes from a single table publication to 
ALL tables. You can also do subsets of ALL. Remember that binary 
replication(streaming) involves the whole Postgres cluster, e.g. all the 
databases in the cluster no choice in the matter. Also it does not allow 
you to shape what is replicated. In other words what form of DML you 
want replicated e.g. UPDATE, INSERT, DELETE. Last but not least logical 
replication works across major versions and different OS'es, which 
binary replication does not.

> 
> It looks like we probably have to re-think a few of our workflows and
> procedures, and until/unless that happens, logical replication won't do
> what we want. Which means figuring out this 13-million-files problem
> becomes a very low priority for me. Unfortunately: it'd be nice to track
> it down and squash it...


> 
> Thanks everyone,
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: logical replication snapshots

От
Dimitri Maziuk
Дата:
On 07/27/2018 05:20 PM, Adrian Klaver wrote:
> On 07/27/2018 03:04 PM, Dimitri Maziuk wrote:

>> ... but if I did the publication for ALL, I could just use streaming
>> replication ...
>
> Well I was just showing the extremes from a single table publication to
> ALL tables. You can also do subsets of ALL. Remember that binary
> replication(streaming) involves the whole Postgres cluster, e.g. all the
> databases in the cluster no choice in the matter. Also it does not allow
> you to shape what is replicated.
Yes, and that's why I was looking into logical replication. Now that I
looked, it seems the limitations are too many for our use case now, and
that spinning up a separate "cluster" for each "replicated dataset" is
simply less work ATM.

Thanks again,
--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


Вложения