Re: [HACKERS] Causal reads take II

Поиск

Список

Период

Сортировка

От	Dmitry Dolgov
Тема	Re: [HACKERS] Causal reads take II
Дата	1 октября 2017 г. 02:05:30
Msg-id	CA+q6zcUD=QdD0W2EVcE-S0DU2w25oLzY9mkDt0-ZOtXx8jkgbA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [HACKERS] Causal reads take II (Thomas Munro <thomas.munro@enterprisedb.com>)
Ответы	Re: [HACKERS] Causal reads take II
Список	pgsql-hackers

Дерево обсуждения

> On 31 July 2017 at 07:49, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

>> On Sun, Jul 30, 2017 at 7:07 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

>> I looked through the code of `synchronous-replay-v1.patch` a bit and ran a few

>> tests. I didn't manage to break anything, except one mysterious error that I've

>> got only once on one of my replicas, but I couldn't reproduce it yet.

>> Interesting thing is that this error did not affect another replica or primary.

>> Just in case here is the log for this error (maybe you can see something

>> obvious, that I've not noticed):

>> LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":

>> Directory not empty

>> ...

> Hmm. The first error ("could not remove directory") could perhaps be

> explained by temporary files from concurrent backends.

> ...

> Perhaps in your testing you accidentally copied a pgdata directory over the

> top of it while it was running? In any case I'm struggling to see how

> anything in this patch would affect anything at the REDO level.

Hmm...no, I don't think so. Basically what I was doing is just running

`installcheck` against a primary instance (I assume there is nothing wrong with

this approach, am I right?). This particular error was caused by `tablespace`

test which was failed in this case:

```

INSERT INTO testschema.foo VALUES(1);

ERROR: could not open file "pg_tblspc/16388/PG_11_201709191/16386/16390": No such file or directory

```

I tried few more times, and I've got it two times from four attempts on a fresh

installation (when all instances were on the same machine). But anyway I'll try

to investigate, maybe it has something to do with my environment.

> > * Also I noticed that some time-related values are hardcoded (e.g. 50%/25%

> > time shift when we're dealing with leases). Does it make sense to move

> > them out and make them configurable?

> These numbers are interrelated, and I think they're best fixed in that

> ratio. You could make it more adjustable, but I think it's better to

> keep it simple with just a single knob.

Ok, but what do you think about converting them to constants to make them more

self explanatory? Like:

```

+ * Since this timestamp is being sent to the standby where it will be

+ * compared against a time generated by the standby's system clock, we

+ * must consider clock skew. We use 25% of the lease time as max

+ * clock skew, and we subtract that from the time we send with the

+ * following reasoning:

+ */

+int max_clock_skew = synchronous_replay_lease_time * MAX_CLOCK_SKEW_PORTION;

```

Also I have another question. I tried to test this patch little bit more, and

I've got some strange behaviour after pgbench (here is the full output [1]):

```

# primary

$ ./bin/pgbench -s 100 -i test

NOTICE: table "pgbench_history" does not exist, skipping

NOTICE: table "pgbench_tellers" does not exist, skipping

NOTICE: table "pgbench_accounts" does not exist, skipping

NOTICE: table "pgbench_branches" does not exist, skipping

creating tables...

100000 of 10000000 tuples (1%) done (elapsed 0.11 s, remaining 10.50 s)

200000 of 10000000 tuples (2%) done (elapsed 1.06 s, remaining 52.00 s)

300000 of 10000000 tuples (3%) done (elapsed 1.88 s, remaining 60.87 s)

2017-09-30 15:47:26.884 CEST [6035] LOG: revoking synchronous replay lease for standby "walreceiver"...

2017-09-30 15:47:26.900 CEST [6035] LOG: standby "walreceiver" is no longer available for synchronous replay

2017-09-30 15:47:26.903 CEST [6197] LOG: revoking synchronous replay lease for standby "walreceiver"...

400000 of 10000000 tuples (4%) done (elapsed 2.44 s, remaining 58.62 s)

2017-09-30 15:47:27.979 CEST [6197] LOG: standby "walreceiver" is no longer available for synchronous replay

```

# replica

2017-09-30 15:47:51.802 CEST [6034] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly

This probably means the server terminated abnormally

before or while processing the request.

2017-09-30 15:47:55.154 CEST [6030] LOG: invalid magic number 0000 in log segment 000000010000000000000020, offset 10092544

2017-09-30 15:47:55.257 CEST [10508] LOG: started streaming WAL from primary at 0/20000000 on timeline 1

2017-09-30 15:48:09.622 CEST [10508] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly

This probably means the server terminated abnormally

before or while processing the request.

```

Is it something well known or unrelated to the patch itself?

[1]: https://gist.github.com/erthalion/cdc9357f7437171192348239eb4db764

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Alexander Korotkov
Дата: 01 октября 2017 г., 02:03:40
Сообщение: Re: [HACKERS] 64-bit queryId?

Следующее

От: Alexander Korotkov
Дата: 01 октября 2017 г., 02:09:19
Сообщение: Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] Causal reads take II

Предыдущее

Следующее