Re: [HACKERS] Causal reads take II

Поиск
Список
Период
Сортировка
От Dmitry Dolgov
Тема Re: [HACKERS] Causal reads take II
Дата
Msg-id CA+q6zcUD=QdD0W2EVcE-S0DU2w25oLzY9mkDt0-ZOtXx8jkgbA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] Causal reads take II  (Thomas Munro <thomas.munro@enterprisedb.com>)
Ответы Re: [HACKERS] Causal reads take II  (Thomas Munro <thomas.munro@enterprisedb.com>)
Список pgsql-hackers
> On 31 July 2017 at 07:49, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> On Sun, Jul 30, 2017 at 7:07 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>>
>> I looked through the code of `synchronous-replay-v1.patch` a bit and ran a few
>> tests. I didn't manage to break anything, except one mysterious error that I've
>> got only once on one of my replicas, but I couldn't reproduce it yet.
>> Interesting thing is that this error did not affect another replica or primary.
>> Just in case here is the log for this error (maybe you can see something
>> obvious, that I've not noticed):
>>
>> LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
>> Directory not empty
>> ...
>
> Hmm.  The first error ("could not remove directory") could perhaps be
> explained by temporary files from concurrent backends.
> ...
> Perhaps in your testing you accidentally copied a pgdata directory over the
> top of it while it was running?  In any case I'm struggling to see how
> anything in this patch would affect anything at the REDO level.

Hmm...no, I don't think so. Basically what I was doing is just running
`installcheck` against a primary instance (I assume there is nothing wrong with
this approach, am I right?). This particular error was caused by `tablespace`
test which was failed in this case:

```
INSERT INTO testschema.foo VALUES(1);
ERROR:  could not open file "pg_tblspc/16388/PG_11_201709191/16386/16390": No such file or directory
```

I tried few more times, and I've got it two times from four attempts on a fresh
installation (when all instances were on the same machine). But anyway I'll try
to investigate, maybe it has something to do with my environment.

> > * Also I noticed that some time-related values are hardcoded (e.g. 50%/25%
> >   time shift when we're dealing with leases). Does it make sense to move
> >   them out and make them configurable?
>
> These numbers are interrelated, and I think they're best fixed in that
> ratio.  You could make it more adjustable, but I think it's better to
> keep it simple with just a single knob.

Ok, but what do you think about converting them to constants to make them more
self explanatory? Like:

```
/*
+ * Since this timestamp is being sent to the standby where it will be
+ * compared against a time generated by the standby's system clock, we
+ * must consider clock skew.  We use 25% of the lease time as max
+ * clock skew, and we subtract that from the time we send with the
+ * following reasoning:
+ */
+int max_clock_skew = synchronous_replay_lease_time * MAX_CLOCK_SKEW_PORTION;
```

Also I have another question. I tried to test this patch little bit more, and
I've got some strange behaviour after pgbench (here is the full output [1]):

```
# primary

$ ./bin/pgbench -s 100 -i test

NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
creating tables...
100000 of 10000000 tuples (1%) done (elapsed 0.11 s, remaining 10.50 s)
200000 of 10000000 tuples (2%) done (elapsed 1.06 s, remaining 52.00 s)
300000 of 10000000 tuples (3%) done (elapsed 1.88 s, remaining 60.87 s)
2017-09-30 15:47:26.884 CEST [6035] LOG:  revoking synchronous replay lease for standby "walreceiver"...
2017-09-30 15:47:26.900 CEST [6035] LOG:  standby "walreceiver" is no longer available for synchronous replay
2017-09-30 15:47:26.903 CEST [6197] LOG:  revoking synchronous replay lease for standby "walreceiver"...
400000 of 10000000 tuples (4%) done (elapsed 2.44 s, remaining 58.62 s)
2017-09-30 15:47:27.979 CEST [6197] LOG:  standby "walreceiver" is no longer available for synchronous replay
```

```
# replica

2017-09-30 15:47:51.802 CEST [6034] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2017-09-30 15:47:55.154 CEST [6030] LOG:  invalid magic number 0000 in log segment 000000010000000000000020, offset 10092544
2017-09-30 15:47:55.257 CEST [10508] LOG:  started streaming WAL from primary at 0/20000000 on timeline 1
2017-09-30 15:48:09.622 CEST [10508] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```

Is it something well known or unrelated to the patch itself?

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alexander Korotkov
Дата:
Сообщение: Re: [HACKERS] 64-bit queryId?
Следующее
От: Alexander Korotkov
Дата:
Сообщение: Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files