Re: Potential G2-item cycles under serializable isolation

Поиск
Список
Период
Сортировка
От Kyle Kingsbury
Тема Re: Potential G2-item cycles under serializable isolation
Дата
Msg-id CAMotZ_wLtqrS_t09Q1b3ofjwV-7gr9728mV9aaN5murde7ykog@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Potential G2-item cycles under serializable isolation  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: Potential G2-item cycles under serializable isolation  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-bugs
Oh this is interesting! I can say that I'm running on a 24-way Xeon with 128gb of ram, so running out of system memory doesn't immediately seem like a bottleneck--I'd suspect my config runs slower by dint of disks (older SSD), fs settings, or maybe postgres tuning (this is with the stock Debian config files).

"No process wrote x" is very surprising and means the database basically fabricated a value out of thin air--either something is very broken in postgres or (more likely, I think) there's some state left over from a prior run--I messaged you on Hangouts about this, but you might have to clear the tables by hand between runs. That's if I forgot to push up my latest commit, which clears tables in append.clj's setup! function. If that code is there... There's something deeper going wrong.

The memory consumption of jepsen during the analysis... There's probably stuff I can optimize there, but it's never been an issue before--most distributed dbs are only pushing ~100 txns/sec, not 10k, so our histories never get this big. I know this is gonna sound weird, but slowing down postgres might actually help with reproducing this bug. Another possible path is to run more (--test-count 100) tests with shorter time limits (--time-limit 20). Or maybe injecting (Thread/sleep) statements into the transactions themselves, like in apply-mop!. Not sure!

If you're having trouble sorting through results, lein run serve in the stolon/ directory will give you a little web server for browsing the store/ directory. Might come in handy!

--Kyle

On Wed, Jun 3, 2020, 19:11 Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jun 3, 2020 at 2:35 PM Kyle Kingsbury <aphyr@jepsen.io> wrote:
> It looks like you're seeing a much higher txn success rate than I am--possibly due to your tuning? Might be worth adjusting --rate and/or --concurrency upwards

I can see what I assume is the same problem (a failure/table flip and
a huge graph) with "--concurrency 150 -r 10000", and with autovacuum
disabled on the Postgres side (this is the same relatively tuned
Postgres configuration that I used when Jepsen passed for me). It's
difficult to run the tests, so it's hard to isolate without it taking
a long time.

BTW, the tests are kind of flappy. The Linux OOM killer just killed
Java after 20 minutes or so, for example. I assume that this is to be
expected with the settings cranked up like this -- the analysis will
take longer and use more memory, too. Any tips on limiting that? Is
there any reason to think that running the same test twice will affect
the outcome of the second test?

I also see this sometimes, even though I thought I fixed it earlier --
it seems to happen at random:

Caused by: java.lang.AssertionError: Assert failed: No transaction wrote 8363 2
t2

The fact that Kyle saw such a high number of failed transactions,
which are difficult to reproduce here seems to suggest that the issue
is related to running out of shared memory for predicate locks and/or
bloat (which tends to have the side effect of increasing the need for
predicate locks). I continue to suspect that this is related to an
edge case with predicate locks. It could be related to running out of
predicate locks -- maybe an issue with the lock escalation? That would
tend to increase the number of failures by quite a lot.

--
Peter Geoghegan

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: Potential G2-item cycles under serializable isolation
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: Potential G2-item cycles under serializable isolation