Synchronized scans versus relcache reinitialization

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Synchronized scans versus relcache reinitialization
Дата
Msg-id 27386.1338059658@sss.pgh.pa.us
обсуждение исходный текст
Ответы Re: Synchronized scans versus relcache reinitialization  (Noah Misch <noah@leadboat.com>)
Re: Synchronized scans versus relcache reinitialization  (Jeff Davis <pgsql@j-davis.com>)
Список pgsql-hackers
I've been poking at Jeff Frost's and Greg Mullane's recent reports of
high load due to many processes getting "stuck" in relcache init file
rebuild operations.  I can reproduce a similar behavior here by creating
a database containing a whole lot of many-column views, thereby bloating
pg_attribute to the gigabyte range, then manually removing the
pg_internal.init file (simulating what would happen after a relcache
inval on any system catalog), and then throwing a bunch of new
connections at the database simultaneously.  Each new connection tries
to rebuild the init file, and they basically saturate the machine.
I don't believe that this case quite matches what happened to either
Jeff or Greg, but nonetheless it's quite reproducible and it needs
to be fixed.  I can identify three sub-issues:

1. If pg_attribute is larger than 1/4th of shared_buffers, the
synchronized scan logic kicks in when we do seqscans to fill the tuple
descriptors for core system catalogs.  For this particular use case
that's not merely not helpful, it's positively disastrous.  The reason
is that the desired rows are almost always in the first couple dozen
blocks of pg_attribute, and the reading code in RelationBuildTupleDesc
knows this and is coded to stop once it's collected the expected number
of pg_attribute rows for the particular catalog.  So even with a very
large pg_attribute, not much work should be expended here.  But the
syncscan logic causes some of the heapscans to start from points later
than block zero, causing them to miss the rows they need, so that the
scan has to run to the end and wrap around before it finds all the rows
it needs.  In my test case on HEAD, this happens just once out of the
eleven heapscans that occur in this phase, if a single backend is doing
this in isolation.  That increases the startup time from a few
milliseconds to about eight-tenths of a second, due to having to scan
all of pg_attribute.  (In my test case, pg_attribute is fully cached in
RAM, but most of it is in kernel buffers not PG buffers.)

Bad as that is, it gets rapidly worse if there are multiple incoming new
connections.  All of them get swept up in the full-table syncscan
started by the first arrival, so that now all rather than only some of
their heapscans start from a point later than block zero, meaning that
all eleven rather than just one of their heapscans are unduly expensive.

It seems clear to me that we should just disable syncscans for the
relcache reload heapscans.  There is lots of downside due to breaking
the early-exit optimization in RelationBuildTupleDesc, and basically no
upside.  I'm inclined to just modify systable_beginscan to prevent use
of syncscan whenever indexOK is false.  If we wanted to change its API
we could make this happen only for RelationBuildTupleDesc's calls, but
I don't see any upside for allowing syncscans for other forced-heapscan
callers either.

2. The larger problem here is that when we have N incoming connections
we let all N of them try to rebuild the init file independently.  This
doesn't make things faster for any one of them, and once N gets large
enough it makes things slower for all of them.  We would be better off
letting the first arrival do the rebuild work while the others just
sleep waiting for it.  I believe that this fix would probably have
ameliorated Jeff and Greg's cases, even though those do not seem to
have triggered the syncscan logic.

3. Having now spent a good deal of time poking at this, I think that the
syncscan logic is in need of more tuning, and I am wondering whether we
should even have it turned on by default.  It appears to be totally
useless for fully-cached-in-RAM scenarios, even if most of the relation
is out in kernel buffers rather than in shared buffers.  The best case
I saw was less than 2X speedup compared to N-times-the-single-client
case, and that wasn't very reproducible, and it didn't happen at all
unless I hacked BAS_BULKREAD mode to use a ring buffer size many times
larger than the current 256K setting (otherwise the timing requirements
are too tight for multiple backends to stay in sync --- a seqscan can
blow through that much data in a fraction of a millisecond these days,
if it's reading from kernel buffers).  The current tuning may be all
right for cases where you're actually reading from spinning rust, but
that seems to be a decreasing fraction of real-world use cases.

Anyway, I think we definitely need to fix systable_beginscan to not use
syncscans; that's about a one-line change and seems plenty safe to
backpatch.  I also intend to look at avoiding concurrent relcache
rebuilds, which I think should also be simple enough if we are willing
to introduce an additional LWLock.  (That would prevent concurrent init
file rebuilds in different databases, but it's not clear that very many
people care about such scenarios.)  I am inclined to back-patch that as
well; it's a bit riskier than the first change, but the first change is
apparently not going to fix the two cases reported from the field.
Comments?
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jan-Benedict Glaw
Дата:
Сообщение: Re: VIP: new format for psql - shell - simple using psql in shell
Следующее
От: Tom Lane
Дата:
Сообщение: --disable-shared is entirely broken these days