Re: ATTACH/DETACH PARTITION CONCURRENTLY

Поиск

Список

Период

Сортировка

От	David Rowley
Тема	Re: ATTACH/DETACH PARTITION CONCURRENTLY
Дата	20 августа 2018 г. 07:09:12
Msg-id	CAKJS1f-VCEKPZy1tj8U67ja7+-F9=7FJYYQ9vK-t0tmWKzpBYw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: ATTACH/DETACH PARTITION CONCURRENTLY (Robert Haas <robertmhaas@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

On 14 August 2018 at 04:00, Robert Haas <robertmhaas@gmail.com> wrote:
> I've thought about similar things, but I think there's a pretty deep
> can of worms.  For instance, if you built a relcache entry using the
> transaction snapshot, you might end up building a seemingly-valid
> relcache entry for a relation that has been dropped or rewritten.
> When you try to access the relation data, you'll be attempt to access
> a relfilenode that's not there any more.  Similarly, if you use an
> older snapshot to build a partition descriptor, you might thing that
> relation OID 12345 is still a partition of that table when in fact
> it's been detached - and, maybe, altered in other ways, such as
> changing column types.

hmm, I guess for that to work correctly we'd need some way to allow
older snapshots to see the changes if they've not already taken a lock
on the table. If the lock had already been obtained then the ALTER
TABLE to change the type of the column would get blocked by the
existing lock. That kinda blows holes in only applying the change to
only snapshots newer than the ATTACH/DETACH's

> It seems to me that overall you're not really focusing on the right
> set of issues here.  I think the very first thing we need to worry
> about how how we're going to keep the executor from following a bad
> pointer and crashing.  Any time the partition descriptor changes, the
> next relcache rebuild is going to replace rd_partdesc and free the old
> one, but the executor may still have the old pointer cached in a
> structure or local variable; the next attempt to dereference it will
> be looking at freed memory, and kaboom.  Right now, we prevent this by
> not allowing the partition descriptor to be modified while there are
> any queries running against the partition, so while there may be a
> rebuild, the old pointer will remain valid (cf. keep_partdesc).  I
> think that whatever scheme you/we choose here should be tested with a
> combination of CLOBBER_CACHE_ALWAYS and multiple concurrent sessions
> -- one of them doing DDL on the table while the other runs a long
> query.

I did focus on that and did write a patch to solve the issue. After
writing that I discovered another problem where if the PartitionDesc
differed between planning and execution then run-time pruning did the
wrong thing (See find_matching_subplans_recurse). The
PartitionPruneInfo is built assuming the PartitionDesc matches between
planning and execution. I moved on from the dangling pointer issue
onto trying to figure out a way to ensure these are the same between
planning and execution.

> Once we keep it from blowing up, the second question is what the
> semantics are supposed to be.  It seems inconceivable to me that the
> set of partitions that an in-progress query will scan can really be
> changed on the fly.  I think we're going to have to rule that if you
> add or remove partitions while a query is running, we're going to scan
> exactly the set we had planned to scan at the beginning of the query;
> anything else would require on-the-fly plan surgery to a degree that
> seems unrealistic.

Trying to do that for in-progress queries would be pretty insane. I'm
not planning on doing anything there.

> That means that when a new partition is attached,
> already-running queries aren't going to scan it.  If they did, we'd
> have big problems, because the transaction snapshot might see rows in
> those tables from an earlier time period during which that table
> wasn't attached.  There's no guarantee that the data at that time
> conformed to the partition constraint, so it would be pretty
> problematic to let users see it.  Conversely, when a partition is
> detached, there may still be scans from existing queries hitting it
> for a fairly arbitrary length of time afterwards.  That may be
> surprising from a locking point of view or something, but it's correct
> as far as MVCC goes.  Any changes made after the DETACH operation
> can't be visible to the snapshot being used for the scan.
>
> Now, what we could try to change on the fly is the set of partitions
> that are used for tuple routing.  For instance, suppose we're
> inserting a long stream of COPY data.  At some point, we attach a new
> partition from another session.  If we then encounter a row that
> doesn't route to any of the partitions that existed at the time the
> query started, we could - instead of immediately failing - go and
> reload the set of partitions that are available for tuple routing and
> see if the new partition which was concurrently added happens to be
> appropriate to the tuple we've got.  If so, we could route the tuple
> to it.  But all of this looks optional.  If new partitions aren't
> available for insert/update tuple routing until the start of the next
> query, that's not a catastrophe.  The reverse direction might be more
> problematic: if a partition is detached, I'm not sure how sensible it
> is to keep routing tuples into it.  On the flip side, what would
> break, really?

Unsure about that, I don't really see what it would buy us, so
presumably you're just considering that this might not be a
roadblocking side-effect. However, I think the PartitionDesc needs to
not change between planning and execution due to run-time pruning
requirements, so if that's the case then what you're saying here is
probably not an issue we need to think about.

> Given the foregoing, I don't see why you need something like
> PartitionIsValid() at all, or why you need an algorithm similar to
> CREATE INDEX CONCURRENTLY.  The problem seems mostly different.  In
> the case of CREATE INDEX CONCURRENTLY, the issue is that any new
> tuples that get inserted while the index creation is in progress need
> to end up in the index, so you'd better not start building the index
> on the existing tuples until everybody who might insert new tuples
> knows about the index.  I don't see that we have the same kind of
> problem in this case.  Each partition is basically a separate table
> with its own set of indexes; as long as queries don't end up with one
> notion of which tables are relevant and a different notion of which
> indexes are relevant, we shouldn't end up with any table/index
> inconsistencies.  And it's not clear to me what other problems we
> actually have here.  To put it another way, if we've got the notion of
> "isvalid" for a partition, what's the difference between a partition
> that exists but is not yet valid and one that exists and is valid?  I
> can't think of anything, and I have a feeling that you may therefore
> be inventing a lot of infrastructure that is not necessary.

Well, the problem is that you want REPEATABLE READ transactions to be
exactly that. A concurrent attach/detach should not change the output
of a query. I don't know for sure that some isvalid flag is required,
but we do need something to ensure we don't change the results of
queries run inside a repeatable read transaction. I did try to start
moving away from the isvalid flag in favour of having a PartitionDesc
just not change within the same snapshot but you've pointed out a few
problems with what I tried to come up with for that.

> I'm inclined to think that we could drop the name CONCURRENTLY from
> this feature altogether and recast it as work to reduce the lock level
> associated with partition attach/detach.  As long as we have a
> reasonable definition of what the semantics are for already-running
> queries, and clear documentation to go with those semantics, that
> seems fine.  If a particular user finds the concurrent behavior too
> strange, they can always perform the DDL in a transaction that uses
> LOCK TABLE first, removing the concurrency.

I did have similar thoughts but that seems like something to think
about once the semantics are determined, not before.

Thanks for your input on this. I clearly don't have all the answers on
this so your input and thoughts are very valuable.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Michael Paquier
Дата: 20 августа 2018 г., 06:50:01
Сообщение: Re: [PATCH] Add regress test for pg_read_all_stats role

Следующее

От: Michael Paquier
Дата: 20 августа 2018 г., 08:37:49
Сообщение: Re: ALTER TABLE on system catalogs

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: ATTACH/DETACH PARTITION CONCURRENTLY

Предыдущее

Следующее