Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

Поиск
Список
Период
Сортировка
От Michail Nikolaev
Тема Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
Дата
Msg-id CANtu0oiT9SPFhs=h8DR4YVPox7TJ6jkfR9JgqT-0L+=uy=Lxng@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements  (Michail Nikolaev <michail.nikolaev@gmail.com>)
Ответы Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
Список pgsql-hackers
Hello, Matthias and others!

Realized new horizon was applied only during validation phase (once index is marked as ready).
Now it applied if index is not marked as valid yet.

Updated version in attach.

--------------------------------------------------

> I think the best way for this to work would be an index method that
> exclusively stores TIDs, and of which we can quickly determine new
> tuples, too. I was thinking about something like GIN's format, but
> using (generation number, tid) instead of ([colno, colvalue], tid) as
> key data for the internal trees, and would be unlogged (because the
> data wouldn't have to survive a crash). Then we could do something
> like this for the second table scan phase:

Regarding that approach to dealing with validation phase and resetting of snapshot:

I was thinking about it and realized: once we go for an additional index - we don't need the second heap scan at all!

We may do it this way:

* create target index, not marked as indisready yet
* create a temporary unlogged index with the same parameters to store tids (optionally with the indexes columns data, see below), marked as indisready (but not indisvalid)
* commit them both in a single transaction
* wait for other transaction to know about them and honor in HOT constraints and new inserts (for temporary index)
* now our temporary index is filled by the tuples inserted to the table
* start building out target index, resetting snapshot every so often (if it is "safe" index)
* finish target index building phase
* mark target index as indisready
* now, start validation of the index:
    * take the reference snapshot    
    * take a visibility snapshot of the target index, sort it (as it done currently)
    * take a visibility snapshot of our temporary index, sort it
    * start merging loop using two synchronized cursors over both visibility snapshots
        * if we encountered tid which is not present in target visibility snapshot
            * insert it to target index
                * if a temporary index contains the column's data - we may even avoid the tuple fetch
                * if temporary index is tid-only - we fetch tuple from the heap, but as plus we are also skipping dead tuples from insertion to the new index (I think it is better option)
    * commit everything, release reference snapshot  
* wait for transactions older than reference snapshot (as it done currently)
* mark target index as indisvalid, drop temporary index
* done


So, pros:
* just a single heap scan
* snapshot is reset periodically

Cons:
* we need to maintain the additional index during the main building phase
* one more tuplesort

If the temporary index is unlogged, cheap to maintain (just append-only mechanics) this feels like a perfect tradeoff for me.

This approach will work perfectly with low amount of tuple inserts during the building phase. And looks like even in the worst case it still better than the current approach.

What do you think? Have I missed something?

Thanks,
Michail.
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bertrand Drouvot
Дата:
Сообщение: Re: Avoid orphaned objects dependencies, take 3
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: First draft of PG 17 release notes