[HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

Поиск
Список
Период
Сортировка
От Pavan Deolasee
Тема [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Дата
Msg-id CABOikdM2MUq9cyZJi1KyLmmkCereyGp5JQ4fuwKoyKEde_mzkQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY  (Amit Kapila <amit.kapila16@gmail.com>)
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY  (Pavan Deolasee <pavan.deolasee@gmail.com>)
Список pgsql-hackers
Hello All,

While stress testing WARM, I encountered a data consistency problem. It happens when index is built concurrently. What I thought to be a WARM induced bug and it took me significant amount of investigation to finally conclude that it's a bug in the master. In fact, we tested all the way up to 9.2 release and the bug exists in all releases.

In my test set up, I keep a UNIQUE index on an "aid" column and then have many other "aid[1-4]" columns, on which indexes are built and dropped. The script then updates a randomly selected row using any of the aid[1-4] columns. Even these columns are updated during the tests to force WARM updates, but within a narrow range of non-overlapping values. So we can always find the unique row using some kind of range condition. The reproduction case is much simpler and I'll explain that below.

In the reproduction scenario (attached), it happens fairly quickly, may be after 3-10 rounds of DROP/CREATE. You may need to adjust the -j value if it does not happen for you even after a few rounds. Just for the record, I haven't checked past reports to correlate if the bug explains any of the index corruption issues we might have received. 

To give a short summary of how CIC works with HOT, it consists of three phases:
1. Create index entry in the catalog, mark it not ready for inserts, commit the transaction, start new transaction and then wait for all potential lock holders on the table to end
2. Take a MVCC snapshot, build the index, mark index ready for inserts, commit and then once again wait for potential lock holders on the table to end
3. Do a validation phase where we look at MVCC-visible tuples and add missing entries to the index. We only look at the root line pointer of each tuple while deciding whether a particular tuple already exists in the index or not. IOW we look at HOT chains and not tuples themselves.

The interesting case is phase 1 and 2. The phase 2 works on the premise that every transaction started after we finished waiting for all existing transactions will definitely see the new index. All these new transactions then look at the columns used by the new index and consider them while deciding HOT-safety of updates. Now for reasons which I don't fully understand, there exists a path where a transaction starts after CIC waited and took its snapshot at the start of the second phase, but still fails to see the new index. Or may be it sees the index, but fails to update its rd_indexattr list to take into account the columns of the new index. That leads to wrong decisions and we end up with a broken HOT chain, something the CIC algorithm is not prepared to handle. This in turn leads the new index missing entries for the update tuples i.e. the index may have aid1 = 10, but the value in the heap could be aid1 = 11. 

Based on my investigation so far and the evidence that I've collected, what probably happens is that after a relcache invalidation arrives at the new transaction, it recomputes the rd_indexattr but based on the old, cached rd_indexlist. So the rd_indexattr does not include the new columns. Later rd_indexlist is updated, but the rd_indexattr remains what it was.

There is definitely an evidence of this sequence of events (collected after sprinkling code with elog messages), but it's not yet clear to me why it affects only a small percentage of transactions. One possibility is that it probably depends on at what stage an invalidation is received and processed by the backend. I find it a bit confusing that even though rd_indexattr is tightly coupled to the rd_indexlist, we don't invalidate or recompute them together. Shouldn't we do that? For example, in the attached patch we invalidate rd_indexattr whenever we recompute rd_indexlist (i.e. if we see that rd_indexvalid is 0). This patch fixes the problem for me (I've only tried master), but I am not sure if this is a correct fix.

Finally, here is the reproduction scenario. Run "cic_bug.sh" and observe the output. The script runs a continuous UPDATE on the table, while running DROP INDEX and CREATE INDEX CONCURRENTLY on aid1 column in a loop. As soon as we build the index, we run a validation query such that rows are fetched using IOS on the new index and compare them against the rows obtained via a SeqScan on the table. If things are OK, the rows obtained via both methods should match. But every often you'll see a row or two differ. I must say that IOS itself is not a problem, but I'd to setup things that way to ensure that index access does not have to go to the heap. Otherwise, it will just return whatever is there in the heap and things will look fine. May be some index consistency check utility could be handy to check the corruption. But I think the query should just do the job for now.

Some additional server logs are attached too. I've filtered them to include only three sessions. One (PID 73119) which is running CIC, second (PID 73090) which processed relcache invalidation correctly and third  (PID 73091) which got it wrong. Specifically, look at line #77 where snapshot is obtained for the second phase:

2017-01-29 10:44:31.238 UTC [73119] [xid=0]LOG:  CIC (pgb_a_aid1) building first phase with snapshot (4670891:4670893)

At line 108, you'll see a new transaction with XID 4670908 still using the old attribute list. 

2017-01-29 10:44:31.889 UTC [73091] [xid=4670908]LOG:  heap_update (cic_tab_accounts), hot_attrs ((b 9))

Other session, which I included for comparison, sees the new index and recomputes its rd_indexattr in time and hence that update works as expected.


Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stephen Frost
Дата:
Сообщение: Re: [HACKERS] One-shot expanded output in psql using \G
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: [HACKERS] Potential data loss of 2PC files