Re: [HACKERS] [WIP] Zipfian distribution in pgbench
От | Peter Geoghegan |
---|---|
Тема | Re: [HACKERS] [WIP] Zipfian distribution in pgbench |
Дата | |
Msg-id | CAH2-Wzmf6intNY1ggiNzOziiO5Eq=DsXfeptODGxO=2j-i1NGQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] [WIP] Zipfian distribution in pgbench (Alik Khilazhev <a.khilazhev@postgrespro.ru>) |
Список | pgsql-hackers |
On Fri, Jul 14, 2017 at 6:37 AM, Alik Khilazhev <a.khilazhev@postgrespro.ru> wrote: > I am attaching results of tests for 32 and 128 clients that were running for 10 minutes, and TPS remains 305 and 115 ktpsrespectively. > > Tests was executed with configuration set for YCSB. And there is very aggressively autovacuum, this can be reason why thereis no decline appears with increasing working time. > Autovacuum config: > > autovacuum_max_workers = 8 > autovacuum_naptime = 10s > autovacuum_vacuum_scale_factor = 0.1 > autovacuum_vacuum_cost_delay = 0ms > autovacuum_vacuum_cost_limit = 10000 I think that what this probably comes down to, more than anything else, is that you have leftmost hot/bloated leaf pages like this: idx | level | l_item | blkno | btpo_prev | btpo_next | btpo_flags | type | live_items | dead_items | avg_item_size | page_size | free_size | highkey -----------------------+-------+--------+-------+-----------+-----------+------------+------+------------+------------+---------------+-----------+-----------+-------------------------...pgbench_accounts_pkey | 0 | 1 | 1 | 0 | 2751 | 65 | l | 100 | 41 | 16 | 8192 | 5328 | 11 00 00 00 00 00 00 00pgbench_accounts_pkey| 0 | 2 | 2751 | 1 | 2746 | 65 | l | 48 | 90 | 16 | 8192 | 5388 | 32 00 00 00 00 00 00 00 ... The high key for the leftmost shows that only values below 0x11 belong on the first page. This is about 16 or 17 possible distinct values, and yet the page has 100 live items, and 41 dead items; in total, there is room for 367 items. It's only slightly better with other nearby pages. This is especially bad because once the keyspace gets split up this finely, it's *impossible* to reverse it -- it's more or less a permanent problem, at least until a REINDEX. You cannot merge pages back together until one is completely empty, which in this case and many cases will in fact never happen. Aggressive vacuuming is probably helpful in part because it prevents the problem from ever getting completely out of hand. That doesn't seem like a very practical solution, though. We should probably be treating unique indexes in a special way, since inserting a new conflicting tuple necessarily supersedes whatever it conflicted with. Postgres holds on to the old tuple today, but only because the transaction might abort, or because an old snapshot might be interested in old tuple versions. However, the general pattern with unique indexes is that there must be one tuple visible to new snapshots, and old versions are almost certainly going to became garbage very quickly. Unique indexes really are quite special, which nbtree doesn't take advantage of at all. If you read the coverage of B-Trees within "Transaction Processing: Concepts and Techniques", and many other publications, the general trend seems to be that unique indexes have truly unique keys, based only on the user-visible key values. They make a sharp distinction between primary and secondary indexes, which doesn't really exist in Postgres (at least, not at the access method level). I suspect that the best long term solution is to add GIN-style duplicate handling within nbtree for unique indexes, with special pruning style optimizations to the posting list structure. This would probably only happen with unique indexes. The useful thing about this approach is it separates these two problems: 1. Representing what values are in the index for lookup, and their latest row version. 2. Finding old row versions, in the less common case where you have an old snapshot. With a design like that, nbtree would never "cut up the keyspace too finely" because of a temporary surge of UPDATE insertions. You still get bloat, but you add it to a place where garbage collection can be much better targeted. Under this scheme, it doesn't really matter so much if garbage collection doesn't happen very frequently, because there could be LP_DEAD style hints for the auxiliary posting list structure. That could probably work based on atomic ops, and would greatly reduce the number of pages that UPDATE workloads like this dirtied. It probably wouldn't make sense to add things like GIN's compression of item pointers, since most data within the auxiliary posting list structure is removed very quickly. I wouldn't expect btree_gin to be faster for this workload today, because it doesn't have kill_prior_tuple/LP_DEAD support, and because it doesn't support unique indexes, and so cannot exploit the special situation that exists with unique indexes. -- Peter Geoghegan
В списке pgsql-hackers по дате отправления:
Следующее
От: Julien RouhaudДата:
Сообщение: [HACKERS] segfault in HEAD when too many nested functions call