Re: plan_create_index_workers doesn't account for TOAST

Поиск
Список
Период
Сортировка
От Jonathan S. Katz
Тема Re: plan_create_index_workers doesn't account for TOAST
Дата
Msg-id 950a26ba-ed90-6b24-3335-05e401739873@postgresql.org
обсуждение исходный текст
Ответ на plan_create_index_workers doesn't account for TOAST  ("Jonathan S. Katz" <jkatz@postgresql.org>)
Список pgsql-hackers
On 6/29/23 10:12 AM, Jonathan S. Katz wrote:
> Hi,
> 
> plan_create_index_workers[1] does not consider the amount of tuples 
> existing in TOAST pages when determining the number of parallel workers 
> to use for a build. The estimation comes from estimate_rel_size[2], 
> which in this case, will just take the value from rel->rd_rel->relpages.
> 
> We probably don't notice this much with B-trees, given a B-tree is 
> typically used for data that does not require toasting. However, this 
> becomes more visible when working on custom index access methods that 
> implement their own parallel build strategy.
> 
> For example, pgvector[3] provides its own data types and index access 
> method for indexing vector data. Vectors can get quite large fairly 
> quickly, e.g. a 768-dimensional vector takes up 8 + 4*768 = 3080 bytes 
> on disk, which quickly clears the default TOAST tuple threshold.
> 
> In a recent patch proposal to allow for building indexes in parallel[4], 
> I performed a few experiments on how many parallel workers would be 
> spawned when indexing 1,000,000 (1MM) 768-dim vectors, both with 
> EXTEDNED (default) and PLAIN storage. In all cases, I allowed for leader 
> participation, but the leader is not considered in 
> plan_create_index_workers.
> 
> With EXTENDED, plan_create_index_workers recommended 2 workers. The 
> build time was ~2x faster than the serial build.
> 
> With PLAIN, plan_create_index_workers recommended 4 workers. The build 
> time was **~3X faster** than the serial build.
> 
> (I've been doing more detailed, less hand-waivy performance testing, but 
> I wanted to provide directional numbers here)
> 
> It seems like we're leaving some performance for columns with TOASTed 
> data that require indexing, so I wanted to propose allowing the pages in 
> TOASTed tables to be considered when we're trying to index a column with 
> TOASTed attributes.

Just to add to this: there is a lever to get more parallel workers by 
setting "min_parallel_table_scan_size" to a lower value, which does help 
in this case. However, it does mask the fact that a large chunk of the 
data required to build the index exists in the TOAST table, which is not 
intuitive to a user who rarely has to use tuning parameters.

Thanks,

Jonathan


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Tristan Partin"
Дата:
Сообщение: Re: Meson build updates
Следующее
От: Nathan Bossart
Дата:
Сообщение: Re: pgsql: Fix search_path to a safe value during maintenance operations.