Re: Fast insertion indexes: why no developments

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема Re: Fast insertion indexes: why no developments
Дата
Msg-id CAMkU=1xtdHLwLQHJ2NOQEaNGPFjw2FtNiAX9MZ2n1i+YtphCrA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Fast insertion indexes: why no developments  (Leonardo Francalanci <m_lists@yahoo.it>)
Список pgsql-hackers
On Tue, Nov 5, 2013 at 9:52 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:
Jeff Janes wrote
> Some experiments I did a few years ago showed that applying sorts to the
> data to be inserted could be helpful even when the sort batch size was as
> small as one tuple per 5 pages of existing index.  Maybe even less.

Cool!!! Do you have any idea/hint on how I could try and replicate that?
Do you remember how you did it?

I can't find my notes but I remember more or less how I did it.

Since we don't yet have an insertion buffer that allows the rows to be sorted in different order for different indexes, I had to simulate it just by using a table with a single index and hoping that that would extrapolate.

create table foo (x bigint);

To speed things up, you may want to prepopulate this with random data so that the size of the index-to-be will exceed shared_buffers, or physical RAM, before making the index.  Also, the effectiveness might depend on how much the index has grown since its creations, since leaf pages are initially correlated between physical order and logical order, but that decreases over time.  So you may want to try different initial seed sizes.

create index on foo (x);

Then I use perl to make run-sorted data with different run sizes, and load that via \copy.  I put all the data points in memory up front rather than generating it per-run on the fly, so that perl consumes about the same amount of memory regardless of the run size.  You would want to use more than 1..1e6 if you are on a very large RAM machine.


Something like: 

for $run_size in 1 10 100 1000 10000 100000; do
  perl -le 'my @x; push @x, int(rand()*1e8) foreach 1..1e6; while (@x) {print foreach sort {$a<=>$b} splice @x,0,'$run_size'; }'| time psql -c '\copy foo from stdin'; 
done

But you probably want another inner loop so that the \copy gets executed multiple times per run_size, so that each run_size executes for at least a couple checkpoint cycles.

Cheers,

Jeff

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Josh Berkus
Дата:
Сообщение: Re: pg_dump and pg_dumpall in real life
Следующее
От: Jeffrey Walton
Дата:
Сообщение: Re: Clang 3.3 Analyzer Results