Re: Parallel copy

Поиск
Список
Период
Сортировка
От Dilip Kumar
Тема Re: Parallel copy
Дата
Msg-id CAFiTN-vM-X6hU1H2m=zK98pjVv-2tGqQ1ZTiuxFX8P1N24FzPw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Parallel copy  (Ants Aasma <ants@cybertec.at>)
Список pgsql-hackers
On Wed, Feb 26, 2020 at 8:47 PM Ants Aasma <ants@cybertec.at> wrote:
>
> On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first. I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
>
> I agree that getting a nice varied dataset would be nice. Including
> things like narrow integer only tables, strings with newlines and
> escapes in them, extremely wide rows.
>
> I tried to capture a quick profile just to see what it looks like.
> Grabbed a random open data set from the web, about 800MB of narrow
> rows CSV [1].
>
> Script:
> CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
> COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true);
>
> Profile:
> # Samples: 59K of event 'cycles:u'
> # Event count (approx.): 57644269486
> #
> # Overhead  Command   Shared Object       Symbol
> # ........  ........  ..................
> .......................................
> #
>     18.24%  postgres  postgres            [.] CopyReadLine
>      9.23%  postgres  postgres            [.] NextCopyFrom
>      8.87%  postgres  postgres            [.] NextCopyFromRawFields
>      5.82%  postgres  postgres            [.] pg_verify_mbstr_len
>      5.45%  postgres  postgres            [.] pg_strtoint32
>      4.16%  postgres  postgres            [.] heap_fill_tuple
>      4.03%  postgres  postgres            [.] heap_compute_data_size
>      3.83%  postgres  postgres            [.] CopyFrom
>      3.78%  postgres  postgres            [.] AllocSetAlloc
>      3.53%  postgres  postgres            [.] heap_form_tuple
>      2.96%  postgres  postgres            [.] InputFunctionCall
>      2.89%  postgres  libc-2.30.so        [.] __memmove_avx_unaligned_erms
>      1.82%  postgres  libc-2.30.so        [.] __strlen_avx2
>      1.72%  postgres  postgres            [.] AllocSetReset
>      1.72%  postgres  postgres            [.] RelationPutHeapTuple
>      1.47%  postgres  postgres            [.] heap_prepare_insert
>      1.31%  postgres  postgres            [.] heap_multi_insert
>      1.25%  postgres  postgres            [.] textin
>      1.24%  postgres  postgres            [.] int4in
>      1.05%  postgres  postgres            [.] tts_buffer_heap_clear
>      0.85%  postgres  postgres            [.] pg_any_to_server
>      0.80%  postgres  postgres            [.] pg_comp_crc32c_sse42
>      0.77%  postgres  postgres            [.] cstring_to_text_with_len
>      0.69%  postgres  postgres            [.] AllocSetFree
>      0.60%  postgres  postgres            [.] appendBinaryStringInfo
>      0.55%  postgres  postgres            [.] tts_buffer_heap_materialize.part.0
>      0.54%  postgres  postgres            [.] palloc
>      0.54%  postgres  libc-2.30.so        [.] __memmove_avx_unaligned
>      0.51%  postgres  postgres            [.] palloc0
>      0.51%  postgres  postgres            [.] pg_encoding_max_length
>      0.48%  postgres  postgres            [.] enlargeStringInfo
>      0.47%  postgres  postgres            [.] ExecStoreVirtualTuple
>      0.45%  postgres  postgres            [.] PageAddItemExtended
>
> So that confirms that the parsing is a huge chunk of overhead with
> current splitting into lines being the largest portion. Amdahl's law
> says that splitting into tuples needs to be made fast before
> parallelizing makes any sense.
>

I have ran very simple case on table with 2 indexes and I can see a
lot of time is spent in index insertion.  I agree that there is a good
amount of time spent in tokanizing but it is not very huge compared to
index insertion.

I have expanded the time spent in the CopyFrom function from my perf
report and pasted here.  We can see that a lot of time is spent in
ExecInsertIndexTuples(77%).   I agree that we need to further evaluate
that out of which how much is I/O vs CPU operations.  But, the point I
want to make is that it's not true for all the cases that parsing is
taking maximum amout of time.

   - 99.50% CopyFrom
      - 82.90% CopyMultiInsertInfoFlush
         - 82.85% CopyMultiInsertBufferFlush
            + 77.68% ExecInsertIndexTuples
            + 3.74% table_multi_insert
            + 0.89% ExecClearTuple
      - 12.54% NextCopyFrom
         - 7.70% NextCopyFromRawFields
            - 5.72% CopyReadLine
                 3.96% CopyReadLineText
               + 1.49% pg_any_to_server
              1.86% CopyReadAttributesCSV
         + 3.68% InputFunctionCall
      + 2.11% ExecMaterializeSlot
      + 0.94% MemoryContextReset

My test:
-- Prepare:
CREATE TABLE t (a int, b int, c varchar);
insert into t select i,i, 'aaaaaaaaaaaaaaaaaaaaaaaa' from
generate_series(1,10000000) as i;
copy t to '/home/dilipkumar/a.csv'  WITH (FORMAT 'csv', HEADER true);
truncate table t;
create index idx on t(a);
create index idx1 on t(c);

-- Test CopyFrom and measure with perf:
copy t from '/home/dilipkumar/a.csv'  WITH (FORMAT 'csv', HEADER true);

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Juan José Santamaría Flecha
Дата:
Сообщение: Re: BUG #15858: could not stat file - over 4GB
Следующее
От: Andreas Karlsson
Дата:
Сообщение: Broken resetting of subplan hash tables