Re: Parallel copy
От | Ants Aasma |
---|---|
Тема | Re: Parallel copy |
Дата | |
Msg-id | CANwKhkM+PvRTYjyUgc2YETbpB10cFkDazGhTqf65MVfZ+tGOyw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Parallel copy (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Ответы |
Re: Parallel copy
(Dilip Kumar <dilipbalaut@gmail.com>)
Re: Parallel copy (vignesh C <vignesh21@gmail.com>) |
Список | pgsql-hackers |
On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Perhaps. I guess it'll depend on the CSV file (number of fields, ...), > so I still think we need to do some measurements first. I'm willing to > do that, but (a) I doubt I'll have time for that until after 2020-03, > and (b) it'd be good to agree on some set of typical CSV files. I agree that getting a nice varied dataset would be nice. Including things like narrow integer only tables, strings with newlines and escapes in them, extremely wide rows. I tried to capture a quick profile just to see what it looks like. Grabbed a random open data set from the web, about 800MB of narrow rows CSV [1]. Script: CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text); COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true); Profile: # Samples: 59K of event 'cycles:u' # Event count (approx.): 57644269486 # # Overhead Command Shared Object Symbol # ........ ........ .................. ....................................... # 18.24% postgres postgres [.] CopyReadLine 9.23% postgres postgres [.] NextCopyFrom 8.87% postgres postgres [.] NextCopyFromRawFields 5.82% postgres postgres [.] pg_verify_mbstr_len 5.45% postgres postgres [.] pg_strtoint32 4.16% postgres postgres [.] heap_fill_tuple 4.03% postgres postgres [.] heap_compute_data_size 3.83% postgres postgres [.] CopyFrom 3.78% postgres postgres [.] AllocSetAlloc 3.53% postgres postgres [.] heap_form_tuple 2.96% postgres postgres [.] InputFunctionCall 2.89% postgres libc-2.30.so [.] __memmove_avx_unaligned_erms 1.82% postgres libc-2.30.so [.] __strlen_avx2 1.72% postgres postgres [.] AllocSetReset 1.72% postgres postgres [.] RelationPutHeapTuple 1.47% postgres postgres [.] heap_prepare_insert 1.31% postgres postgres [.] heap_multi_insert 1.25% postgres postgres [.] textin 1.24% postgres postgres [.] int4in 1.05% postgres postgres [.] tts_buffer_heap_clear 0.85% postgres postgres [.] pg_any_to_server 0.80% postgres postgres [.] pg_comp_crc32c_sse42 0.77% postgres postgres [.] cstring_to_text_with_len 0.69% postgres postgres [.] AllocSetFree 0.60% postgres postgres [.] appendBinaryStringInfo 0.55% postgres postgres [.] tts_buffer_heap_materialize.part.0 0.54% postgres postgres [.] palloc 0.54% postgres libc-2.30.so [.] __memmove_avx_unaligned 0.51% postgres postgres [.] palloc0 0.51% postgres postgres [.] pg_encoding_max_length 0.48% postgres postgres [.] enlargeStringInfo 0.47% postgres postgres [.] ExecStoreVirtualTuple 0.45% postgres postgres [.] PageAddItemExtended So that confirms that the parsing is a huge chunk of overhead with current splitting into lines being the largest portion. Amdahl's law says that splitting into tuples needs to be made fast before parallelizing makes any sense. Regards, Ants Aasma [1] https://www3.stats.govt.nz/2018census/Age-sex-by-ethnic-group-grouped-total-responses-census-usually-resident-population-counts-2006-2013-2018-Censuses-RC-TA-SA2-DHB.zip
В списке pgsql-hackers по дате отправления:
Следующее
От: Alex MalekДата:
Сообщение: Re: bad wal on replica / incorrect resource manager data checksum inrecord / zfs