Perform COPY FROM encoding conversions in larger chunks

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Perform COPY FROM encoding conversions in larger chunks
Дата
Msg-id e7861509-3960-538a-9025-b75a61188e01@iki.fi
обсуждение исходный текст
Ответы Re: Perform COPY FROM encoding conversions in larger chunks  (Bruce Momjian <bruce@momjian.us>)
Re: Perform COPY FROM encoding conversions in larger chunks  (Heikki Linnakangas <hlinnaka@iki.fi>)
Re: Perform COPY FROM encoding conversions in larger chunks  (John Naylor <john.naylor@enterprisedb.com>)
Список pgsql-hackers
I've been looking at the COPY FROM parsing code, trying to refactor it 
so that the parallel COPY would be easier to implement. I haven't 
touched parallelism itself, just looking for ways to smoothen the way. 
And for ways to speed up COPY in general.

Currently, COPY FROM parses the input one line at a time. Each line is 
converted to the database encoding separately, or if the file encoding 
matches the database encoding, we just check that the input is valid for 
the encoding. It would be more efficient to do the encoding 
conversion/verification in larger chunks. At least potentially; the 
current conversion/verification implementations work one byte a time so 
it doesn't matter too much, but there are faster algorithms out there 
that use SIMD instructions or lookup tables that benefit from larger inputs.

So I'd like to change it so that the encoding conversion/verification is 
done before splitting the input into lines. The problem is that the 
conversion and verification functions throw an error on incomplete 
input. So we can't pass them a chunk of N raw bytes, if we don't know 
where the character boundaries are. The first step in this effort is to 
change the encoding and conversion routines to allow that. Attached 
patches 0001-0004 do that:

For encoding conversions, change the signature of the conversion 
function, by adding a "bool noError" argument and making them return the 
number of input bytes successfully converted. That way, the conversion 
function can be called in a streaming fashion: load a buffer with raw 
input without caring about the character boundaries, call the conversion 
function to convert it except for the few bytes at the end that might be 
an incomplete character, load the buffer with more data, and repeat.

For encoding verification, add a new function that works similarly. It 
takes N bytes of raw input, verifies as much of it as possible, and 
returns the number of input bytes that were valid. In principle, this 
could've been implemented by calling the existing pg_encoding_mblen() 
and pg_encoding_verifymb() functions in a loop, but it would be too 
slow. This adds encoding-specific functions for that. The UTF-8 
implementation is slightly optimized by basically inlining the 
pg_utf8_mblen() call, the other implementations are pretty naive.

- Heikki

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Brar Piening
Дата:
Сообщение: Re: Minor documentation error regarding streaming replication protocol
Следующее
От: Fujii Masao
Дата:
Сообщение: Deadlock between backend and recovery may not be detected