Re: New "raw" COPY format

Поиск
Список
Период
Сортировка
От Joel Jacobson
Тема Re: New "raw" COPY format
Дата
Msg-id 2dde695a-4a30-4aea-8978-aa78c5c1648e@app.fastmail.com
обсуждение исходный текст
Ответ на Re: New "raw" COPY format  (Tatsuo Ishii <ishii@postgresql.org>)
Список pgsql-hackers
On Fri, Oct 18, 2024, at 19:24, Joel Jacobson wrote:
> Attachments:
> * v11-0001-Refactor-ProcessCopyOptions-introduce-CopyFormat-enu.patch
> * v11-0002-Add-raw-format-to-COPY-command.patch

Here is a demo of a importing a decently sized real text file,
that can't currently be imported without the CSV hack:

$ head /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64
.package-cache-mutate                        devel/cargo
bin                                admin/base-files
bin/archdetect                            admin/ubiquity
bin/autopartition                        admin/ubiquity
bin/autopartition-crypto                    admin/ubiquity
bin/autopartition-loop                        admin/ubiquity
bin/autopartition-lvm                        admin/ubiquity
bin/block-attr                            admin/ubiquity
bin/blockdev-keygen                        admin/ubiquity
bin/blockdev-wipe                        admin/ubiquity

This file uses a combination of tabs and spaces, in between the two columns,
so none of the existing formats are suitable to deal with this file.

$ ls -lah /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64
-rw-r--r-- 1 root root 791M Apr 24 02:07 /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64

To import using the CSV hack, we first have find two bytes that don't exist anyway,
which can be done using e.g. ripgrep. The below command verifies \x01 and \x02
don't exist anywhere:

$ rg -uuu --multiline '(?-u)[\x01|\x02]' /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64
$

Knowing these bytes don't exist anywhere,
we can then safely use these as delimiter and quote characters,
as a hack to disable these features:

CREATE TABLE package_contents (raw_line text);

COPY package_contents FROM '/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64' (FORMAT CSV,
DELIMITERE'\x01', QUOTE E'\x02');
 
COPY 8443588
Time: 3882.100 ms (00:03.882)
Time: 3552.991 ms (00:03.553)
Time: 3748.038 ms (00:03.748)
Time: 3775.947 ms (00:03.776)
Time: 3729.020 ms (00:03.729)

I tested writing a Rust program that would read the file line-by-line and INSERT each line instead.
This is of course a lot slower, since it has to execute each insert separately:

$ cargo run --release
   Compiling insert_package_contents v0.1.0 (/home/joel/insert_package_contents)
    Finished `release` profile [optimized] target(s) in 0.70s
     Running `target/release/insert_package_contents`
Connecting to the PostgreSQL database...
Successfully connected to the database.
Starting to insert lines from the file...
Successfully inserted 8443588 lines into package_contents in 134.65s.

New approach using the RAW format:

COPY package_contents FROM '/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64' (FORMAT RAW,
DELIMITERE'\n');
 
COPY 8443588
Time: 2918.489 ms (00:02.918)
Time: 3020.372 ms (00:03.020)
Time: 3336.589 ms (00:03.337)
Time: 3067.268 ms (00:03.067)
Time: 3343.694 ms (00:03.344)

Apart from the convenience improvement,
it seems to be somewhat faster already.

/Joel



В списке pgsql-hackers по дате отправления: