Re: Loading 500m json files to database

Поиск

Список

Период

Сортировка

От	Christopher Browne
Тема	Re: Loading 500m json files to database
Дата	23 марта 2020 г. 14:12:08
Msg-id	CAFNqd5UWCrnh6o4R1-sQ=Kr-HLoK0M5d16HJ=y-9vM+Je-7oqQ@mail.gmail.com обсуждение исходный текст
Ответ на	Loading 500m json files to database (pinker <pinker@onet.eu>)
Ответы	Re: Loading 500m json files to database
Список	pgsql-general

Дерево обсуждения

On Mon, 23 Mar 2020 at 06:24, pinker <pinker@onet.eu> wrote:

Hi, do you have maybe idea how to make loading process faster?

I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.

What I came up with now is:

time for i in datafiles/*; do
psql -c "\copy json_parts(json_data) FROM $i"&
done

which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.

some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables instead
of 1)

Any ideas?

Well, you're paying for a lot of overhead in that, as you're

establishing a psql command, connecting to a database, spawning a backend

process, starting a transactions, committing a transaction, closing the backend

process, disconnecting from the database, and cleaning up after the launching

of the psql command. And you're doing that 500 million times.

The one thing I left off that was the loading of a single tuple into json_parts.

What you could do to improve things quite a lot would be to group some number

of those files together, so that each time you pay for the overhead, you at least

get the benefit of loading several entries into json_parts.

So, loosely, I'd commend using /bin/cat (or similar) to assemble several files together

into one, and then \copy that one file in.

Having 2 tuples loaded at once drops overhead by 50%

Having 10 tuples loaded at once drops overhead by 90%

Having 100 tuples loaded at once drops overhead by 99%

Having 1000 tuples loaded at once drops overhead by 99.9%

There probably isn't too much real value to going past 1000 tuples per batch; the

overhead, by that point, is getting pretty immaterial.

Reducing that overhead is the single most important thing you can do.

It is also quite likely that you could run such streams in parallel, although

it would require quite a bit more information about the I/O capabilities of your

hardware to know if that would do any good.

When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

В списке pgsql-general по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Loading 500m json files to database