Re: Loading 500m json files to database

Поиск
Список
Период
Сортировка
От Christopher Browne
Тема Re: Loading 500m json files to database
Дата
Msg-id CAFNqd5UWCrnh6o4R1-sQ=Kr-HLoK0M5d16HJ=y-9vM+Je-7oqQ@mail.gmail.com
обсуждение исходный текст
Ответ на Loading 500m json files to database  (pinker <pinker@onet.eu>)
Ответы Re: Loading 500m json files to database
Список pgsql-general


On Mon, 23 Mar 2020 at 06:24, pinker <pinker@onet.eu> wrote:
Hi, do you have maybe idea how to make loading process faster?

I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.

What I came up with now is:

time for i in datafiles/*; do
  psql -c "\copy json_parts(json_data) FROM $i"&
done

which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.

some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables instead
of 1)


Any ideas?


Well, you're paying for a lot of overhead in that, as you're
establishing a psql command, connecting to a database, spawning a backend
process, starting a transactions, committing a transaction, closing the backend
process, disconnecting from the database, and cleaning up after the launching
of the psql command.  And you're doing that 500 million times.

The one thing I left off that was the loading of a single tuple into json_parts.

What you could do to improve things quite a lot would be to group some number
of those files together, so that each time you pay for the overhead, you at least
get the benefit of loading several entries into json_parts.

So, loosely, I'd commend using /bin/cat (or similar) to assemble several files together
into one, and then \copy that one file in.

Having 2 tuples loaded at once drops overhead by 50%
Having 10 tuples loaded at once drops overhead by 90%
Having 100 tuples loaded at once drops overhead by 99%
Having 1000 tuples loaded at once drops overhead by 99.9%

There probably isn't too much real value to going past 1000 tuples per batch; the
overhead, by that point, is getting pretty immaterial.

Reducing that overhead is the single most important thing you can do.

It is also quite likely that you could run such streams in parallel, although
it would require quite a bit more information about the I/O capabilities of your
hardware to know if that would do any good.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

В списке pgsql-general по дате отправления:

Предыдущее
От: Laurenz Albe
Дата:
Сообщение: Re: Passwordcheck configuration
Следующее
От: Radu Radutiu
Дата:
Сообщение: Runtime partition pruning