Fwd: Import large data set into a table and resolve duplicates?

Поиск

Список

Период

Сортировка

От	Francisco Olarte
Тема	Fwd: Import large data set into a table and resolve duplicates?
Дата	15 февраля 2015 г. 17:51:37
Msg-id	CA+bJJbwP7uRYb8e8j=FEPW2Wy7VCG9=2Og0Zxo1VAQO7CNY1jg@mail.gmail.com обсуждение исходный текст
Ответ на	Import large data set into a table and resolve duplicates? (Eugene Dzhurinsky <jdevelop@gmail.com>)
Список	pgsql-general

Дерево обсуждения

Forwarding to the list, as I forgot ( as usual, think I did the same with my original message ). Sorry for the duplicates.

---------- Forwarded message ----------
From: Francisco Olarte <folarte@peoplecall.com>
Date: Sun, Feb 15, 2015 at 6:48 PM
Subject: Re: [GENERAL] Import large data set into a table and resolve duplicates?
To: Eugene Dzhurinsky <jdevelop@gmail.com>

Hi Eugene:

On Sun, Feb 15, 2015 at 6:29 PM, Eugene Dzhurinsky <jdevelop@gmail.com> wrote:

On Sun, Feb 15, 2015 at 01:06:02PM +0100, Francisco Olarte wrote:
> You state below 200k rows, 50k lines per path. That is not huge unless
> "series" really big, is it?

series data is in between of 100-4096 chars

Well, then it depends on the average length. With 1k avg you'll have 200Mb for the series. Not having stats on your setup, that is easily manageable by today standard machines.

One thing that strikes me is you are either at the beginning of your usage of this or you have A LOT of already present lines in the path ( I mean, path has one fourth of the lines of dictionary ). Which is the case?

That sounds cool, but I'm a bit worried about the performance of a lookup over
the series column and the time to create index for the "temp" table on
"series" column. But perhaps it's better to try this and if a performance will
go really bad - then do some optimizations, like partitioning etc.

Who said create an index on the temp tables? They are small. And, anywhere, If i would be doing this with files I'll NEVER do it indexing, I'll sort, merge, go on ( this is Why I suggested temp files, postgres does sort and merges really well if it needs to and the optimizer has other tricks available ).

When doing bulk operations ( like this, with a record count of 25% of the whole file ) indexing is nearly always slowed than sorting and merging. As I said, I do not know the distribution of your data size, but assuming the worst case ( 200k 4k entries dict, which is 800Mb data, 50k 4k patches, which is 200Mb data), you can do it with text files in 5 minutes by sorting and merging in 8Mb RAM ( I've done similar things, in that time, and the disk where much slower, I've even done this kind of things with half inch tapes and it wasn't thar slow ). You just sort the dictionary by series, sort patch by series, read both comparing keys and write a result file and a new dictionary, and, as the new dictionary is already sorted, you do not need to read it the next time. It was the usual thing to do for updating accounts on the tape days.

And, as they always said, measure, then optimize. I would try it ( use some little tables, like say 20 entries dict and 5 lines patches to debug the queries ), then measure ( if you don't mind hogging the server, or have a test one launch the whole thing and measure times, if you get it done in a reasonable time, and you may be surprised, some times simple algorithms are fast ).

The only other thing I would try maybe doing the things without temp tables, something like:

select series, id from dictionary join patches_in using series

UNION ALL ( faster, as you know there will be no duplicates do not bother checking them )

insert into dictionary (series) select p.series from patches where patches not in (select series from dictionary) returning series, id

( you may need to experiment a little with the queries and use some with expresions, but you see the logic ).

Francisco Olarte.

Thank you!

--
Eugene Dzhurinsky

В списке pgsql-general по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Fwd: Import large data set into a table and resolve duplicates?