Re: Import large data set into a table and resolve duplicates?

Поиск
Список
Период
Сортировка
От Francisco Olarte
Тема Re: Import large data set into a table and resolve duplicates?
Дата
Msg-id CA+bJJbz5oxoq2GO62--ZpmrQc0cLaGQZgxnqUs7Wz4MydkgO0g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Import large data set into a table and resolve duplicates?  (Eugene Dzhurinsky <jdevelop@gmail.com>)
Список pgsql-general


On Sun, Feb 15, 2015 at 7:11 PM, Eugene Dzhurinsky <jdevelop@gmail.com> wrote:
On Sun, Feb 15, 2015 at 06:48:36PM +0100, Francisco Olarte wrote:
> One thing that strikes me is you are either at the beginning of your usage
> of this or you have A LOT of already present lines in the path ( I mean,
> path has one fourth of the lines of dictionary ). Which is the case?

Right now I'm on the very beginning stage, yes, and I expect to have "cache
miss" for the dictionary at ratio of at least 70%, which eventually will drop
down to 5-10%.

​Well, that gives one important data point ( the other being the expected median/average size of your series entries ).

And this means the size of the problem is completely different, so the solutions should be too. I mean, at a 70% hit rate with a 50k lines patch you are expecting 35k insertions, and you even expect to have 2k5 / 5k insertions when it drops. Not knowing any other num​ber, I'll urge you to DISCARD any of my propossed solutions ( i.e, if you have 20 runs at 35k followed by 100 runs at 5k you will end up with a 200+20*35+100*5=200+700+500=1400k rows table receiving 5k updates from a 50k file, which is a completely different problem than the original. Not having any expected problem size, I will not propose any more.


> When doing bulk operations ( like this, with a record count of 25% of the
> whole file ) indexing is nearly always slowed than sorting and merging. As
​.....

So you suggest to take this off the Postgres?

​As stated above, I do not presently suggest anything, as I'm unable to do it without further knowledge of the problem ( which, apart from the great unknown of the problem size would also need how is the dictionary being to be used, among other things ).​

Thats interesting. Simply put,
I'll do a dump of the dictionary, sorted by series, to some file. Then sort
the file with patch by series. Then merge the dictionary (left) and the patch
(right) files. And during the merge if the (right) line doesn't have a
corresponding (left) line, then put a nextval expression for sequence as an ID
parameter. Then truncate existing dictionary table and COPY the data from the
merged file into it.

​I would not truncate the dictionary, just put insert returning expressions in the data file. Anyway, I would not presently suggest doing that, with the data I have it's performance is a great unknown. With the original problem size, it can be made highly performant, and I know batch processes ( sort, merge, etc...) easily beat ''interactive'' ( index lookup etc ) throughput wise ( and so total time wise ) on some sizes, but you need to know the problem.

Also, when doing this, it depends on who updates the table and how. If only the patches do it you can keep the table on the server and a sorted copy out of it, and use the sorted copy plus the patch to prepare a set of insert-returnings, which you then process and use ( their return value ) to generate a new sorted copy ( this is normally very fast, and it can be made robust, and most important, does not hit the server too much ( you use the sorted copy plus the patch to generate a 'to_insert' file, then go to the server, copy that into a to_insert temp table, insert it into dict, with returning, inserting the result ( with ID's ) into an 'inserted' temp table, then copy out the table, and then with the copy out (which you made already sorted, or just sort again) you patch the dictionary copy. If you ever think you've lost sync you just copy out the dictionary again and sort it. ) This is batch-processing 101, but you need knowledge of the problem, with the current data, I cannot recommend anything.

​I've been doing that sort of things for decades, and it works quite well, but normally only bother doing it for ''big'' tasks, meaning multihour at least. The thing is indexes seem quite fast as they give you a row very fast, while the sort spends hours preparing, but after that the sorted rows go in really fast, much faster than the index, and ends up first. As I say, it depends on the problem. If you have a 5 min runtime doing straight update/inserts ( not unreasonable for 200k/50k) to be done daily, it does not make sense to make it more complex to make the runtime 15 secs. OTOH, if you have a weekly 20h process which you can cut to 6h to fit it into a nightly window, it does ( And I'm not exagerating, on the early 90s changing indexes to sorting and batching allowed me to cut a 4.5 DAYS process to 8 h. It did not seem to do anything for the first 7, but then it went really fast  ).

Anyway, dessign it according to your knowledge of the problem.

Regards.
    Francisco Olarte.





Is it what you've meant?

Thank you!

--
Eugene Dzhurinsky

В списке pgsql-general по дате отправления:

Предыдущее
От: Paul A Jungwirth
Дата:
Сообщение: Re: Import large data set into a table and resolve duplicates?
Следующее
От: Ramesh T
Дата:
Сообщение: Re: dbmsscheduler