Populating huge tables each day

Поиск
Список
Период
Сортировка
От Ben-Nes Yonatan
Тема Populating huge tables each day
Дата
Msg-id 42C05DD6.1000508@canaan.co.il
обсуждение исходный текст
Список pgsql-general
Hi All,

First I apologize for the length of this email im just afraid that my
problem is quite complicated for explination & also about my english
which is not my native language.... sorry :)

I'm currently building a site for a client which will need to update his
database on a daily routine with about 3.5 million rows from data files
(the files vary in size and can reach up to 1 million rows per file).
That data will be retrieved from several files (Text or XML by my
chioce) and part of the rows will need to be INSERT cause they doesnt
have any previous record and some will need to UPDATE previous records
or just INSERT and DELETE the previous records.
Beside of that the new data also bring information in it that tell how
to populate another table (2 fields and the result is alot less rows)
and connect between the two tables by foreign key which is written in
each row of the main table (the one with the 3.5 million rows).
Now the site must work 24/7 and it will probably have 2 servers which
will run PostreSQL (working as 1 database), the scripting language that
ill use is PHP if it change anything.

I thought on how to accomplish this and I would like to receive comments
and ideas.
I'm mostly afraid from the stress it will make on the server during the
process and that the site will display working information all the time.

A brief explaniation on the definitions ill use ahead:
1. New data - The retrieved data from the files (Text or XML).
2. Main table - The final table which need to hold about 3.5 million
rows, it can be empty before the process or can hold already information
that some of it need to get updated using the "New data" and the rest of
the "New data" need to get inserted into it while the previous data
which didnt got updated need to get deleted.
3. Second table - The table which hold information that the data at the
"Main table" need to get connected to using foreign keys.
4. Temp table - A temporary table which will hold the "New data" till it
will be ready to be INSERT/UPDATE the data at "Main table" (got the
exact same columns as the "Main table").
4. Temp table2 - A temporary table which is created by CREATE TEMPORARY
TABLE AS (former SELECT INTO).

My plan:
1. With a COPY FROM ill insert the data to the "Temp table" (all of the
files in Text format).
2. Run at PL/pgSQL function: {

A. Start transaction
B. DELETE the content of the current existing "Second table".
C. INSERT data into the "Second table" using the "Temp table" - each row
will be checked to its values and compared to check if they exist
already (SELECT) at the "Second table" and if not it will run an INSERT
to create it at the "Second table" - im afraid that this process will be
extremly heavy on the server.
D. DELETE the "Main table" rows.
E. With CREATE TEMPORARY TABLE AS ill create "Temp table2" which will
contain all of the information of the "Temp table" + a subquery will
retrieve the ID of the "Second table" for the foreign key - quite heavy
process i suspect.
F. INSERT the data from "Temp table2" to the "Main table".
G. End transaction + quit from PL/pgSQL. }

3. Delete all the files.

Thanks alot in advance and again im sorry for the length of the mail :)

Ben-Nes Yonatan
Canaan Surfing ltd.
http://www.canaan.net.il


В списке pgsql-general по дате отправления:

Предыдущее
От: Andreas
Дата:
Сообщение: Re: DANGER Windows version might hurt you
Следующее
От: "Sailer, Denis (YBUSA-CDR)"
Дата:
Сообщение: Vacuum analyze question on estimated rows