Обсуждение: bulk loading table via join of 2 large staging tables

Поиск
Список
Период
Сортировка

bulk loading table via join of 2 large staging tables

От
Seb
Дата:
Hi,

I have two large CSV files that need to be merged and loaded into a
single table of a database in Postgresql 9.3.  I thought I'd do this by
first staging the data in these files in two temporary tables:

---<--------------------cut here---------------start------------------->---
CREATE TEMPORARY TABLE rmc (
utc_year character varying(6),
utc_month character varying(4),
utc_day character varying(4),
utc_hour character varying(4),
utc_minute character varying(4),
utc_second character varying(8),
latitude numeric,
longitude numeric,
sog numeric,
cog numeric);

CREATE TEMPORARY TABLE gyro (
utc_year character varying(6),
utc_month character varying(4),
utc_day character varying(4),
utc_hour character varying(4),
utc_minute character varying(4),
utc_second character varying(8),
heading numeric);
---<--------------------cut here---------------end--------------------->---

And the target table in the database looks like this:

---<--------------------cut here---------------start------------------->---
                                                   Table "public.navigation_series"
        Column        |            Type             |                                    Modifiers
               

----------------------+-----------------------------+----------------------------------------------------------------------------------
 navigation_record_id | integer                     | not null default
nextval('navigation_series_navigation_record_id_seq'::regclass)
 project_id           | integer                     |
 platform_id          | integer                     |
 supplier_id          | integer                     |
 time                 | timestamp without time zone | not null
 longitude            | numeric                     |
 latitude             | numeric                     |
 speed_over_ground    | numeric                     |
 course_over_ground   | numeric                     |
 heading              | numeric                     |
Indexes:
    "navigation_series_pkey" PRIMARY KEY, btree (navigation_record_id)
    "navigation_series_project_id_platform_id_supplier_id_time_key" UNIQUE CONSTRAINT, btree (project_id, platform_id,
supplier_id,"time") 
Foreign-key constraints:
    "navigation_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(project_id) ON UPDATE CASCADE ON DELETE
RESTRICT
    "navigation_series_platform_id_fkey" FOREIGN KEY (platform_id) REFERENCES platforms(platform_id) ON UPDATE CASCADE
ONDELETE RESTRICT 
    "navigation_series_supplier_id_fkey" FOREIGN KEY (supplier_id) REFERENCES suppliers(supplier_id) ON UPDATE CASCADE
ONDELETE RESTRICT 
---<--------------------cut here---------------end--------------------->---

Loading the temporary tables was very quick (about 3 min; input files
were 580 Mb and 3.5 Gb) in psql, using:

\copy gyro FROM 'gyro.csv' CSV
\copy rmc FROM 'rmc.csv' CSV

I then created a temporary view with:

CREATE TEMPORARY VIEW rmc_gyro AS
SELECT DISTINCT ON (project_id, platform_id, supplier_id, "time")
    2 AS project_id,
    1 AS platform_id,
    6 AS supplier_id,
    (utc_year || '-' || utc_month || '-' || utc_day || ' ' ||
    utc_hour || ':' || utc_minute || ':' || utc_second)::timestamp AS "time",
    longitude, latitude,
    sog AS speed_over_ground,
    cog AS course_over_ground,
    heading
FROM rmc
FULL JOIN gyro USING (utc_year, utc_month, utc_day, utc_hour, utc_minute, utc_second)
ORDER BY project_id, platform_id, supplier_id, "time";

But at this point even just selecting a few rows of data from the view
is too slow (I haven't seen the output after many hours).

Given that the process involves a full join, I'm not sure I can do this
in chunks (say breaking down the files into smaller pieces). Any
suggestions would be greatly appreciated.

Cheers,

--
Seb


Re: bulk loading table via join of 2 large staging tables

От
David Johnston
Дата:
Quick thoughts:

On both tables:

Convert your date-time varchar fields into a single epoch/integer field.
Create an index of that epoch/integer field.

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/bulk-loading-table-via-join-of-2-large-staging-tables-tp5784869p5784870.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: bulk loading table via join of 2 large staging tables

От
Andrew Sullivan
Дата:
On Mon, Dec 30, 2013 at 07:53:06PM -0600, Seb wrote:
> Given that the process involves a full join, I'm not sure I can do this
> in chunks (say breaking down the files into smaller pieces). Any
> suggestions would be greatly appreciated.

First, what I would probably do is merge the two files outside
Postgres, run sort|uniq > outfile over the whole thing, and then just
bulk import that.  It's probably going to be faster.  But if the files
are too big, you'll run out of memory.

Are there possibly duplicates _within_ each file, or just between them?

If not within, then load the first file into the target table (well,
probably with the staging table just so you can get the timestamp
sorted out), then create the staging table as you suggest for the
second file, but create some indexes and do a WHERE NOT EXISTS to get
just the subset from that second table.  (This might be faster if you
update the staging table with the timestamp first, then create the
relevant multicolumn index.)

If there are dupes within the file, you can do the same thing except
that in the first step, you do SELECT DISTINCT instead.  Again, I
suspect a multicolumn index is going to be your friend.

Anyway, those are two ways I've done this sort of thing in the past.

Best regards,

A

--
Andrew Sullivan
ajs@crankycanuck.ca


Re: bulk loading table via join of 2 large staging tables

От
Brent Wood
Дата:
This should help...

In each temporary table convert the time parts to a timestamp, then create an index on each of these, then join on the
timestamp.

ALTER table mmc add column timer timestamp without time zone;
UPDATE mmc set timer =  (utc_year || '-' || utc_month || '-' || utc_day || ' ' ||
        utc_hour || ':' || utc_minute || ':' || utc_second)::timestamp;
CREATE index mmc_timer_idx on mmc(timer);

ALTER table gyro add column timer timestamp without time zone;
UPDATE gyro set timer =  (utc_year || '-' || utc_month || '-' || utc_day || ' ' ||
        utc_hour || ':' || utc_minute || ':' || utc_second)::timestamp;
CREATE index gyro_timer_idx on gyro(timer);

so something like this should work if you use postgis - which I recommend for GPS data

SELECT DISTINCT ON (project_id, platform_id, supplier_id, timer)
    2 AS project_id,
    1 AS platform_id,
    6 AS supplier_id,
    m.timer,
    m.latitude,
    m.longitude,
    ST_SetSRID(ST_MAKEPOINT(m.longitude, m.latitude),4326) as location,
    m.sog AS speed_over_ground,
    m.cog AS course_over_ground,
    g.heading
FROM rmc m,
          gyro g
WHERE m.timer = g.timer;

One comment: If either table has times recorded at better than 1 sec precision (ie - more than one value per second)
youmight join with the avg() value and group by to bring the output into 1 sec values. 


Cheers

Brent Wood

Programme leader: Environmental Information Delivery
NIWA
DDI:  +64 (4) 3860529
________________________________________
From: pgsql-general-owner@postgresql.org [pgsql-general-owner@postgresql.org] on behalf of Seb [spluque@gmail.com]
Sent: Tuesday, December 31, 2013 2:53 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] bulk loading table via join of 2 large staging tables

Hi,

I have two large CSV files that need to be merged and loaded into a
single table of a database in Postgresql 9.3.  I thought I'd do this by
first staging the data in these files in two temporary tables:

---<--------------------cut here---------------start------------------->---
CREATE TEMPORARY TABLE rmc (
utc_year character varying(6),
utc_month character varying(4),
utc_day character varying(4),
utc_hour character varying(4),
utc_minute character varying(4),
utc_second character varying(8),
latitude numeric,
longitude numeric,
sog numeric,
cog numeric);

CREATE TEMPORARY TABLE gyro (
utc_year character varying(6),
utc_month character varying(4),
utc_day character varying(4),
utc_hour character varying(4),
utc_minute character varying(4),
utc_second character varying(8),
heading numeric);
---<--------------------cut here---------------end--------------------->---

And the target table in the database looks like this:

---<--------------------cut here---------------start------------------->---
                                                   Table "public.navigation_series"
        Column        |            Type             |                                    Modifiers

----------------------+-----------------------------+----------------------------------------------------------------------------------
 navigation_record_id | integer                     | not null default
nextval('navigation_series_navigation_record_id_seq'::regclass)
 project_id           | integer                     |
 platform_id          | integer                     |
 supplier_id          | integer                     |
 time                 | timestamp without time zone | not null
 longitude            | numeric                     |
 latitude             | numeric                     |
 speed_over_ground    | numeric                     |
 course_over_ground   | numeric                     |
 heading              | numeric                     |
Indexes:
    "navigation_series_pkey" PRIMARY KEY, btree (navigation_record_id)
    "navigation_series_project_id_platform_id_supplier_id_time_key" UNIQUE CONSTRAINT, btree (project_id, platform_id,
supplier_id,"time") 
Foreign-key constraints:
    "navigation_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(project_id) ON UPDATE CASCADE ON DELETE
RESTRICT
    "navigation_series_platform_id_fkey" FOREIGN KEY (platform_id) REFERENCES platforms(platform_id) ON UPDATE CASCADE
ONDELETE RESTRICT 
    "navigation_series_supplier_id_fkey" FOREIGN KEY (supplier_id) REFERENCES suppliers(supplier_id) ON UPDATE CASCADE
ONDELETE RESTRICT 
---<--------------------cut here---------------end--------------------->---

Loading the temporary tables was very quick (about 3 min; input files
were 580 Mb and 3.5 Gb) in psql, using:

\copy gyro FROM 'gyro.csv' CSV
\copy rmc FROM 'rmc.csv' CSV

I then created a temporary view with:

CREATE TEMPORARY VIEW rmc_gyro AS
SELECT DISTINCT ON (project_id, platform_id, supplier_id, "time")
    2 AS project_id,
    1 AS platform_id,
    6 AS supplier_id,
    (utc_year || '-' || utc_month || '-' || utc_day || ' ' ||
        utc_hour || ':' || utc_minute || ':' || utc_second)::timestamp AS "time",
    longitude, latitude,
    sog AS speed_over_ground,
    cog AS course_over_ground,
    heading
FROM rmc
FULL JOIN gyro USING (utc_year, utc_month, utc_day, utc_hour, utc_minute, utc_second)
ORDER BY project_id, platform_id, supplier_id, "time";

But at this point even just selecting a few rows of data from the view
is too slow (I haven't seen the output after many hours).

Given that the process involves a full join, I'm not sure I can do this
in chunks (say breaking down the files into smaller pieces). Any
suggestions would be greatly appreciated.

Cheers,

--
Seb


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Please consider the environment before printing this email.
NIWA is the trading name of the National Institute of Water & Atmospheric Research Ltd.


Re: bulk loading table via join of 2 large staging tables

От
Seb
Дата:
On Tue, 31 Dec 2013 02:23:30 +0000,
Brent Wood <Brent.Wood@niwa.co.nz> wrote:

> This should help...  In each temporary table convert the time parts to
> a timestamp, then create an index on each of these, then join on the
> timestamp.

[...]

Thank you, these were very useful suggestions, and lead to a very
efficient table loading.

Cheers,

--
Seb