Обсуждение: [NOVICE] Normalizing Unnormalized Input

Поиск
Список
Период
Сортировка

[NOVICE] Normalizing Unnormalized Input

От
Stephen Froehlich
Дата:

I have been building a PostgreSQL database for about a year and (shame) when I first built it, I built it unnormalized.  I am in the process of normalizing it at the moment (thankfully it isn’t live for anyone but me, so I can freeze it until I get it all normalized).

 

The part of the problem that I haven’t solved conceptually yet is how to normalize the incoming data.  Historically, I have been pulling a csv into R and perform whatever cleaning is necessary and then writing that unnormalized table straight out to PostgreSQL using the cumbersome and slow “caroline::dbWriteTable2” function.

 

I have seen:

 

https://stackoverflow.com/questions/23708278/how-to-normalize-data-efficently-while-inserting-into-sql-table-postgres

 

… but given the batch nature of my process, I was thinking of something a little different than this line-by-line function.

 

Instead, I was thinking about the following algorithm:

  • Have an empty unnormalized table without indexes that is written into in batches by R.
  • Have a function triggered by the update of that table that will:
    • Rename the table to a temp table name
    • Create a fresh blank table
    • “UPSERT” INSERT INTO … ON CONFLICT DO NOTHING each of the foreign key tables to capture any new values
    • Then build indexes on each of the text fields of the unnormalized table and then do a INSERT INTO the main normalized table with a SELECT with all of the proper joins ..
    • Delete the temp table

 

Does this sound like a good method?  How will it behave if I am doing two simultaneous writes to the unnormalized input table?

 

One thing I’m not sure of yet is how in a function to retain a variable name of the new temp table name, but I haven’t Googled that yet.

 

I am running PostgreSQL 9.6 and given that I both benefit from parallelization and also don’t run anything “live”, I’ll likely be upgrading to PostgreSQL 10 in August when it is final. I do share the server with a couple of colleagues, but that’s all the coordination I need.

 

 

Stephen Froehlich
Sr. Strategist, CableLabs®


s.froehlich@cablelabs.com

Tel: +1 (303) 661-3708

 

Re: [NOVICE] Normalizing Unnormalized Input

От
"David G. Johnston"
Дата:
On Tue, Jun 20, 2017 at 3:50 PM, Stephen Froehlich
<s.froehlich@cablelabs.com> wrote:
> The part of the problem that I haven’t solved conceptually yet is how to
> normalize the incoming data.

The specifics of the data matter but...if at all possible I do something like:

BEGIN
CREATE TEMP TABLE tt
COPY tt FROM STDIN
INSERT NEW RECORDS into t FROM tt - one statement (per target table)
UPDATE EXISTING RECORDS in t USING tt - one statement (per target table)
END

I don't get why (or how) you'd "rename the table into a temp table"...

Its nice that we've add upsert but it seems more useful for streaming
compared to batch.  At scale you should try to avoid collisions in the
first place.

Temporary table names only need to be unique within the session.

The need for indexes on the temporary table are usually limited since
the goal is to move large subsets of it around all at once.

David J.


Re: [NOVICE] Normalizing Unnormalized Input

От
Stephen Froehlich
Дата:
I hadn't even thought of temp tables being limited in scope ... that helps tremendously.

The main caveat is that it is a good idea to drop tt at the end as well.

As for "I don't get why" ... confused questions from novices like me are why this mailing list exists :) ... I clearly
wasn'tthinking right from a Postgres PoV.
 

One thing that baffles me is that does one just call a function as one would a table when performing the write to STDIN
fromR?
 

Once I get into it in the next couple of days, I'm sure I'll have specific code questions, but that is a big help.

Thanks again,
Stephen


-----Original Message-----
From: David G. Johnston [mailto:david.g.johnston@gmail.com] 
Sent: Tuesday, June 20, 2017 5:11 PM
To: Stephen Froehlich <s.froehlich@cablelabs.com>
Cc: pgsql-novice@postgresql.org
Subject: Re: [NOVICE] Normalizing Unnormalized Input

On Tue, Jun 20, 2017 at 3:50 PM, Stephen Froehlich <s.froehlich@cablelabs.com> wrote:
> The part of the problem that I haven’t solved conceptually yet is how 
> to normalize the incoming data.

The specifics of the data matter but...if at all possible I do something like:

BEGIN
CREATE TEMP TABLE tt
COPY tt FROM STDIN
INSERT NEW RECORDS into t FROM tt - one statement (per target table) UPDATE EXISTING RECORDS in t USING tt - one
statement(per target table) END
 

I don't get why (or how) you'd "rename the table into a temp table"...

Its nice that we've add upsert but it seems more useful for streaming compared to batch.  At scale you should try to
avoidcollisions in the first place.
 

Temporary table names only need to be unique within the session.

The need for indexes on the temporary table are usually limited since the goal is to move large subsets of it around
allat once.
 

David J.

Re: [NOVICE] Normalizing Unnormalized Input

От
Joe Conway
Дата:
On 06/20/2017 06:19 PM, Stephen Froehlich wrote:
> One thing that baffles me is that does one just call a function as one would a table when performing the write to
STDINfrom R? 

Not sure I understand that question, but in any case you might want to
look at PL/R:

https://www.joeconway.com/doc/doc.html
https://github.com/postgres-plr/plr


--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development


Вложения

Re: [NOVICE] Normalizing Unnormalized Input

От
Stephen Froehlich
Дата:
One other follow-up question.

The incoming csv files range in size from 100k lines to 14 million.  Does it make sense to build indexes on the
to-be-foreignkey columns of the temp table before doing the final set of joins to write to the primary table or is it
fasterto leave them unindexed for this one-time use?
 

Some columns have as few as 2 or 3 values.

--Stephen

-----Original Message-----
From: pgsql-novice-owner@postgresql.org [mailto:pgsql-novice-owner@postgresql.org] On Behalf Of Stephen Froehlich
Sent: Tuesday, June 20, 2017 7:20 PM
To: David G. Johnston <david.g.johnston@gmail.com>
Cc: pgsql-novice@postgresql.org
Subject: Re: [NOVICE] Normalizing Unnormalized Input

I hadn't even thought of temp tables being limited in scope ... that helps tremendously.

The main caveat is that it is a good idea to drop tt at the end as well.

As for "I don't get why" ... confused questions from novices like me are why this mailing list exists :) ... I clearly
wasn'tthinking right from a Postgres PoV.
 

One thing that baffles me is that does one just call a function as one would a table when performing the write to STDIN
fromR?
 

Once I get into it in the next couple of days, I'm sure I'll have specific code questions, but that is a big help.

Thanks again,
Stephen


-----Original Message-----
From: David G. Johnston [mailto:david.g.johnston@gmail.com]
Sent: Tuesday, June 20, 2017 5:11 PM
To: Stephen Froehlich <s.froehlich@cablelabs.com>
Cc: pgsql-novice@postgresql.org
Subject: Re: [NOVICE] Normalizing Unnormalized Input

On Tue, Jun 20, 2017 at 3:50 PM, Stephen Froehlich <s.froehlich@cablelabs.com> wrote:
> The part of the problem that I haven’t solved conceptually yet is how 
> to normalize the incoming data.

The specifics of the data matter but...if at all possible I do something like:

BEGIN
CREATE TEMP TABLE tt
COPY tt FROM STDIN
INSERT NEW RECORDS into t FROM tt - one statement (per target table) UPDATE EXISTING RECORDS in t USING tt - one
statement(per target table) END
 

I don't get why (or how) you'd "rename the table into a temp table"...

Its nice that we've add upsert but it seems more useful for streaming compared to batch.  At scale you should try to
avoidcollisions in the first place.
 

Temporary table names only need to be unique within the session.

The need for indexes on the temporary table are usually limited since the goal is to move large subsets of it around
allat once.
 

David J.

--
Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice