Обсуждение: large numbers of inserts out of memory strategy

Поиск

Список

Период

Сортировка

large numbers of inserts out of memory strategy

От

Ted Toth

Дата:

28 ноября 2017 г., 23:17:07

I'm writing a migration utility to move data from non-rdbms data
source to a postgres db. Currently I'm generating SQL INSERT
statements involving 6 related tables for each 'thing'. With 100k or
more 'things' to migrate I'm generating a lot of statements and when I
try to import using psql postgres fails with 'out of memory' when
running on a Linux VM with 4G of memory. If I break into smaller
chunks say ~50K statements then thde import succeeds. I can change my
migration utility to generate multiple files each with a limited
number of INSERTs to get around this issue but maybe there's
another/better way?

Ted

Re: large numbers of inserts out of memory strategy

От

Rob Sargent

Дата:

28 ноября 2017 г., 23:19:27

> On Nov 28, 2017, at 10:17 AM, Ted Toth <txtoth@gmail.com> wrote:
> 
> I'm writing a migration utility to move data from non-rdbms data
> source to a postgres db. Currently I'm generating SQL INSERT
> statements involving 6 related tables for each 'thing'. With 100k or
> more 'things' to migrate I'm generating a lot of statements and when I
> try to import using psql postgres fails with 'out of memory' when
> running on a Linux VM with 4G of memory. If I break into smaller
> chunks say ~50K statements then thde import succeeds. I can change my
> migration utility to generate multiple files each with a limited
> number of INSERTs to get around this issue but maybe there's
> another/better way?
> 
> Ted
> 
what tools / languages ate you using?

Re: large numbers of inserts out of memory strategy

От

Tomas Vondra

Дата:

28 ноября 2017 г., 23:22:17

Hi,

On 11/28/2017 06:17 PM, Ted Toth wrote:
> I'm writing a migration utility to move data from non-rdbms data
> source to a postgres db. Currently I'm generating SQL INSERT
> statements involving 6 related tables for each 'thing'. With 100k or
> more 'things' to migrate I'm generating a lot of statements and when I
> try to import using psql postgres fails with 'out of memory' when
> running on a Linux VM with 4G of memory. If I break into smaller
> chunks say ~50K statements then thde import succeeds. I can change my
> migration utility to generate multiple files each with a limited
> number of INSERTs to get around this issue but maybe there's
> another/better way?
> 

The question is what exactly runs out of memory, and how did you modify
the configuration (particularly related to memory).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: large numbers of inserts out of memory strategy

От

Ted Toth

Дата:

28 ноября 2017 г., 23:50:18

On Tue, Nov 28, 2017 at 11:19 AM, Rob Sargent <robjsargent@gmail.com> wrote:
>
>> On Nov 28, 2017, at 10:17 AM, Ted Toth <txtoth@gmail.com> wrote:
>>
>> I'm writing a migration utility to move data from non-rdbms data
>> source to a postgres db. Currently I'm generating SQL INSERT
>> statements involving 6 related tables for each 'thing'. With 100k or
>> more 'things' to migrate I'm generating a lot of statements and when I
>> try to import using psql postgres fails with 'out of memory' when
>> running on a Linux VM with 4G of memory. If I break into smaller
>> chunks say ~50K statements then thde import succeeds. I can change my
>> migration utility to generate multiple files each with a limited
>> number of INSERTs to get around this issue but maybe there's
>> another/better way?
>>
>> Ted
>>
> what tools / languages ate you using?

I'm using python to read binary source files and create the text files
contains the SQL. Them I'm running psql -f <file containing SQL>.

Re: large numbers of inserts out of memory strategy

От

Ted Toth

Дата:

28 ноября 2017 г., 23:54:11

On Tue, Nov 28, 2017 at 11:22 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Hi,
>
> On 11/28/2017 06:17 PM, Ted Toth wrote:
>> I'm writing a migration utility to move data from non-rdbms data
>> source to a postgres db. Currently I'm generating SQL INSERT
>> statements involving 6 related tables for each 'thing'. With 100k or
>> more 'things' to migrate I'm generating a lot of statements and when I
>> try to import using psql postgres fails with 'out of memory' when
>> running on a Linux VM with 4G of memory. If I break into smaller
>> chunks say ~50K statements then thde import succeeds. I can change my
>> migration utility to generate multiple files each with a limited
>> number of INSERTs to get around this issue but maybe there's
>> another/better way?
>>
>
> The question is what exactly runs out of memory, and how did you modify
> the configuration (particularly related to memory).
>
> regards
>
> --
> Tomas Vondra                  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

I'm pretty new to postgres so I haven't changed any configuration
setting and the log is a bit hard for me to make sense of :(

Вложения

psql.outofmem.log

Re: large numbers of inserts out of memory strategy

От

Rob Sargent

Дата:

28 ноября 2017 г., 23:55:54

On 11/28/2017 10:50 AM, Ted Toth wrote:
> On Tue, Nov 28, 2017 at 11:19 AM, Rob Sargent <robjsargent@gmail.com> wrote:
>>> On Nov 28, 2017, at 10:17 AM, Ted Toth <txtoth@gmail.com> wrote:
>>>
>>> I'm writing a migration utility to move data from non-rdbms data
>>> source to a postgres db. Currently I'm generating SQL INSERT
>>> statements involving 6 related tables for each 'thing'. With 100k or
>>> more 'things' to migrate I'm generating a lot of statements and when I
>>> try to import using psql postgres fails with 'out of memory' when
>>> running on a Linux VM with 4G of memory. If I break into smaller
>>> chunks say ~50K statements then thde import succeeds. I can change my
>>> migration utility to generate multiple files each with a limited
>>> number of INSERTs to get around this issue but maybe there's
>>> another/better way?
>>>
>>> Ted
>>>
>> what tools / languages ate you using?
> I'm using python to read binary source files and create the text files
> contains the SQL. Them I'm running psql -f <file containing SQL>.
If you're going out to the file system, I would use COPY of csv files 
(if number of records per table is non-trivial).  Any bulk loading 
python available?

Re: large numbers of inserts out of memory strategy

От

Tomas Vondra

Дата:

29 ноября 2017 г., 00:01:15


On 11/28/2017 06:54 PM, Ted Toth wrote:
> On Tue, Nov 28, 2017 at 11:22 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Hi,
>>
>> On 11/28/2017 06:17 PM, Ted Toth wrote:
>>> I'm writing a migration utility to move data from non-rdbms data
>>> source to a postgres db. Currently I'm generating SQL INSERT
>>> statements involving 6 related tables for each 'thing'. With 100k or
>>> more 'things' to migrate I'm generating a lot of statements and when I
>>> try to import using psql postgres fails with 'out of memory' when
>>> running on a Linux VM with 4G of memory. If I break into smaller
>>> chunks say ~50K statements then thde import succeeds. I can change my
>>> migration utility to generate multiple files each with a limited
>>> number of INSERTs to get around this issue but maybe there's
>>> another/better way?
>>>
>>
>> The question is what exactly runs out of memory, and how did you modify
>> the configuration (particularly related to memory).
>>
>> regards
>>
>> --
>> Tomas Vondra                  http://www.2ndQuadrant.com
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 
> I'm pretty new to postgres so I haven't changed any configuration
> setting and the log is a bit hard for me to make sense of :(
> 

The most interesting part of the log is this:
   SPI Proc: 2464408024 total in 279 blocks; 1672 free (1 chunks);
2464406352 used     PL/pgSQL function context: 537911352 total in 74 blocks; 2387536
free (4 chunks); 535523816 used


That is, most of the memory is allocated for SPI (2.4GB) and PL/pgSQL
procedure (500MB). How do you do the load? What libraries/drivers?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: large numbers of inserts out of memory strategy

От

Steven Lembark

Дата:

29 ноября 2017 г., 00:04:22

On Tue, 28 Nov 2017 11:17:07 -0600
Ted Toth <txtoth@gmail.com> wrote:

> I'm writing a migration utility to move data from non-rdbms data
> source to a postgres db. Currently I'm generating SQL INSERT
> statements involving 6 related tables for each 'thing'. With 100k or
> more 'things' to migrate I'm generating a lot of statements and when I
> try to import using psql postgres fails with 'out of memory' when
> running on a Linux VM with 4G of memory. If I break into smaller
> chunks say ~50K statements then thde import succeeds. I can change my
> migration utility to generate multiple files each with a limited
> number of INSERTs to get around this issue but maybe there's
> another/better way?

Chunking the loads has a number of advantages other than avoiding
OOM errors, not the least of which are possibly parallel loading
and being able to restart after other failures without re-processing
all of the input data. 

Note that 4GiB of core is not all that much by today's standards.
You might want to run top/vmstat and ask if the PG daemons are using
all/most of the available memory. If PG is sucking up all of the core
as it is then tuning the database may not have much of an effect; if
there is lots of spare memory then it'll be worth looking at ways to
tune PG.

Note also that "out of memory" frequently means virutal memory.

Q: Does the VM have swap configured?

If not then add 8GiB and see if that solves your problem; if so then 
how much swap is in use when you get the OOM error?

-- 
Steven Lembark                                       1505 National Ave
Workhorse Computing                                 Rockford, IL 61103
lembark@wrkhors.com                                    +1 888 359 3508

Re: large numbers of inserts out of memory strategy

От

Ted Toth

Дата:

29 ноября 2017 г., 00:23:56

On Tue, Nov 28, 2017 at 12:04 PM, Steven Lembark <lembark@wrkhors.com> wrote:
> On Tue, 28 Nov 2017 11:17:07 -0600
> Ted Toth <txtoth@gmail.com> wrote:
>
>> I'm writing a migration utility to move data from non-rdbms data
>> source to a postgres db. Currently I'm generating SQL INSERT
>> statements involving 6 related tables for each 'thing'. With 100k or
>> more 'things' to migrate I'm generating a lot of statements and when I
>> try to import using psql postgres fails with 'out of memory' when
>> running on a Linux VM with 4G of memory. If I break into smaller
>> chunks say ~50K statements then thde import succeeds. I can change my
>> migration utility to generate multiple files each with a limited
>> number of INSERTs to get around this issue but maybe there's
>> another/better way?
>
> Chunking the loads has a number of advantages other than avoiding
> OOM errors, not the least of which are possibly parallel loading
> and being able to restart after other failures without re-processing
> all of the input data.
>
> Note that 4GiB of core is not all that much by today's standards.
> You might want to run top/vmstat and ask if the PG daemons are using
> all/most of the available memory. If PG is sucking up all of the core
> as it is then tuning the database may not have much of an effect; if
> there is lots of spare memory then it'll be worth looking at ways to
> tune PG.
>
> Note also that "out of memory" frequently means virutal memory.
>
> Q: Does the VM have swap configured?
>
> If not then add 8GiB and see if that solves your problem; if so then
> how much swap is in use when you get the OOM error?
>
>
> --
> Steven Lembark                                       1505 National Ave
> Workhorse Computing                                 Rockford, IL 61103
> lembark@wrkhors.com                                    +1 888 359 3508
>

I understand that 4G is not much ... Yeah in top I see the postmaster
process RES grow until it fails. The VM is basically a Centos 6 box
with 4G of swap.

Re: large numbers of inserts out of memory strategy

От

Ted Toth

Дата:

29 ноября 2017 г., 00:26:20

On Tue, Nov 28, 2017 at 12:01 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> On 11/28/2017 06:54 PM, Ted Toth wrote:
>> On Tue, Nov 28, 2017 at 11:22 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> Hi,
>>>
>>> On 11/28/2017 06:17 PM, Ted Toth wrote:
>>>> I'm writing a migration utility to move data from non-rdbms data
>>>> source to a postgres db. Currently I'm generating SQL INSERT
>>>> statements involving 6 related tables for each 'thing'. With 100k or
>>>> more 'things' to migrate I'm generating a lot of statements and when I
>>>> try to import using psql postgres fails with 'out of memory' when
>>>> running on a Linux VM with 4G of memory. If I break into smaller
>>>> chunks say ~50K statements then thde import succeeds. I can change my
>>>> migration utility to generate multiple files each with a limited
>>>> number of INSERTs to get around this issue but maybe there's
>>>> another/better way?
>>>>
>>>
>>> The question is what exactly runs out of memory, and how did you modify
>>> the configuration (particularly related to memory).
>>>
>>> regards
>>>
>>> --
>>> Tomas Vondra                  http://www.2ndQuadrant.com
>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>
>> I'm pretty new to postgres so I haven't changed any configuration
>> setting and the log is a bit hard for me to make sense of :(
>>
>
> The most interesting part of the log is this:
>
>     SPI Proc: 2464408024 total in 279 blocks; 1672 free (1 chunks);
> 2464406352 used
>       PL/pgSQL function context: 537911352 total in 74 blocks; 2387536
> free (4 chunks); 535523816 used
>
>
> That is, most of the memory is allocated for SPI (2.4GB) and PL/pgSQL
> procedure (500MB). How do you do the load? What libraries/drivers?
>
> regards
>
> --
> Tomas Vondra                  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

I'm doing the load with 'psql -f'. I using 9.6 el6 rpms on a Centos VM
I downloaded from the postgres repo.

Re: large numbers of inserts out of memory strategy

От

Tomas Vondra

Дата:

29 ноября 2017 г., 00:38:37

On 11/28/2017 07:26 PM, Ted Toth wrote:
> On Tue, Nov 28, 2017 at 12:01 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
>> ...
>>
>> That is, most of the memory is allocated for SPI (2.4GB) and PL/pgSQL
>> procedure (500MB). How do you do the load? What libraries/drivers?
>>
> 
> I'm doing the load with 'psql -f'. I using 9.6 el6 rpms on a Centos VM
> I downloaded from the postgres repo.
> 

So what does the script actually do? Because psql certainly is not
running pl/pgsql procedures on it's own. We need to understand why
you're getting OOM in the first place - just inserts alone should not
cause failures like that. Please show us more detailed explanation of
what the load actually does, so that we can try reproducing it.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: large numbers of inserts out of memory strategy

От

Brian Crowell

Дата:

29 ноября 2017 г., 09:16:51

On Tue, Nov 28, 2017 at 12:38 PM, Tomas Vondra wrote: > So what does the script actually do? Because psql certainly is not > running pl/pgsql procedures on it's own. We need to understand why > you're getting OOM in the first place - just inserts alone should not > cause failures like that. Please show us more detailed explanation of > what the load actually does, so that we can try reproducing it. Perhaps the script is one giant insert statement? —Brian

Re: large numbers of inserts out of memory strategy

От

Tom Lane

Дата:

29 ноября 2017 г., 09:59:51

Brian Crowell <brian@fluggo.com> writes:
> On Tue, Nov 28, 2017 at 12:38 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com
>> wrote:
>> So what does the script actually do? Because psql certainly is not
>> running pl/pgsql procedures on it's own. We need to understand why
>> you're getting OOM in the first place - just inserts alone should not
>> cause failures like that. Please show us more detailed explanation of
>> what the load actually does, so that we can try reproducing it.

> Perhaps the script is one giant insert statement?

It's pretty clear from the memory map that the big space consumption
is inside a single invocation of a plpgsql function:
   SPI Proc: 2464408024 total in 279 blocks; 1672 free (1 chunks); 2464406352 used     PL/pgSQL function context:
537911352total in 74 blocks; 2387536 free (4 chunks); 535523816 used 

So whatever's going on here, there's more to it than a giant client-issued
INSERT (or COPY), or for that matter a large number of small ones.  What
would seem to be required is a many-megabyte-sized plpgsql function body
or DO block.

Actually, the truly weird thing about that map is that the "PL/pgSQL
function context" seems to be a child of a "SPI Proc" context, whereas
it's entirely clear from the code that it ought to be a direct child of
TopMemoryContext.  I have no idea how this state of affairs came to be,
and am interested to find out.
        regards, tom lane

Re: large numbers of inserts out of memory strategy

От

Ted Toth

Дата:

29 ноября 2017 г., 20:32:02

On Tue, Nov 28, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Brian Crowell <brian@fluggo.com> writes:
>> On Tue, Nov 28, 2017 at 12:38 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com
>>> wrote:
>>> So what does the script actually do? Because psql certainly is not
>>> running pl/pgsql procedures on it's own. We need to understand why
>>> you're getting OOM in the first place - just inserts alone should not
>>> cause failures like that. Please show us more detailed explanation of
>>> what the load actually does, so that we can try reproducing it.
>
>> Perhaps the script is one giant insert statement?
>
> It's pretty clear from the memory map that the big space consumption
> is inside a single invocation of a plpgsql function:
>
>     SPI Proc: 2464408024 total in 279 blocks; 1672 free (1 chunks); 2464406352 used
>       PL/pgSQL function context: 537911352 total in 74 blocks; 2387536 free (4 chunks); 535523816 used
>
> So whatever's going on here, there's more to it than a giant client-issued
> INSERT (or COPY), or for that matter a large number of small ones.  What
> would seem to be required is a many-megabyte-sized plpgsql function body
> or DO block.
>
> Actually, the truly weird thing about that map is that the "PL/pgSQL
> function context" seems to be a child of a "SPI Proc" context, whereas
> it's entirely clear from the code that it ought to be a direct child of
> TopMemoryContext.  I have no idea how this state of affairs came to be,
> and am interested to find out.
>
>                         regards, tom lane


Yes I did generate 1 large DO block:

DO $$
DECLARE thingid bigint; thingrec bigint; thingdataid bigint;
BEGIN
INSERT INTO thing
(ltn,classification,machine,source,thgrec,flags,serial,type) VALUES
('T007336','THING',0,1025,7336,7,'XXX869977564',1) RETURNING id,thgrec
INTO thingid,thingrec;
INSERT INTO recnum_thing (recnum,thing_id) VALUES (thingrec,thingid);
INSERT INTO thingstatus

(thing_id,nrpts,rmks_cs,force_type_id,ftn_cs,force_code,arr_cs,mask,toi_state,plot_id,signa_cs,lastchange,des_cs,rig_cs,ownship,correlation,maxrpts,rtn_cs,ctc_cs,group_mask,dep_cs)
VALUES (thingid,121,'{0,0,0,0}',440,0,23,0,0,0,'{23,-1,3803,3805,-1,-1,0,6}',0,1509459164,0,0,0,0,1000,0,0,0,0);
INSERT INTO thinger
(thing_id,spe_key,cse_unc,lat_spd,cov,dtg,lng,spd,ave_spd,cse,tol,nrpts,lat,alpha,sigma,spd_unc,lng_spd)
VALUES
(thingid,-1,0.0,-6.58197336634e-08,'{4.27624291532e-09,0.0,3.07802916488e-09,0.0,4.27624291532e-09,0.0,3.07802916488e-09,4.16110417234e-08,0.0,4.16110417234e-08}',1509459163,2.21596980095,0.000226273215958,1.0,0.0,0.10000000149,121,0.584555745125,10.0,4.23079740131e-08,0.0,-2.49999881907e-10);
INSERT INTO thingdata

(thing_id,category,db_num,xref,org_type,trademark,shortname,fcode,platform,callsign,type,orig_xref,shipclass,home_base,uic,service,di,lngfixed,hull,precision,alert,flag,besufx,name,mmsi,catcode,ntds,imo,pn_num,chxref,threat,sconum,latfixed,db_type,pif,echelon,jtn,quantity,overwrite)
VALUES
(thingid,'XXX','','','','','004403704','23','','','','','UNEQUATED','','','','',0.0,'','{0,0,0,0,0}','','KS','','UNKNOWN','004403704','','','','','','AFD','',0.0,3,'','',0,0,0)
RETURNING id INTO thingdataid;
INSERT INTO thingnum (thingdata_id,thgnum,state,dtg,cmd) VALUES
(thingdataid,'013086',0,1502970401,'FOO');

<repeated for each thing>

END $$;

Should I limit the number of 'thing' inserts within a DO block or
wrapping each 'thing' insert in it's own DO block?

Re: large numbers of inserts out of memory strategy

От

Tom Lane

Дата:

29 ноября 2017 г., 20:55:06

Ted Toth <txtoth@gmail.com> writes:
> On Tue, Nov 28, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So whatever's going on here, there's more to it than a giant client-issued
>> INSERT (or COPY), or for that matter a large number of small ones.  What
>> would seem to be required is a many-megabyte-sized plpgsql function body
>> or DO block.

> Yes I did generate 1 large DO block:

Apparently by "large" you mean "hundreds of megabytes".  Don't do that,
at least not on a machine that hasn't got hundreds of megabytes to spare.
The entire thing has to be sucked into menory and parsed before anything
will happen.
        regards, tom lane

Re: large numbers of inserts out of memory strategy

От

Steven Lembark

Дата:

29 ноября 2017 г., 23:02:10

> > what tools / languages ate you using?  
> 
> I'm using python to read binary source files and create the text files
> contains the SQL. Them I'm running psql -f <file containing SQL>.

Then chunking the input should be trivial.
There are a variety of techniques you can use to things like disable
indexes during loading, etc. Maybe load them into temp tables and 
then insert the temp's into the destination tables. The point is to 
amortize the memory load over the entire load period.

-- 
Steven Lembark                                       1505 National Ave
Workhorse Computing                                 Rockford, IL 61103
lembark@wrkhors.com                                    +1 888 359 3508

Re: large numbers of inserts out of memory strategy

От

Steven Lembark

Дата:

29 ноября 2017 г., 23:03:17

> I'm pretty new to postgres so I haven't changed any configuration
> setting and the log is a bit hard for me to make sense of :(

Diving into the shark tank is a helluva way to learn how to swim :-)

Are you interested in finding doc's on how to deal with the tuning?


-- 
Steven Lembark                                       1505 National Ave
Workhorse Computing                                 Rockford, IL 61103
lembark@wrkhors.com                                    +1 888 359 3508

Re: large numbers of inserts out of memory strategy

От

Rory Campbell-Lange

Дата:

30 ноября 2017 г., 01:37:04

On 28/11/17, Rob Sargent (robjsargent@gmail.com) wrote:
> 
> On 11/28/2017 10:50 AM, Ted Toth wrote:
> > On Tue, Nov 28, 2017 at 11:19 AM, Rob Sargent <robjsargent@gmail.com> wrote:
> > > > On Nov 28, 2017, at 10:17 AM, Ted Toth <txtoth@gmail.com> wrote:
> > > > 
> > > > I'm writing a migration utility to move data from non-rdbms data
> > > > source to a postgres db. Currently I'm generating SQL INSERT
> > > > statements involving 6 related tables for each 'thing'. With 100k or
> > > > more 'things' to migrate I'm generating a lot of statements and when I
> > > > try to import using psql postgres fails with 'out of memory' when
> > > > running on a Linux VM with 4G of memory. If I break into smaller
> > > > chunks say ~50K statements then thde import succeeds. I can change my
> > > > migration utility to generate multiple files each with a limited
> > > > number of INSERTs to get around this issue but maybe there's
> > > > another/better way?

> > > what tools / languages ate you using?

> > I'm using python to read binary source files and create the text files
> > contains the SQL. Them I'm running psql -f <file containing SQL>.

> If you're going out to the file system, I would use COPY of csv files (if
> number of records per table is non-trivial).  Any bulk loading python
> available?

psycopg2 has a copy_from function and (possibly pertinent in this case) a
copy_expert function which allows the read buffer size to be specified.

Re: large numbers of inserts out of memory strategy

От

"Peter J. Holzer"

Дата:

30 ноября 2017 г., 16:22:09

On 2017-11-29 08:32:02 -0600, Ted Toth wrote:
> Yes I did generate 1 large DO block:
>
> DO $$
> DECLARE thingid bigint; thingrec bigint; thingdataid bigint;
> BEGIN
> INSERT INTO thing
> (ltn,classification,machine,source,thgrec,flags,serial,type) VALUES
> ('T007336','THING',0,1025,7336,7,'XXX869977564',1) RETURNING id,thgrec
> INTO thingid,thingrec;
> INSERT INTO recnum_thing (recnum,thing_id) VALUES (thingrec,thingid);
> INSERT INTO thingstatus
>
(thing_id,nrpts,rmks_cs,force_type_id,ftn_cs,force_code,arr_cs,mask,toi_state,plot_id,signa_cs,lastchange,des_cs,rig_cs,ownship,correlation,maxrpts,rtn_cs,ctc_cs,group_mask,dep_cs)
> VALUES (thingid,121,'{0,0,0,0}',440,0,23,0,0,0,'{23,-1,3803,3805,-1,-1,0,6}',0,1509459164,0,0,0,0,1000,0,0,0,0);
> INSERT INTO thinger
> (thing_id,spe_key,cse_unc,lat_spd,cov,dtg,lng,spd,ave_spd,cse,tol,nrpts,lat,alpha,sigma,spd_unc,lng_spd)
> VALUES
(thingid,-1,0.0,-6.58197336634e-08,'{4.27624291532e-09,0.0,3.07802916488e-09,0.0,4.27624291532e-09,0.0,3.07802916488e-09,4.16110417234e-08,0.0,4.16110417234e-08}',1509459163,2.21596980095,0.000226273215958,1.0,0.0,0.10000000149,121,0.584555745125,10.0,4.23079740131e-08,0.0,-2.49999881907e-10);
> INSERT INTO thingdata
>
(thing_id,category,db_num,xref,org_type,trademark,shortname,fcode,platform,callsign,type,orig_xref,shipclass,home_base,uic,service,di,lngfixed,hull,precision,alert,flag,besufx,name,mmsi,catcode,ntds,imo,pn_num,chxref,threat,sconum,latfixed,db_type,pif,echelon,jtn,quantity,overwrite)
> VALUES
(thingid,'XXX','','','','','004403704','23','','','','','UNEQUATED','','','','',0.0,'','{0,0,0,0,0}','','KS','','UNKNOWN','004403704','','','','','','AFD','',0.0,3,'','',0,0,0)
> RETURNING id INTO thingdataid;
> INSERT INTO thingnum (thingdata_id,thgnum,state,dtg,cmd) VALUES
> (thingdataid,'013086',0,1502970401,'FOO');
>
> <repeated for each thing>
>
> END $$;
>
> Should I limit the number of 'thing' inserts within a DO block or
> wrapping each 'thing' insert in it's own DO block?

I would suggest getting rid of the do block entirely if that is
possible. Just create lots of insert statements. You can get the current
value of a sequence with currval('sequence_name').

Alternately or in addition, since you are using python, you might want
to insert directly into the database from python using psycopg2. For
separate insert statements that should have about the same performance.
(It is usually much faster to write to a csv file and load that with
copy than to insert each row, but you don't do that and it might be
difficult in your case).
       hp

--   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Re: large numbers of inserts out of memory strategy

От

Ted Toth

Дата:

30 ноября 2017 г., 20:43:32

On Thu, Nov 30, 2017 at 4:22 AM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> On 2017-11-29 08:32:02 -0600, Ted Toth wrote:
>> Yes I did generate 1 large DO block:
>>
>> DO $$
>> DECLARE thingid bigint; thingrec bigint; thingdataid bigint;
>> BEGIN
>> INSERT INTO thing
>> (ltn,classification,machine,source,thgrec,flags,serial,type) VALUES
>> ('T007336','THING',0,1025,7336,7,'XXX869977564',1) RETURNING id,thgrec
>> INTO thingid,thingrec;
>> INSERT INTO recnum_thing (recnum,thing_id) VALUES (thingrec,thingid);
>> INSERT INTO thingstatus
>>
(thing_id,nrpts,rmks_cs,force_type_id,ftn_cs,force_code,arr_cs,mask,toi_state,plot_id,signa_cs,lastchange,des_cs,rig_cs,ownship,correlation,maxrpts,rtn_cs,ctc_cs,group_mask,dep_cs)
>> VALUES (thingid,121,'{0,0,0,0}',440,0,23,0,0,0,'{23,-1,3803,3805,-1,-1,0,6}',0,1509459164,0,0,0,0,1000,0,0,0,0);
>> INSERT INTO thinger
>> (thing_id,spe_key,cse_unc,lat_spd,cov,dtg,lng,spd,ave_spd,cse,tol,nrpts,lat,alpha,sigma,spd_unc,lng_spd)
>> VALUES
(thingid,-1,0.0,-6.58197336634e-08,'{4.27624291532e-09,0.0,3.07802916488e-09,0.0,4.27624291532e-09,0.0,3.07802916488e-09,4.16110417234e-08,0.0,4.16110417234e-08}',1509459163,2.21596980095,0.000226273215958,1.0,0.0,0.10000000149,121,0.584555745125,10.0,4.23079740131e-08,0.0,-2.49999881907e-10);
>> INSERT INTO thingdata
>>
(thing_id,category,db_num,xref,org_type,trademark,shortname,fcode,platform,callsign,type,orig_xref,shipclass,home_base,uic,service,di,lngfixed,hull,precision,alert,flag,besufx,name,mmsi,catcode,ntds,imo,pn_num,chxref,threat,sconum,latfixed,db_type,pif,echelon,jtn,quantity,overwrite)
>> VALUES
(thingid,'XXX','','','','','004403704','23','','','','','UNEQUATED','','','','',0.0,'','{0,0,0,0,0}','','KS','','UNKNOWN','004403704','','','','','','AFD','',0.0,3,'','',0,0,0)
>> RETURNING id INTO thingdataid;
>> INSERT INTO thingnum (thingdata_id,thgnum,state,dtg,cmd) VALUES
>> (thingdataid,'013086',0,1502970401,'FOO');
>>
>> <repeated for each thing>
>>
>> END $$;
>>
>> Should I limit the number of 'thing' inserts within a DO block or
>> wrapping each 'thing' insert in it's own DO block?
>

Thanks for the specific suggestions.

> I would suggest getting rid of the do block entirely if that is
> possible. Just create lots of insert statements. You can get the current
> value of a sequence with currval('sequence_name').

What is the downside of using a DO block? I'd have to do a nextval on
each sequence before I could use currval, right? Or I could do 'select
last_value from <sequence>'.

One thing that is unclear to me is when commits occur while using psql
would you know where in the docs I can find information on this
subject?

>
> Alternately or in addition, since you are using python, you might want
> to insert directly into the database from python using psycopg2. For
> separate insert statements that should have about the same performance.
> (It is usually much faster to write to a csv file and load that with
> copy than to insert each row, but you don't do that and it might be
> difficult in your case).

Yes, I thought about generating csv files but didn't see a way to deal
with the foreign keys.

>
>         hp
>
> --
>    _  | Peter J. Holzer    | we build much bigger, better disasters now
> |_|_) |                    | because we have much more sophisticated
> | |   | hjp@hjp.at         | management tools.
> __/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Re: large numbers of inserts out of memory strategy

От

"Peter J. Holzer"

Дата:

30 ноября 2017 г., 23:40:05

On 2017-11-30 08:43:32 -0600, Ted Toth wrote:
> Date: Thu, 30 Nov 2017 08:43:32 -0600
> From: Ted Toth <txtoth@gmail.com>
> To: "Peter J. Holzer" <hjp-pgsql@hjp.at>
> Cc: pgsql-general@lists.postgresql.org
> Subject: Re: large numbers of inserts out of memory strategy
>
> On Thu, Nov 30, 2017 at 4:22 AM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> > On 2017-11-29 08:32:02 -0600, Ted Toth wrote:
> >> Yes I did generate 1 large DO block:
> >>
> >> DO $$
> >> DECLARE thingid bigint; thingrec bigint; thingdataid bigint;
> >> BEGIN
> >> INSERT INTO thing
> >> (ltn,classification,machine,source,thgrec,flags,serial,type) VALUES
> >> ('T007336','THING',0,1025,7336,7,'XXX869977564',1) RETURNING id,thgrec
> >> INTO thingid,thingrec;
> >> INSERT INTO recnum_thing (recnum,thing_id) VALUES (thingrec,thingid);
> >> INSERT INTO thingstatus
> >>
(thing_id,nrpts,rmks_cs,force_type_id,ftn_cs,force_code,arr_cs,mask,toi_state,plot_id,signa_cs,lastchange,des_cs,rig_cs,ownship,correlation,maxrpts,rtn_cs,ctc_cs,group_mask,dep_cs)
> >> VALUES (thingid,121,'{0,0,0,0}',440,0,23,0,0,0,'{23,-1,3803,3805,-1,-1,0,6}',0,1509459164,0,0,0,0,1000,0,0,0,0);
[...]
> >>
> >> <repeated for each thing>
> >>
> >> END $$;
> >>
> >> Should I limit the number of 'thing' inserts within a DO block or
> >> wrapping each 'thing' insert in it's own DO block?
> >
>
> Thanks for the specific suggestions.
>
> > I would suggest getting rid of the do block entirely if that is
> > possible. Just create lots of insert statements. You can get the current
> > value of a sequence with currval('sequence_name').
>
> What is the downside of using a DO block?

As you discovered, the downside is that the whole do block needs
to be parsed before it can be executed. If you generate a huge do block,
that can be an issue. If you create lots of little do blocks, the
overhead is probably not noticable.

But I'd like to pose the question the other way around: What are the
advantages of using a do block? If there aren't any, I wouldn't use it
(keep it simple). One advantage is that you can use variables. But I
don't think you need that. Other possible advantages could be speed or
transactions: I don't know if these exist.

> I'd have to do a nextval on each sequence before I could use currval,
> right?

That happens automatically if you insert the default value into a serial
column. You seem to do that here:

> >> INSERT INTO thing (ltn,classification,machine,source,thgrec,flags,serial,type)
> >> VALUES ('T007336','THING',0,1025,7336,7,'XXX869977564',1)
> >> RETURNING id,thgrec INTO thingid,thingrec;

The column id is probably defined as serial (or explicitely associated
with a sequence). So you can get the value you just inserted with
currval('thing_id_seq') (Check the table definition to get the sequence
name, but unless you explicitely specified the sequence name that's
it).

So you could ditch the "RETURNING ..." clause and replace the next
statement with:
   INSERT INTO recnum_thing (recnum, thing_id) VALUES (7336, currval('thing_id_seq'));

And similar for the other statements where you use thingid and
thingdataid (thingrec is already known as far as i can see).

Note that this doesn't work if you need two values from the same
sequence. So you can't replace
   insert into thing(..) values(...) returning id as parent_id;   insert into thing(..) values(...) returning id as
child_id;  insert into structure(parent, child) values(parent_id, child_id); 

in the same way. But you don't seem to need anything like that.

> One thing that is unclear to me is when commits occur while using psql
> would you know where in the docs I can find information on this
> subject?

By default psql enables autocommit which causes an implicit commit after
every statement. With a do block I'm not sure whether that means after
the do block or after each statement within the do block. I'd just turn
autocommit off and add explicit commits wherever I wanted them.

> > Alternately or in addition, since you are using python, you might want
> > to insert directly into the database from python using psycopg2. For
> > separate insert statements that should have about the same performance.
> > (It is usually much faster to write to a csv file and load that with
> > copy than to insert each row, but you don't do that and it might be
> > difficult in your case).
>
> Yes, I thought about generating csv files but didn't see a way to deal
> with the foreign keys.

Still: Is there a reason why you use a python script to create an sql
script instead of directly issuing the sql queries from your python
script?
       hp

--   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Re: large numbers of inserts out of memory strategy

От

Christopher Browne

Дата:

01 декабря 2017 г., 00:44:40

On 30 November 2017 at 05:22, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> On 2017-11-29 08:32:02 -0600, Ted Toth wrote:
>> Yes I did generate 1 large DO block:
>>
>> DO $$
>> DECLARE thingid bigint; thingrec bigint; thingdataid bigint;
>> BEGIN
>> INSERT INTO thing
>> (ltn,classification,machine,source,thgrec,flags,serial,type) VALUES
>> ('T007336','THING',0,1025,7336,7,'XXX869977564',1) RETURNING id,thgrec
>> INTO thingid,thingrec;
>> INSERT INTO recnum_thing (recnum,thing_id) VALUES (thingrec,thingid);
>> INSERT INTO thingstatus
>>
(thing_id,nrpts,rmks_cs,force_type_id,ftn_cs,force_code,arr_cs,mask,toi_state,plot_id,signa_cs,lastchange,des_cs,rig_cs,ownship,correlation,maxrpts,rtn_cs,ctc_cs,group_mask,dep_cs)
>> VALUES (thingid,121,'{0,0,0,0}',440,0,23,0,0,0,'{23,-1,3803,3805,-1,-1,0,6}',0,1509459164,0,0,0,0,1000,0,0,0,0);
>> INSERT INTO thinger
>> (thing_id,spe_key,cse_unc,lat_spd,cov,dtg,lng,spd,ave_spd,cse,tol,nrpts,lat,alpha,sigma,spd_unc,lng_spd)
>> VALUES
(thingid,-1,0.0,-6.58197336634e-08,'{4.27624291532e-09,0.0,3.07802916488e-09,0.0,4.27624291532e-09,0.0,3.07802916488e-09,4.16110417234e-08,0.0,4.16110417234e-08}',1509459163,2.21596980095,0.000226273215958,1.0,0.0,0.10000000149,121,0.584555745125,10.0,4.23079740131e-08,0.0,-2.49999881907e-10);
>> INSERT INTO thingdata
>>
(thing_id,category,db_num,xref,org_type,trademark,shortname,fcode,platform,callsign,type,orig_xref,shipclass,home_base,uic,service,di,lngfixed,hull,precision,alert,flag,besufx,name,mmsi,catcode,ntds,imo,pn_num,chxref,threat,sconum,latfixed,db_type,pif,echelon,jtn,quantity,overwrite)
>> VALUES
(thingid,'XXX','','','','','004403704','23','','','','','UNEQUATED','','','','',0.0,'','{0,0,0,0,0}','','KS','','UNKNOWN','004403704','','','','','','AFD','',0.0,3,'','',0,0,0)
>> RETURNING id INTO thingdataid;
>> INSERT INTO thingnum (thingdata_id,thgnum,state,dtg,cmd) VALUES
>> (thingdataid,'013086',0,1502970401,'FOO');
>>
>> <repeated for each thing>
>>
>> END $$;
>>
>> Should I limit the number of 'thing' inserts within a DO block or
>> wrapping each 'thing' insert in it's own DO block?
>
> I would suggest getting rid of the do block entirely if that is
> possible. Just create lots of insert statements. You can get the current
> value of a sequence with currval('sequence_name').
>
> Alternately or in addition, since you are using python, you might want
> to insert directly into the database from python using psycopg2. For
> separate insert statements that should have about the same performance.
> (It is usually much faster to write to a csv file and load that with
> copy than to insert each row, but you don't do that and it might be
> difficult in your case).

Once upon a time, Slony-I used to use INSERT/UPDATE/DELETE to
do all of its work, so that the replication stream consisted of gigantic
numbers of INSERT/UPDATE/DELETE statements that had to be
parsed and processed.


--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Re: large numbers of inserts out of memory strategy

От

Christopher Browne

Дата:

01 декабря 2017 г., 01:18:48

On 30 November 2017 at 05:22, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> On 2017-11-29 08:32:02 -0600, Ted Toth wrote:
>> Yes I did generate 1 large DO block:
>>
>> DO $$
>> DECLARE thingid bigint; thingrec bigint; thingdataid bigint;
>> BEGIN
>> INSERT INTO thing
>> (ltn,classification,machine,source,thgrec,flags,serial,type) VALUES
>> ('T007336','THING',0,1025,7336,7,'XXX869977564',1) RETURNING id,thgrec
>> INTO thingid,thingrec;
>> INSERT INTO recnum_thing (recnum,thing_id) VALUES (thingrec,thingid);
>> INSERT INTO thingstatus
>>
(thing_id,nrpts,rmks_cs,force_type_id,ftn_cs,force_code,arr_cs,mask,toi_state,plot_id,signa_cs,lastchange,des_cs,rig_cs,ownship,correlation,maxrpts,rtn_cs,ctc_cs,group_mask,dep_cs)
>> VALUES (thingid,121,'{0,0,0,0}',440,0,23,0,0,0,'{23,-1,3803,3805,-1,-1,0,6}',0,1509459164,0,0,0,0,1000,0,0,0,0);
>> INSERT INTO thinger
>> (thing_id,spe_key,cse_unc,lat_spd,cov,dtg,lng,spd,ave_spd,cse,tol,nrpts,lat,alpha,sigma,spd_unc,lng_spd)
>> VALUES
(thingid,-1,0.0,-6.58197336634e-08,'{4.27624291532e-09,0.0,3.07802916488e-09,0.0,4.27624291532e-09,0.0,3.07802916488e-09,4.16110417234e-08,0.0,4.16110417234e-08}',1509459163,2.21596980095,0.000226273215958,1.0,0.0,0.10000000149,121,0.584555745125,10.0,4.23079740131e-08,0.0,-2.49999881907e-10);
>> INSERT INTO thingdata
>>
(thing_id,category,db_num,xref,org_type,trademark,shortname,fcode,platform,callsign,type,orig_xref,shipclass,home_base,uic,service,di,lngfixed,hull,precision,alert,flag,besufx,name,mmsi,catcode,ntds,imo,pn_num,chxref,threat,sconum,latfixed,db_type,pif,echelon,jtn,quantity,overwrite)
>> VALUES
(thingid,'XXX','','','','','004403704','23','','','','','UNEQUATED','','','','',0.0,'','{0,0,0,0,0}','','KS','','UNKNOWN','004403704','','','','','','AFD','',0.0,3,'','',0,0,0)
>> RETURNING id INTO thingdataid;
>> INSERT INTO thingnum (thingdata_id,thgnum,state,dtg,cmd) VALUES
>> (thingdataid,'013086',0,1502970401,'FOO');
>>
>> <repeated for each thing>
>>
>> END $$;
>>
>> Should I limit the number of 'thing' inserts within a DO block or
>> wrapping each 'thing' insert in it's own DO block?
>
> I would suggest getting rid of the do block entirely if that is
> possible. Just create lots of insert statements. You can get the current
> value of a sequence with currval('sequence_name').
>
> Alternately or in addition, since you are using python, you might want
> to insert directly into the database from python using psycopg2. For
> separate insert statements that should have about the same performance.
> (It is usually much faster to write to a csv file and load that with
> copy than to insert each row, but you don't do that and it might be
> difficult in your case).

A long time ago (pre version 2.2), Slony-I used to do this, having a big
stream of INSERT, UPDATE, and DELETE statements.

This seemed pretty OK when tuples weren't very large, but we did get
some reports in from people with applications with very large tuples
that would run into out-of-memory conditions if a bunch of INSERTs
involving very large tuples were processed together.

The "fix" we initially did was to have the bulk processing stream
hold off on any Very Large Tuples, so that when we'd encounter
them, we'd process the big tuples one by one.  This worked
relatively OK, but meant that we'd be switching logic every time
there was big data, and there was a pretty painful amount of
configuration to force people to worry about.

The "2.2" change was to switch to COPY-based streaming.  In
our case, we put a trigger onto the single log table and have that
initiate tossing data as it came in into the appropriate target table.

That introduced an Amount Of Complexity (e.g. - a rather complex
stored function in C/SPI), but it's notable that we got a pretty big
performance boost as well as complete elimination of memory
allocation worries out of turning the process into COPY streaming.

It may be challenging to get a suitable set of COPY requests to do
what is necessary.  (It sure would be cool if it could just be one...)
But there are three crucial things I'd observe:
a) Loading data via COPY is *way* faster,
b) Parsing INSERT statements is a *big* slowdown,
c) Parsing INSERT statements means that those statements must
be drawn into memory, and that chews a lot of memory if the query
has very large attributes.  (COPY doesn't chew that memory!)

At one point, Jan Wieck was looking into an alternative COPY
protocol that would have allowed more actions, notably:

- You could specify multiple tables to stream into, and switch between them on a tuple by tuple basis.

- You could specify actions of INSERT, UPDATE, or DELETE.

It seemed like a really interesting idea at the time; it was intended
to be particularly useful for Slony, but some folks designing data
warehouse ETL systems observed that it would be useful to them,
too.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Re: large numbers of inserts out of memory strategy

От

Ted Toth

Дата:

01 декабря 2017 г., 02:27:58

On Thu, Nov 30, 2017 at 11:40 AM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> On 2017-11-30 08:43:32 -0600, Ted Toth wrote:
>> Date: Thu, 30 Nov 2017 08:43:32 -0600
>> From: Ted Toth <txtoth@gmail.com>
>> To: "Peter J. Holzer" <hjp-pgsql@hjp.at>
>> Cc: pgsql-general@lists.postgresql.org
>> Subject: Re: large numbers of inserts out of memory strategy
>>
>> On Thu, Nov 30, 2017 at 4:22 AM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
>> > On 2017-11-29 08:32:02 -0600, Ted Toth wrote:
>> >> Yes I did generate 1 large DO block:
>> >>
>> >> DO $$
>> >> DECLARE thingid bigint; thingrec bigint; thingdataid bigint;
>> >> BEGIN
>> >> INSERT INTO thing
>> >> (ltn,classification,machine,source,thgrec,flags,serial,type) VALUES
>> >> ('T007336','THING',0,1025,7336,7,'XXX869977564',1) RETURNING id,thgrec
>> >> INTO thingid,thingrec;
>> >> INSERT INTO recnum_thing (recnum,thing_id) VALUES (thingrec,thingid);
>> >> INSERT INTO thingstatus
>> >>
(thing_id,nrpts,rmks_cs,force_type_id,ftn_cs,force_code,arr_cs,mask,toi_state,plot_id,signa_cs,lastchange,des_cs,rig_cs,ownship,correlation,maxrpts,rtn_cs,ctc_cs,group_mask,dep_cs)
>> >> VALUES (thingid,121,'{0,0,0,0}',440,0,23,0,0,0,'{23,-1,3803,3805,-1,-1,0,6}',0,1509459164,0,0,0,0,1000,0,0,0,0);
> [...]
>> >>
>> >> <repeated for each thing>
>> >>
>> >> END $$;
>> >>
>> >> Should I limit the number of 'thing' inserts within a DO block or
>> >> wrapping each 'thing' insert in it's own DO block?
>> >
>>
>> Thanks for the specific suggestions.
>>
>> > I would suggest getting rid of the do block entirely if that is
>> > possible. Just create lots of insert statements. You can get the current
>> > value of a sequence with currval('sequence_name').
>>
>> What is the downside of using a DO block?
>
> As you discovered, the downside is that the whole do block needs
> to be parsed before it can be executed. If you generate a huge do block,
> that can be an issue. If you create lots of little do blocks, the
> overhead is probably not noticable.
>
> But I'd like to pose the question the other way around: What are the
> advantages of using a do block? If there aren't any, I wouldn't use it
> (keep it simple). One advantage is that you can use variables. But I
> don't think you need that. Other possible advantages could be speed or
> transactions: I don't know if these exist.
>
>> I'd have to do a nextval on each sequence before I could use currval,
>> right?
>
> That happens automatically if you insert the default value into a serial
> column. You seem to do that here:
>
>> >> INSERT INTO thing (ltn,classification,machine,source,thgrec,flags,serial,type)
>> >> VALUES ('T007336','THING',0,1025,7336,7,'XXX869977564',1)
>> >> RETURNING id,thgrec INTO thingid,thingrec;
>
> The column id is probably defined as serial (or explicitely associated
> with a sequence). So you can get the value you just inserted with
> currval('thing_id_seq') (Check the table definition to get the sequence
> name, but unless you explicitely specified the sequence name that's
> it).
>
> So you could ditch the "RETURNING ..." clause and replace the next
> statement with:
>
>     INSERT INTO recnum_thing (recnum, thing_id) VALUES (7336, currval('thing_id_seq'));

I only did the DO to get the DECLARE of the variables so I'll change
it to use this method instead, thanks.

>
> And similar for the other statements where you use thingid and
> thingdataid (thingrec is already known as far as i can see).
>
> Note that this doesn't work if you need two values from the same
> sequence. So you can't replace
>
>     insert into thing(..) values(...) returning id as parent_id;
>     insert into thing(..) values(...) returning id as child_id;
>     insert into structure(parent, child) values(parent_id, child_id);
>
> in the same way. But you don't seem to need anything like that.
>
>
>> One thing that is unclear to me is when commits occur while using psql
>> would you know where in the docs I can find information on this
>> subject?
>
> By default psql enables autocommit which causes an implicit commit after
> every statement. With a do block I'm not sure whether that means after
> the do block or after each statement within the do block. I'd just turn
> autocommit off and add explicit commits wherever I wanted them.

So you think I can turn off autocommit and put BEGIN/COMMITs in a
large file and then postmaster won't have to parse the whole thing
when I feed it to it via psql? I did change my generator to create
multiple files with limited numbers of inserts. Somebody had mentioned
by chunking it this way it would be easier to know when and where if
something went wrong which I thought was a good idea.

>
>
>> > Alternately or in addition, since you are using python, you might want
>> > to insert directly into the database from python using psycopg2. For
>> > separate insert statements that should have about the same performance.
>> > (It is usually much faster to write to a csv file and load that with
>> > copy than to insert each row, but you don't do that and it might be
>> > difficult in your case).
>>
>> Yes, I thought about generating csv files but didn't see a way to deal
>> with the foreign keys.
>
> Still: Is there a reason why you use a python script to create an sql
> script instead of directly issuing the sql queries from your python
> script?

I already had code that generated JSON so it was relatively easy to
add code and a cmd line arg to generate SQL instead. Theoretically
this will be a one time offline operation and performance is not a big
factor. Also my target machine will be significantly beefier.

>
>         hp
>
> --
>    _  | Peter J. Holzer    | we build much bigger, better disasters now
> |_|_) |                    | because we have much more sophisticated
> | |   | hjp@hjp.at         | management tools.
> __/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Re: large numbers of inserts out of memory strategy

От

Steven Lembark

Дата:

01 декабря 2017 г., 23:49:42

On Thu, 30 Nov 2017 08:43:32 -0600
Ted Toth <txtoth@gmail.com> wrote:

> What is the downside of using a DO block? I'd have to do a nextval on
> each sequence before I could use currval, right? Or I could do 'select
> last_value from <sequence>'.

You are creating a piece of code that has to be parsed, tokenized,
and compiled prior to execution. What's biting you is that you've
created a function the size of your dataset.

If you like do-blocks then write a short block to insert one record
using placeholders and call it a few zillion times.

That or (in DBI-speak):

    eval
    {
        $dbh->{ RaiseError  } = 1;
        $dbh->{ AutoCommit  } = 0;

        my $sth = $dbh->prepare
        (
            'insert into yourtable ( field field ) values ( $1, $2 )'
        );

        $sth->do( @$_ ) for @rows;

        $dbh->commit
    }
    or die "Failed execution: $@";

which will be nearly as effecient in the long run.

That or just import the data from a csv/tsv (there are good 
examples of data import available in the PG docs).

-- 
Steven Lembark                                       1505 National Ave
Workhorse Computing                                 Rockford, IL 61103
lembark@wrkhors.com                                    +1 888 359 3508

Re: large numbers of inserts out of memory strategy

От

"Peter J. Holzer"

Дата:

04 декабря 2017 г., 06:01:58

On 2017-11-30 14:27:58 -0600, Ted Toth wrote:
> On Thu, Nov 30, 2017 at 11:40 AM, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> > On 2017-11-30 08:43:32 -0600, Ted Toth wrote:
> >> One thing that is unclear to me is when commits occur while using psql
> >> would you know where in the docs I can find information on this
> >> subject?
> >
> > By default psql enables autocommit which causes an implicit commit after
> > every statement. With a do block I'm not sure whether that means after
> > the do block or after each statement within the do block. I'd just turn
> > autocommit off and add explicit commits wherever I wanted them.
>
> So you think I can turn off autocommit and put BEGIN/COMMITs in a
> large file and then postmaster won't have to parse the whole thing
> when I feed it to it via psql?

No, if you still had one giant do block it would need to parse it
completely. I was thinking of the case where you have many small do
blocks, each with a few insert statements (one do block per "thing"). In
this case it would parse and execute each do block before moving on to
the next. But I wasn't sure whether the default autocommit would mean
one commit after each do block or one commit after each insert statement
(David answered this - thanks), and in any case you probably wouldn't
want to commit after each "thing", so I suggested turning autocommit off
and adding an explicit commit at the end or possibly after every nth
thing.

> > Still: Is there a reason why you use a python script to create an sql
> > script instead of directly issuing the sql queries from your python
> > script?
>
> I already had code that generated JSON so it was relatively easy to
> add code and a cmd line arg to generate SQL instead.

Ok. that sounds like a good reason.

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: large numbers of inserts out of memory strategy

Вложения