Обсуждение: Using COPY to import large xml file
On 06/24/2018 05:25 PM, Anto Aravinth wrote: > Hello Everyone, > > I have downloaded the Stackoverflow posts xml (contains all SO questions till > date).. the file is around 70GB.. I wanna import the data in those xml to my > table.. is there a way to do so in postgres? > > > Thanks, > Anto. Hello Anto, I used this tool : https://github.com/Networks-Learning/stackexchange-dump-to-postgres Regards, -- Adrien NAYRAT https://blog.anayrat.info
Вложения
Hello Anto,On 06/24/2018 05:25 PM, Anto Aravinth wrote:
> Hello Everyone,
>
> I have downloaded the Stackoverflow posts xml (contains all SO questions till
> date).. the file is around 70GB.. I wanna import the data in those xml to my
> table.. is there a way to do so in postgres?
>
>
> Thanks,
> Anto.
I used this tool :
https://github.com/Networks-Learning/stackexchange-dump- to-postgres
Regards,
--
Adrien NAYRAT
https://blog.anayrat.info
On 06/24/2018 06:07 PM, Anto Aravinth wrote: > Thanks for the response. I'm not sure, how long does this tool takes for the > 70GB data. In my memory, it took several hours. I can't remember if it is xml conversion or insert which are longer. > > I used node to stream the xml files into inserts.. which was very slow.. > Actually the xml contains 40 million records, out of which 10Million took around > 2 hrs using nodejs. Hence, I thought will use COPY command, as suggested on the > internet. > > Definitely, will try the code and let you know.. But looks like it uses the same > INSERT, not copy.. interesting if it runs quick on my machine. Yes it use INSERT, maybe it is not difficult to change the code to use COPY instead. -- Adrien NAYRAT https://blog.anayrat.info
Вложения
On 06/24/2018 08:25 AM, Anto Aravinth wrote: > Hello Everyone, > > I have downloaded the Stackoverflow posts xml (contains all SO questions > till date).. the file is around 70GB.. I wanna import the data in those > xml to my table.. is there a way to do so in postgres? It is going to require some work. You will need to deal with: 1) The row schema inside the XML is here: https://ia800107.us.archive.org/27/items/stackexchange/readme.txt - **posts**.xml 2) The rows are inside a <posts> tag. Seems to me you have two options: 1) Drop each row into a single XML field and deal with extracting the row components in the database. 2) Break down the row into column components before entering them into the database. Adrien has pointed you at a Python program that covers the above: https://github.com/Networks-Learning/stackexchange-dump-to-postgres If you are comfortable in Python you can take a look at: https://github.com/Networks-Learning/stackexchange-dump-to-postgres/blob/master/row_processor.py to see how the rows are broken down into elements. I would try this out first on one of the smaller datasets found here: https://archive.org/details/stackexchange I personally took a look at: https://archive.org/download/stackexchange/beer.stackexchange.com.7z because why not? > > > Thanks, > Anto. -- Adrian Klaver adrian.klaver@aklaver.com
## Adrien Nayrat (adrien.nayrat@anayrat.info): > I used this tool : > https://github.com/Networks-Learning/stackexchange-dump-to-postgres That will be awfully slow: this tool commits each INSERT on it's own, see loop in https://github.com/Networks-Learning/stackexchange-dump-to-postgres/blob/master/load_into_pg.py#L83 With only small changes - prepare the INSERT, execute for all (or at least a huge lot of) rows, COMMIT at the end - you can safe quite a lot of time (500 rows is not "a huge lot"). And when you do that, for heaven's sake, do not try to create the INSERT statement as a string with the values - Bobby Tables will eat you. See psycopg documentation on how it's done (especially watch the warnings): http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries On prepared statements with psycopg2, see http://initd.org/psycopg/articles/2012/10/01/prepared-statements-psycopg/ (python makes stuff rather hard, but I'll spare you the snark and wait until the language has matured to at least version 5). Using the COPY protocol with psycopg2 seems to require some hoop-jumping, but could improve matters even more. Regards, Christoph -- Spare Space.
Anto Aravinth <anto.aravinth.cse@gmail.com> writes: > Thanks for the response. I'm not sure, how long does this tool takes for > the 70GB data. > > I used node to stream the xml files into inserts.. which was very slow.. > Actually the xml contains 40 million records, out of which 10Million took > around 2 hrs using nodejs. Hence, I thought will use COPY command, as > suggested on the internet. > > Definitely, will try the code and let you know.. But looks like it uses the > same INSERT, not copy.. interesting if it runs quick on my machine. > > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info> > wrote: > >> On 06/24/2018 05:25 PM, Anto Aravinth wrote: >> > Hello Everyone, >> > >> > I have downloaded the Stackoverflow posts xml (contains all SO questions >> till >> > date).. the file is around 70GB.. I wanna import the data in those xml >> to my >> > table.. is there a way to do so in postgres? >> > >> > >> > Thanks, >> > Anto. >> >> Hello Anto, >> >> I used this tool : >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres >> If you are using nodejs, then you can easily use the pg-copy-streams module to insert the records into your database. I've been using this for inserting large numbers of records from NetCDF files. Takes between 40 to 50 minutes to insert 60 Million+ records and we are doing additional calculations on the values, not just inserting them, plus we are inserting into a database over the network and into a database which is also performing other processing. We found a significant speed improvement with COPY over blocks of insert transactions, which was faster than just individual inserts. The only downside with using COPY is that it either completely works or completely fails and when it fails, it can be tricky to work out which record is causing the failure. A benefit of using blocks of transactions is that you have more fine grained control, allowing you to recover from some errors or providing more specific detail regarding the cause of the error. Be wary of what indexes your defining on your table. Depending on the type and number, these can have significant impact on insert times as well. -- Tim Cross
Anto Aravinth <anto.aravinth.cse@gmail.com> writes:
> Thanks for the response. I'm not sure, how long does this tool takes for
> the 70GB data.
>
> I used node to stream the xml files into inserts.. which was very slow..
> Actually the xml contains 40 million records, out of which 10Million took
> around 2 hrs using nodejs. Hence, I thought will use COPY command, as
> suggested on the internet.
>
> Definitely, will try the code and let you know.. But looks like it uses the
> same INSERT, not copy.. interesting if it runs quick on my machine.
>
> On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info>
> wrote:
>
>> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>> > Hello Everyone,
>> >
>> > I have downloaded the Stackoverflow posts xml (contains all SO questions
>> till
>> > date).. the file is around 70GB.. I wanna import the data in those xml
>> to my
>> > table.. is there a way to do so in postgres?
>> >
>> >
>> > Thanks,
>> > Anto.
>>
>> Hello Anto,
>>
>> I used this tool :
>> https://github.com/Networks-Learning/stackexchange-dump- to-postgres
>>
If you are using nodejs, then you can easily use the pg-copy-streams
module to insert the records into your database. I've been using this
for inserting large numbers of records from NetCDF files. Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and into a database which is
also performing other processing.
We found a significant speed improvement with COPY over blocks of insert
transactions, which was faster than just individual inserts. The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work out which
record is causing the failure. A benefit of using blocks of transactions
is that you have more fine grained control, allowing you to recover from
some errors or providing more specific detail regarding the cause of the
error.
Be wary of what indexes your defining on your table. Depending on the
type and number, these can have significant impact on insert times as
well.
--
Tim Cross
On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross <theophilusx@gmail.com> wrote:
Anto Aravinth <anto.aravinth.cse@gmail.com> writes:
> Thanks for the response. I'm not sure, how long does this tool takes for
> the 70GB data.
>
> I used node to stream the xml files into inserts.. which was very slow..
> Actually the xml contains 40 million records, out of which 10Million took
> around 2 hrs using nodejs. Hence, I thought will use COPY command, as
> suggested on the internet.
>
> Definitely, will try the code and let you know.. But looks like it uses the
> same INSERT, not copy.. interesting if it runs quick on my machine.
>
> On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info>
> wrote:
>
>> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>> > Hello Everyone,
>> >
>> > I have downloaded the Stackoverflow posts xml (contains all SO questions
>> till
>> > date).. the file is around 70GB.. I wanna import the data in those xml
>> to my
>> > table.. is there a way to do so in postgres?
>> >
>> >
>> > Thanks,
>> > Anto.
>>
>> Hello Anto,
>>
>> I used this tool :
>> https://github.com/Networks-Learning/stackexchange-dump-to-postgres
>>
If you are using nodejs, then you can easily use the pg-copy-streams
module to insert the records into your database. I've been using this
for inserting large numbers of records from NetCDF files. Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and into a database which is
also performing other processing.
We found a significant speed improvement with COPY over blocks of insert
transactions, which was faster than just individual inserts. The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work out which
record is causing the failure. A benefit of using blocks of transactions
is that you have more fine grained control, allowing you to recover from
some errors or providing more specific detail regarding the cause of the
error.Sure, let me try that.. I have a question here, COPY usually works when you move data from files to your postgres instance, right? Now in node.js, processing the whole file, can I use COPYprogrammatically like COPY Stackoverflow <calculated value at run time>? Because from doc:I don't see its possible. May be I need to convert the files to copy understandable first?Anto.
## Anto Aravinth (anto.aravinth.cse@gmail.com): > Sure, let me try that.. I have a question here, COPY usually works when you > move data from files to your postgres instance, right? Now in node.js, > processing the whole file, can I use COPY > programmatically like COPY Stackoverflow <calculated value at run time>? > Because from doc: > > https://www.postgresql.org/docs/9.2/static/sql-copy.html > > I don't see its possible. May be I need to convert the files to copy > understandable first? "COPY ... FROM STDIN" STDIN Specifies that input comes from the client application. It's on the page... Regards, Christoph -- Spare Space.
Yes. Essentially what you do is create a stream and feed whatever information you want to copy into that stream. PG sees the. data as if it was seeing each line in a file, so you push data onto the stream wherre each item is seperated by a tab (or whatever). Here is the basic low level function I use (Don't know how the formatting will go!)On Mon, 25 Jun 2018 at 11:38, Anto Aravinth <anto.aravinth.cse@gmail.com> wrote:On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross <theophilusx@gmail.com> wrote:
Anto Aravinth <anto.aravinth.cse@gmail.com> writes:
> Thanks for the response. I'm not sure, how long does this tool takes for
> the 70GB data.
>
> I used node to stream the xml files into inserts.. which was very slow..
> Actually the xml contains 40 million records, out of which 10Million took
> around 2 hrs using nodejs. Hence, I thought will use COPY command, as
> suggested on the internet.
>
> Definitely, will try the code and let you know.. But looks like it uses the
> same INSERT, not copy.. interesting if it runs quick on my machine.
>
> On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info>
> wrote:
>
>> On 06/24/2018 05:25 PM, Anto Aravinth wrote:
>> > Hello Everyone,
>> >
>> > I have downloaded the Stackoverflow posts xml (contains all SO questions
>> till
>> > date).. the file is around 70GB.. I wanna import the data in those xml
>> to my
>> > table.. is there a way to do so in postgres?
>> >
>> >
>> > Thanks,
>> > Anto.
>>
>> Hello Anto,
>>
>> I used this tool :
>> https://github.com/Networks-Learning/stackexchange-dump- to-postgres
>>
If you are using nodejs, then you can easily use the pg-copy-streams
module to insert the records into your database. I've been using this
for inserting large numbers of records from NetCDF files. Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and into a database which is
also performing other processing.
We found a significant speed improvement with COPY over blocks of insert
transactions, which was faster than just individual inserts. The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work out which
record is causing the failure. A benefit of using blocks of transactions
is that you have more fine grained control, allowing you to recover from
some errors or providing more specific detail regarding the cause of the
error.Sure, let me try that.. I have a question here, COPY usually works when you move data from files to your postgres instance, right? Now in node.js, processing the whole file, can I use COPYprogrammatically like COPY Stackoverflow <calculated value at run time>? Because from doc:I don't see its possible. May be I need to convert the files to copy understandable first?Anto.async function copyInsert(sql, stringifyFN, records) {const logName = `${moduleName}.copyInsert`;var client;assert.ok(Array.isArray(records), "The records arg must be an array"); assert.ok(typeof(stringifyFN) === "function", "The stringifyFN arg must be a function");return getClient().then(c => {client = c;return new Promise(function(resolve, reject) {var stream, rs;var idx = 0;function done() {releaseClient(client);client = undefined;resolve(idx + 1);}function onError(err) {if (client !== undefined) {releaseClient(client);}reject(new VError(err, `${logName}: COPY failed at record ${idx}`));}function arrayRead() {if (idx === records.length) {rs.push(null);} else {let rec = records[idx];rs.push(stringifyFN(rec));idx += 1;}}rs = new Readable;rs._read = arrayRead;rs.on("error", onError);stream = client.query(copyFrom(sql));stream.on("error", onError);stream.on("end", done);rs.pipe(stream);});}).catch(err => {throw new VError(err, `${logName} Failed COPY insert`);});}and I will call it likecopyInsert(sql, stringifyClimateRecord, records)where sql and stringifycomateRecord arguments areconst sql = `COPY access_s.climate_data_ensemble_${ensemble} ` + "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds," + "vprp_09,vprp_15,wind_speed) FROM STDIN";function stringifyClimateRecord(rec) {return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t` + `${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${ rec[9]}\n`; }The stringifyClimateRecord returns a record to be inserted as a 'line' into the stream with values separated by tabs. Records is an array of data records where each record is an array.--regards,Tim--Tim Cross
Thanks a lot. But I do got lot of challenges! Looks like SO data contains lot of tabs within itself.. So tabs delimiter didn't work for me. I thought I can give a special demiliter but looks like Postrgesql copy allow only one character as delimiter :(Sad, I guess only way is to insert or do a through serialization of my data into something that COPY can understand.
2018-06-25 16:25 GMT+02:00 Anto Aravinth <anto.aravinth.cse@gmail.com>:Thanks a lot. But I do got lot of challenges! Looks like SO data contains lot of tabs within itself.. So tabs delimiter didn't work for me. I thought I can give a special demiliter but looks like Postrgesql copy allow only one character as delimiter :(Sad, I guess only way is to insert or do a through serialization of my data into something that COPY can understand.easiest way would be:xml -> csv -> \copyby csv, I mean regular quoted csv (Simply wrap csv field with double quote, and escapeenventually contained quotes with an other double quote.).
1 "Are questions about animations or comics inspired by Japanese culture or styles considered on-topic?" "pExamples include a href=""http://www.imdb.com/title/tt0417299/"" rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/title/tt1695360/"" rel=""nofollow""Korra/a and, to some extent, a href=""http://www.imdb.com/title/tt0278238/"" rel=""nofollow""Samurai Jack/a. They're all widely popular American cartoons, sometimes even referred to as ema href=""https://en.wikipedia.org/wiki/Anime-influenced_animation"" rel=""nofollow""Amerime/a/em./p
pAre questions about these series on-topic?/p
" "pExamples include a href=""http://www.imdb.com/title/tt0417299/"" rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/title/tt1695360/"" rel=""nofollow""Korra/a and, to some extent, a href=""http://www.imdb.com/title/tt0278238/"" rel=""nofollow""Samurai Jack/a. They're all widely popular American cartoons, sometimes even referred to as ema href=""https://en.wikipedia.org/wiki/Anime-influenced_animation"" rel=""nofollow""Amerime/a/em./p
pAre questions about these series on-topic?/p
" "null"
COPY so2 from '/Users/user/programs/js/node-mbox/file.csv';
I get:
CONTEXT: COPY so2, line 1: "1 "Are questions about animations or comics inspired by Japanese culture or styles considered on-top..."
Not sure what I'm missing. Not sure the above csv is breaking because I have newlines within my content. But the error message is very hard to debug.
Postgresql copy csv parser is one of the most robust I ever testedbefore.
On Mon, Jun 25, 2018 at 8:20 PM, Nicolas Paris <niparisco@gmail.com> wrote:2018-06-25 16:25 GMT+02:00 Anto Aravinth <anto.aravinth.cse@gmail.com>:Thanks a lot. But I do got lot of challenges! Looks like SO data contains lot of tabs within itself.. So tabs delimiter didn't work for me. I thought I can give a special demiliter but looks like Postrgesql copy allow only one character as delimiter :(Sad, I guess only way is to insert or do a through serialization of my data into something that COPY can understand.easiest way would be:xml -> csv -> \copyby csv, I mean regular quoted csv (Simply wrap csv field with double quote, and escapeenventually contained quotes with an other double quote.).I tried but no luck. Here is the sample csv, I wrote from my xml convertor:1 "Are questions about animations or comics inspired by Japanese culture or styles considered on-topic?" "pExamples include a href=""http://www.imdb.com/
title/tt0417299/"" rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/ title/tt1695360/"" rel=""nofollow""Korra/a and, to some extent, a href=""http://www.imdb.com/ title/tt0278238/"" rel=""nofollow""Samurai Jack/a. They're all widely popular American cartoons, sometimes even referred to as ema href=""https://en.wikipedia. org/wiki/Anime-influenced_ animation"" rel=""nofollow""Amerime/a/em./ p
pAre questions about these series on-topic?/p
" "pExamples include a href=""http://www.imdb.com/
title/tt0417299/"" rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/ title/tt1695360/"" rel=""nofollow""Korra/a and, to some extent, a href=""http://www.imdb.com/ title/tt0278238/"" rel=""nofollow""Samurai Jack/a. They're all widely popular American cartoons, sometimes even referred to as ema href=""https://en.wikipedia. org/wiki/Anime-influenced_ animation"" rel=""nofollow""Amerime/a/em./ p
pAre questions about these series on-topic?/p
" "null"
the schema of my table is:CREATE TABLE so2 (id INTEGER NOT NULL PRIMARY KEY,title varchar(1000) NULL,posts text,body TSVECTOR,parent_id INTEGER NULL,FOREIGN KEY (parent_id) REFERENCES so1(id));and when I run:COPY so2 from '/Users/user/programs/js/node-
mbox/file.csv';
I get:
ERROR: missing data for column "body"
CONTEXT: COPY so2, line 1: "1 "Are questions about animations or comics inspired by Japanese culture or styles considered on-top..."
CONTEXT: COPY so2, line 1: "1 "Are questions about animations or comics inspired by Japanese culture or styles considered on-top..."
Not sure what I'm missing. Not sure the above csv is breaking because I have newlines within my content. But the error message is very hard to debug.
Postgresql copy csv parser is one of the most robust I ever testedbefore.
On Mon, Jun 25, 2018 at 8:54 PM, Anto Aravinth <anto.aravinth.cse@gmail.com> wrote:On Mon, Jun 25, 2018 at 8:20 PM, Nicolas Paris <niparisco@gmail.com> wrote:2018-06-25 16:25 GMT+02:00 Anto Aravinth <anto.aravinth.cse@gmail.com>:Thanks a lot. But I do got lot of challenges! Looks like SO data contains lot of tabs within itself.. So tabs delimiter didn't work for me. I thought I can give a special demiliter but looks like Postrgesql copy allow only one character as delimiter :(Sad, I guess only way is to insert or do a through serialization of my data into something that COPY can understand.easiest way would be:xml -> csv -> \copyby csv, I mean regular quoted csv (Simply wrap csv field with double quote, and escapeenventually contained quotes with an other double quote.).I tried but no luck. Here is the sample csv, I wrote from my xml convertor:1 "Are questions about animations or comics inspired by Japanese culture or styles considered on-topic?" "pExamples include a href=""http://www.imdb.com/tit
le/tt0417299/"" rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/tit le/tt1695360/"" rel=""nofollow""Korra/a and, to some extent, a href=""http://www.imdb.com/tit le/tt0278238/"" rel=""nofollow""Samurai Jack/a. They're all widely popular American cartoons, sometimes even referred to as ema href=""https://en.wikipedia.or g/wiki/Anime-influenced_animat ion"" rel=""nofollow""Amerime/a/em./ p
pAre questions about these series on-topic?/p
" "pExamples include a href=""http://www.imdb.com/tit
le/tt0417299/"" rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/tit le/tt1695360/"" rel=""nofollow""Korra/a and, to some extent, a href=""http://www.imdb.com/tit le/tt0278238/"" rel=""nofollow""Samurai Jack/a. They're all widely popular American cartoons, sometimes even referred to as ema href=""https://en.wikipedia.or g/wiki/Anime-influenced_animat ion"" rel=""nofollow""Amerime/a/em./ p
pAre questions about these series on-topic?/p
" "null"
the schema of my table is:CREATE TABLE so2 (id INTEGER NOT NULL PRIMARY KEY,title varchar(1000) NULL,posts text,body TSVECTOR,parent_id INTEGER NULL,FOREIGN KEY (parent_id) REFERENCES so1(id));and when I run:COPY so2 from '/Users/user/programs/js/node-
mbox/file.csv';
I get:
ERROR: missing data for column "body"
CONTEXT: COPY so2, line 1: "1 "Are questions about animations or comics inspired by Japanese culture or styles considered on-top..."
CONTEXT: COPY so2, line 1: "1 "Are questions about animations or comics inspired by Japanese culture or styles considered on-top..."
Not sure what I'm missing. Not sure the above csv is breaking because I have newlines within my content. But the error message is very hard to debug.
On 06/25/2018 07:25 AM, Anto Aravinth wrote: > Thanks a lot. But I do got lot of challenges! Looks like SO data > contains lot of tabs within itself.. So tabs delimiter didn't work for > me. I thought I can give a special demiliter but looks like Postrgesql > copy allow only one character as delimiter :( I use | as it is rarely found in data itself. > > Sad, I guess only way is to insert or do a through serialization of my > data into something that COPY can understand. > > On Mon, Jun 25, 2018 at 8:09 AM, Tim Cross <theophilusx@gmail.com > <mailto:theophilusx@gmail.com>> wrote: > > > > On Mon, 25 Jun 2018 at 11:38, Anto Aravinth > <anto.aravinth.cse@gmail.com <mailto:anto.aravinth.cse@gmail.com>> > wrote: > > > > On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross > <theophilusx@gmail.com <mailto:theophilusx@gmail.com>> wrote: > > > Anto Aravinth <anto.aravinth.cse@gmail.com > <mailto:anto.aravinth.cse@gmail.com>> writes: > > > Thanks for the response. I'm not sure, how long does this tool takes for > > the 70GB data. > > > > I used node to stream the xml files into inserts.. which was very slow.. > > Actually the xml contains 40 million records, out of which 10Million took > > around 2 hrs using nodejs. Hence, I thought will use COPY command, as > > suggested on the internet. > > > > Definitely, will try the code and let you know.. But looks like it uses the > > same INSERT, not copy.. interesting if it runs quick on my machine. > > > > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info <mailto:adrien.nayrat@anayrat.info>> > > wrote: > > > >> On 06/24/2018 05:25 PM, Anto Aravinth wrote: > >> > Hello Everyone, > >> > > >> > I have downloaded the Stackoverflow posts xml (contains all SO questions > >> till > >> > date).. the file is around 70GB.. I wanna import the data in those xml > >> to my > >> > table.. is there a way to do so in postgres? > >> > > >> > > >> > Thanks, > >> > Anto. > >> > >> Hello Anto, > >> > >> I used this tool : > >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres > <https://github.com/Networks-Learning/stackexchange-dump-to-postgres> > >> > > If you are using nodejs, then you can easily use the > pg-copy-streams > module to insert the records into your database. I've been > using this > for inserting large numbers of records from NetCDF files. > Takes between > 40 to 50 minutes to insert 60 Million+ records and we are doing > additional calculations on the values, not just inserting them, > plus we are inserting into a database over the network and > into a database which is > also performing other processing. > > We found a significant speed improvement with COPY over > blocks of insert > transactions, which was faster than just individual inserts. > The only > downside with using COPY is that it either completely works or > completely fails and when it fails, it can be tricky to work > out which > record is causing the failure. A benefit of using blocks of > transactions > is that you have more fine grained control, allowing you to > recover from > some errors or providing more specific detail regarding the > cause of the > error. > > > Sure, let me try that.. I have a question here, COPY usually > works when you move data from files to your postgres instance, > right? Now in node.js, processing the whole file, can I use COPY > programmatically like COPY Stackoverflow <calculated value at > run time>? Because from doc: > > https://www.postgresql.org/docs/9.2/static/sql-copy.html > <https://www.postgresql.org/docs/9.2/static/sql-copy.html> > > I don't see its possible. May be I need to convert the files to > copy understandable first? > > Anto. > > > > > Yes. Essentially what you do is create a stream and feed whatever > information you want to copy into that stream. PG sees the. data as > if it was seeing each line in a file, so you push data onto the > stream wherre each item is seperated by a tab (or whatever). Here is > the basic low level function I use (Don't know how the formatting > will go!) > > async function copyInsert(sql, stringifyFN, records) { > const logName = `${moduleName}.copyInsert`; > var client; > > assert.ok(Array.isArray(records), "The records arg must be an > array"); > assert.ok(typeof(stringifyFN) === "function", "The stringifyFN > arg must be a function"); > return getClient() > .then(c => { > client = c; > return new Promise(function(resolve, reject) { > var stream, rs; > var idx = 0; > function done() { > releaseClient(client); > client = undefined; > resolve(idx + 1); > } > > function onError(err) { > if (client !== undefined) { > releaseClient(client); > } > reject(new VError(err, `${logName}: COPY failed at record > ${idx}`)); > } > > function arrayRead() { > if (idx === records.length) { > rs.push(null); > } else { > let rec = records[idx]; > rs.push(stringifyFN(rec)); > idx += 1; > } > } > > rs = new Readable; > rs._read = arrayRead; > rs.on("error", onError); > stream = client.query(copyFrom(sql)); > stream.on("error", onError); > stream.on("end", done); > rs.pipe(stream); > }); > }) > .catch(err => { > throw new VError(err, `${logName} Failed COPY insert`); > }); > } > > and I will call it like > > copyInsert(sql, stringifyClimateRecord, records) > > where sql and stringifycomateRecord arguments are > > const sql = `COPY access_s.climate_data_ensemble_${ensemble} ` > + "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds," > + "vprp_09,vprp_15,wind_speed) FROM STDIN"; > > function stringifyClimateRecord(rec) { > return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t` > + > `${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${rec[9]}\n`; > } > > The stringifyClimateRecord returns a record to be inserted as a > 'line' into the stream with values separated by tabs. Records is an > array of data records where each record is an array. > > > -- > regards, > > Tim > > -- > Tim Cross > > -- Adrian Klaver adrian.klaver@aklaver.com
Anto Aravinth <anto.aravinth.cse@gmail.com> writes: > Thanks a lot. But I do got lot of challenges! Looks like SO data contains > lot of tabs within itself.. So tabs delimiter didn't work for me. I thought > I can give a special demiliter but looks like Postrgesql copy allow only > one character as delimiter :( > > Sad, I guess only way is to insert or do a through serialization of my data > into something that COPY can understand. > The COPY command has a number of options, including setting what is used as the delimiter - it doesn't have to be tab. You need to also look at the logs/output to see exactly why the copy fails. I'd recommend first pre-processing your input data to make sure it is 'clean' and all the fields actually match with whatever DDL you have used to define your db tables etc. I'd then select a small subset and try different parameters to the copy command until you get the right combination of data format and copy definition. It may take some effort to get the right combination, but the result is probably worth it given your data set size i.e. difference between hours and days. -- Tim Cross
Anto Aravinth <anto.aravinth.cse@gmail.com> writes:
> Thanks a lot. But I do got lot of challenges! Looks like SO data contains
> lot of tabs within itself.. So tabs delimiter didn't work for me. I thought
> I can give a special demiliter but looks like Postrgesql copy allow only
> one character as delimiter :(
>
> Sad, I guess only way is to insert or do a through serialization of my data
> into something that COPY can understand.
>
The COPY command has a number of options, including setting what is used
as the delimiter - it doesn't have to be tab. You need to also look at
the logs/output to see exactly why the copy fails.
I'd recommend first pre-processing your input data to make sure it is
'clean' and all the fields actually match with whatever DDL you have
used to define your db tables etc. I'd then select a small subset and
try different parameters to the copy command until you get the right
combination of data format and copy definition.
It may take some effort to get the right combination, but the result is
probably worth it given your data set size i.e. difference between hours
and days.
--
Tim Cross