Обсуждение: COPY from STDIN vs file with large CSVs

Поиск
Список
Период
Сортировка

COPY from STDIN vs file with large CSVs

От
Wells Oliver
Дата:
I have a CSV that's ~30GB. Some 400m rows. Would there be a meaningful performance difference to run COPY from STDIN using: cat f.csv | psql "COPY .. FROM STDIN WITH CSV" versus just doing "COPY ... FROM 'f.csv' WITH CSV"?

Thanks. It took about four hours to copy one and I felt that was a little much.

--

RE: COPY from STDIN vs file with large CSVs

От
Indirani Venkopa Chetty
Дата:

Hi Wells,

I am trying import CSV file in to Amazon RDS PostgreSQL

I am getting following error , In internal PostgreSQL it works

 

Here is the PSQL Script  : can you also help me with Powershell to execute this script

 

COPY product."tblProductLicenseType" FROM E'//XXXXXXXXXXX/Product_Licensing_Orders/Development/tblProductLicenseType.csv'  DELIMITER E',' NULL '\N'  QUOTE '"' HEADER CSV ;

COPY product."tblProductLicense" FROM E'//XXXXXXXXXXX/Product_Licensing_Orders/Development/tblProductLicense.csv' WITH (format csv, DELIMITER ',', header true, force_null ("LicenseId", "Season", "ProductCode", "Account", "UserId", "StatusId", "CreateDate", "UpdateDate", "SourceId", ord_num, item_num, "lType"), ENCODING 'UTF8' );

COPY product."tblCobranderUserSeasonProduct" FROM E'//XXXXXXXXXXX/Product_Licensing_Orders/Development/tblCobranderUserSeasonProduct.csv' WITH (format csv, DELIMITER ',', header true, force_null ("uspId", "CobranderUserLinkId", season, prod_cd, "statusId", "statusDate", "createdBy", "createdDate", "modifiedBy", "modifiedDate"), ENCODING 'UTF8' );

COPY product."tblDailyFundedBankProducts" FROM E'//XXXXXXXXXXX/Product_Licensing_Orders/Development/tblDailyFundedBankProducts.csv' WITH (format csv, DELIMITER ',', header true, force_null ("Account", "UserID", "BankProductSeason", "TransmitterPaymentDate", "FundedBankProducts", "ProtectionPlusFunded"), ENCODING 'UTF8' );

COPY product."tblProductSoftwarePackage"("ProductCode", "Season", "SoftwarePackageId", "CreateDate", "UpdateDate" , "createdBy", "updatedBy") FROM E'//XXXXXXXXXXX/Product_Licensing_Orders/Development/tblProductSoftwarePackage.csv' WITH (format csv, DELIMITER ',', header true, force_null ("ProductCode", "Season", "SoftwarePackageId", "CreateDate", "UpdateDate", "createdBy", "updatedBy"), ENCODING 'UTF8' );

COPY product."tblUserProductEval" FROM E'//XXXXXXXXXXX/Product_Licensing_Orders/Development/tblUserProductEval.csv' WITH (format csv, DELIMITER ',', header true, force_null ("evalId", "userId", "prodCd", "licenseId", "evalStartDate", "evalEndDate", "createdBy", "createdDate", "modifiedBy", "modifiedDate"), ENCODING 'UTF8' );

 

COPY orders."OrderType" FROM E'//XXXXXXXXXXX/Product_Licensing_Orders/Development/OrderType.csv' WITH (format csv, DELIMITER ',', header true, force_null ("OrderTypeCode", "Description"), ENCODING 'UTF8' );

COPY orders.orders(ord_num, ord_stat, account, user_id, season, county, tax_rate, tax_amt, disc_pct, disc_amt, freight, ord_tot, bal_due, pmt1_amt, pmt1_dt, pmt2_amt, pmt2_dt, note1, note2, sales_exec, split, chk_num1, chk_name1, chk_amt1, exp_mon1, exp_yr1, auth1, ref1, chk_num2, chk_name2, chk_amt2, exp_mon2, exp_yr2, auth2, ref2, b_fname, b_lname, b_company, b_addr1, b_addr2, b_city, b_state, b_zip, s_phone, s_fname, s_lname, s_company, s_addr1, s_addr2, s_city, s_state, s_zip, orig_by, orig_dt, edit_by, edit_dt, security_cd1, security_cd2, export_date, exec_email, s_fax, do_not_ship, ship_email, "CommissionType", "OrderType", "PaymentPlan", "sfOrderId")

FROM E'//XXXXXXXXXXX/Product_Licensing_Orders/Development/orders.csv' WITH (format csv, DELIMITER ',', header true, force_null (ord_num, ord_stat, account, user_id, season, county, tax_rate, tax_amt, disc_pct, disc_amt, freight, ord_tot, bal_due, pmt1_amt, pmt1_dt, pmt2_amt, pmt2_dt, note1, note2, sales_exec, split, chk_num1, chk_name1, chk_amt1, exp_mon1, exp_yr1, auth1, ref1, chk_num2, chk_name2, chk_amt2, exp_mon2, exp_yr2, auth2, ref2, b_fname, b_lname, b_company, b_addr1, b_addr2, b_city, b_state, b_zip, s_phone, s_fname, s_lname, s_company, s_addr1, s_addr2, s_city, s_state, s_zip, orig_by, orig_dt, edit_by, edit_dt, security_cd1, security_cd2, export_date, exec_email, s_fax, do_not_ship, ship_email, "CommissionType", "OrderType", "PaymentPlan", "sfOrderId"), ENCODING 'UTF8' );

COPY orders.ord_items FROM E'//XXXXXXXXXXX/Product_Licensing_Orders/Development/ord_items.csv' WITH (format csv, DELIMITER ',', header true, force_null (item_num, ord_num, prod_cd, qty, uprice, dt, bank_id, ship_via, bank_stat, "sfOrderLineId"), ENCODING 'UTF8' );

COPY orders.ord_pkg FROM E'//XXXXXXXXXXX/Product_Licensing_Orders/Development/ord_pkg.csv' WITH (format csv, DELIMITER ',', header true, force_null ("rowID", parent_cd, child_cd, season, sequence_number), ENCODING 'UTF8' );

COPY orders.ord_stock FROM E'//XXXXXXXXXXX/Product_Licensing_Orders/Development/ord_stock.csv' WITH (format csv, DELIMITER ',', header true, force_null ("rowID", prod_cd, season, idescr, qty, uprice, backorder, shippable, bank_dependent, derived, taxable, pkg_id1, pkg_id2, act_code, "Discountable", "UserActivationTypeID", "eFileActive"), ENCODING 'UTF8' );

COPY orders.orders_ext FROM E'//XXXXXXXXXXX/Product_Licensing_Orders/Development/orders_ext.csv' WITH (format csv, DELIMITER ',', header true, force_null (ord_num, svc_fee, rest_fee, ref_credit), ENCODING 'UTF8' );

COPY orders.sls_comm4ord FROM E'//XXXXXXXXXXX/Product_Licensing_Orders/Development/sls_comm4ord.csv' WITH (format csv, DELIMITER ',', header true, force_null (ord_comm_id, ord_num, sales_exec, amount, upd_by, upd_dt, com_dt), ENCODING 'UTF8' );

RETURN True;

 

 

Thanks,

 

chetty

 

From: Wells Oliver <wells.oliver@gmail.com>
Sent: Wednesday, January 8, 2020 8:54 AM
To: pgsql-admin <pgsql-admin@postgresql.org>
Subject: COPY from STDIN vs file with large CSVs

 

I have a CSV that's ~30GB. Some 400m rows. Would there be a meaningful performance difference to run COPY from STDIN using: cat f.csv | psql "COPY .. FROM STDIN WITH CSV" versus just doing "COPY ... FROM 'f.csv' WITH CSV"?

 

Thanks. It took about four hours to copy one and I felt that was a little much.


--

NOTICE: This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free, as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender, therefore, does not accept liability for any errors or omissions in the contents of this message which arise as a result of email transmission. Unless stated to the contrary, any opinions or comments are personal to the writer and do not represent the official view of Crosslink Professional Tax Solutions.
Вложения

Re: COPY from STDIN vs file with large CSVs

От
Ron
Дата:
On 1/8/20 10:54 AM, Wells Oliver wrote:
> I have a CSV that's ~30GB. Some 400m rows. Would there be a meaningful 
> performance difference to run COPY from STDIN using: cat f.csv | psql 
> "COPY .. FROM STDIN WITH CSV" versus just doing "COPY ... FROM 'f.csv' 
> WITH CSV"?
>
> Thanks. It took about four hours to copy one and I felt that was a little 
> much.

catting the file starts another process, and opens a pipe.  That can't be 
faster than "COPY ... FROM ... WITH CSV".

pg_bulkload (which might be in your repository) is probably what you really 
want.

-- 
Angular momentum makes the world go 'round.



Re: COPY from STDIN vs file with large CSVs

От
Scott Ribe
Дата:
> On Jan 8, 2020, at 10:05 AM, Indirani Venkopa Chetty <ivchetty@crosslinktax.com> wrote:
>
> I am trying import CSV file in to Amazon RDS PostgreSQL
> I am getting following error , In internal PostgreSQL it works
> <image001.jpg>
>

The error message is telling you that you will have to connect to postgres as a user with pg superuser privileges.




Re: COPY from STDIN vs file with large CSVs

От
Wells Oliver
Дата:
Thanks, I had looked into pg_bulkload a bit but it does not seem to be available for PG 12? It's not in the extension directory, it's not available through apt-cache search, and I have all kinds of issues with it finding pgcommon and pgport when I try to build from source. Using Ubuntu 18 LTS...

On Wed, Jan 8, 2020 at 9:09 AM Ron <ronljohnsonjr@gmail.com> wrote:
On 1/8/20 10:54 AM, Wells Oliver wrote:
> I have a CSV that's ~30GB. Some 400m rows. Would there be a meaningful
> performance difference to run COPY from STDIN using: cat f.csv | psql
> "COPY .. FROM STDIN WITH CSV" versus just doing "COPY ... FROM 'f.csv'
> WITH CSV"?
>
> Thanks. It took about four hours to copy one and I felt that was a little
> much.

catting the file starts another process, and opens a pipe.  That can't be
faster than "COPY ... FROM ... WITH CSV".

pg_bulkload (which might be in your repository) is probably what you really
want.

--
Angular momentum makes the world go 'round.




--

Re: COPY from STDIN vs file with large CSVs

От
"David G. Johnston"
Дата:
On Wed, Jan 8, 2020 at 9:31 AM Indirani Venkopa Chetty <ivchetty@crosslinktax.com> wrote:

Hi Wells,

I am trying import CSV file in to Amazon RDS PostgreSQL

I am getting following error , In internal PostgreSQL it works

 



"COPY with a file name instructs the PostgreSQL server to directly read from or write to a file." -  i.e., Even had the server accepted your command it wasn't going to find your file.

Use either psql's "\copy" meta command or "COPY FROM STDIN" (search around for the needed syntax to make that work).

David J.

Вложения

Re: COPY from STDIN vs file with large CSVs

От
bricklen
Дата:


On Wed, Jan 8, 2020 at 8:55 AM Wells Oliver <wells.oliver@gmail.com> wrote:
I have a CSV that's ~30GB. Some 400m rows. Would there be a meaningful performance difference to run COPY from STDIN using: cat f.csv | psql "COPY .. FROM STDIN WITH CSV" versus just doing "COPY ... FROM 'f.csv' WITH CSV"?

If you're looking to speed up the loading - and your disk subsystem is decent - consider running your csv through the "split" command to decompose it into smaller CSV files. You can then load them in parallel using multiple psql sessions.