Обсуждение: BUG #16300: Text line order corruption with COPY command
The following bug has been logged on the website: Bug reference: 16300 Logged by: Hans Buschmann Email address: buschmann@nidsa.net PostgreSQL version: 12.2 Operating system: Windows Server 2019 64bit Description: A reproducable line order corruption occurs when copying a quite large test file into Postgres. I was trying to import and parse a big .xml file (about 41 MB, 643407 lines) into a simple import table using the following sequence: create database x86db template=template0 encoding 'UTF8' lc_collate='C'; \c x86db create table uops_imp2 ( cline varchar ) ; copy uops_imp2 from 'N:/downloads/uops_info_instructions_200226.xml'; or copy uops_imp2 from '/usr/local/hb/uops_info_instructions_200226.xml'; This was tested on different machines under Windows Server 2019 64bit and Fedora 31 x86-64 under Postgres 12.2 respective 12.1: x86db=# select version (); version ------------------------------------------------------------ PostgreSQL 12.2, compiled by Visual C++ build 1914, 64-bit (1 row) x86db=# select version (); version -------------------------------------------------------------------------------------------------------- PostgreSQL 12.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 9.2.1 20190827 (Red Hat 9.2.1-1), 64-bit (1 row) The original order of the input lines from the original file was verified under 2 different editors under Windows: notepad++ 7.8.5 x64 notepad (as build in), with status line turned on to show line numbers Here are shown the line 627365 til 627392: (the correct original) <doc TP="1.0"/> </architecture> </instruction> <instruction asm="VPMADDWD" category="AVX512" cpl="3" evex="1" extension="AVX512EVEX" iclass="VPMADDWD" iform="VPMADDWD_ZMMi32_MASKmskw_ZMMi16_MEMi16_AVX512" isa-set="AVX512BW_512" mask="0" string="VPMADDWD (ZMM, ZMM, M512)" zeroing="0"> <operand idx="1" name="REG0" type="reg" w="1" width="512" xtype="i32">ZMM0,ZMM1,ZMM2,ZMM3,ZMM4,ZMM5,ZMM6,ZMM7,ZMM8,ZMM9,ZMM10,ZMM11,ZMM12,ZMM13,ZMM14,ZMM15,ZMM16,ZMM17,ZMM18,ZMM19,ZMM20,ZMM21,ZMM22,ZMM23,ZMM24,ZMM25,ZMM26,ZMM27,ZMM28,ZMM29,ZMM30,ZMM31</operand> <operand idx="2" name="REG2" r="1" type="reg" width="512" xtype="i16">ZMM0,ZMM1,ZMM2,ZMM3,ZMM4,ZMM5,ZMM6,ZMM7,ZMM8,ZMM9,ZMM10,ZMM11,ZMM12,ZMM13,ZMM14,ZMM15,ZMM16,ZMM17,ZMM18,ZMM19,ZMM20,ZMM21,ZMM22,ZMM23,ZMM24,ZMM25,ZMM26,ZMM27,ZMM28,ZMM29,ZMM30,ZMM31</operand> <operand idx="3" memory-prefix="zmmword ptr" name="MEM0" r="1" type="mem" width="512" xtype="i16"/> <architecture name="SKX"> <IACA TP="0.50" TP_ports="0.50" fusion_occurred="1" ports="1*p05+1*p23" uops="2" version="2.3"/> <IACA TP="0.50" TP_ports="0.50" fusion_occurred="1" ports="1*p05+1*p23" uops="2" version="3.0"/> <measurement TP="0.54" TP_ports="0.50" ports="1*p05+1*p23" uops="2" uops_retire_slots="1"> <latency cycles="5" start_op="2" target_op="1"/> <latency cycles_addr="13" cycles_addr_is_upper_bound="1" cycles_addr_same_reg="14" cycles_addr_same_reg_is_upper_bound="1" cycles_mem="10" cycles_mem_is_upper_bound="1" start_op="3" target_op="1"/> </measurement> </architecture> <architecture name="CNL"> <measurement TP="1.00" TP_ports="1.00" ports="1*p0+1*p23" uops="2" uops_retire_slots="1"> <latency cycles="5" start_op="2" target_op="1"/> <latency cycles_addr="13" cycles_addr_is_upper_bound="1" cycles_mem="10" cycles_mem_is_upper_bound="1" start_op="3" target_op="1"/> </measurement> </architecture> <architecture name="ICL"> <measurement TP="1.00" TP_ports="1.00" ports="1*p0+1*p23" uops="2" uops_retire_slots="1"> <latency cycles="5" start_op="2" target_op="1"/> <latency cycles_addr="13" cycles_addr_is_upper_bound="1" cycles_mem="10" cycles_mem_is_upper_bound="1" start_op="3" target_op="1"/> </measurement> <doc TP="1.0"/> </architecture> </instruction> when querying the table by select * from uops_imp2 offset 627365 limit 27; I get a different part from the original lines with another line mangled in between (see ###) x86db=# x86db=# select * from uops_imp2 offset 627365 limit 27; cline ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- <latency cycles="5" start_op="4" target_op="1"/> </measurement> <doc TP="1.0"/> </architecture> </instruction> <instruction asm="VPMADDWD" category="AVX512" cpl="3" evex="1" extension="AVX512EVEX" iclass="VPMADDWD" iform="VPMADDWD_ZMMi32_MASKmskw_ZMMi16_MEMi16_AVX512" isa-set="AVX512BW_512" mask="0" string="VPMADDWD (ZMM, ZMM, M512)" zeroing="0"> <operand idx="1" name="REG0" type="reg" w="1" width="512" xtype="i32">ZMM0,ZMM1,ZMM2,ZMM3,ZMM4,ZMM5,ZMM6,ZMM7,ZMM8,ZMM9,ZMM10,ZMM11,ZMM12,ZMM13,ZMM14,ZMM15,ZMM16,ZMM17,ZMM18,ZMM19,ZMM20,ZMM21,ZMM22,ZMM23,ZMM24,ZMM25,ZMM26,ZMM27,ZMM28,ZMM29,ZMM30,ZMM31</operand> ### <latency cycles="6" start_op="2" target_op="1"/> <operand idx="2" name="REG2" r="1" type="reg" width="512" xtype="i16">ZMM0,ZMM1,ZMM2,ZMM3,ZMM4,ZMM5,ZMM6,ZMM7,ZMM8,ZMM9,ZMM10,ZMM11,ZMM12,ZMM13,ZMM14,ZMM15,ZMM16,ZMM17,ZMM18,ZMM19,ZMM20,ZMM21,ZMM22,ZMM23,ZMM24,ZMM25,ZMM26,ZMM27,ZMM28,ZMM29,ZMM30,ZMM31</operand> <operand idx="3" memory-prefix="zmmword ptr" name="MEM0" r="1" type="mem" width="512" xtype="i16"/> <architecture name="SKX"> <IACA TP="0.50" TP_ports="0.50" fusion_occurred="1" ports="1*p05+1*p23" uops="2" version="2.3"/> <IACA TP="0.50" TP_ports="0.50" fusion_occurred="1" ports="1*p05+1*p23" uops="2" version="3.0"/> <measurement TP="0.54" TP_ports="0.50" ports="1*p05+1*p23" uops="2" uops_retire_slots="1"> <latency cycles="5" start_op="2" target_op="1"/> <latency cycles_addr="13" cycles_addr_is_upper_bound="1" cycles_addr_same_reg="14" cycles_addr_same_reg_is_upper_bound="1" cycles_mem="10" cycles_mem_is_upper_bound="1" start_op="3" target_op="1"/> </measurement> </architecture> <architecture name="CNL"> <measurement TP="1.00" TP_ports="1.00" ports="1*p0+1*p23" uops="2" uops_retire_slots="1"> <latency cycles="5" start_op="2" target_op="1"/> <latency cycles_addr="13" cycles_addr_is_upper_bound="1" cycles_mem="10" cycles_mem_is_upper_bound="1" start_op="3" target_op="1"/> </measurement> </architecture> <architecture name="ICL"> <measurement TP="1.00" TP_ports="1.00" ports="1*p0+1*p23" uops="2" uops_retire_slots="1"> <latency cycles="5" start_op="2" target_op="1"/> (27 rows) In all cases i tried the original order of the lines was not preserved and the disorder was the same. The count of all lines seems correct: x86db=# select count(*) from uops_imp2; count -------- 643407 (1 row) The same error occurred when using \copy on the psql client side. To reproduce, the XML-file is directly downloadable under the following address: https://uops.info/xml.html and choosing the file instructions.xml I have not further analyzed other regions of line order corruption because it is very difficult when you cant rely on postgres COPY. I fear similar problems could occur when restoring a pg_dump file, which also relies on copy commands. Thanks in advance Hans Buschmann
The following bug has been logged on the website:
Bug reference: 16300
Logged by: Hans Buschmann
Email address: buschmann@nidsa.net
PostgreSQL version: 12.2
Operating system: Windows Server 2019 64bit
Description:
A reproducable line order corruption occurs when copying a quite large test
file into Postgres.
Gesendet: Donnerstag, 12. März 2020 21:42
An: Hans Buschmann; PostgreSQL mailing lists
Betreff: Re: BUG #16300: Text line order corruption with COPY command
Thank you for the quick reply.
When looking into the documentation I find under SQL COPY command:
"If a column list is specified, COPY TO copies only the data in the specified columns to the file.
For COPY FROM, each field in the file is inserted, in order, into the specified column.
Table columns not specified in the COPY FROM column list will receive their default values. "
So I expected the insertion in order as said above.
In my opinion it is essential to preserve the order of textfile input in COPY FROM. This also holds true when copying from another source like a program, where often it is not practicable to add an orderable column on input.
Not preserving (and rendering on a select) the order makes the COPY FROM mostly unusable for cases where the order must be preserved. (Think of XML, JSON, Source code, Log files, Disassembly etc.).
The size of such kind of files can it make also impractible to take the mentioned circumventing methods you provided.
I have not inspected the resulting tuple orders on the physical file.
I will try to nail down the first occurence of the disordering during the import by comparing import and export.
My concern also goes to pg_dump and friends, which could change such not natural orderable tables through pg_dump/pg_restore.
Hans Buschmann
Hans: On Fri, Mar 13, 2020 at 10:05 AM Hans Buschmann <buschmann@nidsa.net> wrote: ... > When looking into the documentation I find under SQL COPY command: > "If a column list is specified, COPY TO copies only the data in the specified columns to the file. > For COPY FROM, each field in the file is inserted, in order, into the specified column. > Table columns not specified in the COPY FROM column list will receive their default values. " > So I expected the insertion in order as said above. And the column order ( which is what you have quoted here ) is preserved ( easily, since you only have one). > In my opinion it is essential to preserve the order of textfile input in COPY FROM. This also holds true when copying fromanother source like a program, where often it is not practicable to add an orderable column on input. You are talking about ROW order. This have been discussed countless time for all sort of sql databases. SQL TABLES are SETS of TUPLES (rows), without order. > Not preserving (and rendering on a select) the order makes the COPY FROM mostly unusable for cases where the order mustbe preserved. (Think of XML, JSON, Source code, Log files, Disassembly etc.). I think you are not using SQL correctly. When you put "files" into the database ( XML, JSON, Source Code, Disassembly above), the correct behaviour is normally to put each file in a single field in a single row. For Log files the thing is different. I normally do not want to preserve import order on these, I send an explicit order, typically by the timestamp for similar field, when reading, so that even if I import the weekend files in sunday-saturday order I get them right, and I can mix logs from several sources and see them ordered ( that's why I import them when I (rarely) do ). If you want to model TEXT files as a table of lines you need to use a line number of similar thing, and then you have to deal with several problems ( in your XML example, I can delete some clines in the midle of your xml, you cannot detect it. If I preserve insertion order and want to INSERT a line in the middle of the document, how do O do it? ). Having a text file is generally bad idea. > The size of such kind of files can it make also impractible to take the mentioned circumventing methods you provided. Putting big text files in the database is not generally a good idea. The only thing you can do with those is read the lines sequentally, the plain file excels at this. > I have not inspected the resulting tuple orders on the physical file. > I will try to nail down the first occurence of the disordering during the import by comparing import and export. You can do these, but it will be useless. Tuples can move, actually they only do as a result of some operations, but nothing guarantees they are not going to do it on their own for some reason in the future. AFAIK, except for updates, they do not move, and I think vacuum full preserves ordering, but nothing guarantees you postgres 42 is not going to gain a "repack" function which moves tuples to increase the fill factor in the first pages of the files destroying your order. You do not have a command to ask an sql database for the tuples in insertion order, orderless select just mean any order. Even if they were ordered in the file you could have something like this: - Connection 1 ask for all the lines of a fairly huge table ( much bigger than caches), the engine is free to return them in any order so it peeks the easier, file order, and starts returning them. - When c1 is 20% into the job, c2 ask for the same rows. Engine notices c1 is already doing the same and starts sending the rows from 20-100% of the file to both connections. - c1 gets all of them, engine rereads rows in the 0-20% range and sends them to c2. Huge win, only 120% of data read, instead of 200% ( remember the "much bigger than cache" note ) ( engine could have stopped c1 and reread 0-20% to let them sync, but we will have a very unhappy c1 them ). I'm not saying postgres does this, but one of the reason sql language does not impose any unrequested order is to let it do it. > My concern also goes to pg_dump and friends, which could change such not natural orderable tables through pg_dump/pg_restore. There is not such thing as a natural orderable table. Number them or you are in for a lot of pain. Or use some kind of database built for text line processing, not a relational one. Francisco Olarte.
Von: David G. Johnston <david.g.johnston@gmail.com>
Gesendet: Donnerstag, 12. März 2020 21:42
An: Hans Buschmann; PostgreSQL mailing lists
Betreff: Re: BUG #16300: Text line order corruption with COPY command
Thank you for the quick reply.
When looking into the documentation I find under SQL COPY command:
"If a column list is specified, COPY TO copies only the data in the specified columns to the file.
For COPY FROM, each field in the file is inserted, in order, into the specified column.
Table columns not specified in the COPY FROM column list will receive their default values. "
So I expected the insertion in order as said above.
In my opinion it is essential to preserve the order of textfile input in COPY FROM. This also holds true when copying from another source like a program, where often it is not practicable to add an orderable column on input.
Not preserving (and rendering on a select) the order makes the COPY FROM mostly unusable for cases where the order must be preserved. (Think of XML, JSON, Source code, Log files, Disassembly etc.).
My concern also goes to pg_dump and friends, which could change such not natural orderable tables through pg_dump/pg_restore.