Обсуждение: Copy/delete issue
I have a strange situation that occurs every now and again.
We have a reports system that gathers all the data from our various production systems during the night, where we can run heavy reports on it without loading the production databases.
I have two shell scripts that do this nightly transfer of data. The production database is Sybase. So I have a shell script that scans a list of tables and databases and dumps them into a format suitable for postgres COPY. After it dumps everything, another shell script scans the same list, and loads each dump file into the proper table.
The shell script first runs psql with a DELETE command. For transaction tables (ones where data accumulates by date) the records for two days are deleted, and for non-transaction tables (ones that have records that might change but don't accumulate based on time) it's DELETE without WHERE.
I run psql with ON_ERROR_STOP and check the exit status. If the DELETE failed, I should get an error status, so I do not proceed to the copy.
Then I run psql again, with ON_ERROR_STOP, and run a \copy command that loads the data to the same table.
For some reason, once in a while, that fails. Always on the same table - violating the unique constraint of the primary key. Now, this is impossible because there was a successful delete beforehand, as I said, and the data comes from a database where that same primary key is enforced. Moreover, when I re-run the script, everything runs fine.
This happens at least once a week - always with the same table.
Can anybody think of a reason why psql will not report an error on deletion? Or why it would tell me that a constraint has been violated when loading the same data 5 minutes later works fine?
Thanks,
Herouth
Here is the relevant shell code (the relevant table has "*' in the file for datefield):
# The names of the tables are stored in a text file
exec 4<$TABLES_FILE
dstamp N "Starting postgres load" >> $LOAD_LOG
while read -u 4 ignored1 ignored2 local_table datefield
do
dstamp N "Now loading $local_table" >> $LOAD_LOG
filename="$DUMPDIR/$local_table.tsv"
# Stop if the dump file does not exist.
if [ ! -f "$filename" ]
then
errexit "Dump file not found for table: $local_table" 1 >> $LOAD_LOG
fi
# If the datefield contains "*", it means the table contents are fully
# replaced, otherwise use this as the field on which to limit the deletion.
if [ "$datefield" = "*" ]
then
CMD="DELETE FROM $local_table"
else
CMD="DELETE FROM $local_table WHERE $datefield >= current_date - 2"
fi
# Run the deletion command
echo -e "\\\\set ON_ERROR_STOP\\n$CMD;" | $PSQLCMD -q -f - > $TMPFILE 2>&1
# Report errors and stop the loop if any occured
rc=$?
if [ "$rc" != "0" ]
then
# Copy the error output, properly formatted, to the log file
sed "s/^/$(date +%Y-%m-%d%t%T) E /" $TMPFILE >> $LOAD_LOG
# Send mail message about the failure
rm -f $TMPFILE
errexit "Deletion failed with status $rc on table: $local_table" $rc >> $LOAD_LOG
fi
# Now run the load command
echo -e "\\\\set ON_ERROR_STOP\\n\\\\copy $local_table from $filename" | $PSQLCMD -q -f - > $TMPFILE 2>&1
rc=$?
# Check for errors and report
if [ "$rc" != "0" ]
then
# Copy the error output, properly formatted, to the log file
sed "s/^/$(date +%Y-%m-%d%t%T) E /" $TMPFILE >> $LOAD_LOG
# Send mail message about the failure
rm -f $TMPFILE
errexit "Copy failed with status $rc on table: $local_table" $rc >> $LOAD_LOG
fi
# Remove the dump file, as well as the output file from the psql command
rm -f "$filename"
# Update statistics with the ANALYZE command
dstamp N "Updating statistics for $local_table" >> $LOAD_LOG
echo -e "\\\\set ON_ERROR_STOP\\nANALYZE $local_table;" | $PSQLCMD -q -f - > $TMPFILE 2>&1
# Report errors and stop the loop if any occured
rc=$?
if [ "$rc" != "0" ]
then
# Copy the error output, properly formatted, to the log file
sed "s/^/$(date +%Y-%m-%d%t%T) E /" $TMPFILE >> $LOAD_LOG
# Send mail message about the failure
rm -f $TMPFILE
errexit "ANALYZE failed with status $rc on table: $local_table" $rc >> $LOAD_LOG
fi
done
We have a reports system that gathers all the data from our various production systems during the night, where we can run heavy reports on it without loading the production databases.
I have two shell scripts that do this nightly transfer of data. The production database is Sybase. So I have a shell script that scans a list of tables and databases and dumps them into a format suitable for postgres COPY. After it dumps everything, another shell script scans the same list, and loads each dump file into the proper table.
The shell script first runs psql with a DELETE command. For transaction tables (ones where data accumulates by date) the records for two days are deleted, and for non-transaction tables (ones that have records that might change but don't accumulate based on time) it's DELETE without WHERE.
I run psql with ON_ERROR_STOP and check the exit status. If the DELETE failed, I should get an error status, so I do not proceed to the copy.
Then I run psql again, with ON_ERROR_STOP, and run a \copy command that loads the data to the same table.
For some reason, once in a while, that fails. Always on the same table - violating the unique constraint of the primary key. Now, this is impossible because there was a successful delete beforehand, as I said, and the data comes from a database where that same primary key is enforced. Moreover, when I re-run the script, everything runs fine.
This happens at least once a week - always with the same table.
Can anybody think of a reason why psql will not report an error on deletion? Or why it would tell me that a constraint has been violated when loading the same data 5 minutes later works fine?
Thanks,
Herouth
Here is the relevant shell code (the relevant table has "*' in the file for datefield):
# The names of the tables are stored in a text file
exec 4<$TABLES_FILE
dstamp N "Starting postgres load" >> $LOAD_LOG
while read -u 4 ignored1 ignored2 local_table datefield
do
dstamp N "Now loading $local_table" >> $LOAD_LOG
filename="$DUMPDIR/$local_table.tsv"
# Stop if the dump file does not exist.
if [ ! -f "$filename" ]
then
errexit "Dump file not found for table: $local_table" 1 >> $LOAD_LOG
fi
# If the datefield contains "*", it means the table contents are fully
# replaced, otherwise use this as the field on which to limit the deletion.
if [ "$datefield" = "*" ]
then
CMD="DELETE FROM $local_table"
else
CMD="DELETE FROM $local_table WHERE $datefield >= current_date - 2"
fi
# Run the deletion command
echo -e "\\\\set ON_ERROR_STOP\\n$CMD;" | $PSQLCMD -q -f - > $TMPFILE 2>&1
# Report errors and stop the loop if any occured
rc=$?
if [ "$rc" != "0" ]
then
# Copy the error output, properly formatted, to the log file
sed "s/^/$(date +%Y-%m-%d%t%T) E /" $TMPFILE >> $LOAD_LOG
# Send mail message about the failure
rm -f $TMPFILE
errexit "Deletion failed with status $rc on table: $local_table" $rc >> $LOAD_LOG
fi
# Now run the load command
echo -e "\\\\set ON_ERROR_STOP\\n\\\\copy $local_table from $filename" | $PSQLCMD -q -f - > $TMPFILE 2>&1
rc=$?
# Check for errors and report
if [ "$rc" != "0" ]
then
# Copy the error output, properly formatted, to the log file
sed "s/^/$(date +%Y-%m-%d%t%T) E /" $TMPFILE >> $LOAD_LOG
# Send mail message about the failure
rm -f $TMPFILE
errexit "Copy failed with status $rc on table: $local_table" $rc >> $LOAD_LOG
fi
# Remove the dump file, as well as the output file from the psql command
rm -f "$filename"
# Update statistics with the ANALYZE command
dstamp N "Updating statistics for $local_table" >> $LOAD_LOG
echo -e "\\\\set ON_ERROR_STOP\\nANALYZE $local_table;" | $PSQLCMD -q -f - > $TMPFILE 2>&1
# Report errors and stop the loop if any occured
rc=$?
if [ "$rc" != "0" ]
then
# Copy the error output, properly formatted, to the log file
sed "s/^/$(date +%Y-%m-%d%t%T) E /" $TMPFILE >> $LOAD_LOG
# Send mail message about the failure
rm -f $TMPFILE
errexit "ANALYZE failed with status $rc on table: $local_table" $rc >> $LOAD_LOG
fi
done
On Wednesday 17 December 2008 12:38:40 am Herouth Maoz wrote: > I have a strange situation that occurs every now and again. > > We have a reports system that gathers all the data from our various > production systems during the night, where we can run heavy reports on > it without loading the production databases. > > I have two shell scripts that do this nightly transfer of data. The > production database is Sybase. So I have a shell script that scans a > list of tables and databases and dumps them into a format suitable for > postgres COPY. After it dumps everything, another shell script scans the > same list, and loads each dump file into the proper table. > > The shell script first runs psql with a DELETE command. For transaction > tables (ones where data accumulates by date) the records for two days > are deleted, and for non-transaction tables (ones that have records that > might change but don't accumulate based on time) it's DELETE without WHERE. > > I run psql with ON_ERROR_STOP and check the exit status. If the DELETE > failed, I should get an error status, so I do not proceed to the copy. > > Then I run psql again, with ON_ERROR_STOP, and run a \copy command that > loads the data to the same table. > > For some reason, once in a while, that fails. Always on the same table - > violating the unique constraint of the primary key. Now, this is > impossible because there was a successful delete beforehand, as I said, > and the data comes from a database where that same primary key is > enforced. Moreover, when I re-run the script, everything runs fine. > > This happens at least once a week - always with the same table. > > Can anybody think of a reason why psql will not report an error on > deletion? Or why it would tell me that a constraint has been violated > when loading the same data 5 minutes later works fine? > > Thanks, > Herouth > > > > Here is the relevant shell code (the relevant table has "*' in the file > for datefield): > <Snip> > deletion. > > if [ "$datefield" = "*" ] > then > CMD="DELETE FROM $local_table" > else > CMD="DELETE FROM $local_table WHERE $datefield >= current_date - 2" > fi > > # Run the deletion command > > echo -e "\\\\set ON_ERROR_STOP\\n$CMD;" | $PSQLCMD -q -f - > > $TMPFILE 2>&1 > <Snip> Are you sure the problem is not in "$datefield" = "*" . That the script that formats the data file is not correctly adding "*" to the right file. Seems almost like sometimes the second CMD is being run against the table that the first CMD should be run on. In other words it is not doing a complete delete , but a date based one, and you then import duplicate records. -- Adrian Klaver aklaver@comcast.net
Adrian Klaver wrote: > <Snip> > > Are you sure the problem is not in "$datefield" = "*" . That the script that > formats the data file is not correctly adding "*" to the right file. Seems > almost like sometimes the second CMD is being run against the table that the > first CMD should be run on. In other words it is not doing a complete > delete , but a date based one, and you then import duplicate records. > > Thanks for your reply. The file containing the tables list is static - it doesn't change from one run to the next (unless I edit it personally). Herouth
(Sorry for the forward, I forgot to CC the list) On Wed, Dec 17, 2008 at 9:38 AM, Herouth Maoz <herouth@unicell.co.il> wrote: > and for non-transaction tables (ones that have records that might > change but don't accumulate based on time) it's DELETE without WHERE. In that case, you are better off using TRUNCATE instead.
On Sunday 21 December 2008 1:49:18 am Herouth Maoz wrote: > Adrian Klaver wrote: > > <Snip> > > > > Are you sure the problem is not in "$datefield" = "*" . That the script > > that formats the data file is not correctly adding "*" to the right file. > > Seems almost like sometimes the second CMD is being run against the table > > that the first CMD should be run on. In other words it is not doing a > > complete delete , but a date based one, and you then import duplicate > > records. > > Thanks for your reply. The file containing the tables list is static - > it doesn't change from one run to the next (unless I edit it personally). > > Herouth Well something is not static :) You mentioned this happens only with one table. Have you tried running your procedure against that table only? Just because a DELETE did not error does not mean it succeeded in the way you wanted. You might want to throw a count() in the mix to see if you are really clearing out the table the way you want to. Also is the actual data file static from one run to the next? Would also help to see the schema for the table involved and maybe a sample of the data, if that is possible. -- Adrian Klaver aklaver@comcast.net
Adrian Klaver wrote:
Well, every time this happens, I re-run the procedure, with all the lines in the data files up to the given table deleted. And it works. Then I restore the original data file. And the next day it works. It only happens once in a while.On Sunday 21 December 2008 1:49:18 am Herouth Maoz wrote:Adrian Klaver wrote:<Snip> Are you sure the problem is not in "$datefield" = "*" . That the script that formats the data file is not correctly adding "*" to the right file. Seems almost like sometimes the second CMD is being run against the table that the first CMD should be run on. In other words it is not doing a complete delete , but a date based one, and you then import duplicate records.Thanks for your reply. The file containing the tables list is static - it doesn't change from one run to the next (unless I edit it personally). HerouthWell something is not static :) You mentioned this happens only with one table. Have you tried running your procedure against that table only?
I wonder if there is a way to use the result of "count()" in \echo...Just because a DELETE did not error does not mean it succeeded in the way you wanted. You might want to throw a count() in the mix to see if you are really clearing out the table the way you want to.
If you mean the data file that contains the list of tables, then yes. If you mean the data in the table itself, then no, the data changes - new records are added and old ones are updated.Also is the actual data file static from one run to the next?
A sample of the data would be a bit tricky, as this is customers' private information. But the table schema is:Would also help to see the schema for the table involved and maybe a sample of the data, if that is possible.
CREATE TABLE web1010.users ( user_id CHAR(32) PRIMARY KEY NOT NULL, whitelabel_id NUMERIC(21) NOT NULL, username VARCHAR(30) NOT NULL, password CHAR(32) NOT NULL, perms VARCHAR(255) NOT NULL, first_name VARCHAR(40) NULL, last_name VARCHAR(40) NULL, total_points INTEGER DEFAULT 0 NOT NULL, date_created TIMESTAMP NOT NULL, date_birth TIMESTAMP NULL, gender INTEGER NULL, city_id NUMERIC(21) NULL, is_active SMALLINT NOT NULL, email VARCHAR(255) NULL, subscriptin_id NUMERIC(21) NULL, subscriptin_num_of_msg INTEGER NULL, subscriptin_date_start TIMESTAMP NULL, subscriptin_sent_datetime TIMESTAMP NULL, subscriptin_credit_left INTEGER NULL, subscriptin_status INTEGER NULL, subscriptin_sent_reference NUMERIC(21) NULL, first_time_subscribed VARCHAR(10) NULL, sms_credit INTEGER NULL, reg_pid NUMERIC(21) NULL, spam_fl SMALLINT NULL, constraint PK_USERS unique (whitelabel_id,username) ) ;
I suppose this doesn't happen with other tables in the process, because most other tables don't have two unique constraints in them - most only have the primary key. But still, if everything is deleted from the table, this should not be an issue...
I might take Dennis Brakhane's advice and replace the DELETE command with TRUNCATE, as I see no harm in doing so. Nevertheless, DELETE should either work or fail saying "could not delete because...". Otherwise PostgreSQL is not a very reliable...
Thanks,
Herouth
On Tuesday 23 December 2008 6:43:56 am Herouth Maoz wrote: > > Well, every time this happens, I re-run the procedure, with all the > lines in the data files up to the given table deleted. And it works. > Then I restore the original data file. And the next day it works. It > only happens once in a while. See next comment. > > > Also is the actual data file static from one run to the next? > > If you mean the data file that contains the list of tables, then yes. If > you mean the data in the table itself, then no, the data changes - new > records are added and old ones are updated. I should have been more specific. You mentioned you repeat the procedure 5 minutes or so after a failure. Is there a change in the actual data between those runs? > > > Would also help to see the schema for the > > table involved and maybe a sample of the data, if that is possible. > > A sample of the data would be a bit tricky, as this is customers' > private information. But the table schema is: > > CREATE TABLE web1010.users > ( > user_id CHAR(32) PRIMARY KEY NOT NULL, > whitelabel_id NUMERIC(21) NOT NULL, > username VARCHAR(30) NOT NULL, > password CHAR(32) NOT NULL, > perms VARCHAR(255) NOT NULL, > first_name VARCHAR(40) NULL, > last_name VARCHAR(40) NULL, > total_points INTEGER DEFAULT 0 NOT NULL, > date_created TIMESTAMP NOT NULL, > date_birth TIMESTAMP NULL, > gender INTEGER NULL, > city_id NUMERIC(21) NULL, > is_active SMALLINT NOT NULL, > email VARCHAR(255) NULL, > subscriptin_id NUMERIC(21) NULL, > subscriptin_num_of_msg INTEGER NULL, > subscriptin_date_start TIMESTAMP NULL, > subscriptin_sent_datetime TIMESTAMP NULL, > subscriptin_credit_left INTEGER NULL, > subscriptin_status INTEGER NULL, > subscriptin_sent_reference NUMERIC(21) NULL, > first_time_subscribed VARCHAR(10) NULL, > sms_credit INTEGER NULL, > reg_pid NUMERIC(21) NULL, > spam_fl SMALLINT NULL, > constraint PK_USERS unique (whitelabel_id,username) > ) > ; > > I suppose this doesn't happen with other tables in the process, because > most other tables don't have two unique constraints in them - most only > have the primary key. But still, if everything is deleted from the > table, this should not be an issue... In the original post you said the constraint violation was on the PK. Is that the case or is it on PK_USERS? > > I might take Dennis Brakhane's advice and replace the DELETE command > with TRUNCATE, as I see no harm in doing so. Nevertheless, DELETE should > either work or fail saying "could not delete because...". Otherwise > PostgreSQL is not a very reliable... Worth trying. However it does not answer the question of what is going on. While it is possible that there is a DELETE bug, I still believe it is a case of DELETE working in way you are not expecting. > > > Thanks, > > Herouth -- Adrian Klaver aklaver@comcast.net