Обсуждение: Disk filled-up issue after a lot of inserts and drop schema

Поиск
Список
Период
Сортировка

Disk filled-up issue after a lot of inserts and drop schema

От
Pietro Pugni
Дата:
I’m running PostgreSQL 9.5.4 on a virtual machine for production purposes. It runs Ubuntu 16.04.1 LTS 64bit, 32GB RAM, 461GB disk space and 4 x logical CPUs.

Postgres executes the following activities:
- many INSERTS for ETL
- a lot of read and write operations for the main OLTP application

The ETL job is still under development, so I’m launching several sequential “tries” in order to get the whole thing working. The ETL procedure consists of a lot of inserts packed inside transactions. At the moment each transaction consists of 100k inserts, so for a 90mln rows table I get 90mln inserts packed in 900 transactions. I know it’s not the best, but JDBC drivers combined with Pentaho doesn’t seem to pack more inserts into one, so I get a lot of overhead. I can see INSERT, BIND and PARSE called for each insert.. I think it’s Pentaho which embeds the INSERT in a parametric query.. I hate Pentaho.. anyway..

The ETL procedure does the following:
1) DROP SCHEMA IF EXISTS data_schema CASCADE;
2) creates the “data_schema” schema and populates it with tables and rows using INSERTs as described before;
3) if an error occurs, drop the schema

I’m repeating the previous steps many times because of some Pentaho errors which the team is working on in order to get it working. This stresses the WAL because the interruption of the process interrupts the current transaction and is followed by a DROP SCHEMA .. CASCADE.

After few days since we began debugging the ETL elaboration, the disk filled up and the last ETL job was automatically aborted. Note that the DB data directory is located on the same root disk at /var/lib/postgresql/9.5/main

What shocked me was that the data directory of Postgres was just 815MB in size ($ du -h /var/lib/postgresql/9.5/main ) and pg_xlog was 705MB, but the entire disk was full ("df -h" returned a disk usage of 100%).

I looked for any postgres activity and only noticed a checkpoint writer process that was writing at low speeds (IO usage was about 5%).
Also, "SELECT * FROM pg_stat_activity" returned nothing and the most shocking part was that the "du -h /“ command returned 56GB as the total size of files stored on the whole disk!!! The same was for “du -ha /“, which returns the apparent size. 

The total disk size is 461GB, so how is it possible that “df -h” resulted in 461GB occupied while “du -h /“ returned just 56GB?

After executing:
$ service postgresql stop
$ service postgresql start

the disk was freed and “df -h” returned a usage of just 16%!

The other questions are:
- how can I prevent the disk from filling up? I’m using the default configuration for the WAL (1GB max size).
- how can I tune Postgres to speed up the INSERTs?

The actual configuration is the following:
listen_addresses = 'localhost'
max_connections = 32
shared_buffers = 16GB
work_mem = 128MB
maintenance_work_mem = 512MB
effective_io_concurrency = 10
checkpoint_completion_target = 0.9
cpu_tuple_cost = 0.02
cpu_index_tuple_cost = 0.01
cpu_operator_cost = 0.005
effective_cache_size = 24GB
default_statistics_target = 1000

May be that some of these parameters causes this strange behavior? checkpoint_completion_target?

Thanks to everyone for the support.

Best regards,
 Pietro Pugni



Re: Disk filled-up issue after a lot of inserts and drop schema

От
Rick Otten
Дата:
In Unix/Linux with many of the common file system types, if you delete a file, but a process still has it open, it will continue to "own" the disk space until that process closes the file descriptor or dies.  If you try "ls" or other file system commands, you won't actually see the file there, yet it really is, and it still has exclusive control of a portion of the disk.  The file is "unlinked" but the data blocks for the file are still reserved.

Like 'ls', the 'du' command only looks at files that still exist and adds up the disk space for those files.  It does not know about these files that have been unlinked, but still reserve a large portion of the disk.

I don't know why something still has an open file descriptor on something you believe has been removed, but at least that explains why you are experiencing the discrepancy between "du" and the real available space on the disk.


On Wed, Sep 14, 2016 at 9:53 AM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
I’m running PostgreSQL 9.5.4 on a virtual machine for production purposes. It runs Ubuntu 16.04.1 LTS 64bit, 32GB RAM, 461GB disk space and 4 x logical CPUs.

Postgres executes the following activities:
- many INSERTS for ETL
- a lot of read and write operations for the main OLTP application

The ETL job is still under development, so I’m launching several sequential “tries” in order to get the whole thing working. The ETL procedure consists of a lot of inserts packed inside transactions. At the moment each transaction consists of 100k inserts, so for a 90mln rows table I get 90mln inserts packed in 900 transactions. I know it’s not the best, but JDBC drivers combined with Pentaho doesn’t seem to pack more inserts into one, so I get a lot of overhead. I can see INSERT, BIND and PARSE called for each insert.. I think it’s Pentaho which embeds the INSERT in a parametric query.. I hate Pentaho.. anyway..

The ETL procedure does the following:
1) DROP SCHEMA IF EXISTS data_schema CASCADE;
2) creates the “data_schema” schema and populates it with tables and rows using INSERTs as described before;
3) if an error occurs, drop the schema

I’m repeating the previous steps many times because of some Pentaho errors which the team is working on in order to get it working. This stresses the WAL because the interruption of the process interrupts the current transaction and is followed by a DROP SCHEMA .. CASCADE.

After few days since we began debugging the ETL elaboration, the disk filled up and the last ETL job was automatically aborted. Note that the DB data directory is located on the same root disk at /var/lib/postgresql/9.5/main

What shocked me was that the data directory of Postgres was just 815MB in size ($ du -h /var/lib/postgresql/9.5/main ) and pg_xlog was 705MB, but the entire disk was full ("df -h" returned a disk usage of 100%).

I looked for any postgres activity and only noticed a checkpoint writer process that was writing at low speeds (IO usage was about 5%).
Also, "SELECT * FROM pg_stat_activity" returned nothing and the most shocking part was that the "du -h /“ command returned 56GB as the total size of files stored on the whole disk!!! The same was for “du -ha /“, which returns the apparent size. 

The total disk size is 461GB, so how is it possible that “df -h” resulted in 461GB occupied while “du -h /“ returned just 56GB?

After executing:
$ service postgresql stop
$ service postgresql start

the disk was freed and “df -h” returned a usage of just 16%!

The other questions are:
- how can I prevent the disk from filling up? I’m using the default configuration for the WAL (1GB max size).
- how can I tune Postgres to speed up the INSERTs?

The actual configuration is the following:
listen_addresses = 'localhost'
max_connections = 32
shared_buffers = 16GB
work_mem = 128MB
maintenance_work_mem = 512MB
effective_io_concurrency = 10
checkpoint_completion_target = 0.9
cpu_tuple_cost = 0.02
cpu_index_tuple_cost = 0.01
cpu_operator_cost = 0.005
effective_cache_size = 24GB
default_statistics_target = 1000

May be that some of these parameters causes this strange behavior? checkpoint_completion_target?

Thanks to everyone for the support.

Best regards,
 Pietro Pugni




Re: Disk filled-up issue after a lot of inserts and drop schema

От
Tom Lane
Дата:
Rick Otten <rottenwindfish@gmail.com> writes:
> I don't know why something still has an open file descriptor on something
> you believe has been removed, but at least that explains why you are
> experiencing the discrepancy between "du" and the real available space on
> the disk.

Yeah, the reported behavior clearly indicates that some PG process is
holding open files that should have been dropped (and were unlinked).
That's a bug, but there's not enough info here to find and fix it.

If we're really lucky, this is the same bug that Andres found and fixed
last week:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=26ce63ce76f91eac7570fcb893321ed0233d62ff

but that guess is probably too optimistic, especially if it's a background
process (such as the checkpointer process) that is holding the open files.

If you can reproduce this, which I'm guessing you can, please use
"lsof" or similar tool to see which Postgres process is holding open
references to lots of no-longer-there files.

            regards, tom lane


Re: Disk filled-up issue after a lot of inserts and drop schema

От
Pietro Pugni
Дата:
Thank you guys.
I’ve jsut discovered the issue.. I set "logging_collector=off” in the previous email but didn’t comment the other log*
parameters,so Postgres was logging every single INSERT! This was caused the disk to fill up. 

The strange issue is that the log file didn’t exists when the disk filled up. I personally looked for it but it wasn’t
whereit should have been ( /var/log/postgesql/ ), so I can’t exactly confirm that the issue was the log file getting
biggerand bigger. 

Now, after writing the previous mail and rebooting postgres, I run several ETL jobs and the disk space was filling up.
Thelog file reached 110GB in size. 
After disabling *ALL* the log options in postgresql.conf, the log file does just the essential and default information.


I’m sorry to have launched a false alarm, but we can consider the issue solved.

Thank you again

Best regards,
 Pietro Pugni




> Il giorno 14 set 2016, alle ore 16:44, Tom Lane <tgl@sss.pgh.pa.us> ha scritto:
>
> Rick Otten <rottenwindfish@gmail.com> writes:
>> I don't know why something still has an open file descriptor on something
>> you believe has been removed, but at least that explains why you are
>> experiencing the discrepancy between "du" and the real available space on
>> the disk.
>
> Yeah, the reported behavior clearly indicates that some PG process is
> holding open files that should have been dropped (and were unlinked).
> That's a bug, but there's not enough info here to find and fix it.
>
> If we're really lucky, this is the same bug that Andres found and fixed
> last week:
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=26ce63ce76f91eac7570fcb893321ed0233d62ff
>
> but that guess is probably too optimistic, especially if it's a background
> process (such as the checkpointer process) that is holding the open files.
>
> If you can reproduce this, which I'm guessing you can, please use
> "lsof" or similar tool to see which Postgres process is holding open
> references to lots of no-longer-there files.
>
>             regards, tom lane



Re: Disk filled-up issue after a lot of inserts and drop schema

От
Tom Lane
Дата:
Pietro Pugni <pietro.pugni@gmail.com> writes:
> I’ve jsut discovered the issue.. I set "logging_collector=off” in the previous email but didn’t comment the other
log*parameters, so Postgres was logging every single INSERT! This was caused the disk to fill up. 

Ah.

> The strange issue is that the log file didn’t exists when the disk filled up. I personally looked for it but it
wasn’twhere it should have been ( /var/log/postgesql/ ), so I can’t exactly confirm that the issue was the log file
gettingbigger and bigger. 

Seems like the log file must have gotten unlinked while still active,
or at least, *something* had an open reference to it.  It's hard to
speculate about the cause for that without more info about how you've got
the logging set up.  (Are you using the log collector?  Are you rotating
logs?)  But I seriously doubt it represents a Postgres bug.  Unlike the
situation with data files, it's very hard to see how PG could be holding
onto a reference to an unused log file.  It only ever writes to one log
file at a time.

            regards, tom lane


Re: Disk filled-up issue after a lot of inserts and drop schema

От
Pietro Pugni
Дата:
Log rotation was active and set to 5MB or 1 day.
I don’t know if it is a bug, but Postgres was logging even if logging_collector was set to “off”.
Also, that big log file wasn’t visible for me, in fact “ls” and “du” didn’t detect it.

Thanks again

Best regards,
 Pietro Pugni

> Il giorno 14 set 2016, alle ore 19:55, Tom Lane <tgl@sss.pgh.pa.us> ha scritto:
>
> Pietro Pugni <pietro.pugni@gmail.com> writes:
>> I’ve jsut discovered the issue.. I set "logging_collector=off” in the previous email but didn’t comment the other
log*parameters, so Postgres was logging every single INSERT! This was caused the disk to fill up. 
>
> Ah.
>
>> The strange issue is that the log file didn’t exists when the disk filled up. I personally looked for it but it
wasn’twhere it should have been ( /var/log/postgesql/ ), so I can’t exactly confirm that the issue was the log file
gettingbigger and bigger. 
>
> Seems like the log file must have gotten unlinked while still active,
> or at least, *something* had an open reference to it.  It's hard to
> speculate about the cause for that without more info about how you've got
> the logging set up.  (Are you using the log collector?  Are you rotating
> logs?)  But I seriously doubt it represents a Postgres bug.  Unlike the
> situation with data files, it's very hard to see how PG could be holding
> onto a reference to an unused log file.  It only ever writes to one log
> file at a time.
>
>             regards, tom lane