Обсуждение: [GENERAL] Bad planning data resulting in OOM killing of postgres

Поиск
Список
Период
Сортировка

[GENERAL] Bad planning data resulting in OOM killing of postgres

От
David Hinkle
Дата:
I'm having trouble with purges related to a large table.   The delete
query consumes ram until postgres crashes due to OOM.   I have a very
large table called log_raw.  There are half a dozen related tables,
such as 'urls' and 'titles'.   log_raw.urlid = urls.urlid and urls
contains the text of the various urls, for example.

Each time I try to purge these side tables the unit OOM's.

psql:postgres@cipafilter = explain DELETE FROM titles WHERE NOT EXISTS
( SELECT 1 FROM log_raw WHERE log_raw.titleid = titles.titleid );
                                       QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────
 Delete on titles  (cost=22166473.44..24850954.67 rows=870382 width=12)
   ->  Hash Anti Join  (cost=22166473.44..24850954.67 rows=870382 width=12)
         Hash Cond: (titles.titleid = log_raw.titleid)
         ->  Seq Scan on titles  (cost=0.00..17871.64 rows=870664 width=10)
         ->  Hash  (cost=12744792.64..12744792.64 rows=542011264 width=10)
               ->  Seq Scan on log_raw  (cost=0.00..12744792.64
rows=542011264 width=10)
(6 rows)

psql:postgres@cipafilter = select count(*) from (select titleid from
log_raw group by titleid) as a;
 count
────────
 872210
(1 row)

cipafilter=# select n_distinct from pg_stats where tablename =
'log_raw' and attname = 'titleid';
 n_distinct
------------
        282
(1 row)

The planning data is wildly low for each of these fields, and I wonder
if because of that error the planner thinks it can keep all these id's
in ram while it works. Analyze doesn't fix it.   Increasing the
statistics target improves the data in n_distinct but not
considerably, as increasing it 3 or 4 fold leads to it still being
wildly off.  ALTER TABLE set n_distinct doesn't seem to be used by the
planner as it doesn't change any of the plans I've generated or seem
to be taken into account in the row estimates. I'm out of ideas.
Anybody have any ideas?


--
David Hinkle

Senior Software Developer

Phone:  800.243.3729x3000

Email:  hinkle@cipafilter.com

Hours:  Mon-Fri   8:00AM-5:00PM (CT)


Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
Pavel Stehule
Дата:
Hi

2017-02-13 18:40 GMT+01:00 David Hinkle <hinkle@cipafilter.com>:
I'm having trouble with purges related to a large table.   The delete
query consumes ram until postgres crashes due to OOM.   I have a very
large table called log_raw.  There are half a dozen related tables,
such as 'urls' and 'titles'.   log_raw.urlid = urls.urlid and urls
contains the text of the various urls, for example.

Each time I try to purge these side tables the unit OOM's.

psql:postgres@cipafilter = explain DELETE FROM titles WHERE NOT EXISTS
( SELECT 1 FROM log_raw WHERE log_raw.titleid = titles.titleid );
                                       QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────
 Delete on titles  (cost=22166473.44..24850954.67 rows=870382 width=12)
   ->  Hash Anti Join  (cost=22166473.44..24850954.67 rows=870382 width=12)
         Hash Cond: (titles.titleid = log_raw.titleid)
         ->  Seq Scan on titles  (cost=0.00..17871.64 rows=870664 width=10)
         ->  Hash  (cost=12744792.64..12744792.64 rows=542011264 width=10)
               ->  Seq Scan on log_raw  (cost=0.00..12744792.64
rows=542011264 width=10)
(6 rows)


what is your work_mem setting?

Regards

Pavel
 
psql:postgres@cipafilter = select count(*) from (select titleid from
log_raw group by titleid) as a;
 count
────────
 872210
(1 row)

cipafilter=# select n_distinct from pg_stats where tablename =
'log_raw' and attname = 'titleid';
 n_distinct
------------
        282
(1 row)

The planning data is wildly low for each of these fields, and I wonder
if because of that error the planner thinks it can keep all these id's
in ram while it works. Analyze doesn't fix it.   Increasing the
statistics target improves the data in n_distinct but not
considerably, as increasing it 3 or 4 fold leads to it still being
wildly off.  ALTER TABLE set n_distinct doesn't seem to be used by the
planner as it doesn't change any of the plans I've generated or seem
to be taken into account in the row estimates. I'm out of ideas.
Anybody have any ideas?


--
David Hinkle

Senior Software Developer

Phone:  800.243.3729x3000

Email:  hinkle@cipafilter.com

Hours:  Mon-Fri   8:00AM-5:00PM (CT)


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
Jeff Janes
Дата:
On Mon, Feb 13, 2017 at 9:40 AM, David Hinkle <hinkle@cipafilter.com> wrote:
I'm having trouble with purges related to a large table.   The delete
query consumes ram until postgres crashes due to OOM.   I have a very
large table called log_raw.  There are half a dozen related tables,
such as 'urls' and 'titles'.   log_raw.urlid = urls.urlid and urls
contains the text of the various urls, for example.

Each time I try to purge these side tables the unit OOM's.

psql:postgres@cipafilter = explain DELETE FROM titles WHERE NOT EXISTS
( SELECT 1 FROM log_raw WHERE log_raw.titleid = titles.titleid );
                                       QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────
 Delete on titles  (cost=22166473.44..24850954.67 rows=870382 width=12)
   ->  Hash Anti Join  (cost=22166473.44..24850954.67 rows=870382 width=12)
         Hash Cond: (titles.titleid = log_raw.titleid)
         ->  Seq Scan on titles  (cost=0.00..17871.64 rows=870664 width=10)
         ->  Hash  (cost=12744792.64..12744792.64 rows=542011264 width=10)
               ->  Seq Scan on log_raw  (cost=0.00..12744792.64
rows=542011264 width=10)
(6 rows)

psql:postgres@cipafilter = select count(*) from (select titleid from
log_raw group by titleid) as a;
 count
────────
 872210
(1 row)

cipafilter=# select n_distinct from pg_stats where tablename =
'log_raw' and attname = 'titleid';
 n_distinct
------------
        282
(1 row)

The planning data is wildly low for each of these fields, and I wonder
if because of that error the planner thinks it can keep all these id's
in ram while it works. Analyze doesn't fix it.   Increasing the
statistics target improves the data in n_distinct but not
considerably, as increasing it 3 or 4 fold leads to it still being
wildly off.  ALTER TABLE set n_distinct doesn't seem to be used by the
planner as it doesn't change any of the plans I've generated or seem
to be taken into account in the row estimates. I'm out of ideas.
Anybody have any ideas?


Your data on log_raw.titleid is probably clustered, so that any given page of the table all has the same value for titleid. This really messes up the sampling algorithm used by ANALYZE.  To overcome that, you would have to increase the statistics target by 3 or 4 orders of magnitude, not a factor of 3 or 4.

However, that doesn't seem to be the actual problem.  Surprisingly enough, a hash anti-join doesn't automatically de-duplicate the hash table as it is being built.  So n_distinct correctly does not have an influence on the estimated RAM usage, because it doesn't influence the actual ram usage either.

It sounds like your work_mem is set way too high.  What is it set to?  And what version of PostgreSQL are you using?

Cheers,

Jeff

Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
David Hinkle
Дата:
Thanks guys, here's the information you requested:

psql:postgres@cipafilter = show work_mem;
 work_mem
──────────
 10MB
(1 row)

psql:postgres@cipafilter = select version();
                                                version
───────────────────────────────────────────────────────────────────────────────────────────────────────
 PostgreSQL 9.4.10 on x86_64-unknown-linux-gnu, compiled by gcc
(Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit
(1 row)

On Mon, Feb 13, 2017 at 1:26 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Feb 13, 2017 at 9:40 AM, David Hinkle <hinkle@cipafilter.com> wrote:
>>
>> I'm having trouble with purges related to a large table.   The delete
>> query consumes ram until postgres crashes due to OOM.   I have a very
>> large table called log_raw.  There are half a dozen related tables,
>> such as 'urls' and 'titles'.   log_raw.urlid = urls.urlid and urls
>> contains the text of the various urls, for example.
>>
>> Each time I try to purge these side tables the unit OOM's.
>>
>> psql:postgres@cipafilter = explain DELETE FROM titles WHERE NOT EXISTS
>> ( SELECT 1 FROM log_raw WHERE log_raw.titleid = titles.titleid );
>>                                        QUERY PLAN
>>
>> ─────────────────────────────────────────────────────────────────────────────────────────
>>  Delete on titles  (cost=22166473.44..24850954.67 rows=870382 width=12)
>>    ->  Hash Anti Join  (cost=22166473.44..24850954.67 rows=870382
>> width=12)
>>          Hash Cond: (titles.titleid = log_raw.titleid)
>>          ->  Seq Scan on titles  (cost=0.00..17871.64 rows=870664
>> width=10)
>>          ->  Hash  (cost=12744792.64..12744792.64 rows=542011264 width=10)
>>                ->  Seq Scan on log_raw  (cost=0.00..12744792.64
>> rows=542011264 width=10)
>> (6 rows)
>>
>> psql:postgres@cipafilter = select count(*) from (select titleid from
>> log_raw group by titleid) as a;
>>  count
>> ────────
>>  872210
>> (1 row)
>>
>> cipafilter=# select n_distinct from pg_stats where tablename =
>> 'log_raw' and attname = 'titleid';
>>  n_distinct
>> ------------
>>         282
>> (1 row)
>>
>> The planning data is wildly low for each of these fields, and I wonder
>> if because of that error the planner thinks it can keep all these id's
>> in ram while it works. Analyze doesn't fix it.   Increasing the
>> statistics target improves the data in n_distinct but not
>> considerably, as increasing it 3 or 4 fold leads to it still being
>> wildly off.  ALTER TABLE set n_distinct doesn't seem to be used by the
>> planner as it doesn't change any of the plans I've generated or seem
>> to be taken into account in the row estimates. I'm out of ideas.
>> Anybody have any ideas?
>
>
>
> Your data on log_raw.titleid is probably clustered, so that any given page
> of the table all has the same value for titleid. This really messes up the
> sampling algorithm used by ANALYZE.  To overcome that, you would have to
> increase the statistics target by 3 or 4 orders of magnitude, not a factor
> of 3 or 4.
>
> However, that doesn't seem to be the actual problem.  Surprisingly enough, a
> hash anti-join doesn't automatically de-duplicate the hash table as it is
> being built.  So n_distinct correctly does not have an influence on the
> estimated RAM usage, because it doesn't influence the actual ram usage
> either.
>
> It sounds like your work_mem is set way too high.  What is it set to?  And
> what version of PostgreSQL are you using?
>
> Cheers,
>
> Jeff



--
David Hinkle

Senior Software Developer

Phone:  800.243.3729x3000

Email:  hinkle@cipafilter.com

Hours:  Mon-Fri   8:00AM-5:00PM (CT)


Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
Jeff Janes
Дата:
On Mon, Feb 13, 2017 at 11:53 AM, David Hinkle <hinkle@cipafilter.com> wrote:
Thanks guys, here's the information you requested:

psql:postgres@cipafilter = show work_mem;
 work_mem
──────────
 10MB
(1 row)


OK, new theory then.  Do you have triggers on or foreign key constraints to the table you are deleting from?  It queues up each deleted row to go back and fire the trigger or validate the constraint at the end of the statement.  You might need to drop the constraint, or delete in smaller batches by adding some kind of dummy condition to the WHERE clause which you progressively move.

Or select the rows you want to keep into a new table, and then drop the old one, rename the new one, and rebuild any constraints or indexes and other dependencies.  This can be pretty annoying if there a lot of them.

Cheers,

Jeff

Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
David Hinkle
Дата:
Thanks Jeff,

No triggers or foreign key constrains:

psql:postgres@cipafilter = \d+ titles
                                                     Table "public.titles"
 Column  │       Type        │                        Modifiers
                 │ Storage  │ Stats target │ Description

─────────┼───────────────────┼──────────────────────────────────────────────────────────┼──────────┼──────────────┼─────────────
 title   │ character varying │
                 │ extended │              │
 titleid │ integer           │ not null default
nextval('titles_titleid_seq'::regclass) │ plain    │              │
Indexes:
    "titles_pkey" PRIMARY KEY, btree (titleid)
    "titles_md5_title_idx" btree (md5(title::text))

Do you see anything in there that would be problematic?

On Mon, Feb 13, 2017 at 2:41 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Feb 13, 2017 at 11:53 AM, David Hinkle <hinkle@cipafilter.com>
> wrote:
>>
>> Thanks guys, here's the information you requested:
>>
>> psql:postgres@cipafilter = show work_mem;
>>  work_mem
>> ──────────
>>  10MB
>> (1 row)
>
>
>
> OK, new theory then.  Do you have triggers on or foreign key constraints to
> the table you are deleting from?  It queues up each deleted row to go back
> and fire the trigger or validate the constraint at the end of the statement.
> You might need to drop the constraint, or delete in smaller batches by
> adding some kind of dummy condition to the WHERE clause which you
> progressively move.
>
> Or select the rows you want to keep into a new table, and then drop the old
> one, rename the new one, and rebuild any constraints or indexes and other
> dependencies.  This can be pretty annoying if there a lot of them.
>
> Cheers,
>
> Jeff



-- 
David Hinkle

Senior Software Developer

Phone:  800.243.3729x3000

Email:  hinkle@cipafilter.com

Hours:  Mon-Fri   8:00AM-5:00PM (CT)

Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
Jeff Janes
Дата:
On Mon, Feb 13, 2017 at 12:43 PM, David Hinkle <hinkle@cipafilter.com> wrote:
Thanks Jeff,

No triggers or foreign key constrains:

psql:postgres@cipafilter = \d+ titles
                                                     Table "public.titles"
 Column  │       Type        │                        Modifiers
                 │ Storage  │ Stats target │ Description
─────────┼───────────────────┼──────────────────────────────────────────────────────────┼──────────┼──────────────┼─────────────
 title   │ character varying │
                 │ extended │              │
 titleid │ integer           │ not null default
nextval('titles_titleid_seq'::regclass) │ plain    │              │
Indexes:
    "titles_pkey" PRIMARY KEY, btree (titleid)
    "titles_md5_title_idx" btree (md5(title::text))

Do you see anything in there that would be problematic?


I'm out of ideas here.  What happens if you just select the rows, rather than deleting them?  Does it have memory problems then?  If not, can you post the explain (analyze, buffers) of doing that?

Cheers,

Jeff

Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
David Hinkle
Дата:
psql:postgres@cipafilter = EXPLAIN (ANALYZE, BUFFERS) select titleid
from titles WHERE NOT EXISTS ( SELECT 1 FROM log_raw WHERE
log_raw.titleid = titles.titleid );
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Nope, that pops too.  The query runs for a long time at a somewhat
normal rate of ram consumption, using ~1G of RSS then suddenly spikes
to about 6G, at which point the OOM killer pops it.  Box has 8G of ram
and 4G of swap.

On Mon, Feb 13, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Feb 13, 2017 at 12:43 PM, David Hinkle <hinkle@cipafilter.com>
> wrote:
>>
>> Thanks Jeff,
>>
>> No triggers or foreign key constrains:
>>
>> psql:postgres@cipafilter = \d+ titles
>>                                                      Table "public.titles"
>>  Column  │       Type        │                        Modifiers
>>                  │ Storage  │ Stats target │ Description
>>
>>
─────────┼───────────────────┼──────────────────────────────────────────────────────────┼──────────┼──────────────┼─────────────
>>  title   │ character varying │
>>                  │ extended │              │
>>  titleid │ integer           │ not null default
>> nextval('titles_titleid_seq'::regclass) │ plain    │              │
>> Indexes:
>>     "titles_pkey" PRIMARY KEY, btree (titleid)
>>     "titles_md5_title_idx" btree (md5(title::text))
>>
>> Do you see anything in there that would be problematic?
>
>
>
> I'm out of ideas here.  What happens if you just select the rows, rather
> than deleting them?  Does it have memory problems then?  If not, can you
> post the explain (analyze, buffers) of doing that?
>
> Cheers,
>
> Jeff



-- 
David Hinkle

Senior Software Developer

Phone:  800.243.3729x3000

Email:  hinkle@cipafilter.com

Hours:  Mon-Fri   8:00AM-5:00PM (CT)

Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
David Hinkle
Дата:
I managed to get this version to finish:

psql:postgres@cipafilter = explain (ANALYZE, BUFFERS) select count(*)
from (select titleid from log_raw group by titleid) as a;
                                                               QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Aggregate  (cost=14099827.15..14099827.16 rows=1 width=0) (actual
time=248805.549..248805.549 rows=1 loops=1)
   Buffers: shared hit=598 read=7324082 dirtied=34
   ->  HashAggregate  (cost=14099820.80..14099823.62 rows=282 width=4)
(actual time=248504.756..248760.382 rows=874750 loops=1)
         Group Key: log_raw.titleid
         Buffers: shared hit=598 read=7324082 dirtied=34
         ->  Seq Scan on log_raw  (cost=0.00..12744792.64
rows=542011264 width=4) (actual time=0.002..145554.907 rows=544654818
loops=1)
               Buffers: shared hit=598 read=7324082 dirtied=34
 Planning time: 0.072 ms
 Execution time: 248807.285 ms
(9 rows)

On Mon, Feb 13, 2017 at 3:47 PM, David Hinkle <hinkle@cipafilter.com> wrote:
> psql:postgres@cipafilter = EXPLAIN (ANALYZE, BUFFERS) select titleid
> from titles WHERE NOT EXISTS ( SELECT 1 FROM log_raw WHERE
> log_raw.titleid = titles.titleid );
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> Nope, that pops too.  The query runs for a long time at a somewhat
> normal rate of ram consumption, using ~1G of RSS then suddenly spikes
> to about 6G, at which point the OOM killer pops it.  Box has 8G of ram
> and 4G of swap.
>
> On Mon, Feb 13, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Mon, Feb 13, 2017 at 12:43 PM, David Hinkle <hinkle@cipafilter.com>
>> wrote:
>>>
>>> Thanks Jeff,
>>>
>>> No triggers or foreign key constrains:
>>>
>>> psql:postgres@cipafilter = \d+ titles
>>>                                                      Table "public.titles"
>>>  Column  │       Type        │                        Modifiers
>>>                  │ Storage  │ Stats target │ Description
>>>
>>>
─────────┼───────────────────┼──────────────────────────────────────────────────────────┼──────────┼──────────────┼─────────────
>>>  title   │ character varying │
>>>                  │ extended │              │
>>>  titleid │ integer           │ not null default
>>> nextval('titles_titleid_seq'::regclass) │ plain    │              │
>>> Indexes:
>>>     "titles_pkey" PRIMARY KEY, btree (titleid)
>>>     "titles_md5_title_idx" btree (md5(title::text))
>>>
>>> Do you see anything in there that would be problematic?
>>
>>
>>
>> I'm out of ideas here.  What happens if you just select the rows, rather
>> than deleting them?  Does it have memory problems then?  If not, can you
>> post the explain (analyze, buffers) of doing that?
>>
>> Cheers,
>>
>> Jeff
>
>
>
> --
> David Hinkle
>
> Senior Software Developer
>
> Phone:  800.243.3729x3000
>
> Email:  hinkle@cipafilter.com
>
> Hours:  Mon-Fri   8:00AM-5:00PM (CT)



-- 
David Hinkle

Senior Software Developer

Phone:  800.243.3729x3000

Email:  hinkle@cipafilter.com

Hours:  Mon-Fri   8:00AM-5:00PM (CT)

Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
David Hinkle
Дата:
I manually updated the pg_statistics data by literally set it to an
appropriate amount, and the planner picked a new plan and the new plan
worked.  Any idea what I should do about this?  Is manually updating
these values my best bet?

psql:daemon@cipafilter = update pg_statistic set stadistinct = 850000
where starelid = 16678;

psql:daemon@cipafilter = EXPLAIN (ANALYZE, BUFFERS) delete from titles
WHERE NOT EXISTS ( SELECT 1 FROM log_raw WHERE log_raw.titleid =
titles.titleid );
                                                                      QUERY PLAN

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Delete on titles  (cost=111982485.81..117653772.60 rows=26291
width=12) (actual time=726777.769..726777.769 rows=0 loops=1)
   Buffers: shared hit=769918 read=7323762 dirtied=1, temp
read=1944017 written=2177366
   ->  Merge Anti Join  (cost=111982485.81..117653772.60 rows=26291
width=12) (actual time=726777.766..726777.766 rows=0 loops=1)
         Merge Cond: (titles.titleid = log_raw.titleid)
         Buffers: shared hit=769918 read=7323762 dirtied=1, temp
read=1944017 written=2177366
         ->  Index Scan using titles_pkey on titles
(cost=0.42..44451.84 rows=876291 width=10) (actual time=0.006..316.375
rows=874829 loops=1)
               Buffers: shared hit=769000 dirtied=1
         ->  Materialize  (cost=109899462.67..112609518.99
rows=542011264 width=10) (actual time=545806.339..687770.786
rows=425297005 loops=1)
               Buffers: shared hit=918 read=7323762, temp read=1944017
written=2177366
               ->  Sort  (cost=109899462.67..111254490.83
rows=542011264 width=10) (actual time=545806.336..652402.716
rows=425297005 loops=1)
                     Sort Key: log_raw.titleid
                     Sort Method: external merge  Disk: 11013224kB
                     Buffers: shared hit=918 read=7323762, temp
read=1944017 written=2177366
                     ->  Seq Scan on log_raw  (cost=0.00..12744792.64
rows=542011264 width=10) (actual time=0.003..274920.970 rows=544670242
loops=1)
                           Buffers: shared hit=918 read=7323762
 Planning time: 0.158 ms
 Execution time: 727533.213 ms
(17 rows)

On Mon, Feb 13, 2017 at 3:57 PM, David Hinkle <hinkle@cipafilter.com> wrote:
> I managed to get this version to finish:
>
> psql:postgres@cipafilter = explain (ANALYZE, BUFFERS) select count(*)
> from (select titleid from log_raw group by titleid) as a;
>                                                                QUERY PLAN
>
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>  Aggregate  (cost=14099827.15..14099827.16 rows=1 width=0) (actual
> time=248805.549..248805.549 rows=1 loops=1)
>    Buffers: shared hit=598 read=7324082 dirtied=34
>    ->  HashAggregate  (cost=14099820.80..14099823.62 rows=282 width=4)
> (actual time=248504.756..248760.382 rows=874750 loops=1)
>          Group Key: log_raw.titleid
>          Buffers: shared hit=598 read=7324082 dirtied=34
>          ->  Seq Scan on log_raw  (cost=0.00..12744792.64
> rows=542011264 width=4) (actual time=0.002..145554.907 rows=544654818
> loops=1)
>                Buffers: shared hit=598 read=7324082 dirtied=34
>  Planning time: 0.072 ms
>  Execution time: 248807.285 ms
> (9 rows)
>
> On Mon, Feb 13, 2017 at 3:47 PM, David Hinkle <hinkle@cipafilter.com> wrote:
>> psql:postgres@cipafilter = EXPLAIN (ANALYZE, BUFFERS) select titleid
>> from titles WHERE NOT EXISTS ( SELECT 1 FROM log_raw WHERE
>> log_raw.titleid = titles.titleid );
>> server closed the connection unexpectedly
>>         This probably means the server terminated abnormally
>>         before or while processing the request.
>> The connection to the server was lost. Attempting reset: Failed.
>>
>> Nope, that pops too.  The query runs for a long time at a somewhat
>> normal rate of ram consumption, using ~1G of RSS then suddenly spikes
>> to about 6G, at which point the OOM killer pops it.  Box has 8G of ram
>> and 4G of swap.
>>
>> On Mon, Feb 13, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> On Mon, Feb 13, 2017 at 12:43 PM, David Hinkle <hinkle@cipafilter.com>
>>> wrote:
>>>>
>>>> Thanks Jeff,
>>>>
>>>> No triggers or foreign key constrains:
>>>>
>>>> psql:postgres@cipafilter = \d+ titles
>>>>                                                      Table "public.titles"
>>>>  Column  │       Type        │                        Modifiers
>>>>                  │ Storage  │ Stats target │ Description
>>>>
>>>>
─────────┼───────────────────┼──────────────────────────────────────────────────────────┼──────────┼──────────────┼─────────────
>>>>  title   │ character varying │
>>>>                  │ extended │              │
>>>>  titleid │ integer           │ not null default
>>>> nextval('titles_titleid_seq'::regclass) │ plain    │              │
>>>> Indexes:
>>>>     "titles_pkey" PRIMARY KEY, btree (titleid)
>>>>     "titles_md5_title_idx" btree (md5(title::text))
>>>>
>>>> Do you see anything in there that would be problematic?
>>>
>>>
>>>
>>> I'm out of ideas here.  What happens if you just select the rows, rather
>>> than deleting them?  Does it have memory problems then?  If not, can you
>>> post the explain (analyze, buffers) of doing that?
>>>
>>> Cheers,
>>>
>>> Jeff
>>
>>
>>
>> --
>> David Hinkle
>>
>> Senior Software Developer
>>
>> Phone:  800.243.3729x3000
>>
>> Email:  hinkle@cipafilter.com
>>
>> Hours:  Mon-Fri   8:00AM-5:00PM (CT)
>
>
>
> --
> David Hinkle
>
> Senior Software Developer
>
> Phone:  800.243.3729x3000
>
> Email:  hinkle@cipafilter.com
>
> Hours:  Mon-Fri   8:00AM-5:00PM (CT)



-- 
David Hinkle

Senior Software Developer

Phone:  800.243.3729x3000

Email:  hinkle@cipafilter.com

Hours:  Mon-Fri   8:00AM-5:00PM (CT)

Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
Tom Lane
Дата:
David Hinkle <hinkle@cipafilter.com> writes:
> Thanks guys, here's the information you requested:
> psql:postgres@cipafilter = show work_mem;
>  work_mem
> ──────────
>  10MB
> (1 row)

[ squint... ]  It should absolutely not have tried to hash a 500M-row
table if it thought work_mem was only 10MB.  I wonder if there's an
integer-overflow problem or something like that.

            regards, tom lane


Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
Tom Lane
Дата:
I wrote:
> David Hinkle <hinkle@cipafilter.com> writes:
>> Thanks guys, here's the information you requested:
>> psql:postgres@cipafilter = show work_mem;
>> work_mem
>> ──────────
>> 10MB
>> (1 row)

> [ squint... ]  It should absolutely not have tried to hash a 500M-row
> table if it thought work_mem was only 10MB.  I wonder if there's an
> integer-overflow problem or something like that.

Ah, nah, scratch that: I was thinking of the hash aggregation case,
where there's no run-time ability to spill to disk so the planner
will not risk using hash aggregation if it estimates the hash table
would exceed work_mem.  Hash joins do have the ability to restrict
memory consumption by increasing the number of batches, so the planner
doesn't worry about it in that case.

I think what must be happening is that there's some one value of
log_raw.titleid that occurs a ridiculous number of times, so that
the executor is unable to split up that particular hash bucket,
leading to OOM when it tries to load all those rows to process
the hash bucket.

The planner does attempt to estimate the worst-case bucket size,
but in what now seems like brain fade, it doesn't do more with that
information than charge an appropriate number of tuple comparisons.
That would somewhat discourage use of a hash join, but a merge join
on this many tuples would be pretty expensive too, so it's not overly
surprising that it went with hashing anyway.

I am thinking we ought to fix it so it rejects (or at least heavily
penalizes) a hash join if it estimates that the rows containing
the inner side's most common value wouldn't all fit in work_mem.

What's not completely certain however is whether this diagnosis is
accurate for your case, or whether the proposed remedy would fix it.
It would be useful to see the contents of pg_stats.most_common_freqs
for log_raw.titleid.

            regards, tom lane


Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
Karsten Hilbert
Дата:
On Mon, Feb 13, 2017 at 03:47:08PM -0600, David Hinkle wrote:


> Nope, that pops too.  The query runs for a long time at a somewhat
> normal rate of ram consumption, using ~1G of RSS then suddenly spikes
> to about 6G, at which point the OOM killer pops it.  Box has 8G of ram
> and 4G of swap.

By any chance:

- when it happens has the kernel considered using swap ?

- which kernel are you running ?

There's been (for some workloads) massive problems with RAM
exhaustion / swapping / OOM killer going wild with
4.7/4.8/some 4.9 kernels.

Just a shot in the dark,
Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
Karsten Hilbert
Дата:
On Wed, Feb 15, 2017 at 01:04:51PM +0100, Karsten Hilbert wrote:

> > Nope, that pops too.  The query runs for a long time at a somewhat
> > normal rate of ram consumption, using ~1G of RSS then suddenly spikes
> > to about 6G, at which point the OOM killer pops it.  Box has 8G of ram
> > and 4G of swap.
>
> By any chance:
>
> - when it happens has the kernel considered using swap ?
>
> - which kernel are you running ?
>
> There's been (for some workloads) massive problems with RAM
> exhaustion / swapping / OOM killer going wild with
> 4.7/4.8/some 4.9 kernels.

I guess what I'm trying to say is that it may actually not be
PostgreSQL's fault but rather the kernel invoking the OOM
killer way prematurely.

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
David Hinkle
Дата:
Thanks for your help!

Karsten: The system does fill up swap before it blows up.  This
particular model has 8G of ram and 4G of swap and runs kernel
4.4.0-53-generic #74~14.04.1-Ubuntu.

Tom, there are three columns in this table that exhibit the problem,
here is the statistics data after an analyze, and the real data to
compare it to.

cipafilter=# select attname, n_distinct, most_common_freqs from
pg_stats where tablename = 'log_raw' and (attname = 'urlid' or attname
= 'titleid' or attname = 'hostid');
 attname | n_distinct |






 most_common_freqs

---------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 urlid   |      38963 |

{0.0188,0.009,0.00853333,0.0079,0.00686667,0.0068,0.00616667,0.0061,0.00606667,0.005,0.0049,0.00483333,0.00463333,0.00456667,0.0044,0.0044,0.0039,0.0032,0.00263333,0.00263333,0.00256667,0.00256667,0.00246667,0.0023,0.0023,0.00223333,0.00203333,0.002,0.002,0.002,0.0019,0.00186667,0.00183333,0.0018,0.0018,0.00176667,0.00176667,0.00176667,0.00176667,0.00173333,0.00173333,0.00173333,0.0017,0.0017,0.00166667,0.0016,0.00156667,0.00153333,0.0015,0.00146667,0.00143333,0.0014,0.0014,0.0014,0.0014,0.0014,0.0014,0.00136667,0.00133333,0.0013,0.0013,0.0013,0.0013,0.00126667,0.00126667,0.00123333,0.00123333,0.00123333,0.00123333,0.0012,0.0012,0.00113333,0.0011,0.00106667,0.00106667,0.001,0.001,0.001,0.001,0.000966667,0.000966667,0.000966667,0.000966667,0.000966667,0.000933333,0.000933333,0.000933333,0.0009,0.0009,0.0009,0.0009,0.0009,0.0009,0.0009,0.0009,0.0009,0.000866667,0.000866667,0.000866667,0.000866667}
 hostid  |       2478 |

{0.2109,0.0330667,0.0321667,0.0245,0.0223,0.0177667,0.0165,0.0152,0.0151333,0.014,0.0132,0.0115667,0.00993333,0.00963333,0.00873333,0.00853333,0.00853333,0.00746667,0.0074,0.00653333,0.0065,0.0065,0.00646667,0.00513333,0.00506667,0.00503333,0.00496667,0.00493333,0.0049,0.00486667,0.00483333,0.00453333,0.0044,0.00433333,0.00426667,0.00413333,0.00396667,0.00386667,0.00383333,0.00363333,0.00363333,0.00363333,0.00346667,0.00323333,0.00323333,0.0032,0.00316667,0.00303333,0.00303333,0.0029,0.00286667,0.00273333,0.00273333,0.00266667,0.00263333,0.00263333,0.0026,0.0026,0.0025,0.00243333,0.00243333,0.0024,0.00236667,0.00226667,0.00223333,0.00216667,0.00213333,0.0021,0.0021,0.00206667,0.00206667,0.00203333,0.002,0.002,0.00193333,0.0019,0.0019,0.00186667,0.00186667,0.00186667,0.0018,0.0018,0.0018,0.00176667,0.00173333,0.0017,0.0017,0.00166667,0.00166667,0.00166667,0.00163333,0.00163333,0.00153333,0.0015,0.0015,0.0015,0.00146667,0.00146667,0.00146667,0.00146667}
 titleid |        292 | {0.767167}
(3 rows)

I have to patch the pg_stats table to get postgres to run the
following queries without crashing:

cipafilter=#     UPDATE pg_statistic AS s
cipafilter-#         SET stadistinct = (select reltuples from pg_class
where relname = 'titles')
cipafilter-#     FROM pg_class c, pg_attribute a where c.oid =
s.starelid and c.oid = a.attrelid and a.attnum = s.staattnum and
cipafilter-#         relname = 'log_raw' and attname = 'titleid';
UPDATE 1
cipafilter=#     UPDATE pg_statistic AS s
cipafilter-#         SET stadistinct = (select reltuples from pg_class
where relname = 'urls')
cipafilter-#     FROM pg_class c, pg_attribute a where c.oid =
s.starelid and c.oid = a.attrelid and a.attnum = s.staattnum and
cipafilter-#         relname = 'log_raw' and attname = 'urlid';
UPDATE 1
cipafilter=#     UPDATE pg_statistic AS s
cipafilter-#         SET stadistinct = (select reltuples from pg_class
where relname = 'hosts')
cipafilter-#     FROM pg_class c, pg_attribute a where c.oid =
s.starelid and c.oid = a.attrelid and a.attnum = s.staattnum and
cipafilter-#         relname = 'log_raw' and attname = 'hostid';
UPDATE 1

cipafilter=#  select attname, n_distinct from pg_stats where tablename
= 'log_raw' and (attname = 'urlid' or attname = 'titleid' or attname =
'hostid');
 attname | n_distinct
---------+-------------
 urlid   | 1.51452e+08
 hostid  |      303756
 titleid |      879485

cipafilter=# select titleid, count(titleid) from log_raw group by
titleid order by count(titleid) desc limit 10;
 titleid |   count
---------+-----------
       1 | 423848049
   49547 |    403432
     238 |    188151
      12 |    151640
       5 |    149524
    6196 |    139445
   32014 |    123828
     200 |     88682
   58921 |     86451
      10 |     84264
(10 rows)

cipafilter=# select urlid, count(urlid) from log_raw group by urlid
order by count(urlid) desc limit 10;
 urlid  |  count
--------+----------
 129991 | 10843088
      1 |  4953757
     21 |  4345503
   2765 |  4266981
     12 |  3654127
    920 |  3609054
   1135 |  3562185
     20 |  3495023
 283567 |  3019675
   2494 |  2655301
(10 rows)

cipafilter=# select hostid, count(hostid) from log_raw group by hostid
order by count(hostid) desc limit 10;
 hostid |   count
--------+-----------
      7 | 117342686
      5 |  18016481
     53 |  17408992
     57 |  12947564
    543 |  12698269
      1 |  10068246
 127544 |   8746204
     27 |   8618595
     40 |   8507278
     36 |   7424412
(10 rows)

Fun fact, hostid 7 is 'google.com'.

On Wed, Feb 15, 2017 at 6:11 AM, Karsten Hilbert
<Karsten.Hilbert@gmx.net> wrote:
> On Wed, Feb 15, 2017 at 01:04:51PM +0100, Karsten Hilbert wrote:
>
>> > Nope, that pops too.  The query runs for a long time at a somewhat
>> > normal rate of ram consumption, using ~1G of RSS then suddenly spikes
>> > to about 6G, at which point the OOM killer pops it.  Box has 8G of ram
>> > and 4G of swap.
>>
>> By any chance:
>>
>> - when it happens has the kernel considered using swap ?
>>
>> - which kernel are you running ?
>>
>> There's been (for some workloads) massive problems with RAM
>> exhaustion / swapping / OOM killer going wild with
>> 4.7/4.8/some 4.9 kernels.
>
> I guess what I'm trying to say is that it may actually not be
> PostgreSQL's fault but rather the kernel invoking the OOM
> killer way prematurely.
>
> Karsten
> --
> GPG key ID E4071346 @ eu.pool.sks-keyservers.net
> E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general



--
David Hinkle

Senior Software Developer

Phone:  800.243.3729x3000

Email:  hinkle@cipafilter.com

Hours:  Mon-Fri   8:00AM-5:00PM (CT)


Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
David Hinkle
Дата:
Maybe also useful to know:

cipafilter=# select reltuples from pg_class where relname = 'log_raw';
  reltuples
-------------
 5.40531e+08
(1 row)

On Wed, Feb 15, 2017 at 7:55 PM, David Hinkle <hinkle@cipafilter.com> wrote:
> Thanks for your help!
>
> Karsten: The system does fill up swap before it blows up.  This
> particular model has 8G of ram and 4G of swap and runs kernel
> 4.4.0-53-generic #74~14.04.1-Ubuntu.
>
> Tom, there are three columns in this table that exhibit the problem,
> here is the statistics data after an analyze, and the real data to
> compare it to.
>
> cipafilter=# select attname, n_distinct, most_common_freqs from
> pg_stats where tablename = 'log_raw' and (attname = 'urlid' or attname
> = 'titleid' or attname = 'hostid');
>  attname | n_distinct |
>
>
>
>
>
>
>  most_common_freqs
>
---------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  urlid   |      38963 |
>
{0.0188,0.009,0.00853333,0.0079,0.00686667,0.0068,0.00616667,0.0061,0.00606667,0.005,0.0049,0.00483333,0.00463333,0.00456667,0.0044,0.0044,0.0039,0.0032,0.00263333,0.00263333,0.00256667,0.00256667,0.00246667,0.0023,0.0023,0.00223333,0.00203333,0.002,0.002,0.002,0.0019,0.00186667,0.00183333,0.0018,0.0018,0.00176667,0.00176667,0.00176667,0.00176667,0.00173333,0.00173333,0.00173333,0.0017,0.0017,0.00166667,0.0016,0.00156667,0.00153333,0.0015,0.00146667,0.00143333,0.0014,0.0014,0.0014,0.0014,0.0014,0.0014,0.00136667,0.00133333,0.0013,0.0013,0.0013,0.0013,0.00126667,0.00126667,0.00123333,0.00123333,0.00123333,0.00123333,0.0012,0.0012,0.00113333,0.0011,0.00106667,0.00106667,0.001,0.001,0.001,0.001,0.000966667,0.000966667,0.000966667,0.000966667,0.000966667,0.000933333,0.000933333,0.000933333,0.0009,0.0009,0.0009,0.0009,0.0009,0.0009,0.0009,0.0009,0.0009,0.000866667,0.000866667,0.000866667,0.000866667}
>  hostid  |       2478 |
>
{0.2109,0.0330667,0.0321667,0.0245,0.0223,0.0177667,0.0165,0.0152,0.0151333,0.014,0.0132,0.0115667,0.00993333,0.00963333,0.00873333,0.00853333,0.00853333,0.00746667,0.0074,0.00653333,0.0065,0.0065,0.00646667,0.00513333,0.00506667,0.00503333,0.00496667,0.00493333,0.0049,0.00486667,0.00483333,0.00453333,0.0044,0.00433333,0.00426667,0.00413333,0.00396667,0.00386667,0.00383333,0.00363333,0.00363333,0.00363333,0.00346667,0.00323333,0.00323333,0.0032,0.00316667,0.00303333,0.00303333,0.0029,0.00286667,0.00273333,0.00273333,0.00266667,0.00263333,0.00263333,0.0026,0.0026,0.0025,0.00243333,0.00243333,0.0024,0.00236667,0.00226667,0.00223333,0.00216667,0.00213333,0.0021,0.0021,0.00206667,0.00206667,0.00203333,0.002,0.002,0.00193333,0.0019,0.0019,0.00186667,0.00186667,0.00186667,0.0018,0.0018,0.0018,0.00176667,0.00173333,0.0017,0.0017,0.00166667,0.00166667,0.00166667,0.00163333,0.00163333,0.00153333,0.0015,0.0015,0.0015,0.00146667,0.00146667,0.00146667,0.00146667}
>  titleid |        292 | {0.767167}
> (3 rows)
>
> I have to patch the pg_stats table to get postgres to run the
> following queries without crashing:
>
> cipafilter=#     UPDATE pg_statistic AS s
> cipafilter-#         SET stadistinct = (select reltuples from pg_class
> where relname = 'titles')
> cipafilter-#     FROM pg_class c, pg_attribute a where c.oid =
> s.starelid and c.oid = a.attrelid and a.attnum = s.staattnum and
> cipafilter-#         relname = 'log_raw' and attname = 'titleid';
> UPDATE 1
> cipafilter=#     UPDATE pg_statistic AS s
> cipafilter-#         SET stadistinct = (select reltuples from pg_class
> where relname = 'urls')
> cipafilter-#     FROM pg_class c, pg_attribute a where c.oid =
> s.starelid and c.oid = a.attrelid and a.attnum = s.staattnum and
> cipafilter-#         relname = 'log_raw' and attname = 'urlid';
> UPDATE 1
> cipafilter=#     UPDATE pg_statistic AS s
> cipafilter-#         SET stadistinct = (select reltuples from pg_class
> where relname = 'hosts')
> cipafilter-#     FROM pg_class c, pg_attribute a where c.oid =
> s.starelid and c.oid = a.attrelid and a.attnum = s.staattnum and
> cipafilter-#         relname = 'log_raw' and attname = 'hostid';
> UPDATE 1
>
> cipafilter=#  select attname, n_distinct from pg_stats where tablename
> = 'log_raw' and (attname = 'urlid' or attname = 'titleid' or attname =
> 'hostid');
>  attname | n_distinct
> ---------+-------------
>  urlid   | 1.51452e+08
>  hostid  |      303756
>  titleid |      879485
>
> cipafilter=# select titleid, count(titleid) from log_raw group by
> titleid order by count(titleid) desc limit 10;
>  titleid |   count
> ---------+-----------
>        1 | 423848049
>    49547 |    403432
>      238 |    188151
>       12 |    151640
>        5 |    149524
>     6196 |    139445
>    32014 |    123828
>      200 |     88682
>    58921 |     86451
>       10 |     84264
> (10 rows)
>
> cipafilter=# select urlid, count(urlid) from log_raw group by urlid
> order by count(urlid) desc limit 10;
>  urlid  |  count
> --------+----------
>  129991 | 10843088
>       1 |  4953757
>      21 |  4345503
>    2765 |  4266981
>      12 |  3654127
>     920 |  3609054
>    1135 |  3562185
>      20 |  3495023
>  283567 |  3019675
>    2494 |  2655301
> (10 rows)
>
> cipafilter=# select hostid, count(hostid) from log_raw group by hostid
> order by count(hostid) desc limit 10;
>  hostid |   count
> --------+-----------
>       7 | 117342686
>       5 |  18016481
>      53 |  17408992
>      57 |  12947564
>     543 |  12698269
>       1 |  10068246
>  127544 |   8746204
>      27 |   8618595
>      40 |   8507278
>      36 |   7424412
> (10 rows)
>
> Fun fact, hostid 7 is 'google.com'.
>
> On Wed, Feb 15, 2017 at 6:11 AM, Karsten Hilbert
> <Karsten.Hilbert@gmx.net> wrote:
>> On Wed, Feb 15, 2017 at 01:04:51PM +0100, Karsten Hilbert wrote:
>>
>>> > Nope, that pops too.  The query runs for a long time at a somewhat
>>> > normal rate of ram consumption, using ~1G of RSS then suddenly spikes
>>> > to about 6G, at which point the OOM killer pops it.  Box has 8G of ram
>>> > and 4G of swap.
>>>
>>> By any chance:
>>>
>>> - when it happens has the kernel considered using swap ?
>>>
>>> - which kernel are you running ?
>>>
>>> There's been (for some workloads) massive problems with RAM
>>> exhaustion / swapping / OOM killer going wild with
>>> 4.7/4.8/some 4.9 kernels.
>>
>> I guess what I'm trying to say is that it may actually not be
>> PostgreSQL's fault but rather the kernel invoking the OOM
>> killer way prematurely.
>>
>> Karsten
>> --
>> GPG key ID E4071346 @ eu.pool.sks-keyservers.net
>> E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>
>
>
> --
> David Hinkle
>
> Senior Software Developer
>
> Phone:  800.243.3729x3000
>
> Email:  hinkle@cipafilter.com
>
> Hours:  Mon-Fri   8:00AM-5:00PM (CT)



--
David Hinkle

Senior Software Developer

Phone:  800.243.3729x3000

Email:  hinkle@cipafilter.com

Hours:  Mon-Fri   8:00AM-5:00PM (CT)


Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
Tom Lane
Дата:
David Hinkle <hinkle@cipafilter.com> writes:
> Tom, there are three columns in this table that exhibit the problem,
> here is the statistics data after an analyze, and the real data to
> compare it to.

>  attname | n_distinct |  most_common_freqs

>  titleid |        292 | {0.767167}

Ouch.  That's saying there's some single value of titleid that accounts
for more than three-quarters of the entries ... does that square with
reality?  That'd certainly explain why a hash join goes nuts.

            regards, tom lane


Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
David Hinkle
Дата:
Yep, 420ish million records out of 540 million records have a titleid
of 1.   There are about 880,000 other unique values, but most of the
records are 1.   Of course, n_distinct is only 292.   I'm surprised
it's not eliminating the duplicates while it builds that hash table.

This is what I'm doing for a work around right now.  Getting
n_distinct right seems to be preventing the system from breaking.
It's going to be executed once a week during the weekly maintenance.
It's setting the n_distinct of each column to the number of rows in
the associated table.

    CREATE OR REPLACE FUNCTION patch_ndistinct(_table varchar, _column
varchar, _string_table varchar)
      RETURNS real AS
    $$
         DECLARE _cnt REAL;
    BEGIN
         SELECT reltuples INTO _cnt from pg_class where relname = _string_table;
       EXECUTE 'ALTER TABLE ' || _table || ' ALTER COLUMN ' || _column
|| ' SET (n_distinct=' || _cnt || ')';
       RETURN _cnt;
    END
    $$ LANGUAGE plpgsql;
    select patch_ndistinct('log_raw', 'titleid', 'titles');
    select patch_ndistinct('log_raw', 'urlid', 'urls');
    select patch_ndistinct('log_raw', 'hostid', 'hosts');
    ANALYZE log_raw;

On Thu, Feb 16, 2017 at 10:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Hinkle <hinkle@cipafilter.com> writes:
>> Tom, there are three columns in this table that exhibit the problem,
>> here is the statistics data after an analyze, and the real data to
>> compare it to.
>
>>  attname | n_distinct |  most_common_freqs
>
>>  titleid |        292 | {0.767167}
>
> Ouch.  That's saying there's some single value of titleid that accounts
> for more than three-quarters of the entries ... does that square with
> reality?  That'd certainly explain why a hash join goes nuts.
>
>                         regards, tom lane



--
David Hinkle

Senior Software Developer

Phone:  800.243.3729x3000

Email:  hinkle@cipafilter.com

Hours:  Mon-Fri   8:00AM-5:00PM (CT)


Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
David Hinkle
Дата:
I guess this doesn't work, latest test run crashed.  It still uses the
bad plan for the hostid column even after n_distinct is updated.

cipafilter=# select attname, n_distinct from pg_stats where tablename
cipafilter-# = 'log_raw' and (attname = 'urlid' or attname = 'titleid'
or attname =
cipafilter(# 'hostid');
 attname | n_distinct
---------+-------------
 urlid   | 1.51625e+08
 hostid  |      304710
 titleid |      886499
(3 rows)

cipafilter=# explain DELETE FROM hosts WHERE NOT EXISTS ( SELECT
log_raw.hostid FROM log_raw WHERE log_raw.hostid = hosts.hostid );
                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Delete on hosts  (cost=22249475.67..74844813.47 rows=1 width=12)
   ->  Hash Anti Join  (cost=22249475.67..74844813.47 rows=1 width=12)
         Hash Cond: (hosts.hostid = log_raw.hostid)
         ->  Seq Scan on hosts  (cost=0.00..5017.10 rows=304710 width=10)
         ->  Hash  (cost=12799395.52..12799395.52 rows=543645052 width=10)
               ->  Seq Scan on log_raw  (cost=0.00..12799395.52
rows=543645052 width=10)
(6 rows)

I guess I will also try throwing in 'set enable_hashjoin = false;' and
see if that gets these purges to go.

On Thu, Feb 16, 2017 at 2:22 PM, David Hinkle <hinkle@cipafilter.com> wrote:
> Yep, 420ish million records out of 540 million records have a titleid
> of 1.   There are about 880,000 other unique values, but most of the
> records are 1.   Of course, n_distinct is only 292.   I'm surprised
> it's not eliminating the duplicates while it builds that hash table.
>
> This is what I'm doing for a work around right now.  Getting
> n_distinct right seems to be preventing the system from breaking.
> It's going to be executed once a week during the weekly maintenance.
> It's setting the n_distinct of each column to the number of rows in
> the associated table.
>
>     CREATE OR REPLACE FUNCTION patch_ndistinct(_table varchar, _column
> varchar, _string_table varchar)
>       RETURNS real AS
>     $$
>          DECLARE _cnt REAL;
>     BEGIN
>          SELECT reltuples INTO _cnt from pg_class where relname = _string_table;
>        EXECUTE 'ALTER TABLE ' || _table || ' ALTER COLUMN ' || _column
> || ' SET (n_distinct=' || _cnt || ')';
>        RETURN _cnt;
>     END
>     $$ LANGUAGE plpgsql;
>     select patch_ndistinct('log_raw', 'titleid', 'titles');
>     select patch_ndistinct('log_raw', 'urlid', 'urls');
>     select patch_ndistinct('log_raw', 'hostid', 'hosts');
>     ANALYZE log_raw;
>
> On Thu, Feb 16, 2017 at 10:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> David Hinkle <hinkle@cipafilter.com> writes:
>>> Tom, there are three columns in this table that exhibit the problem,
>>> here is the statistics data after an analyze, and the real data to
>>> compare it to.
>>
>>>  attname | n_distinct |  most_common_freqs
>>
>>>  titleid |        292 | {0.767167}
>>
>> Ouch.  That's saying there's some single value of titleid that accounts
>> for more than three-quarters of the entries ... does that square with
>> reality?  That'd certainly explain why a hash join goes nuts.
>>
>>                         regards, tom lane
>
>
>
> --
> David Hinkle
>
> Senior Software Developer
>
> Phone:  800.243.3729x3000
>
> Email:  hinkle@cipafilter.com
>
> Hours:  Mon-Fri   8:00AM-5:00PM (CT)



--
David Hinkle

Senior Software Developer

Phone:  800.243.3729x3000

Email:  hinkle@cipafilter.com

Hours:  Mon-Fri   8:00AM-5:00PM (CT)


Re: [GENERAL] Bad planning data resulting in OOM killing of postgres

От
Jeff Janes
Дата:
On Thu, Feb 16, 2017 at 3:27 PM, David Hinkle <hinkle@cipafilter.com> wrote:
I guess this doesn't work, latest test run crashed.  It still uses the
bad plan for the hostid column even after n_distinct is updated.

cipafilter=# select attname, n_distinct from pg_stats where tablename
cipafilter-# = 'log_raw' and (attname = 'urlid' or attname = 'titleid'
or attname =
cipafilter(# 'hostid');
 attname | n_distinct
---------+-------------
 urlid   | 1.51625e+08
 hostid  |      304710
 titleid |      886499
(3 rows)

cipafilter=# explain DELETE FROM hosts WHERE NOT EXISTS ( SELECT
log_raw.hostid FROM log_raw WHERE log_raw.hostid = hosts.hostid );
                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Delete on hosts  (cost=22249475.67..74844813.47 rows=1 width=12)
   ->  Hash Anti Join  (cost=22249475.67..74844813.47 rows=1 width=12)
         Hash Cond: (hosts.hostid = log_raw.hostid)
         ->  Seq Scan on hosts  (cost=0.00..5017.10 rows=304710 width=10)
         ->  Hash  (cost=12799395.52..12799395.52 rows=543645052 width=10)
               ->  Seq Scan on log_raw  (cost=0.00..12799395.52
rows=543645052 width=10)
(6 rows)

I guess I will also try throwing in 'set enable_hashjoin = false;' and
see if that gets these purges to go.



Another option would be to force the de-dup to happen, with:

explain with t as (select distinct hostid from log_raw) delete from hosts where not exists (select 1 from t where t.hostid=hosts.hostid)

That way you can use the hash join without running out of memory, in case the hash join is actually faster than the merge join.  Also, it just seems cleaner than fiddling with enable_* parameters and then having to remember to reset them when done.

Cheers,

Jeff