Re: Automatically setting work_mem

Поиск
Список
Период
Сортировка
От Simon Riggs
Тема Re: Automatically setting work_mem
Дата
Msg-id 1142971550.24487.499.camel@localhost.localdomain
обсуждение исходный текст
Ответ на Re: Automatically setting work_mem  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Automatically setting work_mem  (Martijn van Oosterhout <kleptog@svana.org>)
Список pgsql-hackers
On Fri, 2006-03-17 at 09:46 -0500, Tom Lane wrote:
> "Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> > So what's the difference between these two strategy?
> > (1) Running time: do they use the same amount of memory? Why option 2 is
> > better than 1?
> > (2) Idle time: after sort done, option 1 will return all 1024 to the OS and
> > 2 will still keep 512?
> 
> Point 2 is actually a serious flaw in Simon's proposal, because there
> is no portable way to make malloc return freed memory to the OS.  Some
> mallocs will do that ... in some cases ... but many simply don't ever
> move the brk address down.  It's not an easy thing to do when the arena
> gets cluttered with a lot of different alloc chunks and only some of
> them get freed.

I'm aware of that objection and agree its an issue...

One of the situations I am looking at is larger queries with multiple
sorts in them. I'm getting some reasonable results for final merge even
after releasing lots of memory (say 50-90%). memtuples array is not
required at all for randomAccess sorts (100% reduction).

The largest requirement for memory is the run building during
performsort. That portion of the code is not concurrently executed
within the same query. If we can reduce memory usage after that phase
completes then we stand a chance of not overusing memory on a big query
and not being able to reclaim it.

So overall memory usage could be as low as work_mem + (numsorts * 0.1 *
work_mem) which is a lot less than numsorts * work_mem. (e.g. 130% of
work_mem rather than 300% work_mem).

> So the semantics we'd have to adopt is that once a backend claims some
> "shared work mem", it keeps it until process exit.  I don't think that
> makes the idea worthless, because there's usually a clear distinction
> between processes doing expensive stuff and processes doing cheap
> stuff.  But it's definitely a limitation.  

...Hopefully less so with mem reduction changes.

The other way is of course to allocate *all* sort/hash/big stuff space
out of shared memory and then let all the backends fight it out
(somehow...) to see who gets access to it. That way backends stay small
and we have well bounded memory usage.

> Also, if you've got a process
> doing expensive stuff, it's certainly possible to expect the user to
> just increase work_mem locally.

Doing that is fine, but you have to retune the system every time the
memory usage changes for any reason. So most of the time manual tuning
has to be very conservative to avoid getting it wrong.

> My own thoughts about the problems with our work_mem arrangement are
> that the real problem is the rule that we can allocate work_mem per sort
> or hash operation; this makes the actual total memory use per backend
> pretty unpredictable for nontrivial queries.  I don't know how to fix
> this though.  The planner needs to know the work_mem that will be used
> for any one of these operations in order to estimate costs, so simply
> trying to divide up work_mem among the operations of a completed plan
> tree is not going to improve matters.

Spent about 5 hours discussing that and the best answer was "use
queuing"....

= = = = 

Anyway, thinking just about sort, I've got the following concrete
suggestions (first two of which coded and tested):

1. I originally picked MERGE_BUFFER_SIZE at 32 blocks as a guess. Better
test results show that is indeed the optimum when we take into account
both intermediate and final merging. However, the preferred buffer size
would be about 256 blocks in the case that Nruns << Ntapes i.e. when
work_mem is set high. In this case memory is reallocated to reduce the
overall usage after "performsort done" has happened.

2. When a tape runs out of tuples, its memory is reallocated to
remaining tapes to increase their I/O efficiency. This should help to
increase performance for smaller work_mem settings with large sorts, or
anything where the final merge Nruns is close to Ntapes. You can see
this occurring in the example below. The reallocation is either done
uniformly or all onto a single tape, depending upon the preread pattern.
This mostly doesn't occur with well sorted output, so there is little
overhead from doing this in the general case.

Right now, large sort performance is very good, whereas smaller sort
perfomance is still fairly bad. 

3. We implement new merge alogorithm as Tom suggested...

Best Regards, Simon Riggs




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Martijn van Oosterhout
Дата:
Сообщение: Re: [GENERAL] A real currency type
Следующее
От: Tom Lane
Дата:
Сообщение: Re: qsort again (was Re: [PERFORM] Strange Create Index behaviour)