Re: Cost limited statements RFC

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: Cost limited statements RFC
Дата
Msg-id 51B1FDB6.7080901@2ndQuadrant.com
обсуждение исходный текст
Ответ на Re: Cost limited statements RFC  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Cost limited statements RFC  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On 6/7/13 10:14 AM, Robert Haas wrote:
>> If the page hit limit goes away, the user with a single core server who used
>> to having autovacuum only pillage shared_buffers at 78MB/s might complain
>> that if it became unbounded.
>
> Except that it shouldn't become unbounded, because of the ring-buffer
> stuff.  Vacuum can pillage the OS cache, but the degree to which a
> scan of a single relation can pillage shared_buffers should be sharply
> limited.

I wasn't talking about disruption of the data that's in the buffer 
cache.  The only time the scenario I was describing plays out is when 
the data is already in shared_buffers.  The concern is damage done to 
the CPU's data cache by this activity.  Right now you can't even reach 
100MB/s of damage to your CPU caches in an autovacuum process.  Ripping 
out the page hit cost will eliminate that cap.  Autovacuum could 
introduce gigabytes per second of memory -> L1 cache transfers.  That's 
what all my details about memory bandwidth were trying to put into 
context.  I don't think it really matter much because the new bottleneck 
will be the processing speed of a single core, and that's still a decent 
cap to most people now.

> I think you're missing my point here, which is is that we shouldn't
> have any such things as a "cost limit".  We should limit reads and
> writes *completely separately*.  IMHO, there should be a limit on
> reading, and a limit on dirtying data, and those two limits should not
> be tied to any common underlying "cost limit".  If they are, they will
> not actually enforce precisely the set limit, but some other composite
> limit which will just be weird.

I see the distinction you're making now, don't need a mock up to follow 
you.  The main challenge of moving this way is that read and write rates 
never end up being completely disconnected from one another.  A read 
will only cost some fraction of what a write does, but they shouldn't be 
completely independent.

Just because I'm comfortable doing 10MB/s of reads and 5MB/s of writes, 
I may not be happy with the server doing 9MB/s read + 5MB/s write=14MB/s 
of I/O in an implementation where they float independently.  It's 
certainly possible to disconnect the two like that, and people will be 
able to work something out anyway.  I personally would prefer not to 
lose some ability to specify how expensive read and write operations 
should be considered in relation to one another.

Related aside:  shared_buffers is becoming a decreasing fraction of 
total RAM each release, because it's stuck with this rough 8GB limit 
right now.  As the OS cache becomes a larger multiple of the 
shared_buffers size, the expense of the average read is dropping.  Reads 
are getting more likely to be in the OS cache but not shared_buffers, 
which makes the average cost of any one read shrink.  But writes are as 
expensive as ever.

Real-world tunings I'm doing now reflecting that, typically in servers 
with >128GB of RAM, have gone this far in that direction:

vacuum_cost_page_hit = 0
vacuum_cost_page_hit = 2
vacuum_cost_page_hit = 20

That's 4MB/s of writes, 40MB/s of reads, or some blended mix that 
considers writes 10X as expensive as reads.  The blend is a feature.

The logic here is starting to remind me of how the random_page_cost 
default has been justified.  Read-world random reads are actually close 
to 50X as expensive as sequential ones.  But the average read from the 
executor's perspective is effectively discounted by OS cache hits, so 
4.0 is still working OK.  In large memory servers, random reads keep 
getting cheaper via better OS cache hit odds, and it's increasingly 
becoming something important to tune for.

Some of this mess would go away if we could crack the shared_buffers 
scaling issues for 9.4.  There's finally enough dedicated hardware 
around to see the issue and work on it, but I haven't gotten a clear 
picture of any reproducible test workload that gets slower with large 
buffer cache sizes.  If anyone has a public test case that gets slower 
when shared_buffers goes from 8GB to 16GB, please let me know; I've got 
two systems setup I could chase that down on now.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Hannu Krosing
Дата:
Сообщение: Re: extensible external toast tuple support & snappy prototype
Следующее
От: Andres Freund
Дата:
Сообщение: Re: extensible external toast tuple support & snappy prototype