Обсуждение: synchronize_seqscans' description is a bit misleading

Поиск
Список
Период
Сортировка

synchronize_seqscans' description is a bit misleading

От
Gurjeet Singh
Дата:
If I'm reading the code right [1], this GUC does not actually *synchronize* the scans, but instead just makes sure that a new scan starts from a block that was reported by some other backend performing a scan on the same relation.

Since the backends scanning the relation may be processing the relation at different speeds, even though each one took the hint when starting the scan, they may end up being out of sync with each other. Even in a single query, there may be different scan nodes scanning different parts of the same relation, and even they don't synchronize with each other (and for good reason).

Imagining that all scans on a table are always synchronized, may make some wrongly believe that adding more backends scanning the same table will not incur any extra I/O; that is, only one stream of blocks will be read from disk no matter how many backends you add to the mix. I noticed this when I was creating partition tables, and each of those was a CREATE TABLE AS SELECT FROM original_table (to avoid WAL generation), and running more than 3 such transactions caused the disk read throughput to behave unpredictably, sometimes even dipping below 1 MB/s for a few seconds at a stretch.

Please note that I am not complaining about the implementation, which I think is the best we can do without making backends wait for each other. It's just that the documentation [2] implies that the scans are synchronized through the entire run, which is clearly not the case. So I'd like the docs to be improved to reflect that.

How about something like:

<doc>
synchronize_seqscans (boolean)
    This allows sequential scans of large tables to start from a point in the table that is already being read by another backend. This increases the probability that concurrent scans read the same block at about the same time and hence share the I/O workload. Note that, due to the difference in speeds of processing the table, the backends may eventually get out of sync, and hence stop sharing the I/O workload.

    When this is enabled, ... The default is on.
</doc>

Best regards,

[1] src/backend/access/heap/heapam.c
[2] http://www.postgresql.org/docs/9.2/static/runtime-config-compatible.html#GUC-SYNCHRONIZE-SEQSCANS

--
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.

Re: [DOCS] synchronize_seqscans' description is a bit misleading

От
Tom Lane
Дата:
Gurjeet Singh <gurjeet@singh.im> writes:
> If I'm reading the code right [1], this GUC does not actually *synchronize*
> the scans, but instead just makes sure that a new scan starts from a block
> that was reported by some other backend performing a scan on the same
> relation.

Well, that's the only *direct* effect, but ...

> Since the backends scanning the relation may be processing the relation at
> different speeds, even though each one took the hint when starting the
> scan, they may end up being out of sync with each other.

The point you're missing is that the synchronization is self-enforcing:
whichever backend gets ahead of the others will be the one forced to
request (and wait for) the next physical I/O.  This will naturally slow
down the lower-CPU-cost-per-page scans.  The other ones tend to catch up
during the I/O operation.

The feature is not terribly useful unless I/O costs are high compared to
the CPU cost-per-page.  But when that is true, it's actually rather
robust.  Backends don't have to have exactly the same per-page
processing cost, because pages stay in shared buffers for a while after
the current scan leader reads them.

> Imagining that all scans on a table are always synchronized, may make some
> wrongly believe that adding more backends scanning the same table will not
> incur any extra I/O; that is, only one stream of blocks will be read from
> disk no matter how many backends you add to the mix. I noticed this when I
> was creating partition tables, and each of those was a CREATE TABLE AS
> SELECT FROM original_table (to avoid WAL generation), and running more than
> 3 such transactions caused the disk read throughput to behave unpredictably,
> sometimes even dipping below 1 MB/s for a few seconds at a stretch.

It's not really the scans that's causing that to be unpredictable, it's
the write I/O from the output side, which is forcing highly
nonsequential behavior (or at least I suspect so ... how many disk units
were involved in this test?)

            regards, tom lane


Re: [DOCS] synchronize_seqscans' description is a bit misleading

От
Gurjeet Singh
Дата:
On Wed, Apr 10, 2013 at 11:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Gurjeet Singh <gurjeet@singh.im> writes:
> If I'm reading the code right [1], this GUC does not actually *synchronize*
> the scans, but instead just makes sure that a new scan starts from a block
> that was reported by some other backend performing a scan on the same
> relation.

Well, that's the only *direct* effect, but ...

> Since the backends scanning the relation may be processing the relation at
> different speeds, even though each one took the hint when starting the
> scan, they may end up being out of sync with each other.

The point you're missing is that the synchronization is self-enforcing:
whichever backend gets ahead of the others will be the one forced to
request (and wait for) the next physical I/O.  This will naturally slow
down the lower-CPU-cost-per-page scans.  The other ones tend to catch up
during the I/O operation.

Got it. So far, so good.

Let's consider a pathological case where a scan is performed by a user controlled cursor, whose scan speed depends on how fast the user presses the "Next" button, then this scan is quickly going to fall out of sync with other scans. Moreover, if a new scan happens to pick up the block reported by this slow scan, then that new scan may have to read blocks off the disk afresh.

So, again, it is not guaranteed that all the scans on a relation will synchronize with each other. Hence my proposal to include the term 'probability' in the definition.


The feature is not terribly useful unless I/O costs are high compared to
the CPU cost-per-page.  But when that is true, it's actually rather
robust.  Backends don't have to have exactly the same per-page
processing cost, because pages stay in shared buffers for a while after
the current scan leader reads them.

Agreed. Even if the buffer has been evicted from shared_buffers, there's a high likelihood that the scan that's close on the heels of others will fetch it from FS cache.
 

> Imagining that all scans on a table are always synchronized, may make some
> wrongly believe that adding more backends scanning the same table will not
> incur any extra I/O; that is, only one stream of blocks will be read from
> disk no matter how many backends you add to the mix. I noticed this when I
> was creating partition tables, and each of those was a CREATE TABLE AS
> SELECT FROM original_table (to avoid WAL generation), and running more than
> 3 such transactions caused the disk read throughput to behave unpredictably,
> sometimes even dipping below 1 MB/s for a few seconds at a stretch.

It's not really the scans that's causing that to be unpredictable, it's
the write I/O from the output side, which is forcing highly
nonsequential behavior (or at least I suspect so ... how many disk units
were involved in this test?)

You may be right. I don't have access to the system anymore, and I don't remember the disk layout, but it's quite possible that write operations were causing the  read throughput to drop. I did try to reproduce the behaviour on my laptop with up to 6 backends doing pure reads on a table that was multiple times the system RAM, but I could not get them to get out of sync.

--
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.

Re: [DOCS] synchronize_seqscans' description is a bit misleading

От
Tom Lane
Дата:
Gurjeet Singh <gurjeet@singh.im> writes:
> On Wed, Apr 10, 2013 at 11:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The point you're missing is that the synchronization is self-enforcing:

> Let's consider a pathological case where a scan is performed by a user
> controlled cursor, whose scan speed depends on how fast the user presses
> the "Next" button, then this scan is quickly going to fall out of sync with
> other scans. Moreover, if a new scan happens to pick up the block reported
> by this slow scan, then that new scan may have to read blocks off the disk
> afresh.

Sure --- if a backend stalls completely, it will fall out of the
synchronized group.  And that's a good thing; we'd surely not want to
block the other queries while waiting for a user who just went to lunch.

> So, again, it is not guaranteed that all the scans on a relation will
> synchronize with each other. Hence my proposal to include the term
> 'probability' in the definition.

Yeah, it's definitely not "guaranteed" in any sense.  But I don't really
think your proposed wording is an improvement.  The existing wording
isn't promising guaranteed sync either, to my eyes.

Perhaps we could compromise on, say, changing "so that concurrent scans
read the same block at about the same time" to "so that concurrent scans
tend to read the same block at about the same time", or something like
that.  I don't mind making it sound a bit more uncertain, but I don't
think that we need to emphasize the probability of failure.

            regards, tom lane


Re: [DOCS] synchronize_seqscans' description is a bit misleading

От
Gurjeet Singh
Дата:
On Wed, Apr 10, 2013 at 11:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Gurjeet Singh <gurjeet@singh.im> writes:
> So, again, it is not guaranteed that all the scans on a relation will
> synchronize with each other. Hence my proposal to include the term
> 'probability' in the definition.

Yeah, it's definitely not "guaranteed" in any sense.  But I don't really
think your proposed wording is an improvement.  The existing wording
isn't promising guaranteed sync either, to my eyes.

Given Postgres' track record of delivering what it promises, I expect casual readers to take that phrase as a definitive guide to what is happening internally.
 

Perhaps we could compromise on, say, changing "so that concurrent scans
read the same block at about the same time" to "so that concurrent scans
tend to read the same block at about the same time",

Given that, on first read the word "about" did not deter me from assuming the best, I don't think adding "tend" would make much difference in a readers (mis)understanding. Perhaps we can spare a few more words to make more clear.
 
or something like
that.  I don't mind making it sound a bit more uncertain, but I don't
think that we need to emphasize the probability of failure.

I agree we don't want to stress the failure case too much, especially when it makes the performance no worse than the absence of the feature. But we don't want the reader to get the wrong idea either.

In addition to the slight doc improvement being suggested, perhaps a wiki.postgresql.org entry would allow us to explain the behaviour in more detail.

--
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.