Re: Sequential scans

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: Sequential scans
Дата
Msg-id 4638ED62.4090308@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Sequential scans  (Jeff Davis <pgsql@j-davis.com>)
Ответы Re: Sequential scans  (Jeff Davis <pgsql@j-davis.com>)
Список pgsql-hackers
Jeff Davis wrote:
> On Wed, 2007-05-02 at 14:26 +0100, Heikki Linnakangas wrote:
>> Let's use a normal hash table instead, and use a lock to protect it. If 
>> we only update it every 10 pages or so, the overhead should be 
>> negligible. To further reduce contention, we could modify ReadBuffer to 
>> let the caller know if the read resulted in a physical read or not, and 
>> only update the entry when a page is physically read in. That way all 
>> the synchronized scanners wouldn't be updating the same value, just the 
>> one performing the I/O. And while we're at it, let's use the full 
>> relfilenode instead of just the table oid in the hash.
> 
> What should be the maximum size of this hash table? 

Good question. And also, how do you remove entries from it?

I guess the size should somehow be related to number of backends. Each 
backend will realistically be doing just 1 or max 2 seq scan at a time.  It also depends on the number of large tables
inthe databases, but we 
 
don't have that information easily available. How about using just 
NBackends? That should be plenty, but wasting a few hundred bytes of 
memory won't hurt anyone.

I think you're going to need an LRU list and counter of used entries in 
addition to the hash table, and when all entries are in use, remove the 
least recently used one.

The thing to keep an eye on is that it doesn't add too much overhead or 
lock contention in the typical case when there's no concurrent scans.

For the locking, use a LWLock.

> Is there already-
> existing hash table code that I should use to be consistent with the
> rest of the code?

Yes, see utils/hash/dynahash.c, and ShmemInitHash (in 
storage/ipc/shmem.c) since it's in shared memory. There's plenty of 
examples that use hash tables, see for example 
storage/freespace/freespace.c.

> I'm still trying to understand the effect of using the full relfilenode.
> Do you mean using the entire relation _segment_ as the key? That doesn't
> make sense to me. Or do you just mean using the relfilenode (without the
> segment) as the key?

No, not the segment. RelFileNode consists of tablespace oid, database 
oid and relation oid. You can find it in scan->rs_rd->rd_node. The 
segmentation works at a lower level.

> Linux with CFQ I/O scheduler performs very poorly and inconsistently
> with concurrent sequential scans regardless of whether the scans are
> synchronized or not. I suspect the reason for this is that CFQ is
> designed to care more about the process issuing the request than any
> other factor.
> 
> Every other I/O system performed either ideally (no interference between
> scans) or had some interference but still much better than current
> behavior.

Hmm. Should we care then? CFG is the default on Linux, and an average 
sysadmin is unlikely to change it.

What we could do quite easily is:
- when ReadBuffer is called, let the caller know if the read did 
physical I/O.
- when the previous ReadBuffer didn't result in physical I/O, assume 
that we're not the pack leader. If the next buffer isn't already in 
cache, wait a few milliseconds before initiating the read, giving the 
pack leader a chance to do it instead.

Needs testing, of course..

>> 4. It fails regression tests. You get an assertion failure on the portal 
>> test. I believe that changing the direction of a scan isn't handled 
>> properly; it's probably pretty easy to fix.
>>
> 
> I will examine the code more carefully. As a first guess, is it possible
> that test is failing because of the non-deterministic order in which
> tuples are returned?

No, it's an assertion failure, not just different output than expected. 
But it's probably quite simple to fix..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "JEAN-PIERRE PELLETIER"
Дата:
Сообщение: Re: temporal variants of generate_series()
Следующее
От: "Andrew Dunstan"
Дата:
Сообщение: Re: Feature freeze progress report