Hi,
In reference to the seq scans roadmap, I have just submitted a patch
that addresses some of the concerns.
The patch does this:
1. for small relation (smaller than 60% of bufferpool), use the
current logic
2. for big relation:- use a ring buffer in heap scan- pin first 12 pages when scan starts- on consumption of every
4-page,read and pin the next 4-page- invalidate used pages of in the scan so they do not force out
other useful pages
4 files changed:
bufmgr.c, bufmgr.h, heapam.c, relscan.h
If there are interests, I can submit another scan patch that returns
N tuples at a time, instead of current one-at-a-time interface. This
improves code locality and further improve performance by another
10-20%.
For TPCH 1G tables, we are seeing more than 20% improvement in scans
on the same hardware.
------------------------------------------------------------------------
-
----- PATCHED VERSION
------------------------------------------------------------------------
-
gptest=# select count(*) from lineitem; count
---------
6001215
(1 row)
Time: 2117.025 ms
------------------------------------------------------------------------
-
----- ORIGINAL CVS HEAD VERSION
------------------------------------------------------------------------
-
gptest=# select count(*) from lineitem; count
---------
6001215
(1 row)
Time: 2722.441 ms
Suggestions for improvement are welcome.
Regards,
-cktan
Greenplum, Inc.
On May 8, 2007, at 5:57 AM, Heikki Linnakangas wrote:
> Luke Lonergan wrote:
>>> What do you mean with using readahead inside the heapscan?
>>> Starting an async read request?
>> Nope - just reading N buffers ahead for seqscans. Subsequent
>> calls use
>> previously read pages. The objective is to issue contiguous reads to
>> the OS in sizes greater than the PG page size (which is much smaller
>> than what is needed for fast sequential I/O).
>
> Are you filling multiple buffers in the buffer cache with a single
> read-call? The OS should be doing readahead for us anyway, so I
> don't see how just issuing multiple ReadBuffers one after each
> other helps.
>
>> Yes, I think the ring buffer strategy should be used when the
>> table size
>> is > 1 x bufcache and the ring buffer should be of a fixed size
>> smaller
>> than L2 cache (32KB - 128KB seems to work well).
>
> I think we want to let the ring grow larger than that for updating
> transactions and vacuums, though, to avoid the WAL flush problem.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 6: explain analyze is your friend
>