Обсуждение: contrib/cache_scan (Re: What's needed for cache-only table scan?)

Поиск

Список

Период

Сортировка

contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Kohei KaiGai

Дата:

14 января 2014 г., 15:06:50

Hello,

The attached patch is what we discussed just before the commit-fest:Nov.

It implements an alternative way to scan a particular table using on-memory
cache instead of the usual heap access method. Unlike buffer cache, this
mechanism caches a limited number of columns on the memory, so memory
consumption per tuple is much smaller than the regular heap access method,
thus it allows much larger number of tuples on the memory.

I'd like to extend this idea to implement a feature to cache data according to
column-oriented data structure to utilize parallel calculation processors like
CPU's SIMD operations or simple GPU cores. (Probably, it makes sense to
evaluate multiple records with a single vector instruction if contents of
a particular column is put as a large array.)
However, this patch still keeps all the tuples in row-oriented data format,
because row <=> column translation makes this patch bigger than the
current form (about 2KL), and GPU integration needs to link proprietary
library (cuda or opencl) thus I thought it is not preferable for the upstream
code.

Also note that this patch needs part-1 ~ part-3 patches of CustomScan
APIs as prerequisites because it is implemented on top of the APIs.

One thing I have to apologize is, lack of documentation and source code
comments around the contrib/ code. Please give me a couple of days to
clean-up the code.
Aside from the extension code, I put two enhancement on the core code
as follows. I'd like to have a discussion about adequacy of these enhancement.

The first enhancement is a hook on heap_page_prune() to synchronize
internal state of extension with changes of heap image on the disk.
It is not avoidable to hold garbage, increasing time by time, on the cache,
thus needs to clean up as vacuum process doing. The best timing to do
is when dead tuples are reclaimed because it is certain nobody will
reference the tuples any more.

diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index f626755..023f78e 100644
--- a/src/backend/utils/time/tqual.c
    bool        marked[MaxHeapTuplesPerPage + 1];
 } PruneState;

+/* Callback for each page pruning */
+heap_page_prune_hook_type heap_page_prune_hook = NULL;
+
 /* Local functions */
 static int heap_prune_chain(Relation relation, Buffer buffer,
                 OffsetNumber rootoffnum,
@@ -294,6 +297,16 @@ heap_page_prune(Relation relation, Buffer buffer, Transacti
onId OldestXmin,
     * and update FSM with the remaining space.
     */

+   /*
+    * This callback allows extensions to synchronize their own status with
+    * heap image on the disk, when this buffer page is vacuumed.
+    */
+   if (heap_page_prune_hook)
+       (*heap_page_prune_hook)(relation,
+                               buffer,
+                               ndeleted,
+                               OldestXmin,
+                               prstate.latestRemovedXid);
    return ndeleted;
 }

The second enhancement makes SetHintBits() accepts InvalidBuffer to
ignore all the jobs. We need to check visibility of cached tuples when
custom-scan node scans cached table instead of the heap.
Even though we can use MVCC snapshot to check tuple's visibility,
it may internally set hint bit of tuples thus we always needs to give
a valid buffer pointer to HeapTupleSatisfiesVisibility(). Unfortunately,
it kills all the benefit of table cache if it takes to load the heap buffer
being associated with the cached tuple.
So, I'd like to have a special case handling on the SetHintBits() for
dry-run when InvalidBuffer is given.

diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index f626755..023f78e 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -103,11 +103,18 @@ static bool XidInMVCCSnapshot(TransactionId xid,
Snapshot snapshot);
  *
  * The caller should pass xid as the XID of the transaction to check, or
  * InvalidTransactionId if no check is needed.
+ *
+ * In case when the supplied HeapTuple is not associated with a particular
+ * buffer, it just returns without any jobs. It may happen when an extension
+ * caches tuple with their own way.
  */
 static inline void
 SetHintBits(HeapTupleHeader tuple, Buffer buffer,
            uint16 infomask, TransactionId xid)
 {
+   if (BufferIsInvalid(buffer))
+       return;
+
    if (TransactionIdIsValid(xid))
    {
        /* NB: xid must be known committed here! */

Thanks,

2013/11/13 Kohei KaiGai <kaigai@kaigai.gr.jp>:
> 2013/11/12 Tom Lane <tgl@sss.pgh.pa.us>:
>> Kohei KaiGai <kaigai@kaigai.gr.jp> writes:
>>> So, are you thinking it is a feasible approach to focus on custom-scan
>>> APIs during the upcoming CF3, then table-caching feature as use-case
>>> of this APIs on CF4?
>>
>> Sure.  If you work on this extension after CF3, and it reveals that the
>> custom scan stuff needs some adjustments, there would be time to do that
>> in CF4.  The policy about what can be submitted in CF4 is that we don't
>> want new major features that no one has seen before, not that you can't
>> make fixes to previously submitted stuff.  Something like a new hook
>> in vacuum wouldn't be a "major feature", anyway.
>>
> Thanks for this clarification.
> 3 days are too short to write a patch, however, 2 month may be sufficient
> to develop a feature on top of the scheme being discussed in the previous
> comitfest.
>
> Best regards,
> --
> KaiGai Kohei <kaigai@kaigai.gr.jp>

--
KaiGai Kohei <kaigai@kaigai.gr.jp>

Вложения

pgsql-v9.4-custom-scan.part-4.v5.patch

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

KaiGai Kohei

Дата:

21 января 2014 г., 04:55:33

Hello,

I revisited the patch for contrib/cache_scan extension.
The previous one had a problem when T-tree node shall be rebalanced
then crashed on merging the node.

Even though contrib/cache_scan portion has more than 2KL code,
things I'd like to have a discussion first is a portion of the
core enhancements to run MVCCsnapshot on the cached tuple, and
to get callback on vacuumed pages for cache synchronization.

Any comments please.

Thanks,

(2014/01/15 0:06), Kohei KaiGai wrote:
> Hello,
>
> The attached patch is what we discussed just before the commit-fest:Nov.
>
> It implements an alternative way to scan a particular table using on-memory
> cache instead of the usual heap access method. Unlike buffer cache, this
> mechanism caches a limited number of columns on the memory, so memory
> consumption per tuple is much smaller than the regular heap access method,
> thus it allows much larger number of tuples on the memory.
>
> I'd like to extend this idea to implement a feature to cache data according to
> column-oriented data structure to utilize parallel calculation processors like
> CPU's SIMD operations or simple GPU cores. (Probably, it makes sense to
> evaluate multiple records with a single vector instruction if contents of
> a particular column is put as a large array.)
> However, this patch still keeps all the tuples in row-oriented data format,
> because row <=> column translation makes this patch bigger than the
> current form (about 2KL), and GPU integration needs to link proprietary
> library (cuda or opencl) thus I thought it is not preferable for the upstream
> code.
>
> Also note that this patch needs part-1 ~ part-3 patches of CustomScan
> APIs as prerequisites because it is implemented on top of the APIs.
>
> One thing I have to apologize is, lack of documentation and source code
> comments around the contrib/ code. Please give me a couple of days to
> clean-up the code.
> Aside from the extension code, I put two enhancement on the core code
> as follows. I'd like to have a discussion about adequacy of these enhancement.
>
> The first enhancement is a hook on heap_page_prune() to synchronize
> internal state of extension with changes of heap image on the disk.
> It is not avoidable to hold garbage, increasing time by time, on the cache,
> thus needs to clean up as vacuum process doing. The best timing to do
> is when dead tuples are reclaimed because it is certain nobody will
> reference the tuples any more.
>
> diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
> index f626755..023f78e 100644
> --- a/src/backend/utils/time/tqual.c
>      bool        marked[MaxHeapTuplesPerPage + 1];
>   } PruneState;
>
> +/* Callback for each page pruning */
> +heap_page_prune_hook_type heap_page_prune_hook = NULL;
> +
>   /* Local functions */
>   static int heap_prune_chain(Relation relation, Buffer buffer,
>                   OffsetNumber rootoffnum,
> @@ -294,6 +297,16 @@ heap_page_prune(Relation relation, Buffer buffer, Transacti
> onId OldestXmin,
>       * and update FSM with the remaining space.
>       */
>
> +   /*
> +    * This callback allows extensions to synchronize their own status with
> +    * heap image on the disk, when this buffer page is vacuumed.
> +    */
> +   if (heap_page_prune_hook)
> +       (*heap_page_prune_hook)(relation,
> +                               buffer,
> +                               ndeleted,
> +                               OldestXmin,
> +                               prstate.latestRemovedXid);
>      return ndeleted;
>   }
>
>
> The second enhancement makes SetHintBits() accepts InvalidBuffer to
> ignore all the jobs. We need to check visibility of cached tuples when
> custom-scan node scans cached table instead of the heap.
> Even though we can use MVCC snapshot to check tuple's visibility,
> it may internally set hint bit of tuples thus we always needs to give
> a valid buffer pointer to HeapTupleSatisfiesVisibility(). Unfortunately,
> it kills all the benefit of table cache if it takes to load the heap buffer
> being associated with the cached tuple.
> So, I'd like to have a special case handling on the SetHintBits() for
> dry-run when InvalidBuffer is given.
>
> diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
> index f626755..023f78e 100644
> --- a/src/backend/utils/time/tqual.c
> +++ b/src/backend/utils/time/tqual.c
> @@ -103,11 +103,18 @@ static bool XidInMVCCSnapshot(TransactionId xid,
> Snapshot snapshot);
>    *
>    * The caller should pass xid as the XID of the transaction to check, or
>    * InvalidTransactionId if no check is needed.
> + *
> + * In case when the supplied HeapTuple is not associated with a particular
> + * buffer, it just returns without any jobs. It may happen when an extension
> + * caches tuple with their own way.
>    */
>   static inline void
>   SetHintBits(HeapTupleHeader tuple, Buffer buffer,
>              uint16 infomask, TransactionId xid)
>   {
> +   if (BufferIsInvalid(buffer))
> +       return;
> +
>      if (TransactionIdIsValid(xid))
>      {
>          /* NB: xid must be known committed here! */
>
> Thanks,
>
> 2013/11/13 Kohei KaiGai <kaigai@kaigai.gr.jp>:
>> 2013/11/12 Tom Lane <tgl@sss.pgh.pa.us>:
>>> Kohei KaiGai <kaigai@kaigai.gr.jp> writes:
>>>> So, are you thinking it is a feasible approach to focus on custom-scan
>>>> APIs during the upcoming CF3, then table-caching feature as use-case
>>>> of this APIs on CF4?
>>>
>>> Sure.  If you work on this extension after CF3, and it reveals that the
>>> custom scan stuff needs some adjustments, there would be time to do that
>>> in CF4.  The policy about what can be submitted in CF4 is that we don't
>>> want new major features that no one has seen before, not that you can't
>>> make fixes to previously submitted stuff.  Something like a new hook
>>> in vacuum wouldn't be a "major feature", anyway.
>>>
>> Thanks for this clarification.
>> 3 days are too short to write a patch, however, 2 month may be sufficient
>> to develop a feature on top of the scheme being discussed in the previous
>> comitfest.
>>
>> Best regards,
>> --
>> KaiGai Kohei <kaigai@kaigai.gr.jp>
>
>
>
>
>

--
OSS Promotion Center / The PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Вложения

pgsql-v9.4-custom-scan.part-4.v5.patch

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Kohei KaiGai

Дата:

08 февраля 2014 г., 02:09:40

Hello,

Because of time pressure in the commit-fest:Jan, I tried to simplifies the patch
for cache-only scan into three portions; (1) add a hook on heap_page_prune
for cache invalidation on vacuuming a particular page. (2) add a check to accept
InvalidBuffer on SetHintBits (3) a proof-of-concept module of cache-only scan.

(1) pgsql-v9.4-heap_page_prune_hook.v1.patch
Once on-memory columnar cache is constructed, then it needs to be invalidated
if heap page on behalf of the cache is modified. In usual DML cases, extension
can get control using row-level trigger functions for invalidation,
however, we right
now have no way to get control on a page is vacuumed, usually handled by
autovacuum process.
This patch adds a callback on heap_page_prune(), to allow extensions to prune
dead entries on its cache, not only heap pages.
I'd also like to see any other scenario we need to invalidate columnar cache
entries, if exist. It seems to me object_access_hook makes sense to conver
DDL and VACUUM FULL scenario...


(2) pgsql-v9.4-HeapTupleSatisfies-accepts-InvalidBuffer.v1.patch
In case when we want to check visibility of the tuples on cache entries (thus
no particular shared buffer is associated) using HeapTupleSatisfiesVisibility,
it internally tries to update hint bits of tuples. However, it does
not make sense
onto the tuples being not associated with a particular shared buffer.
Due to its definition, tuple entries being on cache does not connected with
a particular shared buffer. If we need to load whole of the buffer page to set
hint bits, it is totally nonsense because the purpose of on-memory cache is
to reduce disk accesses.
This patch adds an exceptional condition on SetHintBits() to skip anything
if the given buffer is InvalidBuffer. It allows to check tuple
visibility using regular
visibility check functions, without re-invention of the wheel by themselves.


(3) pgsql-v9.4-contrib-cache-scan.v1.patch
Unlike (1) and (2), this patch is just a proof of the concept to
implement cache-
only scan on top of the custom-scan interface.
It tries to offer an alternative scan path on the table with row-level
triggers for
cache invalidation if total width of referenced columns are less than 30% of the
total width of table definition. Thus, it can keep larger number of records with
meaningful portion on the main memory.
This cache shall be invalidated according to the main heap update. One is
row-level trigger, second is object_access_hook on DDL, and the third is
heap_page_prune hook. Once a columns reduced tuple gets cached, it is
copied to the cache memory from the shared buffer, so it needs a feature
to ignore InvalidBuffer for visibility check functions.

Please volunteer to reviewing the patches, especially (1) and (2) that are
very small portion.

Thanks,

2014-01-21 KaiGai Kohei <kaigai@ak.jp.nec.com>:
> Hello,
>
> I revisited the patch for contrib/cache_scan extension.
> The previous one had a problem when T-tree node shall be rebalanced
> then crashed on merging the node.
>
> Even though contrib/cache_scan portion has more than 2KL code,
> things I'd like to have a discussion first is a portion of the
> core enhancements to run MVCCsnapshot on the cached tuple, and
> to get callback on vacuumed pages for cache synchronization.
>
> Any comments please.
>
> Thanks,
>
>
> (2014/01/15 0:06), Kohei KaiGai wrote:
>>
>> Hello,
>>
>> The attached patch is what we discussed just before the commit-fest:Nov.
>>
>> It implements an alternative way to scan a particular table using
>> on-memory
>> cache instead of the usual heap access method. Unlike buffer cache, this
>> mechanism caches a limited number of columns on the memory, so memory
>> consumption per tuple is much smaller than the regular heap access method,
>> thus it allows much larger number of tuples on the memory.
>>
>> I'd like to extend this idea to implement a feature to cache data
>> according to
>> column-oriented data structure to utilize parallel calculation processors
>> like
>> CPU's SIMD operations or simple GPU cores. (Probably, it makes sense to
>> evaluate multiple records with a single vector instruction if contents of
>> a particular column is put as a large array.)
>> However, this patch still keeps all the tuples in row-oriented data
>> format,
>> because row <=> column translation makes this patch bigger than the
>> current form (about 2KL), and GPU integration needs to link proprietary
>> library (cuda or opencl) thus I thought it is not preferable for the
>> upstream
>> code.
>>
>> Also note that this patch needs part-1 ~ part-3 patches of CustomScan
>> APIs as prerequisites because it is implemented on top of the APIs.
>>
>> One thing I have to apologize is, lack of documentation and source code
>> comments around the contrib/ code. Please give me a couple of days to
>> clean-up the code.
>> Aside from the extension code, I put two enhancement on the core code
>> as follows. I'd like to have a discussion about adequacy of these
>> enhancement.
>>
>> The first enhancement is a hook on heap_page_prune() to synchronize
>> internal state of extension with changes of heap image on the disk.
>> It is not avoidable to hold garbage, increasing time by time, on the
>> cache,
>> thus needs to clean up as vacuum process doing. The best timing to do
>> is when dead tuples are reclaimed because it is certain nobody will
>> reference the tuples any more.
>>
>> diff --git a/src/backend/utils/time/tqual.c
>> b/src/backend/utils/time/tqual.c
>> index f626755..023f78e 100644
>> --- a/src/backend/utils/time/tqual.c
>>      bool        marked[MaxHeapTuplesPerPage + 1];
>>   } PruneState;
>>
>> +/* Callback for each page pruning */
>> +heap_page_prune_hook_type heap_page_prune_hook = NULL;
>> +
>>   /* Local functions */
>>   static int heap_prune_chain(Relation relation, Buffer buffer,
>>                   OffsetNumber rootoffnum,
>> @@ -294,6 +297,16 @@ heap_page_prune(Relation relation, Buffer buffer,
>> Transacti
>> onId OldestXmin,
>>       * and update FSM with the remaining space.
>>       */
>>
>> +   /*
>> +    * This callback allows extensions to synchronize their own status
>> with
>> +    * heap image on the disk, when this buffer page is vacuumed.
>> +    */
>> +   if (heap_page_prune_hook)
>> +       (*heap_page_prune_hook)(relation,
>> +                               buffer,
>> +                               ndeleted,
>> +                               OldestXmin,
>> +                               prstate.latestRemovedXid);
>>      return ndeleted;
>>   }
>>
>>
>> The second enhancement makes SetHintBits() accepts InvalidBuffer to
>> ignore all the jobs. We need to check visibility of cached tuples when
>> custom-scan node scans cached table instead of the heap.
>> Even though we can use MVCC snapshot to check tuple's visibility,
>> it may internally set hint bit of tuples thus we always needs to give
>> a valid buffer pointer to HeapTupleSatisfiesVisibility(). Unfortunately,
>> it kills all the benefit of table cache if it takes to load the heap
>> buffer
>> being associated with the cached tuple.
>> So, I'd like to have a special case handling on the SetHintBits() for
>> dry-run when InvalidBuffer is given.
>>
>> diff --git a/src/backend/utils/time/tqual.c
>> b/src/backend/utils/time/tqual.c
>> index f626755..023f78e 100644
>> --- a/src/backend/utils/time/tqual.c
>> +++ b/src/backend/utils/time/tqual.c
>> @@ -103,11 +103,18 @@ static bool XidInMVCCSnapshot(TransactionId xid,
>> Snapshot snapshot);
>>    *
>>    * The caller should pass xid as the XID of the transaction to check, or
>>    * InvalidTransactionId if no check is needed.
>> + *
>> + * In case when the supplied HeapTuple is not associated with a
>> particular
>> + * buffer, it just returns without any jobs. It may happen when an
>> extension
>> + * caches tuple with their own way.
>>    */
>>   static inline void
>>   SetHintBits(HeapTupleHeader tuple, Buffer buffer,
>>              uint16 infomask, TransactionId xid)
>>   {
>> +   if (BufferIsInvalid(buffer))
>> +       return;
>> +
>>      if (TransactionIdIsValid(xid))
>>      {
>>          /* NB: xid must be known committed here! */
>>
>> Thanks,
>>
>> 2013/11/13 Kohei KaiGai <kaigai@kaigai.gr.jp>:
>>>
>>> 2013/11/12 Tom Lane <tgl@sss.pgh.pa.us>:
>>>>
>>>> Kohei KaiGai <kaigai@kaigai.gr.jp> writes:
>>>>>
>>>>> So, are you thinking it is a feasible approach to focus on custom-scan
>>>>> APIs during the upcoming CF3, then table-caching feature as use-case
>>>>> of this APIs on CF4?
>>>>
>>>>
>>>> Sure.  If you work on this extension after CF3, and it reveals that the
>>>> custom scan stuff needs some adjustments, there would be time to do that
>>>> in CF4.  The policy about what can be submitted in CF4 is that we don't
>>>> want new major features that no one has seen before, not that you can't
>>>> make fixes to previously submitted stuff.  Something like a new hook
>>>> in vacuum wouldn't be a "major feature", anyway.
>>>>
>>> Thanks for this clarification.
>>> 3 days are too short to write a patch, however, 2 month may be sufficient
>>> to develop a feature on top of the scheme being discussed in the previous
>>> comitfest.
>>>
>>> Best regards,
>>> --
>>> KaiGai Kohei <kaigai@kaigai.gr.jp>
>>
>>
>>
>>
>>
>>
>
> --
> OSS Promotion Center / The PG-Strom Project
> KaiGai Kohei <kaigai@ak.jp.nec.com>



--
KaiGai Kohei <kaigai@kaigai.gr.jp>

Вложения

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Haribabu Kommi

Дата:

12 февраля 2014 г., 06:00:16

On Sat, Feb 8, 2014 at 1:09 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

Hello,

Because of time pressure in the commit-fest:Jan, I tried to simplifies the patch
for cache-only scan into three portions; (1) add a hook on heap_page_prune
for cache invalidation on vacuuming a particular page. (2) add a check to accept
InvalidBuffer on SetHintBits (3) a proof-of-concept module of cache-only scan.

(1) pgsql-v9.4-heap_page_prune_hook.v1.patch
Once on-memory columnar cache is constructed, then it needs to be invalidated
if heap page on behalf of the cache is modified. In usual DML cases, extension
can get control using row-level trigger functions for invalidation,
however, we right
now have no way to get control on a page is vacuumed, usually handled by
autovacuum process.
This patch adds a callback on heap_page_prune(), to allow extensions to prune
dead entries on its cache, not only heap pages.
I'd also like to see any other scenario we need to invalidate columnar cache
entries, if exist. It seems to me object_access_hook makes sense to conver
DDL and VACUUM FULL scenario...

(2) pgsql-v9.4-HeapTupleSatisfies-accepts-InvalidBuffer.v1.patch
In case when we want to check visibility of the tuples on cache entries (thus
no particular shared buffer is associated) using HeapTupleSatisfiesVisibility,
it internally tries to update hint bits of tuples. However, it does
not make sense
onto the tuples being not associated with a particular shared buffer.
Due to its definition, tuple entries being on cache does not connected with
a particular shared buffer. If we need to load whole of the buffer page to set
hint bits, it is totally nonsense because the purpose of on-memory cache is
to reduce disk accesses.
This patch adds an exceptional condition on SetHintBits() to skip anything
if the given buffer is InvalidBuffer. It allows to check tuple
visibility using regular
visibility check functions, without re-invention of the wheel by themselves.

(3) pgsql-v9.4-contrib-cache-scan.v1.patch
Unlike (1) and (2), this patch is just a proof of the concept to
implement cache-
only scan on top of the custom-scan interface.
It tries to offer an alternative scan path on the table with row-level
triggers for
cache invalidation if total width of referenced columns are less than 30% of the
total width of table definition. Thus, it can keep larger number of records with
meaningful portion on the main memory.
This cache shall be invalidated according to the main heap update. One is
row-level trigger, second is object_access_hook on DDL, and the third is
heap_page_prune hook. Once a columns reduced tuple gets cached, it is
copied to the cache memory from the shared buffer, so it needs a feature
to ignore InvalidBuffer for visibility check functions.

I reviewed all the three patches. The first 1 and 2 core PostgreSQL patches are fine.

And I have comments in the third patch related to cache scan.

1. +# contrib/dbcache/Makefile

Makefile header comment is not matched with file name location.

2.+ /*

+ * Estimation of average width of cached tuples - it does not make

+ * sense to construct a new cache if its average width is more than

+ * 30% of the raw data.

+ */

Move the estimation of average width calculation of cached tuples into the case where the cache is not found,

otherwise it is an overhead for cache hit scenario.

3. + if (old_cache)

+ attrs_used = bms_union(attrs_used, &old_cache->attrs_used);

can't we need the check to see the average width is more than 30%? During estimation it doesn't

include the existing other attributes.

4. + lchunk->right = cchunk;

+ lchunk->l_depth = TTREE_DEPTH(lchunk->right);

I think it should be lchunk->r_depth needs to be set in a clock wise rotation.

5. can you add some comments in the code with how the block is used?

6. In do_insert_tuple function I felt moving the tuples and rearranging their addresses is little bit costly. How about the following way?

Always insert the tuple from the bottom of the block where the empty space is started and store their corresponding reference pointers

in the starting of the block in an array. As and when the new tuple inserts this array increases from block start and tuples from block end.

Just need to sort this array based on item pointers, no need to update their reference pointers.

In this case the movement is required only when the tuple is moved from one block to another block and also whenever if the continuous

free space is not available to insert the new tuple. you can decide based on how frequent the sorting will happen in general.

7. In ccache_find_tuple function this Assert(i_min + 1 < cchunk->ntups); can go wrong when only one tuple present in the block

with the equal item pointer what we are searching in the forward scan direction.

8. I am not able to find a protection mechanism in insert/delete and etc of a tuple in Ttree. As this is a shared memory it can cause problems.

9. + /* merge chunks if this chunk has enough space to merge */

+ ccache_merge_chunk(ccache, cchunk);

calling the merge chunks for every call back of heap page prune is a overhead for vacuum. After the merge which may again leads

to node splits because of new data.

10. "columner" is present in some places of the patch. correct it.

11. In cache_scan_next function, incase of cache insert fails because of shared memory the tuple pointer is not reset and cache is NULL.

Because of this during next record fetch it leads to assert as cache != NULL.

12. + if (ccache->status != CCACHE_STATUS_IN_PROGRESS)

+ cs_put_ccache(ccache);

The cache is created with refcnt as 2 and in some times two times put cache is called to eliminate it and in some times with a different approach.

It is little bit confusing, can you explain in with comments with why 2 is required and how it maintains?

13. A performance report is required to see how much impact it can cause on insert/delete/update operations because of cache synchronizer.

14. The Guc variable "cache_scan_disabled" is missed in docs description.

please let me know if you need any support.

Regards,

Hari Babu

Fujitsu Australia

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Kohei KaiGai

Дата:

12 февраля 2014 г., 15:42:36

2014-02-12 14:59 GMT+09:00 Haribabu Kommi <kommi.haribabu@gmail.com>:
> I reviewed all the three patches. The first 1 and 2 core PostgreSQL patches
> are fine.
> And I have comments in the third patch related to cache scan.
>
Thanks for your volunteering.

> 1. +# contrib/dbcache/Makefile
>
>    Makefile header comment is not matched with file name location.
>
Ahh, it's an old name when I started to implement.

> 2.+   /*
> +   * Estimation of average width of cached tuples - it does not make
> +   * sense to construct a new cache if its average width is more than
> +   * 30% of the raw data.
> +   */
>
>    Move the estimation of average width calculation of cached tuples into
> the case where the cache is not found,
>    otherwise it is an overhead for cache hit scenario.
>
You are right. If and when existing cache is found and match, its width is
obviously less than 30% of total width.

> 3. + if (old_cache)
> + attrs_used = bms_union(attrs_used, &old_cache->attrs_used);
>
>    can't we need the check to see the average width is more than 30%? During
> estimation it doesn't
>    include the existing other attributes.
>
Indeed. It should drop some attributes on the existing cache if total average
width grows more than the threshold. Probably, we need to have a statistical
variable to track how many times or how recently referenced.

> 4. + lchunk->right = cchunk;
> + lchunk->l_depth = TTREE_DEPTH(lchunk->right);
>
>   I think it should be lchunk->r_depth needs to be set in a clock wise
> rotation.
>
Oops, nice cache.

> 5. can you add some comments in the code with how the block is used?
>
Sorry, I'll add it. A block is consumed from the head to store pointers of
tuples, and from the tail to store contents of the tuples. A block can hold
multiple tuples unless usage of tuple pointers from the head does not cross
the area for tuple contents. Anyway, I'll put it on the source code.

> 6. In do_insert_tuple function I felt moving the tuples and rearranging
> their addresses is little bit costly. How about the following way?
>
>    Always insert the tuple from the bottom of the block where the empty
> space is started and store their corresponding reference pointers
>    in the starting of the block in an array. As and when the new tuple
> inserts this array increases from block start and tuples from block end.
>    Just need to sort this array based on item pointers, no need to update
> their reference pointers.
>
>    In this case the movement is required only when the tuple is moved from
> one block to another block and also whenever if the continuous
>    free space is not available to insert the new tuple. you can decide based
> on how frequent the sorting will happen in general.
>
It seems to me a reasonable suggestion.
Probably, an easier implementation is replacing an old block with dead-
spaces by a new block that contains only valid tuples, if and when dead-
space grows threshold of block-usage.

> 7. In ccache_find_tuple function this Assert(i_min + 1 < cchunk->ntups); can
> go wrong when only one tuple present in the block
>    with the equal item pointer what we are searching in the forward scan
> direction.
>
It shouldn't happen, because the first or second ItemPointerCompare will
handle the condition. Please assume the cchunk->ntups == 1. In this case,
any given ctid shall match either of them, because any ctid is less, equal or
larger to the tuple being only cached, thus, it moves to the right or left node
according to the scan direction.

> 8. I am not able to find a protection mechanism in insert/delete and etc of
> a tuple in Ttree. As this is a shared memory it can cause problems.
>
For design simplification, I put a giant lock per columnar-cache.
So, routines in cscan.c acquires exclusive lwlock prior to invocation of
ccache_insert_tuple / ccache_delete_tuple.

> 9. + /* merge chunks if this chunk has enough space to merge */
> + ccache_merge_chunk(ccache, cchunk);
>
>   calling the merge chunks for every call back of heap page prune is a
> overhead for vacuum. After the merge which may again leads
>   to node splits because of new data.
>
OK, I'll check the condition to merge the chunks, to prevent too frequent
merge / split.

> 10. "columner" is present in some places of the patch. correct it.
>
Ahh, it should be "columnar".

> 11. In cache_scan_next function, incase of cache insert fails because of
> shared memory the tuple pointer is not reset and cache is NULL.
>     Because of this during next record fetch it leads to assert as cache !=
> NULL.
>
You are right. I had to modify the state of scan as if normal-seqscan path,
not just setting NULL on csstate->ccache.
I left an older manner during try & error during implementation.

> 12. + if (ccache->status != CCACHE_STATUS_IN_PROGRESS)
>       + cs_put_ccache(ccache);
>
>     The cache is created with refcnt as 2 and in some times two times put
> cache is called to eliminate it and in some times with a different approach.
>     It is little bit confusing, can you explain in with comments with why 2
> is required and how it maintains?
>
I thought, 2 is same as create + get, so putting the cache at end of the scan
does not release the cache. However, it might be confusing as you pointed out.
The process who create the cache knows it is the creator process. So, all it
needs to do is exiting the scan without putting the cache, if it successfully
create the cache and wants to leave the cache for later scan.

> 13. A performance report is required to see how much impact it can cause on
> insert/delete/update operations because of cache synchronizer.
>
OK, I'll try to measure the difference between them on next patch submission.

> 14. The Guc variable "cache_scan_disabled" is missed in docs description.
>
OK,

Thanks, I'll submit a revised one within a couple of days.
-- 
KaiGai Kohei <kaigai@kaigai.gr.jp>

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Haribabu Kommi

Дата:

12 февраля 2014 г., 23:31:25

On Thu, Feb 13, 2014 at 2:42 AM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

2014-02-12 14:59 GMT+09:00 Haribabu Kommi <kommi.haribabu@gmail.com>:
> 7. In ccache_find_tuple function this Assert(i_min + 1 < cchunk->ntups); can
> go wrong when only one tuple present in the block
> with the equal item pointer what we are searching in the forward scan
> direction.
>
It shouldn't happen, because the first or second ItemPointerCompare will
handle the condition. Please assume the cchunk->ntups == 1. In this case,
any given ctid shall match either of them, because any ctid is less, equal or
larger to the tuple being only cached, thus, it moves to the right or left node
according to the scan direction.

yes you are correct. sorry for the noise.

> 8. I am not able to find a protection mechanism in insert/delete and etc of
> a tuple in Ttree. As this is a shared memory it can cause problems.
>
For design simplification, I put a giant lock per columnar-cache.
So, routines in cscan.c acquires exclusive lwlock prior to invocation of
ccache_insert_tuple / ccache_delete_tuple.

Correct. But this lock can be a bottleneck for the concurrency. Better to analyze the same once we have the performance report.

Regards,

Hari Babu

Fujitsu Australia

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Kouhei Kaigai

Дата:

13 февраля 2014 г., 04:28:30

>     > 8. I am not able to find a protection mechanism in insert/delete
> and etc of
>     > a tuple in Ttree. As this is a shared memory it can cause problems.
>     >
>
>     For design simplification, I put a giant lock per columnar-cache.
>     So, routines in cscan.c acquires exclusive lwlock prior to
> invocation of
>     ccache_insert_tuple / ccache_delete_tuple.
>
>
> Correct. But this lock can be a bottleneck for the concurrency. Better to
> analyze the same once we have the performance report.
>
Well, concurrent updates towards a particular table may cause lock contention
due to a giant lock.
On the other hands, one my headache is how to avoid dead-locking if we try to
implement it using finer granularity locking. Please assume per-chunk locking.
It also needs to take a lock on the neighbor nodes when a record is moved out.
Concurrently, some other process may try to move another record with inverse
order. That is a ticket for dead-locking.

Is there idea or reference to implement concurrent tree structure updating?

Anyway, it is a good idea to measure the impact of concurrent updates on
cached tables, to find out the significance of lock splitting.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> -----Original Message-----
> From: Haribabu Kommi [mailto:kommi.haribabu@gmail.com]
> Sent: Thursday, February 13, 2014 8:31 AM
> To: Kohei KaiGai
> Cc: Kaigai, Kouhei(海外, 浩平); Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> On Thu, Feb 13, 2014 at 2:42 AM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
>
>
>     2014-02-12 14:59 GMT+09:00 Haribabu Kommi
> <kommi.haribabu@gmail.com>:
>
>     > 7. In ccache_find_tuple function this Assert(i_min + 1 <
> cchunk->ntups); can
>
>     > go wrong when only one tuple present in the block
>     >    with the equal item pointer what we are searching in the
> forward scan
>     > direction.
>     >
>
>     It shouldn't happen, because the first or second ItemPointerCompare
> will
>     handle the condition. Please assume the cchunk->ntups == 1. In this
> case,
>     any given ctid shall match either of them, because any ctid is less,
> equal or
>     larger to the tuple being only cached, thus, it moves to the right
> or left node
>     according to the scan direction.
>
>
> yes you are correct. sorry for the noise.
>
>
>     > 8. I am not able to find a protection mechanism in insert/delete
> and etc of
>     > a tuple in Ttree. As this is a shared memory it can cause problems.
>     >
>
>     For design simplification, I put a giant lock per columnar-cache.
>     So, routines in cscan.c acquires exclusive lwlock prior to
> invocation of
>     ccache_insert_tuple / ccache_delete_tuple.
>
>
> Correct. But this lock can be a bottleneck for the concurrency. Better to
> analyze the same once we have the performance report.
>
> Regards,
> Hari Babu
>
> Fujitsu Australia

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Haribabu Kommi

Дата:

14 февраля 2014 г., 05:21:53

On Thu, Feb 13, 2014 at 3:27 PM, Kouhei Kaigai wrote:

> > 8. I am not able to find a protection mechanism in insert/delete
> and etc of
> > a tuple in Ttree. As this is a shared memory it can cause problems.
> >
>
> For design simplification, I put a giant lock per columnar-cache.
> So, routines in cscan.c acquires exclusive lwlock prior to
> invocation of
> ccache_insert_tuple / ccache_delete_tuple.
>
>
> Correct. But this lock can be a bottleneck for the concurrency. Better to
> analyze the same once we have the performance report.
>
Well, concurrent updates towards a particular table may cause lock contention
due to a giant lock.
On the other hands, one my headache is how to avoid dead-locking if we try to
implement it using finer granularity locking. Please assume per-chunk locking.
It also needs to take a lock on the neighbor nodes when a record is moved out.
Concurrently, some other process may try to move another record with inverse
order. That is a ticket for dead-locking.

Is there idea or reference to implement concurrent tree structure updating?

Anyway, it is a good idea to measure the impact of concurrent updates on
cached tables, to find out the significance of lock splitting.

we can do some of the following things,

1. Let only insert can take the exclusive lock.

2. Always follow the locking order from root to the children.

3. For delete take exclusive lock only on the exact node where the delete is happening.

and etc, we will identify some more based on the performance data.

And one more interesting document i found in the net while searching for the concurrency in Ttree,

which says that B-tree can outperform Ttree as an in-memory index also with an little bit expensive of more memory

usage. The document is attached in the mail.

Regards,

Hari Babu

Fujitsu Australia

Вложения

T-Tree_or_B-Tree_Main_Memory_Database_Index_Structure_Revisited.pdf

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Kohei KaiGai

Дата:

20 февраля 2014 г., 15:19:36

Hello,

The attached patch is a revised one for cache-only scan module
on top of custom-scan interface. Please check it.
The points I changed are below.

Also, the custom-scan patch, the basis of this functionality, is still
looking for the folks who can volunteer for reviewing.
  https://commitfest.postgresql.org/action/patch_view?id=1282

> 1. +# contrib/dbcache/Makefile
>
>    Makefile header comment is not matched with file name location.
>
Fixed.

> 2.+   /*
> +   * Estimation of average width of cached tuples - it does not make
> +   * sense to construct a new cache if its average width is more than
> +   * 30% of the raw data.
> +   */
>
>
>    Move the estimation of average width calculation of cached tuples into
> the case where the cache is not found,
>    otherwise it is an overhead for cache hit scenario.
>
This logic was moved to the block if existing cache was not hit.
So, usual path does not fall to the average width calculation.

In addition, I added one additional GUC (cache_scan.width_threshold) to
specify the threshold to determine usage of cache-scan.

> 3. + if (old_cache)
> + attrs_used = bms_union(attrs_used, &old_cache->attrs_used);
>
>
>    can't we need the check to see the average width is more than 30%? During
> estimation it doesn't
>    include the existing other attributes.
>
See the new ccache_new_attribute_set(). It almost works like bms_union
if total width of the union set of attributes is less than the threshold.
Elsewhere, it drops attributes included in the existing cache, but not
required this scan. Ideally, I want statistical information to know how
frequently/recently the column is referenced, but no such information right
now. So, I put a logic to drop larger columns first.

> 4. + lchunk->right = cchunk;
> + lchunk->l_depth = TTREE_DEPTH(lchunk->right);
>
>
>   I think it should be lchunk->r_depth needs to be set in a clock wise
> rotation.
>
Fixed. Also, I found a bug that didn't update upper pointer of child
nodes when rotation was kicked.

> 5. can you add some comments in the code with how the block is used?
>
I added a source code comment around the definition of shmseg_head.
Is it understandable?

> 6. In do_insert_tuple function I felt moving the tuples and rearranging
> their addresses is little bit costly. How about the following way?
>
>    Always insert the tuple from the bottom of the block where the empty
> space is started and store their corresponding reference pointers
>    in the starting of the block in an array. As and when the new tuple inserts
> this array increases from block start and tuples from block end.
>    Just need to sort this array based on item pointers, no need to update
> their reference pointers.
>
>    In this case the movement is required only when the tuple is moved from
> one block to another block and also whenever if the continuous
>    free space is not available to insert the new tuple. you can decide based
> on how frequent the sorting will happen in general.
>
I newly added a "deadspace" field in the ccache_chunk. It shows the payload
being consumed by tuples already deleted. Because of the format of ccache_chunk,
actual free space is cchunk->usage - offsetof(ccache_chunk,
tuples[cchunk->ntups]),
but deadspace is potential free space. So, I adjusted the do_insert to kick
chunk compaction code, if target chunk has enough "potential" free space, but
no available "actual" free space to store the new tuple.
It makes the overhead around insertion of tuples since all we need to move
is pointer within cchunk->tuples, not contents itself.

> 8. I am not able to find a protection mechanism in insert/delete and etc
> of a tuple in Ttree. As this is a shared memory it can cause problems.
>
I didn't change the existing logic (a giant lock per ccache_head) yet,
because I still wonder this kind of optimization may lose the simpleness
of implementation for the proof-of-concept purpose towards the hook.
Does it really needed for this purpose?

In addition, it seems to me we don't have intent-lock feature, even though
the paper you suggested gave me an impression...

> 9. + /* merge chunks if this chunk has enough space to merge */
> + ccache_merge_chunk(ccache, cchunk);
>
>   calling the merge chunks for every call back of heap page prune is a
> overhead for vacuum. After the merge which may again leads
>   to node splits because of new data.
>
I adjusted the logic little bit. Now ccache_merge_chunk() is called
if vacuumed cache_chunk is different from the previous one. It allows
to reduce number of (trial) merge operation.

> 10. "columner" is present in some places of the patch. correct it.
>
Fixed.

> 11. In cache_scan_next function, incase of cache insert fails because of
> shared memory the tuple pointer is not reset and cache is NULL.
>     Because of this during next record fetch it leads to assert as cache !=
> NULL.
>
Fixed.

> 12. + if (ccache->status != CCACHE_STATUS_IN_PROGRESS)
>       + cs_put_ccache(ccache);
>
>     The cache is created with refcnt as 2 and in some times two times put
> cache is called to eliminate it and in some times with a different approach.
>     It is little bit confusing, can you explain in with comments with why
> 2 is required and how it maintains?
>
I adjusted the logic around reference count. When a cache is created,
the creator process does not decrease reference count if creation was
successfully done. Once some error happen during creation, it shall be
decreased, so we can drop the cache with one "put" operation.

> 13. A performance report is required to see how much impact it can cause
> on insert/delete/update operations because of cache synchronizer.
>
I tried to assign synchronizer trigger on "pgbench_account" then run
pgbench toward the table with/without columnar-cache.
It seems to me here is no big difference more than variants.

* with columnar-cache

$ pgbench -T 15 postgres
tps = 113.586430 (including connections establishing)
tps = 113.597228 (excluding connections establishing)

* without columnar-cache
$ pgbench -T 15 postgres
tps = 110.765080 (including connections establishing)
tps = 110.775890 (excluding connections establishing)

> 14. The Guc variable "cache_scan_disabled" is missed in docs description.
>
Its description was added.

Thanks,

2014-02-14 14:21 GMT+09:00 Haribabu Kommi <kommi.haribabu@gmail.com>:
> On Thu, Feb 13, 2014 at 3:27 PM, Kouhei Kaigai  wrote:
>
>> >       > 8. I am not able to find a protection mechanism in insert/delete
>> > and etc of
>> >       > a tuple in Ttree. As this is a shared memory it can cause
>> > problems.
>> >       >
>> >
>> >       For design simplification, I put a giant lock per columnar-cache.
>> >       So, routines in cscan.c acquires exclusive lwlock prior to
>> > invocation of
>> >       ccache_insert_tuple / ccache_delete_tuple.
>> >
>> >
>> > Correct. But this lock can be a bottleneck for the concurrency. Better
>> > to
>> > analyze the same once we have the performance report.
>> >
>> Well, concurrent updates towards a particular table may cause lock
>> contention
>> due to a giant lock.
>> On the other hands, one my headache is how to avoid dead-locking if we try
>> to
>> implement it using finer granularity locking. Please assume per-chunk
>> locking.
>> It also needs to take a lock on the neighbor nodes when a record is moved
>> out.
>> Concurrently, some other process may try to move another record with
>> inverse
>> order. That is a ticket for dead-locking.
>>
>> Is there idea or reference to implement concurrent tree structure
>> updating?
>>
>> Anyway, it is a good idea to measure the impact of concurrent updates on
>> cached tables, to find out the significance of lock splitting.
>
>
> we can do some of the following things,
> 1. Let only insert can take the exclusive lock.
> 2. Always follow the locking order from root to the children.
> 3. For delete take exclusive lock only on the exact node where the delete is
> happening.
> and etc, we will identify some more based on the performance data.
>
> And one more interesting document i found in the net while searching for the
> concurrency in Ttree,
> which says that B-tree can outperform Ttree as an in-memory index also with
> an little bit expensive of more memory
> usage. The document is attached in the mail.
>
> Regards,
> Hari Babu
> Fujitsu Australia



--
KaiGai Kohei <kaigai@kaigai.gr.jp>

Вложения

pgsql-v9.4-custom-scan.part-4.v7.patch

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Haribabu Kommi

Дата:

24 февраля 2014 г., 03:42:21

On Fri, Feb 21, 2014 at 2:19 AM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

Hello,

The attached patch is a revised one for cache-only scan module
on top of custom-scan interface. Please check it.

Thanks for the revised patch. Please find some minor comments.

1. memcpy(dest, tuple, HEAPTUPLESIZE);

+ memcpy((char *)dest + HEAPTUPLESIZE,

+ tuple->t_data, tuple->t_len);

For a normal tuple these two addresses are different but in case of ccache, it is a continuous memory.

Better write a comment as even if it continuous memory, it is treated as different only.

2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);

t_len is already maxaligned. No problem of using it again, The required length calculation is differing function to function.

For example, in below part of the same function, the same t_len is used directly. It didn't generate any problem, but it may give some confusion.

4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);

+ if (pchunk != NULL && pchunk != cchunk)

+ ccache_merge_chunk(ccache, pchunk);

+ pchunk = cchunk;

The merge_chunk is called only when the heap tuples are spread across two cache chunks. Actually one cache chunk can accommodate one or more than heap pages. it needs some other way of handling.

4. for (i=0; i < 20; i++)

Better to replace this magic number with a meaningful macro.

5. "columner" is present in sgml file. correct it.

6. "max_cached_attnum" value in the document saying as 128 by default but in the code it set as 256.

I will start regress and performance tests. I will inform you the same once i finish.

Regards,

Hari Babu

Fujitsu Australia

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Haribabu Kommi

Дата:

24 февраля 2014 г., 23:16:20

On Mon, Feb 24, 2014 at 2:41 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Fri, Feb 21, 2014 at 2:19 AM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
Hello,

The attached patch is a revised one for cache-only scan module
on top of custom-scan interface. Please check it.

I will start regress and performance tests. I will inform you the same once i finish.

Getting some compilation warnings while compiling the extension and also I am not able to load the extension because of undefined symbol "get_restriction_qual_cost".

cscan.c: In function ‘cs_estimate_costs’:

cscan.c:163: warning: implicit declaration of function ‘get_restriction_qual_cost’

cscan.c: In function ‘cs_add_scan_path’:

cscan.c:437: warning: implicit declaration of function ‘bms_to_string’

cscan.c:437: warning: passing argument 1 of ‘makeString’ makes pointer from integer without a cast

cscan.c: In function ‘cs_begin_custom_scan’:

cscan.c:493: warning: implicit declaration of function ‘bms_from_string’

cscan.c:493: warning: assignment makes pointer from integer without a cast

FATAL: could not load library "/postgresql/cache_scan.so": /postgresql/cache_scan.so: undefined symbol: get_restriction_qual_cost

Regards,

Hari Babu

Fujitsu Australia

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Kouhei Kaigai

Дата:

24 февраля 2014 г., 23:45:50

Thanks for your testing,

> Getting some compilation warnings while compiling the extension and also
> I am not able to load the extension because of undefined symbol
> "get_restriction_qual_cost".
>
It seems to me you applied only part-1 portion of the custom-scan patches.

The get_restriction_qual_cost() is re-defined as extern function, from
static one, on the part-2 portion (ctidscan) to estimate cost value of the
qualifiers in extension. Also, bms_to_string() and bms_from_string() are
added on the part-3 portion (postgres_fdw) portion to carry a bitmap-set
according to the copyObject manner.

It may make sense to include the above supplemental changes in the cache-
scan feature also, until either of them getting upstreamed.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> -----Original Message-----
> From: Haribabu Kommi [mailto:kommi.haribabu@gmail.com]
> Sent: Tuesday, February 25, 2014 8:16 AM
> To: Kohei KaiGai
> Cc: Kaigai, Kouhei(海外, 浩平); Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> On Mon, Feb 24, 2014 at 2:41 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>
>
>     On Fri, Feb 21, 2014 at 2:19 AM, Kohei KaiGai <kaigai@kaigai.gr.jp>
> wrote:
>
>
>         Hello,
>
>         The attached patch is a revised one for cache-only scan
> module
>         on top of custom-scan interface. Please check it.
>
>
>
>     I will start regress and performance tests. I will inform you the
> same once i finish.
>
>
>
> Getting some compilation warnings while compiling the extension and also
> I am not able to load the extension because of undefined symbol
> "get_restriction_qual_cost".
>
> cscan.c: In function ‘cs_estimate_costs’:
> cscan.c:163: warning: implicit declaration of function
> ‘get_restriction_qual_cost’
> cscan.c: In function ‘cs_add_scan_path’:
> cscan.c:437: warning: implicit declaration of function ‘bms_to_string’
> cscan.c:437: warning: passing argument 1 of ‘makeString’ makes pointer from
> integer without a cast
> cscan.c: In function ‘cs_begin_custom_scan’:
> cscan.c:493: warning: implicit declaration of function ‘bms_from_string’
> cscan.c:493: warning: assignment makes pointer from integer without a cast
>
> FATAL:  could not load library "/postgresql/cache_scan.so":
> /postgresql/cache_scan.so: undefined symbol: get_restriction_qual_cost
>
>
>
> Regards,
> Hari Babu
>
> Fujitsu Australia

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Haribabu Kommi

Дата:

25 февраля 2014 г., 00:15:02

On Tue, Feb 25, 2014 at 10:44 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Thanks for your testing,

> Getting some compilation warnings while compiling the extension and also
> I am not able to load the extension because of undefined symbol
> "get_restriction_qual_cost".
>
It seems to me you applied only part-1 portion of the custom-scan patches.

The get_restriction_qual_cost() is re-defined as extern function, from
static one, on the part-2 portion (ctidscan) to estimate cost value of the
qualifiers in extension. Also, bms_to_string() and bms_from_string() are
added on the part-3 portion (postgres_fdw) portion to carry a bitmap-set
according to the copyObject manner.

It may make sense to include the above supplemental changes in the cache-
scan feature also, until either of them getting upstreamed.

Thanks for the information, I will apply other patches also and start testing.

Regards,

Hari Babu

Fujitsu Australia

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Haribabu Kommi

Дата:

26 февраля 2014 г., 04:01:15

On Tue, Feb 25, 2014 at 11:13 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

Thanks for the information, I will apply other patches also and start testing.

When try to run the pgbench test, by default the cache-scan plan is not chosen because of more cost. So I increased the cpu_index_tuple_cost to a maximum value or by turning off index_scan, so that the plan can chose the cache_scan as the least cost.

The configuration parameters changed during the test are,

shared_buffers - 2GB, cache_scan.num_blocks - 1024

wal_buffers - 16MB, checkpoint_segments - 255

checkpoint_timeout - 15 min, cpu_index_tuple_cost - 100000 or enable_indexscan=off

Test procedure:

1. Initialize the database with pgbench with 75 scale factor.

2. Create the triggers on pgbench_accounts

3. Use a select query to load all the data into cache.

4. Run a simple update pgbench test.

Plan details of pgbench simple update queries:

postgres=# explain update pgbench_accounts set abalance = abalance where aid = 100000;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------

Update on pgbench_accounts (cost=0.43..100008.44 rows=1 width=103)

-> Index Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.43..100008.44 rows=1 width=103)

Index Cond: (aid = 100000)

Planning time: 0.045 ms

(4 rows)

postgres=# explain select abalance from pgbench_accounts where aid = 100000;

QUERY PLAN

------------------------------------------------------------------------------------

Custom Scan (cache scan) on pgbench_accounts (cost=0.00..99899.99 rows=1 width=4)

Filter: (aid = 100000)

Planning time: 0.042 ms

(3 rows)

I am observing a too much delay in performance results. The performance test script is attached in the mail.

please let me know if you find any problem in the test.

Regards,

Hari Babu

Fujitsu Australia

Вложения

run_reading.sh

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Kouhei Kaigai

Дата:

04 марта 2014 г., 03:36:23

Thanks for your reviewing.

According to the discussion in the Custom-Scan API thread, I moved
all the supplemental facilities (like bms_to/from_string) into the
main patch. So, you don’t need to apply ctidscan and postgres_fdw
patch for testing any more.
(I'll submit the revised one later)

> 1. memcpy(dest, tuple, HEAPTUPLESIZE);
> + memcpy((char *)dest + HEAPTUPLESIZE,
>
> +   tuple->t_data, tuple->t_len);
>
>   For a normal tuple these two addresses are different but in case of ccache,
> it is a continuous memory.
>   Better write a comment as even if it continuous memory, it is treated
> as different only.
>
OK, I put a source code comment as follows:

        /*
         * Even though we put the body of HeapTupleHeaderData just after
         * HeapTupleData, usually, here is no guarantee that both of data
         * structures are located on continuous memory address.
         * So, we explicitly adjust tuple->t_data to point the area just
         * behind of itself, to reference the HeapTuple on columnar-cache
         * as like regular ones.
         */

> 2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
>
>   t_len is already maxaligned. No problem of using it again, The required
> length calculation is differing function to function.
>   For example, in below part of the same function, the same t_len is used
> directly. It didn't generate any problem, but it may give some confusion.
>
Once I tried to trust t_len is aligned well, however, Assert() macro
said it is not a right assumption. See heap_compute_data_size(), it
computes length of tuple body and adjusts alignment according to the
"attalign" value of pg_attribute; that is not usually same with
sizeof(Datum).

> 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> + if (pchunk != NULL && pchunk != cchunk)
>
> + ccache_merge_chunk(ccache, pchunk);
>
> + pchunk = cchunk;
>
>
>   The merge_chunk is called only when the heap tuples are spread across
> two cache chunks. Actually one cache chunk can accommodate one or more than
> heap pages. it needs some other way of handling.
>
I adjusted the logic to merge the chunks as follows:

Once a tuple is vacuumed from a chunk, it also checks whether it can be merged
with its child leafs. A chunk has up to two child leafs; left one has less ctid
that the parent, and right one has greater ctid. It means a chunk without right
child in the left sub-tree or a chunk without left child in the right sub-tree
are neighbor of the chunk being vacuumed. In addition, if vacuumed chunk does not
have either (or both) of children, it can be merged with parent node.
I modified ccache_vacuum_tuple() to merge chunks during t-tree walk-down, if
vacuumed chunk has enough free space.

> 4. for (i=0; i < 20; i++)
>
>    Better to replace this magic number with a meaningful macro.
>
I rethought here is no good reason why we should construct multiple
ccache_entries at once. So, I adjusted the code to create a new
ccache_entry on demand, to track a columnar-cache being acquired.

> 5. "columner" is present in sgml file. correct it.
>
Sorry, fixed it.

> 6. "max_cached_attnum" value in the document saying as 128 by default but
> in the code it set as 256.
>
Sorry, fixed it.


Also, I tried to run a benchmark of cache_scan on the case when this module
performs most effectively.

The table t1 is declared as follows:
  create table t1 (a int, b float, c float, d text, e date, f char(200));
Its width is almost 256bytes/record, and contains 4milion records, thus
total table size is almost 1GB.

* 1st trial - it takes longer time than sequential scan because of columnar-
              cache construction
postgres=# explain analyze select count(*) from t1 where a % 10 = 5;
                                                             QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=200791.62..200791.64 rows=1 width=0) (actual time=63105.036..63105.037 rows=1 loops=1)
   ->  Custom Scan (cache scan) on t1  (cost=0.00..200741.62 rows=20000 width=0) (actual time=7.397..62832.728
rows=400000loops=1) 
         Filter: ((a % 10) = 5)
         Rows Removed by Filter: 3600000
 Planning time: 214.506 ms
 Total runtime: 64629.296 ms
(6 rows)

* 2nd trial - it takes much faster than sequential scan because of no disk access
postgres=# explain analyze select count(*) from t1 where a % 10 = 5;
                                                            QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=67457.53..67457.54 rows=1 width=0) (actual time=7833.313..7833.313 rows=1 loops=1)
   ->  Custom Scan (cache scan) on t1  (cost=0.00..67407.53 rows=20000 width=0) (actual time=0.154..7615.914
rows=400000loops=1) 
         Filter: ((a % 10) = 5)
         Rows Removed by Filter: 3600000
 Planning time: 1.019 ms
 Total runtime: 7833.761 ms
(6 rows)

* 3rd trial - turn off the cache_scan, so planner chooses the built-in SeqScan.
postgres=# set cache_scan.enabled = off;
SET
postgres=# explain analyze select count(*) from t1 where a % 10 = 5;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=208199.08..208199.09 rows=1 width=0) (actual time=59700.810..59700.810 rows=1 loops=1)
   ->  Seq Scan on t1  (cost=0.00..208149.08 rows=20000 width=0) (actual time=715.489..59518.095 rows=400000 loops=1)
         Filter: ((a % 10) = 5)
         Rows Removed by Filter: 3600000
 Planning time: 0.630 ms
 Total runtime: 59701.104 ms
(6 rows)

The reason why such an extreme result.
I adjusted the system page cache usage to constrain disk cache hit in the operating
system level, so sequential scan is dominated by disk access performance in this
case. On the other hand, columnar cache allowed to host whole of the records because
it omits to cache unreferenced columns.

* GUCs
shared_buffers = 512MB
shared_preload_libraries = 'cache_scan'
cache_scan.num_blocks = 400

[kaigai@iwashi backend]$ free -m
             total       used       free     shared    buffers     cached
Mem:          7986       7839        146          0          2        572
-/+ buffers/cache:       7265        721
Swap:         8079        265       7814


Please don't throw me stones. :-)
The primary purpose of this extension is to demonstrate usage of custom-scan
interface and heap_page_prune_hook().

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> -----Original Message-----
> From: Haribabu Kommi [mailto:kommi.haribabu@gmail.com]
> Sent: Monday, February 24, 2014 12:42 PM
> To: Kohei KaiGai
> Cc: Kaigai, Kouhei(海外, 浩平); Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> On Fri, Feb 21, 2014 at 2:19 AM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
>
>
>     Hello,
>
>     The attached patch is a revised one for cache-only scan module
>     on top of custom-scan interface. Please check it.
>
>
>
> Thanks for the revised patch.  Please find some minor comments.
>
> 1. memcpy(dest, tuple, HEAPTUPLESIZE);
> + memcpy((char *)dest + HEAPTUPLESIZE,
>
> +   tuple->t_data, tuple->t_len);
>
>
>   For a normal tuple these two addresses are different but in case of ccache,
> it is a continuous memory.
>   Better write a comment as even if it continuous memory, it is treated
> as different only.
>
> 2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
>
>   t_len is already maxaligned. No problem of using it again, The required
> length calculation is differing function to function.
>   For example, in below part of the same function, the same t_len is used
> directly. It didn't generate any problem, but it may give some confusion.
>
> 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> + if (pchunk != NULL && pchunk != cchunk)
>
> + ccache_merge_chunk(ccache, pchunk);
>
> + pchunk = cchunk;
>
>
>   The merge_chunk is called only when the heap tuples are spread across
> two cache chunks. Actually one cache chunk can accommodate one or more than
> heap pages. it needs some other way of handling.
>
> 4. for (i=0; i < 20; i++)
>
>    Better to replace this magic number with a meaningful macro.
>
> 5. "columner" is present in sgml file. correct it.
>
> 6. "max_cached_attnum" value in the document saying as 128 by default but
> in the code it set as 256.
>
> I will start regress and performance tests. I will inform you the same once
> i finish.
>
>
> Regards,
> Hari Babu
>
> Fujitsu Australia

Вложения

pgsql-v9.4-custom-scan.part-4.v9.patch

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Kouhei Kaigai

Дата:

04 марта 2014 г., 04:08:06

Sorry, the previous one still has "columner" in the sgml files.
Please see the attached one, instead.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Tuesday, March 04, 2014 12:35 PM
> To: Haribabu Kommi; Kohei KaiGai
> Cc: Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> Thanks for your reviewing.
>
> According to the discussion in the Custom-Scan API thread, I moved all the
> supplemental facilities (like bms_to/from_string) into the main patch. So,
> you don’t need to apply ctidscan and postgres_fdw patch for testing any
> more.
> (I'll submit the revised one later)
>
> > 1. memcpy(dest, tuple, HEAPTUPLESIZE);
> > + memcpy((char *)dest + HEAPTUPLESIZE,
> >
> > +   tuple->t_data, tuple->t_len);
> >
> >   For a normal tuple these two addresses are different but in case of
> > ccache, it is a continuous memory.
> >   Better write a comment as even if it continuous memory, it is
> > treated as different only.
> >
> OK, I put a source code comment as follows:
>
>         /*
>          * Even though we put the body of HeapTupleHeaderData just after
>          * HeapTupleData, usually, here is no guarantee that both of data
>          * structures are located on continuous memory address.
>          * So, we explicitly adjust tuple->t_data to point the area just
>          * behind of itself, to reference the HeapTuple on columnar-cache
>          * as like regular ones.
>          */
>
> > 2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
> >
> >   t_len is already maxaligned. No problem of using it again, The
> > required length calculation is differing function to function.
> >   For example, in below part of the same function, the same t_len is
> > used directly. It didn't generate any problem, but it may give some
> confusion.
> >
> Once I tried to trust t_len is aligned well, however, Assert() macro said
> it is not a right assumption. See heap_compute_data_size(), it computes
> length of tuple body and adjusts alignment according to the "attalign" value
> of pg_attribute; that is not usually same with sizeof(Datum).
>
> > 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> > + if (pchunk != NULL && pchunk != cchunk)
> >
> > + ccache_merge_chunk(ccache, pchunk);
> >
> > + pchunk = cchunk;
> >
> >
> >   The merge_chunk is called only when the heap tuples are spread
> > across two cache chunks. Actually one cache chunk can accommodate one
> > or more than heap pages. it needs some other way of handling.
> >
> I adjusted the logic to merge the chunks as follows:
>
> Once a tuple is vacuumed from a chunk, it also checks whether it can be
> merged with its child leafs. A chunk has up to two child leafs; left one
> has less ctid that the parent, and right one has greater ctid. It means
> a chunk without right child in the left sub-tree or a chunk without left
> child in the right sub-tree are neighbor of the chunk being vacuumed. In
> addition, if vacuumed chunk does not have either (or both) of children,
> it can be merged with parent node.
> I modified ccache_vacuum_tuple() to merge chunks during t-tree walk-down,
> if vacuumed chunk has enough free space.
>
> > 4. for (i=0; i < 20; i++)
> >
> >    Better to replace this magic number with a meaningful macro.
> >
> I rethought here is no good reason why we should construct multiple
> ccache_entries at once. So, I adjusted the code to create a new ccache_entry
> on demand, to track a columnar-cache being acquired.
>
> > 5. "columner" is present in sgml file. correct it.
> >
> Sorry, fixed it.
>
> > 6. "max_cached_attnum" value in the document saying as 128 by default
> > but in the code it set as 256.
> >
> Sorry, fixed it.
>
>
> Also, I tried to run a benchmark of cache_scan on the case when this module
> performs most effectively.
>
> The table t1 is declared as follows:
>   create table t1 (a int, b float, c float, d text, e date, f char(200));
> Its width is almost 256bytes/record, and contains 4milion records, thus
> total table size is almost 1GB.
>
> * 1st trial - it takes longer time than sequential scan because of columnar-
>               cache construction
> postgres=# explain analyze select count(*) from t1 where a % 10 = 5;
>                                                              QUERY PLAN
> ----------------------------------------------------------------------
> --------------------------------------------------------------
>  Aggregate  (cost=200791.62..200791.64 rows=1 width=0) (actual
> time=63105.036..63105.037 rows=1 loops=1)
>    ->  Custom Scan (cache scan) on t1  (cost=0.00..200741.62 rows=20000
> width=0) (actual time=7.397..62832.728 rows=400000 loops=1)
>          Filter: ((a % 10) = 5)
>          Rows Removed by Filter: 3600000  Planning time: 214.506 ms  Total
> runtime: 64629.296 ms
> (6 rows)
>
> * 2nd trial - it takes much faster than sequential scan because of no disk
> access postgres=# explain analyze select count(*) from t1 where a % 10 =
> 5;
>                                                             QUERY PLAN
> ----------------------------------------------------------------------
> ------------------------------------------------------------
>  Aggregate  (cost=67457.53..67457.54 rows=1 width=0) (actual
> time=7833.313..7833.313 rows=1 loops=1)
>    ->  Custom Scan (cache scan) on t1  (cost=0.00..67407.53 rows=20000
> width=0) (actual time=0.154..7615.914 rows=400000 loops=1)
>          Filter: ((a % 10) = 5)
>          Rows Removed by Filter: 3600000  Planning time: 1.019 ms  Total
> runtime: 7833.761 ms
> (6 rows)
>
> * 3rd trial - turn off the cache_scan, so planner chooses the built-in
> SeqScan.
> postgres=# set cache_scan.enabled = off; SET postgres=# explain analyze
> select count(*) from t1 where a % 10 = 5;
>                                                       QUERY PLAN
> ----------------------------------------------------------------------
> ------------------------------------------------
>  Aggregate  (cost=208199.08..208199.09 rows=1 width=0) (actual
> time=59700.810..59700.810 rows=1 loops=1)
>    ->  Seq Scan on t1  (cost=0.00..208149.08 rows=20000 width=0) (actual
> time=715.489..59518.095 rows=400000 loops=1)
>          Filter: ((a % 10) = 5)
>          Rows Removed by Filter: 3600000  Planning time: 0.630 ms  Total
> runtime: 59701.104 ms
> (6 rows)
>
> The reason why such an extreme result.
> I adjusted the system page cache usage to constrain disk cache hit in the
> operating system level, so sequential scan is dominated by disk access
> performance in this case. On the other hand, columnar cache allowed to host
> whole of the records because it omits to cache unreferenced columns.
>
> * GUCs
> shared_buffers = 512MB
> shared_preload_libraries = 'cache_scan'
> cache_scan.num_blocks = 400
>
> [kaigai@iwashi backend]$ free -m
>              total       used       free     shared    buffers
> cached
> Mem:          7986       7839        146          0          2
> 572
> -/+ buffers/cache:       7265        721
> Swap:         8079        265       7814
>
>
> Please don't throw me stones. :-)
> The primary purpose of this extension is to demonstrate usage of custom-scan
> interface and heap_page_prune_hook().
>
> Thanks,
> --
> NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei
> <kaigai@ak.jp.nec.com>
>
>
> > -----Original Message-----
> > From: Haribabu Kommi [mailto:kommi.haribabu@gmail.com]
> > Sent: Monday, February 24, 2014 12:42 PM
> > To: Kohei KaiGai
> > Cc: Kaigai, Kouhei(海外, 浩平); Tom Lane; PgHacker; Robert Haas
> > Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for
> > cache-only table scan?)
> >
> > On Fri, Feb 21, 2014 at 2:19 AM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
> >
> >
> >     Hello,
> >
> >     The attached patch is a revised one for cache-only scan module
> >     on top of custom-scan interface. Please check it.
> >
> >
> >
> > Thanks for the revised patch.  Please find some minor comments.
> >
> > 1. memcpy(dest, tuple, HEAPTUPLESIZE);
> > + memcpy((char *)dest + HEAPTUPLESIZE,
> >
> > +   tuple->t_data, tuple->t_len);
> >
> >
> >   For a normal tuple these two addresses are different but in case of
> > ccache, it is a continuous memory.
> >   Better write a comment as even if it continuous memory, it is
> > treated as different only.
> >
> > 2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
> >
> >   t_len is already maxaligned. No problem of using it again, The
> > required length calculation is differing function to function.
> >   For example, in below part of the same function, the same t_len is
> > used directly. It didn't generate any problem, but it may give some
> confusion.
> >
> > 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> > + if (pchunk != NULL && pchunk != cchunk)
> >
> > + ccache_merge_chunk(ccache, pchunk);
> >
> > + pchunk = cchunk;
> >
> >
> >   The merge_chunk is called only when the heap tuples are spread
> > across two cache chunks. Actually one cache chunk can accommodate one
> > or more than heap pages. it needs some other way of handling.
> >
> > 4. for (i=0; i < 20; i++)
> >
> >    Better to replace this magic number with a meaningful macro.
> >
> > 5. "columner" is present in sgml file. correct it.
> >
> > 6. "max_cached_attnum" value in the document saying as 128 by default
> > but in the code it set as 256.
> >
> > I will start regress and performance tests. I will inform you the same
> > once i finish.
> >
> >
> > Regards,
> > Hari Babu
> >
> > Fujitsu Australia

Вложения

pgsql-v9.4-custom-scan.part-4.v9.patch

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Haribabu Kommi

Дата:

06 марта 2014 г., 09:17:45

On Tue, Mar 4, 2014 at 3:07 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

> > 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> > + if (pchunk != NULL && pchunk != cchunk)
> >
> > + ccache_merge_chunk(ccache, pchunk);
> >
> > + pchunk = cchunk;
> >
> >
> > The merge_chunk is called only when the heap tuples are spread
> > across two cache chunks. Actually one cache chunk can accommodate one
> > or more than heap pages. it needs some other way of handling.
> >
> I adjusted the logic to merge the chunks as follows:
>
> Once a tuple is vacuumed from a chunk, it also checks whether it can be
> merged with its child leafs. A chunk has up to two child leafs; left one
> has less ctid that the parent, and right one has greater ctid. It means
> a chunk without right child in the left sub-tree or a chunk without left
> child in the right sub-tree are neighbor of the chunk being vacuumed. In
> addition, if vacuumed chunk does not have either (or both) of children,
> it can be merged with parent node.
> I modified ccache_vacuum_tuple() to merge chunks during t-tree walk-down,
> if vacuumed chunk has enough free space.
>

Patch looks good.

Regarding merging of the nodes, instead of checking whether merge is possible or not for every tuple which is vacuumed,

can we put some kind of threshold as whenever the node is 50% free then try to merge it from leaf nodes until 90% is full.

The rest of the 10% will be left for the next inserts on the node. what do you say?

I will update you later regarding the performance test results.

Regards,

Hari Babu

Fujitsu Australia

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Kohei KaiGai

Дата:

06 марта 2014 г., 11:16:09

2014-03-06 18:17 GMT+09:00 Haribabu Kommi <kommi.haribabu@gmail.com>:
>
> On Tue, Mar 4, 2014 at 3:07 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>>
>> > > 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
>> > > + if (pchunk != NULL && pchunk != cchunk)
>> > >
>> > > + ccache_merge_chunk(ccache, pchunk);
>> > >
>> > > + pchunk = cchunk;
>> > >
>> > >
>> > >   The merge_chunk is called only when the heap tuples are spread
>> > > across two cache chunks. Actually one cache chunk can accommodate one
>> > > or more than heap pages. it needs some other way of handling.
>> > >
>> > I adjusted the logic to merge the chunks as follows:
>> >
>> > Once a tuple is vacuumed from a chunk, it also checks whether it can be
>> > merged with its child leafs. A chunk has up to two child leafs; left one
>> > has less ctid that the parent, and right one has greater ctid. It means
>> > a chunk without right child in the left sub-tree or a chunk without left
>> > child in the right sub-tree are neighbor of the chunk being vacuumed. In
>> > addition, if vacuumed chunk does not have either (or both) of children,
>> > it can be merged with parent node.
>> > I modified ccache_vacuum_tuple() to merge chunks during t-tree
>> > walk-down,
>> > if vacuumed chunk has enough free space.
>> >
>
>
> Patch looks good.
>
Thanks for your volunteering.

> Regarding merging of the nodes, instead of checking whether merge is
> possible or not for every tuple which is vacuumed,
> can we put some kind of threshold as whenever the node is 50% free then try
> to merge it from leaf nodes until 90% is full.
> The rest of the 10% will be left for the next inserts on the node. what do
> you say?
>
Hmm. Indeed, it makes sense. How about an idea that kicks chunk merging
if "expected" free space of merged chunk is less than 50%?
If threshold depends on the (expected) usage of merged chunk, it can avoid
over-merging.

> I will update you later regarding the performance test results.
>
Thhanks,

Also, I'll rebase the patch on top of the new custom-scan interfaces
according to Tom's suggestion, even though main logic of cache_scan
is not changed.

Best regards,
-- 
KaiGai Kohei <kaigai@kaigai.gr.jp>

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Haribabu Kommi

Дата:

12 марта 2014 г., 04:14:26

On Thu, Mar 6, 2014 at 10:15 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
> 2014-03-06 18:17 GMT+09:00 Haribabu Kommi <kommi.haribabu@gmail.com>:
>> I will update you later regarding the performance test results.
>>

I ran the performance test on the cache scan patch and below are the readings.

Configuration:

Shared_buffers - 512MB
cache_scan.num_blocks - 600
checkpoint_segments - 255

Machine:
OS - centos - 6.4
CPU - 4 core 2.5 GHZ
Memory - 4GB

                                         Head          patched          Diff
Select -  500K                772ms        2659ms        -200%
Insert - 400K                   3429ms     1948ms          43% (I am
not sure how it improved in this case)
delete - 200K                 2066ms     3978ms        -92%
update - 200K                3915ms      5899ms        -50%

This patch shown how the custom scan can be used very well but coming
to patch as It is having
some performance problem which needs to be investigated.

I attached the test script file used for the performance test.

Regards,
Hari Babu
Fujitsu Australia

Вложения

cache_scan_test.txt

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Kouhei Kaigai

Дата:

12 марта 2014 г., 06:26:42

Thanks for your efforts!
>                                          Head          patched
> Diff
> Select -  500K                772ms        2659ms        -200%
> Insert - 400K                   3429ms     1948ms          43% (I am
> not sure how it improved in this case)
> delete - 200K                 2066ms     3978ms        -92%
> update - 200K                3915ms      5899ms        -50%
>
> This patch shown how the custom scan can be used very well but coming to
> patch as It is having some performance problem which needs to be
> investigated.
>
> I attached the test script file used for the performance test.
>
First of all, it seems to me your test case has too small data set that
allows to hold all the data in memory - briefly 500K of 200bytes record
will consume about 100MB. Your configuration allocates 512MB of
shared_buffer, and about 3GB of OS-level page cache is available.
(Note that Linux uses free memory as disk cache adaptively.)

This cache is designed to hide latency of disk accesses, so this test
case does not fit its intention.
(Also, the primary purpose of this module is a demonstration for
heap_page_prune_hook to hook vacuuming, so simple code was preferred
than complicated implementation but better performance.)

I could reproduce the overall trend, no cache scan is faster than
cached scan if buffer is in memory. Probably, it comes from the
cost to walk down T-tree index using ctid per reference.
Performance penalty around UPDATE and DELETE likely come from
trigger invocation per row.
I could observe performance gain on INSERT a little bit.
It's strange for me, also. :-(

On the other hand, the discussion around custom-plan interface
effects this module because it uses this API as foundation.
Please wait for a few days to rebase the cache_scan module onto
the newer custom-plan interface; that I submitted just a moment
before.

Also, is it really necessary to tune the performance stuff in this
example module of the heap_page_prune_hook?
Even though I have a few ideas to improve the cache performance,
like insertion of multiple rows at once or local chunk copy instead
of t-tree walk down, I'm not sure whether it is productive in the
current v9.4 timeframe. ;-(

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Haribabu Kommi
> Sent: Wednesday, March 12, 2014 1:14 PM
> To: Kohei KaiGai
> Cc: Kaigai Kouhei(海外 浩平); Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> On Thu, Mar 6, 2014 at 10:15 PM, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:
> > 2014-03-06 18:17 GMT+09:00 Haribabu Kommi <kommi.haribabu@gmail.com>:
> >> I will update you later regarding the performance test results.
> >>
>
> I ran the performance test on the cache scan patch and below are the readings.
>
> Configuration:
>
> Shared_buffers - 512MB
> cache_scan.num_blocks - 600
> checkpoint_segments - 255
>
> Machine:
> OS - centos - 6.4
> CPU - 4 core 2.5 GHZ
> Memory - 4GB
>
>                                          Head          patched
> Diff
> Select -  500K                772ms        2659ms        -200%
> Insert - 400K                   3429ms     1948ms          43% (I am
> not sure how it improved in this case)
> delete - 200K                 2066ms     3978ms        -92%
> update - 200K                3915ms      5899ms        -50%
>
> This patch shown how the custom scan can be used very well but coming to
> patch as It is having some performance problem which needs to be
> investigated.
>
> I attached the test script file used for the performance test.
>
> Regards,
> Hari Babu
> Fujitsu Australia

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Haribabu Kommi

Дата:

12 марта 2014 г., 06:43:42

On Wed, Mar 12, 2014 at 5:26 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> Thanks for your efforts!
>>                                          Head          patched
>> Diff
>> Select -  500K                772ms        2659ms        -200%
>> Insert - 400K                   3429ms     1948ms          43% (I am
>> not sure how it improved in this case)
>> delete - 200K                 2066ms     3978ms        -92%
>> update - 200K                3915ms      5899ms        -50%
>>
>> This patch shown how the custom scan can be used very well but coming to
>> patch as It is having some performance problem which needs to be
>> investigated.
>>
>> I attached the test script file used for the performance test.
>>
> First of all, it seems to me your test case has too small data set that
> allows to hold all the data in memory - briefly 500K of 200bytes record
> will consume about 100MB. Your configuration allocates 512MB of
> shared_buffer, and about 3GB of OS-level page cache is available.
> (Note that Linux uses free memory as disk cache adaptively.)

Thanks for the information and a small correction. The Total number of
records are 5 million.
The select operation is selecting 500K records. The total table size
is around 1GB.

Once I get your new patch re-based on the custom scan patch, I will
test the performance
again by increasing my database size more than the RAM size. And also
I will make sure
that memory available for disk cache is less.

Regards,
Hari Babu
Fujitsu Australia

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Kouhei Kaigai

Дата:

17 марта 2014 г., 00:46:18

Hello,

The attached patches are revised ones according to the latest custom-plan
interface patch (v11).
The cache-scan module was re-implemented on the newer interface, and also
I noticed the extension does not handle the tuples being redirected correctly,
So, I revised the logic in ccache_vacuum_page() totally. It now becomes to
synchronize the cached tuples per page, not per tuple, basic and tries to
merge t-tree chunks per page basis also.

Also, I split the patches again because *demonstration* part is much larger
than the patches to the core backend. It will help reviewing.
* pgsql-v9.4-vacuum_page_hook.v11.patch
 -> It adds a hook for each page being vacuumed; that needs to synchronize
    the status of in-memory cache managed by extension.
* pgsql-v9.4-mvcc_allows_cache.v11.patch
 -> It allows to run HeapTupleSatisfiesVisibility() towards the tuples
    on the in-memory cache, not on the heap.
* pgsql-v9.4-example-cache_scan.v11.patch
 -> It demonstrates the usage of above two patches. It allows to scan
    a relation without storage access if possible.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Haribabu Kommi
> Sent: Wednesday, March 12, 2014 3:43 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Kohei KaiGai; Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> On Wed, Mar 12, 2014 at 5:26 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > Thanks for your efforts!
> >>                                          Head          patched
> >> Diff
> >> Select -  500K                772ms        2659ms        -200%
> >> Insert - 400K                   3429ms     1948ms          43% (I am
> >> not sure how it improved in this case)
> >> delete - 200K                 2066ms     3978ms        -92%
> >> update - 200K                3915ms      5899ms        -50%
> >>
> >> This patch shown how the custom scan can be used very well but coming
> >> to patch as It is having some performance problem which needs to be
> >> investigated.
> >>
> >> I attached the test script file used for the performance test.
> >>
> > First of all, it seems to me your test case has too small data set
> > that allows to hold all the data in memory - briefly 500K of 200bytes
> > record will consume about 100MB. Your configuration allocates 512MB of
> > shared_buffer, and about 3GB of OS-level page cache is available.
> > (Note that Linux uses free memory as disk cache adaptively.)
>
> Thanks for the information and a small correction. The Total number of
> records are 5 million.
> The select operation is selecting 500K records. The total table size is
> around 1GB.
>
> Once I get your new patch re-based on the custom scan patch, I will test
> the performance again by increasing my database size more than the RAM size.
> And also I will make sure that memory available for disk cache is less.
>
> Regards,
> Hari Babu
> Fujitsu Australia
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
> changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Haribabu Kommi

Дата:

18 марта 2014 г., 02:14:48

On Mon, Mar 17, 2014 at 11:45 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> Hello,
>
> The attached patches are revised ones according to the latest custom-plan
> interface patch (v11).
> The cache-scan module was re-implemented on the newer interface, and also
> I noticed the extension does not handle the tuples being redirected correctly,
> So, I revised the logic in ccache_vacuum_page() totally. It now becomes to
> synchronize the cached tuples per page, not per tuple, basic and tries to
> merge t-tree chunks per page basis also.
>
> Also, I split the patches again because *demonstration* part is much larger
> than the patches to the core backend. It will help reviewing.
> * pgsql-v9.4-vacuum_page_hook.v11.patch
>  -> It adds a hook for each page being vacuumed; that needs to synchronize
>     the status of in-memory cache managed by extension.
> * pgsql-v9.4-mvcc_allows_cache.v11.patch
>  -> It allows to run HeapTupleSatisfiesVisibility() towards the tuples
>     on the in-memory cache, not on the heap.
> * pgsql-v9.4-example-cache_scan.v11.patch
>  -> It demonstrates the usage of above two patches. It allows to scan
>     a relation without storage access if possible.

All the patches are good. The cache scan extension patch may need
further refinement
in terms of performance improvement but the same can be handled later also.
So I am marking the patch as "ready for committer". Thanks for the patch.

Regards,
Hari Babu
Fujitsu Australia

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

От

Kouhei Kaigai

Дата:

18 марта 2014 г., 02:32:00

> > Also, I split the patches again because *demonstration* part is much
> > larger than the patches to the core backend. It will help reviewing.
> > * pgsql-v9.4-vacuum_page_hook.v11.patch
> >  -> It adds a hook for each page being vacuumed; that needs to synchronize
> >     the status of in-memory cache managed by extension.
> > * pgsql-v9.4-mvcc_allows_cache.v11.patch
> >  -> It allows to run HeapTupleSatisfiesVisibility() towards the tuples
> >     on the in-memory cache, not on the heap.
> > * pgsql-v9.4-example-cache_scan.v11.patch
> >  -> It demonstrates the usage of above two patches. It allows to scan
> >     a relation without storage access if possible.
>
> All the patches are good. The cache scan extension patch may need further
> refinement in terms of performance improvement but the same can be handled
> later also.
> So I am marking the patch as "ready for committer". Thanks for the patch.
>
Thanks for your dedicated efforts on these patches.

The smaller portions of above submission can be applied independent from the
custom-plan interface, and its scale is much smaller than contrib/ portion
(about 30lines in total).
If someone can pick them up individually from the extension portion, it also
makes sense. I intended to implement the extension portion as simple as I can,
for the demonstration purpose rather than performance, however, its scale is
about 2.5KL. :-(
Yes, I know the time pressure towards v9.4 final feature freeze....

Best regards,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


> -----Original Message-----
> From: Haribabu Kommi [mailto:kommi.haribabu@gmail.com]
> Sent: Tuesday, March 18, 2014 11:14 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Kohei KaiGai; Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> On Mon, Mar 17, 2014 at 11:45 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com>
> wrote:
> > Hello,
> >
> > The attached patches are revised ones according to the latest
> > custom-plan interface patch (v11).
> > The cache-scan module was re-implemented on the newer interface, and
> > also I noticed the extension does not handle the tuples being
> > redirected correctly, So, I revised the logic in ccache_vacuum_page()
> > totally. It now becomes to synchronize the cached tuples per page, not
> > per tuple, basic and tries to merge t-tree chunks per page basis also.
> >
> > Also, I split the patches again because *demonstration* part is much
> > larger than the patches to the core backend. It will help reviewing.
> > * pgsql-v9.4-vacuum_page_hook.v11.patch
> >  -> It adds a hook for each page being vacuumed; that needs to synchronize
> >     the status of in-memory cache managed by extension.
> > * pgsql-v9.4-mvcc_allows_cache.v11.patch
> >  -> It allows to run HeapTupleSatisfiesVisibility() towards the tuples
> >     on the in-memory cache, not on the heap.
> > * pgsql-v9.4-example-cache_scan.v11.patch
> >  -> It demonstrates the usage of above two patches. It allows to scan
> >     a relation without storage access if possible.
>
> All the patches are good. The cache scan extension patch may need further
> refinement in terms of performance improvement but the same can be handled
> later also.
> So I am marking the patch as "ready for committer". Thanks for the patch.
>
> Regards,
> Hari Babu
> Fujitsu Australia

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: contrib/cache_scan (Re: What's needed for cache-only table scan?)

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения