Обсуждение: GIST optimization to limit calls to operator on sub nodes

Поиск
Список
Период
Сортировка

GIST optimization to limit calls to operator on sub nodes

От
Pujol Mathieu
Дата:
Hello,
I already post my question in the General Mailing list, but without
succeed so I try this one that seems to me more specialized.
My question is about GIST index.
I made my own index to handle specific data and operators. It works
pretty fine but I wonder if it was possible to optimize it.
When I run my operator on a GIST node (in the method
gist_range_consistent) it returns "NotConsistent" / "MaybeConsistent" /
"FullyConsistent".
NotConsistent -> means that all subnodes could be ignored,
gist_range_consistent return false
MaybeConsistent -> means that at least one subnode/leaf will be
consistent, gist_range_consistent return true
FullyConsistent -> means that all subnodes/leaves will be consistent,
gist_range_consistent return true

So like with the "recheck flag" I would like to know if there is a way
to notify postgres that it is not necessary to rerun my operator on
subnodes, to speedup the search.

For example, consider the following gist tree
           R
     /         \
    Na          Nb
  /   \       /    \
La1  La2    Lb1  Lb2

If all nodes return FullyConsistent, postgres will run tests in that
Order : R, Na, Nb, La1, La2, Lb1, Lb2, thanks to recheck flag it will
not test rows associated to leaves Lxx.
My goal is that postgres run test on R and then skip tests on other
nodes. So is there a way to do that in the GIST API ? Or could I share
data from R to Nx and then From Na to Lax and Nb to Lbx ?
Thanks,
Mathieu



Re: GIST optimization to limit calls to operator on sub nodes

От
Emre Hasegeli
Дата:
Pujol Mathieu <mathieu.pujol@realfusio.com>:
> Hello,
> I already post my question in the General Mailing list, but without
> succeed so I try this one that seems to me more specialized.
> My question is about GIST index.
> I made my own index to handle specific data and operators. It works
> pretty fine but I wonder if it was possible to optimize it.
> When I run my operator on a GIST node (in the method
> gist_range_consistent) it returns "NotConsistent" /
> "MaybeConsistent" / "FullyConsistent".
> NotConsistent -> means that all subnodes could be ignored,
> gist_range_consistent return false
> MaybeConsistent -> means that at least one subnode/leaf will be
> consistent, gist_range_consistent return true
> FullyConsistent -> means that all subnodes/leaves will be
> consistent, gist_range_consistent return true
>
> So like with the "recheck flag" I would like to know if there is a
> way to notify postgres that it is not necessary to rerun my operator
> on subnodes, to speedup the search.

I do not think it is possible at the moment.  The GiST framework can
be extended to support this use case.  I am not sure about the
speedup.  Most of the consistent functions do not seem very expensive
compared to other operations of the GiST framework.  I would be
happy to test it, if you would implement.


Re: GIST optimization to limit calls to operator on sub nodes

От
Tom Lane
Дата:
Emre Hasegeli <emre@hasegeli.com> writes:
> Pujol Mathieu <mathieu.pujol@realfusio.com>:
>> I made my own index to handle specific data and operators. It works
>> pretty fine but I wonder if it was possible to optimize it.
>> When I run my operator on a GIST node (in the method
>> gist_range_consistent) it returns "NotConsistent" /
>> "MaybeConsistent" / "FullyConsistent".
>> NotConsistent -> means that all subnodes could be ignored,
>> gist_range_consistent return false
>> MaybeConsistent -> means that at least one subnode/leaf will be
>> consistent, gist_range_consistent return true
>> FullyConsistent -> means that all subnodes/leaves will be
>> consistent, gist_range_consistent return true
>>
>> So like with the "recheck flag" I would like to know if there is a
>> way to notify postgres that it is not necessary to rerun my operator
>> on subnodes, to speedup the search.

> I do not think it is possible at the moment.  The GiST framework can
> be extended to support this use case.  I am not sure about the
> speedup.  Most of the consistent functions do not seem very expensive
> compared to other operations of the GiST framework.  I would be
> happy to test it, if you would implement.

I don't actually understand what's being requested here that the
NotConsistent case doesn't already cover.

            regards, tom lane


Re: GIST optimization to limit calls to operator on sub nodes

От
Pujol Mathieu
Дата:
Le 29/06/2014 22:14, Emre Hasegeli a écrit :
> Pujol Mathieu <mathieu.pujol@realfusio.com>:
>> Hello,
>> I already post my question in the General Mailing list, but without
>> succeed so I try this one that seems to me more specialized.
>> My question is about GIST index.
>> I made my own index to handle specific data and operators. It works
>> pretty fine but I wonder if it was possible to optimize it.
>> When I run my operator on a GIST node (in the method
>> gist_range_consistent) it returns "NotConsistent" /
>> "MaybeConsistent" / "FullyConsistent".
>> NotConsistent -> means that all subnodes could be ignored,
>> gist_range_consistent return false
>> MaybeConsistent -> means that at least one subnode/leaf will be
>> consistent, gist_range_consistent return true
>> FullyConsistent -> means that all subnodes/leaves will be
>> consistent, gist_range_consistent return true
>>
>> So like with the "recheck flag" I would like to know if there is a
>> way to notify postgres that it is not necessary to rerun my operator
>> on subnodes, to speedup the search.
> I do not think it is possible at the moment.  The GiST framework can
> be extended to support this use case.  I am not sure about the
> speedup.  Most of the consistent functions do not seem very expensive
> compared to other operations of the GiST framework.  I would be
> happy to test it, if you would implement.
>
>
Thanks for your reply.
I am not sure to have time to develop inside the framework, but if I try
I'll let you know my results. In my case the consistent function is
costly and the number of row important so this optimization could save
several hundred tests on a single request.

--
Mathieu PUJOL
Ingénieur Réalité Virtuelle
REAL FUSIO - 3D Computer Graphics
10, rue des arts - 31000 TOULOUSE - FRANCE
mathieu.pujol@realfusio.com - http://www.realfusio.com



Re: GIST optimization to limit calls to operator on sub nodes

От
Pujol Mathieu
Дата:
Le 29/06/2014 22:30, Tom Lane a écrit :
> Emre Hasegeli <emre@hasegeli.com> writes:
>> Pujol Mathieu <mathieu.pujol@realfusio.com>:
>>> I made my own index to handle specific data and operators. It works
>>> pretty fine but I wonder if it was possible to optimize it.
>>> When I run my operator on a GIST node (in the method
>>> gist_range_consistent) it returns "NotConsistent" /
>>> "MaybeConsistent" / "FullyConsistent".
>>> NotConsistent -> means that all subnodes could be ignored,
>>> gist_range_consistent return false
>>> MaybeConsistent -> means that at least one subnode/leaf will be
>>> consistent, gist_range_consistent return true
>>> FullyConsistent -> means that all subnodes/leaves will be
>>> consistent, gist_range_consistent return true
>>>
>>> So like with the "recheck flag" I would like to know if there is a
>>> way to notify postgres that it is not necessary to rerun my operator
>>> on subnodes, to speedup the search.
>> I do not think it is possible at the moment.  The GiST framework can
>> be extended to support this use case.  I am not sure about the
>> speedup.  Most of the consistent functions do not seem very expensive
>> compared to other operations of the GiST framework.  I would be
>> happy to test it, if you would implement.
> I don't actually understand what's being requested here that the
> NotConsistent case doesn't already cover.
>
>             regards, tom lane
>
>
Hi,
The NotConsistent case is correctly covered, the sub nodes are not
tested because I know that no child could pass the consistent_test.
The MaybeConsistent case is also correctly covered, all sub nodes are
tested because I don't know which sub nodes will pass the consistent_test.
My problem is with the FullyConsistent, because when I test a node I can
know that all it's childs nodes and leaves will pass the test, so I want
to notify GIST framework that it can't skip consistent test on those
nodes. Like we can notify it when testing a leaf that it could skip
consistent test on the row. Maybe I miss something on the API to do
that. On my tests, the "recheck_flag" works only for leaves.
Thanks
Mathieu



Re: GIST optimization to limit calls to operator on sub nodes

От
Tom Lane
Дата:
Pujol Mathieu <mathieu.pujol@realfusio.com> writes:
> Le 29/06/2014 22:30, Tom Lane a �crit :
>> I don't actually understand what's being requested here that the
>> NotConsistent case doesn't already cover.

> The NotConsistent case is correctly covered, the sub nodes are not
> tested because I know that no child could pass the consistent_test.
> The MaybeConsistent case is also correctly covered, all sub nodes are
> tested because I don't know which sub nodes will pass the consistent_test.
> My problem is with the FullyConsistent, because when I test a node I can
> know that all it's childs nodes and leaves will pass the test, so I want
> to notify GIST framework that it can't skip consistent test on those
> nodes. Like we can notify it when testing a leaf that it could skip
> consistent test on the row. Maybe I miss something on the API to do
> that. On my tests, the "recheck_flag" works only for leaves.

Hm ... that doesn't seem like a case that'd come up often enough to be
worth complicating the APIs for, unless maybe you are expecting a lot
of exact-duplicate index entries.  If you are, you might find that GIN
is a better fit for your problem than GIST --- it's designed to be
efficient for lots-of-duplicates.

Another view of this is that if you can make exact satisfaction checks
at upper-page entries, you're probably storing too much information in
the index entries (and thereby bloating the index).  The typical tradeoff
in GIST indexes is something like storing bounding boxes for geometric
objects --- which is necessarily lossy, but it results in small indexes
that are fast to search.  It's particularly important for upper-page
entries to be small, so that fanout is high and you have a better chance
of keeping all the upper pages in cache.

If you've got a compelling example where this actually makes sense,
I'd be curious to hear the details.

            regards, tom lane


Re: GIST optimization to limit calls to operator on sub nodes

От
Pujol Mathieu
Дата:
Le 30/06/2014 16:04, Tom Lane a écrit :
> Pujol Mathieu <mathieu.pujol@realfusio.com> writes:
>> Le 29/06/2014 22:30, Tom Lane a écrit :
>>> I don't actually understand what's being requested here that the
>>> NotConsistent case doesn't already cover.
>> The NotConsistent case is correctly covered, the sub nodes are not
>> tested because I know that no child could pass the consistent_test.
>> The MaybeConsistent case is also correctly covered, all sub nodes are
>> tested because I don't know which sub nodes will pass the consistent_test.
>> My problem is with the FullyConsistent, because when I test a node I can
>> know that all it's childs nodes and leaves will pass the test, so I want
>> to notify GIST framework that it can't skip consistent test on those
>> nodes. Like we can notify it when testing a leaf that it could skip
>> consistent test on the row. Maybe I miss something on the API to do
>> that. On my tests, the "recheck_flag" works only for leaves.
> Hm ... that doesn't seem like a case that'd come up often enough to be
> worth complicating the APIs for, unless maybe you are expecting a lot
> of exact-duplicate index entries.  If you are, you might find that GIN
> is a better fit for your problem than GIST --- it's designed to be
> efficient for lots-of-duplicates.
>
> Another view of this is that if you can make exact satisfaction checks
> at upper-page entries, you're probably storing too much information in
> the index entries (and thereby bloating the index).  The typical tradeoff
> in GIST indexes is something like storing bounding boxes for geometric
> objects --- which is necessarily lossy, but it results in small indexes
> that are fast to search.  It's particularly important for upper-page
> entries to be small, so that fanout is high and you have a better chance
> of keeping all the upper pages in cache.
>
> If you've got a compelling example where this actually makes sense,
> I'd be curious to hear the details.
>
>             regards, tom lane
>
>
Hi,
I have a table containing several millions of rows, and each row
contains an unique integer as identifier. My goal is to select all rows
which have an identifier that is contained into at least one range of a
list.
CREATE TABLE t (id int4 UNIQUE, ...)
SELECT * FROM t WHERE id @@> ARRAY[...]::range[]
I use a custom operator to check if an integer is contained in an array
of ranges (a custom structure defined by my plugin).
I build my own GIST to speedup those requests. So each node of my GIST
is represented by a range (like a BVH or octree of 3D boxes). I have no
duplicated keys in my index.
When I run my tests I am able to quickly discard entire portions of the
index which leads to great performance improvements.
During GIST traversal when I test consistency of a node
(isRangeOverlap(range,range[]) the test return Fully, Partially, No. So
when the tests return Fully I know for sure that all subnodes will also
return Fully.
In practice my operator is not free in execution time, so if I could
propagate the information on subnodes it will allow to save lot of
computation time.
This optimization could also be benefical for cube extension.
I don't think that it would complicate the API, we could use existing
recheck_flag. Today this value is used only for leaves node. Maybe it
could be propagated by GIST API from a node to its subnodes. So if a
node set it to false, it will be false for it's subnodes allowing client
to use it or no. So the API could remain the same without changes for
existing plugins and need only small memory to propagate this boolean.
I already achieve great performance improvements with my GIST, my
request is to optimize it in use cases that select several rows to limit
overhead of my consistent_operator.

regards,
mathieu pujol