Re: GIST optimization to limit calls to operator on sub nodes
От | Pujol Mathieu |
---|---|
Тема | Re: GIST optimization to limit calls to operator on sub nodes |
Дата | |
Msg-id | 53B2676A.5040001@realfusio.com обсуждение исходный текст |
Ответ на | Re: GIST optimization to limit calls to operator on sub nodes (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-performance |
Le 30/06/2014 16:04, Tom Lane a écrit : > Pujol Mathieu <mathieu.pujol@realfusio.com> writes: >> Le 29/06/2014 22:30, Tom Lane a écrit : >>> I don't actually understand what's being requested here that the >>> NotConsistent case doesn't already cover. >> The NotConsistent case is correctly covered, the sub nodes are not >> tested because I know that no child could pass the consistent_test. >> The MaybeConsistent case is also correctly covered, all sub nodes are >> tested because I don't know which sub nodes will pass the consistent_test. >> My problem is with the FullyConsistent, because when I test a node I can >> know that all it's childs nodes and leaves will pass the test, so I want >> to notify GIST framework that it can't skip consistent test on those >> nodes. Like we can notify it when testing a leaf that it could skip >> consistent test on the row. Maybe I miss something on the API to do >> that. On my tests, the "recheck_flag" works only for leaves. > Hm ... that doesn't seem like a case that'd come up often enough to be > worth complicating the APIs for, unless maybe you are expecting a lot > of exact-duplicate index entries. If you are, you might find that GIN > is a better fit for your problem than GIST --- it's designed to be > efficient for lots-of-duplicates. > > Another view of this is that if you can make exact satisfaction checks > at upper-page entries, you're probably storing too much information in > the index entries (and thereby bloating the index). The typical tradeoff > in GIST indexes is something like storing bounding boxes for geometric > objects --- which is necessarily lossy, but it results in small indexes > that are fast to search. It's particularly important for upper-page > entries to be small, so that fanout is high and you have a better chance > of keeping all the upper pages in cache. > > If you've got a compelling example where this actually makes sense, > I'd be curious to hear the details. > > regards, tom lane > > Hi, I have a table containing several millions of rows, and each row contains an unique integer as identifier. My goal is to select all rows which have an identifier that is contained into at least one range of a list. CREATE TABLE t (id int4 UNIQUE, ...) SELECT * FROM t WHERE id @@> ARRAY[...]::range[] I use a custom operator to check if an integer is contained in an array of ranges (a custom structure defined by my plugin). I build my own GIST to speedup those requests. So each node of my GIST is represented by a range (like a BVH or octree of 3D boxes). I have no duplicated keys in my index. When I run my tests I am able to quickly discard entire portions of the index which leads to great performance improvements. During GIST traversal when I test consistency of a node (isRangeOverlap(range,range[]) the test return Fully, Partially, No. So when the tests return Fully I know for sure that all subnodes will also return Fully. In practice my operator is not free in execution time, so if I could propagate the information on subnodes it will allow to save lot of computation time. This optimization could also be benefical for cube extension. I don't think that it would complicate the API, we could use existing recheck_flag. Today this value is used only for leaves node. Maybe it could be propagated by GIST API from a node to its subnodes. So if a node set it to false, it will be false for it's subnodes allowing client to use it or no. So the API could remain the same without changes for existing plugins and need only small memory to propagate this boolean. I already achieve great performance improvements with my GIST, my request is to optimize it in use cases that select several rows to limit overhead of my consistent_operator. regards, mathieu pujol
В списке pgsql-performance по дате отправления: