Re: [SQL] "SELECT IN" Still Broken in 7.4b

Поиск
Список
Период
Сортировка
От Shridhar Daithankar
Тема Re: [SQL] "SELECT IN" Still Broken in 7.4b
Дата
Msg-id 3F467FF5.8984.281BB3@localhost
обсуждение исходный текст
Ответ на Re: [SQL] "SELECT IN" Still Broken in 7.4b  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 21 Aug 2003 at 16:42, Tom Lane wrote:

> Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
> > Within the scope of the new hashed IN stuff I believe so in at least some
> > cases.  I have a few million row table of integers where searching for
> > values IN (~10000 values) takes longer than creating the temp table,
> > copying into it and doing the in subquery.
> 
> I did some profiling and soon realized that the main problem is the
> executor's trick for not returning the same row multiple times in a
> multiple-indexscan plan node.  The point is that given
>     WHERE a = 1 OR b = 1
> you could create a plan that first indexscans on a, then indexscans on
> b --- but you mustn't return any tuples in the second scan that you
> already returned in the first.  IndexNext solves this by evaluating the
> prior-scan index conditions to see if they are true.  Which is okay if
> there aren't too many of them.  But when you have an N-element IN list
> this means you are going to do O(N^2) index expression evaluations.
> In the 10000-element IN-list test case, ExecQual gets called almost
> 50 million times :-(
> 
> I'm toying with the notion of ripping out that logic and instead
> building an in-memory hashtable of already-returned TIDs.  This could
> theoretically run out of memory if the multiple indexscan returns enough
> tuples, but I think in practice that wouldn't happen because the planner
> wouldn't choose an indexscan when very large numbers of tuples are being
> selected.

If I understand it correctly, we are essentially looking at ~10000 individual 
index scans, in above example, isn't it? So if planner takes this in account, 
it probably would not opt for seq. scan.

Other idea could be, index the in list first and search the index using 
locality of values in the in list. If the 10K values in the list are between 
10K-20K, they should be pretty easy to locate with single index sweep, assuming 
uniform distribution. May be planner could build this IN list index before 
drawing plan to determine cardinality and distribution of individual values.

On the least side, it could draw a plan for fetching a tuple range between min 
and max of IN values and building in memory hash of these values for 
comparison. That's would surely be cheaper than scanning entire table/result 
set over ad over 

Frankly, I would say a temp table is far better solution..:-)

Just a thought...

ByeShridhar

--
Youth doesn't excuse everything.        -- Dr. Janice Lester (in Kirk's body), 
"Turnabout Intruder",           stardate 5928.5.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jan Wieck
Дата:
Сообщение: Re: [GENERAL] Buglist
Следующее
От: Jan Wieck
Дата:
Сообщение: Re: [GENERAL] Buglist