Обсуждение: Limit Heap Fetches / Rows Removed by Filter in Index Scans

Поиск
Список
Период
Сортировка

Limit Heap Fetches / Rows Removed by Filter in Index Scans

От
Victor Blomqvist
Дата:
Hi,

Is it possible to break/limit a query so that it returns whatever results found after having checked X amount of rows in a index scan?

For example:
create table a(id int primary key);
insert into a select * from generate_series(1,100000);

select * from a
where id%2 = 0
order by id limit 10

In this case the query will "visit" 20 rows and filter out 10 of them. We can see that in the query plan:
"Rows Removed by Filter: 10"
"Heap Fetches: 20"

Is it somehow possible to limit this query so that it only fetches X amount, in my example if we limited it to 10 Heap Fetches the query could return the first 5 rows?


My use case is I have a table with 35 million rows with a geo index, and I want to do a KNN search but also limit the query on some other parameters. In some cases the other parameters restrict the query so much that Heap Fetches becomes several 100k or more, and in those cases I would like to have a limit to my query.

Thanks!
/Victor

Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans

От
Sameer Kumar
Дата:


On Fri, 19 Aug 2016, 1:07 p.m. Victor Blomqvist, <vb@viblo.se> wrote:
Hi,

Is it possible to break/limit a query so that it returns whatever results found after having checked X amount of rows in a index scan?

For example:
create table a(id int primary key);
insert into a select * from generate_series(1,100000);

select * from a
where id%2 = 0
order by id limit 10

In this case the query will "visit" 20 rows and filter out 10 of them. We can see that in the query plan:
"Rows Removed by Filter: 10"
"Heap Fetches: 20"

Is it somehow possible to limit this query so that it only fetches X amount, in my example if we limited it to 10 Heap Fetches the query could return the first 5 rows?


My use case is I have a table with 35 million rows with a geo index, and I want to do a KNN search but also limit the query on some other parameters. In some cases the other parameters restrict the query so much that Heap Fetches becomes several 100k or more, and in those cases I would like to have a limit to my query.

Have you checked the TABLESAMPLE clause in v9.5?



Thanks!
/Victor
--
--
Best Regards
Sameer Kumar | DB Solution Architect 
ASHNIK PTE. LTD.

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533

T: +65 6438 3504 | M: +65 8110 0350 

Skype: sameer.ashnik | www.ashnik.com

Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans

От
Victor Blomqvist
Дата:

On Fri, Aug 19, 2016 at 1:31 PM, Sameer Kumar <sameer.kumar@ashnik.com> wrote:


On Fri, 19 Aug 2016, 1:07 p.m. Victor Blomqvist, <vb@viblo.se> wrote:
Hi,

Is it possible to break/limit a query so that it returns whatever results found after having checked X amount of rows in a index scan?

For example:
create table a(id int primary key);
insert into a select * from generate_series(1,100000);

select * from a
where id%2 = 0
order by id limit 10

In this case the query will "visit" 20 rows and filter out 10 of them. We can see that in the query plan:
"Rows Removed by Filter: 10"
"Heap Fetches: 20"

Is it somehow possible to limit this query so that it only fetches X amount, in my example if we limited it to 10 Heap Fetches the query could return the first 5 rows?


My use case is I have a table with 35 million rows with a geo index, and I want to do a KNN search but also limit the query on some other parameters. In some cases the other parameters restrict the query so much that Heap Fetches becomes several 100k or more, and in those cases I would like to have a limit to my query.

Have you checked the TABLESAMPLE clause in v9.5?



Unless I misunderstand what you mean or how it works I cant really see what it would help.

I want my query to still return the "best" results, and I want it to use the index for that. Just randomly selecting out from the whole table will either have to sample a too small subset of the rows, or be too slow.

So, given my query above, in the normal ("slow" case) I would find the 10 first even rows:
2,4,6,8,10,12,14,16,18,20
If I could restrict the heap fetches to 10 I would find
2,4,6,8,10
However, with tablesample I might end up with for example these rows:
15024,71914,51682,7110,61802,63390,98278,8022,34256,49220

In my use case I want the best rows (according to the order by), so just a random sample is not good enough.

/Victor
 

Thanks!
/Victor
--
--
Best Regards
Sameer Kumar | DB Solution Architect 
ASHNIK PTE. LTD.

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533

T: +65 6438 3504 | M: +65 8110 0350 

Skype: sameer.ashnik | www.ashnik.com


Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans

От
Sameer Kumar
Дата:


On Fri, Aug 19, 2016 at 2:25 PM Victor Blomqvist <vb@viblo.se> wrote:
On Fri, Aug 19, 2016 at 1:31 PM, Sameer Kumar <sameer.kumar@ashnik.com> wrote:


On Fri, 19 Aug 2016, 1:07 p.m. Victor Blomqvist, <vb@viblo.se> wrote:
Hi,

Is it possible to break/limit a query so that it returns whatever results found after having checked X amount of rows in a index scan?

For example:
create table a(id int primary key);
insert into a select * from generate_series(1,100000);

select * from a
where id%2 = 0
order by id limit 10

In this case the query will "visit" 20 rows and filter out 10 of them. We can see that in the query plan:
"Rows Removed by Filter: 10"
"Heap Fetches: 20"

Is it somehow possible to limit this query so that it only fetches X amount, in my example if we limited it to 10 Heap Fetches the query could return the first 5 rows?


 

My use case is I have a table with 35 million rows with a geo index, and I want to do a KNN search but also limit the query on some other parameters. In some cases the other parameters restrict the query so much that Heap Fetches becomes several 100k or more, and in those cases I would like to have a limit to my query.

Have you checked the TABLESAMPLE clause in v9.5?



Unless I misunderstand what you mean or how it works I cant really see what it would help.


I stand corrected. TABLESAMPLE will not help you.
 
I want my query to still return the "best" results, and I want it to use the index for that. Just randomly selecting out from the whole table will either have to sample a too small subset of the rows, or be too slow.

So, given my query above, in the normal ("slow" case) I would find the 10 first even rows:
2,4,6,8,10,12,14,16,18,20
If I could restrict the heap fetches to 10 I would find
2,4,6,8,10
However, with tablesample I might end up with for example these rows:
15024,71914,51682,7110,61802,63390,98278,8022,34256,49220


How about using the LIMIT ?
SELECT column_1, column_2, ... FROM my_table WHERE <<expression>>
ORDER BY my_column
LIMIT 10 ;

 
In my use case I want the best rows (according to the order by), so just a random sample is not good enough.

/Victor
 

Thanks!
/Victor
--
--
Best Regards
Sameer Kumar | DB Solution Architect 
ASHNIK PTE. LTD.

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533

T: +65 6438 3504 | M: +65 8110 0350 

Skype: sameer.ashnik | www.ashnik.com

--
--
Best Regards
Sameer Kumar | DB Solution Architect 
ASHNIK PTE. LTD.

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533

T: +65 6438 3504 | M: +65 8110 0350 

Skype: sameer.ashnik | www.ashnik.com

Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans

От
Francisco Olarte
Дата:
Hi Victor:

On Fri, Aug 19, 2016 at 7:06 AM, Victor Blomqvist <vb@viblo.se> wrote:
> Is it possible to break/limit a query so that it returns whatever results
> found after having checked X amount of rows in a index scan?
>
> For example:
> create table a(id int primary key);
> insert into a select * from generate_series(1,100000);
>
> select * from a
> where id%2 = 0
> order by id limit 10
>
> In this case the query will "visit" 20 rows and filter out 10 of them. We
> can see that in the query plan:
> "Rows Removed by Filter: 10"
> "Heap Fetches: 20"
>
> Is it somehow possible to limit this query so that it only fetches X amount,
> in my example if we limited it to 10 Heap Fetches the query could return the
> first 5 rows?
>
>
> My use case is I have a table with 35 million rows with a geo index, and I
> want to do a KNN search but also limit the query on some other parameters.
> In some cases the other parameters restrict the query so much that Heap
> Fetches becomes several 100k or more, and in those cases I would like to
> have a limit to my query.

Well, if you accept more abstract limits (i.e. you do not depend on
heap fetches, you just want up to 5 even IDs from the first 10 IDs )
you could try:

with base as (select * from a order by id limit 10)
select * from base where id %2 = 0 order by id limit 5;

( Or do it with a subquery instead of a CTE ).

In general

select * from table where common_condition and filter_condition order
by xx limit N

becomes

with base as (select * from table where common_condition order by xx
limit base_fecthes)
select * from base where filter_condition order by XX limit N;

In the example common_condition is non existent, put it as true,
optimize after transforming.

Francisco Olarte.


Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans

От
Victor Blomqvist
Дата:


On Fri, Aug 19, 2016 at 6:01 PM, Francisco Olarte <folarte@peoplecall.com> wrote:
Hi Victor:

On Fri, Aug 19, 2016 at 7:06 AM, Victor Blomqvist <vb@viblo.se> wrote:
> Is it possible to break/limit a query so that it returns whatever results
> found after having checked X amount of rows in a index scan?
>
> For example:
> create table a(id int primary key);
> insert into a select * from generate_series(1,100000);
>
> select * from a
> where id%2 = 0
> order by id limit 10
>
> In this case the query will "visit" 20 rows and filter out 10 of them. We
> can see that in the query plan:
> "Rows Removed by Filter: 10"
> "Heap Fetches: 20"
>
> Is it somehow possible to limit this query so that it only fetches X amount,
> in my example if we limited it to 10 Heap Fetches the query could return the
> first 5 rows?
>
>
> My use case is I have a table with 35 million rows with a geo index, and I
> want to do a KNN search but also limit the query on some other parameters.
> In some cases the other parameters restrict the query so much that Heap
> Fetches becomes several 100k or more, and in those cases I would like to
> have a limit to my query.

Well, if you accept more abstract limits (i.e. you do not depend on
heap fetches, you just want up to 5 even IDs from the first 10 IDs )
you could try:

with base as (select * from a order by id limit 10)
select * from base where id %2 = 0 order by id limit 5;

( Or do it with a subquery instead of a CTE ).

In general

select * from table where common_condition and filter_condition order
by xx limit N

becomes

with base as (select * from table where common_condition order by xx
limit base_fecthes)
select * from base where filter_condition order by XX limit N;

In the example common_condition is non existent, put it as true,
optimize after transforming.

Francisco Olarte.

Sorry if that wasnt clear in my first email. This whole problem is part of a 100 lines
long function so I tried to simplify it as much as possible for my question, but maybe I
went a bit too far. I will try again:

The problem with this approach is that I "waste" the number of rows in those cases
when it doesnt need to "loop" 10 times to fetch my result. Let me show a slightly
more complicated example:

create table b(id serial primary key, age_preference int);
insert into a (
age_preference)
  select
(random()*100)::integer from generate_series(1,1000000);

select * from b
where
age_preference%10 = x?
order by id limit 10


So here I have a table with id and age_preference, where age_preference is the
preferred age of users in my database. The distribution of age preferences is not
uniform in real life (only in my mock data here). Its also changing, so I cant know
beforehand, only do some very rough estimations.

What I do know is that for most values of x its enough to visit the 1k first rows.
However, sometimes I need more rows, maybe 100k rows. And for some special x
(such as 11 in my stupid example), its not enough even with 1m fetches.

What I want to avoid is my query visiting the whole 1m rows to get a result,
because in my real table that can take 100sec. At the same time I want the
queries that only need to visit 1k rows finish quickly, and the queries that
visit 100k rows at least get some result back.


Thank you for your patience so far!
/Victor

Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans

От
Francisco Olarte
Дата:
Hi Victor:


On Fri, Aug 19, 2016 at 7:02 PM, Victor Blomqvist <vb@viblo.se> wrote:
> What I want to avoid is my query visiting the whole 1m rows to get a result,
> because in my real table that can take 100sec. At the same time I want the
> queries that only need to visit 1k rows finish quickly, and the queries that
> visit 100k rows at least get some result back.

You are going to have problems with that. If you just want to limit it
to max 100k rows, max 10 results my solution works, probably better as
nested selects than CTEs, but someone more knowledgeable in the
optimizer will need to say something ( or tests will be needed ). This
is because "the queries that visit 100k rows at least get some result
back." may be false, you may need to visit the whole 1M to get the
first result if you are unlucky. Just set ap=999 where id=1M and ask
for ap>=999 and you've got that degenerate case, which can only be
saved if you have an index on ap ( even with statistics, you would
need a full table scan to find it ).

If you are positive some results are in the first 100k rows, then my
method works fine, how fast will need to be tested with the real data.
You can even try using *10, *100, *1k of the real limit until you have
enough results if you want to time-limit your queries.


Francisco Olarte.


Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans

От
Victor Blomqvist
Дата:

On Sat, Aug 20, 2016 at 1:13 AM, Francisco Olarte <folarte@peoplecall.com> wrote:
Hi Victor:


On Fri, Aug 19, 2016 at 7:02 PM, Victor Blomqvist <vb@viblo.se> wrote:
> What I want to avoid is my query visiting the whole 1m rows to get a result,
> because in my real table that can take 100sec. At the same time I want the
> queries that only need to visit 1k rows finish quickly, and the queries that
> visit 100k rows at least get some result back.

You are going to have problems with that. If you just want to limit it
to max 100k rows, max 10 results my solution works, probably better as
nested selects than CTEs, but someone more knowledgeable in the
optimizer will need to say something ( or tests will be needed ). This
is because "the queries that visit 100k rows at least get some result
back." may be false, you may need to visit the whole 1M to get the
first result if you are unlucky. Just set ap=999 where id=1M and ask
for ap>=999 and you've got that degenerate case, which can only be
saved if you have an index on ap ( even with statistics, you would
need a full table scan to find it ).

If you are positive some results are in the first 100k rows, then my
method works fine, how fast will need to be tested with the real data.
You can even try using *10, *100, *1k of the real limit until you have
enough results if you want to time-limit your queries.


Francisco Olarte.


Thanks! A sub select seems to do it.

I didnt think of it before, guess I got blinded by the CTE since usually
its the other way around and the CTE is the answer to the problem.
But seems like the easy solution with a good old sub select fixes it.
Now I feel a bit embarrassed for such a easy answer :)


Checking these two queries I can see that the first one visits the
max 50 rows its allowed to and returns 5 rows, while the second one
finish off after 13 rows fetched and returns the full 10 rows.

select *
    from (select * from b order by id limit 50) x
    where age_preference%10 < 1
    order by id limit 10

select *
    from (select * from b order by id limit 50) x
    where age_preference%10 < 5
    order by id limit 10

/Victor
 

Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans

От
Francisco Olarte
Дата:
Victor:

On Fri, Aug 19, 2016 at 8:01 PM, Victor Blomqvist <vb@viblo.se> wrote:

> Thanks! A sub select seems to do it.

I suspected it would be needed, that's why I sugested it.

I prefer to write the queries as CTEs because they are much easier to
read, but IIRC they are some kind of 'optimization fences'. Some times
I end up using psql macros to write queries in CTE-like order in my
scripts and compose them into sub selects when I have this problems.

> Checking these two queries I can see that the first one visits the
> max 50 rows its allowed to and returns 5 rows, while the second one
> finish off after 13 rows fetched and returns the full 10 rows.

Good. The only problem is you are not guaranteed a result, like in the
contrived example I gave, but if it is what you want this is a way to
go.

Francisco Olarte.