Обсуждение: maximum columns for brin bloom indexes

Поиск
Список
Период
Сортировка

maximum columns for brin bloom indexes

От
Jaime Casanova
Дата:
Hi everyone,

When testing brin bloom indexes I noted that we need to reduce the
PAGES_PER_RANGE parameter of the index to allow more columns on it.

Sadly, this could be a problem if you create the index before the table
grows, once it reaches some number of rows (i see the error as early as
1000 rows) it starts error out.

    create table t1(i int, j int);
    
    -- uses default PAGES_PER_RANGE=128
    create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;
    
    insert into t1 
        select random()*1000, random()*1000 from generate_series(1, 1000);
    ERROR:  index row size 8968 exceeds maximum 8152 for index "t1_i_j_idx"

if instead you create the index with a minor PAGES_PER_RANGE it goes
fine, in this case it works once you reduce it to at least 116

    create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) 
        with (pages_per_range=116);


so, for having:
two int columns - PAGES_PER_RANGE should be max 116
three int columns - PAGES_PER_RANGE should be max 77
one int and one timestamp - PAGES_PER_RANGE should be max 121 

and so on

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL



Re: maximum columns for brin bloom indexes

От
Tomas Vondra
Дата:
On 4/8/21 9:08 AM, Jaime Casanova wrote:
> Hi everyone,
> 
> When testing brin bloom indexes I noted that we need to reduce the
> PAGES_PER_RANGE parameter of the index to allow more columns on it.
> 
> Sadly, this could be a problem if you create the index before the table
> grows, once it reaches some number of rows (i see the error as early as
> 1000 rows) it starts error out.
> 
>     create table t1(i int, j int);
>     
>     -- uses default PAGES_PER_RANGE=128
>     create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;
>     
>     insert into t1 
>         select random()*1000, random()*1000 from generate_series(1, 1000);
>     ERROR:  index row size 8968 exceeds maximum 8152 for index "t1_i_j_idx"
> 
> if instead you create the index with a minor PAGES_PER_RANGE it goes
> fine, in this case it works once you reduce it to at least 116
> 
>     create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) 
>         with (pages_per_range=116);
> 
> 
> so, for having:
> two int columns - PAGES_PER_RANGE should be max 116
> three int columns - PAGES_PER_RANGE should be max 77
> one int and one timestamp - PAGES_PER_RANGE should be max 121 
> 
> and so on
> 

No, because this very much depends on the number if distinct values in
the page page range, which determines how well the bloom filter
compresses. You used 1000, but that's just an arbitrary value and the
actual data might have any other value. And it's unlikely that all three
columns will have the same number of distinct values.

Of course, this also depends on the false positive rate.

FWIW I doubt people are using multi-column BRIN indexes very often.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: maximum columns for brin bloom indexes

От
Jaime Casanova
Дата:
On Thu, Apr 08, 2021 at 12:18:36PM +0200, Tomas Vondra wrote:
> On 4/8/21 9:08 AM, Jaime Casanova wrote:
> > Hi everyone,
> > 
> > When testing brin bloom indexes I noted that we need to reduce the
> > PAGES_PER_RANGE parameter of the index to allow more columns on it.
> > 
> > Sadly, this could be a problem if you create the index before the table
> > grows, once it reaches some number of rows (i see the error as early as
> > 1000 rows) it starts error out.
> > 
> >     create table t1(i int, j int);
> >     
> >     -- uses default PAGES_PER_RANGE=128
> >     create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;
> >     
> >     insert into t1 
> >         select random()*1000, random()*1000 from generate_series(1, 1000);
> >     ERROR:  index row size 8968 exceeds maximum 8152 for index "t1_i_j_idx"
> > 
> > if instead you create the index with a minor PAGES_PER_RANGE it goes
> > fine, in this case it works once you reduce it to at least 116
> > 
> >     create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) 
> >         with (pages_per_range=116);
> > 
> > 
> > so, for having:
> > two int columns - PAGES_PER_RANGE should be max 116
> > three int columns - PAGES_PER_RANGE should be max 77
> > one int and one timestamp - PAGES_PER_RANGE should be max 121 
> > 
> > and so on
> > 
> 
> No, because this very much depends on the number if distinct values in
> the page page range, which determines how well the bloom filter
> compresses. You used 1000, but that's just an arbitrary value and the
> actual data might have any other value. And it's unlikely that all three
> columns will have the same number of distinct values.
>

Ok, that makes sense. Still I see a few odd things: 

    """
    drop table if exists t1;
    create table t1(i int, j int);
    create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;

    -- This one will succeed, I guess because it has less different
    -- values
    insert into t1
    select random()*20, random()*100 from generate_series(1, 1000);

    -- succeed
    insert into t1
    select random()*20, random()*100 from generate_series(1, 100000);

    -- succeed
    insert into t1
    select random()*200, random()*1000 from generate_series(1, 1000);

    -- succeed
    insert into t1
    select random()*200, random()*1000 from generate_series(1, 1000);

    -- succeed? This is the case it has been causing problems before
    insert into t1
    select random()*1000, random()*1000 from generate_series(1, 1000);
    """

Maybe this makes sense, but it looks random to me. If it makes sense
this is something we should document better. 

Let's try another combination:

    """
    drop table if exists t1;
    create table t1(i int, j int);
    create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;

    -- this fails again
    insert into t1
    select random()*1000, random()*1000 from generate_series(1, 1000);

    -- and this starts to fail now, but this worked before
    insert into t1
    select random()*20, random()*100 from generate_series(1, 1000);
    """

> Of course, this also depends on the false positive rate.
> 

How the false positive rate work?

> FWIW I doubt people are using multi-column BRIN indexes very often.
> 

true. 

Another question, should we allow to create a brin multi column index
that uses different opclasses?

CREATE INDEX ON t1 USING brin (i int4_bloom_ops, j int4_minmax_ops);

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL



Re: maximum columns for brin bloom indexes

От
Tomas Vondra
Дата:

On 4/8/21 4:49 PM, Jaime Casanova wrote:
> On Thu, Apr 08, 2021 at 12:18:36PM +0200, Tomas Vondra wrote:
>> On 4/8/21 9:08 AM, Jaime Casanova wrote:
>>> Hi everyone,
>>>
>>> When testing brin bloom indexes I noted that we need to reduce the
>>> PAGES_PER_RANGE parameter of the index to allow more columns on it.
>>>
>>> Sadly, this could be a problem if you create the index before the table
>>> grows, once it reaches some number of rows (i see the error as early as
>>> 1000 rows) it starts error out.
>>>
>>>     create table t1(i int, j int);
>>>     
>>>     -- uses default PAGES_PER_RANGE=128
>>>     create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;
>>>     
>>>     insert into t1 
>>>         select random()*1000, random()*1000 from generate_series(1, 1000);
>>>     ERROR:  index row size 8968 exceeds maximum 8152 for index "t1_i_j_idx"
>>>
>>> if instead you create the index with a minor PAGES_PER_RANGE it goes
>>> fine, in this case it works once you reduce it to at least 116
>>>
>>>     create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) 
>>>         with (pages_per_range=116);
>>>
>>>
>>> so, for having:
>>> two int columns - PAGES_PER_RANGE should be max 116
>>> three int columns - PAGES_PER_RANGE should be max 77
>>> one int and one timestamp - PAGES_PER_RANGE should be max 121 
>>>
>>> and so on
>>>
>>
>> No, because this very much depends on the number if distinct values in
>> the page page range, which determines how well the bloom filter
>> compresses. You used 1000, but that's just an arbitrary value and the
>> actual data might have any other value. And it's unlikely that all three
>> columns will have the same number of distinct values.
>>
> 
> Ok, that makes sense. Still I see a few odd things: 
> 
>     """
>     drop table if exists t1;
>     create table t1(i int, j int);
>     create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;
> 
>     -- This one will succeed, I guess because it has less different
>     -- values
>     insert into t1
>     select random()*20, random()*100 from generate_series(1, 1000);
> 
>     -- succeed
>     insert into t1
>     select random()*20, random()*100 from generate_series(1, 100000);
> 
>     -- succeed
>     insert into t1
>     select random()*200, random()*1000 from generate_series(1, 1000);
> 
>     -- succeed
>     insert into t1
>     select random()*200, random()*1000 from generate_series(1, 1000);
> 
>     -- succeed? This is the case it has been causing problems before
>     insert into t1
>     select random()*1000, random()*1000 from generate_series(1, 1000);
>     """
> 
> Maybe this makes sense, but it looks random to me. If it makes sense
> this is something we should document better. 
> 

Presumably it's about where exactly are the new rows added, and when we
summarize the page range.

> Let's try another combination:
> 
>     """
>     drop table if exists t1;
>     create table t1(i int, j int);
>     create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;
> 
>     -- this fails again
>     insert into t1
>     select random()*1000, random()*1000 from generate_series(1, 1000);
> 
>     -- and this starts to fail now, but this worked before
>     insert into t1
>     select random()*20, random()*100 from generate_series(1, 1000);
>     """
> 
>> Of course, this also depends on the false positive rate.
>>
> 
> How the false positive rate work?
> 

The lower the false positive rate, the more "accurate" the index is,
because fewer page ranges not containing the value will be added to the
bitmap. The bloom filter however has to be larger.

>> FWIW I doubt people are using multi-column BRIN indexes very often.
>>
> 
> true. 
> 
> Another question, should we allow to create a brin multi column index
> that uses different opclasses?
> 
> CREATE INDEX ON t1 USING brin (i int4_bloom_ops, j int4_minmax_ops);
> 

Why not? Without that you couldn't create index on (int, bigint) because
those are in principle different opclasses too. I don't see what would
this restriction give us.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company