Обсуждение: PG 9.3 materialized view VS Views, indexes, shared memory

Поиск

Список

Период

Сортировка

PG 9.3 materialized view VS Views, indexes, shared memory

От

Nicolas Paris

Дата:

20 февраля 2015 г., 13:28:39

Hello !

I have a huge table, 1 bilion rows, with many indexes.

I have many materialysed view (MV), subsets of this huge table, with same kind indexes.

I have many users, querying thoses MV.

I have a storage problem, because of MV physical memory use.

I wonder :

If I replace MV with classical Views, the only indexes that will be used will be the huge table's one. As all users will query on the same indexes, is will always be loaded in memory, right ? This will be shared, I mean if 10 users query the same time, will it use 10*ram memory for indexes or juste 1 time that ram ?

I terms of performances, will MV better than simple Views in my case ?

Thanks for explanation by advance

Nicolas PARIS

Re: PG 9.3 materialized view VS Views, indexes, shared memory

От

Matheus de Oliveira

Дата:

20 февраля 2015 г., 15:37:06

On Fri, Feb 20, 2015 at 8:28 AM, Nicolas Paris <niparisco@gmail.com> wrote:

If I replace MV with classical Views, the only indexes that will be used will be the huge table's one. As all users will query on the same indexes, is will always be loaded in memory, right ? This will be shared, I mean if 10 users query the same time, will it use 10*ram memory for indexes or juste 1 time that ram ?

Once one user load pages into the shared_buffer (or even OS memory cache), subsequent users that requests the same pages will read from there (from the memory), it is valid from pages of any kind of relation (MVs, tables, indexes, etc.). So if 10 users use the same index, then the pages read from it will be loaded in memory only once (unless it doesn't fit ram/shared_buffer, of course).

I terms of performances, will MV better than simple Views in my case ?

We'd need a lot more of information to answer this question. I tend to recommend people to try simpler approaches (in your case "simple views") and only move to more robust ones if the performance of this one is bad.

By the little information you gave, looks like the queries gets a well defined subset of this big table, so you should also consider:

- Creating partial indexes for the subsets, or at least the most accessed ones;

- Partitioning the table (be really careful with that and make sure you actually use the partition keys).

Regards,

Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres

Re: PG 9.3 materialized view VS Views, indexes, shared memory

От

Nicolas Paris

Дата:

20 февраля 2015 г., 16:06:32

Thanks,

I like the idea of partial indexes mixed with simple Views

So question :

huge_table{

id,

field

}

CREATE INDEX idx_huge_table ON huge_table(id)

CREATE INDEX idx_huge_table_for_view1 ON huge_table(id) WHERE id IN (1,2,3)

CREATE VIEW view1 AS SELECT * FROM huge_table WHERE id IN (1,2,3)

Do the following query uses idx_huge_table_for_view1 ?

SELECT * FROM view1

WHERE field LIKE 'brillant idea'

In other words, do all queries on view1 will use the partial index (and never the idx_hute_table ) ?

Nicolas PARIS

2015-02-20 13:36 GMT+01:00 Matheus de Oliveira <matioli.matheus@gmail.com>:

On Fri, Feb 20, 2015 at 8:28 AM, Nicolas Paris <niparisco@gmail.com> wrote:
If I replace MV with classical Views, the only indexes that will be used will be the huge table's one. As all users will query on the same indexes, is will always be loaded in memory, right ? This will be shared, I mean if 10 users query the same time, will it use 10*ram memory for indexes or juste 1 time that ram ?

Once one user load pages into the shared_buffer (or even OS memory cache), subsequent users that requests the same pages will read from there (from the memory), it is valid from pages of any kind of relation (MVs, tables, indexes, etc.). So if 10 users use the same index, then the pages read from it will be loaded in memory only once (unless it doesn't fit ram/shared_buffer, of course).

I terms of performances, will MV better than simple Views in my case ?

We'd need a lot more of information to answer this question. I tend to recommend people to try simpler approaches (in your case "simple views") and only move to more robust ones if the performance of this one is bad.

By the little information you gave, looks like the queries gets a well defined subset of this big table, so you should also consider:

- Creating partial indexes for the subsets, or at least the most accessed ones;
- Partitioning the table (be really careful with that and make sure you actually use the partition keys).

Regards,
--
Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres

Re: PG 9.3 materialized view VS Views, indexes, shared memory

От

Matheus de Oliveira

Дата:

20 февраля 2015 г., 17:45:22

On Fri, Feb 20, 2015 at 11:06 AM, Nicolas Paris <niparisco@gmail.com> wrote:

Thanks,

I like the idea of partial indexes mixed with simple Views
So question :

huge_table{
id,
field
}
CREATE INDEX idx_huge_table ON huge_table(id)
CREATE INDEX idx_huge_table_for_view1 ON huge_table(id) WHERE id IN (1,2,3)

CREATE VIEW view1 AS SELECT * FROM huge_table WHERE id IN (1,2,3)

Do the following query uses idx_huge_table_for_view1 ?
SELECT * FROM view1
WHERE field LIKE 'brillant idea'

In other words, do all queries on view1 will use the partial index (and never the idx_hute_table ) ?

You can try that pretty easily:

    postgres=# CREATE TEMP TABLE huge_table(id int, field text);
    CREATE TABLE
    postgres=# CREATE INDEX huge_table_id_idx ON huge_table(id);
    CREATE INDEX
    postgres=# CREATE INDEX huge_table_id_partial_idx ON huge_table(id) WHERE id IN (1,2,3);
    CREATE INDEX
    postgres=# CREATE TEMP VIEW view1 AS SELECT * FROM huge_table WHERE id IN (1,2);
    CREATE VIEW
    postgres=# SET enable_seqscan TO off;
    SET
    postgres=# SET enable_bitmapscan To off;
    SET
    postgres=# EXPLAIN SELECT * FROM view1 WHERE field LIKE 'foo%';
                                              QUERY PLAN
    ----------------------------------------------------------------------------------------------
     Index Scan using huge_table_id_partial_idx on huge_table (cost=0.12..36.41 rows=1 width=36)
       Index Cond: (id = ANY ('{1,2}'::integer[]))
       Filter: (field ~~ 'foo%'::text)
    (3 rows)

I expect that to happen always, unless you have another index that matches better the filter from outside the view.

Regards,
--

Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres

Re: PG 9.3 materialized view VS Views, indexes, shared memory

От

Nicolas Paris

Дата:

20 февраля 2015 г., 19:19:53

It appears that in the predicate close (WHERE id IN (foo)), foo cannot depend on other table (join or other). It must be a list. I anderstand why (this must be static).

I can build a string value, but in some case, I will have a milion key list.

Postgresql do not have limitation in query size, and IN(...) keys number.

But creating a partial index, with a query of bilion character length is not an issue ? It looks like a little dirty, not ?

Thanks for all

Nicolas PARIS

2015-02-20 15:44 GMT+01:00 Matheus de Oliveira <matioli.matheus@gmail.com>:

On Fri, Feb 20, 2015 at 11:06 AM, Nicolas Paris <niparisco@gmail.com> wrote:
Thanks,

I like the idea of partial indexes mixed with simple Views
So question :

huge_table{
id,
field
}
CREATE INDEX idx_huge_table ON huge_table(id)
CREATE INDEX idx_huge_table_for_view1 ON huge_table(id) WHERE id IN (1,2,3)

CREATE VIEW view1 AS SELECT * FROM huge_table WHERE id IN (1,2,3)

Do the following query uses idx_huge_table_for_view1 ?
SELECT * FROM view1
WHERE field LIKE 'brillant idea'

In other words, do all queries on view1 will use the partial index (and never the idx_hute_table ) ?

You can try that pretty easily:

    postgres=# CREATE TEMP TABLE huge_table(id int, field text);
    CREATE TABLE
    postgres=# CREATE INDEX huge_table_id_idx ON huge_table(id);
    CREATE INDEX
    postgres=# CREATE INDEX huge_table_id_partial_idx ON huge_table(id) WHERE id IN (1,2,3);
    CREATE INDEX
    postgres=# CREATE TEMP VIEW view1 AS SELECT * FROM huge_table WHERE id IN (1,2);
    CREATE VIEW
    postgres=# SET enable_seqscan TO off;
    SET
    postgres=# SET enable_bitmapscan To off;
    SET
    postgres=# EXPLAIN SELECT * FROM view1 WHERE field LIKE 'foo%';
                                              QUERY PLAN
    ----------------------------------------------------------------------------------------------
     Index Scan using huge_table_id_partial_idx on huge_table (cost=0.12..36.41 rows=1 width=36)
       Index Cond: (id = ANY ('{1,2}'::integer[]))
       Filter: (field ~~ 'foo%'::text)
    (3 rows)

I expect that to happen always, unless you have another index that matches better the filter from outside the view.

Regards,
--
Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres

Re: PG 9.3 materialized view VS Views, indexes, shared memory

От

Nicolas Paris

Дата:

20 февраля 2015 г., 21:09:47

Well it seems that max query size for CREATE INDEX is 8160 character in my 9.3 postgresql version.

Then the only solution see is to add a new boolean field : huge_table.view1

and change predicat to "WHERE view1=1 "

But I may have 800 views.. adding 800 new fields indexed to the huge table is actually not a good idea. Too bad

Any idea to solve that partial view limitation?

Nicolas PARIS

2015-02-20 17:19 GMT+01:00 Nicolas Paris <niparisco@gmail.com>:

It appears that in the predicate close (WHERE id IN (foo)), foo cannot depend on other table (join or other). It must be a list. I anderstand why (this must be static).
I can build a string value, but in some case, I will have a milion key list.
Postgresql do not have limitation in query size, and IN(...) keys number.

But creating a partial index, with a query of bilion character length is not an issue ? It looks like a little dirty, not ?

Thanks for all

Nicolas PARIS

2015-02-20 15:44 GMT+01:00 Matheus de Oliveira <matioli.matheus@gmail.com>:

On Fri, Feb 20, 2015 at 11:06 AM, Nicolas Paris <niparisco@gmail.com> wrote:
Thanks,

I like the idea of partial indexes mixed with simple Views
So question :

huge_table{
id,
field
}
CREATE INDEX idx_huge_table ON huge_table(id)
CREATE INDEX idx_huge_table_for_view1 ON huge_table(id) WHERE id IN (1,2,3)

CREATE VIEW view1 AS SELECT * FROM huge_table WHERE id IN (1,2,3)

Do the following query uses idx_huge_table_for_view1 ?
SELECT * FROM view1
WHERE field LIKE 'brillant idea'

In other words, do all queries on view1 will use the partial index (and never the idx_hute_table ) ?

You can try that pretty easily:

    postgres=# CREATE TEMP TABLE huge_table(id int, field text);
    CREATE TABLE
    postgres=# CREATE INDEX huge_table_id_idx ON huge_table(id);
    CREATE INDEX
    postgres=# CREATE INDEX huge_table_id_partial_idx ON huge_table(id) WHERE id IN (1,2,3);
    CREATE INDEX
    postgres=# CREATE TEMP VIEW view1 AS SELECT * FROM huge_table WHERE id IN (1,2);
    CREATE VIEW
    postgres=# SET enable_seqscan TO off;
    SET
    postgres=# SET enable_bitmapscan To off;
    SET
    postgres=# EXPLAIN SELECT * FROM view1 WHERE field LIKE 'foo%';
                                              QUERY PLAN
    ----------------------------------------------------------------------------------------------
     Index Scan using huge_table_id_partial_idx on huge_table (cost=0.12..36.41 rows=1 width=36)
       Index Cond: (id = ANY ('{1,2}'::integer[]))
       Filter: (field ~~ 'foo%'::text)
    (3 rows)

I expect that to happen always, unless you have another index that matches better the filter from outside the view.

Regards,
--
Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres

Re: PG 9.3 materialized view VS Views, indexes, shared memory

От

Jim Nasby

Дата:

06 марта 2015 г., 04:41:14

On 2/20/15 12:09 PM, Nicolas Paris wrote:
> Well it seems that max  query size for CREATE INDEX is 8160 character in
> my 9.3 postgresql version.
> Then the only solution see is to add  a new boolean field : huge_table.view1
> and change predicat to "WHERE view1=1 "
> But I may have 800 views.. adding 800 new fields indexed to the huge
> table is actually not a good idea. Too bad
>
> Any idea to solve that partial view limitation?

If you have that many different views I doubt you want that many indexes
anyway.

Have you tried just hitting the base table and indexes directly, either
through plain views or just direct SQL?

Also, how frequently does data change in the huge table? This sounds
like a case where the visibility map could make a huge difference.

By the way, if all the Mat Views are in one schema that's already in the
search path, a very easy way to test this would be to create an
equivalent set of regular views in a different schema (which you can
probably do programmatically via pg_get_viewdef()) and then change the
search_path to put the new schema before the old.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: PG 9.3 materialized view VS Views, indexes, shared memory

От

Nicolas Paris

Дата:

06 марта 2015 г., 11:17:06

If you have that many different views I doubt you want that many indexes anyway.

It's a datawarehouse, then each view is used by many user for each query.

Those views must be subset of the huge material table. All indexes are needed

Have you tried just hitting the base table and indexes directly, either through plain views or just direct SQL?

I have tried each. The performances are worst querying on a subset (the views) than querying on whole huge table when using the huge indexes

=> this is the solution I am implementing. (800 is not true, but in 10 years it maybe will be)

Actually, I have added a boolean column on the huge table for each views. This is the way each view is a subset of huge table (Create View as Select * FROM hugeTable WHERE columnX is true --etc 800 times). Then I create 800partials indexes on that column(create index...WHERE columnX is TRUE), for each view.

This works great as the query planer chooses the partials indexes when querying the little subset of the terrific table (potential 20bilion rows)

This is better than material views for some reasons :

- saves places on hard drive (columnX is boolean +same indexes - data for MatViews)

- saves time generating materialised views

This is quite more complicated because in the project, the number of view is increasing, and dynamic then :

- then adding new mat views is simple

- adding new views => adding new column on the huge table. It can take long time to update boolean for each tuple. Then I need to truncate/bulk load all data each time I add a new View. Other problem is dynamic number column table was a bit tricky to implement in an ETL soft such Talend, but the benefits are I hope great.

Nicolas PARIS

2015-03-06 2:40 GMT+01:00 Jim Nasby <Jim.Nasby@bluetreble.com>:

On 2/20/15 12:09 PM, Nicolas Paris wrote:
Well it seems that max query size for CREATE INDEX is 8160 character in
my 9.3 postgresql version.
Then the only solution see is to add a new boolean field : huge_table.view1
and change predicat to "WHERE view1=1 "
But I may have 800 views.. adding 800 new fields indexed to the huge
table is actually not a good idea. Too bad

Any idea to solve that partial view limitation?

If you have that many different views I doubt you want that many indexes anyway.

Have you tried just hitting the base table and indexes directly, either through plain views or just direct SQL?

Also, how frequently does data change in the huge table? This sounds like a case where the visibility map could make a huge difference.

By the way, if all the Mat Views are in one schema that's already in the search path, a very easy way to test this would be to create an equivalent set of regular views in a different schema (which you can probably do programmatically via pg_get_viewdef()) and then change the search_path to put the new schema before the old.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: PG 9.3 materialized view VS Views, indexes, shared memory

От

Jim Nasby

Дата:

06 марта 2015 г., 12:32:31

On 3/6/15 2:16 AM, Nicolas Paris wrote:
>     If you have that many different views I doubt you want that many
>     indexes anyway.
>
> It's a datawarehouse, then each view is used by many user for each query.
> Those views must be subset of the huge material table. All indexes are
> needed

Yes, but they don't have to be partial.

>     Have you tried just hitting the base table and indexes directly,
>     either through plain views or just direct SQL?
>
>  I have tried each. The performances are worst querying on a subset
> (the views) than querying on whole huge table when using the huge indexes

You mean the materialized views, right? If so, that makes sense: Instead
of having all your users hitting one common set of data (your fact
table) you had them hitting a bunch of other data (the mat views). But
you still had other stuff hitting the fact table. So now you were
dealing with a lot more data than if you just stuck to the single fact
table.

> => this is the solution I am implementing. (800 is not true, but in 10
> years it maybe will be)

In 10 years we'll all be using quantum computers anyway... ;P

>  Actually, I have added a boolean column on the huge table for each
> views. This is the way each view is a subset of huge table (Create View
> as Select  * FROM hugeTable WHERE columnX is true --etc 800 times). Then
> I create 800partials indexes on that column(create index...WHERE columnX
> is TRUE), for each view.
> This works great as the query planer chooses the partials indexes when
> querying the little subset of the terrific table (potential 20bilion rows)
>
> This is better than material views for some reasons :
> - saves places on hard drive (columnX is boolean +same indexes - data
> for MatViews)
> - saves time generating materialised views

But this isn't better than the mat views because of a bunch of booleans;
it's better because it means less stain on the disk cache.

> This is quite more complicated because in the project, the number of
> view is increasing, and dynamic then :
> - then adding new mat views is simple
> - adding new views => adding new column on the huge table. It can take
> long time to update boolean for each tuple. Then I need to truncate/bulk
> load all data each time I add a new View. Other problem is dynamic
> number column table was a bit tricky to implement in an ETL soft such
> Talend, but the benefits are I hope great.

I think you'll ultimately be unhappy trying to go down this route, for
the reasons you mention, plus the very large amount of extra space
you'll be using. 800 booleans is 800 extra bytes for every row in your
fact table. That's a lot. Even if you used a bitmap instead (which means
you have to mess around with tracking which bit means what and probably
other problems as well) you're still looking at 100 bytes per row.
That's nothing to sneeze at.

My suggestion is to test using nothing but plain views and plain indexes
on the base table. I expect that some of those views will not perform
adequately, but many (or most) of them will be fine. For the views that
are too slow, look at what the expensive part of the view and
materialize *only that*. I suspect you'll find that when you do that
you'll discover that several views are slow because of the same thing,
so if you materialize that one thing one time you can then use it to
speed up several views.

Using that approach means you'll have a lot less data that you have to read.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: PG 9.3 materialized view VS Views, indexes, shared memory

От

Nicolas Paris

Дата:

06 марта 2015 г., 13:25:15

Thanks Jim,

My suggestion is to test using nothing but plain views and plain indexes on the base table

Actualy the way I choose subset rows for views is complicated in terms of query. Then using simple views without partial indexes is terrible in terms of performance (I have tested that).

You mean the materialized views, right?

Well I have tested matviews, views without partial indexes, views with hashjoin on a key, ..

I think you'll ultimately be unhappy trying to go down this route, for the reasons you mention, plus the very large amount of extra space you'll be using. 800 booleans is 800 extra bytes for every row in your fact table. That's a lot. Even if you used a bitmap instead (which means you have to mess around with tracking which bit means what and probably other problems as well) you're still looking at 100 bytes per row. That's nothing to sneeze at.

Since each subset is about 5% (this number is decreasing when number of views increase) of the fact table, most boolean rows are null. This means 5% of 800 extra bytes, right ? I have choosen smallint, because of bulk load decrease csv size (true VS 1).

For the views that are too slow, look at what the expensive part of the view and materialize *only that*.

It would be great if I could, but all will be automatic, then It will be difficult to apply such rules that demands human analyse, and manual database modification, for one subset

Hope I have well anderstand you

Nicolas PARIS

2015-03-06 10:32 GMT+01:00 Jim Nasby <Jim.Nasby@bluetreble.com>:

On 3/6/15 2:16 AM, Nicolas Paris wrote:
If you have that many different views I doubt you want that many
indexes anyway.

It's a datawarehouse, then each view is used by many user for each query.
Those views must be subset of the huge material table. All indexes are
needed

Yes, but they don't have to be partial.

Have you tried just hitting the base table and indexes directly,
either through plain views or just direct SQL?

I have tried each. The performances are worst querying on a subset
(the views) than querying on whole huge table when using the huge indexes

You mean the materialized views, right? If so, that makes sense: Instead of having all your users hitting one common set of data (your fact table) you had them hitting a bunch of other data (the mat views). But you still had other stuff hitting the fact table. So now you were dealing with a lot more data than if you just stuck to the single fact table.

=> this is the solution I am implementing. (800 is not true, but in 10
years it maybe will be)

In 10 years we'll all be using quantum computers anyway... ;P

Actually, I have added a boolean column on the huge table for each
views. This is the way each view is a subset of huge table (Create View
as Select * FROM hugeTable WHERE columnX is true --etc 800 times). Then
I create 800partials indexes on that column(create index...WHERE columnX
is TRUE), for each view.
This works great as the query planer chooses the partials indexes when
querying the little subset of the terrific table (potential 20bilion rows)

This is better than material views for some reasons :
- saves places on hard drive (columnX is boolean +same indexes - data
for MatViews)
- saves time generating materialised views

But this isn't better than the mat views because of a bunch of booleans; it's better because it means less stain on the disk cache.

This is quite more complicated because in the project, the number of
view is increasing, and dynamic then :
- then adding new mat views is simple
- adding new views => adding new column on the huge table. It can take
long time to update boolean for each tuple. Then I need to truncate/bulk
load all data each time I add a new View. Other problem is dynamic
number column table was a bit tricky to implement in an ETL soft such
Talend, but the benefits are I hope great.

I think you'll ultimately be unhappy trying to go down this route, for the reasons you mention, plus the very large amount of extra space you'll be using. 800 booleans is 800 extra bytes for every row in your fact table. That's a lot. Even if you used a bitmap instead (which means you have to mess around with tracking which bit means what and probably other problems as well) you're still looking at 100 bytes per row. That's nothing to sneeze at.

My suggestion is to test using nothing but plain views and plain indexes on the base table. I expect that some of those views will not perform adequately, but many (or most) of them will be fine. For the views that are too slow, look at what the expensive part of the view and materialize *only that*. I suspect you'll find that when you do that you'll discover that several views are slow because of the same thing, so if you materialize that one thing one time you can then use it to speed up several views.

Using that approach means you'll have a lot less data that you have to read.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: PG 9.3 materialized view VS Views, indexes, shared memory

От

Nicolas Paris

Дата:

06 марта 2015 г., 23:26:33

According to this link http://postgresql.nabble.com/NULL-saves-disk-space-td4344106.html

NULL values do not take place if only one other column are null for that row.

Boolean takes 1 byte wheras smallint 2bytes.

Then the space problem is not anymore a problem with boolean columns 95% empty

One thing that is really great with postgresql is transaction for drop table cascade, that allow te restore all stuf index, views on a rollback if problem in loading appears.

I hope using one transaction to drop/load many table is not a performance issue ?

Nicolas PARIS

2015-03-06 11:25 GMT+01:00 Nicolas Paris <niparisco@gmail.com>:

Thanks Jim,

My suggestion is to test using nothing but plain views and plain indexes on the base table
Actualy the way I choose subset rows for views is complicated in terms of query. Then using simple views without partial indexes is terrible in terms of performance (I have tested that).

You mean the materialized views, right?
Well I have tested matviews, views without partial indexes, views with hashjoin on a key, ..

I think you'll ultimately be unhappy trying to go down this route, for the reasons you mention, plus the very large amount of extra space you'll be using. 800 booleans is 800 extra bytes for every row in your fact table. That's a lot. Even if you used a bitmap instead (which means you have to mess around with tracking which bit means what and probably other problems as well) you're still looking at 100 bytes per row. That's nothing to sneeze at.

Since each subset is about 5% (this number is decreasing when number of views increase) of the fact table, most boolean rows are null. This means 5% of 800 extra bytes, right ? I have choosen smallint, because of bulk load decrease csv size (true VS 1).

For the views that are too slow, look at what the expensive part of the view and materialize *only that*.
It would be great if I could, but all will be automatic, then It will be difficult to apply such rules that demands human analyse, and manual database modification, for one subset

Hope I have well anderstand you

Nicolas PARIS

2015-03-06 10:32 GMT+01:00 Jim Nasby <Jim.Nasby@bluetreble.com>:
On 3/6/15 2:16 AM, Nicolas Paris wrote:
If you have that many different views I doubt you want that many
indexes anyway.

It's a datawarehouse, then each view is used by many user for each query.
Those views must be subset of the huge material table. All indexes are
needed

Yes, but they don't have to be partial.

Have you tried just hitting the base table and indexes directly,
either through plain views or just direct SQL?

I have tried each. The performances are worst querying on a subset
(the views) than querying on whole huge table when using the huge indexes

You mean the materialized views, right? If so, that makes sense: Instead of having all your users hitting one common set of data (your fact table) you had them hitting a bunch of other data (the mat views). But you still had other stuff hitting the fact table. So now you were dealing with a lot more data than if you just stuck to the single fact table.

=> this is the solution I am implementing. (800 is not true, but in 10
years it maybe will be)

In 10 years we'll all be using quantum computers anyway... ;P

Actually, I have added a boolean column on the huge table for each
views. This is the way each view is a subset of huge table (Create View
as Select * FROM hugeTable WHERE columnX is true --etc 800 times). Then
I create 800partials indexes on that column(create index...WHERE columnX
is TRUE), for each view.
This works great as the query planer chooses the partials indexes when
querying the little subset of the terrific table (potential 20bilion rows)

This is better than material views for some reasons :
- saves places on hard drive (columnX is boolean +same indexes - data
for MatViews)
- saves time generating materialised views

But this isn't better than the mat views because of a bunch of booleans; it's better because it means less stain on the disk cache.

This is quite more complicated because in the project, the number of
view is increasing, and dynamic then :
- then adding new mat views is simple
- adding new views => adding new column on the huge table. It can take
long time to update boolean for each tuple. Then I need to truncate/bulk
load all data each time I add a new View. Other problem is dynamic
number column table was a bit tricky to implement in an ETL soft such
Talend, but the benefits are I hope great.

I think you'll ultimately be unhappy trying to go down this route, for the reasons you mention, plus the very large amount of extra space you'll be using. 800 booleans is 800 extra bytes for every row in your fact table. That's a lot. Even if you used a bitmap instead (which means you have to mess around with tracking which bit means what and probably other problems as well) you're still looking at 100 bytes per row. That's nothing to sneeze at.

My suggestion is to test using nothing but plain views and plain indexes on the base table. I expect that some of those views will not perform adequately, but many (or most) of them will be fine. For the views that are too slow, look at what the expensive part of the view and materialize *only that*. I suspect you'll find that when you do that you'll discover that several views are slow because of the same thing, so if you materialize that one thing one time you can then use it to speed up several views.

Using that approach means you'll have a lot less data that you have to read.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: PG 9.3 materialized view VS Views, indexes, shared memory

От

Nicolas Paris

Дата:

09 марта 2015 г., 16:17:55

(sorry for top-posting, gmail does not help.)

Thanks to your advice Jim, I have done an other test :

No partial indexes, just a partial index on boolean columns does the job. (I get same perfs as MV)

CREATE INDEX ..ON (BoolColumnX) WHERE BoolColumnX IS TRUE

Then VIEW =

SELECT colA....colZ

FROM huge_table

WHERE BoolColumnX IS TRUE

Then this only index is used 800times (for each bool col) and saves place as it does'nt indexes NULL values, and does no replicate. subsets. Moreover the huge indexes are allways loaded in cache memory.

Nicolas PARIS

2015-03-06 21:26 GMT+01:00 Nicolas Paris <niparisco@gmail.com>:

According to this link http://postgresql.nabble.com/NULL-saves-disk-space-td4344106.html
NULL values do not take place if only one other column are null for that row.
Boolean takes 1 byte wheras smallint 2bytes.
Then the space problem is not anymore a problem with boolean columns 95% empty

One thing that is really great with postgresql is transaction for drop table cascade, that allow te restore all stuf index, views on a rollback if problem in loading appears.
I hope using one transaction to drop/load many table is not a performance issue ?

Nicolas PARIS

2015-03-06 11:25 GMT+01:00 Nicolas Paris <niparisco@gmail.com>:
Thanks Jim,

My suggestion is to test using nothing but plain views and plain indexes on the base table
Actualy the way I choose subset rows for views is complicated in terms of query. Then using simple views without partial indexes is terrible in terms of performance (I have tested that).

You mean the materialized views, right?
Well I have tested matviews, views without partial indexes, views with hashjoin on a key, ..

I think you'll ultimately be unhappy trying to go down this route, for the reasons you mention, plus the very large amount of extra space you'll be using. 800 booleans is 800 extra bytes for every row in your fact table. That's a lot. Even if you used a bitmap instead (which means you have to mess around with tracking which bit means what and probably other problems as well) you're still looking at 100 bytes per row. That's nothing to sneeze at.

Since each subset is about 5% (this number is decreasing when number of views increase) of the fact table, most boolean rows are null. This means 5% of 800 extra bytes, right ? I have choosen smallint, because of bulk load decrease csv size (true VS 1).

For the views that are too slow, look at what the expensive part of the view and materialize *only that*.
It would be great if I could, but all will be automatic, then It will be difficult to apply such rules that demands human analyse, and manual database modification, for one subset

Hope I have well anderstand you

Nicolas PARIS

2015-03-06 10:32 GMT+01:00 Jim Nasby <Jim.Nasby@bluetreble.com>:
On 3/6/15 2:16 AM, Nicolas Paris wrote:
If you have that many different views I doubt you want that many
indexes anyway.

It's a datawarehouse, then each view is used by many user for each query.
Those views must be subset of the huge material table. All indexes are
needed

Yes, but they don't have to be partial.

Have you tried just hitting the base table and indexes directly,
either through plain views or just direct SQL?

I have tried each. The performances are worst querying on a subset
(the views) than querying on whole huge table when using the huge indexes

You mean the materialized views, right? If so, that makes sense: Instead of having all your users hitting one common set of data (your fact table) you had them hitting a bunch of other data (the mat views). But you still had other stuff hitting the fact table. So now you were dealing with a lot more data than if you just stuck to the single fact table.

=> this is the solution I am implementing. (800 is not true, but in 10
years it maybe will be)

In 10 years we'll all be using quantum computers anyway... ;P

Actually, I have added a boolean column on the huge table for each
views. This is the way each view is a subset of huge table (Create View
as Select * FROM hugeTable WHERE columnX is true --etc 800 times). Then
I create 800partials indexes on that column(create index...WHERE columnX
is TRUE), for each view.
This works great as the query planer chooses the partials indexes when
querying the little subset of the terrific table (potential 20bilion rows)

This is better than material views for some reasons :
- saves places on hard drive (columnX is boolean +same indexes - data
for MatViews)
- saves time generating materialised views

But this isn't better than the mat views because of a bunch of booleans; it's better because it means less stain on the disk cache.

This is quite more complicated because in the project, the number of
view is increasing, and dynamic then :
- then adding new mat views is simple
- adding new views => adding new column on the huge table. It can take
long time to update boolean for each tuple. Then I need to truncate/bulk
load all data each time I add a new View. Other problem is dynamic
number column table was a bit tricky to implement in an ETL soft such
Talend, but the benefits are I hope great.

I think you'll ultimately be unhappy trying to go down this route, for the reasons you mention, plus the very large amount of extra space you'll be using. 800 booleans is 800 extra bytes for every row in your fact table. That's a lot. Even if you used a bitmap instead (which means you have to mess around with tracking which bit means what and probably other problems as well) you're still looking at 100 bytes per row. That's nothing to sneeze at.

My suggestion is to test using nothing but plain views and plain indexes on the base table. I expect that some of those views will not perform adequately, but many (or most) of them will be fine. For the views that are too slow, look at what the expensive part of the view and materialize *only that*. I suspect you'll find that when you do that you'll discover that several views are slow because of the same thing, so if you materialize that one thing one time you can then use it to speed up several views.

Using that approach means you'll have a lot less data that you have to read.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: PG 9.3 materialized view VS Views, indexes, shared memory

От

Jim Nasby

Дата:

10 марта 2015 г., 11:31:48

On 3/9/15 8:17 AM, Nicolas Paris wrote:
> (sorry for top-posting, gmail does not help.)

*shakes fist at gmail*

> Thanks to your advice Jim, I have done an other test :
> No partial indexes, just a partial index on boolean columns does the
> job.  (I get same perfs as MV)
> CREATE INDEX ..ON (BoolColumnX) WHERE BoolColumnX IS TRUE
>
> Then VIEW =
> SELECT colA....colZ
> FROM huge_table
> WHERE BoolColumnX IS TRUE
>
> Then this only index is used 800times (for each bool col)  and saves
> place as it does'nt indexes NULL values, and does no replicate. subsets.
> Moreover the huge indexes are allways loaded in cache memory.

Cool. :)

>     According to this link
>     http://postgresql.nabble.com/NULL-saves-disk-space-td4344106.html
>     NULL values do not take place if only one other column are null for
>     that row.
>     Boolean takes 1 byte wheras smallint 2bytes.
>     Then the space problem is not anymore a problem with boolean columns
>     95% empty
>
>     One thing that is really great with postgresql is transaction for
>     drop table cascade, that allow te restore all stuf index, views on a
>     rollback if problem in loading appears.
>     I hope using one transaction to drop/load many table is not a
>     performance issue ?

Why are you dropping and re-loading? You mentioned it before and it
sounded like it had something to do with adding columns, but you don't
have to drop and reload to add a column.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: PG 9.3 materialized view VS Views, indexes, shared memory

От

Nicolas Paris

Дата:

10 марта 2015 г., 11:53:29

Why are you dropping and re-loading? You mentioned it before and it sounded like it had something to do with adding columns, but you

don't have to drop and reload to add a column.

Adding a NULL column is fast. Dropping one too. I need to set some row as TRUE. I can do it with an update, but in postgresql update is done by delete then insert with copy of the row. This is really slow. A drop cascade, then bulk load is better.

This is not the only reason. Drop & load simplify all the ETL process. No question of delta changes and no "fuck brain" when a problem occurs or a modification of the table. I've tested, it loads 20milion rows in 5 min (without time for reindexing and time to retrieve datas)

2015-03-10 9:31 GMT+01:00 Jim Nasby <Jim.Nasby@bluetreble.com>:

On 3/9/15 8:17 AM, Nicolas Paris wrote:
(sorry for top-posting, gmail does not help.)

*shakes fist at gmail*

Thanks to your advice Jim, I have done an other test :
No partial indexes, just a partial index on boolean columns does the
job. (I get same perfs as MV)
CREATE INDEX ..ON (BoolColumnX) WHERE BoolColumnX IS TRUE

Then VIEW =
SELECT colA....colZ
FROM huge_table
WHERE BoolColumnX IS TRUE

Then this only index is used 800times (for each bool col) and saves
place as it does'nt indexes NULL values, and does no replicate. subsets.
Moreover the huge indexes are allways loaded in cache memory.

Cool. :)

According to this link
http://postgresql.nabble.com/NULL-saves-disk-space-td4344106.html
NULL values do not take place if only one other column are null for
that row.
Boolean takes 1 byte wheras smallint 2bytes.
Then the space problem is not anymore a problem with boolean columns
95% empty

One thing that is really great with postgresql is transaction for
drop table cascade, that allow te restore all stuf index, views on a
rollback if problem in loading appears.
I hope using one transaction to drop/load many table is not a
performance issue ?

Why are you dropping and re-loading? You mentioned it before and it sounded like it had something to do with adding columns, but you don't have to drop and reload to add a column.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: PG 9.3 materialized view VS Views, indexes, shared memory