Обсуждение: Batch update of indexes

Поиск
Список
Период
Сортировка

Batch update of indexes

От
Konstantin Knizhnik
Дата:
Hi hackers,<br /><br /> I want to know opinion of community about possible ways of solving quite common problem:
increasinginsert speed while still providing indexes for efficient execution of queries.<br /><br /> Many applications
haveto deal with high input stream of data. Most of the time while record inserting in the database is taken for update
ofindexes. And without indexes we are not able to efficiently execute queries. <br /> So in many cases it is desirable
tohave "batch or concurrent" index update. And it is acceptable that an index is slightly behind current state of the
table.<br/><br /> One interesting approach of solving this problem is discussed in this article:<br /><br /><a
class="moz-txt-link-freetext"
href="https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb">https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb</a><br
/><br/> Them are using materialized views to build indexes in background.<br /> Interesting idea, but copying content
ofthe whole table just to be able to build index concurrently seems to be overkill.<br /><br /> I thought about more
straightforwardways of solving this problem. It will be nice if we can preserve of of them main postulates of Postgres
andother RDBMSes: indexes are just optimization and result of query should not depend on presence of indexes.<br /><br
/>First idea is to use inheritance. I have investigated different ways of splitting table into "archival" and
"operational"parts, but all of them requiring physical copying of data from one table to another.<br /><br /> Another
ideais to use partial indexes (<a class="moz-txt-link-freetext"
href="http://www.postgresql.org/docs/current/static/indexes-partial.html">http://www.postgresql.org/docs/current/static/indexes-partial.html</a>)<br
/>Assume that we have stream of input data where each record have increased timestamp:<br /><br />   create table t(<br
/>     ts timestamp primary key,<br />      c1 real,<br />      c2 integer,<br />      c3 varchar,<br />      ...<br />
    cN char(5)<br />   );<br /><br /> We want to provide the highest insert speed for "t" but provide indexes for
c1..cNfields.<br /> We can declared partial indexes:<br /><br />   create index idx1 on t(c1) where ts <
'20/01/2016';<br/>   create index idx2 on t(c2) where ts < '20/01/2016';<br />   ...<br />   create index idxN on
t(cN)where ts < '20/01/2016';<br /><br /> As far as this indexes do not cover current date, them will not be
affectedduring insert operations.<br /> But we can still efficiently run queries like<br /><br />   select * from t
wherec1>100 and ts < '20/01/2016';<br /><br /> Then, in background, may be at night, we can do<br /><br />  
alterindex idx1 where <font color="#ff0000">ts < '</font><font color="#ff0000">21/01/2016'</font>;<br /><br />
Pleasenotice that such alter table statement, changing condition for partial index, is not supported now.<br /> But I
donot see any principle problems with supporting such construction.<br /> We should just include in the index all
recordswhich match new condition and do not match old condition:<br /><br />    ts < '21/01/2016' and not (ts <
'20/01/2016')<br/><br /> If there is index for "ts" field it can be done quite efficiently.<br /> This approach doesn't
causecontradictions with concepts of indexes in RDBMS.<br /><br /> But there is one more problem with this approach
withI think should be addressed.<br /> Right now optimizer builds the following execution plan for query with partial
indexes:<br/><br />  postgres=# explain select * from t where c1 < 10 and ts < '20/01/2016'::timestamp;<br />
                                                 QUERY PLAN                                                   <br />
 ---------------------------------------------------------------------------------------------------------------<br/>
 BitmapHeap Scan on t  (cost=7.20..732.14 rows=12263 width=12)<br />    Recheck Cond: ((c1 < '10'::double precision)
AND(ts < '2016-01-20 00:00:00'::timestamp without time zone))<br />    ->  Bitmap Index Scan on idx1 
(cost=0.00..4.13rows=12263 width=0)<br />          Index Cond: (c1 < '10'::double precision)<br /> (4 rows)<br /><br
/>As you can see optimizer insert recheck in query execution plan while it is not needed in this case: search condition
isexactly the same as partial index condition.<br /> Optimal plan should be:<br /><br />    Index Scan using idx1 on t
(cost=0.00..4.13rows=12263 width=0)<br />    Index Cond: (c1 < '10'::double precision)<br /><br /><br /> What do you
thinkabout this approach? Will it be useful to work in this direction?<br /> Or there are some better solutions for the
problem?<br/><br /><pre class="moz-signature" cols="72">-- 
 
Konstantin Knizhnik
Postgres Professional: <a class="moz-txt-link-freetext"
href="http://www.postgrespro.com">http://www.postgrespro.com</a>
The Russian Postgres Company </pre>

Re: Batch update of indexes

От
Anastasia Lubennikova
Дата:
20.01.2016 12:28, Konstantin Knizhnik :<br /><blockquote cite="mid:569F5346.1010005@postgrespro.ru" type="cite"> Hi
hackers,<br/><br /> I want to know opinion of community about possible ways of solving quite common problem: increasing
insertspeed while still providing indexes for efficient execution of queries.<br /><br /> Many applications have to
dealwith high input stream of data. Most of the time while record inserting in the database is taken for update of
indexes.And without indexes we are not able to efficiently execute queries. <br /> So in many cases it is desirable to
have"batch or concurrent" index update. And it is acceptable that an index is slightly behind current state of the
table.<br/></blockquote><br /> Hi, I glad to see that you interested in that too.<br /> I think this is a good feature
andI think it will be very useful to have.<br /> I have already mentioned some related problems and possible
improvementsin my presentation.<br /><a
href="http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts"><aclass="moz-txt-link-freetext"
href="http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts">http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts</a></a><br
/>Last two slides concern to this thread. Briefly, I've suggested to think about insertion buffer. Something very
similarto it is already implemented in BRIN. It does not index last data from heap, while the number of last pages is
lessthan pages_per_block.<br /> The next point, I've thought about is a bulk update. Problem is that update like
"UPDATEmytable set a = a+1;" causes N searches from the root of B-tree. I looks very strange to me, and I'd like to fix
itsomehow. The obvious solution is to update all tuples on the page at a time, and keep the number of last updated
page. But, maybe it's a bit off-thread here.<br /><br /><blockquote cite="mid:569F5346.1010005@postgrespro.ru"
type="cite">One interesting approach of solving this problem is discussed in this article:<br /><br /><a
class="moz-txt-link-freetext"
href="https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb"
moz-do-not-send="true">https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb</a><br
/><br/> Them are using materialized views to build indexes in background.<br /> Interesting idea, but copying content
ofthe whole table just to be able to build index concurrently seems to be overkill.<br /></blockquote><br /> This
approachseems like a tricky crutch to me. And I agree that it requires a lot of extra work.<br /><br /><blockquote
cite="mid:569F5346.1010005@postgrespro.ru"type="cite"> I thought about more straightforward ways of solving this
problem.It will be nice if we can preserve of of them main postulates of Postgres and other RDBMSes: indexes are just
optimizationand result of query should not depend on presence of indexes.<br /><br /> First idea is to use inheritance.
Ihave investigated different ways of splitting table into "archival" and "operational" parts, but all of them requiring
physicalcopying of data from one table to another.<br /><br /></blockquote><blockquote
cite="mid:569F5346.1010005@postgrespro.ru"type="cite"> Another idea is to use partial indexes (<a
class="moz-txt-link-freetext"href="http://www.postgresql.org/docs/current/static/indexes-partial.html"
moz-do-not-send="true">http://www.postgresql.org/docs/current/static/indexes-partial.html</a>)<br/> Assume that we have
streamof input data where each record have increased timestamp:<br /><br />   create table t(<br />      ts timestamp
primarykey,<br />      c1 real,<br />      c2 integer,<br />      c3 varchar,<br />      ...<br />      cN char(5)<br
/>  );<br /><br /> We want to provide the highest insert speed for "t" but provide indexes for c1..cN fields.<br /> We
candeclared partial indexes:<br /><br />   create index idx1 on t(c1) where ts < '20/01/2016';<br />   create index
idx2on t(c2) where ts < '20/01/2016';<br />   ...<br />   create index idxN on t(cN) where ts < '20/01/2016';<br
/><br/> As far as this indexes do not cover current date, them will not be affected during insert operations.<br /> But
wecan still efficiently run queries like<br /><br />   select * from t where c1>100 and ts < '20/01/2016';<br
/><br/> Then, in background, may be at night, we can do<br /><br />   alter index idx1 where <font color="#ff0000">ts
<'</font><font color="#ff0000">21/01/2016'</font>;<br /></blockquote><br /> This idea sounds very interesting to me.
<br/><blockquote cite="mid:569F5346.1010005@postgrespro.ru" type="cite"><br /> Please notice that such alter table
statement,changing condition for partial index, is not supported now.<br /></blockquote><br /> Don't you think, that
thisfeature could be used in a very wrong way? Do not take it as criticism, just a bit of thoughts.<br /><br
/><blockquotecite="mid:569F5346.1010005@postgrespro.ru" type="cite"> But I do not see any principle problems with
supportingsuch construction.<br /> We should just include in the index all records which match new condition and do not
matchold condition:<br /><br />    ts < '21/01/2016' and not (ts < '20/01/2016')<br /><br /> If there is index
for"ts" field it can be done quite efficiently.<br /> This approach doesn't cause contradictions with concepts of
indexesin RDBMS.<br /><br /> But there is one more problem with this approach with I think should be addressed.<br />
Rightnow optimizer builds the following execution plan for query with partial indexes:<br /><br />  postgres=# explain
select* from t where c1 < 10 and ts < '20/01/2016'::timestamp;<br />
                                                 QUERY PLAN                                                   <br />
 ---------------------------------------------------------------------------------------------------------------<br/>
 BitmapHeap Scan on t  (cost=7.20..732.14 rows=12263 width=12)<br />    Recheck Cond: ((c1 < '10'::double precision)
AND(ts < '2016-01-20 00:00:00'::timestamp without time zone))<br />    ->  Bitmap Index Scan on idx1 
(cost=0.00..4.13rows=12263 width=0)<br />          Index Cond: (c1 < '10'::double precision)<br /> (4 rows)<br /><br
/>As you can see optimizer insert recheck in query execution plan while it is not needed in this case: search condition
isexactly the same as partial index condition.<br /> Optimal plan should be:<br /><br />    Index Scan using idx1 on t
(cost=0.00..4.13rows=12263 width=0)<br />    Index Cond: (c1 < '10'::double precision)<br /></blockquote><br />
Therewas the discussion of the patch for partial indexes.<br />  <a
href="http://postgresql.nabble.com/PATCH-index-only-scans-with-partial-indexes-td5857568.html"><a
class="moz-txt-link-freetext"
href="http://postgresql.nabble.com/PATCH-index-only-scans-with-partial-indexes-td5857568.html">http://postgresql.nabble.com/PATCH-index-only-scans-with-partial-indexes-td5857568.html</a></a><br
/> Since I haven't watched it closely, It seems to be open still. I think it'll be interesting to you.<br /><br /><pre
class="moz-signature"cols="72">-- 
 
Anastasia Lubennikova
Postgres Professional: <a class="moz-txt-link-freetext"
href="http://www.postgrespro.com">http://www.postgrespro.com</a>
The Russian Postgres Company</pre>

Re: Batch update of indexes

От
Konstantin Knizhnik
Дата:
Hi,

> Hi, I glad to see that you interested in that too.
> I think this is a good feature and I think it will be very useful to have.
> I have already mentioned some related problems and possible 
> improvements in my presentation.
> http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts
> Last two slides concern to this thread. Briefly, I've suggested to 
> think about insertion buffer. Something very similar to it is already 
> implemented in BRIN. It does not index last data from heap, while the 
> number of last pages is less than pages_per_block.

Do you mean GIN-like usage of insertion buffer (here it is called 
"pending list")?
So that we have to combine search in the main tree and in the insert buffer?
Actually this is what I want to avoided (because at least in case of GIN 
pending list cause significant degrade of performance,
while up-to-date state of full text index is rarely required).


> The next point, I've thought about is a bulk update. Problem is that 
> update like "UPDATE mytable set a = a+1;" causes N searches from the 
> root of B-tree. I looks very strange to me, and I'd like to fix it 
> somehow. The obvious solution is to update all tuples on the page at a 
> time, and keep the number of last updated page. But, maybe it's a bit 
> off-thread here.

Bulk update is the second question (but very important).
First I just want to be able to append index concurrently, not blocking 
insert.

>
>> One interesting approach of solving this problem is discussed in this 
>> article:
>>
>> https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb
>>
>> Them are using materialized views to build indexes in background.
>> Interesting idea, but copying content of the whole table just to be 
>> able to build index concurrently seems to be overkill.
>
> This approach seems like a tricky crutch to me. And I agree that it 
> requires a lot of extra work.

It will be very interesting to know how people are using materialized views.
Delayed building of indexes seems to be one of the popular use cases, 
although requiring large overhead, first of all storage overhead.

>
>>
>> Please notice that such alter table statement, changing condition for 
>> partial index, is not supported now.
>
> Don't you think, that this feature could be used in a very wrong way? 
> Do not take it as criticism, just a bit of thoughts.
>

Everything which can be misused, will be misused:)
But I do not worry much about it...
If it can address real challenges, then it will be good thing in any case.

Ideally we should be able to alter everything. Naive implementation of 
such alter clause can just to build new index with temporary name, then 
drop old index and rename new index.


>
> There was the discussion of the patch for partial indexes.
> http://postgresql.nabble.com/PATCH-index-only-scans-with-partial-indexes-td5857568.html
>  Since I haven't watched it closely, It seems to be open still. I 
> think it'll be interesting to you.
>

So small patch...
Why it was not accepted?
I do no see any problems with it...


> -- 
> Anastasia Lubennikova
> Postgres Professional:http://www.postgrespro.com
> The Russian Postgres Company

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Batch update of indexes

От
Simon Riggs
Дата:
On 20 January 2016 at 14:55, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
Hi,

Hi, I glad to see that you interested in that too.
I think this is a good feature and I think it will be very useful to have.
I have already mentioned some related problems and possible improvements in my presentation.
http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts
Last two slides concern to this thread. Briefly, I've suggested to think about insertion buffer. Something very similar to it is already implemented in BRIN. It does not index last data from heap, while the number of last pages is less than pages_per_block.

Do you mean GIN-like usage of insertion buffer (here it is called "pending list")?
So that we have to combine search in the main tree and in the insert buffer?
Actually this is what I want to avoided (because at least in case of GIN pending list cause significant degrade of performance,
while up-to-date state of full text index is rarely required).

Degrade in performance is because scan of pending list is O(N).

If we did the same thing for monotonic inserts into a btree, the performance of ruling out any contents in the pending list would be O(1), so it is more feasible than you say.
 
--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Batch update of indexes

От
konstantin knizhnik
Дата:

On Jan 21, 2016, at 5:14 AM, Simon Riggs wrote:

On 20 January 2016 at 14:55, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
Hi,

Hi, I glad to see that you interested in that too.
I think this is a good feature and I think it will be very useful to have.
I have already mentioned some related problems and possible improvements in my presentation.
http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts
Last two slides concern to this thread. Briefly, I've suggested to think about insertion buffer. Something very similar to it is already implemented in BRIN. It does not index last data from heap, while the number of last pages is less than pages_per_block.

Do you mean GIN-like usage of insertion buffer (here it is called "pending list")?
So that we have to combine search in the main tree and in the insert buffer?
Actually this is what I want to avoided (because at least in case of GIN pending list cause significant degrade of performance,
while up-to-date state of full text index is rarely required).

Degrade in performance is because scan of pending list is O(N).

If we did the same thing for monotonic inserts into a btree, the performance of ruling out any contents in the pending list would be O(1), so it is more feasible than you say.

Sorry, didn't get it.
Certainly for B-Tree we can organize insert buffer (or pending list) as sorted array or also as a tree.
But in both case complexity of search in this buffer will be O(log(N)), not O(1).
O(1) is possible only if we will use hash table for pending list and are lucky with hash function.
But in this case it will be not possible to support range queries and proper order.

In any case, necessity to perform two searches instead of one and combining results will significantly complicate index implementation.
But definitely this solution is more convenient and transparent for users...


 
--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Batch update of indexes

От
Simon Riggs
Дата:
On 21 January 2016 at 06:41, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
Certainly for B-Tree we can organize insert buffer (or pending list) as sorted array or also as a tree.
But in both case complexity of search in this buffer will be O(log(N)), not O(1).

You are right; search within this buffer is O(log(N)). But we can test whether we need to search in the buffer at all with O(1). 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Batch update of indexes

От
Anastasia Lubennikova
Дата:
20.01.2016 17:55, Konstantin Knizhnik:
> Hi,
>
>> Hi, I glad to see that you interested in that too.
>> I think this is a good feature and I think it will be very useful to 
>> have.
>> I have already mentioned some related problems and possible 
>> improvements in my presentation.
>> http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts 
>>
>> Last two slides concern to this thread. Briefly, I've suggested to 
>> think about insertion buffer. Something very similar to it is already 
>> implemented in BRIN. It does not index last data from heap, while the 
>> number of last pages is less than pages_per_block.
>
> Do you mean GIN-like usage of insertion buffer (here it is called 
> "pending list")?
> So that we have to combine search in the main tree and in the insert 
> buffer?
> Actually this is what I want to avoided (because at least in case of 
> GIN pending list cause significant degrade of performance,
> while up-to-date state of full text index is rarely required).
>

What I meant is more like a BRIN-like combination of an index scan and 
heap scan.
Maybe it could be called "deferred inserts" or "temporary read-only index"
Maybe it's similar with mysql insert buffer 
http://dev.mysql.com/doc/refman/5.7/en/innodb-insert-buffering.html
I think it'll be more clear with example. Please don't care about syntax.

CREATE TABLE tbl (c1 int);
CREATE INDEX idx on tbl(c1);

SET enable_deferred_insert(idx) = on;
At this moment, we save the last_indexed_item (its TID) somewhere in 
index metapage.

Since that moment, the data inserted into the table doesn't touch the index.
We perform some heavy insert and then go back to the normal index behavior.

SET enable_deferred_insert(idx) = off;
This command takes all the data between the last_indexed_item and the 
end of the table, and inserts it into the index at a time.

Of course there are new problems to deal with, but it's really useful 
for the use case to balance irregular heavy write load, isn't it?

BTW, could you explain, what is the reason to copy data into the pending 
list and then copy it again while flushing pending list into the index? 
Why not read this data directly from the table? I feel that I've missed 
something important here.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Batch update of indexes

От
Konstantin Knizhnik
Дата:


On 21.01.2016 10:14, Simon Riggs wrote:
On 21 January 2016 at 06:41, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:
 
Certainly for B-Tree we can organize insert buffer (or pending list) as sorted array or also as a tree.
But in both case complexity of search in this buffer will be O(log(N)), not O(1).

You are right; search within this buffer is O(log(N)). But we can test whether we need to search in the buffer at all with O(1).

Only if records are inserted in key ascending order...
But usually we need to maintain more than once index and and for them it is not true.


--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Batch update of indexes

От
Konstantin Knizhnik
Дата:

On 21.01.2016 19:09, Anastasia Lubennikova wrote:
> What I meant is more like a BRIN-like combination of an index scan and 
> heap scan.
> Maybe it could be called "deferred inserts" or "temporary read-only 
> index"
> Maybe it's similar with mysql insert buffer 
> http://dev.mysql.com/doc/refman/5.7/en/innodb-insert-buffering.html
> I think it'll be more clear with example. Please don't care about syntax.
>
> CREATE TABLE tbl (c1 int);
> CREATE INDEX idx on tbl(c1);
>
> SET enable_deferred_insert(idx) = on;
> At this moment, we save the last_indexed_item (its TID) somewhere in 
> index metapage.
>
> Since that moment, the data inserted into the table doesn't touch the 
> index.
> We perform some heavy insert and then go back to the normal index 
> behavior.
>
> SET enable_deferred_insert(idx) = off;
> This command takes all the data between the last_indexed_item and the 
> end of the table, and inserts it into the index at a time.
>
> Of course there are new problems to deal with, but it's really useful 
> for the use case to balance irregular heavy write load, isn't it?
>
> BTW, could you explain, what is the reason to copy data into the 
> pending list and then copy it again while flushing pending list into 
> the index? Why not read this data directly from the table? I feel that 
> I've missed something important here.
>
No, I do  not think that inserted data should be placed in pending list 
and then copied to main table.
It should be stored directly in the main table and "pending list" is 
just some fast, transient index.
From my point of view there are two possibilities:
1. Preserve strict table-index consistency: query results should not 
depend on presence of the index
2. Support out-of-date or deferred indexes, which can be updated in 
background.

Second approach is certainty more efficient and IMHO it acceptable for 
most of the users.
But we are violating one of the fundamental properties of RDBMes...
So I am not sure which approach to chose.

First case is also harder to implement, because we have to somehow merge 
two indexes during index scan
and provide proper recovery of main index in case of failure (assuming 
that pending list is maintained in memory and is lost after the fault).


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Batch update of indexes

От
Torsten Zühlsdorff
Дата:
On 21.01.2016 18:47, Konstantin Knizhnik wrote:

>
> On 21.01.2016 19:09, Anastasia Lubennikova wrote:
>> What I meant is more like a BRIN-like combination of an index scan and
>> heap scan.
>> Maybe it could be called "deferred inserts" or "temporary read-only
>> index"
>> Maybe it's similar with mysql insert buffer
>> http://dev.mysql.com/doc/refman/5.7/en/innodb-insert-buffering.html
>> I think it'll be more clear with example. Please don't care about syntax.
>>
>> CREATE TABLE tbl (c1 int);
>> CREATE INDEX idx on tbl(c1);
>>
>> SET enable_deferred_insert(idx) = on;
>> At this moment, we save the last_indexed_item (its TID) somewhere in
>> index metapage.
>>
>> Since that moment, the data inserted into the table doesn't touch the
>> index.
>> We perform some heavy insert and then go back to the normal index
>> behavior.
>>
>> SET enable_deferred_insert(idx) = off;
>> This command takes all the data between the last_indexed_item and the
>> end of the table, and inserts it into the index at a time.
>>
>> Of course there are new problems to deal with, but it's really useful
>> for the use case to balance irregular heavy write load, isn't it?
>>
>> BTW, could you explain, what is the reason to copy data into the
>> pending list and then copy it again while flushing pending list into
>> the index? Why not read this data directly from the table? I feel that
>> I've missed something important here.
>>
> No, I do  not think that inserted data should be placed in pending list
> and then copied to main table.
> It should be stored directly in the main table and "pending list" is
> just some fast, transient index.
>
>  From my point of view there are two possibilities:
> 1. Preserve strict table-index consistency: query results should not
> depend on presence of the index
> 2. Support out-of-date or deferred indexes, which can be updated in
> background.
>
> Second approach is certainty more efficient and IMHO it acceptable for
> most of the users.
> But we are violating one of the fundamental properties of RDBMes...
> So I am not sure which approach to chose.
>
> First case is also harder to implement, because we have to somehow merge
> two indexes during index scan
> and provide proper recovery of main index in case of failure (assuming
> that pending list is maintained in memory and is lost after the fault).

We already have unlogged tables which loose their data when there was an 
unclean shutdown. Would it be acceptable to create something like that 
for this "deferred index"? When they are in sync everything is fine. 
When there was an error and the index was not up to date a reindex is 
needed. This would be a bad thing on very big indexes, but for most 
indexes this could be fine. Thoughts?

Greetings,
Torsten




Re: Batch update of indexes

От
Konstantin Knizhnik
Дата:
Hi hackers,<br /><br /> I have implemented "ALTER INDEX ... WHERE ..." clause allowing to change condition for partial
index.<br/> Actually it allows us to append index without fully rebuilding it.<br /> As I explained in the previous
mails,partial indexes can be used to increase insert speed. <br /> Right now I get the following results (with one
insertstream):<br /><br /> Insert with 1 index (primary key, monotonically ascending):                  324275 TPS<br
/>Insert with 9 indexes (primary key + 8 indexes with random keys):         52495 TPS<br /> Insert with primary key and
8concurrently updated partial indexes:     194458 TPS<br /> Insert with primary key and 8 "frozen" partial
indexes:                         278446 TPS<br /><br /> So, as you can see insert with indexes is about 6 times slower
thaninsert without indexes.<br /> And partial indexes allows to eliminate  this gap.<br /> When partial indexes are not
affected(assuming that them will be reconstructed "at night"), <br /> performance is almost the same, as without
indexes.<br/> And if "ALTER INDEX" is done concurrently with inserts, it certainly decrease insert speed, <br /> but
stillit is 4 times faster than with normal indexes.<br /><br /> Such high TPS values were obtained using "insert from
select"to bypass libpq overhead.<br /> With libpq (when each insert is sent as independent statement) results are less
impressive:<br/><br /> Insert with 1 index (primary key, monotonically ascending):                  37892 TPS<br />
Insertwith 9 indexes (primary key + 8 indexes with random keys):       20231 TPS<br /> Insert with primary key and 8
concurrentlyupdated partial indexes:     26934 TPS<br /> Insert with primary key and 8 "frozen" partial
indexes:                         28863 TPS<br /><br /> But still partial indexes allows to almost eliminate two times
differences...<br/><br /> This results can be reproduced using our public repository: <br /><a
class="moz-txt-link-freetext"
href="https://github.com/postgrespro/postgres_cluster">https://github.com/postgrespro/postgres_cluster</a><br/><br />
Mostof the code related with support of "ALTER INDEX .. WHERE" is in AlterIndex function in <br />
postgres_cluster/src/backend/commands/indexcmds.c<br/> I have also added insbench utility for measuring insert
performance,using which this results were obtained.<br /> It is located in postgres_cluster/src/bin/insbench
directory.<br/><br /> Known issues:<br /> 1. I do not handle case when new condition for partial index is more
restrictedthan original.<br /> There is no way in Postgres to exclude records from index (except VACUUM), so in this
caseindex has to be reconstructed from scratch.<br /> 2. Currently I am using SPI to locate records which should be
includedin index.<br /> 3. I am not  completely sure that  there are no synchronization/isolation problems in
AlterIndexfunction<br /><br /> If this approach is considered to be interesting by community, I will try to address
theseissues.<br /><br /><br /><div class="moz-cite-prefix">On 20.01.2016 12:28, Konstantin Knizhnik wrote:<br
/></div><blockquotecite="mid:569F5346.1010005@postgrespro.ru" type="cite"> Hi hackers,<br /><br /> I want to know
opinionof community about possible ways of solving quite common problem: increasing insert speed while still providing
indexesfor efficient execution of queries.<br /><br /> Many applications have to deal with high input stream of data.
Mostof the time while record inserting in the database is taken for update of indexes. And without indexes we are not
ableto efficiently execute queries. <br /> So in many cases it is desirable to have "batch or concurrent" index update.
Andit is acceptable that an index is slightly behind current state of the table.<br /><br /> One interesting approach
ofsolving this problem is discussed in this article:<br /><br /><a class="moz-txt-link-freetext"
href="https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb"
moz-do-not-send="true">https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb</a><br
/><br/> Them are using materialized views to build indexes in background.<br /> Interesting idea, but copying content
ofthe whole table just to be able to build index concurrently seems to be overkill.<br /><br /> I thought about more
straightforwardways of solving this problem. It will be nice if we can preserve of of them main postulates of Postgres
andother RDBMSes: indexes are just optimization and result of query should not depend on presence of indexes.<br /><br
/>First idea is to use inheritance. I have investigated different ways of splitting table into "archival" and
"operational"parts, but all of them requiring physical copying of data from one table to another.<br /><br /> Another
ideais to use partial indexes (<a class="moz-txt-link-freetext"
href="http://www.postgresql.org/docs/current/static/indexes-partial.html"
moz-do-not-send="true">http://www.postgresql.org/docs/current/static/indexes-partial.html</a>)<br/> Assume that we have
streamof input data where each record have increased timestamp:<br /><br />   create table t(<br />      ts timestamp
primarykey,<br />      c1 real,<br />      c2 integer,<br />      c3 varchar,<br />      ...<br />      cN char(5)<br
/>  );<br /><br /> We want to provide the highest insert speed for "t" but provide indexes for c1..cN fields.<br /> We
candeclared partial indexes:<br /><br />   create index idx1 on t(c1) where ts < '20/01/2016';<br />   create index
idx2on t(c2) where ts < '20/01/2016';<br />   ...<br />   create index idxN on t(cN) where ts < '20/01/2016';<br
/><br/> As far as this indexes do not cover current date, them will not be affected during insert operations.<br /> But
wecan still efficiently run queries like<br /><br />   select * from t where c1>100 and ts < '20/01/2016';<br
/><br/> Then, in background, may be at night, we can do<br /><br />   alter index idx1 where <font color="#ff0000">ts
<'</font><font color="#ff0000">21/01/2016'</font>;<br /><br /> Please notice that such alter table statement,
changingcondition for partial index, is not supported now.<br /> But I do not see any principle problems with
supportingsuch construction.<br /> We should just include in the index all records which match new condition and do not
matchold condition:<br /><br />    ts < '21/01/2016' and not (ts < '20/01/2016')<br /><br /> If there is index
for"ts" field it can be done quite efficiently.<br /> This approach doesn't cause contradictions with concepts of
indexesin RDBMS.<br /><br /> But there is one more problem with this approach with I think should be addressed.<br />
Rightnow optimizer builds the following execution plan for query with partial indexes:<br /><br />  postgres=# explain
select* from t where c1 < 10 and ts < '20/01/2016'::timestamp;<br />
                                                 QUERY PLAN                                                   <br />
 ---------------------------------------------------------------------------------------------------------------<br/>
 BitmapHeap Scan on t  (cost=7.20..732.14 rows=12263 width=12)<br />    Recheck Cond: ((c1 < '10'::double precision)
AND(ts < '2016-01-20 00:00:00'::timestamp without time zone))<br />    ->  Bitmap Index Scan on idx1 
(cost=0.00..4.13rows=12263 width=0)<br />          Index Cond: (c1 < '10'::double precision)<br /> (4 rows)<br /><br
/>As you can see optimizer insert recheck in query execution plan while it is not needed in this case: search condition
isexactly the same as partial index condition.<br /> Optimal plan should be:<br /><br />    Index Scan using idx1 on t
(cost=0.00..4.13rows=12263 width=0)<br />    Index Cond: (c1 < '10'::double precision)<br /><br /><br /> What do you
thinkabout this approach? Will it be useful to work in this direction?<br /> Or there are some better solutions for the
problem?<br/><br /><pre class="moz-signature" cols="72">-- 
 
Konstantin Knizhnik
Postgres Professional: <a class="moz-txt-link-freetext" href="http://www.postgrespro.com"
moz-do-not-send="true">http://www.postgrespro.com</a>
The Russian Postgres Company </pre></blockquote><br /><pre class="moz-signature" cols="72">-- 
Konstantin Knizhnik
Postgres Professional: <a class="moz-txt-link-freetext"
href="http://www.postgrespro.com">http://www.postgrespro.com</a>
The Russian Postgres Company </pre>

Re: Batch update of indexes

От
Robert Haas
Дата:
On Wed, Jan 20, 2016 at 4:28 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> Please notice that such alter table statement, changing condition for
> partial index, is not supported now.
> But I do not see any principle problems with supporting such construction.
> We should just include in the index all records which match new condition
> and do not match old condition:
>
>    ts < '21/01/2016' and not (ts < '20/01/2016')

You'd also need to remove any rows from the index that match the old
condition but not the new one.  In your example, that's impossible,
but in general, it's definitely possible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Batch update of indexes

От
Konstantin Knizhnik
Дата:
Attached please find patch for "ALTER INDEX ... WHERE ..." clause.
It is now able to handle all three possible situations:
1. Making index partial (add WHERE condition to the ordinary index)
2. Extend partial index range (less restricted index predicate)
3. Arbitrary change of partial index predicate

In case 2) new records are added to the index.
In other two cases index is completely reconstructed.

This patch includes src/bin/insbench utility for testing insert
performance. It can be easily excluded from the patch to reduce it size.
Also it is better to apply this patch together with "index-only scans
with partial indexes" patch:

http://www.postgresql.org/message-id/560C7213.3010203@2ndquadrant.com

only in this case regression test will produce expected output.


On 27.01.2016 23:15, Robert Haas wrote:
> On Wed, Jan 20, 2016 at 4:28 AM, Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>> Please notice that such alter table statement, changing condition for
>> partial index, is not supported now.
>> But I do not see any principle problems with supporting such construction.
>> We should just include in the index all records which match new condition
>> and do not match old condition:
>>
>>     ts < '21/01/2016' and not (ts < '20/01/2016')
> You'd also need to remove any rows from the index that match the old
> condition but not the new one.  In your example, that's impossible,
> but in general, it's definitely possible.
>

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Вложения

Re: Batch update of indexes

От
Jim Nasby
Дата:
On 1/21/16 11:47 AM, Konstantin Knizhnik wrote:
>> BTW, could you explain, what is the reason to copy data into the
>> pending list and then copy it again while flushing pending list into
>> the index? Why not read this data directly from the table? I feel that
>> I've missed something important here.
>>
> No, I do  not think that inserted data should be placed in pending list
> and then copied to main table.
> It should be stored directly in the main table and "pending list" is
> just some fast, transient index.

That sounds similar to what we would need to support referencing OLD and 
NEW in per-statement triggers: a good way to find everything that was 
changed in a statement.

Or if you will, s/statement/transaction/.

Having that is probably a prerequisite for doing incremental refresh 
materialized views.

My suspicion is that it would be useful to pre-order the new data before 
trying to apply it to the indexes.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Batch update of indexes

От
konstantin knizhnik
Дата:
On Feb 4, 2016, at 2:00 AM, Jim Nasby wrote:

>
> My suspicion is that it would be useful to pre-order the new data before trying to apply it to the indexes.

Sorry, but ALTER INDEX is expected to work for all indexes, not only B-Tree, and for them sorting may not be
possible...
But for B-Tree presorting inserted data should certainly increase performance.
I will think about it.



> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers




Re: Batch update of indexes

От
Jim Nasby
Дата:
On 2/4/16 1:37 AM, konstantin knizhnik wrote:
>> >My suspicion is that it would be useful to pre-order the new data before trying to apply it to the indexes.
> Sorry, but ALTER INDEX is expected to work for all indexes, not only B-Tree, and for them sorting may not be
possible...
> But for B-Tree presorting inserted data should certainly increase performance.
> I will think about it.

I wasn't talking about ALTER INDEX.

My theory is that if you're doing a large DML operation it might be more 
efficient to update an index as a single bulk operation, instead of 
doing it for each tuple.

If you want to do that, then you need an efficient method for finding 
everything that a DML statement changed. That's the exact same thing we 
need to support statement-level triggers being able to reference NEW and 
OLD. It's probably also what we need to support incremental update matviews.

If we had such a capability then we could add options to the AM 
infrastructure to allow indexes to support doing bulk maintenance as 
well as per-tuple maintenance (or even support only bulk maintenance...)

I don't think any of that has anything to do with ALTER INDEX.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Batch update of indexes

От
David Steele
Дата:
Hi Konstantin,

On 2/3/16 11:47 AM, Konstantin Knizhnik wrote:
> Attached please find patch for "ALTER INDEX ... WHERE ..." clause.
> It is now able to handle all three possible situations:
> 1. Making index partial (add WHERE condition to the ordinary index)
> 2. Extend partial index range (less restricted index predicate)
> 3. Arbitrary change of partial index predicate
>
> In case 2) new records are added to the index.
> In other two cases index is completely reconstructed.
>
> This patch includes src/bin/insbench utility for testing insert
> performance. It can be easily excluded from the patch to reduce it size.
> Also it is better to apply this patch together with "index-only scans
> with partial indexes" patch:

This patch no longer applies on master:

$ git apply ../other/alter-index.patch
../other/alter-index.patch:27: trailing whitespace.
static void
../other/alter-index.patch:62: indent with spaces.    SPIPlanPtr plan;
../other/alter-index.patch:63: indent with spaces.    Portal portal;
../other/alter-index.patch:90: trailing whitespace.

../other/alter-index.patch:99: trailing whitespace.

error: patch failed: src/test/regress/expected/aggregates.out:831
error: src/test/regress/expected/aggregates.out: patch does not apply

Please provide a new patch for review.  Meanwhile I am marking this 
"waiting on author".

Thanks,
-- 
-David
david@pgmasters.net



Re: Batch update of indexes

От
Konstantin Knizhnik
Дата:
Hi David,

Rebased patch is attached.

On 14.03.2016 15:09, David Steele wrote:
> Hi Konstantin,
>
> On 2/3/16 11:47 AM, Konstantin Knizhnik wrote:
>> Attached please find patch for "ALTER INDEX ... WHERE ..." clause.
>> It is now able to handle all three possible situations:
>> 1. Making index partial (add WHERE condition to the ordinary index)
>> 2. Extend partial index range (less restricted index predicate)
>> 3. Arbitrary change of partial index predicate
>>
>> In case 2) new records are added to the index.
>> In other two cases index is completely reconstructed.
>>
>> This patch includes src/bin/insbench utility for testing insert
>> performance. It can be easily excluded from the patch to reduce it size.
>> Also it is better to apply this patch together with "index-only scans
>> with partial indexes" patch:
>
> This patch no longer applies on master:
>
> $ git apply ../other/alter-index.patch
> ../other/alter-index.patch:27: trailing whitespace.
> static void
> ../other/alter-index.patch:62: indent with spaces.
>     SPIPlanPtr plan;
> ../other/alter-index.patch:63: indent with spaces.
>     Portal portal;
> ../other/alter-index.patch:90: trailing whitespace.
>
> ../other/alter-index.patch:99: trailing whitespace.
>
> error: patch failed: src/test/regress/expected/aggregates.out:831
> error: src/test/regress/expected/aggregates.out: patch does not apply
>
> Please provide a new patch for review.  Meanwhile I am marking this
> "waiting on author".
>
> Thanks,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Вложения

Re: Batch update of indexes

От
David Steele
Дата:
On 3/14/16 10:28 AM, Konstantin Knizhnik wrote:

> Rebased patch is attached.

Thanks for the quick turnaround!

Marko, you are signed up to review this patch.  Do you have an idea of 
when you'll be able to do that?

-- 
-David
david@pgmasters.net



Re: Batch update of indexes

От
David Steele
Дата:
On 3/14/16 10:37 AM, David Steele wrote:
> On 3/14/16 10:28 AM, Konstantin Knizhnik wrote:
>
>> Rebased patch is attached.
>
> Thanks for the quick turnaround!
>
> Marko, you are signed up to review this patch.  Do you have an idea of
> when you'll be able to do that?

Bump.

Since it looks like Marko has not had time to look at this would anyone 
else care to do a review?

Thanks,
-- 
-David
david@pgmasters.net



Re: Batch update of indexes

От
Tom Lane
Дата:
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
> Attached please find patch for "ALTER INDEX ... WHERE ..." clause.
> It is now able to handle all three possible situations:
> 1. Making index partial (add WHERE condition to the ordinary index)
> 2. Extend partial index range (less restricted index predicate)
> 3. Arbitrary change of partial index predicate

I've not been following this thread previously, but this proposal
scares me quite a lot.  I am certain there are places in our code
that assume that the properties of an index don't change after it's
been created.  One area that this almost certainly breaks is HOT updates:
adding a previously-unindexed column to an index predicate might break
existing HOT chains, and I see nothing in this patch that could deal
with that.  I seem to recall there are other places that would be
broken by changing an index's DDL definition after creation, but can't
recall specifics right now.

I am also, frankly, not seeing a use-case for this functionality that
would justify trying to find and remove those assumptions.

There's a lot of things I don't care for about the way the patch is
written, in particular its willingness to use SPI (which opens a lot of
potential for search-path problems, failure to see uncommitted tuples,
etc).  But we need not get to that if we don't believe the functionality
can work.

> This patch includes src/bin/insbench utility for testing insert 
> performance. It can be easily excluded from the patch to reduce it size.

C++ is not likely to get accepted into our tree ...
        regards, tom lane



Re: Batch update of indexes

От
Konstantin Knizhnik
Дата:
On 04/02/2016 09:57 PM, Tom Lane wrote:
> Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes:
>> Attached please find patch for "ALTER INDEX ... WHERE ..." clause.
>> It is now able to handle all three possible situations:
>> 1. Making index partial (add WHERE condition to the ordinary index)
>> 2. Extend partial index range (less restricted index predicate)
>> 3. Arbitrary change of partial index predicate
> I've not been following this thread previously, but this proposal
> scares me quite a lot.  I am certain there are places in our code
> that assume that the properties of an index don't change after it's
> been created.  One area that this almost certainly breaks is HOT updates:
> adding a previously-unindexed column to an index predicate might break
> existing HOT chains, and I see nothing in this patch that could deal
> with that.  I seem to recall there are other places that would be
> broken by changing an index's DDL definition after creation, but can't
> recall specifics right now.
>
> I am also, frankly, not seeing a use-case for this functionality that
> would justify trying to find and remove those assumptions.
>
> There's a lot of things I don't care for about the way the patch is
> written, in particular its willingness to use SPI (which opens a lot of
> potential for search-path problems, failure to see uncommitted tuples,
> etc).  But we need not get to that if we don't believe the functionality
> can work.

Thank you for review, Tom.

I completely agree with all your arguments against this patch.
I have proposed this patch mostly as prove of concept.
Yes, I have not take in account hot updates and may be there are other possible issues which I not considered.

The main question is whether the proposed way of batch update of indexes is viable or it is conceptually wrong
approach
(because it beaks assumption that index properties can't be changed or because it is not convenient to use...).

I hope that everybody agree that maintaining of indexes is the main limiting factor for insert speed.
If table has no indexes, then insert speed can be as high as disk write speed (100Mb/sec or 1000 for SSD).
So if size of record is about 10 bytes, then we can get about 10 millions TPS.
But presence of indexes will dramatically change this picture: if database is large enough so that even index can not
fitin memory
 
and records are inserted in random key order, then each insert in index will require reading of 3-4 pages from random
locationson the disk.
 
With average HDD positioning time 10 msec, we get 100 reads per second and ... 20-30 TPS. It is just with one index.
If we have 10 indexes, then TPS can be less than fingers on a hand.

Certainly it is very pessimistic estimation.
But still it is true that we can not provide good insert speed if we have to update indexes immediately.
And without indexes we can not efficiently execute most of queries.

I do not see any way in Postgres to solve this problem now. The hack with creating materialized views requires a lot of
extratime and space.
 
It will not work for really large table.

So we need some way to postpone insertion of new records in the index. Then we can do such insertion in background or
inidle time (at night), try to use bulk insert if index implementation supports it (for example sorting records by key
beforeinsert can 
 
significantly increase locality and so improve speed of insert in index). But the principle moment here is that such
delayedupdate of index violates the main RDBMS rule that results of query execution with and without indexes should be
thesame. The trick 
 
with partial indexes allows to eliminate this contradiction. But it requires more actions from user. So are users ready
todo some exatra job just because of "idealogical" reasons? Because if user wants to have delayed update of indexes,
thenhe actually 
 
approves that it is ok for him that query results may not include some most recent updates.

Another aspect is which database objects are allowed to be altered and which not. Right now with tables we can alter
almosteverything.
 
With indexes - almost nothing. It is assumed that index can always be reconstructed. But for very big table
reconstructionof indexes from scratch will take unacceptable amount of time. So should we make it possible to alter
someindex characteristics 
 
which do not require to rebuild index from scratch (and it is definitely true for partial index predicate)? Or price of
supportingit is so high, that it can not be compensated by obtained benefits?
 

So what do you think?
1. Should I continue work in this direction and fix all possible issues with hot updates,... to make it possible to
alterpartial index predicates and support batch inserts i this way?
 
2. Or it is better to just add extra option to the index, allowing it to be slightly out-of-sync? It will allow, for
example,to eliminate pending list for GIN which can cause very significant degradation of query speed, while for most
full-textsearch 
 
engine it is acceptable that changes are not immediately visible.
Certainly much more work is required here except of just adding a new index option...
3. Or both of the approaches are wrong and we should leave everything as it is?
4. Or may be there is some other approach which is more acceptable?





>
>> This patch includes src/bin/insbench utility for testing insert
>> performance. It can be easily excluded from the patch to reduce it size.
> C++ is not likely to get accepted into our tree ...
>
>             regards, tom lane


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Batch update of indexes

От
David Steele
Дата:
On 4/2/16 4:21 PM, Konstantin Knizhnik wrote:

> Thank you for review, Tom.
> 
> I completely agree with all your arguments against this patch.
> I have proposed this patch mostly as prove of concept.

I have marked this "returned with feedback".  Hopefully you can work on
the concept and resubmit for 9.7.

Thanks,
-- 
-David
david@pgmasters.net