Обсуждение: Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

Поиск
Список
Период
Сортировка

Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

От
Kisung Kim
Дата:
Hi,

I've run YCSB(Yahoo! Cloud Service Benchmark) on PostgreSQL and MongoDB with WiredTiger.
And I found some interesting results and some issues(maybe) on Btree index of PostgreSQL.

Here is my experiments and results.
YCSB is for document store benchmark and I build following schema in PG.

CREATE TABLE usertable (
    YCSB_KEY varchar(30) primary key,
    FIELDS jsonb);

And the benchmark generated avg-300-byte-length Json documents and loaded 100M rows in PG and Mongo.

First I found that the size difference between PG and Mongo:
I configured Mongo not to use any compression for both storage and index.

MongoDB index size: 2.1 GB
PostgreSQL index size: 5.5 GB

When I used the index bloating estimation script in https://github.com/ioguix/pgsql-bloat-estimation,
the result is as follows:
 current_database | schemaname |     tblname      |              idxname              | real_size  | extra_size |    extra_ratio    | fillfactor | bloat_size |    bloat_ratio    | is_na 
------------------+------------+------------------+-----------------------------------+------------+------------+-------------------+------------+------------+-------------------+-------
ycsb             | public     | usertable        | usertable_pkey                    | 5488852992 | 2673713152 |  48.7116917850949 |         90 | 2362122240 |  43.0348971532448 | f

It says that the bloat_ratio is 42 for the index.

So, I rebuilded the index and the result was changed:

 current_database | schemaname |     tblname      |              idxname              | real_size  | extra_size |    extra_ratio    | fillfactor | bloat_size |    bloat_ratio    | is_na 
------------------+------------+------------------+-----------------------------------+------------+------------+-------------------+------------+------------+-------------------+-------
 ycsb             | public     | usertable        | usertable_pkey                    | 3154264064 |  339132416 |  10.7515543758863 |         90 |   27533312 | 0.872891788428275 | f


I am curious about the results
1) why the index was bloated even though rows were only inserted not removed or updated.
2) And then why the bloating is removed when it is rebuilded

I guess that the splits of Btree nodes during inserting rows caused the bloating but I don't know exact reasons.
And given that MongoDB's index size is much smaller than PG after they handled the same workload (only insert),
then is there any chances to improve PG's index behavior.

Thank you very much. 


--

                                                                                                                                                       

Bitnine Global Inc., Kisung Kim, Ph.D
https://sites.google.com/site/kisungresearch/
E-mail : kskim@bitnine.net
Office phone : 070-4800-5890, 408-606-8602
US Mobile phone : 408-805-2192

Re: Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

От
Lukas Fittl
Дата:
On Wed, Aug 10, 2016 at 4:24 PM, Kisung Kim <kskim@bitnine.net> wrote:
When I used the index bloating estimation script in https://github.com/ioguix/pgsql-bloat-estimation,
the result is as follows:

Regardless of the issue at hand, it might make sense to verify these statistics using pgstattuple - those bloat estimation queries can be wildly off at times.


Best,
Lukas

--
Lukas Fittl

Skype: lfittl
Phone: +1 415 321 0630

Re: Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

От
Kisung Kim
Дата:
Thank you for your information.
Here is the result:

After insertions:

ycsb=# select * from pgstatindex('usertable_pkey');
 version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
       2 |          3 | 5488721920 |         44337 |           4464 |     665545 |           0 |             0 |               52 |                 11
(1 row)

After rebuild:


ycsb=# select * from pgstatindex('usertable_pkey');
 version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation 
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
       2 |          3 | 3154296832 |         41827 |           1899 |     383146 |           0 |             0 |            90.08 |                  0


It seems like that rebuild has an effect to reduce the number of internal and leaf_pages and make more dense leaf pages.



On Wed, Aug 10, 2016 at 6:47 PM, Lukas Fittl <lukas@fittl.com> wrote:
On Wed, Aug 10, 2016 at 4:24 PM, Kisung Kim <kskim@bitnine.net> wrote:
When I used the index bloating estimation script in https://github.com/ioguix/pgsql-bloat-estimation,
the result is as follows:

Regardless of the issue at hand, it might make sense to verify these statistics using pgstattuple - those bloat estimation queries can be wildly off at times.


Best,
Lukas

--
Lukas Fittl

Skype: lfittl
Phone: +1 415 321 0630



--

                                                                                                                                                       

Bitnine Global Inc., Kisung Kim, Ph.D
https://sites.google.com/site/kisungresearch/
E-mail : kskim@bitnine.net
Office phone : 070-4800-5890, 408-606-8602
US Mobile phone : 408-805-2192

Re: Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

От
Mark Kirkwood
Дата:
After examining the benchmark design - I see we are probably not being 
helped by the repeated insertion of keys all of form 'userxxxxxxx' 
leading to some page splitting.

However your index rebuild gets you from 5 to 3 GB - does that really 
help performance significantly?

regards

Mark

On 11/08/16 16:08, Kisung Kim wrote:
> Thank you for your information.
> Here is the result:
>
> After insertions:
>
> ycsb=# select * from pgstatindex('usertable_pkey');
>  version | tree_level | index_size | root_block_no | internal_pages | 
> leaf_pages | empty_pages | deleted_pages | avg_leaf_density | 
> leaf_fragmentation
>
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
>        2 |          3 | 5488721920 |         44337 |     4464 |     
> 665545 |           0 |             0 |       52 |                 11
> (1 row)
>
> After rebuild:
>
>
> ycsb=# select * from pgstatindex('usertable_pkey');
>  version | tree_level | index_size | root_block_no | internal_pages | 
> leaf_pages | empty_pages | deleted_pages | avg_leaf_density | 
> leaf_fragmentation
>
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
>        2 |          3 | 3154296832 |         41827 |         1899 |   
>   383146 |           0 |             0 |            90.08 |           
>        0
>
>
> It seems like that rebuild has an effect to reduce the number of 
> internal and leaf_pages and make more dense leaf pages.
>
>
>




Re: Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

От
Kisung Kim
Дата:
You're right. Reindex improves the performance of the benchmark workloads dramatically.
I'm gathering results and will announce them.

But I think we should notice that the results before Reindexing is poorer than MongoDB.
It seems that this is because of Btree bloating (not exact expression). 
The lookup performance for the Btree is most crucial for the results 
because the workload is select for primary key.
So larger Btree could mean less cache hits and slower query performance.
I think that PG doesn't pack leaf node to 90% but half for future insertion
and because of this PG's btree is larger than MongoDB 
(I turned off prefix compression of WiredTiger index and block compression for storage.)
But MongoDB (actually WiredTiger) seems to do differently.

Is my speculation is right? I'm not sure because I didn't review the btree code of PG yet.

And I want to point that the loading performance of MongoDB is better than PG.
If PG leaves more space for future insertion, then could we get at least faster loading performance?  
Then can we conclude that we have more chances to improve Btree of PG?

Best Regards, 



On Fri, Aug 12, 2016 at 5:40 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
After examining the benchmark design - I see we are probably not being helped by the repeated insertion of keys all of form 'userxxxxxxx' leading to some page splitting.

However your index rebuild gets you from 5 to 3 GB - does that really help performance significantly?

regards

Mark


On 11/08/16 16:08, Kisung Kim wrote:
Thank you for your information.
Here is the result:

After insertions:

ycsb=# select * from pgstatindex('usertable_pkey');
 version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
       2 |          3 | 5488721920 |         44337 |     4464 |     665545 |           0 |             0 |       52 |                 11
(1 row)

After rebuild:


ycsb=# select * from pgstatindex('usertable_pkey');
 version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
       2 |          3 | 3154296832 |         41827 |         1899 |     383146 |           0 |             0 |            90.08 |                  0


It seems like that rebuild has an effect to reduce the number of internal and leaf_pages and make more dense leaf pages.







--

                                                                                                                                                       

Bitnine Global Inc., Kisung Kim, Ph.D
https://sites.google.com/site/kisungresearch/
E-mail : kskim@bitnine.net
Office phone : 070-4800-5890, 408-606-8602
US Mobile phone : 408-805-2192

Re: Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

От
Jeff Janes
Дата:
On Fri, Aug 12, 2016 at 1:40 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
> After examining the benchmark design - I see we are probably not being
> helped by the repeated insertion of keys all of form 'userxxxxxxx' leading
> to some page splitting.

But shouldn't that still leave us with a 75% full index, rather than
slightly over 50% full?

The leaf pages start at 50%, grow to 100%, then split back to 50%,
then grow back to 100%.  So the average should be about 75%.


> However your index rebuild gets you from 5 to 3 GB - does that really help
> performance significantly?

It can make a big difference, depending on how much RAM you have.

Cheers,

Jeff



Re: Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

От
Andrew Gierth
Дата:
>>>>> "Jeff" == Jeff Janes <jeff.janes@gmail.com> writes:
Jeff> But shouldn't that still leave us with a 75% full index, ratherJeff> than slightly over 50% full?

Average is usually about 67%-70%. (For capacity estimation I always
assume 66% for a non-sequentially-filled btree.)
Jeff> The leaf pages start at 50%, grow to 100%, then split back toJeff> 50%, then grow back to 100%.  So the average
shouldbe about 75%.
 

No, because as the pages split, they fill more slowly (because there are
now more pages). So on average in a large randomly filled index, pages
spend more time nearer 50% full than 100% full. This is easy to
demonstrate by creating a table with an indexed float8 column and adding
batches of random() values to it, checking with pgstatindex at intervals -
the average leaf density will rarely exceed 70%.

However, worst case conditions can give lower leaf densities; obviously
the worst case is if the data is loaded in an order and quantity that
just happens to leave every leaf page recently split.

-- 
Andrew (irc:RhodiumToad)



Re: Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

От
Greg Stark
Дата:
On Fri, Aug 12, 2016 at 8:13 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
> No, because as the pages split, they fill more slowly (because there are
> now more pages). So on average in a large randomly filled index, pages
> spend more time nearer 50% full than 100% full. This is easy to
> demonstrate by creating a table with an indexed float8 column and adding
> batches of random() values to it, checking with pgstatindex at intervals -
> the average leaf density will rarely exceed 70%.
>
> However, worst case conditions can give lower leaf densities; obviously
> the worst case is if the data is loaded in an order and quantity that
> just happens to leave every leaf page recently split.


btree pages don't split 50/50 either. They split biased to assume the
greater side of the split will receive more inserts -- iirc 70/30. So
if they're loaded sequentially you should get a btree that's 70% full
but the worst case is in theory closer to 30% though I think the
insert order would have to be pretty perverse to be worse than 50%.

-- 
greg



Re: Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

От
Andrew Gierth
Дата:
>>>>> "Greg" == Greg Stark <stark@mit.edu> writes:
>> No, because as the pages split, they fill more slowly (because there>> are now more pages). So on average in a large
randomlyfilled index,>> pages spend more time nearer 50% full than 100% full. This is easy>> to demonstrate by creating
atable with an indexed float8 column and>> adding batches of random() values to it, checking with pgstatindex>> at
intervals- the average leaf density will rarely exceed 70%.>> >> However, worst case conditions can give lower leaf
densities;>>obviously the worst case is if the data is loaded in an order and>> quantity that just happens to leave
everyleaf page recently split.
 
Greg> btree pages don't split 50/50 either. They split biased to assumeGreg> the greater side of the split will receive
moreinserts -- iircGreg> 70/30.
 

Hmm? The code in _bt_findsplitloc and _bt_checksplitloc doesn't seem to
agree with this.

(Inserting on the high leaf page is a special case, which is where the
fillfactor logic kicks in; that's why sequentially filled indexes are
(by default) 90% full rather than 100%. But other pages split into
roughly equal halves.)

-- 
Andrew (irc:RhodiumToad)



Re: Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

От
Greg Stark
Дата:
On Sat, Aug 13, 2016 at 1:18 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
>
> Hmm? The code in _bt_findsplitloc and _bt_checksplitloc doesn't seem to
> agree with this.
>
> (Inserting on the high leaf page is a special case, which is where the
> fillfactor logic kicks in; that's why sequentially filled indexes are
> (by default) 90% full rather than 100%. But other pages split into
> roughly equal halves.)

Hm, I was going from this lore. I didn't realize it was only for
inserts near the end of the index. That's cleverer than I realized.

commit 1663f3383849968415d29965ef9bfdf5aac4d358
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Sat Sep 29 23:49:51 2001 +0000
   Tweak btree page split logic so that when splitting a page that is   rightmost on its tree level, we split 2/3 to
theleft and 1/3 to the   new right page, rather than the even split we use elsewhere.  The idea   is that when faced
witha steadily increasing series of inserted keys   (such as sequence or timestamp values), we'll end up with a btree
that's  about 2/3ds full not 1/2 full, which is much closer to the desired   steady-state load for a btree.  Per
suggestionfrom Ann Harrison of   IBPhoenix.
 


-- 
greg



Re: Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

От
Mark Kirkwood
Дата:
On 13/08/16 05:44, Jeff Janes wrote:
> On Fri, Aug 12, 2016 at 1:40 AM, Mark Kirkwood
>> However your index rebuild gets you from 5 to 3 GB - does that really help
>> performance significantly?
> It can make a big difference, depending on how much RAM you have.
>

Yeah - I suspect this is the issue - loading up a similar type of schema 
with records with a primary key of form 'userxxxxx' for (uniformly) 
randomly distributed xxxxx... (I was gonna use the Yahoo benchmark but 
it is soooo slow...). Also I'm using 10000000 rows instead of 100000000 
to avoid waiting a long time (10000000 should be enough to show the point):

prefix=# \d prefix           Table "public.prefix" Column |         Type          | Modifiers
--------+-----------------------+----------- uid    | character varying(30) | not null filler | character(255)
|
Indexes:    "prefix_pkey" PRIMARY KEY, btree (uid)

Doing an uncached indexed read by forcing a buffer cache clear:

# echo 3 > /proc/sys/vm/drop_caches

prefix=# SELECT 
relfilenode,relname,reltuples,pg_relation_size(oid)/1024/1024 AS mb
FROM pg_class WHERE relname LIKE 'prefix%'; relfilenode |   relname   | reltuples | mb
-------------+-------------+-----------+-----     6017817 | prefix      |     1e+07 | 422     6017819 | prefix_pkey |
 1e+07 | 391
 
(2 rows)

prefix=# EXPLAIN ANALYZE SELECT count(*)         FROM prefix WHERE uid='user10000';
                    QUERY PLAN
 

--------------------------------------------------------------------------------
----------------------------------------------- Aggregate  (cost=8.46..8.46 rows=1 width=0) (actual time=3.408..3.408 
rows=1 lo
ops=1)   ->  Index Only Scan using prefix_pkey on prefix (cost=0.43..8.45 
rows=1 widt
h=0) (actual time=3.406..3.406 rows=0 loops=1)         Index Cond: (uid = 'user10000'::text)         Heap Fetches: 0
Planningtime: 19.362 ms Execution time: 3.429 ms
 
(6 rows)

Repeating this after REINDEX:

# echo 3 > /proc/sys/vm/drop_caches

prefix=# SELECT 
relfilenode,relname,reltuples,pg_relation_size(oid)/1024/1024 AS mb
FROM pg_class WHERE relname LIKE 'prefix%'; relfilenode |   relname   | reltuples | mb
-------------+-------------+-----------+-----     6017817 | prefix      |     1e+07 | 422     6017819 | prefix_pkey |
 1e+07 | 300
 
(2 rows)

prefix=# EXPLAIN ANALYZE SELECT count(*)         FROM prefix WHERE uid='user10000';
                    QUERY PLAN
 

--------------------------------------------------------------------------------
----------------------------------------------- Aggregate  (cost=8.46..8.46 rows=1 width=0) (actual time=3.868..3.868 
rows=1 lo
ops=1)   ->  Index Only Scan using prefix_pkey on prefix (cost=0.43..8.45 
rows=1 widt
h=0) (actual time=3.866..3.866 rows=0 loops=1)         Index Cond: (uid = 'user10000'::text)         Heap Fetches: 0
Planningtime: 19.366 ms Execution time: 3.889 ms
 
(6 rows)

So certainly not significantly *slower* with the physically bigger 
index. This suggests that Jeff's analysis was spot on - likely that the 
larger index didn't fix in RAM.

Cheers

Mark



Re: Btree Index on PostgreSQL and Wiredtiger (MongoDB3.2)

От
Andrew Borodin
Дата:
>So on average in a large randomly filled index, pages
spend more time nearer 50% full than 100% full.

I think we can make this number more...controllable.
Before split we can check whether left and right pages both are in
shared buffer and if they are seriously under certain fillfactor, say
under 75%. We can unload some of data there instead of spliting. This
will slow down insertions a bit, but we can have any fillfactor we
want for random inserts. I mean, for sure, someone can construct bad
input to gain low fillfactor, like it is with qsort (
http://www.cs.dartmouth.edu/~doug/mdmspe.pdf ), but every real
scenario will not trigger this behavior.

But then we have to think about locks, WALs etc.


Best regards, Andrey Borodin, Octonica & Ural Federal University.