Обсуждение: [PERFORM] Speeding up JSON + TSQUERY + GIN
Hello everyone,
I am currently evaluating the possibility of using PostgreSQL for storing and querying jsonb+tsvector queries. Let's consider this setup:
create table docs (id serial primary key, meta jsonb);
# generate 10M entries, cf. appendix
create index docs_meta_idx ON docs using gin (meta jsonb_path_ops);
create index docs_name_idx ON docs using gin (to_tsvector('english', meta->>'name'));
create index docs_address_idx ON docs using gin (to_tsvector('english', meta->>'address'));
Testing around with some smaller datasets, functionality-wise it's great. However increasing to 10M, things tend to slow down (using PostgreSQL 9.5):
explain analyze select id from docs where meta @> '{"age": 20}';
Planning time: 0.121 ms
Execution time: 4873.507 ms
explain analyze select id from docs where meta @> '{"age": 20}';
Planning time: 0.122 ms
Execution time: 206.289 ms
explain analyze select id from docs where meta @> '{"age": 30}';
Planning time: 0.109 ms
Execution time: 7496.886 ms
explain analyze select id from docs where meta @> '{"age": 30}';
Planning time: 0.114 ms
Execution time: 1169.649 ms
explain analyze select id from docs where to_tsvector('english', meta->>'name') @@ to_tsquery('english', 'john');
Planning time: 0.179 ms
Execution time: 10109.375 ms
explain analyze select id from docs where to_tsvector('english', meta->>'name') @@ to_tsquery('english', 'john');
Planning time: 0.188 ms
Execution time: 238.854 ms
Using "select pg_prewarm('docs');" and on any of the indexes doesn't help either.
After a "systemctl stop postgresql.service && sync && echo 3 > /proc/sys/vm/drop_caches && systemctl start postgresql.service" the age=20, 30 or name=john queries are slow again.
Is there a way to speed up or to warm up things permanently?
Regards,
Sven
Appendix I:
example json:
{"age": 20, "name": "Michelle Hernandez", "birth": "1991-08-16", "address": "94753 Tina Bridge Suite 318\\nEmilyport, MT 75302"}
Appendix II:
The Python script to generate fake json data. Needs "pip install faker".
>>> python fake_json.py > test.json # generates 2M entries; takes some time
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
-- fake_json.py --
import faker, json;
fake = faker.Faker();
for i in range(2*10**6):
print(json.dumps({"name": fake.name(), "birth": fake.date(), "address": fake.address(), "age": fake.random_int(0,100)}).replace('\\n', '\\\\n'))
I am currently evaluating the possibility of using PostgreSQL for storing and querying jsonb+tsvector queries. Let's consider this setup:
create table docs (id serial primary key, meta jsonb);
# generate 10M entries, cf. appendix
create index docs_meta_idx ON docs using gin (meta jsonb_path_ops);
create index docs_name_idx ON docs using gin (to_tsvector('english', meta->>'name'));
create index docs_address_idx ON docs using gin (to_tsvector('english', meta->>'address'));
Testing around with some smaller datasets, functionality-wise it's great. However increasing to 10M, things tend to slow down (using PostgreSQL 9.5):
explain analyze select id from docs where meta @> '{"age": 20}';
Planning time: 0.121 ms
Execution time: 4873.507 ms
explain analyze select id from docs where meta @> '{"age": 20}';
Planning time: 0.122 ms
Execution time: 206.289 ms
explain analyze select id from docs where meta @> '{"age": 30}';
Planning time: 0.109 ms
Execution time: 7496.886 ms
explain analyze select id from docs where meta @> '{"age": 30}';
Planning time: 0.114 ms
Execution time: 1169.649 ms
explain analyze select id from docs where to_tsvector('english', meta->>'name') @@ to_tsquery('english', 'john');
Planning time: 0.179 ms
Execution time: 10109.375 ms
explain analyze select id from docs where to_tsvector('english', meta->>'name') @@ to_tsquery('english', 'john');
Planning time: 0.188 ms
Execution time: 238.854 ms
Using "select pg_prewarm('docs');" and on any of the indexes doesn't help either.
After a "systemctl stop postgresql.service && sync && echo 3 > /proc/sys/vm/drop_caches && systemctl start postgresql.service" the age=20, 30 or name=john queries are slow again.
Is there a way to speed up or to warm up things permanently?
Regards,
Sven
Appendix I:
example json:
{"age": 20, "name": "Michelle Hernandez", "birth": "1991-08-16", "address": "94753 Tina Bridge Suite 318\\nEmilyport, MT 75302"}
Appendix II:
The Python script to generate fake json data. Needs "pip install faker".
>>> python fake_json.py > test.json # generates 2M entries; takes some time
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
-- fake_json.py --
import faker, json;
fake = faker.Faker();
for i in range(2*10**6):
print(json.dumps({"name": fake.name(), "birth": fake.date(), "address": fake.address(), "age": fake.random_int(0,100)}).replace('\\n', '\\\\n'))
On Sun, Feb 26, 2017 at 4:28 PM, Sven R. Kunze <srkunze@mail.de> wrote:
Hello everyone,
I am currently evaluating the possibility of using PostgreSQL for storing and querying jsonb+tsvector queries. Let's consider this setup:
create table docs (id serial primary key, meta jsonb);
# generate 10M entries, cf. appendix
create index docs_meta_idx ON docs using gin (meta jsonb_path_ops);
create index docs_name_idx ON docs using gin (to_tsvector('english', meta->>'name'));
create index docs_address_idx ON docs using gin (to_tsvector('english', meta->>'address'));
functional index tends to be slow, better use separate column(s) for tsvector
Testing around with some smaller datasets, functionality-wise it's great. However increasing to 10M, things tend to slow down (using PostgreSQL 9.5):
explain analyze select id from docs where meta @> '{"age": 20}';
Planning time: 0.121 ms
Execution time: 4873.507 ms
explain analyze select id from docs where meta @> '{"age": 20}';
Planning time: 0.122 ms
Execution time: 206.289 ms
explain analyze select id from docs where meta @> '{"age": 30}';
Planning time: 0.109 ms
Execution time: 7496.886 ms
explain analyze select id from docs where meta @> '{"age": 30}';
Planning time: 0.114 ms
Execution time: 1169.649 ms
explain analyze select id from docs where to_tsvector('english', meta->>'name') @@ to_tsquery('english', 'john');
Planning time: 0.179 ms
Execution time: 10109.375 ms
explain analyze select id from docs where to_tsvector('english', meta->>'name') @@ to_tsquery('english', 'john');
Planning time: 0.188 ms
Execution time: 238.854 ms
what is full output from explain analyze ?
Using "select pg_prewarm('docs');" and on any of the indexes doesn't help either.
After a "systemctl stop postgresql.service && sync && echo 3 > /proc/sys/vm/drop_caches && systemctl start postgresql.service" the age=20, 30 or name=john queries are slow again.
Is there a way to speed up or to warm up things permanently?
Regards,
Sven
Appendix I:
example json:
{"age": 20, "name": "Michelle Hernandez", "birth": "1991-08-16", "address": "94753 Tina Bridge Suite 318\\nEmilyport, MT 75302"}
Appendix II:
The Python script to generate fake json data. Needs "pip install faker".
>>> python fake_json.py > test.json # generates 2M entries; takes some time
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
-- fake_json.py --
import faker, json;
fake = faker.Faker();
for i in range(2*10**6):
print(json.dumps({"name": fake.name(), "birth": fake.date(), "address": fake.address(), "age": fake.random_int(0,100)}).replace('\\n', '\\\\n'))
Thanks Oleg for your reply.
On 26.02.2017 21:13, Oleg Bartunov wrote:
On 26.02.2017 21:13, Oleg Bartunov wrote:
On Sun, Feb 26, 2017 at 4:28 PM, Sven R. Kunze <srkunze@mail.de> wrote:create index docs_meta_idx ON docs using gin (meta jsonb_path_ops);
create index docs_name_idx ON docs using gin (to_tsvector('english', meta->>'name'));
create index docs_address_idx ON docs using gin (to_tsvector('english', meta->>'address'));
functional index tends to be slow, better use separate column(s) for tsvector
Why? Don't we have indexes to make them faster?
The idea is to accelerate all operations as specified (cf. the table schema below) without adding more and more columns.
what is full output from explain analyze ?
Okay, let's stick to gin + @> operator for now before we tackle the functional index issue.
Maybe, I did something wrong while defining the gin indexes:
explain analyze select id from docs where meta @> '{"age": 40}';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on docs (cost=86.50..9982.50 rows=10000 width=4) (actual time=97.443..8073.983 rows=98385 loops=1)
Recheck Cond: (meta @> '{"age": 40}'::jsonb)
Heap Blocks: exact=79106
-> Bitmap Index Scan on docs_meta_idx (cost=0.00..84.00 rows=10000 width=0) (actual time=66.878..66.878 rows=98385 loops=1)
Index Cond: (meta @> '{"age": 40}'::jsonb)
Planning time: 0.118 ms
Execution time: 8093.533 ms
(7 rows)
explain analyze select id from docs where meta @> '{"age": 40}';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on docs (cost=86.50..9982.50 rows=10000 width=4) (actual time=99.527..3349.001 rows=98385 loops=1)
Recheck Cond: (meta @> '{"age": 40}'::jsonb)
Heap Blocks: exact=79106
-> Bitmap Index Scan on docs_meta_idx (cost=0.00..84.00 rows=10000 width=0) (actual time=68.503..68.503 rows=98385 loops=1)
Index Cond: (meta @> '{"age": 40}'::jsonb)
Planning time: 0.113 ms
Execution time: 3360.773 ms
(7 rows)
explain analyze select id from docs where meta @> '{"age": 40}';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on docs (cost=86.50..9982.50 rows=10000 width=4) (actual time=64.928..168.311 rows=98385 loops=1)
Recheck Cond: (meta @> '{"age": 40}'::jsonb)
Heap Blocks: exact=79106
-> Bitmap Index Scan on docs_meta_idx (cost=0.00..84.00 rows=10000 width=0) (actual time=45.340..45.340 rows=98385 loops=1)
Index Cond: (meta @> '{"age": 40}'::jsonb)
Planning time: 0.121 ms
Execution time: 171.098 ms
(7 rows)
explain analyze select id from docs where meta @> '{"age": 40}';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on docs (cost=86.50..9982.50 rows=10000 width=4) (actual time=86.118..215.755 rows=98385 loops=1)
Recheck Cond: (meta @> '{"age": 40}'::jsonb)
Heap Blocks: exact=79106
-> Bitmap Index Scan on docs_meta_idx (cost=0.00..84.00 rows=10000 width=0) (actual time=54.535..54.535 rows=98385 loops=1)
Index Cond: (meta @> '{"age": 40}'::jsonb)
Planning time: 0.127 ms
Execution time: 219.746 ms
(7 rows)
explain analyze select id from docs where meta @> '{"age": 40}';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on docs (cost=86.50..9982.50 rows=10000 width=4) (actual time=83.197..211.840 rows=98385 loops=1)
Recheck Cond: (meta @> '{"age": 40}'::jsonb)
Heap Blocks: exact=79106
-> Bitmap Index Scan on docs_meta_idx (cost=0.00..84.00 rows=10000 width=0) (actual time=53.036..53.036 rows=98385 loops=1)
Index Cond: (meta @> '{"age": 40}'::jsonb)
Planning time: 0.127 ms
Execution time: 215.753 ms
(7 rows)
Regards,
Sven
Table Schema:
Table "public.docs"
Column | Type | Modifiers
--------+---------+---------------------------------------------------
id | integer | not null default nextval('docs_id_seq'::regclass)
meta | jsonb |
Indexes:
"docs_pkey" PRIMARY KEY, btree (id)
"docs_address_idx" gin (to_tsvector('english'::regconfig, meta ->> 'address'::text))
"docs_address_trgm_idx" gin ((meta ->> 'address'::text) gin_trgm_ops)
"docs_birth_idx" btree ((meta ->> 'birth'::text))
"docs_meta_idx" gin (meta jsonb_path_ops)
"docs_name_idx" gin (to_tsvector('english'::regconfig, meta ->> 'name'::text))
On Sun, Feb 26, 2017 at 5:28 AM, Sven R. Kunze <srkunze@mail.de> wrote:
Using "select pg_prewarm('docs');" and on any of the indexes doesn't help either.
After a "systemctl stop postgresql.service && sync && echo 3 > /proc/sys/vm/drop_caches && systemctl start postgresql.service" the age=20, 30 or name=john queries are slow again.
Is there a way to speed up or to warm up things permanently?
If by 'permanently', you mean even when you intentionally break things, then no. You will always be able to intentionally break things. There is on-going discussion of an auto-prewarm feature. But that doesn't yet exist; and once it does, a super user will always be able to break it.
Presumably you have a use-case in mind other than intentional sabotage of your caches by root. But, what is it? If you reboot the server frequently, maybe you can just throw 'select pg_prewarm...' into an init script?
Cheers,
Jeff
On 27.02.2017 19:22, Jeff Janes wrote:
If by 'permanently', you mean even when you intentionally break things, then no. You will always be able to intentionally break things. There is on-going discussion of an auto-prewarm feature. But that doesn't yet exist; and once it does, a super user will always be able to break it.Presumably you have a use-case in mind other than intentional sabotage of your caches by root. But, what is it? If you reboot the server frequently, maybe you can just throw 'select pg_prewarm...' into an init script?
I didn't express myself well enough. pg_prewarm doesn't help to speed up those queries at all.
Looking at these numbers, I wonder why it takes ~5 secs to answer?
Best,
Sven
On Tue, Feb 28, 2017 at 12:27 AM, Sven R. Kunze <srkunze@mail.de> wrote:
On 27.02.2017 19:22, Jeff Janes wrote:If by 'permanently', you mean even when you intentionally break things, then no. You will always be able to intentionally break things. There is on-going discussion of an auto-prewarm feature. But that doesn't yet exist; and once it does, a super user will always be able to break it.Presumably you have a use-case in mind other than intentional sabotage of your caches by root. But, what is it? If you reboot the server frequently, maybe you can just throw 'select pg_prewarm...' into an init script?
I didn't express myself well enough. pg_prewarm doesn't help to speed up those queries at all.
Oh. In my hands, it works very well. I get 70 seconds to do the {age: 20} query from pure cold caches, versus 1.4 seconds from cold caches which was followed by pg_prewarm('docs','prefetch').
How much RAM do you have? Maybe you don't have enough to hold the table in RAM. What kind of IO system? And what OS?
Cheers,
Jeff
On 28.02.2017 17:49, Jeff Janes wrote:
Oh. In my hands, it works very well. I get 70 seconds to do the {age: 20} query from pure cold caches, versus 1.4 seconds from cold caches which was followed by pg_prewarm('docs','prefetch').How much RAM do you have? Maybe you don't have enough to hold the table in RAM. What kind of IO system? And what OS?
On my test system:
RAM: 4GB
IO: SSD (random_page_cost = 1.0)
OS: Ubuntu 16.04
Regards,
Sven
On Wed, Mar 1, 2017 at 6:02 AM, Sven R. Kunze <srkunze@mail.de> wrote:
On 28.02.2017 17:49, Jeff Janes wrote:Oh. In my hands, it works very well. I get 70 seconds to do the {age: 20} query from pure cold caches, versus 1.4 seconds from cold caches which was followed by pg_prewarm('docs','prefetch').How much RAM do you have? Maybe you don't have enough to hold the table in RAM. What kind of IO system? And what OS?
On my test system:
RAM: 4GB
IO: SSD (random_page_cost = 1.0)
OS: Ubuntu 16.04
4GB is not much RAM to be trying to pre-warm this amount of data into. Towards the end of the pg_prewarm, it is probably evicting data read in by the earlier part of it.
What is shared_buffers?
Cheers,
Jeff
On 01.03.2017 18:04, Jeff Janes wrote:
942MB.
But I see where you are coming from. How come that these queries need a Recheck Cond? I gather that this would require reading not only the index data but also the table itself which could be huge, right?
Sven
On Wed, Mar 1, 2017 at 6:02 AM, Sven R. Kunze <srkunze@mail.de> wrote:On 28.02.2017 17:49, Jeff Janes wrote:Oh. In my hands, it works very well. I get 70 seconds to do the {age: 20} query from pure cold caches, versus 1.4 seconds from cold caches which was followed by pg_prewarm('docs','prefetch').How much RAM do you have? Maybe you don't have enough to hold the table in RAM. What kind of IO system? And what OS?
On my test system:
RAM: 4GB
IO: SSD (random_page_cost = 1.0)
OS: Ubuntu 16.044GB is not much RAM to be trying to pre-warm this amount of data into. Towards the end of the pg_prewarm, it is probably evicting data read in by the earlier part of it.What is shared_buffers?
942MB.
But I see where you are coming from. How come that these queries need a Recheck Cond? I gather that this would require reading not only the index data but also the table itself which could be huge, right?
Sven
On Thu, Mar 2, 2017 at 1:19 PM, Sven R. Kunze <srkunze@mail.de> wrote:
On 01.03.2017 18:04, Jeff Janes wrote:On Wed, Mar 1, 2017 at 6:02 AM, Sven R. Kunze <srkunze@mail.de> wrote:On 28.02.2017 17:49, Jeff Janes wrote:Oh. In my hands, it works very well. I get 70 seconds to do the {age: 20} query from pure cold caches, versus 1.4 seconds from cold caches which was followed by pg_prewarm('docs','prefetch').How much RAM do you have? Maybe you don't have enough to hold the table in RAM. What kind of IO system? And what OS?
On my test system:
RAM: 4GB
IO: SSD (random_page_cost = 1.0)
OS: Ubuntu 16.044GB is not much RAM to be trying to pre-warm this amount of data into. Towards the end of the pg_prewarm, it is probably evicting data read in by the earlier part of it.What is shared_buffers?
942MB.
But I see where you are coming from. How come that these queries need a Recheck Cond? I gather that this would require reading not only the index data but also the table itself which could be huge, right?
Bitmaps can overflow and drop the row-level information, tracking only the blocks which need to be inspected. So it has to have a recheck in case that happens (although in your case it is not actually overflowing--but it still needs to be prepared for that). Also, I think that jsonb_path_ops indexes the hashes of the paths, so it can deliver false positives which need to be rechecked. And you are selecting `id`, which is not in the index so it would have to consult the table anyway to retrieve that. Even if it could get all the data from the index itself, I don't think GIN indexes support that feature.
Cheers,
Jeff
On 06.03.2017 05:25, Jeff Janes wrote:
Good to know.
Wow, that's a very important piece of information. It explains a lot. Thanks a lot.
Yes, I see. I actually was sloppy about the query. What's really important here would be counting the number of rows. However, from what I can see, it's the best PostgreSQL can do right now.
Or you have any more ideas how to speed up counting?
Best,
Sven
Bitmaps can overflow and drop the row-level information, tracking only the blocks which need to be inspected. So it has to have a recheck in case that happens (although in your case it is not actually overflowing--but it still needs to be prepared for that).
Good to know.
Also, I think that jsonb_path_ops indexes the hashes of the paths, so it can deliver false positives which need to be rechecked.
Wow, that's a very important piece of information. It explains a lot. Thanks a lot.
And you are selecting `id`, which is not in the index so it would have to consult the table anyway to retrieve that. Even if it could get all the data from the index itself, I don't think GIN indexes support that feature.
Yes, I see. I actually was sloppy about the query. What's really important here would be counting the number of rows. However, from what I can see, it's the best PostgreSQL can do right now.
Or you have any more ideas how to speed up counting?
Best,
Sven