Обсуждение: performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains
performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains
От
Chris Withers
Дата:
Hi All, I have quite a few tables that follow a pattern like this: Table "public.my_model" Column | Type | Modifiers --------+-------------------+----------- period | tsrange | not null key | character varying | not null value | integer | Indexes: "my_model_pkey" PRIMARY KEY, btree (period, key) "my_model_period_key_excl" EXCLUDE USING gist (period WITH &&, key WITH =) Check constraints: "my_model_period_check" CHECK (period <> 'empty'::tsrange) So, a primary key of a period column and one or more other columns (usually int or string) and an exclude constraint to prevent overlaps, and a check constraint to prevent empty ranges. However, I'm hitting performance problems on moderate bulk inserts and updates, with ~700k rows taking around 13 minutes. Profiling my python code suggests that most of the time is being taken by Postgres (9.4 in this case...) What can I do to speed things up? Is there a different type of index I can use to achieve the same exclude constraint? Is there something I can do to have the index changes only done on the commit of the bulk batches? cheers, Chris
On 9/16/2016 2:01 AM, Chris Withers wrote: > Hi All, > > I have quite a few tables that follow a pattern like this: > > Table "public.my_model" > Column | Type | Modifiers > --------+-------------------+----------- > period | tsrange | not null > key | character varying | not null > value | integer | > Indexes: > "my_model_pkey" PRIMARY KEY, btree (period, key) > "my_model_period_key_excl" EXCLUDE USING gist (period WITH &&, > key WITH =) > Check constraints: > "my_model_period_check" CHECK (period <> 'empty'::tsrange) > > So, a primary key of a period column and one or more other columns > (usually int or string) and an exclude constraint to prevent overlaps, > and a check constraint to prevent empty ranges. > > However, I'm hitting performance problems on moderate bulk inserts and > updates, with ~700k rows taking around 13 minutes. Profiling my python > code suggests that most of the time is being taken by Postgres (9.4 in > this case...) > > What can I do to speed things up? Is there a different type of index I > can use to achieve the same exclude constraint? Is there something I > can do to have the index changes only done on the commit of the bulk > batches? if (period,key) is unique, by virtue of being the primary key, then whats the point of the exclusion ?? I'm curious, how fast do your insert/updates run if you remove the key exclusion and check constraint ? tsvector operations are a lot more complicated than simple matches in indexing.... -- john r pierce, recycling bits in santa cruz
On 9/16/2016 2:12 AM, John R Pierce wrote: > Column | Type | Modifiers > --------+-------------------+----------- > period | tsrange | not null wait, what is a tsrange? the standard textsearch data types in postgres are tsvector and tsquery, -- john r pierce, recycling bits in santa cruz
On 9/16/2016 2:23 AM, John R Pierce wrote: > > wait, what is a tsrange? the standard textsearch data types in > postgres are tsvector and tsquery, never mind, I should have known, its a timestamp range. ... when you do updates, are you changing any of the indexed fields, or just "value" ?
On 16/09/2016 10:26, John R Pierce wrote: > On 9/16/2016 2:23 AM, John R Pierce wrote: >> >> wait, what is a tsrange? the standard textsearch data types in >> postgres are tsvector and tsquery, > > never mind, I should have known, its a timestamp range. ... > > > when you do updates, are you changing any of the indexed fields, or > just "value" ? Yeah, it's a temporal table, so "updates" involve modifying the period column for a row to set its end ts, and then inserting a new row with a start ts running on from that. Of course, the adds are just inserting new rows. cheers, Chris
On 9/16/2016 3:46 AM, Chris Withers wrote:
Yeah, it's a temporal table, so "updates" involve modifying the period column for a row to set its end ts, and then inserting a new row with a start ts running on from that.
when you do updates, are you changing any of the indexed fields, or
just "value" ?
thats expensive, as it has to reindex that row. and range indexes are more expensive than timestamp indexes
modifiyng the primary key is kind of a violation of one of the basic rules of relational databases as it means the row can't be referenced by another table.
I expect the expensive one is the constraint that ensures no periods overlap for the given key. I'm not sure how that can be done short of a full scan for each update/insert. it might actually perform better if you write the index with the key first as presumably the key is invariant ?
-- john r pierce, recycling bits in santa cruz
On 16/09/2016 12:00, John R Pierce wrote: > On 9/16/2016 3:46 AM, Chris Withers wrote: >>> >>> when you do updates, are you changing any of the indexed fields, or >>> just "value" ? >> Yeah, it's a temporal table, so "updates" involve modifying the period >> column for a row to set its end ts, and then inserting a new row with >> a start ts running on from that. > > thats expensive, as it has to reindex that row. and range indexes are > more expensive than timestamp indexes > > modifiyng the primary key is kind of a violation of one of the basic > rules of relational databases as it means the row can't be referenced by > another table. Right, but these rows have no natural primary key. Would it help if I just added an auto-incrementing integer key? Would that make a positive difference or would it just be a wasted column? > I expect the expensive one is the constraint that ensures no periods > overlap for the given key. I'm not sure how that can be done short of > a full scan for each update/insert. Indeed, I wonder if making the constraint deferrable might help for the bulk case? > it might actually perform better > if you write the index with the key first as presumably the key is > invariant ? You mean: PRIMARY KEY, btree (key1, key2, period) as opposed to PRIMARY KEY, btree (period, key) Interesting, I'd assumed postgres would optimise that under the covers... Chris
-----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Chris Withers Sent: Friday, September 16, 2016 6:47 AM To: John R Pierce <pierce@hogranch.com>; pgsql-general@postgresql.org Subject: Re: [GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains On 16/09/2016 10:26, John R Pierce wrote: > On 9/16/2016 2:23 AM, John R Pierce wrote: >> >> wait, what is a tsrange? the standard textsearch data types in >> postgres are tsvector and tsquery, > > never mind, I should have known, its a timestamp range. ... > > > when you do updates, are you changing any of the indexed fields, or > just "value" ? Yeah, it's a temporal table, so "updates" involve modifying the period column for a row to set its end ts, and then insertinga new row with a start ts running on from that. Of course, the adds are just inserting new rows. cheers, Chris ____________________________________________________________________________________________________ So, what is the value for "end ts", when the record is inserted (the range just started)? Regards, Igor Neyman
On 16/09/2016 14:54, Igor Neyman wrote: > > -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Chris Withers > Sent: Friday, September 16, 2016 6:47 AM > To: John R Pierce <pierce@hogranch.com>; pgsql-general@postgresql.org > Subject: Re: [GENERAL] performance problems with bulk inserts/updates on tsrange with gist-based exclude constrains > > On 16/09/2016 10:26, John R Pierce wrote: >> On 9/16/2016 2:23 AM, John R Pierce wrote: >>> >>> wait, what is a tsrange? the standard textsearch data types in >>> postgres are tsvector and tsquery, >> >> never mind, I should have known, its a timestamp range. ... >> >> >> when you do updates, are you changing any of the indexed fields, or >> just "value" ? > Yeah, it's a temporal table, so "updates" involve modifying the period column for a row to set its end ts, and then insertinga new row with a start ts running on from that. > > Of course, the adds are just inserting new rows. > > So, what is the value for "end ts", when the record is inserted (the range just started)? It's open ended, so the period is [start_ts, ) Chris
Chris Withers <chris@simplistix.co.uk> writes: > On 16/09/2016 14:54, Igor Neyman wrote: >> So, what is the value for "end ts", when the record is inserted (the range just started)? > It's open ended, so the period is [start_ts, ) I've not looked at the GiST range opclass, but I would not be surprised if having lots of those is pretty destructive to the index's ability to be selective about && searches. regards, tom lane
On 16/09/2016 12:00, John R Pierce wrote: > On 9/16/2016 3:46 AM, Chris Withers wrote: >>> >>> when you do updates, are you changing any of the indexed fields, or >>> just "value" ? >> Yeah, it's a temporal table, so "updates" involve modifying the period >> column for a row to set its end ts, and then inserting a new row with >> a start ts running on from that. > > thats expensive, as it has to reindex that row. and range indexes are > more expensive than timestamp indexes > > modifiyng the primary key is kind of a violation of one of the basic > rules of relational databases as it means the row can't be referenced by > another table. Right, but these rows have no natural primary key. Would it help if I just added an auto-incrementing integer key? Would that make a positive difference or would it just be a wasted column? > I expect the expensive one is the constraint that ensures no periods > overlap for the given key. I'm not sure how that can be done short of > a full scan for each update/insert. Indeed, I wonder if making the constraint deferrable might help for the bulk case? > it might actually perform better > if you write the index with the key first as presumably the key is > invariant ? You mean: PRIMARY KEY, btree (period, key) as opposed to > > > > -- > john r pierce, recycling bits in santa cruz >
On 16/09/2016 10:26, John R Pierce wrote: > On 9/16/2016 2:23 AM, John R Pierce wrote: >> >> wait, what is a tsrange? the standard textsearch data types in >> postgres are tsvector and tsquery, > > never mind, I should have known, its a timestamp range. ... > > > when you do updates, are you changing any of the indexed fields, or > just "value" ? Yeah, it's a temporal table, so "updates" involve modifying the period column for a row to set its end ts, and then inserting a new row with a start ts running on from that. Of course, the adds are just inserting new rows. cheers, Chris
On 16/09/2016 15:29, Tom Lane wrote: > Chris Withers <chris@simplistix.co.uk> writes: >> On 16/09/2016 14:54, Igor Neyman wrote: >>> So, what is the value for "end ts", when the record is inserted (the range just started)? > >> It's open ended, so the period is [start_ts, ) > > I've not looked at the GiST range opclass, but I would not be surprised if > having lots of those is pretty destructive to the index's ability to be > selective about && searches. If that's so, that's a little disappointing... (I'd have thought the special case end value (open ended) and the ending type (inclusive/exclusive) would just be sentinel values) How would I verify your suspicions? cheers, Chris
On Fri, Sep 16, 2016 at 2:01 AM, Chris Withers <chris@simplistix.co.uk> wrote:
Hi All,
I have quite a few tables that follow a pattern like this:
Table "public.my_model"
Column | Type | Modifiers
--------+-------------------+-----------
period | tsrange | not null
key | character varying | not null
value | integer |
Indexes:
"my_model_pkey" PRIMARY KEY, btree (period, key)
"my_model_period_key_excl" EXCLUDE USING gist (period WITH &&, key WITH =)
Check constraints:
"my_model_period_check" CHECK (period <> 'empty'::tsrange)
Try swapping the order of the columns in the exclude constraint. You want the more selective criterion to appear first in the index/constraint. Presumably "key with =" is the most selective, especially if many of your periods are unbounded.
Cheers,
Jeff
Jeff Janes wrote > Try swapping the order of the columns in the exclude constraint. You want > the more selective criterion to appear first in the index/constraint. > Presumably "key with =" is the most selective, especially if many of your > periods are unbounded. I would not be so sure with that: http://use-the-index-luke.com/sql/myth-directory/most-selective-first -- View this message in context: http://postgresql.nabble.com/performance-problems-with-bulk-inserts-updates-on-tsrange-with-gist-based-exclude-constrains-tp5921498p5922219.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
On Wed, Sep 21, 2016 at 2:18 PM, pinker <pinker@onet.eu> wrote:
Jeff Janes wrote
> Try swapping the order of the columns in the exclude constraint. You want
> the more selective criterion to appear first in the index/constraint.
> Presumably "key with =" is the most selective, especially if many of your
> periods are unbounded.
I would not be so sure with that:
As a rule, I generally don't spout random nonsense. Or at least, not without including a disclaimer. I didn't test it on the OPs exact case, because he has need blessed us with his data or his scripts. But I have tested it on other data, and it does work.
http://use-the-index-luke.com/sql/myth-directory/most- selective-first
I don't see how anything there applies to GiST indexes. Indeed, there doesn't seem to be much there worth reading at all. The only thing vaguely informative, other than trivia about other RDBMS, is "It’s useless to have the most selective column of the index on the left if very few queries filter on it", which is rather obvious, but also obviously does not apply to this case.
Cheers,
Jeff
Hey Tom, I appreciate you're busy, but did you ever get a chance to look at this? On 19/09/2016 08:40, Chris Withers wrote: > On 16/09/2016 15:29, Tom Lane wrote: >> Chris Withers <chris@simplistix.co.uk> writes: >>> On 16/09/2016 14:54, Igor Neyman wrote: >>>> So, what is the value for "end ts", when the record is inserted (the >>>> range just started)? >> >>> It's open ended, so the period is [start_ts, ) >> >> I've not looked at the GiST range opclass, but I would not be >> surprised if >> having lots of those is pretty destructive to the index's ability to be >> selective about && searches. > > If that's so, that's a little disappointing... > (I'd have thought the special case end value (open ended) and the ending > type (inclusive/exclusive) would just be sentinel values) > > How would I verify your suspicions? cheers, Chris
Hey Tom, I appreciate you're busy, but did you ever get a chance to look at this? On 19/09/2016 08:40, Chris Withers wrote: > On 16/09/2016 15:29, Tom Lane wrote: >> Chris Withers <chris@simplistix.co.uk> writes: >>> On 16/09/2016 14:54, Igor Neyman wrote: >>>> So, what is the value for "end ts", when the record is inserted (the >>>> range just started)? >> >>> It's open ended, so the period is [start_ts, ) >> >> I've not looked at the GiST range opclass, but I would not be >> surprised if >> having lots of those is pretty destructive to the index's ability to be >> selective about && searches. > > If that's so, that's a little disappointing... > (I'd have thought the special case end value (open ended) and the ending > type (inclusive/exclusive) would just be sentinel values) > > How would I verify your suspicions? cheers, Chris