Обсуждение: advice sought - general approaches to optimizing queries around "event streams"
advice sought - general approaches to optimizing queries around "event streams"
От
Jonathan Vanasco
Дата:
I have a growing database with millions of rows that track resources against an event stream. i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things downfrom 70s to 3.5s on full scans and offer .05s partial scans. no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were anytips/tricks from the community on how to approach them. a simple form of my database would be: -- 1k of create table stream ( id int not null primary key, ) -- 1MM of create table resource ( id int not null primary key, col_a bool, col_b bool, col_c text, ); -- 10MM of create table streamevent ( id int not null, event_timestamp timestamp not null, stream_id int not null references stream(id) ); -- 10MM of create table resource_2_stream_event( resource_id int not null references resource(id), streamevent_id int not null references streamevent(id) ) Everything is running off of indexes; there are no seq scans. I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a subqueryor CTE. better performance has come from limiting the number of "stream events" ( which are only the timestamp and resource_id offa joined table ) The bottlenecks I've encountered have primarily been: 1. When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue. I've figured out a novel way to work with the most recent events, but distant events are troublesome using no limit, the query takes 3500 ms using a limit of 10000, the query takes 320ms using a limit of 1000, the query takes 20ms there is a dedicated index of on event_timestamp (desc) , and it is being used according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the deduplicationof records in a stream seems to be the issue (either with DISTINCT or max+group_by) 2. I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based search) I thought about limiting the query by finding matching resources first, then locking it to an event stream, but: - scanning the entire table for a term takes about 10 seconds on an initial hit. subsequent queries for the sameterms end up using the cache, and complete within 20ms. I get better search performance by calculating the event stream, then searching it for matching documents, but I stillhave the performance issues related to limiting the window of events i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with largedata sets than i am with raw SQL right now. i'm hoping someone can enlighten me into looking at new ways to solve these problems. i think i've learned more about postgres/sqlin the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will comefrom new ways of querying data , rather than optimizing the current queries.
Re: advice sought - general approaches to optimizing queries around "event streams"
От
Gavin Flower
Дата:
On 27/09/14 09:02, Jonathan Vanasco wrote: > I have a growing database with millions of rows that track resources against an event stream. > > i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop thingsdown from 70s to 3.5s on full scans and offer .05s partial scans. > > no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were anytips/tricks from the community on how to approach them. > > a simple form of my database would be: > > -- 1k of > create table stream ( > id int not null primary key, > ) > > -- 1MM of > create table resource ( > id int not null primary key, > col_a bool, > col_b bool, > col_c text, > ); > > -- 10MM of > create table streamevent ( > id int not null, > event_timestamp timestamp not null, > stream_id int not null references stream(id) > ); > > -- 10MM of > create table resource_2_stream_event( > resource_id int not null references resource(id), > streamevent_id int not null references streamevent(id) > ) > > Everything is running off of indexes; there are no seq scans. > > I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a subqueryor CTE. > better performance has come from limiting the number of "stream events" ( which are only the timestamp and resource_idoff a joined table ) > > The bottlenecks I've encountered have primarily been: > > 1. When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue. > I've figured out a novel way to work with the most recent events, but distant events are troublesome > > using no limit, the query takes 3500 ms > using a limit of 10000, the query takes 320ms > using a limit of 1000, the query takes 20ms > > there is a dedicated index of on event_timestamp (desc) , and it is being used > according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the deduplicationof records in a stream seems to be the issue (either with DISTINCT or max+group_by) > > > 2. I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based search) > > I thought about limiting the query by finding matching resources first, then locking it to an event stream, but: > - scanning the entire table for a term takes about 10 seconds on an initial hit. subsequent queries for the sameterms end up using the cache, and complete within 20ms. > > I get better search performance by calculating the event stream, then searching it for matching documents, but I stillhave the performance issues related to limiting the window of events > > i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing withlarge data sets than i am with raw SQL right now. > > i'm hoping someone can enlighten me into looking at new ways to solve these problems. i think i've learned more aboutpostgres/sql in the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need willcome from new ways of querying data , rather than optimizing the current queries. > Minor point: when specifying PRIMARY KEY, you don't need to also put NOT NULL (this should make no change to performance). I notice that the 'id' of 'streamevent' is not marked as a PRIMARY KEY, so it will not have an index associated with it - hence referencing it as a foreign key might be slower than expected. Cheers, Gavin
We have a similar timeseries database approaching 500m records.
We partition the main tables (much like your events) into one year subsets, with a clustered index on timestamp for all but the live year.
https://blog.engineyard.com/2013/scaling-postgresql-performance-table-partitioning
http://www.postgresql.org/docs/9.3/static/ddl-partitioning.html
http://www.postgresql.org/docs/9.3/static/sql-cluster.html
As discussed here previously, you can also improve performance using hardware - SSD'd vs spindles. Also note that tablespaces, with indexes on your faster drives & data on slower ones can improve performance.
http://www.postgresql.org/docs/9.3/static/manage-ag-tablespaces.html
Also make sure your db server is optimised for the database size & hardware configuration - like perhaps alloe fewer concurrent users, but more resources per user, or see what pgtune recommends.
Should help your performance, in terms of underlying db efficiency & performance, rather than tweaking your actual queries.
Brent Wood
Programme leader: Environmental Information Delivery
NIWA
DDI: +64 (4) 3860529
We partition the main tables (much like your events) into one year subsets, with a clustered index on timestamp for all but the live year.
https://blog.engineyard.com/2013/scaling-postgresql-performance-table-partitioning
http://www.postgresql.org/docs/9.3/static/ddl-partitioning.html
http://www.postgresql.org/docs/9.3/static/sql-cluster.html
As discussed here previously, you can also improve performance using hardware - SSD'd vs spindles. Also note that tablespaces, with indexes on your faster drives & data on slower ones can improve performance.
http://www.postgresql.org/docs/9.3/static/manage-ag-tablespaces.html
Also make sure your db server is optimised for the database size & hardware configuration - like perhaps alloe fewer concurrent users, but more resources per user, or see what pgtune recommends.
Should help your performance, in terms of underlying db efficiency & performance, rather than tweaking your actual queries.
Brent Wood
Programme leader: Environmental Information Delivery
NIWA
DDI: +64 (4) 3860529
Brent Wood |
Principal Technician - GIS and Spatial Data Management Programme Leader - Environmental Information Delivery |
+64-4-386-0529 | 301 Evans Bay Parade, Greta Point, Wellington | www.niwa.co.nz |
________________________________________
From: pgsql-general-owner@postgresql.org <pgsql-general-owner@postgresql.org> on behalf of Jonathan Vanasco <postgres@2xlp.com>
Sent: Saturday, September 27, 2014 9:02 AM
To: PostgreSQL general
Subject: [GENERAL] advice sought - general approaches to optimizing queries around "event streams"
I have a growing database with millions of rows that track resources against an event stream.
i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things down from 70s to 3.5s on full scans and offer .05s partial scans.
no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were any tips/tricks from the community on how to approach them.
a simple form of my database would be:
-- 1k of
create table stream (
id int not null primary key,
)
-- 1MM of
create table resource (
id int not null primary key,
col_a bool,
col_b bool,
col_c text,
);
-- 10MM of
create table streamevent (
id int not null,
event_timestamp timestamp not null,
stream_id int not null references stream(id)
);
-- 10MM of
create table resource_2_stream_event(
resource_id int not null references resource(id),
streamevent_id int not null references streamevent(id)
)
Everything is running off of indexes; there are no seq scans.
I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a subquery or CTE.
better performance has come from limiting the number of "stream events" ( which are only the timestamp and resource_id off a joined table )
The bottlenecks I've encountered have primarily been:
1. When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue.
I've figured out a novel way to work with the most recent events, but distant events are troublesome
using no limit, the query takes 3500 ms
using a limit of 10000, the query takes 320ms
using a limit of 1000, the query takes 20ms
there is a dedicated index of on event_timestamp (desc) , and it is being used
according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the deduplication of records in a stream seems to be the issue (either with DISTINCT or max+group_by)
2. I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based search)
I thought about limiting the query by finding matching resources first, then locking it to an event stream, but:
- scanning the entire table for a term takes about 10 seconds on an initial hit. subsequent queries for the same terms end up using the cache, and complete within 20ms.
I get better search performance by calculating the event stream, then searching it for matching documents, but I still have the performance issues related to limiting the window of events
i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with large data sets than i am with raw SQL right now.
i'm hoping someone can enlighten me into looking at new ways to solve these problems. i think i've learned more about postgres/sql in the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will come from new ways of querying data , rather than optimizing the current queries.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
From: pgsql-general-owner@postgresql.org <pgsql-general-owner@postgresql.org> on behalf of Jonathan Vanasco <postgres@2xlp.com>
Sent: Saturday, September 27, 2014 9:02 AM
To: PostgreSQL general
Subject: [GENERAL] advice sought - general approaches to optimizing queries around "event streams"
I have a growing database with millions of rows that track resources against an event stream.
i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things down from 70s to 3.5s on full scans and offer .05s partial scans.
no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were any tips/tricks from the community on how to approach them.
a simple form of my database would be:
-- 1k of
create table stream (
id int not null primary key,
)
-- 1MM of
create table resource (
id int not null primary key,
col_a bool,
col_b bool,
col_c text,
);
-- 10MM of
create table streamevent (
id int not null,
event_timestamp timestamp not null,
stream_id int not null references stream(id)
);
-- 10MM of
create table resource_2_stream_event(
resource_id int not null references resource(id),
streamevent_id int not null references streamevent(id)
)
Everything is running off of indexes; there are no seq scans.
I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a subquery or CTE.
better performance has come from limiting the number of "stream events" ( which are only the timestamp and resource_id off a joined table )
The bottlenecks I've encountered have primarily been:
1. When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue.
I've figured out a novel way to work with the most recent events, but distant events are troublesome
using no limit, the query takes 3500 ms
using a limit of 10000, the query takes 320ms
using a limit of 1000, the query takes 20ms
there is a dedicated index of on event_timestamp (desc) , and it is being used
according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the deduplication of records in a stream seems to be the issue (either with DISTINCT or max+group_by)
2. I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based search)
I thought about limiting the query by finding matching resources first, then locking it to an event stream, but:
- scanning the entire table for a term takes about 10 seconds on an initial hit. subsequent queries for the same terms end up using the cache, and complete within 20ms.
I get better search performance by calculating the event stream, then searching it for matching documents, but I still have the performance issues related to limiting the window of events
i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with large data sets than i am with raw SQL right now.
i'm hoping someone can enlighten me into looking at new ways to solve these problems. i think i've learned more about postgres/sql in the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will come from new ways of querying data , rather than optimizing the current queries.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general