Обсуждение: advice sought - general approaches to optimizing queries around "event streams"

Поиск
Список
Период
Сортировка

advice sought - general approaches to optimizing queries around "event streams"

От
Jonathan Vanasco
Дата:
I have a growing database with millions of rows that track resources against an event stream.

i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things
downfrom 70s to 3.5s on full scans and offer .05s partial scans.   

no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were
anytips/tricks from the community on how to approach them.   

a simple form of my database would be:

    --  1k of
    create table stream (
        id int not null primary key,
    )

    -- 1MM of
    create table resource (
        id int not null primary key,
        col_a bool,
        col_b bool,
        col_c text,
    );

    -- 10MM of
    create table streamevent (
        id int not null,
        event_timestamp timestamp not null,
        stream_id int not null references stream(id)
    );

    -- 10MM of
    create table resource_2_stream_event(
        resource_id int not null references resource(id),
        streamevent_id int not null references streamevent(id)
    )

Everything is running off of indexes; there are no seq scans.

I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a
subqueryor CTE.   
better performance has come from limiting the number of "stream events"  ( which are only the timestamp and resource_id
offa joined table )  

The bottlenecks I've encountered have primarily been:

1.    When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue.
    I've figured out a novel way to work with the most recent events, but distant events are troublesome

    using no limit, the query takes 3500 ms
    using a limit of 10000, the query takes 320ms
    using a limit of 1000, the query takes 20ms

    there is a dedicated index of on event_timestamp (desc) , and it is being used
    according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the
deduplicationof records in a stream seems to be the issue (either with DISTINCT or max+group_by) 


2.     I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based
search)

    I thought about limiting the query by finding matching resources first, then locking it to an event stream, but:
        - scanning the entire table for a term takes about 10 seconds on an initial hit.  subsequent queries for the
sameterms end up using the cache, and complete within 20ms. 

    I get better search performance by calculating the event stream, then searching it for matching documents, but I
stillhave the performance issues related to limiting the window of events 

i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with
largedata sets than i am with raw SQL right now.   

i'm hoping someone can enlighten me into looking at new ways to solve these problems.   i think i've learned more about
postgres/sqlin the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will
comefrom new ways of querying data , rather than optimizing the current queries. 




















Re: advice sought - general approaches to optimizing queries around "event streams"

От
Gavin Flower
Дата:
On 27/09/14 09:02, Jonathan Vanasco wrote:
> I have a growing database with millions of rows that track resources against an event stream.
>
> i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop
thingsdown from 70s to 3.5s on full scans and offer .05s partial scans. 
>
> no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were
anytips/tricks from the community on how to approach them. 
>
> a simple form of my database would be:
>
>     --  1k of
>     create table stream (
>         id int not null primary key,
>     )
>
>     -- 1MM of
>     create table resource (
>         id int not null primary key,
>         col_a bool,
>         col_b bool,
>         col_c text,
>     );
>
>     -- 10MM of
>     create table streamevent (
>         id int not null,
>         event_timestamp timestamp not null,
>         stream_id int not null references stream(id)
>     );
>
>     -- 10MM of
>     create table resource_2_stream_event(
>         resource_id int not null references resource(id),
>         streamevent_id int not null references streamevent(id)
>     )
>
> Everything is running off of indexes; there are no seq scans.
>
> I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a
subqueryor CTE. 
> better performance has come from limiting the number of "stream events"  ( which are only the timestamp and
resource_idoff a joined table ) 
>
> The bottlenecks I've encountered have primarily been:
>
> 1.    When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an
issue.
>     I've figured out a novel way to work with the most recent events, but distant events are troublesome
>
>     using no limit, the query takes 3500 ms
>     using a limit of 10000, the query takes 320ms
>     using a limit of 1000, the query takes 20ms
>
>     there is a dedicated index of on event_timestamp (desc) , and it is being used
>     according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the
deduplicationof records in a stream seems to be the issue (either with DISTINCT or max+group_by) 
>
>
> 2.     I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based
search)
>
>     I thought about limiting the query by finding matching resources first, then locking it to an event stream, but:
>         - scanning the entire table for a term takes about 10 seconds on an initial hit.  subsequent queries for the
sameterms end up using the cache, and complete within 20ms. 
>
>     I get better search performance by calculating the event stream, then searching it for matching documents, but I
stillhave the performance issues related to limiting the window of events 
>
> i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing
withlarge data sets than i am with raw SQL right now. 
>
> i'm hoping someone can enlighten me into looking at new ways to solve these problems.   i think i've learned more
aboutpostgres/sql in the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need
willcome from new ways of querying data , rather than optimizing the current queries. 
>
Minor point: when specifying PRIMARY KEY, you don't need to also put NOT
NULL (this should make no change to performance).

I notice that the 'id' of 'streamevent' is not marked as a PRIMARY KEY,
so it will not have an index associated with it - hence referencing it
as a foreign key might be slower than expected.


Cheers,
Gavin


Re: advice sought - general approaches to optimizing queries around "event streams"

От
Brent Wood
Дата:

We have a similar timeseries database approaching 500m records.

We partition the main tables (much like your events) into one year subsets, with a clustered index on timestamp for all but the live year.

https://blog.engineyard.com/2013/scaling-postgresql-performance-table-partitioning
http://www.postgresql.org/docs/9.3/static/ddl-partitioning.html
http://www.postgresql.org/docs/9.3/static/sql-cluster.html

As discussed here previously, you can also improve performance using hardware - SSD'd vs spindles. Also note that tablespaces, with indexes on your faster drives & data on slower ones can improve performance.

http://www.postgresql.org/docs/9.3/static/manage-ag-tablespaces.html

Also make sure your db server is optimised for the database size & hardware configuration - like perhaps alloe fewer concurrent users, but more resources per user, or see what pgtune recommends.

Should help your performance, in terms of underlying db efficiency & performance, rather than tweaking your actual queries.

Brent Wood

Programme leader: Environmental Information Delivery
NIWA
DDI:  +64 (4) 3860529

Brent Wood
Principal Technician - GIS and Spatial Data Management
Programme Leader - Environmental Information Delivery
+64-4-386-0529 | 301 Evans Bay Parade, Greta Point, Wellington | www.niwa.co.nz
NIWA
________________________________________
From: pgsql-general-owner@postgresql.org <pgsql-general-owner@postgresql.org> on behalf of Jonathan Vanasco <postgres@2xlp.com>
Sent: Saturday, September 27, 2014 9:02 AM
To: PostgreSQL general
Subject: [GENERAL] advice sought - general approaches to optimizing queries around "event streams"

I have a growing database with millions of rows that track resources against an event stream.

i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things down from 70s to 3.5s on full scans and offer .05s partial scans.

no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were any tips/tricks from the community on how to approach them.

a simple form of my database would be:

       --  1k of
       create table stream (
               id int not null primary key,
       )

       -- 1MM of
       create table resource (
               id int not null primary key,
               col_a bool,
               col_b bool,
               col_c text,
       );

       -- 10MM of
       create table streamevent (
               id int not null,
               event_timestamp timestamp not null,
               stream_id int not null references stream(id)
       );

       -- 10MM of
       create table resource_2_stream_event(
               resource_id int not null references resource(id),
               streamevent_id int not null references streamevent(id)
       )

Everything is running off of indexes; there are no seq scans.

I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a subquery or CTE.
better performance has come from limiting the number of "stream events"  ( which are only the timestamp and resource_id off a joined table )

The bottlenecks I've encountered have primarily been:

1.      When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue.
       I've figured out a novel way to work with the most recent events, but distant events are troublesome

       using no limit, the query takes 3500 ms
       using a limit of 10000, the query takes 320ms
       using a limit of 1000, the query takes 20ms

       there is a dedicated index of on event_timestamp (desc) , and it is being used
       according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the deduplication of records in a stream seems to be the issue (either with DISTINCT or max+group_by)


2.      I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based search)

       I thought about limiting the query by finding matching resources first, then locking it to an event stream, but:
               - scanning the entire table for a term takes about 10 seconds on an initial hit.  subsequent queries for the same terms end up using the cache, and complete within 20ms.

       I get better search performance by calculating the event stream, then searching it for matching documents, but I still have the performance issues related to limiting the window of events

i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with large data sets than i am with raw SQL right now.

i'm hoping someone can enlighten me into looking at new ways to solve these problems.   i think i've learned more about postgres/sql in the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will come from new ways of querying data , rather than optimizing the current queries.




















--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



Вложения