Re: [GENERAL] Large data and slow queries

Поиск

Список

Период

Сортировка

От	Andrew Staller
Тема	Re: [GENERAL] Large data and slow queries
Дата	22 апреля 2017 г. 02:41:25
Msg-id	CAEsM1FuNwJNUBUBk5pwrQqP_w5D7MWJ+YaR-mRG-r=0Q-Z-y_Q@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [GENERAL] Large data and slow queries (Samuel Williams <space.ship.traveller@gmail.com>)
Ответы	Re: [GENERAL] Large data and slow queries
Список	pgsql-general

Дерево обсуждения

Samuel,

Short answer to your questions: (1) TimescaleDB and CitusDB are focusing on solving different problems, and (2) TimescaleDB is an Apache 2-licensed extension to run in your Postgres database, not a fork or different system.

Longer answer to your first question:

From what we've read and the conversations we've had with Citus (awesome folks, btw), they're primarily solving a different problem -- OLAP queries and distributed transactions -- while we are focused on time-series data analysis. As such, we haven't benchmarked against Citus and if we were to, it would require some unrealistic finagling in order to make it an apples-to-apples comparison.

As an example, their partitioning is based on a single primary key, while all writes in their clustered version go to a cluster master (although their hosted version of Citus MX changes this a bit). We perform two-dimensional partitioning using your primary key and time, with the partitions automatically created/closed on time intervals and data volume in order to keep the most recent partitioned chunks of data (and their B-trees) in memory to support fast ingest for time-series data. (***) We have only released our single-node version, but the clustering will allow you to write directly to any node. Citus has also decided only to support the SQL queries they are optimized for, while we've made the decision to support "full SQL" (even if some queries might not be fast).

*** We blogged yesterday about TimescaleDB's partitioning design choices in more depth, if you are interested:

https://blog.timescale.com/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c

On Thu, Apr 20, 2017 at 5:43 PM, Samuel Williams <space.ship.traveller@gmail.com> wrote:

Andrew, how would timescaledb compare to citus - and is timescaledb an
extension to postgres or is it an entirely separate system?

On 21 April 2017 at 02:44, Andrew Staller <andrew@timescale.com> wrote:
> Awesome thread.
>
> Samuel,
>
> Just wanted you to be aware of the work we're doing at TimescaleDB
> (http://www.timescale.com/), a time-series database extension for
> PostgreSQL.
>
> Some of how we might help you:
> - automatic partitioning by space (primary key - like country_id, for
> instance) and time. This creates "chunks" of your data, right-sized by
> volume and time constraints (which makes inserts fast at much greater scale
> than Vanilla PostgreSQL - not sure if this is a problem for you)
> - this will also help if your queries are selective on time and country_id
> (per this example)
> - the partitioning by time allows you to DROP old chunks without the need
> for vacuums
>
> On Thu, Apr 20, 2017 at 8:30 AM, Vick Khera <vivek@khera.org> wrote:
>>
>> I'm curious why you have so many partial indexes. Are you trying to make
>> custom indexes per query? It seems to me you might want to consider making
>> the indexes general, and remove the redundant ones (that have the same
>> prefix list of indexed fields).
>>
>> Secondly your table is 102Gb. Clearly there's a lot of data here. How many
>> rows does that take? I would further suggest that you partition this table
>> such that there are no more than about 10 million rows per partition (I've
>> done this by using a id % 100 computation). Maybe in your case it makes
>> sense to partition it based on the "what" field, because it appears you are
>> trying to do that with your partial indexes already.
>>
>> On Wed, Apr 19, 2017 at 10:11 PM, Samuel Williams
>> <space.ship.traveller@gmail.com> wrote:
>>>
>>> Okay, so after changing longitude/latitude to float4, and
>>> re-organizing the table a bit, I got the query down from about 8
>>> minutes to 40 seconds.
>>>
>>> The details are in the gist comments:
>>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>>>
>>> Now, just need to get performance another 2 orders of magnitude
>>> better. Can we make the index a bit more coarse grained, perhaps
>>> convert long/lat to integers or something, use a hilbert index, use a
>>> postgis index... ideas?
>>>
>>>
>>> --
>>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-general
>>
>>
>
>
>
> --
> TimescaleDB | Growth & Developer Evangelism
> c: 908.581.9509
>
> 335 Madison Ave.
> New York, NY 10017
> www.timescale.com
> https://github.com/timescale/timescaledb

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general