Re: [GENERAL] Large data and slow queries

Поиск
Список
Период
Сортировка
От Samuel Williams
Тема Re: [GENERAL] Large data and slow queries
Дата
Msg-id CAHkN8V_v9b6sWPsgsseKOuQxGoCg6ert3NBG_iN56y+m8MQcuw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [GENERAL] Large data and slow queries  (Alban Hertroys <haramrae@gmail.com>)
Ответы Re: [GENERAL] Large data and slow queries  (Alban Hertroys <haramrae@gmail.com>)
Список pgsql-general
Thanks Alban, I appreciate your ideas and thoughts.

I'm a little reluctant to go down the partitioning route as I think
we'll probably end up sharding out horizontally using citus data in
the near future and naive postgres partitioning may hamper that
effort.

It's interesting you talk about using multiple indexes. In
MySQL/MariaDB and derivatives, I've never seen the query planner
consider using multiple indexes. So, it's possible that Postgres may
use multiple indexes if it saves time? Or do you mean, doing something
like manually joining the data and leveraging the different indexes
explicitly?

The correlation between user_id and location... well, it's somewhat
temporally related.



On 19 April 2017 at 22:50, Alban Hertroys <haramrae@gmail.com> wrote:
>
>> On 19 Apr 2017, at 6:01, Samuel Williams <space.ship.traveller@gmail.com> wrote:
>>
>> Hi.
>>
>> We have 400,000,000 records in a table (soon to be 800,000,000), here
>> is the schema (\d+)
>>
>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>>
>> We want the following kinds of query to be fast:
>>
>> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
>> "user_event"."what" IN ('poll', 'location_change',
>> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
>> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>> 175.0805140220076);
>>
>> We have a btree index and it appears to be working. However, it's
>> still pretty slow.
>>
>> EXPLAIN ANALYZE gives the following:
>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>>
>> I'm thinking that I need to do the following to help:
>>
>> CLUSTER user_event ON index_user_event_for_visits_3 followed by
>> analyze... Our data is mostly time series but sometimes we get some
>> dumps with historical records.
>
> It seems to me that a large part of the problem is that the server has to scan all index entries from that date and
withinthose location bounds to find that the distinct user id's in that set are about 114,000 out of 1.7M rows matching
theselection-criteria. If it could stop at the first location for each user, it would have to scan less than a 10th of
theindex entries that it has to scan now... 
>
> How high is the correlation between user id and location? That can probably be leveraged…
> Perhaps you'd get better performance if you'd use multiple indices instead of stuffing everything into a single
purpose-specificone? I would suggest one on (user_id, latitude, longitude) or (latitude, longitude, user_id) and one on
(created_at,user_id), or perhaps (created_at, latitude, longitude). That approach could also reduce the number of
indicesyou have on that table, as well as their sizes, making it all fit into cache a little easier. Then again,
additionaloperations will be required to combine them. 
>
> For a different approach; It may be possible to enrich your data with something that is easy to index and query, with
ahigh correlation to (latitude, longitude). That could also be used to partition over. Country seems a good candidate
here,unless all your data-points are in New Zealand like the above? 
> Then again, some countries are a lot larger, with a higher population, than others. And populations can be highly
concentrated(Delhi, Moscow to name a few). 
> Another option is to divide the location space up into squares of a fixed size, with a partition for each square.
About80% of those squares are unpopulated though, being at sea. 
>
> Created_at is a very good candidate for partitioning too, especially if you don't intend to keep data older than a
certainage. Truncating or dropping a partition that you no longer need is quick and easy. 
>
> With data-sets this large, I'd think you would want to partition on multiple dimensions, creating a matrix of
partitionsunder a single master table. I don't think PG has a simple way of doing that (yet) though; perhaps it's
possibleby abusing multiple levels of inheritance, but that sounds like a bad idea. 
>
> And of course, create your partitions sufficiently course to prevent overburdening the system tables, which would
slowdown the query planner. 
>
> Hopefully there's something useful in my ramblings!
>
> Alban Hertroys
> --
> If you can't see the forest for the trees,
> cut the trees and you'll find there is no forest.
>


В списке pgsql-general по дате отправления:

Предыдущее
От: Ron Ben
Дата:
Сообщение: [GENERAL] Unable to upload backups
Следующее
От: vinny
Дата:
Сообщение: Re: [GENERAL] Large data and slow queries