Обсуждение: Recommendations for partitioning?

Поиск
Список
Период
Сортировка

Recommendations for partitioning?

От
Dave Johansen
Дата:
I'm managing a database that is adding about 10-20M records per day to a table and time is a core part of most queries, so I've been looking into seeing if I need to start using partitioning based on the time column and I've found these general guidelines:

Don't use more than about 50 paritions ( http://www.postgresql.org/message-id/17147.1271696670@sss.pgh.pa.us )
The only data I found fell inline with what you'd expect (i.e. speeds up selects but slows down inserts/updates http://www.if-not-true-then-false.com/2009/performance-testing-between-partitioned-and-non-partitioned-postgresql-tables-part-3/ )

So I was thinking that partitioning based on month to keep the number of partitions low, so that would mean about 0.5G records in each table. Does that seem like a reasonable number of records in each partition? Is there anything else that I should consider or be aware of?

Thanks,
Dave

Re: Recommendations for partitioning?

От
desmodemone
Дата:
Hi Dave,
              About the number of partitions , I didn't have so much problems with hundreds of partitions ( like 360 days in a year ).
Moreover you could bypass the overhead of trigger with a direct insert on the partition, also to have a parallel insert without to firing too much the trigger. Remember to enable the check constraints..
In my opinion it's better you try to have less rows/partition. How much is the average row length in byte ? If you will have to rebuild indexes , it will be possible , if the partition it's too big, that the maintenance_work_mem will be not enough and you will sort on disk.
I think you have to evaluate also to divide the partitions on  different tablespaces so to spread the i/o on different storage types/number ( and so on ) and to manage with different strategy the indexes (it's possible the searches will be different on "historical" partitions and on "live" partitions).
Another strategy it's also, not only to create partitions, but to shard data between more nodes.


Bye

Mat


2013/12/5 Dave Johansen <davejohansen@gmail.com>
I'm managing a database that is adding about 10-20M records per day to a table and time is a core part of most queries, so I've been looking into seeing if I need to start using partitioning based on the time column and I've found these general guidelines:

Don't use more than about 50 paritions ( http://www.postgresql.org/message-id/17147.1271696670@sss.pgh.pa.us )
The only data I found fell inline with what you'd expect (i.e. speeds up selects but slows down inserts/updates http://www.if-not-true-then-false.com/2009/performance-testing-between-partitioned-and-non-partitioned-postgresql-tables-part-3/ )

So I was thinking that partitioning based on month to keep the number of partitions low, so that would mean about 0.5G records in each table. Does that seem like a reasonable number of records in each partition? Is there anything else that I should consider or be aware of?

Thanks,
Dave

Re: Recommendations for partitioning?

От
Jeff Janes
Дата:
On Thu, Dec 5, 2013 at 7:36 AM, Dave Johansen <davejohansen@gmail.com> wrote:
I'm managing a database that is adding about 10-20M records per day to a table and time is a core part of most queries,


What is the nature of how the time column is used in the queries?   Depending on how it is used, you might not get much improvement at all, or you might get N fold improvement, or you might find that re-designing your indexes could get you the same query improvement that partitioning would, but with less headache.
 
so I've been looking into seeing if I need to start using partitioning based on the time column and I've found these general guidelines:

Don't use more than about 50 paritions ( http://www.postgresql.org/message-id/17147.1271696670@sss.pgh.pa.us )

Using triggers slows INSERTs down by a lot (unless they were already slow due to the need to hit disk to maintain the indexes or something like that).  Are you sure you can handle that slow down, given your insertion rate?  You could get the best of both worlds by having your bulk loaders target the correct partition directly, but also have the triggers on the parent table for any programs that don't get the message.
 

The only data I found fell inline with what you'd expect (i.e. speeds up selects but slows down inserts/updates http://www.if-not-true-then-false.com/2009/performance-testing-between-partitioned-and-non-partitioned-postgresql-tables-part-3/ )


One of the big benefits of partitioning can be to speed up insertions tremendously, by keeping the hot part of the indices that need to be maintained upon insertion together in shared_buffers.
 

So I was thinking that partitioning based on month to keep the number of partitions low, so that would mean about 0.5G records in each table. Does that seem like a reasonable number of records in each partition? Is there anything else that I should consider or be aware of?

How will data be expired?  Hows does the size of one of your intended partitions compare to your RAM and shared_buffers.
 
Cheers,

Jeff

Re: Recommendations for partitioning?

От
Scott Marlowe
Дата:
On Sat, Dec 7, 2013 at 10:09 AM, desmodemone <desmodemone@gmail.com> wrote:
> Hi Dave,
>               About the number of partitions , I didn't have so much
> problems with hundreds of partitions ( like 360 days in a year ).
> Moreover you could bypass the overhead of trigger with a direct insert on
> the partition, also to have a parallel insert without to firing too much the
> trigger. Remember to enable the check constraints..
> In my opinion it's better you try to have less rows/partition. How much is
> the average row length in byte ? If you will have to rebuild indexes , it
> will be possible , if the partition it's too big, that the
> maintenance_work_mem will be not enough and you will sort on disk.
> I think you have to evaluate also to divide the partitions on  different
> tablespaces so to spread the i/o on different storage types/number ( and so
> on ) and to manage with different strategy the indexes (it's possible the
> searches will be different on "historical" partitions and on "live"
> partitions).
> Another strategy it's also, not only to create partitions, but to shard data
> between more nodes.

I agree on the number of partitions. I've run a stats db with daily
partitions with about 2 years data in it with no real problems due to
high numbers of partitions. Somewhere around 1,000 things start to get
slower.

I'll add that you can use assymetric partitioning if you tend to do a
lot of more fine grained queries on recent data and more big roll up
on older ones. I.e. partition by month except for the last 30 days, do
it by day etc. Then at the end of the month roll all the days into a
month partition and delete them.


Re: Recommendations for partitioning?

От
Dave Johansen
Дата:
Sorry for the delay response. We had some hardware/configuration issues that appear to be solved now, so now we're starting to actually play with modifying the database.

On Sat, Dec 7, 2013 at 1:29 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Dec 5, 2013 at 7:36 AM, Dave Johansen <davejohansen@gmail.com> wrote:
I'm managing a database that is adding about 10-20M records per day to a table and time is a core part of most queries,


What is the nature of how the time column is used in the queries?   Depending on how it is used, you might not get much improvement at all, or you might get N fold improvement, or you might find that re-designing your indexes could get you the same query improvement that partitioning would, but with less headache.

The time column is usually used to calculate statistics, find/analyze duplicates, analyze data contents, etc on a specific time window. So there will be queries with GROUP BY and WINDOWs with a specific time filter in the where clause.
 

so I've been looking into seeing if I need to start using partitioning based on the time column and I've found these general guidelines:

Don't use more than about 50 paritions ( http://www.postgresql.org/message-id/17147.1271696670@sss.pgh.pa.us )

Using triggers slows INSERTs down by a lot (unless they were already slow due to the need to hit disk to maintain the indexes or something like that).  Are you sure you can handle that slow down, given your insertion rate?  You could get the best of both worlds by having your bulk loaders target the correct partition directly, but also have the triggers on the parent table for any programs that don't get the message.

Inserting directly into the correct partition whenever possible and leaving the trigger on the parent table seems like the best option.
 

The only data I found fell inline with what you'd expect (i.e. speeds up selects but slows down inserts/updates http://www.if-not-true-then-false.com/2009/performance-testing-between-partitioned-and-non-partitioned-postgresql-tables-part-3/ )


One of the big benefits of partitioning can be to speed up insertions tremendously, by keeping the hot part of the indices that need to be maintained upon insertion together in shared_buffers.

We insert lots of new data, but rarely modify existing data once it's in the database, so it sounds like this would be a big benefit for us.
 
 
So I was thinking that partitioning based on month to keep the number of partitions low, so that would mean about 0.5G records in each table. Does that seem like a reasonable number of records in each partition? Is there anything else that I should consider or be aware of?

How will data be expired?  Hows does the size of one of your intended partitions compare to your RAM and shared_buffers.

We add about 10-20 million records per day with each being about 200 bytes in size (there's a bytea in there with that being the average size) to each table and there's 64 GB of RAM on the machine.
 
 
Cheers,

Jeff

On Sat, Dec 7, 2013 at 1:37 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> I'll add that you can use assymetric partitioning if you tend to do a
> lot of more fine grained queries on recent data and more big roll up
> on older ones. I.e. partition by month except for the last 30 days, do
> it by day etc. Then at the end of the month roll all the days into a
> month partition and delete them.

This sounds like a great solution for us. Is there some trick to roll the records from one partition to another? Or is the only way just a SELECT INTO followed by a DELETE?

Thanks,
Dave

Re: Recommendations for partitioning?

От
Scott Marlowe
Дата:
On Thu, Dec 19, 2013 at 9:53 AM, Dave Johansen <davejohansen@gmail.com> wrote:
>>
> On Sat, Dec 7, 2013 at 1:37 PM, Scott Marlowe <scott.marlowe@gmail.com>
> wrote:
>> I'll add that you can use assymetric partitioning if you tend to do a
>> lot of more fine grained queries on recent data and more big roll up
>> on older ones. I.e. partition by month except for the last 30 days, do
>> it by day etc. Then at the end of the month roll all the days into a
>> month partition and delete them.
>
> This sounds like a great solution for us. Is there some trick to roll the
> records from one partition to another? Or is the only way just a SELECT INTO
> followed by a DELETE?

That's pretty much it. What I did was to create the new month table
and day tables, alter my triggers to reflect this, then move the data
with insert into / select from query for each old day partition. Then
once their data is moved you can just drop them. Since you changed the
triggers first those tables are no long taking input so it's usually
safe to drop them now.


Re: Recommendations for partitioning?

От
Dave Johansen
Дата:
On Thu, Dec 19, 2013 at 10:27 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Thu, Dec 19, 2013 at 9:53 AM, Dave Johansen <davejohansen@gmail.com> wrote:
>>
> On Sat, Dec 7, 2013 at 1:37 PM, Scott Marlowe <scott.marlowe@gmail.com>
> wrote:
>> I'll add that you can use assymetric partitioning if you tend to do a
>> lot of more fine grained queries on recent data and more big roll up
>> on older ones. I.e. partition by month except for the last 30 days, do
>> it by day etc. Then at the end of the month roll all the days into a
>> month partition and delete them.
>
> This sounds like a great solution for us. Is there some trick to roll the
> records from one partition to another? Or is the only way just a SELECT INTO
> followed by a DELETE?

That's pretty much it. What I did was to create the new month table
and day tables, alter my triggers to reflect this, then move the data
with insert into / select from query for each old day partition. Then
once their data is moved you can just drop them. Since you changed the
triggers first those tables are no long taking input so it's usually
safe to drop them now.

It would be nice if there was just a "move command", but that seems like the type of model that we want and we'll probably move to that.

On a semi-related note, I was trying to move from the single large table to the partitions and doing INSERT INTO SELECT * FROM WHERE ... was running very slow (I believe because of the same index issue that we've been running into), so then I tried creating a BEFORE INSERT trigger that was working and using pg_restore on an -Fc dump. The documentation says that triggers are executed as part of a COPY FROM ( http://www.postgresql.org/docs/8.4/static/sql-copy.html ), but it doesn't appear that the trigger was honored because all of the data was put into the base table and all of the partitions are empty.

Is there a way that I can run pg_restore that will properly honor the trigger? Or do I just have to create a new INSERTs dump?

Thanks,
Dave

Re: Recommendations for partitioning?

От
bricklen
Дата:

On Fri, Dec 20, 2013 at 7:52 AM, Dave Johansen <davejohansen@gmail.com> wrote:
It would be nice if there was just a "move command", but that seems like the type of model that we want and we'll probably move to that.

I haven't been following this thread, but this comment caught my eye. Are you after the "NO INHERIT" command?
http://www.postgresql.org/docs/current/static/sql-altertable.html Search for the "NO INHERIT" clause -- it will allow you to detach a child table from an inherited parent which can then archive or copy into another table. Inserting into the rolled-up partition was already mentioned upthread I see.

Re: Recommendations for partitioning?

От
Alvaro Herrera
Дата:
Dave Johansen escribió:
> On Thu, Dec 19, 2013 at 10:27 AM, Scott Marlowe <scott.marlowe@gmail.com>wrote:

> > That's pretty much it. What I did was to create the new month table
> > and day tables, alter my triggers to reflect this, then move the data
> > with insert into / select from query for each old day partition. Then
> > once their data is moved you can just drop them. Since you changed the
> > triggers first those tables are no long taking input so it's usually
> > safe to drop them now.
>
> It would be nice if there was just a "move command", but that seems like
> the type of model that we want and we'll probably move to that.

Eh.  Why can't you just do something like

WITH moved AS (
    DELETE FROM src WHERE ..
    RETURNING *
) INSERT INTO dst SELECT * FROM moved;

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: Recommendations for partitioning?

От
Shaun Thomas
Дата:
On 12/20/2013 09:59 AM, Alvaro Herrera wrote:

> WITH moved AS (
>     DELETE FROM src WHERE ..
>     RETURNING *
> ) INSERT INTO dst SELECT * FROM moved;

I know that's effectively an atomic action, but I'd feel a lot more
comfortable reversing that logic so the delete is based on the results
of the insert.

WITH saved AS (
     INSERT INTO dst
     SELECT * FROM src WHERE ...
     RETURNING *
)
DELETE FROM src
  WHERE ...;

I'll admit yours is cleaner, though. :)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Recommendations for partitioning?

От
Dave Johansen
Дата:
On Fri, Dec 20, 2013 at 9:18 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:
On 12/20/2013 09:59 AM, Alvaro Herrera wrote:

WITH moved AS (
        DELETE FROM src WHERE ..
        RETURNING *
) INSERT INTO dst SELECT * FROM moved;

I know that's effectively an atomic action, but I'd feel a lot more comfortable reversing that logic so the delete is based on the results of the insert.

WITH saved AS (
    INSERT INTO dst
    SELECT * FROM src WHERE ...
    RETURNING *
)
DELETE FROM src
 WHERE ...;

I'll admit yours is cleaner, though. :)

That is a good idea. I didn't even realize that there was such a command, so I'll definitely use those.

Re: Recommendations for partitioning?

От
Dave Johansen
Дата:
On Fri, Dec 20, 2013 at 8:52 AM, Dave Johansen <davejohansen@gmail.com> wrote:
On Thu, Dec 19, 2013 at 10:27 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Thu, Dec 19, 2013 at 9:53 AM, Dave Johansen <davejohansen@gmail.com> wrote:
>>
> On Sat, Dec 7, 2013 at 1:37 PM, Scott Marlowe <scott.marlowe@gmail.com>
> wrote:
>> I'll add that you can use assymetric partitioning if you tend to do a
>> lot of more fine grained queries on recent data and more big roll up
>> on older ones. I.e. partition by month except for the last 30 days, do
>> it by day etc. Then at the end of the month roll all the days into a
>> month partition and delete them.
>
> This sounds like a great solution for us. Is there some trick to roll the
> records from one partition to another? Or is the only way just a SELECT INTO
> followed by a DELETE?

That's pretty much it. What I did was to create the new month table
and day tables, alter my triggers to reflect this, then move the data
with insert into / select from query for each old day partition. Then
once their data is moved you can just drop them. Since you changed the
triggers first those tables are no long taking input so it's usually
safe to drop them now.

It would be nice if there was just a "move command", but that seems like the type of model that we want and we'll probably move to that.

On a semi-related note, I was trying to move from the single large table to the partitions and doing INSERT INTO SELECT * FROM WHERE ... was running very slow (I believe because of the same index issue that we've been running into), so then I tried creating a BEFORE INSERT trigger that was working and using pg_restore on an -Fc dump. The documentation says that triggers are executed as part of a COPY FROM ( http://www.postgresql.org/docs/8.4/static/sql-copy.html ), but it doesn't appear that the trigger was honored because all of the data was put into the base table and all of the partitions are empty.

Is there a way that I can run pg_restore that will properly honor the trigger? Or do I just have to create a new INSERTs dump?

It turns out that this was an error on my part. I was using an old script to do the restore and it had --disable-triggers to prevent the foreign keys from being checked and that was the actual source of my problem.

Re: Recommendations for partitioning?

От
Sergey Konoplev
Дата:
On Fri, Dec 20, 2013 at 7:59 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Dave Johansen escribió:
>> On Thu, Dec 19, 2013 at 10:27 AM, Scott Marlowe <scott.marlowe@gmail.com>wrote:
>
>> > That's pretty much it. What I did was to create the new month table
>> > and day tables, alter my triggers to reflect this, then move the data
>> > with insert into / select from query for each old day partition. Then
>> > once their data is moved you can just drop them. Since you changed the
>> > triggers first those tables are no long taking input so it's usually
>> > safe to drop them now.
>>
>> It would be nice if there was just a "move command", but that seems like
>> the type of model that we want and we'll probably move to that.
>
> Eh.  Why can't you just do something like
>
> WITH moved AS (
>         DELETE FROM src WHERE ..
>         RETURNING *
> ) INSERT INTO dst SELECT * FROM moved;

Avero, I think it could be cheaper to do this like it is shown below, correct?

psql dbname -c 'copy src to stdout' | \
psql dbname -c 'copy dst from stdin; truncate src;'

Dave, in case if you need to archive old partitions to compressed
files out of your database you can use this tool [1]. Consult with the
configuration example [2], look at the ARCHIVE_* parameters.

[1] https://github.com/grayhemp/pgcookbook/blob/master/bin/archive_tables.sh
[2] https://github.com/grayhemp/pgcookbook/blob/master/bin/config.sh.example

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com


Re: Recommendations for partitioning?

От
Alvaro Herrera
Дата:
Sergey Konoplev escribió:
> On Fri, Dec 20, 2013 at 7:59 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:

> > Eh.  Why can't you just do something like
> >
> > WITH moved AS (
> >         DELETE FROM src WHERE ..
> >         RETURNING *
> > ) INSERT INTO dst SELECT * FROM moved;
>
> Avero, I think it could be cheaper to do this like it is shown below, correct?
>
> psql dbname -c 'copy src to stdout' | \
> psql dbname -c 'copy dst from stdin; truncate src;'

Yes, if you can get rid of the old records by removing or emptying a
partition (or de-inheriting it, as suggested elsewhere in the thread),
that's better than DELETE because that way you don't create dead rows to
vacuum later.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services