Обсуждение: Postgresql partitioning - single hot table or distributed

Поиск
Список
Период
Сортировка

Postgresql partitioning - single hot table or distributed

От
sam mulube
Дата:
Hi,

we are considering database partitioning as a possible solution to
some performance issues we are having with our database, and we are
trying to decide on a partitioning scheme. We have a moderately write
heavy application (approx 50 inserts per second, with writes
outnumbering reads by roughly 5:1), and the table in question looks
something like this:

-------------------------------------------------------------------------------
column name :   id            |  value   |  server_id   |   created_at
column type    :  integer   |  string   |    integer     |   timestamp
with time zone
other info         :  pk           |               | fk, indexed |
indexed
-------------------------------------------------------------------------------

Or initial thoughts on partitioning was to partition by date using the
created_at column, with a separate partition for each month; however
the vast majority of our inserts would be for 'now', so we would be
almost entirely writing to the partition for the current month. Other
month partitions might get occasional updates, but this would be a
relatively infrequent occurrence.

Alternatively we wondered about partitioning by the server_id foreign
key, using for example the modulo of the foreign key id. This would
give us a finite number of partitions (rather than the potentially
unbounded date option), and would likely cause writes to be much more
evenly distributed between the partitions.

Does anyone have any likely idea which would be the better choice. The
single hot table getting most of the inserts, which might mean any
indexes are fully in memory, or dividing the writes more evenly over
all of our partitions?

Many thanks for any advice.

Sam

Re: Postgresql partitioning - single hot table or distributed

От
Vick Khera
Дата:
On Tue, Jun 29, 2010 at 4:00 PM, sam mulube <sam.mulube@gmail.com> wrote:
> Alternatively we wondered about partitioning by the server_id foreign
> key, using for example the modulo of the foreign key id. This would
> give us a finite number of partitions (rather than the potentially
> unbounded date option), and would likely cause writes to be much more
> evenly distributed between the partitions.

Do you ever delete?  If so, what is the delete pattern?

I have several large tables split into 100 partitions which keeps them
in the O(10million) row size range each, and that has proven to be
ideal for our use.  We insert into the partition directly, and select
from it whenever possible avoiding the constraint exclusion step.  The
only time we get bad performance is when doing a join that cannot run
constraint exclusion, and that turns out to be pretty expensive.

Re: Postgresql partitioning - single hot table or distributed

От
sam mulube
Дата:
Hi Vick,

Currently we aren't deleting anything due to business requirements
though at some point we will have to start deleting out some data. I
suspect when we do it won't be as simple as just dropping the oldest
data; some customers will have data that we want to keep permanently,
while others will be ok to be dropped after 30 days.

Inserting directly into the specific partition is interesting, but if
you're going to go down that route then aren't you starting to
implement the partitioning yourself in application code. In that case
what benefit does keeping the Postgresql partitioning in place
actually give you?

Thanks for the reply.

Sam

On 30 June 2010 02:39, Vick Khera <vivek@khera.org> wrote:
> On Tue, Jun 29, 2010 at 4:00 PM, sam mulube <sam.mulube@gmail.com> wrote:
>> Alternatively we wondered about partitioning by the server_id foreign
>> key, using for example the modulo of the foreign key id. This would
>> give us a finite number of partitions (rather than the potentially
>> unbounded date option), and would likely cause writes to be much more
>> evenly distributed between the partitions.
>
> Do you ever delete?  If so, what is the delete pattern?
>
> I have several large tables split into 100 partitions which keeps them
> in the O(10million) row size range each, and that has proven to be
> ideal for our use.  We insert into the partition directly, and select
> from it whenever possible avoiding the constraint exclusion step.  The
> only time we get bad performance is when doing a join that cannot run
> constraint exclusion, and that turns out to be pretty expensive.
>

Re: Postgresql partitioning - single hot table or distributed

От
Vick Khera
Дата:
On Wed, Jun 30, 2010 at 10:20 AM, sam mulube <sam.mulube@gmail.com> wrote:
> Inserting directly into the specific partition is interesting, but if
> you're going to go down that route then aren't you starting to
> implement the partitioning yourself in application code. In that case
> what benefit does keeping the Postgresql partitioning in place
> actually give you?
>

Your benefit comes in very fast lookups when constraint exclusion (or
index query) are working for you.  You also get simplified syntax for
doing joins against the entire data set.  If however you always only
search on the ID, then there is really not much benefit other than
having smaller indexes.  Smaller indexes are great when you have to
re-index.  Instead of taking say 15 minutes to reindex the one big
table, you take 10-15seconds per partition, which means your
applications can move along with only a few seconds delay while you
reindex instead of being blocked.

For one of my tables I could have gone the trigger method for routing
inserts because those are onsies-twosies, but the main table I have it
is much more efficient to compute the partition up front since I do a
*lot* of inserts in a big batch.

The drawback to partitioning by an ID number using modulo is that for
constraint exclusion to work you have to actually add something like
"AND (my_id % 42) = 0" to match the constraint.  The exclusion is not
done by executing the constraint, but by proving the constraint will
hold true for the WHERE clause.  My reasoning was if I'm going to add
that to my select queries, I might as well just pick the table by
doing that arithmetic in my application up front.  It is faster than
having the DB do the constraint proof for each of the 100 partitions.

Re: Postgresql partitioning - single hot table or distributed

От
Greg Smith
Дата:
Vick Khera wrote:
> The drawback to partitioning by an ID number using modulo is that for
> constraint exclusion to work you have to actually add something like
> "AND (my_id % 42) = 0" to match the constraint.  The exclusion is not
> done by executing the constraint, but by proving the constraint will
> hold true for the WHERE clause.

Vick's "Case study of partitioning a large table in Postgres 8"
presentation at http://www.mailermailer.com/labs/presentations/index.rwp
covers this topic in more detail.  The important thing to realize is
that your partitioning scheme must respect the limitations of the
constraint exclusion code in the query planner to be most useful.  You
really need to consider not just the structure of the data, but what the
queries against it look like, to make that call.  If you can hack the
application to include the modulo bit in every query that might be a
sufficient breakdown for you.  But you have to consider what that looks
like from the query side, given what the optimizer knows how to do, not
just the perspective of the table structure.  Building a quick prototype
and using EXPLAIN ANALYZE of common queries is invaluable here to do an
early investigation of any potential redesign.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Postgresql partitioning - single hot table or distributed

От
Scott Marlowe
Дата:
On Tue, Jun 29, 2010 at 4:00 PM, sam mulube <sam.mulube@gmail.com> wrote:
> Hi,
>
> we are considering database partitioning as a possible solution to
> some performance issues we are having with our database, and we are
> trying to decide on a partitioning scheme. We have a moderately write
> heavy application (approx 50 inserts per second, with writes
> outnumbering reads by roughly 5:1), and the table in question looks
> something like this:
>
> -------------------------------------------------------------------------------
> column name :   id            |  value   |  server_id   |   created_at
> column type    :  integer   |  string   |    integer     |   timestamp
> with time zone
> other info         :  pk           |               | fk, indexed |
> indexed
> -------------------------------------------------------------------------------
>
> Or initial thoughts on partitioning was to partition by date using the
> created_at column, with a separate partition for each month; however
> the vast majority of our inserts would be for 'now', so we would be
> almost entirely writing to the partition for the current month. Other
> month partitions might get occasional updates, but this would be a
> relatively infrequent occurrence.
>
> Alternatively we wondered about partitioning by the server_id foreign
> key, using for example the modulo of the foreign key id. This would
> give us a finite number of partitions (rather than the potentially
> unbounded date option), and would likely cause writes to be much more
> evenly distributed between the partitions.
>
> Does anyone have any likely idea which would be the better choice. The
> single hot table getting most of the inserts, which might mean any
> indexes are fully in memory, or dividing the writes more evenly over
> all of our partitions?

Are most of your selects for now to now - 1 day or so as well?  If so,
then look at having one big partition for historical data and one
small one for the last day.  Every x hours run a cron job that moves
everything in the current partition to the old archive partition(s).