Обсуждение: partitioning an existing table

Поиск
Список
Период
Сортировка

partitioning an existing table

От
Robert Blayzor
Дата:
The docs claim that the master table “should” be empty. It it possible to just create child tables off an existing
mastertable with data, then just inserting data into the new child tables. 

THe plan would be to keep data in the master table and purge it over time until it’s eventually empty, then drop the
indexesas well. 

Fully understanding that data needs to be placed in the right child tables. Data outside of those child ranges would
remainas “old data” in the master table. 


Just trying to grab if that’s an acceptable migration of live data from a single large table and move into
partitioning.Think of it as a very large table of cyclic data that ages out. New data in child tables while removing
datafrom the master table over time. 

--
inoc.net!rblayzor
XMPP: rblayzor.AT.inoc.net
PGP:  https://inoc.net/~rblayzor/
















Re: partitioning an existing table

От
Justin Pryzby
Дата:
On Fri, Dec 29, 2017 at 11:37:56PM -0500, Robert Blayzor wrote:
> The docs claim that the master table “should” be empty. It it possible to just create child tables off an existing
mastertable with data, then just inserting data into the new child tables.
 
> 
> THe plan would be to keep data in the master table and purge it over time until it’s eventually empty, then drop the
indexesas well.
 
> 
> Fully understanding that data needs to be placed in the right child tables. Data outside of those child ranges would
remainas “old data” in the master table.
 
> 
> Just trying to grab if that’s an acceptable migration of live data from a single large table and move into
partitioning.Think of it as a very large table of cyclic data that ages out. New data in child tables while removing
datafrom the master table over time.
 

For PG10 "partitions" (as in relkind='p') the parent is defined as empty
(actually has no underlying storage).

For inheritance (available in and before PG10), the parent may be nonempty,
which works fine, although someone else might find it unintuitive.  (Does the
doc actually say "should" somewhere ?)

You almost certainly want child tables to have constraints, to allow
constraint_exclusion (which is the only reason one child table is more "right"
than any other, besides the associated pruning/retention schedule).

Since you'll be running DELETE rather than DROP on the parent, you might
consider DELETE ONLY..  but it won't matter if your children's constraints are
usable with DELETE's WHERE condition.

Also, note that autoanalyze doesn't know to analyze the PARENT's statistics
when its children are INSERTED/DROPPED/etc.  So I'd suggest to consider ANALYZE
each parent following DROP of its children (or maybe on some more frequent
schedule to handle inserted rows, too).  Perhaps that should be included as a
CAVEAT?
https://www.postgresql.org/docs/10/static/ddl-inherit.html#DDL-INHERIT-CAVEATS

Just curious: are your constraints/indices on starting time or ending time?

BTW depending on your requirements, it may be possible to make pg_dump much
more efficient.  For our data, it's reasonable to assume that a table is
"final" if its constraints exclude data older than a few days ago, and it can
be permanently dumped and excluded from future, daily backups, which makes the
backups smaller and faster, and probably causes less cache churn, etc.  But I
imagine you might have different requirements, so that may be infeasible, or
you'd maybe have to track insertions, either via pg_stat_user_tables, or at the
application layer, and redump the relevant table.

Justin


Re: partitioning an existing table

От
Robert Blayzor
Дата:
On Dec 30, 2017, at 12:38 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> For inheritance (available in and before PG10), the parent may be nonempty,
> which works fine, although someone else might find it unintuitive.  (Does the
> doc actually say "should" somewhere ?)

Well it doesn’t say should, but says “normally”..

"The parent table itself is normally empty; it exists just to represent the entire data set. …


> Just curious: are your constraints/indices on starting time or ending time?

Yes, the child tables will be strictly on a months worth of data.

CREATE TABLE table_201801
  (CHECK (ts >= DATE ‘2018-01-01' AND ts < DATE ‘2018-02-01'))
  INHERITS …


The application will insert directly into the child tables, so no need for triggers or rules.


> BTW depending on your requirements, it may be possible to make pg_dump much
> more efficient.  For our data, it's reasonable to assume that a table is
> "final" if its constraints exclude data older than a few days ago, and it can
> be permanently dumped and excluded from future, daily backups, which makes the
> backups smaller and faster, and probably causes less cache churn, etc.  But I
> imagine you might have different requirements, so that may be infeasible, or
> you'd maybe have to track insertions, either via p

The idea is only only keep a # of months available for searching over a period of months. Those months could be 3 or
more,up to a year, etc. But being able to just drop and entire child table for pruning is very attractive. Right now
theaverage months data is about 2-3 million rows each. Data is just inserted and then only searched. Never updated… 

I also like the idea of skipping all this older data from a PGdump. We archive records inserted into these tables daily
intocold storage. ie: export and compressed. So the data is saved cold. We dump the DB nightly also, but probably would
makesense to skip anything outside of the newest child table. Just not sure how to make that happen, yet…. 






Re: partitioning an existing table - efficient pg_dump

От
Justin Pryzby
Дата:
On Sat, Dec 30, 2017 at 09:19:05AM -0500, Robert Blayzor wrote:
> On Dec 30, 2017, at 12:38 AM, Justin Pryzby <pryzby@telsasoft.com> wrote:

> > BTW depending on your requirements, it may be possible to make pg_dump much
> > more efficient.  For our data, it's reasonable to assume that a table is
> > "final" if its constraints exclude data older than a few days ago, and it can
> > be permanently dumped and excluded from future, daily backups, which makes the
> > backups smaller and faster, and probably causes less cache churn, etc.  But I
> > imagine you might have different requirements, so that may be infeasible, or
> > you'd maybe have to track insertions, either via p
> 
> The idea is only only keep a # of months available for searching over a period of months. Those months could be 3 or
more,up to a year, etc. But being able to just drop and entire child table for pruning is very attractive. Right now
theaverage months data is about 2-3 million rows each. Data is just inserted and then only searched. Never updated…
 
> 
> I also like the idea of skipping all this older data from a PGdump. We archive records inserted into these tables
dailyinto cold storage. ie: export and compressed. So the data is saved cold. We dump the DB nightly also, but probably
wouldmake sense to skip anything outside of the newest child table. Just not sure how to make that happen, yet….
 

For us, I classify the tables as "partitioned" or "not partitioned" and
subdivide "partitioned" into "recent" or "historic" based on table names; but
if you design it from scratch then you'd have the opportunity to keep a list of
partitioned tables, their associated date range, date of most recent insertion,
and most recent "final" backup.

This is the essence of it:
snap= ... SELECT pg_export_snapshot();
pg_dump --snap "$snap" -T "$ptnreg" -f nonpartitioned.new
pg_dump --snap "$snap" -t "$recent" -f recent.new
loop around historic partitioned tables and run "final" pg_dump if it's been
  INSERTed more recently than it's been dumped.
remove any "final" pg_dump not included in any existing backup (assuming you
  keep multiple copies on different rotation).

Note that pg_dump -t/-T is different from "egrep" in a few special ways..

Justin