Обсуждение: pg_dump is broken for partition tablespaces

Поиск
Список
Период
Сортировка

pg_dump is broken for partition tablespaces

От
David Rowley
Дата:
Over on [1] Andres pointed out that the pg_dump support for the new to
PG12 tablespace inheritance feature is broken.  This is the feature
added in ca4103025dfe26 to allow a partitioned table to have a
tablespace that acts as the default tablespace for newly attached
partitions. The idea being that you can periodically change the
default tablespace for new partitions to a tablespace that sits on a
disk partition with more free space without affecting the default
tablespace for normal non-partitioned tables. Anyway...

pg_dump is broken with this. Consider:

create tablespace newts location '/tmp/newts';
create table listp (a int) partition by list(a) tablespace newts;
create table listp1 partition of listp for values in(1) tablespace pg_default;
create table listp2 partition of listp for values in(2);

select relname,relkind,reltablespace from pg_class where relname like
'listp%' and relkind in('r','p') order by relname;
produces:

 relname | relkind | reltablespace
---------+---------+---------------
 listp   | p       |         16384
 listp1  | r       |             0
 listp2  | r       |         16384
(3 rows)

after dump/restore:

 relname | relkind | reltablespace
---------+---------+---------------
 listp   | p       |         16384
 listp1  | r       |         16384
 listp2  | r       |         16384
(3 rows)

Here the tablespace for listp1 was inherited from listp, but we really
should have restored this to use pg_default like was specified.

The reason this occurs is that in pg_dump we do:

SET default_tablespace = '';

CREATE TABLE public.listp1 PARTITION OF public.listp
FOR VALUES IN (1);

so, since we're creating the table initially as a partition the logic
that applies the default partition from the parent kicks in.

If we instead did:

CREATE TABLE public.listp1 (a integer
);

ALTER TABLE public.list1 ATTACH PARTITION public.listp FOR VALUES IN (1);

then we'd have no issue, as tablespace will be set to whatever
default_tablespace is set to.

Partitioned indexes have this similar inherit tablespace from parent
feature, so ca4103025dfe26 was intended to align the behaviour of the
two. Partitioned indexes happen not to suffer from the same issue as
the indexes are attached after their creation similar to what I
propose above.

Can anyone see any fundamental reason that we should not create a
partitioned table by doing CREATE TABLE followed by ATTACH PARTITION?
 If not, I'll write a patch that fixes it that way.

As far as I can see, the biggest fundamental difference with doing
things this way will be that the column order of partitions will be
preserved, where before it would inherit the order of the partitioned
table.  I'm a little unsure if doing this column reordering was an
intended side-effect or not.

[1] https://www.postgresql.org/message-id/20190305060804.jv5mz4slrnelh3jy@alap3.anarazel.de

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: pg_dump is broken for partition tablespaces

От
Michael Paquier
Дата:
On Wed, Mar 06, 2019 at 07:45:06PM +1300, David Rowley wrote:
> Partitioned indexes have this similar inherit tablespace from parent
> feature, so ca4103025dfe26 was intended to align the behaviour of the
> two. Partitioned indexes happen not to suffer from the same issue as
> the indexes are attached after their creation similar to what I
> propose above.
>
> Can anyone see any fundamental reason that we should not create a
> partitioned table by doing CREATE TABLE followed by ATTACH PARTITION?
>  If not, I'll write a patch that fixes it that way.

The part for partitioned indexes is already battle-proven, so if the
part for partitioned tables can be consolidated the same way that
would be really nice.

> As far as I can see, the biggest fundamental difference with doing
> things this way will be that the column order of partitions will be
> preserved, where before it would inherit the order of the partitioned
> table.  I'm a little unsure if doing this column reordering was an
> intended side-effect or not.

I don't see any direct issues with that to be honest thinking about
it..
--
Michael

Вложения

Re: pg_dump is broken for partition tablespaces

От
David Rowley
Дата:
On Wed, 6 Mar 2019 at 20:19, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Mar 06, 2019 at 07:45:06PM +1300, David Rowley wrote:
> > Can anyone see any fundamental reason that we should not create a
> > partitioned table by doing CREATE TABLE followed by ATTACH PARTITION?
> >  If not, I'll write a patch that fixes it that way.
>
> The part for partitioned indexes is already battle-proven, so if the
> part for partitioned tables can be consolidated the same way that
> would be really nice.

I think Andres is also going to need it to work this way for the
pluggable storage patch too.

Looking closer at this, I discovered that when pg_dump is in binary
upgrade mode that it crafts the pg_dump output in this way anyway...
Obviously, the column orders can't go changing magically in that case
since we're about to plug the old heap into the new table.  Due to the
removal of the special case, it means this patch turned out to remove
more code than it adds.

Patch attached.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: pg_dump is broken for partition tablespaces

От
Tom Lane
Дата:
David Rowley <david.rowley@2ndquadrant.com> writes:
> As far as I can see, the biggest fundamental difference with doing
> things this way will be that the column order of partitions will be
> preserved, where before it would inherit the order of the partitioned
> table.  I'm a little unsure if doing this column reordering was an
> intended side-effect or not.

Well, if the normal behavior results in changing the column order,
it'd be necessary to do things differently in --binary-upgrade mode
anyway, because there we *must* preserve column order.  I don't know
if what you're describing represents a separate bug for pg_upgrade runs,
but it might.  Is there any test case for the situation left behind by
the core regression tests?

            regards, tom lane


Re: pg_dump is broken for partition tablespaces

От
Andres Freund
Дата:
Hi,

On 2019-03-06 19:45:06 +1300, David Rowley wrote:
> Over on [1] Andres pointed out that the pg_dump support for the new to
> PG12 tablespace inheritance feature is broken.  This is the feature
> added in ca4103025dfe26 to allow a partitioned table to have a
> tablespace that acts as the default tablespace for newly attached
> partitions. The idea being that you can periodically change the
> default tablespace for new partitions to a tablespace that sits on a
> disk partition with more free space without affecting the default
> tablespace for normal non-partitioned tables. Anyway...

I'm also concerned that the the current catalog representation isn't
right. As I said:

> I also find it far from clear that:
>     <listitem>
>      <para>
>       The <replaceable class="parameter">tablespace_name</replaceable> is the name
>       of the tablespace in which the new table is to be created.
>       If not specified,
>       <xref linkend="guc-default-tablespace"/> is consulted, or
>       <xref linkend="guc-temp-tablespaces"/> if the table is temporary.  For
>       partitioned tables, since no storage is required for the table itself,
>       the tablespace specified here only serves to mark the default tablespace
>       for any newly created partitions when no other tablespace is explicitly
>       specified.
>      </para>
>     </listitem>
> is handled correctly. The above says that the *specified* tablespaces -
> which seems to exclude the default tablespace - is what's going to
> determine what partitions use as their default tablespace. But in fact
> that's not true, the partitioned table's pg_class.retablespace is set to
> what default_tablespaces was at the time of the creation.

I still think the feature as is doesn't seem to have very well defined
behaviour.



> If we instead did:
> 
> CREATE TABLE public.listp1 (a integer
> );
> 
> ALTER TABLE public.list1 ATTACH PARTITION public.listp FOR VALUES IN (1);

Isn't that a bit more expensive, because now the table needs to be
scanned for maching the value? That's probably neglegible though, given
it'd probably always empty.


Greetings,

Andres Freund


Re: pg_dump is broken for partition tablespaces

От
David Rowley
Дата:
On Thu, 7 Mar 2019 at 03:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > As far as I can see, the biggest fundamental difference with doing
> > things this way will be that the column order of partitions will be
> > preserved, where before it would inherit the order of the partitioned
> > table.  I'm a little unsure if doing this column reordering was an
> > intended side-effect or not.
>
> Well, if the normal behavior results in changing the column order,
> it'd be necessary to do things differently in --binary-upgrade mode
> anyway, because there we *must* preserve column order.  I don't know
> if what you're describing represents a separate bug for pg_upgrade runs,
> but it might.  Is there any test case for the situation left behind by
> the core regression tests?

After having written the patch, I noticed that binary upgrade mode
does the CREATE TABLE then ATTACH PARTITION in order to preserve the
order.

After changing it nothing failed in make check-world with tap tests enabled.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: pg_dump is broken for partition tablespaces

От
David Rowley
Дата:
On Thu, 7 Mar 2019 at 05:17, Andres Freund <andres@anarazel.de> wrote:
> I'm also concerned that the the current catalog representation isn't
> right. As I said:
>
> > I also find it far from clear that:
> >     <listitem>
> >      <para>
> >       The <replaceable class="parameter">tablespace_name</replaceable> is the name
> >       of the tablespace in which the new table is to be created.
> >       If not specified,
> >       <xref linkend="guc-default-tablespace"/> is consulted, or
> >       <xref linkend="guc-temp-tablespaces"/> if the table is temporary.  For
> >       partitioned tables, since no storage is required for the table itself,
> >       the tablespace specified here only serves to mark the default tablespace
> >       for any newly created partitions when no other tablespace is explicitly
> >       specified.
> >      </para>
> >     </listitem>
> > is handled correctly. The above says that the *specified* tablespaces -
> > which seems to exclude the default tablespace - is what's going to
> > determine what partitions use as their default tablespace. But in fact
> > that's not true, the partitioned table's pg_class.retablespace is set to
> > what default_tablespaces was at the time of the creation.

Do you think it's fine to reword the docs to make this point more
clear, or do you see this as a fundamental problem with the patch?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: pg_dump is broken for partition tablespaces

От
Andres Freund
Дата:
Hi,

On 2019-03-07 11:31:15 +1300, David Rowley wrote:
> On Thu, 7 Mar 2019 at 05:17, Andres Freund <andres@anarazel.de> wrote:
> > I'm also concerned that the the current catalog representation isn't
> > right. As I said:
> >
> > > I also find it far from clear that:
> > >     <listitem>
> > >      <para>
> > >       The <replaceable class="parameter">tablespace_name</replaceable> is the name
> > >       of the tablespace in which the new table is to be created.
> > >       If not specified,
> > >       <xref linkend="guc-default-tablespace"/> is consulted, or
> > >       <xref linkend="guc-temp-tablespaces"/> if the table is temporary.  For
> > >       partitioned tables, since no storage is required for the table itself,
> > >       the tablespace specified here only serves to mark the default tablespace
> > >       for any newly created partitions when no other tablespace is explicitly
> > >       specified.
> > >      </para>
> > >     </listitem>
> > > is handled correctly. The above says that the *specified* tablespaces -
> > > which seems to exclude the default tablespace - is what's going to
> > > determine what partitions use as their default tablespace. But in fact
> > > that's not true, the partitioned table's pg_class.retablespace is set to
> > > what default_tablespaces was at the time of the creation.
> 
> Do you think it's fine to reword the docs to make this point more
> clear, or do you see this as a fundamental problem with the patch?

Hm, both? I mean I wouldn't necessarily characterize it as "fundamental"
problem, but ...

I don't think the argument that the user intended to explicitly set a
tablespace holds much water if it was just set via default_tablespace,
rather than an explicit TABLESPACE.  I think iff you really want
something like this feature, you'd have to mark a partition's
reltablespace as 0 unless an *explicit* assignment of the tablespace
happened. In which case you also would need to explicitly emit a
TABLESPACE for the partitioned table in pg_dump, to restore that.

Greetings,

Andres Freund


Re: pg_dump is broken for partition tablespaces

От
David Rowley
Дата:
On Thu, 7 Mar 2019 at 11:37, Andres Freund <andres@anarazel.de> wrote:
>
> On 2019-03-07 11:31:15 +1300, David Rowley wrote:
> > Do you think it's fine to reword the docs to make this point more
> > clear, or do you see this as a fundamental problem with the patch?
>
> Hm, both? I mean I wouldn't necessarily characterize it as "fundamental"
> problem, but ...

Okay, so if I understand you correctly, you're complaining about the
fact that if the user does:

CREATE TABLE p (a int) PARTITION BY LIST(a) TABLESPACE pg_default;

that the user intended that all future partitions go to pg_default and
not whatever default_tablespace is set to at the time?

If so, that seems like a genuine concern.

I see in heap_create() we do;

/*
* Never allow a pg_class entry to explicitly specify the database's
* default tablespace in reltablespace; force it to zero instead. This
* ensures that if the database is cloned with a different default
* tablespace, the pg_class entry will still match where CREATE DATABASE
* will put the physically copied relation.
*
* Yes, this is a bit of a hack.
*/
if (reltablespace == MyDatabaseTableSpace)
reltablespace = InvalidOid;

which will zero pg_class.reltablespace if the specified tablespace
happens to match pg_database.dattablespace. This causes future
partitions to think that no tablespace was specified and therefore
DefineRelation() consults the default_tablespace.

I see that this same problem exists for partitioned indexes too:

create table listp (a int) partition by list(a);
create index on listp (a) tablespace pg_default;
set default_Tablespace = n;
create table listp1 partition of listp for values in(1);
\d listp1
               Table "public.listp1"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           |          |
Partition of: listp FOR VALUES IN (1)
Indexes:
    "listp1_a_idx" btree (a), tablespace "n"
Tablespace: "n"

If I understand what you're saying correctly, then the listp1_a_idx
should have been created in pg_default since that's what the default
partitioned index tablespace was set to.

> I don't think the argument that the user intended to explicitly set a
> tablespace holds much water if it was just set via default_tablespace,
> rather than an explicit TABLESPACE.  I think iff you really want
> something like this feature, you'd have to mark a partition's
> reltablespace as 0 unless an *explicit* assignment of the tablespace
> happened. In which case you also would need to explicitly emit a
> TABLESPACE for the partitioned table in pg_dump, to restore that.

I think emitting an explicit tablespace in pg_dump for partitioned
tables (when non-zero) might have issues for pg_restore's
--no-tablespaces option.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: pg_dump is broken for partition tablespaces

От
Andres Freund
Дата:
On 2019-03-07 15:25:34 +1300, David Rowley wrote:
> On Thu, 7 Mar 2019 at 11:37, Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2019-03-07 11:31:15 +1300, David Rowley wrote:
> > > Do you think it's fine to reword the docs to make this point more
> > > clear, or do you see this as a fundamental problem with the patch?
> >
> > Hm, both? I mean I wouldn't necessarily characterize it as "fundamental"
> > problem, but ...
> 
> Okay, so if I understand you correctly, you're complaining about the
> fact that if the user does:
> 
> CREATE TABLE p (a int) PARTITION BY LIST(a) TABLESPACE pg_default;
> 
> that the user intended that all future partitions go to pg_default and
> not whatever default_tablespace is set to at the time?

Correct. And also, conversely, if default_tablespace was set to frakbar
at the time of CREATE TABLE, but no explicit TABLESPACE was provided,
that that should *not* be the default for new partitions; rather
default_tablespace should be consulted again.


> If I understand what you're saying correctly, then the listp1_a_idx
> should have been created in pg_default since that's what the default
> partitioned index tablespace was set to.

That's how I understand the intent of the user yes.


> > I don't think the argument that the user intended to explicitly set a
> > tablespace holds much water if it was just set via default_tablespace,
> > rather than an explicit TABLESPACE.  I think iff you really want
> > something like this feature, you'd have to mark a partition's
> > reltablespace as 0 unless an *explicit* assignment of the tablespace
> > happened. In which case you also would need to explicitly emit a
> > TABLESPACE for the partitioned table in pg_dump, to restore that.
> 
> I think emitting an explicit tablespace in pg_dump for partitioned
> tables (when non-zero) might have issues for pg_restore's
> --no-tablespaces option.

Yea, there'd probably need to be handling in pg_backup_archiver.c or
such, not sure how to do that best.

Greetings,

Andres Freund


Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Mar-06, Andres Freund wrote:


> > I also find it far from clear that:
> >     <listitem>
> >      <para>
> >       The <replaceable class="parameter">tablespace_name</replaceable> is the name
> >       of the tablespace in which the new table is to be created.
> >       If not specified,
> >       <xref linkend="guc-default-tablespace"/> is consulted, or
> >       <xref linkend="guc-temp-tablespaces"/> if the table is temporary.  For
> >       partitioned tables, since no storage is required for the table itself,
> >       the tablespace specified here only serves to mark the default tablespace
> >       for any newly created partitions when no other tablespace is explicitly
> >       specified.
> >      </para>
> >     </listitem>
> > is handled correctly. The above says that the *specified* tablespaces -
> > which seems to exclude the default tablespace - is what's going to
> > determine what partitions use as their default tablespace. But in fact
> > that's not true, the partitioned table's pg_class.retablespace is set to
> > what default_tablespaces was at the time of the creation.
> 
> I still think the feature as is doesn't seem to have very well defined
> behaviour.

I have just started looking into this issue.  I'm not sure yet what's
the best fix; maybe for the specific case of partitioned tables and
indexes we should deviate from the ages-old behavior of storing zero
tablespace, if the tablespace is specified as the default one.  But I
haven't written the code yet.

In the meantime, here's David's patch, rebased to current master and
with the pg_upgrade and pg_dump tests fixed to match the new partition
creation behavior.

> > If we instead did:
> > 
> > CREATE TABLE public.listp1 (a integer
> > );
> > 
> > ALTER TABLE public.list1 ATTACH PARTITION public.listp FOR VALUES IN (1);
> 
> Isn't that a bit more expensive, because now the table needs to be
> scanned for maching the value? That's probably neglegible though, given
> it'd probably always empty.

I think it's always empty.  In the standard case, there are two
transactions rather than one, so yeah it's a little bit more expensive.
Maybe we should make this conditional on there actually being an
important tablespace distinction to preserve.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Mar-07, David Rowley wrote:

> On Thu, 7 Mar 2019 at 11:37, Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2019-03-07 11:31:15 +1300, David Rowley wrote:
> > > Do you think it's fine to reword the docs to make this point more
> > > clear, or do you see this as a fundamental problem with the patch?
> >
> > Hm, both? I mean I wouldn't necessarily characterize it as "fundamental"
> > problem, but ...
> 
> Okay, so if I understand you correctly, you're complaining about the
> fact that if the user does:
> 
> CREATE TABLE p (a int) PARTITION BY LIST(a) TABLESPACE pg_default;
> 
> that the user intended that all future partitions go to pg_default and
> not whatever default_tablespace is set to at the time?
> 
> If so, that seems like a genuine concern.

So, you don't need a partition in order to see a problem here: pg_dump
doesn't do the right thing for the partitioned table.  It does this:

SET default_tablespace = pg_default;
CREATE TABLE public.p (
    a integer
)
PARTITION BY LIST (a);

which naturally has a different effect on partitions.

Now, I think the problem for partitions can be solved in a simple
manner: we can have this code David quoted ignore the
MyDatabaseTableSpace exception for partitioned rels:

> /*
> * Never allow a pg_class entry to explicitly specify the database's
> * default tablespace in reltablespace; force it to zero instead. This
> * ensures that if the database is cloned with a different default
> * tablespace, the pg_class entry will still match where CREATE DATABASE
> * will put the physically copied relation.
> *
> * Yes, this is a bit of a hack.
> */
> if (reltablespace == MyDatabaseTableSpace)
> reltablespace = InvalidOid;

This gives the right effect AFAICS, namely to store the specified
tablespace regardless of what it is; and there is no problem with the
"physically copied relation" as the comment says, because there *isn't*
a physical relation in the first place.

However, in order to fix the pg_dump behavior for the partitioned rel,
we would need to emit the tablespace differently, i.e. not use SET
default_tablespace, but instead attach the tablespace clause to the
CREATE TABLE line.

I'll go have a look at what problems are there with doing that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-09, Alvaro Herrera wrote:

> However, in order to fix the pg_dump behavior for the partitioned rel,
> we would need to emit the tablespace differently, i.e. not use SET
> default_tablespace, but instead attach the tablespace clause to the
> CREATE TABLE line.
> 
> I'll go have a look at what problems are there with doing that.

I tried with the attached tbspc-partitioned.patch (on top of the
previous patch).  It makes several additional cases work correctly, but
causes much more pain than it fixes:

1. if a partitioned table is created without a tablespace spec, then
pg_dump dumps it with a clause like TABLESPACE "".

2. If you fix that to make pg_dump omit the tablespace clause if the
tablespace is default, then there's no SET nor TABLESPACE clauses, so
the table is created in the tablespace of the previously dumped table.
Useless.

3. You can probably make the above work anyway with enough hacks, but
by the time you're finished, the code stinks like months-only fish.

4. Even if you ignore the odor, all the resulting CREATE TABLE commands
will fail if you run them in a database that doesn't contain the
tablespaces in question.  That is, it negates one of the main points of
using "SET default_tablespace" in the first place.

I therefore decree that this approach is not a solution and never will
be.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Mar-06, Andres Freund wrote:

> I don't think the argument that the user intended to explicitly set a
> tablespace holds much water if it was just set via default_tablespace,
> rather than an explicit TABLESPACE.  I think iff you really want
> something like this feature, you'd have to mark a partition's
> reltablespace as 0 unless an *explicit* assignment of the tablespace
> happened. In which case you also would need to explicitly emit a
> TABLESPACE for the partitioned table in pg_dump, to restore that.

Thinking more about this, I think you're wrong about the behavior under
nonempty default_tablespace.  Quoth the fine manual:

default_tablespace:
    [...]
    This variable specifies the default tablespace in which to create
    objects (tables and indexes) when a CREATE command does not explicitly specify
    a tablespace.
    The value is either the name of a tablespace, or an empty string to
    specify using the default tablespace of the current database. [...]
    https://www.postgresql.org/docs/11/runtime-config-client.html#RUNTIME-CONFIG-CLIENT-STATEMENT

What this says to me, if default_tablespace is set, and there is no
TABLESPACE clause, we should regard the default_tablespace just as if it
were an explicitly named tablespace.  Note that the default setting of
default_tablespace is empty, meaning that tables are created in the
database tablespace.

Emerging behavior: default_tablespace is set to A, then partitioned
table T is created, then default_tablespace is changed to B.  Any
partitions of T created afterwards still appear in tablespace A.

If you really intended for new partitions to be created in
default_tablespace (following future changes to that option), then you
should just leave default_tablespace as empty when creating T.

There is one deficiency that needs to be solved in order for this to
work fully: currently there is no way to reset "reltablespace" to 0.

Does that make sense?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-09, Alvaro Herrera wrote:

> There is one deficiency that needs to be solved in order for this to
> work fully: currently there is no way to reset "reltablespace" to 0.

Therefore I propose to add
ALTER TABLE tb ... RESET TABLESPACE;
which sets reltablespace to 0, and it would work only for partitioned
tables and indexes.

That, together with the initial proposal by David, seems to me to solve
the issue at hand.

If no objections, I'll try to come up with a patch tomorrow.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
David Rowley
Дата:
On Wed, 10 Apr 2019 at 11:05, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2019-Apr-09, Alvaro Herrera wrote:
>
> > There is one deficiency that needs to be solved in order for this to
> > work fully: currently there is no way to reset "reltablespace" to 0.
>
> Therefore I propose to add
> ALTER TABLE tb ... RESET TABLESPACE;
> which sets reltablespace to 0, and it would work only for partitioned
> tables and indexes.
>
> That, together with the initial proposal by David, seems to me to solve
> the issue at hand.
>
> If no objections, I'll try to come up with a patch tomorrow.

I'm starting to wonder if maintaining two separate behaviours here
isn't just to complex.

For example, if I do:

CREATE TABLE a (a INT PRIMARY KEY) TABLESPACE mytablespace;

then a_pkey goes into the default_tablespace, not mytablespace.

Also, is it weird that CLUSTER can move a table into another
tablespace if the database's tablespace has changed?

postgres=# CREATE TABLE a (a INT PRIMARY KEY) TABLESPACE pg_default;
CREATE TABLE
postgres=# SELECT pg_relation_filepath('a'::regclass);
 pg_relation_filepath
----------------------
 base/12702/16444
(1 row)

postgres=# \c n
n=# ALTER DATABASE postgres TABLESPACE mytablespace;
ALTER DATABASE
n=# \c postgres
postgres=# CLUSTER a USING a_pkey;
CLUSTER
postgres=# SELECT pg_relation_filepath('a'::regclass);
            pg_relation_filepath
---------------------------------------------
 pg_tblspc/16415/PG_12_201904072/12702/16449
(1 row)

This one seems very strange to me.

I think to make it work we'd need to modify heap_create() and
heap_create_with_catalog() to add a new bool argument that controls if
the TABLESPACE was defined the calling command then only set the
reltablespace to InvalidOid if the tablespace was not defined and it
matches the database's tablespace.  If we want to treat table
partitions and index partitions in a special way then we'll need to
add a condition to not set InvalidOid if the relkind is one of those.
That feels a bit dirty, but if the above two cases were also deemed
wrong then we wouldn't need the special case.

Another option would be instead of adding a new bool flag, just pass
InvalidOid for the tablespace to heap_create() when TABLESPACE was not
specified then have it lookup GetDefaultTablespace() but keep
pg_class.reltablespace set to InvalidOId. Neither of these would be a
back-patchable fix for index partitions in PG11. Not sure what to do
about that...

Making constraints follow the tablespace specified during CREATE TABLE
would require a bit more work.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-09, Alvaro Herrera wrote:

> There is one deficiency that needs to be solved in order for this to
> work fully: currently there is no way to reset "reltablespace" to 0.

Actually, there is one -- just need to
  ALTER TABLE .. SET TABLESPACE <the database default tablespace>
this took me a bit by surprise, but actually it's quite expected.

So I think that apart from David's patch, we should just document all
these things carefully.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Andres Freund
Дата:
Hi,

On 2019-04-10 09:28:21 -0400, Alvaro Herrera wrote:
> So I think that apart from David's patch, we should just document all
> these things carefully.

Yea, I think that's the most important part.

I'm not convinced that we should have any inheriting behaviour btw - it
seems like there's a lot of different ways to think this should behave,
with different good reason each.

Greetings,

Andres Freund



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-10, Andres Freund wrote:

> Hi,
> 
> On 2019-04-10 09:28:21 -0400, Alvaro Herrera wrote:
> > So I think that apart from David's patch, we should just document all
> > these things carefully.
> 
> Yea, I think that's the most important part.
> 
> I'm not convinced that we should have any inheriting behaviour btw - it
> seems like there's a lot of different ways to think this should behave,
> with different good reason each.

So, I ended up with the attached patch.  I think it works pretty well,
and it passes all my POLA tests.

But it doesn't pass pg_upgrade tests!  And investigating closer, it
seems closely related to what David was complaining elsewhere about the
tablespace being improperly set by some rewrite operations.  Here's the
setup as created by regress' create_table.sql:

create table at_partitioned (a int, b text) partition by range (a);
create table at_part_1 partition of at_partitioned for values from (0) to (1000);
insert into at_partitioned values (512, '0.123');
create table at_part_2 (b text, a int);
insert into at_part_2 values ('1.234', 1024);
create index on at_partitioned (b);
create index on at_partitioned (a);

If you examine state at this point, it's all good:
alvherre=# select relname, reltablespace from pg_class where relname like 'at_partitioned%';
       relname        | reltablespace 
----------------------+---------------
 at_partitioned       |             0
 at_partitioned_a_idx |             0
 at_partitioned_b_idx |             0

but the test immediately does this:

alter table at_partitioned alter column b type numeric using b::numeric;

and watch what happens!  (1663 is pg_default)

alvherre=# select relname, reltablespace from pg_class where relname like 'at_partitioned%';
       relname        | reltablespace 
----------------------+---------------
 at_partitioned       |             0
 at_partitioned_a_idx |             0
 at_partitioned_b_idx |          1663
(3 filas)

Outrageous!

I'm going to have a look at this behavior now.  IMO it's a separate bug,
but with that obviously we cannot fix the other one.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-10, Alvaro Herrera wrote:

> but the test immediately does this:
> 
> alter table at_partitioned alter column b type numeric using b::numeric;
> 
> and watch what happens!  (1663 is pg_default)
> 
> alvherre=# select relname, reltablespace from pg_class where relname like 'at_partitioned%';
>        relname        | reltablespace 
> ----------------------+---------------
>  at_partitioned       |             0
>  at_partitioned_a_idx |             0
>  at_partitioned_b_idx |          1663
> (3 filas)
> 
> Outrageous!

This is because ruleutils.c attaches a TABLESPACE clause when asked to
dump an index definition; and tablecmds.c uses ruleutils to deparse the
index definition into something that can be replayed via CREATE INDEX
commands (or ALTER TABLE ADD CONSTRAINT UNIQUE/PRIMARY KEY, if that's
the case.)

This patch (PoC quality) fixes that behavior, but I'm looking to see
what else it breaks.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-10, Alvaro Herrera wrote:

> This is because ruleutils.c attaches a TABLESPACE clause when asked to
> dump an index definition; and tablecmds.c uses ruleutils to deparse the
> index definition into something that can be replayed via CREATE INDEX
> commands (or ALTER TABLE ADD CONSTRAINT UNIQUE/PRIMARY KEY, if that's
> the case.)

Found a "solution" to this -- namely, to set the GUC default_tablespace
to the empty string temporarily, and teach ruleutils.c to attach
TABLESPACE clauses on index/constraint definitions only if they
are not in the database tablespace.  That makes everything works
correctly.  (I did have to patch psql to show tablespace for partitioned
indexes.)

However, because the tablespace to use for an index is determined at
phase 3 execution time (i.e. inside DefineIndex), look what happens in
certain weird cases:

create tablespace foo location '/tmp/foo';
set default_tablespace to foo;
alter table t add unique (b) ;
create index on t (a);

at this point, the indexes for "a" and "b" is in tablespace foo, which
is correct because that's the default tablespace.

However, if we do a type change *and add an index in the same command*,
then that index ends up in the wrong tablespace (namely the database
tablespace instead of default_tablespace):

alter table t alter a type bigint, add unique (c);

I'm not seeing any good way to fix this; I need the default tablespace
reset to only affect the index creations caused by the rewrite, but not
the ones used by different commands.  I suppose forbidding ADD
CONSTRAINT subcommands together with ALTER COLUMN SET DATA TYPE would
not fly very far.  (I'm not sure that's complete either: if you change
datatype so that a toast table is created, perhaps this action will
affect the location of said new toast table, also.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Mar-06, David Rowley wrote:

> Over on [1] Andres pointed out that the pg_dump support for the new to
> PG12 tablespace inheritance feature is broken.  This is the feature
> added in ca4103025dfe26 to allow a partitioned table to have a
> tablespace that acts as the default tablespace for newly attached
> partitions. The idea being that you can periodically change the
> default tablespace for new partitions to a tablespace that sits on a
> disk partition with more free space without affecting the default
> tablespace for normal non-partitioned tables. Anyway...
> 
> pg_dump is broken with this.

Here's a patch to fix the reported problems.  It's somewhat invasive,
and I've spent a long time staring at it, so I very much appreciate eyes
on it.

Guiding principles:

* Partitioned tables can have the database tablespace in their
  reltablespace column, in contrast with non-partitioned relations.
  This is okay and doesn't cause permission checks, since nothing
  is put in the tablespace just yet.

* When creating a partition, the parent's tablespace takes precedence
  over default_tablespace.  This sounds a bit weird at first, but I
  think it's correct.  If you really want your partition to override the
  parent's tablespace specification, you need to add a tablespace
  clause.  (I was initially opposed to this, but on further reflection
  I think it's the right thing to do.)

With these things in mind, I have introduced some behavior changes.  In
particular, both bugs mentioned in this thread have been fixed.

* pg_dump now correctly reproduces the state, still without using any
  TABLESPACE clauses.  (I suppose this is important for portability of
  dumps, as well as the --no-tablespaces option).

* When a partitioned table is created specifying the database tablespace
  in the TABLESPACE clause, and later a partition is created under
  a different non-empty default_tablespace setting, the partition honors
  the parent's tablespace.  (Fixing this bug became one of the
  principles here.)

* When an index is rewritten because of an ALTER TABLE, it no longer
  moves magically to another tablespace.  This worked fine for
  non-partitioned indexes (commit bd673e8e864a fixed it), but it was
  broken for partitioned ones.

* pg_dump was patched to emit straight CREATE TABLE plus ALTER TABLE
  ATTACH for partitions (just like the binary upgrade code does) instead
  of CREATE TABLE PARTITION OF.  This is what lets the partition use a
  straight default_tablespace setting instead of having to conditionally
  attach TABLESPACE clauses, which would create quite a mess.

Making this work required somewhat unusual hacks:

* Nodes IndexStmt and Constraint have gained a new member
  "reset_default_tblspc".  This is not set in the normal code paths, but
  when ALTER TABLE wants to recreate an index, it sets this flag; the
  flag tells index creation to reset default_tablespace to empty.  This
  is only necessary because ALTER TABLE execution resolves (at execution
  time) the tablespace of the artifically-generated SQL command to
  recreate the index.  If default_tablespace is left there, it
  interferes with that.

* GetDefaultTablespace now behaves differently for partitioned tables,
  precisely because we want it not to return InvalidOid when the
  tablespace is specifically selected.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: pg_dump is broken for partition tablespaces

От
David Rowley
Дата:
On Sat, 13 Apr 2019 at 11:36, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Here's a patch to fix the reported problems.  It's somewhat invasive,
> and I've spent a long time staring at it, so I very much appreciate eyes
> on it.

I think it's a bit strange that don't store the pg_default's oid in
reltablespace for objects other than partitioned tables and indexes.
For documents [1] say:

"When default_tablespace is set to anything but an empty string, it
supplies an implicit TABLESPACE clause for CREATE TABLE and CREATE
INDEX commands that do not have an explicit one."

I'd say the fact that we populate reltablespace with 0 is a bug as
it's not going to do what they want after a dump/restore.

If that's ok to change then maybe the attached is an okay fix. Rather
nicely it gets rid of the code that's commented with "Yes, this is a
bit of a hack." and also changes the contract with heap_create() so
that we just pass InvalidOid to mean use MyDatabaseTableSpace.  I've
not really documented that in the patch yet.  It also does not need
the pg_dump change to have it use ATTACH PARTITION instead of
PARTITION OF, although perhaps that's an okay change to make
regardless of this bug.

On Wed, 10 Apr 2019 at 10:58, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> There is one deficiency that needs to be solved in order for this to
> work fully: currently there is no way to reset "reltablespace" to 0.

Yeah, I noticed that too.  My patch makes that a consistent problem
with all object types that allow tablespaces. Perhaps we can allow
ALTER ... <name> SET TABLESPACE DEFAULT; since "DEFAULT" is fully
reserved it can't be the name of a tablespace.

[1] https://www.postgresql.org/docs/devel/manage-ag-tablespaces.html

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: pg_dump is broken for partition tablespaces

От
Tom Lane
Дата:
David Rowley <david.rowley@2ndquadrant.com> writes:
> I'd say the fact that we populate reltablespace with 0 is a bug as
> it's not going to do what they want after a dump/restore.

Well, it's not really nice perhaps, but you cannot just put in some
other concrete tablespace OID instead.  What a zero there means is
"use the database's default tablespace", and the point of it is that
it still means that after the DB has been cloned with a different
default tablespace.  If we don't store 0 then we break
"CREATE DATABASE ... TABLESPACE = foo".

You could imagine using some special tablespace OID that has these
semantics (*not* pg_default, but some new row in pg_tablespace).
I'm not sure that that'd provide any functional improvement over
using zero, but we could certainly entertain such a change if
partitioned tables seem to need it.

            regards, tom lane



Re: pg_dump is broken for partition tablespaces

От
David Rowley
Дата:
On Mon, 15 Apr 2019 at 02:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > I'd say the fact that we populate reltablespace with 0 is a bug as
> > it's not going to do what they want after a dump/restore.
>
> Well, it's not really nice perhaps, but you cannot just put in some
> other concrete tablespace OID instead.  What a zero there means is
> "use the database's default tablespace", and the point of it is that
> it still means that after the DB has been cloned with a different
> default tablespace.  If we don't store 0 then we break
> "CREATE DATABASE ... TABLESPACE = foo".
>
> You could imagine using some special tablespace OID that has these
> semantics (*not* pg_default, but some new row in pg_tablespace).
> I'm not sure that that'd provide any functional improvement over
> using zero, but we could certainly entertain such a change if
> partitioned tables seem to need it.

The patch only changes that behaviour when the user does something like:

set default_tablespace = 'pg_default';
create table ... (...);

or:

create table ... (...) tablespace pg_default;

The 0 value is still maintained when the tablespace is not specified
or default_tablespace is an empty string.

The CREATE TABLE docs mention:

"The tablespace_name is the name of the tablespace in which the new
table is to be created. If not specified, default_tablespace is
consulted, or temp_tablespaces if the table is temporary. For
partitioned tables, since no storage is required for the table itself,
the tablespace specified here only serves to mark the default
tablespace for any newly created partitions when no other tablespace
is explicitly specified."

and the default_tablespace docs say:

"When default_tablespace is set to anything but an empty string, it
supplies an implicit TABLESPACE clause for CREATE TABLE and CREATE
INDEX commands that do not have an explicit one."

so the change just seems to be altering the code to follow the documents.

Alvaro is proposing to change this behaviour for partitioned tables
and indexes. I'm proposing not having that special case and just
changing it.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Tom Lane
Дата:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On Mon, 15 Apr 2019 at 02:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, it's not really nice perhaps, but you cannot just put in some
>> other concrete tablespace OID instead.  What a zero there means is
>> "use the database's default tablespace", and the point of it is that
>> it still means that after the DB has been cloned with a different
>> default tablespace.  If we don't store 0 then we break
>> "CREATE DATABASE ... TABLESPACE = foo".

> [ quotes documents ]

I think those documentation statements are probably wrong in detail,
or at least you're misreading them if you think they are justification
for this patch.  *This change will break CREATE DATABASE*.

(And, apparently, the comment you tried to remove isn't sufficiently
clear about that.)

> Alvaro is proposing to change this behaviour for partitioned tables
> and indexes. I'm proposing not having that special case and just
> changing it.

It's possible that Alvaro's patch is also broken, but I haven't had time
to review it.  The immediate question is what happens when somebody makes
a partitioned table in template1 and then does CREATE DATABASE with a
tablespace option.  Does the partitioned table end up in the same
tablespace as ordinary tables do?

It's entirely possible BTW that this whole business of inheriting
tablespace from the partitioned table is broken and should be thrown
out.  I certainly don't see any compelling reason for partitions to
act differently from regular tables in this respect, and the more
problems we find with the idea, the less attractive it seems.

            regards, tom lane



Re: pg_dump is broken for partition tablespaces

От
Andres Freund
Дата:
Hi,

On 2019-04-14 10:38:05 -0400, Tom Lane wrote:
> It's entirely possible BTW that this whole business of inheriting
> tablespace from the partitioned table is broken and should be thrown
> out.  I certainly don't see any compelling reason for partitions to
> act differently from regular tables in this respect, and the more
> problems we find with the idea, the less attractive it seems.

Indeed. After discovering during the tableam work, and trying to write
tests for the equivalent feature for tableam, I decided that just not
allowing AM specifications for partitioned tables is the right call - at
least until the desired behaviour is clearer. The discussion of the last
few days makes me think so even more.

Greetings,

Andres Freund



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-14, Andres Freund wrote:

> On 2019-04-14 10:38:05 -0400, Tom Lane wrote:
> > It's entirely possible BTW that this whole business of inheriting
> > tablespace from the partitioned table is broken and should be thrown
> > out.  I certainly don't see any compelling reason for partitions to
> > act differently from regular tables in this respect, and the more
> > problems we find with the idea, the less attractive it seems.
> 
> Indeed. After discovering during the tableam work, and trying to write
> tests for the equivalent feature for tableam, I decided that just not
> allowing AM specifications for partitioned tables is the right call - at
> least until the desired behaviour is clearer. The discussion of the last
> few days makes me think so even more.

To be honest, when doing that feature I neglected to pay attention to
(read: forgot about) default_tablespace, and it's mostly the
interactions with that feature that makes this partitioned table stuff
so complicated.  I'm not 100% convinced yet that we need to throw it out
completely, but I'm less sure now about it than I was before.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
David Rowley
Дата:
On Mon, 15 Apr 2019 at 05:32, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2019-Apr-14, Andres Freund wrote:
>
> > On 2019-04-14 10:38:05 -0400, Tom Lane wrote:
> > > It's entirely possible BTW that this whole business of inheriting
> > > tablespace from the partitioned table is broken and should be thrown
> > > out.  I certainly don't see any compelling reason for partitions to
> > > act differently from regular tables in this respect, and the more
> > > problems we find with the idea, the less attractive it seems.
> >
> > Indeed. After discovering during the tableam work, and trying to write
> > tests for the equivalent feature for tableam, I decided that just not
> > allowing AM specifications for partitioned tables is the right call - at
> > least until the desired behaviour is clearer. The discussion of the last
> > few days makes me think so even more.
>
> To be honest, when doing that feature I neglected to pay attention to
> (read: forgot about) default_tablespace, and it's mostly the
> interactions with that feature that makes this partitioned table stuff
> so complicated.  I'm not 100% convinced yet that we need to throw it out
> completely, but I'm less sure now about it than I was before.

FWIW, I was trying to hint in [1] that this all might be more trouble
than its worth.

To be honest, if I'd done a better job of thinking through the
implications of this tablespace inheritance in ca4103025d, then I'd
probably have not bothered submitting a patch for it.  We could easily
revert that, but we'd still be left with the same behaviour in
partitioned indexes, which is in PG11.

[1] https://www.postgresql.org/message-id/CAKJS1f-52x3o16fsd4=tBPKct9_E0uEg0LmzOgxBqLiuZsj-SA@mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-15, David Rowley wrote:

> To be honest, if I'd done a better job of thinking through the
> implications of this tablespace inheritance in ca4103025d, then I'd
> probably have not bothered submitting a patch for it.  We could easily
> revert that, but we'd still be left with the same behaviour in
> partitioned indexes, which is in PG11.

Well, I suppose if we do decide to revert it for tables, we should do it
for both tables and indexes.  But as I said, I'm not yet convinced that
this is the best way forward.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
David Rowley
Дата:
On Mon, 15 Apr 2019 at 15:26, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2019-Apr-15, David Rowley wrote:
>
> > To be honest, if I'd done a better job of thinking through the
> > implications of this tablespace inheritance in ca4103025d, then I'd
> > probably have not bothered submitting a patch for it.  We could easily
> > revert that, but we'd still be left with the same behaviour in
> > partitioned indexes, which is in PG11.
>
> Well, I suppose if we do decide to revert it for tables, we should do it
> for both tables and indexes.  But as I said, I'm not yet convinced that
> this is the best way forward.

Ok.  Any ideas or suggestions on how we move on from here?  It seems
like a bit of a stalemate.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-17, David Rowley wrote:

> On Mon, 15 Apr 2019 at 15:26, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > On 2019-Apr-15, David Rowley wrote:
> >
> > > To be honest, if I'd done a better job of thinking through the
> > > implications of this tablespace inheritance in ca4103025d, then I'd
> > > probably have not bothered submitting a patch for it.  We could easily
> > > revert that, but we'd still be left with the same behaviour in
> > > partitioned indexes, which is in PG11.
> >
> > Well, I suppose if we do decide to revert it for tables, we should do it
> > for both tables and indexes.  But as I said, I'm not yet convinced that
> > this is the best way forward.
> 
> Ok.  Any ideas or suggestions on how we move on from here?  It seems
> like a bit of a stalemate.

Well, here's my proposed patch.  I'm now fairly happy with how this
looks now, concerning partitioned tables.

This is mostly what was already discussed:

1. pg_dump now uses regular CREATE TABLE followed by ALTER TABLE / ATTACH
   PARTITION when creating partitions, rather than CREATE TABLE
   PARTITION OF.  pg_dump --binary-upgrade was already doing that, so
   this part mostly removes some code.  In order to make the partitions
   reach the correct tablespace, the "default_tablespace" GUC is used.
   No TABLESPACE clause is added to the dump.  This is David's patch
   near the start of the thread.

2. When creating a partition using the CREATE TABLE PARTITION OF syntax,
   the TABLESPACE clause has highest precedence; if that is not given,
   the partitioned table's tablespace is used; if that is set to 0 (the
   default), default_tablespace is used; if that's set to empty or a
   nonexistant tablespace, the database's default tablespace is used.
   This is (I think) what Andres proposed in
   https://postgr.es/m/20190306223741.lolaaimhkkp4kict@alap3.anarazel.de

3. Partitioned relations can have the database tablespace in
   pg_class.reltablespace, as opposed to storage-bearing relations which
   cannot.  This is useful to be able to put partitions in the database
   tablespace even if the default_tablespace is set to something else.

4. For partitioned tables, ALTER TABLE .. SET TABLESPACE DEFAULT is
   available as suggested by David, which makes future partition
   creations target default_tablespace or the database's tablespace.  

5. Recreating indexes during table-rewriting ALTER TABLE resulted in
   broken indexes.  We already had some adhesive tape in place to make
   that work for regular indexes (commit bd673e8e864a); my approach to
   fix it for partitioned indexes is to temporarily reset
   default_tablespace to empty.


As for Tom's question in https://postgr.es/m/12678.1555252685@sss.pgh.pa.us :

> It's possible that Alvaro's patch is also broken, but I haven't had time
> to review it.  The immediate question is what happens when somebody makes
> a partitioned table in template1 and then does CREATE DATABASE with a
> tablespace option.  Does the partitioned table end up in the same
> tablespace as ordinary tables do?

Note that partitioned don't have any files, so they don't end up
anywhere; when a partition is created, the target tablespace is
determined using four rules instead of three (see #2 above) so yes, they
do end up in the same places as ordinary tables.  Note that even if you
do put a partitioned table in some tablespace, you will not later run
afoul of this check:
                ereport(ERROR,
                        (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                         errmsg("cannot assign new default tablespace \"%s\"",
                                tablespacename),
                         errdetail("There is a conflict because database \"%s\" already has some tables in this
tablespace.",
                                   dbtemplate)));
src/backend/commands/dbcommands.c:435

because that check uses ReadDir() and raise an error if any entry is
found; but partitioned tables don't have files in directory, so nothing
happens.  (Of course, it will hit if you have a partition in that
tablespace, but that's also the case for regular tables.)

(I propose to commit both 0002 and 0003 as a single unit that fixes the
whole problem, rather than attacking backend and pg_dump separately.
0001 appears logically separate and I would push on its own.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: pg_dump is broken for partition tablespaces

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> 1. pg_dump now uses regular CREATE TABLE followed by ALTER TABLE / ATTACH
>    PARTITION when creating partitions, rather than CREATE TABLE
>    PARTITION OF.  pg_dump --binary-upgrade was already doing that, so
>    this part mostly removes some code.  In order to make the partitions
>    reach the correct tablespace, the "default_tablespace" GUC is used.
>    No TABLESPACE clause is added to the dump.  This is David's patch
>    near the start of the thread.

This idea seems reasonable independently of all else, simply on the grounds
of reducing code duplication.  It also has the advantage that if you try
to do a selective restore of just a partition, and the parent partitioned
table isn't around, you can still do it (with an ignorable error).

> 2. When creating a partition using the CREATE TABLE PARTITION OF syntax,
>    the TABLESPACE clause has highest precedence; if that is not given,
>    the partitioned table's tablespace is used; if that is set to 0 (the
>    default), default_tablespace is used; if that's set to empty or a
>    nonexistant tablespace, the database's default tablespace is used.
>    This is (I think) what Andres proposed in
>    https://postgr.es/m/20190306223741.lolaaimhkkp4kict@alap3.anarazel.de

Hmm.  The precedence order between the second and third options seems
pretty arbitrary and hence unrememberable.  I don't say this choice is
wrong, but it's not clear that it's right, either.

> 3. Partitioned relations can have the database tablespace in
>    pg_class.reltablespace, as opposed to storage-bearing relations which
>    cannot.  This is useful to be able to put partitions in the database
>    tablespace even if the default_tablespace is set to something else.

I still feel that this is a darn bad idea.  It goes against the rule
that's existed for pg_class.reltablespace since its beginning, and
I think it's inevitable that that's going to break something.

            regards, tom lane



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-17, Tom Lane wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > 1. pg_dump now uses regular CREATE TABLE followed by ALTER TABLE / ATTACH
> >    PARTITION when creating partitions, rather than CREATE TABLE
> >    PARTITION OF.  pg_dump --binary-upgrade was already doing that, so
> >    this part mostly removes some code.  In order to make the partitions
> >    reach the correct tablespace, the "default_tablespace" GUC is used.
> >    No TABLESPACE clause is added to the dump.  This is David's patch
> >    near the start of the thread.
> 
> This idea seems reasonable independently of all else, simply on the grounds
> of reducing code duplication.  It also has the advantage that if you try
> to do a selective restore of just a partition, and the parent partitioned
> table isn't around, you can still do it (with an ignorable error).

I'll get this part pushed, then.

> > 2. When creating a partition using the CREATE TABLE PARTITION OF syntax,
> >    the TABLESPACE clause has highest precedence; if that is not given,
> >    the partitioned table's tablespace is used; if that is set to 0 (the
> >    default), default_tablespace is used; if that's set to empty or a
> >    nonexistant tablespace, the database's default tablespace is used.
> >    This is (I think) what Andres proposed in
> >    https://postgr.es/m/20190306223741.lolaaimhkkp4kict@alap3.anarazel.de
> 
> Hmm.  The precedence order between the second and third options seems
> pretty arbitrary and hence unrememberable.  I don't say this choice is
> wrong, but it's not clear that it's right, either.

Well, I see it as the default_tablespace being a global setting whereas
the parent is "closer" to the partition definition, which is why it has
higher priority.  I don't have a strong opinion however (and I think the
patch would be shorter if default_tablespace had higher precedence.)

Maybe others care to comment?

> > 3. Partitioned relations can have the database tablespace in
> >    pg_class.reltablespace, as opposed to storage-bearing relations which
> >    cannot.  This is useful to be able to put partitions in the database
> >    tablespace even if the default_tablespace is set to something else.
> 
> I still feel that this is a darn bad idea.  It goes against the rule
> that's existed for pg_class.reltablespace since its beginning, and
> I think it's inevitable that that's going to break something.

Yes, this deviates from current practice, and while I tested this in as
many ways as I could think of, I cannot deny that it might break
something unexpectedly.


Here's a v4 btw, which is just some adjustments to the regress test
script and expected file.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-17, Alvaro Herrera wrote:

> On 2019-Apr-17, Tom Lane wrote:
> 
> > Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > > 1. pg_dump now uses regular CREATE TABLE followed by ALTER TABLE / ATTACH
> > >    PARTITION when creating partitions, rather than CREATE TABLE
> > >    PARTITION OF.  pg_dump --binary-upgrade was already doing that, so
> > >    this part mostly removes some code.  In order to make the partitions
> > >    reach the correct tablespace, the "default_tablespace" GUC is used.
> > >    No TABLESPACE clause is added to the dump.  This is David's patch
> > >    near the start of the thread.
> > 
> > This idea seems reasonable independently of all else, simply on the grounds
> > of reducing code duplication.  It also has the advantage that if you try
> > to do a selective restore of just a partition, and the parent partitioned
> > table isn't around, you can still do it (with an ignorable error).
> 
> I'll get this part pushed, then.

After looking at it again, I found that there's no significant
duplication reduction -- the patch simply duplicates one block in a
different location, putting half of the original code in each. And if we
reject the idea of separating tablespaces, there's no reason to do
things that way.  So ISTM if we don't want the tablespace thing, we
should not apply this part.  FWIW, we got quite a few positive votes for
handling tablespaces this way for partitioned tables [1] [2], so I
resist the idea that we have to revert the initial commit, as some seem
to be proposing.


After re-reading the thread one more time, I found one more pretty
reasonable point that Andres was complaining about, and I made things
work the way he described.  Namely, if you do this:

SET default_tablespace TO 'foo';
CREATE TABLE part (a int) PARTITION BY LIST (a);
SET default_tablespace TO 'bar';
CREATE TABLE part1 PARTITION OF part FOR VALUES IN (1);

then the partition must end up in tablespace bar, not in tablespace foo:
the reason is that the default_tablespace is not "strong enough" to
stick with the partitioned table.  The partition would only end up in
tablespace foo in this case:

CREATE TABLE part (a int) PARTITION BY LIST (a) TABLESPACE foo;
CREATE TABLE part1 PARTITION OF part FOR VALUES IN (1);

i.e. when the tablespace is explicitly indicated in the CREATE TABLE
command for the partitioned table.  Of course, you can still add a
TABLESPACE clause to the partition to override it (and you can change
the parent table's tablespace later.)

So here's a proposed v5.

I would appreciate others' eyes on this patch.

[1] https://postgr.es/m/CAKJS1f9SxVzqDrGD1teosFd6jBMM0UEaa14_8mRvcWE19Tu0hA@mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/CAKJS1f9PXYcT%2Bj%3DoyL-Lquz%3DScNwpRtmD7u9svLASUygBdbN8w%40mail.gmail.com

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: pg_dump is broken for partition tablespaces

От
Robert Haas
Дата:
On Wed, Apr 17, 2019 at 6:06 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > > 3. Partitioned relations can have the database tablespace in
> > >    pg_class.reltablespace, as opposed to storage-bearing relations which
> > >    cannot.  This is useful to be able to put partitions in the database
> > >    tablespace even if the default_tablespace is set to something else.
> >
> > I still feel that this is a darn bad idea.  It goes against the rule
> > that's existed for pg_class.reltablespace since its beginning, and
> > I think it's inevitable that that's going to break something.
>
> Yes, this deviates from current practice, and while I tested this in as
> many ways as I could think of, I cannot deny that it might break
> something unexpectedly.

Like Tom, I think this has got to be broken.

Suppose that you have a partitioned table which has reltablespace =
dattablespace.  It has a bunch of children with reltablespace = 0.
So far so good: new children of the partitioned table go into the
database tablespace regardless of default_tablespace.

Now somebody changes the default tablespace using ALTER DATABASE ..
SET TABLESPACE.  All the existing children end up in the new default
tablespace, but new children of the partitioned table end up going
into the tablespace that used to be the default but is no longer.
That's pretty odd, because the whole point of setting a tablespace on
the children was to get all of the children into the same tablespace.

The same thing happens if you clone the database using CREATE DATABASE
.. TEMPLATE .. TABLESPACE, as Tom mentioned.

PostgreSQL has historically and very deliberately *not made a
distinction* between "this object is in the default tablespace" and
"this object is in tablespace X which happens to be the default."  I
think that it's too late to invent such a distinction for reasons of
backward compatibility -- and if we were going to do it, surely it
would need to exist for both partitioned tables and the partitions
themselves. Otherwise it just produces more strange inconsistencies.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-22, Robert Haas wrote:

> PostgreSQL has historically and very deliberately *not made a
> distinction* between "this object is in the default tablespace" and
> "this object is in tablespace X which happens to be the default."  I
> think that it's too late to invent such a distinction for reasons of
> backward compatibility -- and if we were going to do it, surely it
> would need to exist for both partitioned tables and the partitions
> themselves. Otherwise it just produces more strange inconsistencies.

Yeah, this is probably right.  (I don't think it's the same thing that
Tom was saying, though, or at least I didn't understand his argument
this way.)

I think we can get out of this whole class of problems by forbidding the
TABLESPACE clause for partitioned rels from mentioning the database
tablespace -- that is, users either mention some *other* tablespace, or
partitions follow default_tablespace like everybody else.  AFAICS with
that restriction this whole problem does not arise, and the patch may
become simpler.  I'll give it a spin.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Andres Freund
Дата:
Hi,

On 2019-04-22 14:16:28 -0400, Alvaro Herrera wrote:
> On 2019-Apr-22, Robert Haas wrote:
> 
> > PostgreSQL has historically and very deliberately *not made a
> > distinction* between "this object is in the default tablespace" and
> > "this object is in tablespace X which happens to be the default."  I
> > think that it's too late to invent such a distinction for reasons of
> > backward compatibility -- and if we were going to do it, surely it
> > would need to exist for both partitioned tables and the partitions
> > themselves. Otherwise it just produces more strange inconsistencies.
> 
> Yeah, this is probably right.  (I don't think it's the same thing that
> Tom was saying, though, or at least I didn't understand his argument
> this way.)
> 
> I think we can get out of this whole class of problems by forbidding the
> TABLESPACE clause for partitioned rels from mentioning the database
> tablespace -- that is, users either mention some *other* tablespace, or
> partitions follow default_tablespace like everybody else.  AFAICS with
> that restriction this whole problem does not arise, and the patch may
> become simpler.  I'll give it a spin.

Why is the obvious answer is to not just remove the whole tablespace
inheritance behaviour? It's obviously ambiguous and hard to get right.
I still don't see any usecase that even comes close to making the
inheritance useful enough to justify the amount of work (code, tests,
bugfixes) and docs that are required.

Greetings,

Andres Freund



Re: pg_dump is broken for partition tablespaces

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2019-04-22 14:16:28 -0400, Alvaro Herrera wrote:
>> I think we can get out of this whole class of problems by forbidding the
>> TABLESPACE clause for partitioned rels from mentioning the database
>> tablespace -- that is, users either mention some *other* tablespace, or
>> partitions follow default_tablespace like everybody else.  AFAICS with
>> that restriction this whole problem does not arise, and the patch may
>> become simpler.  I'll give it a spin.

> Why is the obvious answer is to not just remove the whole tablespace
> inheritance behaviour? It's obviously ambiguous and hard to get right.
> I still don't see any usecase that even comes close to making the
> inheritance useful enough to justify the amount of work (code, tests,
> bugfixes) and docs that are required.

Yeah, that's where I'm at as well.  Alvaro's proposal could be made
to work perhaps, but I think it would still end up with some odd
corner-case behaviors.  One example is that "TABLESPACE X" would
be allowed if the database's default tablespace is Y, but if you
try to dump and restore into a database whose default is X, it'd be
rejected (?).  The results after ALTER DATABASE ... SET TABLESPACE X
are unclear too.

            regards, tom lane



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-22, Andres Freund wrote:

> Why is the obvious answer is to not just remove the whole tablespace
> inheritance behaviour?

Because it was requested by many, and there were plenty of people
surprised that things didn't work that way.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-22, Tom Lane wrote:

> Yeah, that's where I'm at as well.  Alvaro's proposal could be made
> to work perhaps, but I think it would still end up with some odd
> corner-case behaviors.  One example is that "TABLESPACE X" would
> be allowed if the database's default tablespace is Y, but if you
> try to dump and restore into a database whose default is X, it'd be
> rejected (?).

Hmm, I don't think so, because dump uses default_tablespace on a plain
table instead of TABLESPACE clauses, and the table is attached
afterwards.

> The results after ALTER DATABASE ... SET TABLESPACE X
> are unclear too.

Currently we disallow SET TABLESPACE X if you have any table in that
tablespace, and we do that by searching for files.  A partitioned table
would not have a file that would cause it to fail, so this is something
to study.


(BTW I think these tablespace behaviors are not tested very much.  The
tests we have are intra-database operations only, and there's only a
single non-default tablespace.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2019-Apr-22, Andres Freund wrote:
>> Why is the obvious answer is to not just remove the whole tablespace
>> inheritance behaviour?

> Because it was requested by many, and there were plenty of people
> surprised that things didn't work that way.

There are lots of things in SQL that people find surprising.
In this particular case, "we can't do it because it conflicts with
ancient decisions about how PG tablespaces work" seems like a
defensible answer, even without getting into the question of
whether "partitions inherit their tablespace from the parent"
is really any less surprising than "partitions work exactly like
normal tables as far as tablespace selection goes".

            regards, tom lane



Re: pg_dump is broken for partition tablespaces

От
Robert Haas
Дата:
On Mon, Apr 22, 2019 at 3:08 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2019-Apr-22, Andres Freund wrote:
> > Why is the obvious answer is to not just remove the whole tablespace
> > inheritance behaviour?
>
> Because it was requested by many, and there were plenty of people
> surprised that things didn't work that way.

On the other hand, that behavior worked correctly on its own terms,
and this behavior seems to be broken, and you've been through a whole
series of possible designs trying to figure out how to fix it, and
it's still not clear that you've got a working solution.  I don't know
whether that shows that it is impossible to make this idea work
sensibly, but at the very least it proves that the whole area needed a
lot more thought than it got before this code was committed (a
complaint that I also made at the time, if you recall).  "Surprising"
is not great, but it is clearly superior to "broken."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-22, Robert Haas wrote:

> On Mon, Apr 22, 2019 at 3:08 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > On 2019-Apr-22, Andres Freund wrote:
> > > Why is the obvious answer is to not just remove the whole tablespace
> > > inheritance behaviour?
> >
> > Because it was requested by many, and there were plenty of people
> > surprised that things didn't work that way.
> 
> On the other hand, that behavior worked correctly on its own terms,
> and this behavior seems to be broken, and you've been through a whole
> series of possible designs trying to figure out how to fix it, and
> it's still not clear that you've got a working solution.

Well, frequently when people discuss ideas on this list, others discuss
and provide further ideas to try help to find a working solution, rather
than throw every roadblock they can think of (though roadblocks are
indeed thrown now and then).  If I've taken a long time to find a
working solution, maybe it's because I have no shoulders of giants to
stand on, and I'm a pretty short guy, so I need to build me a ladder.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Robert Haas
Дата:
On Mon, Apr 22, 2019 at 4:43 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Well, frequently when people discuss ideas on this list, others discuss
> and provide further ideas to try help to find a working solution, rather
> than throw every roadblock they can think of (though roadblocks are
> indeed thrown now and then).  If I've taken a long time to find a
> working solution, maybe it's because I have no shoulders of giants to
> stand on, and I'm a pretty short guy, so I need to build me a ladder.

What exactly do you mean by throwing up roadblocks?  I don't have a
basket full of great ideas for how to solve this problem that I'm
failing to suggest out of some sort of perverse desire to see you
fail.  I'm not quite as convinced as Tom and Andres that this whole
idea is fatally flawed and can't ever be made to work correctly, but I
think it's quite possible that they are right, both because their
objections sound to me like they are target and because they are
pretty smart people.  But that's also because I haven't spent a lot of
time on this issue, which I think is pretty fair, because it seems
like it would be unfair to complain that I am not spending enough time
helping fix code that I advised against committing in the first place.
How much time should I spent giving you advice if my previous advice
was ignored?

But FWIW, it seems to me that a good place to start solving this
problem would be to think hard about what Andres said here:

http://postgr.es/m/20190306161744.22jdkg37fyi2zyke@alap3.anarazel.de

Specifically, this complaint: "I still think the feature as is doesn't
seem to have very well defined behaviour."

If we know what the feature is supposed to do, then it should be
possible to look at each relevant piece of code and decides whether it
implements the specification correctly or not.  But if we don't have a
specification -- that is, we don't know precisely what the feature is
supposed to do -- then we'll just end up whacking the behavior around
trying to get some combination that makes sense, and the whole effort
is probably doomed.

I think that the large quote block from David Rowley in the middle of
the above-linked email gets at the definitional problem pretty
clearly: the documentation seems to be intending to say -- although it
is not 100% clear -- that if TABLESPACE is specified it has effect on
all future children, and if not specified then those children get the
tablespace they would have gotten anyway.  But that behavior is
impossible to implement correctly unless there is a way to distinguish
between a partitioned table for which TABLESPACE was not specified and
where it was specified to be the same as the default tablespace for
the database.  And we know, per previous conversation, that the
catalog representation admits of no way to make that distinction.
Using reltablespace = dattablespace is clearly the wrong answer,
because that breaks stuff.  Tom earlier suggested, I believe, that
some fixed OID could be used, e.g. reltablespace = 1 means "tablespace
not specified" and reltablespace = 0 means dattablespace.  That should
be safe enough, since I don't think OID 1 can ever be the OID of a
real tablespace.  I think this is probably the only way forward if
this definition captures the desired behavior.

The other option is to decide that we want some other behavior.  In
that case, the way forward depends on what we want the behavior to be.
Your proposal upthread is to disallow the case where the user provides
an explicit TABLESPACE setting whose value matches the default
tablespace for the database.  But that definition seems to have an
obvious problem: just because that's not true when the partitioned
table is defined doesn't mean it won't become true later, because the
default tablespace for the database can be changed, or the database
can be copied and the copy be assigned a different default tablespace.
Even if there were no issue, that definition doesn't sound very clean,
because it means that reltablespace = 0 for a regular relation means
dattablespace and for a partitioned relation it means none.

Now that's not the only way we could go either, I suppose.  There must
be a variety of other possible behaviors.  But the one Tom and Andres
are proposing -- go back to the way it worked in v10 -- is definitely
not crazy.  You are right that a lot of people didn't like that, but
the definition was absolutely clear, because it was the exact same
definition we use for normal tables, with the straigthforward
extension that for relations with no storage it meant nothing.

Going back to the proposal of making OID = 0 mean TABLESPACE
dattablespace, OID = 1 meaning no TABLESPACE clause specified for this
partitioned table, and OID = whatever meaning that particular
non-default tablespace, I do see a couple of problems with that idea
too:

- I don't have any idea how to salvage things in v11, where the
catalog representation is irredeemably ambiguous.  We'd probably have
to be satisfied with fairly goofy behavior in v11.

- It's not enough to have a good catalog representation.  You also
have to be able to recreate those catalog states.  I think you
probably need a way to set the reltablespace to whatever value you
need it to have without changing the tables under it.  Like ALTER
TABLE ONLY blah {TABLESPACE some_tablespace_name | NO TABLESPACE} or
some such thing.  That exact idea may be wrong, but I think we are not
going to have much luck getting to a happy place without something of
this sort.

If we have a clear definition of what the feature does, a catalog
representation to match, and syntax that can recreate any given
catalog representation, then this should be an SMOP.

But again, rip it all out and try it again someday is not a crazy
proposal, IMHO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Mar-06, Tom Lane wrote:

> David Rowley <david.rowley@2ndquadrant.com> writes:
> > As far as I can see, the biggest fundamental difference with doing
> > things this way will be that the column order of partitions will be
> > preserved, where before it would inherit the order of the partitioned
> > table.  I'm a little unsure if doing this column reordering was an
> > intended side-effect or not.
> 
> Well, if the normal behavior results in changing the column order,
> it'd be necessary to do things differently in --binary-upgrade mode
> anyway, because there we *must* preserve column order.  I don't know
> if what you're describing represents a separate bug for pg_upgrade runs,
> but it might.  Is there any test case for the situation left behind by
> the core regression tests?

Now that I re-read this complaint once again, I wonder if a mismatching
column order in partitions isn't a thing we ought to preserve anyhow.
Robert, Amit -- is it by design that pg_dump loses the original column
order for partitions, when not in binary-upgrade mode?  To me, it sounds
unintuitive to accept partitions that don't exactly match the order of
the parent table; but it's been supported all along.  In the statu quo,
if users dump and restore such a database, the restored partition ends
up with the column order of the parent instead of its own column order
(by virtue of being created as CREATE TABLE PARTITION OF).  Isn't that
wrong?  It'll cause an INSERT/COPY direct to the partition that worked
prior to the restore to fail after the restore, if the column list isn't
specified.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Now that I re-read this complaint once again, I wonder if a mismatching
> column order in partitions isn't a thing we ought to preserve anyhow.
> Robert, Amit -- is it by design that pg_dump loses the original column
> order for partitions, when not in binary-upgrade mode?

I haven't looked at the partitioning code, but I am quite sure that that's
always happened for old-style inheritance children, and I imagine pg_dump
is just duplicating that old behavior.

Wasn't there already a patch on the table to change this, though,
by removing the code path that uses inheritance rather than the
binary-upgrade-like solution?

            regards, tom lane



Re: pg_dump is broken for partition tablespaces

От
Amit Langote
Дата:
On 2019/04/23 7:51, Alvaro Herrera wrote:
> On 2019-Mar-06, Tom Lane wrote:
>> David Rowley <david.rowley@2ndquadrant.com> writes:
>>> As far as I can see, the biggest fundamental difference with doing
>>> things this way will be that the column order of partitions will be
>>> preserved, where before it would inherit the order of the partitioned
>>> table.  I'm a little unsure if doing this column reordering was an
>>> intended side-effect or not.
>>
>> Well, if the normal behavior results in changing the column order,
>> it'd be necessary to do things differently in --binary-upgrade mode
>> anyway, because there we *must* preserve column order.  I don't know
>> if what you're describing represents a separate bug for pg_upgrade runs,
>> but it might.  Is there any test case for the situation left behind by
>> the core regression tests?
> 
> Now that I re-read this complaint once again, I wonder if a mismatching
> column order in partitions isn't a thing we ought to preserve anyhow.
> Robert, Amit -- is it by design that pg_dump loses the original column
> order for partitions, when not in binary-upgrade mode?

I do remember being too wary initially about letting partitions devolve
into a state of needing tuple conversion during DML execution, which very
well may have been a reason to write the pg_dump support code the way it
is now.  pg_dump chooses to emit partitions with the CREATE TABLE
PARTITION OF syntax because, as it seems has been correctly interpreted on
this thread, it allows partitions to end up with same TupleDesc as the
parent and hence not require tuple conversion in DML execution, unless of
course it's run with --binary-upgrade mode.

Needing tuple conversion is still an overhead but maybe there aren't that
many cases where TupleDescs differ among tables in partition trees, so the
considerations for emitting PARTITION OF syntax may not be all that
relevant.  Also, we've made DML involving partitions pretty efficient
these days by reducing most other overheads, even though nothing has been
done to prevent tuple conversion in the cases it is needed anyway.

> To me, it sounds
> unintuitive to accept partitions that don't exactly match the order of
> the parent table; but it's been supported all along.

You might know it already, but even though column sets of two tables may
appear identical, their TupleDescs still may not match due to dropped
columns being different in the two tables.

> In the statu quo,
> if users dump and restore such a database, the restored partition ends
> up with the column order of the parent instead of its own column order
> (by virtue of being created as CREATE TABLE PARTITION OF).  Isn't that
> wrong?  It'll cause an INSERT/COPY direct to the partition that worked
> prior to the restore to fail after the restore, if the column list isn't
> specified.

That's true, although there is a workaround as you mentioned -- specify
column names to match the input data.  pg_dump itself specifies them, so
the dumped output can be loaded unchanged.

Anyway, I don't see a problem with changing pg_dump to *always* emit
CREATE TABLE followed by ATTACH PARTITION, not just in --binary-upgrade
mode, if it lets us deal with the tablespace-related issues smoothly.

Thanks,
Amit




Re: pg_dump is broken for partition tablespaces

От
David Rowley
Дата:
On Tue, 23 Apr 2019 at 13:49, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 2019/04/23 7:51, Alvaro Herrera wrote:
> > To me, it sounds
> > unintuitive to accept partitions that don't exactly match the order of
> > the parent table; but it's been supported all along.
>
> You might know it already, but even though column sets of two tables may
> appear identical, their TupleDescs still may not match due to dropped
> columns being different in the two tables.

I think that's the most likely reason that the TupleDescs would differ
at all.  For RANGE partitions on time series data, it's quite likely
that new partitions are periodically created to store new data.  If
the partitioned table those belong to evolved over time, gaining new
columns and dropping columns that are no longer needed then some
translation work will end up being required.  From my work on
42f70cd9c, I know tuple conversion is not free, so it's pretty good
that pg_dump will remove the need for maps in this case even with the
proposed change.

I imagine users randomly specifying columns for partitions in various
random orders is a less likely scenario, although, it's entirely
possible.

Tom's point about being able to pg_dump a single partition and restore
it somewhere without the parent (apart from an error in ATTACH
PARTITION) seems like it could be useful too, so I'd say that the
pg_dump change is a good one regardless.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Amit Langote
Дата:
On 2019/04/23 14:45, David Rowley wrote:
> On Tue, 23 Apr 2019 at 13:49, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> On 2019/04/23 7:51, Alvaro Herrera wrote:
>>> To me, it sounds
>>> unintuitive to accept partitions that don't exactly match the order of
>>> the parent table; but it's been supported all along.
>>
>> You might know it already, but even though column sets of two tables may
>> appear identical, their TupleDescs still may not match due to dropped
>> columns being different in the two tables.
> 
> I think that's the most likely reason that the TupleDescs would differ
> at all.  For RANGE partitions on time series data, it's quite likely
> that new partitions are periodically created to store new data.  If
> the partitioned table those belong to evolved over time, gaining new
> columns and dropping columns that are no longer needed then some
> translation work will end up being required.  From my work on
> 42f70cd9c, I know tuple conversion is not free, so it's pretty good
> that pg_dump will remove the need for maps in this case even with the
> proposed change.

Maybe I'm missing something, but if you're talking about pg_dump changes
proposed in the latest patch that Alvaro posted on April 18, which is to
emit partitions as two steps, then I don't see how that will always
improves things in terms of whether maps are needed or not (regardless of
whether that's something to optimize for or not.)  If partitions needed a
map in the old database, this patch means that they will *continue* to
need it in the new database.  With HEAD, they won't, because partitions
created with CREATE TABLE PARTITION OF will have the same descriptor as
parent, provided the parent is also created afresh in the new database,
which is true in the non-binary-upgrade mode.  The current arrangement, as
I mentioned in my previous email, is partly inspired from the fact that
creating the parent and partition afresh in the new database will lead
them to have the same TupleDesc and hence won't need maps.

Thanks,
Amit




Re: pg_dump is broken for partition tablespaces

От
David Rowley
Дата:
On Tue, 23 Apr 2019 at 18:18, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> If partitions needed a
> map in the old database, this patch means that they will *continue* to
> need it in the new database.

That's incorrect.  My point was about dropped columns being removed
after a dump / reload.  Only binary upgrade mode preserves
pg_attribute entries for dropped columns. Normal mode does not, so the
maps won't be needed after the reload if they were previously only
needed due to dropped columns.  This is the case both with and without
the pg_dump changes I proposed.  The case the patch does change is if
the columns were actually out of order, which I saw as an unlikely
thing to happen in the real world.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
Thanks for taking the time to think through this.

On 2019-Apr-22, Robert Haas wrote:

> If we know what the feature is supposed to do, then it should be
> possible to look at each relevant piece of code and decides whether it
> implements the specification correctly or not.  But if we don't have a
> specification -- that is, we don't know precisely what the feature is
> supposed to do -- then we'll just end up whacking the behavior around
> trying to get some combination that makes sense, and the whole effort
> is probably doomed.

Clearly this is now the crux of the issue ... the specification was
unclear, as I cobbled it together without thinking about the overall
implications, taking pieces of advice from various posts and my own
ideas of how it should work.

I am trying to split things in a way that makes the most sense to offer
the best possible combination of functionality.  I think there are three
steps to getting this done:

1. for pg10 and pg11, make pg_dump work correctly.  I think having
pg_dump use CREATE TABLE PARTITION OF is not correct when the partition
has a mismatching column definition.  So my proposal is to back-patch
David's pg_dump patch to pg10 and pg11.  Thread:
https://postgr.es/m/20190423185007.GA27954@alvherre.pgsql
Note that this changes pg_dump behavior in back branches.

2. for pg12, try to keep as much of the tablespace inheritance
functionality as possible (more below on how this works), without
running into weird cases.  I think if we just forbid the case of
the tablespace being defined to the database tablespace, all the
weird cases disappear from the code.

3. For pg13, we can try to put back the functionality of the database
tablespace as default for a partition.  You suggest using the value 1,
and Tom suggests adding a new predefined row in pg_tablespace that has
the meaning of "whatever the default tablespace is for the current
database".  I have a slight preference towards having the additional
magical row(s).

Maybe people would like to see #3 put it pg12.  I don't oppose that, but
it seems to be new functionality development; RMT would have to exempt
this from feature freeze.

Robert wrote:

> - I don't have any idea how to salvage things in v11, where the
> catalog representation is irredeemably ambiguous.  We'd probably have
> to be satisfied with fairly goofy behavior in v11.

This behavior is in use only for indexes in pg11, and does not allow
setting the database tablespace anyway (it just gets set back to 0 if
you do that), so I don't think there's anything *too* goofy about it.
(But whatever we do, pg11 behavior for tables would differ from pg12;
and it would differ for indexes too, depending on how we handle the
default_tablespace thing.)

> - It's not enough to have a good catalog representation.  You also
> have to be able to recreate those catalog states.  I think you
> probably need a way to set the reltablespace to whatever value you
> need it to have without changing the tables under it.  Like ALTER
> TABLE ONLY blah {TABLESPACE some_tablespace_name | NO TABLESPACE} or
> some such thing.  That exact idea may be wrong, but I think we are not
> going to have much luck getting to a happy place without something of
> this sort.

David proposed ALTER TABLE .. TABLESPACE DEFAULT as a way to put back
the database tablespace behavior (and that's implemented in my previous
patch), which seems good to me.  I'm not sure I like "NO TABLESPACE"
better than DEFAULT, but if others do, I'm not really wedded to it.


Now I discuss how tablespace inheritance works.  To define the
tablespace of a regular table, there's a simple algorithm:

 a1. if there's a TABLESPACE clause, use that.
 b1. otherwise, if there's a default_tablespace, use that.
 c1. otherwise, use the database tablespace.
 d1. if we end up with the database tablespace, overwrite with 0.

If creating a partition, there is the additional rule that parent's
tablespace overrides default_tablespace:

 a2. if there's a TABLESPACE clause, use that.
 b2. otherwise, if the parent has a tablespace, use that.
 c2. otherwise, if there's a default_tablespace, use that.
 d2. otherwise, use the database tablespace.
 e2. if we end up with the database tablespace, overwrite with 0.

Finally, there's the question of what defines the parent's tablespace.
I think the best algorithm to determine the tablespace when creating a
partitioned relation is the same as for partitions:

 a3. if there's a TABLESPACE clause, use that.
 b3. otherwise, if the parent has a tablespace, use that.  (We have this
      because partitions might be partitioned.)
 c3. otherwise, if there's default_tablespace, use that.
 d3. otherwise, use the database tablespace.
 e3. if we end up with the database tablespace, overwrite with 0.

Tracing through the sets of rules, they are all alike -- the only
difference is that regular tables lack the rule about parent's
tablespace.

Some notes on this:

1. We have rule c3, that is, we still honor default_tablespace when
   creating partitioned tables.  Andres said, upthread, that we
   shouldn't do that -- if the tablespace isn't *explicitly* mentioned,
   he said, the partitions have no business landing on that tablespace.
   https://postgr.es/m/20190306223741.lolaaimhkkp4kict@alap3.anarazel.de
   However I think it would be more inconsistent to not have that than
   to have it, for the case when you create a partitioned partition.
   I tried to implement things the way he suggests and ended up with
   more warts.
   Would it be surprising for users that all subsequent partitions are
   created in that partition?  Perhaps, but in Tom's words, "there are
   many things [in this area] that are surprising"; we should be sure to
   display the situation clearly, and if they don't like it, they can
   simply do ALTER TABLE SET TABLESPACE and things are good again.

   default_tablespace does not seem to be a particularly popular
   feature; I suspect pg_dump is by far its biggest user, and we make it
   not fall afoul of this problem with patch #1, so it never creates any
   partitions anyway.

2. we have rule b2 ahead of c2; however Tom said users are unlikely to
   remember which rule comes first.  I don't care much for that
   argument; surely they can just read the docs.  This is not the first
   not-completely-intuitive thing for a DBA, so.

3. If you really want a partitioned rel to end up in the database
   tablespace overriding default_tablespace, you have two options:
   i) reset default_tablespace ahead of time (SET LOCAL), or
   ii) use ALTER TABLE ... TABLESPACE afterwards.
   Maybe we should have a iii) "CREATE TABLE ... TABLESPACE DEFAULT"
   clause that does it, but I'm not sure it's necessary.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
David Rowley
Дата:
On Wed, 24 Apr 2019 at 10:26, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> If creating a partition, there is the additional rule that parent's
> tablespace overrides default_tablespace:
>
>  a2. if there's a TABLESPACE clause, use that.
>  b2. otherwise, if the parent has a tablespace, use that.
>  c2. otherwise, if there's a default_tablespace, use that.
>  d2. otherwise, use the database tablespace.
>  e2. if we end up with the database tablespace, overwrite with 0.

Wouldn't it just take the proposed pg_dump change to get that?  rule
e2 says we'll store 0 in reltablespace, even if the user does
TABLESPACE pg_default, so there's no requirement to adjust the hack in
heap_create to put any additional conditions on when we set
reltablespace to 0, so it looks like none of the patching work you did
would be required to implement this. Right?

The good part about that is that its consistent with what happens if
the user does TABLESPACE pg_default for any other object type that
supports tablespaces. i.e we always store 0.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-24, David Rowley wrote:

> On Wed, 24 Apr 2019 at 10:26, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > If creating a partition, there is the additional rule that parent's
> > tablespace overrides default_tablespace:
> >
> >  a2. if there's a TABLESPACE clause, use that.
> >  b2. otherwise, if the parent has a tablespace, use that.
> >  c2. otherwise, if there's a default_tablespace, use that.
> >  d2. otherwise, use the database tablespace.
> >  e2. if we end up with the database tablespace, overwrite with 0.
> 
> Wouldn't it just take the proposed pg_dump change to get that?  rule
> e2 says we'll store 0 in reltablespace, even if the user does
> TABLESPACE pg_default, so there's no requirement to adjust the hack in
> heap_create to put any additional conditions on when we set
> reltablespace to 0, so it looks like none of the patching work you did
> would be required to implement this. Right?

I'm not sure yet that 100% of the patch is gone, but yes much of it
would go away thankfully.  I do suggest we should raise an error if rule
a3 hits and it mentions the database tablespace (I stupidly forgot
this critical point in the previous email).  I think astonishment is
lesser that way.

> The good part about that is that its consistent with what happens if
> the user does TABLESPACE pg_default for any other object type that
> supports tablespaces. i.e we always store 0.

Yeah, it makes the whole thing a lot simpler.  Note my note for further
development of a feature (modelled after Robert's proposal) to allow the
database tablespace to be specified, using either a separate pg_tablespace
entry that means "use the database tablespace whatever that is" (Tom's
suggestion), or a magic not-a-real-tablespace-OID number known to the
code, such as 1 (Robert's).

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_dump is broken for partition tablespaces

От
Amit Langote
Дата:
On 2019/04/23 20:03, David Rowley wrote:
> On Tue, 23 Apr 2019 at 18:18, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> If partitions needed a
>> map in the old database, this patch means that they will *continue* to
>> need it in the new database.
> 
> That's incorrect.

Not completely though, because...

> My point was about dropped columns being removed
> after a dump / reload.  Only binary upgrade mode preserves
> pg_attribute entries for dropped columns. Normal mode does not, so the
> maps won't be needed after the reload if they were previously only
> needed due to dropped columns.  This is the case both with and without
> the pg_dump changes I proposed.  The case the patch does change is if
> the columns were actually out of order, which I saw as an unlikely
> thing to happen in the real world.

This is the case I was talking about, which I agree is very rare.  Sorry
for being unclear.

I think your proposed patch is fine and I don't want to argue that the way
things are now has some very sound basis.

Also, as you and Alvaro have found, the existing arrangement makes pg_dump
emit partitions in a way that's not super helpful (insert/copy failing
unintuitively), but it's not totally broken either.  That said, I don't
mean to oppose back-patching any fix you think is appropriate.

Thank you for working on this.

Regards,
Amit




Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
On 2019-Apr-23, Alvaro Herrera wrote:

> I'm not sure yet that 100% of the patch is gone, but yes much of it
> would go away thankfully.

Of course, the part that fixes the bug that indexes move tablespace when
recreated by a rewriting ALTER TABLE is still necessary.  Included in
the attached patch.

(I think it would be good to have the relation being complained about in
the error message, though that requires passing the name to
GetDefaultTablespace.)

> I do suggest we should raise an error if rule a3 hits and it mentions
> the database tablespace (I stupidly forgot this critical point in the
> previous email).  I think astonishment is lesser that way.

As in the attached.  When pg_default is the database tablespace, these
cases fail with the patch, as expected:

alvherre=# CREATE TABLE q (a int PRIMARY KEY) PARTITION BY LIST (a) TABLESPACE pg_default;
psql: ERROR:  cannot specify default tablespace for partitioned relations

alvherre=# CREATE TABLE q (a int PRIMARY KEY USING INDEX TABLESPACE pg_default) PARTITION BY LIST (a);
psql: ERROR:  cannot specify default tablespace for partitioned relations


alvherre=# SET default_tablespace TO 'pg_default';

alvherre=# CREATE TABLE q (a int PRIMARY KEY) PARTITION BY LIST (a) ;
psql: ERROR:  cannot specify default tablespace for partitioned relations

alvherre=# CREATE TABLE q (a int PRIMARY KEY) PARTITION BY LIST (a) TABLESPACE foo;
psql: ERROR:  cannot specify default tablespace for partitioned relations

alvherre=# CREATE TABLE q (a int PRIMARY KEY USING INDEX TABLESPACE foo) PARTITION BY LIST (a);
psql: ERROR:  cannot specify default tablespace for partitioned relations


These cases work:

alvherre=# CREATE TABLE q (a int PRIMARY KEY USING INDEX TABLESPACE foo) PARTITION BY LIST (a) TABLESPACE foo;

alvherre=# SET default_tablespace TO '';    -- the default
alvherre=# CREATE TABLE q (a int PRIMARY KEY) PARTITION BY LIST (a);

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: pg_dump is broken for partition tablespaces

От
Alvaro Herrera
Дата:
I have pushed this now, after putting back a few of the tests I had
proposed earlier, as well as a couple of sentences in the docs to
hopefully make it clearer how it works.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services