Обсуждение: Schema version management

Поиск
Список
Период
Сортировка

Schema version management

От
Joel Jacobson
Дата:
Hi,

I just read a very interesting post about "schema version management".

Quote: "You could set it up so that every developer gets their own
test database, sets up the schema there, takes a dump, and checks that
in. There are going to be problems with that, including that dumps
produced by pg_dump are ugly and optimized for restoring, not for
developing with, and they don't have a deterministic output order." (
http://petereisentraut.blogspot.com/2012/05/my-anti-take-on-database-schema-version.html
)

Back in December 2010, I suggested a new option to pg_dump, --split,
which would write the schema definition of each object in separate
files:

http://archives.postgresql.org/pgsql-hackers/2010-12/msg02285.php

Instead of a huge plain text schema file, impossible to version
control, all tables/sequences/views/functions are written to separate
files, allowing the use of a version control software system, such as
git, to do proper version controlling.

The "deterministic output order" problem mentioned in the post above,
is not a problem if each object (table/sequence/view/function/etc) is
written to the same filename everytime.
No matter the order, the tree of files and their content will be
identical, no matter the order in which they are dumped.

I remember a lot of hackers were very positive about this option, but
we somehow failed to agree on the naming of files in the tree
structure. I'm sure we can work that out though.

I use this feature in production, I have a cronjob which does a dump
of the schema every hour, committing any eventual changes to a
separate git branch for each database installation, such as
production, development and test.
If no changes to the schema have been made, nothing will be committed
to git since none of the files have changed.

It is then drop-dead simple to diff two different branches of the
database schema, such as development or production, or diffing
different revisions allowing point-in-time comparison of the schema.

This is an example of the otuput of a git log --summary for one of the
automatic commits to our production database's git-repo:

--
commit 18c31f8162d851b0dac3bad7e80529ef2ed18be3
Author: Production Database <production.database@trustly.com>
Date:   Fri May 4 15:00:04 2012 +0200

    Update of database schema Linux DB0 2.6.26-2-amd64 #1 SMP Wed Aug
19 22:33:18 UTC 2009 x86_64 GNU/Linux Fri, 04 May 2012 15:00:04 +0200

 create mode 100644
gluepay-split/public/CONSTRAINT/openingclosingbalances_pkey.sql
 create mode 100644
gluepay-split/public/CONSTRAINT/openingclosingbalances_source_key.sql
 create mode 100644 gluepay-split/public/SEQUENCE/seqopeningclosingbalance.sql
 create mode 100644 gluepay-split/public/TABLE/openingclosingbalances.sql
--

Here we can see we apparently deployed a new table,
"openingclosingbalances" around Fri May 4 15:00:04.

Without any manual work, I'm able to follow all changes actually
_deployed_ in each database.

At my company, a highly database-centric stored-procedure intensive
business dealing with mission-critical monetary transactions, we've
been using this technique to successfully do schema version management
without any hassle for the last two years.

Hopefully this can add to the list of various possible _useful_ schema
version management methods.

Best regards,

Joel Jacobson


Re: Schema version management

От
Daniel Farina
Дата:
On Sun, May 20, 2012 at 12:41 PM, Joel Jacobson <joel@trustly.com> wrote:
> Hi,
>
> I just read a very interesting post about "schema version management".
>
> Quote: "You could set it up so that every developer gets their own
> test database, sets up the schema there, takes a dump, and checks that
> in. There are going to be problems with that, including that dumps
> produced by pg_dump are ugly and optimized for restoring, not for
> developing with, and they don't have a deterministic output order." (
> http://petereisentraut.blogspot.com/2012/05/my-anti-take-on-database-schema-version.html
> )

I think you are absolutely right, but I'm not sure if teaching pg_dump
a new option is the best idea.  It's a pretty complex program as-is.
I've also heard some people who really wish pg knew how to self-dump
for valid reasons.

It sounds like some of the catalog wrangling and cycle-breaking
properties of pg_dump could benefit from being exposed stand-alone,
but unfortunately that's not a simple task, especially if you want to
do The Right Thing and have pg_dump link that code, given pg_dump's
criticality.

pg_extractor is a new/alternative take on the database copying
problem, maybe you could have a look at that?

--
fdr


Re: Schema version management

От
Joel Jacobson
Дата:
On Mon, May 21, 2012 at 8:08 AM, Daniel Farina <daniel@heroku.com> wrote:
> I think you are absolutely right, but I'm not sure if teaching pg_dump
> a new option is the best idea.  It's a pretty complex program as-is.
> I've also heard some people who really wish pg knew how to self-dump
> for valid reasons.

Complex program? Yes, pg_dump it is extremely complex, I wouldn't want
to touch any of the code. A rewrite is probably close to impossible.

Complex patch? No. It's 102 lines of code and doesn't change any of
the existing code in pg_dump, it simply adds some lines writing out
the objects to separate files. Have a look at the patch, it's super
simple.

> It sounds like some of the catalog wrangling and cycle-breaking
> properties of pg_dump could benefit from being exposed stand-alone,
> but unfortunately that's not a simple task, especially if you want to
> do The Right Thing and have pg_dump link that code, given pg_dump's
> criticality.

I agree it's not a simple task, and it's probably not something anyone
will fix in the near future.
The --split option doesn't aim to solve this problem either. That's a
different problem, and it's not a problem I have.

> pg_extractor is a new/alternative take on the database copying
> problem, maybe you could have a look at that?

It's just sad realizing people need to some up with hacks and
work-arounds to solve a obvious real-life problem, easily fixed inside
pg_dump with 102 lines of drop-dead simple code, not touching any of
the logics or flows in pg_dump.

I can't even image how many hours coders have wasted hacking together
tools like pg_extractor just to circumvent the stupid fact pg_dump
can't do this natively.

The pg_extractor is way more complex than my suggested patch, it's 974
lines of perl codes, as opposed to 102 lines of simple code in the
patch.
The pg_extractor also does a lot more than simply splitting objects
into separate files, like executing svn commands.

The splitting of objects into separate files should clearly be the
responsibility of pg_dump.
It would allow you to easily version control the schema files your
self with any version control software system, such as svn, git, etc.

I'm sure pg_extractor does it best to achieve the objective, but even
if it does, I would never trust it for production usage, version
controlling your production schema is far too important to trust any
tool not part of the mainline distribution of postgres. And personally
I don't have any problem, I've been using the --split option for two
years, I just feel sorry for the rest of the postgres community,
unaware of how to solve this problem, having to hack together their
own little tools, or be "lucky" finding some existing hack.


Re: Schema version management

От
Daniel Farina
Дата:
On Sun, May 20, 2012 at 7:36 PM, Joel Jacobson <joel@trustly.com> wrote:
> On Mon, May 21, 2012 at 8:08 AM, Daniel Farina <daniel@heroku.com> wrote:
>> I think you are absolutely right, but I'm not sure if teaching pg_dump
>> a new option is the best idea.  It's a pretty complex program as-is.
>> I've also heard some people who really wish pg knew how to self-dump
>> for valid reasons.
>
> Complex program? Yes, pg_dump it is extremely complex, I wouldn't want
> to touch any of the code. A rewrite is probably close to impossible.

I wouldn't be so sure about that...

> Complex patch? No. It's 102 lines of code and doesn't change any of
> the existing code in pg_dump, it simply adds some lines writing out
> the objects to separate files. Have a look at the patch, it's super
> simple.

Ah. I did not know there was a patch already out there -- I did not
somehow get that , as it then can be audited in its precise functionality.

>> It sounds like some of the catalog wrangling and cycle-breaking
>> properties of pg_dump could benefit from being exposed stand-alone,
>> but unfortunately that's not a simple task, especially if you want to
>> do The Right Thing and have pg_dump link that code, given pg_dump's
>> criticality.
>
> It's just sad realizing people need to some up with hacks and
> work-arounds to solve a obvious real-life problem, easily fixed inside
> pg_dump with 102 lines of drop-dead simple code, not touching any of
> the logics or flows in pg_dump.
>
> I can't even image how many hours coders have wasted hacking together
> tools like pg_extractor just to circumvent the stupid fact pg_dump
> can't do this natively.

My next question would be how this might relate to the directory dump
format.  For example, is it an embellishment of that?  It seems at
fist glance that whatever this patch might be a cousin of that
feature.  Or, is it superseded? The documentation is clear that tables
are given their own files, but doesn't say much about how other schema
objects are stored, so they may or may not be useful to your needs.

Also, now that I look more carefully, there was a lot of conversation
about this patch; it seems like what you are doing now is reporting
its successful use, and I did not understand that by reading the
abstract of your email.  And, beyond that, do we have a summary of the
open questions that prevented it from being committed?

> I'm sure pg_extractor does it best to achieve the objective, but even
> if it does, I would never trust it for production usage, version
> controlling your production schema is far too important to trust any
> tool not part of the mainline distribution of postgres. And personally
> I don't have any problem, I've been using the --split option for two
> years, I just feel sorry for the rest of the postgres community,
> unaware of how to solve this problem, having to hack together their
> own little tools, or be "lucky" finding some existing hack.

My thinking is that confidence would be increased if there was a piece
of code that handled a lot of the catalog munging et al that is part
of pg_dump that *is* maintained by postgres so other projects can more
convincingly add a correct veneer.

As a meta-comment, all I did was ask some polite questions.  You could
have politely disqualified pg_extractor and spared some of the
language without having gotten anything less done.

--
fdr


Re: Schema version management

От
Joel Jacobson
Дата:
On Mon, May 21, 2012 at 10:06 AM, Daniel Farina <daniel@heroku.com> wrote:
> Also, now that I look more carefully, there was a lot of conversation
> about this patch; it seems like what you are doing now is reporting
> its successful use, and I did not understand that by reading the
> abstract of your email.  And, beyond that, do we have a summary of the
> open questions that prevented it from being committed?

Good idea. Here is an attempt to a summary:

http://archives.postgresql.org/pgsql-hackers/2010-12/msg02301.php
The initial feedback was on the usage of OIDs as file names.
This was indeed a bad idea and was changed, see
http://archives.postgresql.org/pgsql-hackers/2010-12/msg02314.php
Tom Lane also pointed out it doesn't solve the "randomly different
ordering of rows within a table"-problem.
The rows within a table are not part of the schema. The patch doesn't
attempt to solve that problem.

http://archives.postgresql.org/pgsql-hackers/2010-12/msg02318.php
Gurjeet Singh pointed out the problem with functions sharing the same
name but having different arguments.
As of now, it's not certain they will always be dumped into the same files.
This is a valid point, and needs to be solved in an elegant way.
The arguments needs to be made part of the path somehow.

http://archives.postgresql.org/pgsql-hackers/2010-12/msg02327.php
Another idea Gurjeet had was to hash the object identifier and use
that in the file's name.
Not a bad idea, would look nicer if functions have very many arguments.
Perhaps /[schema]/[object type]/[name]/[hash of arguments].sql

http://archives.postgresql.org/pgsql-hackers/2010-12/msg02329.php
David Wilson suggested placing all overloaded functions within the same file.
"Then, assuming you order them deterministically within that file, we
sidestep the
file naming issue and maintain useful diff capabilities, since a diff of the
function's file will show additions or removals of various overloaded
versions."
This would be a good idea, but falls on pg_dump not outputting the
functions in a deterministic order.

http://archives.postgresql.org/pgsql-hackers/2010-12/msg02496.php
Robert Treat: "I've both enjoyed reading this thread and seeing this wheel
reinvented yet again, and wholeheartedly +1 the idea of building this
directly into pg_dump. (The only thing better would be to make everything
thing sql callable, but that's a problem for another day)."

> My thinking is that confidence would be increased if there was a piece
> of code that handled a lot of the catalog munging et al that is part
> of pg_dump that *is* maintained by postgres so other projects can more
> convincingly add a correct veneer.

I totally agree, the most easy place to handle it is within pg_dump.

> As a meta-comment, all I did was ask some polite questions.  You could
> have politely disqualified pg_extractor and spared some of the
> language without having gotten anything less done.

I very much appreciated your answer, and I did in no way mean to be impolite.

Best regards,

Joel


Re: Schema version management

От
Benedikt Grundmann
Дата:
On Mon, May 21, 2012 at 5:03 AM, Joel Jacobson <joel@trustly.com> wrote:
>
> http://archives.postgresql.org/pgsql-hackers/2010-12/msg02301.php
> The initial feedback was on the usage of OIDs as file names.
> This was indeed a bad idea and was changed, see
> http://archives.postgresql.org/pgsql-hackers/2010-12/msg02318.php
> Gurjeet Singh pointed out the problem with functions sharing the same
> name but having different arguments.
> As of now, it's not certain they will always be dumped into the same files.
> This is a valid point, and needs to be solved in an elegant way.
> The arguments needs to be made part of the path somehow.

This is interesting at Jane Street we actually have a small tool
that "parses" the output of pg_dump.  (Well applies a set of regular
expressions plus a little bit guesswork).  We use this to do three things
all of which I would love to see supported by postgres tool chain proper:

1) split the output into one file per thing (basically as per this  thread) each file named
<type>_<name>_<running-integer>for use  with a VCS.  So if we have an overloaded function foo we end up with  several
function_foo_1.sqlfunction_foo_2.sql ...  The order of the  enumeration is just the order the functions occurred in the
pg_dump which seems to be stable and therefore good enough.
 

2) extract a patch.  You give the tool the name of one or more roots  (e.g. a table or set of tables you want to
modify). It finds all  things that depend on it (well sort of just turn the body of each  definition into a list of
wordsand a depends on b if the name of b  occurrs in a).  Do a topological sort (if there are cycles because  of the
hackdependency check break them but continue and produce a  warning). Output a file that first drops the definitions in
inverse dependency order and then recreates them (in dependency order).  The file starts with a begin but does NOT end
witha commit so you  are forced to enter it yourself.
 
  This tool is fantastic if you have a big set of plpgsql functions as  it is otherwise hard to make sure that you have
modifiedall places  when refactoring, changing a column, etc...
 

3) Find all leaves.  E.g. do the topsort on the whole pg_dump and list  the names of all things nothing depends on.
Thisis mostly useful if  you want to make sure that you are not accumulating cruft that isn't  used by anything.  Of
courseyou separately need a list or knowledge  about the entry points of your application(s).
 

Cheers,

Bene


Re: Schema version management

От
Daniel Farina
Дата:
On Sun, May 20, 2012 at 9:03 PM, Joel Jacobson <joel@trustly.com> wrote:
> On Mon, May 21, 2012 at 10:06 AM, Daniel Farina <daniel@heroku.com> wrote:
>> Also, now that I look more carefully, there was a lot of conversation
>> about this patch; it seems like what you are doing now is reporting
>> its successful use, and I did not understand that by reading the
>> abstract of your email.  And, beyond that, do we have a summary of the
>> open questions that prevented it from being committed?
>
> Good idea. Here is an attempt to a summary:

Thank you, that's very informative.  I'd like to reiterate one
question, though, which is something like:

"How do you feel that the since-committed directory-output/input
support in pg_dump/pg_restore could or should influence your patch, if
at all?"

It seems like now that there is support for spitting out a bunch of
files in a directory for pg_dump that is now going to be supported for
a long time that a new feature like yours might be more cohesive if it
somehow played with that.  I must confess I haven't read the patch in
detail, especially if it has been updated, but back then there was no
multi-file output mode from pg_dump, and now there is one.  My naive
understanding is this would be adding a second one as-is, but I wonder
if that is strictly necessary to fulfill the use case.

--
fdr


Re: Schema version management

От
Joel Jacobson
Дата:
On Tue, May 22, 2012 at 3:30 AM, Daniel Farina <daniel@heroku.com> wrote:
> Thank you, that's very informative.  I'd like to reiterate one
> question, though, which is something like:
>
> "How do you feel that the since-committed directory-output/input
> support in pg_dump/pg_restore could or should influence your patch, if
> at all?"

The directory format fulfills a different purpose. The tables are
split into files, where each file name gets a number. Functions are
not split into files, they are defined in the table of content file,
toc.dat.

Example:

joel@Joel-Jacobsons-MacBook-Pro ~ $ pg_dump -F d -f /Users/joel/test
joel@Joel-Jacobsons-MacBook-Pro ~ $ ls -la test
total 24
drwx------    5 joel  staff   170 May 22 07:16 .
drwx------+ 130 joel  staff  4488 May 22 07:16 ..
-rw-r--r--    1 joel  staff    38 May 22 07:16 2116.dat.gz
-rw-r--r--    1 joel  staff    39 May 22 07:16 2117.dat.gz
-rw-r--r--    1 joel  staff  2265 May 22 07:16 toc.dat

This is a good feature for its purpose, but doesn't provide a solution
for the schema version management problem.

> It seems like now that there is support for spitting out a bunch of
> files in a directory for pg_dump that is now going to be supported for
> a long time that a new feature like yours might be more cohesive if it
> somehow played with that.  I must confess I haven't read the patch in
> detail, especially if it has been updated, but back then there was no
> multi-file output mode from pg_dump, and now there is one.  My naive
> understanding is this would be adding a second one as-is, but I wonder
> if that is strictly necessary to fulfill the use case.

If one want to reuse the splitting to files-code of the directory
format, maybe the existing option -F d could be tweaked to output in
both a a machine-readable format (current way), and also a
human-friendly tree of files and content (like suggested by my patch).

I wonder what the option would be called then, having two chars
options is not an option I guess, maybe -F t for "tree" instead of
"directory", as the -F d option only dumps to a single directory and
not a tree-structure?


Re: Schema version management

От
Andrew Dunstan
Дата:

On 05/21/2012 08:25 PM, Joel Jacobson wrote:
> On Tue, May 22, 2012 at 3:30 AM, Daniel Farina<daniel@heroku.com>  wrote:
>> Thank you, that's very informative.  I'd like to reiterate one
>> question, though, which is something like:
>>
>> "How do you feel that the since-committed directory-output/input
>> support in pg_dump/pg_restore could or should influence your patch, if
>> at all?"
> The directory format fulfills a different purpose. The tables are
> split into files, where each file name gets a number. Functions are
> not split into files, they are defined in the table of content file,
> toc.dat.
>
> Example:
>
> joel@Joel-Jacobsons-MacBook-Pro ~ $ pg_dump -F d -f /Users/joel/test
> joel@Joel-Jacobsons-MacBook-Pro ~ $ ls -la test
> total 24
> drwx------    5 joel  staff   170 May 22 07:16 .
> drwx------+ 130 joel  staff  4488 May 22 07:16 ..
> -rw-r--r--    1 joel  staff    38 May 22 07:16 2116.dat.gz
> -rw-r--r--    1 joel  staff    39 May 22 07:16 2117.dat.gz
> -rw-r--r--    1 joel  staff  2265 May 22 07:16 toc.dat
>
> This is a good feature for its purpose, but doesn't provide a solution
> for the schema version management problem.
>
>> It seems like now that there is support for spitting out a bunch of
>> files in a directory for pg_dump that is now going to be supported for
>> a long time that a new feature like yours might be more cohesive if it
>> somehow played with that.  I must confess I haven't read the patch in
>> detail, especially if it has been updated, but back then there was no
>> multi-file output mode from pg_dump, and now there is one.  My naive
>> understanding is this would be adding a second one as-is, but I wonder
>> if that is strictly necessary to fulfill the use case.
> If one want to reuse the splitting to files-code of the directory
> format, maybe the existing option -F d could be tweaked to output in
> both a a machine-readable format (current way), and also a
> human-friendly tree of files and content (like suggested by my patch).
>
> I wonder what the option would be called then, having two chars
> options is not an option I guess, maybe -F t for "tree" instead of
> "directory", as the -F d option only dumps to a single directory and
> not a tree-structure?



I have a little utility to write out the data in the TOC as separate 
files. It works with both a custom format dump, and the toc.dat file 
from a directory format or unpacked tar format dump.

It was originally written as a debugging aid, but  can be used to some 
extent for schema management as well. It's far from complete, but still 
might be useful. See <https://github.com/adunstan/DumpToc>

cheers

andrew



Re: Schema version management

От
Daniel Farina
Дата:
On Mon, May 21, 2012 at 5:25 PM, Joel Jacobson <joel@trustly.com> wrote:
> If one want to reuse the splitting to files-code of the directory
> format, maybe the existing option -F d could be tweaked to output in
> both a a machine-readable format (current way), and also a
> human-friendly tree of files and content (like suggested by my patch).
>
> I wonder what the option would be called then, having two chars
> options is not an option I guess, maybe -F t for "tree" instead of
> "directory", as the -F d option only dumps to a single directory and
> not a tree-structure?

Is there a reason why the current directory format could not be
adjusted to become more human-readable friendly for mechanical
reasons?  I realize there is a backwards compatibility problem, but it
may be better than bloating a new option.

Andrew's approach of reading the TOC also be good...as so pg_dump can
avoid serving the additional master of schema versioning and
development usability in addition to dumping.  The TOC is the closest
thing we have to the library-ification of pg_dump in the near-term.
But I don't see how making the directory output format more
human-friendly could be seen as a bad thing overall, except in the
notable axis of implementation complexity.  Silly issues like naming
files on different platforms, case sensitivity, and file length
restrictions may rear their ugly head.

I think about this because in addition to the data types and operators
defined in the development process, there are often small tables that
need to be loaded with content and version controlled as well, rather
like userland-equivalents pg_enum entries.

-- 
fdr


Re: Schema version management

От
Joel Jacobson
Дата:
On Wed, May 23, 2012 at 3:24 AM, Daniel Farina <daniel@heroku.com> wrote:
> Is there a reason why the current directory format could not be
> adjusted to become more human-readable friendly for mechanical
> reasons?  I realize there is a backwards compatibility problem, but it
> may be better than bloating a new option.

I like your idea, then the format would be directory, while the option
would be something like, --human-friendly?

Currently, the directory format only dumps the data of tables
into separate files. Everything else goes into the toc.dat file.
To make it work, also the stuff written to the toc.dat file must
be written to separate files.

> But I don't see how making the directory output format more
> human-friendly could be seen as a bad thing overall, except in the
> notable axis of implementation complexity.  Silly issues like naming
> files on different platforms, case sensitivity, and file length
> restrictions may rear their ugly head.

If the entire function identity arguments would be included in the filename,
two dumps of the same schema in two different databases
would be guaranteed to produce the same dump.

This would render some very long filenames for functions with many arguments,
but this problem could at least be reduced by using the shorter aliases for each
data type, as "varchar" instead of "character varying" and "timestamptz"
instead of "timestamp with time zone", etc.

http://www.postgresql.org/docs/devel/static/datatype.html#DATATYPE-TABLE

Also, to get even more space, as the name of the function can be long too,
the function name could be made a directory, and the different overloaded types
different files, e.g:

/public/FUNCTION/myfunc/int.sql
/public/FUNCTION/myfunc/int_timestamptz.sql

And functions with no arguments are written to a single file
(suffic .sql to avoid conflict with eventual directory name for function):

/public/FUNCTION/myfunc.sql

> I think about this because in addition to the data types and operators
> defined in the development process, there are often small tables that
> need to be loaded with content and version controlled as well, rather
> like userland-equivalents pg_enum entries.

Is there a term for such tables? I use the term "lookup tables", but perhaps
there is a better one?

In my schema, they typically maps statusids, stateids,
etc to human friendly names.

E.g., if Orders is a huge table for all orders, I might have a
OrderStatuses table to
lookup all the OrderStatusID columns in Orders.
Orders.OrderStatusID -fk-> OrderStatuses.OrderStatusID
OrderStatuses.Name is unqiue and contains the human friendly name of the status.

These small lookup tables also needs to be version controlled of course.

This is a tricky one though, because you might have small tables with base data,
but with references to other huge tables, which you don't want to
include in your
automatically version controlled schema dump.

I solved this problem by creating a quite complex recursive plpgsql function,
resolving all dependencies and joining only the rows from each table required,
allowing you to specify a regex matching a list of tables, which in
turn resolves
to all tables they have references to, and dumps these tables too, but only the
required rows. The result is a dump of each such table into a separate file,
in a restorable order not causing any dependency problems.
Then I have a similar function to do the restoring.

I use this approach to build a restorable clean test database of any version of
the system, may it be the production or some developer's local version of it.

And also, not to forget, to make it work all the sequences also needs
to be restarted
to the same values as in the original database after the dump is restored.


Re: Schema version management

От
Tom Lane
Дата:
Joel Jacobson <joel@trustly.com> writes:
> If the entire function identity arguments would be included in the filename,
> two dumps of the same schema in two different databases
> would be guaranteed to produce the same dump.

> This would render some very long filenames for functions with many arguments,

Thus, not implausibly, causing the dump to fail entirely on some
filesystems.  Case sensitivity, encoding issues, and special characters
in names (eg slashes or backslashes, depending on platform) are
additional pain points.  This does not sound like a good plan from here.

Taking a step or two back, it seems to me that the thrust of your
proposal is essentially to throw away all dump ordering information,
which does not seem like a particularly good idea either.  It certainly
will not lead to a dump that can be restored reliably.  If the use-case
for this is database comparisons, I think we'd be a lot better off to
write a postprocessing tool for regular dumps to perform such
comparisons, rather than whacking pg_dump around to the point where it's
unable to perform its primary function.
        regards, tom lane


Re: Schema version management

От
Joel Jacobson
Дата:
On Wed, May 23, 2012 at 9:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thus, not implausibly, causing the dump to fail entirely on some
> filesystems.  Case sensitivity, encoding issues, and special characters
> in names (eg slashes or backslashes, depending on platform) are
> additional pain points.  This does not sound like a good plan from here.

This is true, which means some users won't be able to use the feature,
because they are using an ancient OS or have function names with slashes,
hm, is it even possible to have function names with slashes?

The maximum length of tables, functions etc in postgres is 63 characters.
A function in postgres can have at most 100 arguments.
The absolute majority of users run operating systems allowing
at least 255 characters, http://en.wikipedia.org/wiki/Comparison_of_file_systems

I suppose you have a lot more experience of what postgres installations exists
in the world. Do you think it's common databases have non-ascii problematic
characters in object names?

Is it a project policy all features of all standard tools must be
useful for all users
on all platforms on all databases? Or is it acceptable if some features are only
useable for, say, 90% of the users?

> Taking a step or two back, it seems to me that the thrust of your
> proposal is essentially to throw away all dump ordering information,
> which does not seem like a particularly good idea either.  It certainly
> will not lead to a dump that can be restored reliably.  If the use-case
> for this is database comparisons, I think we'd be a lot better off to
> write a postprocessing tool for regular dumps to perform such
> comparisons, rather than whacking pg_dump around to the point where it's
> unable to perform its primary function.

Not at all, the ordering information is not thrown away, it is preserved
in the dump file specified by the -f option, from which each split file
is included using \i


Example, this is an extract of the -f dump file in my database:

--
-- PostgreSQL database dump
--

SET statement_timeout = 0;
SET client_encoding = 'SQL_ASCII';
SET standard_conforming_strings = off;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET escape_string_warning = off;

--
-- Name: pgx_diag; Type: SCHEMA; Schema: -; Owner: postgres
--

CREATE SCHEMA pgx_diag;


ALTER SCHEMA pgx_diag OWNER TO postgres;

-- ... some more schemas, languages etc ...
-- ... and then all the included files:

\i /home/postgres/database/gluepay-split/public/TYPE/dblink_pkey_results.sql
\i /home/postgres/database/gluepay-split/public/TYPE/r_matchedwithdrawal.sql
\i /home/postgres/database/gluepay-split/public/TYPE/r_unapprovedwithdrawal.sql
\i /home/postgres/database/gluepay-split/public/TYPE/ukaccountvalidationchecktype.sql
\i /home/postgres/database/gluepay-split/aml/FUNCTION/check_name.sql
\i /home/postgres/database/gluepay-split/aml/FUNCTION/describe_entityid.sql
\i /home/postgres/database/gluepay-split/aml/FUNCTION/get_linkid.sql
\i /home/postgres/database/gluepay-split/aml/FUNCTION/set_address.sql
-- ... all the objects ..
\i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workerid_fkey.sql
\i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workerstatusid_fkey.sql
\i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workertypeid_fkey.sql


-- .. and after all the included files comes permissions and stuff:

--
-- Name: public; Type: ACL; Schema: -; Owner: postgres
--

REVOKE ALL ON SCHEMA public FROM PUBLIC;
REVOKE ALL ON SCHEMA public FROM postgres;
GRANT ALL ON SCHEMA public TO postgres;
GRANT ALL ON SCHEMA public TO PUBLIC;


Re: Schema version management

От
Joel Jacobson
Дата:
On the topic on fixing pg_dump to dump in a predictable order, can
someone please update me on the current state of the problem?

I've read though pg_dump_sort.c, and note objects are first sorted in
type/name-based ordering, then topologically sorted in a way which
"minimize unnecessary rearrangement".

How come this not always generates a predictable order? Any ideas on
how to fix the problem? If someone gives me a hint I might make an
effort trying to implement the idea.

If pg_dump would dump in a predictable order, it would make sense to
dump all overloaded versions of functions sharing the same name in the
same file.

Then it would be _guaranteed_ two different databases committing their
schema to a shared VCS commit exactly the same files if the schema is
the same, which is not guaranteed unless the dump order is
predictable.

Having thought about it, I agree the idea with arguments in filenames
is, probably possible, but suboptimal.
Much better writing all overloaded functions to the same file and
fixing the predictable dump order problem.


Re: Schema version management

От
"Kevin Grittner"
Дата:
Joel Jacobson  wrote:
> hm, is it even possible to have function names with slashes?
test=# create function "is/""it""\even
possible?"() returns void language plpgsql as $$begin end;$$;
CREATE FUNCTION
test=# select "is/""it""\even
possible?"();is/"it"\even+ possible?   
--------------
(1 row)
-Kevin



Re: Schema version management

От
Robert Haas
Дата:
On Tue, May 22, 2012 at 11:31 PM, Joel Jacobson <joel@trustly.com> wrote:
> This is true, which means some users won't be able to use the feature,
> because they are using an ancient OS or have function names with slashes,
> hm, is it even possible to have function names with slashes?

Sure.  If you quote the function name, you can put anything you want
in there.  Note that Windows disallows a whole bunch of special
characters in filenames, while UNIX-like systems tend to disallow only
slash.

> I suppose you have a lot more experience of what postgres installations exists
> in the world. Do you think it's common databases have non-ascii problematic
> characters in object names?
>
> Is it a project policy all features of all standard tools must be
> useful for all users
> on all platforms on all databases? Or is it acceptable if some features are only
> useable for, say, 90% of the users?

There are cases where we permit features that only work on some
platforms, but it's rare.  Usually, we do this only when the platform
lacks some API that exists elsewhere.  For example, collations and
prefetching are not supported on Windows because the UNIX APIs we use
don't exist there.

In this case, it seems like you could work around the problem by, say,
URL-escaping any characters that can't be used in an unquoted
identifier.  Of course that might make the file name long enough to
hit the platform-specific file name limit.  Not sure what to do about
that.  The basic idea you're proposing here has been proposed a number
of times before, but it's always fallen down over questions of (1)
what do do with very long object names or those containing special
characters and (2) objects (like functions) for which schema+name is
not a unique identifier.

I don't think either of these problems ought to be a complete
show-stopper.  It seems to me that the trade-off is that when object
names are long, contain special characters, or are overloaded, we'll
have to munge the names in some way to prevent collisions.  That could
mean that the names are not 100% stable, which would possibly produce
some annoyance if you're using a VCS to track changes, but maybe
that's an acceptable trade-off, because it shouldn't happen very
often.  If we could guararantee that identifiers less than 64
characters which are not overloaded and contain no special characters
requiring quoting end up in an eponymous file, I think that would be
good enough to make most of our users pretty happy.  In other cases, I
think the point would just be to make it work (with a funny name)
rather than fail.

> \i /home/postgres/database/gluepay-split/public/TYPE/dblink_pkey_results.sql
> \i /home/postgres/database/gluepay-split/public/TYPE/r_matchedwithdrawal.sql
> \i /home/postgres/database/gluepay-split/public/TYPE/r_unapprovedwithdrawal.sql
> \i /home/postgres/database/gluepay-split/public/TYPE/ukaccountvalidationchecktype.sql
> \i /home/postgres/database/gluepay-split/aml/FUNCTION/check_name.sql
> \i /home/postgres/database/gluepay-split/aml/FUNCTION/describe_entityid.sql
> \i /home/postgres/database/gluepay-split/aml/FUNCTION/get_linkid.sql
> \i /home/postgres/database/gluepay-split/aml/FUNCTION/set_address.sql
> -- ... all the objects ..
> \i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workerid_fkey.sql
> \i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workerstatusid_fkey.sql
> \i /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workertypeid_fkey.sql

It would be better to use \ir here rather than hard-code path names, I
think.  Then you'd only need to require that all the files be in the
same directory, rather than requiring them to be at a certain
hard-coded place in the filesystem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Schema version management

От
Joel Jacobson
Дата:
Robert, thank you for keeping this thread alive.

Hopefully some more will join the discussion.

I'm still hopeful the community can manage to agree upon acceptable
tradeoffs and work-arounds to make this possible.

I think the benefits clearly outweighs the minor issues of filenames,
dumping order, etc.

On Tue, Jun 26, 2012 at 6:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I don't think either of these problems ought to be a complete
show-stopper.  It seems to me that the trade-off is that when object
names are long, contain special characters, or are overloaded, we'll
have to munge the names in some way to prevent collisions.  That could
mean that the names are not 100% stable, which would possibly produce
some annoyance if you're using a VCS to track changes, but maybe
that's an acceptable trade-off, because it shouldn't happen very
often.  If we could guararantee that identifiers less than 64
characters which are not overloaded and contain no special characters
requiring quoting end up in an eponymous file, I think that would be
good enough to make most of our users pretty happy.  In other cases, I
think the point would just be to make it work (with a funny name)
rather than fail.

I agree. It's not a problem if the filename is not identical to the name of
the object, as long as the same name generates the same filename on
all architectures. Url escaping would work, but converting all non-ascii
characters to ascii would be nicer, and dropping any problematic characters,
or replacing them with "_" or any other suitable character.

For the small fraction of users how have managed to find a good reason
to name a function "this/is\a/good.name/of/a\function.." the filename
of such a function would be "this_is_a_good_name_of_a_function".

As long as the objects are dumped in the same order, there will be no
merge problems when two developers commit changes of the same
file. I think pg_dump does a reasonable job already making sure the order is
always the same. How big is the problem, really?

It would of course be a little easier to keep track of changes and do merging
if all overloaded functions would be kept in separate files, but I see that as a
minor feature request. As long as all objects with the same name are kept in
separate files, that's good enough for my needs, and I have _a lot_ of functions,
whereof quite a few are overloaded.

 

> \i /home/postgres/database/gluepay-split/aml/FUNCTION/get_linkid.sql
> \i /home/postgres/database/gluepay-split/aml/FUNCTION/set_address.sql

It would be better to use \ir here rather than hard-code path names, I
think.  Then you'd only need to require that all the files be in the
same directory, rather than requiring them to be at a certain
hard-coded place in the filesystem.

I fully agree!
I didn't know about the \ir feature.

Best regards,

Joel Jacobson

Re: Schema version management

От
Peter Eisentraut
Дата:
On ons, 2012-06-27 at 10:02 +0200, Joel Jacobson wrote:
> Robert, thank you for keeping this thread alive.

> Hopefully some more will join the discussion.

> I'm still hopeful the community can manage to agree upon acceptable
> tradeoffs and work-arounds to make this possible.

I think this idea has merit.  Prepare a patch and put it into the next
commit fest.
> 
> I think the benefits clearly outweighs the minor issues of filenames,
> dumping order, etc.

I see the problem that since the dump order is in general not
deterministic, this will cause random reordering in your master file
that includes all the individual files.

Then again, making the dump order deterministic is a problem that can be
solved (I suppose), so maybe starting there would be a good step.  But
it will require a small amount of in-depth pg_dump hacking.





Re: Schema version management

От
Joel Jacobson
Дата:
On Tue, Jul 3, 2012 at 7:49 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
I think this idea has merit.  Prepare a patch and put it into the next
commit fest. 

Glad to hear, I'm on it!
 
I see the problem that since the dump order is in general not
deterministic, this will cause random reordering in your master file
that includes all the individual files. 

Then again, making the dump order deterministic is a problem that can be
solved (I suppose), so maybe starting there would be a good step.  But
it will require a small amount of in-depth pg_dump hacking.

I just made a test, where I created objects in different order and compared the dumps.
It appears pg_dump dumps objects in alphabetically sorted order.
This works fine for most objects, but not for overloaded functions, in which case
they are dumped in oid order.

Are there any other cases than overloaded functions, where the dump order isn't deterministic?

While waiting for your reply, I'll be working on fixing the problem with overloaded functions.

Re: Schema version management

От
Robert Haas
Дата:
On Wed, Jul 4, 2012 at 9:02 AM, Joel Jacobson <joel@trustly.com> wrote:
> On Tue, Jul 3, 2012 at 7:49 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
>>
>> I think this idea has merit.  Prepare a patch and put it into the next
>> commit fest.
>
> Glad to hear, I'm on it!
>
>>
>> I see the problem that since the dump order is in general not
>> deterministic, this will cause random reordering in your master file
>> that includes all the individual files.
>>
>>
>> Then again, making the dump order deterministic is a problem that can be
>> solved (I suppose), so maybe starting there would be a good step.  But
>> it will require a small amount of in-depth pg_dump hacking.
>
>
> I just made a test, where I created objects in different order and compared
> the dumps.
> It appears pg_dump dumps objects in alphabetically sorted order.
> This works fine for most objects, but not for overloaded functions, in which
> case
> they are dumped in oid order.
>
> Are there any other cases than overloaded functions, where the dump order
> isn't deterministic?
>
> While waiting for your reply, I'll be working on fixing the problem with
> overloaded functions.

My vote is - when there's an overloaded function, put each version in
its own file.  And name the files something like
functionname_something.sql.  And just document that something may not
be entirely stable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Schema version management

От
Joel Jacobson
Дата:
On Thu, Jul 5, 2012 at 2:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
My vote is - when there's an overloaded function, put each version in
its own file.  And name the files something like
functionname_something.sql.  And just document that something may not
be entirely stable.

I would agree that's better if the dump order isn't deterministic.

However, it looks like an easy fix to make the dump order deterministic:

If the dump order is deterministic, I think its cleaner to put all versions in the same file.

Benefits:
+ Pretty looking filename
+ Same file structure for all object types, no special exception for functions

Re: Schema version management

От
Gurjeet Singh
Дата:
On Thu, Jul 5, 2012 at 3:15 AM, Joel Jacobson <joel@trustly.com> wrote:
On Thu, Jul 5, 2012 at 2:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
My vote is - when there's an overloaded function, put each version in
its own file.  And name the files something like
functionname_something.sql.  And just document that something may not
be entirely stable.

I would agree that's better if the dump order isn't deterministic.

However, it looks like an easy fix to make the dump order deterministic:

If the dump order is deterministic, I think its cleaner to put all versions in the same file.

Benefits:
+ Pretty looking filename
+ Same file structure for all object types, no special exception for functions

I think there's a merit to keeping all overloaded variations of a function in a single file, apart from the simplicity and benefits noted above. A change in one variation of the function may also be applicable to other variations, say in bug-fixes or enhancements. So keeping all variations in one file would make sense, since it is logically one object.

Best regards,
--
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: Schema version management

От
Andrew Dunstan
Дата:


On Thu, Jul 5, 2012 at 4:04 AM, Gurjeet Singh <singh.gurjeet@gmail.com> wrote:
On Thu, Jul 5, 2012 at 3:15 AM, Joel Jacobson <joel@trustly.com> wrote:
On Thu, Jul 5, 2012 at 2:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
My vote is - when there's an overloaded function, put each version in
its own file.  And name the files something like
functionname_something.sql.  And just document that something may not
be entirely stable.

I would agree that's better if the dump order isn't deterministic.

However, it looks like an easy fix to make the dump order deterministic:

If the dump order is deterministic, I think its cleaner to put all versions in the same file.

Benefits:
+ Pretty looking filename
+ Same file structure for all object types, no special exception for functions

I think there's a merit to keeping all overloaded variations of a function in a single file, apart from the simplicity and benefits noted above. A change in one variation of the function may also be applicable to other variations, say in bug-fixes or enhancements. So keeping all variations in one file would make sense, since it is logically one object.


No they are not necessarily one logical unit. You could have a bunch of functions called, say, "equal" which have pretty much nothing to do with each other, since they refer to different types. 

+1 from me for putting one function definition per file.

cheers

andrew 

Re: Schema version management

От
Michael Glaesemann
Дата:
On Jul 5, 2012, at 9:21, Andrew Dunstan wrote:

> No they are not necessarily one logical unit. You could have a bunch of
> functions called, say, "equal" which have pretty much nothing to do with
> each other, since they refer to different types.
>
> +1 from me for putting one function definition per file.

+1. It might make sense to include some sort of argument type information. The function signature is
really its identifier. The function name is only part of it.

Michael Glaesemann
grzm seespotcode net





Re: Schema version management

От
Vik Reykja
Дата:
On Thu, Jul 5, 2012 at 3:32 PM, Michael Glaesemann <grzm@seespotcode.net> wrote:

On Jul 5, 2012, at 9:21, Andrew Dunstan wrote:

> No they are not necessarily one logical unit. You could have a bunch of
> functions called, say, "equal" which have pretty much nothing to do with
> each other, since they refer to different types.
>
> +1 from me for putting one function definition per file.

+1. It might make sense to include some sort of argument type information. The function signature is
really its identifier. The function name is only part of it.

I'll go against the flow here.  I would prefer to have all overloaded functions in the same file.

Re: Schema version management

От
Joel Jacobson
Дата:
Maybe it could be made an option to pg_dump?

Some users and their systems might not even have overloaded functions,
and these users will of course prefer a nice looking filename, i.e. all functions
having the same name kept in the same file. Which for them will mean only
one function per file anyway.

pg_dump --split --overloaded-functions-to-same-file

Other users and their systems might have a lot of overloaded functions,
like the equal() example mentioned, they will of course prefer to keep
all functions in separate files.

pg_dump --split --overloaded-functions-to-separate-files

Then, one can discuss which one should be the default option for --split,
I would prefer the same file variant, and think most other users would too,
except for users with a lot of overloaded functions.

Re: Schema version management

От
Tom Lane
Дата:
Joel Jacobson <joel@trustly.com> writes:
> Maybe it could be made an option to pg_dump?

Ick.  Then we have to deal with all the downsides of *both* methods.

pg_dump is already a bloated, nearly unmaintainable mess.  The very
last thing it needs is even more options.
        regards, tom lane


Re: Schema version management

От
Joel Jacobson
Дата:
On Thu, Jul 5, 2012 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ick.  Then we have to deal with all the downsides of *both* methods.

pg_dump is already a bloated, nearly unmaintainable mess.  The very
last thing it needs is even more options.

When you say bloated, are you referring to the code or the command line interface?

If you are referring to the code, I don't think that's a good argument against implementing new good features.
The important ratio is the value of a feature compared to the increased complexity.
In this case, it's very simple to implement both the --split option and the fixing of dump order.
I'm not even a C coder and managed to implement it within less of an hour effective coding.
We are talking ~100 lines of code, with comments and everything.

If you are referring to the command line interface and think it is bloated, maybe the options should be hidden in the normal --help.
We could create a new --help-advanced text, where we could put these options, and all other existing less common options.
I think this is a quite common and good way to handle the situation for UNIX command line tools.

Re: Schema version management

От
Alvaro Herrera
Дата:
Excerpts from Tom Lane's message of jue jul 05 10:46:52 -0400 2012:
> Joel Jacobson <joel@trustly.com> writes:
> > Maybe it could be made an option to pg_dump?
>
> Ick.  Then we have to deal with all the downsides of *both* methods.
>
> pg_dump is already a bloated, nearly unmaintainable mess.  The very
> last thing it needs is even more options.

Agreed.

However I am also against what seems to be the flow.  Normally, you
don't write overloaded plpgsql functions such as "equal".  Case in
point, the equality functions in core have funny names like "int4eq" and
so on.  Instead, at least in my experience, the overloaded functions
people seem to have in their databases are like do_stuff_to_foobars()
and you have one version for foos and another one for bars.

If you're doing lots of equality functions, surely it would make more
sense to package them up as an extension anyway along with all the other
thingies you need for the type you're supposedly writing.  So it's a
completely different market than what we're aiming at here.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Schema version management

От
Joel Jacobson
Дата:
On Thu, Jul 5, 2012 at 5:17 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Agreed.

However I am also against what seems to be the flow.  Normally, you
don't write overloaded plpgsql functions such as "equal".  Case in
point, the equality functions in core have funny names like "int4eq" and
so on.  Instead, at least in my experience, the overloaded functions
people seem to have in their databases are like do_stuff_to_foobars()
and you have one version for foos and another one for bars.

If you're doing lots of equality functions, surely it would make more
sense to package them up as an extension anyway along with all the other
thingies you need for the type you're supposedly writing.  So it's a
completely different market than what we're aiming at here.

True, very true, I didn't think about that, you are right, I fully agree.
My vote is therefore on the "put all overloaded functions in the same file" variant.

Re: Schema version management

От
Michael Glaesemann
Дата:
On Jul 5, 2012, at 11:17, Alvaro Herrera wrote:

>
> Excerpts from Tom Lane's message of jue jul 05 10:46:52 -0400 2012:
>> Joel Jacobson <joel@trustly.com> writes:
>>> Maybe it could be made an option to pg_dump?
>>
>> Ick.  Then we have to deal with all the downsides of *both* methods.
>>
>> pg_dump is already a bloated, nearly unmaintainable mess.  The very
>> last thing it needs is even more options.
>
> Agreed.
>
> However I am also against what seems to be the flow.  Normally, you
> don't write overloaded plpgsql functions such as "equal".

I often write functions that perform fetches based on different criteria.
For example,

-- returns count of all orders for the given customer
int function order_count(in_customer_name text)

-- returns count of all orders for the given customer since the given timestamp
int function order_count(in_customer_name text, in_since timestamp with time zone)

-- returns count of orders for the given customer during a given interval
int function order_count(in_customer_name text, in_from timestamp with time zone, in_through timestamp with time zone)


Or, I'll write overloaded functions, one of which provides default values.

-- returns the set of members whose birthday is today. Calls birthday_members(CURRENT_DATE)
setof record function birthday_members()

-- returns the set of members whose birthday is on the given date, which makes testing a lot easier
setof record function birthday_members(in_date DATE)

Some may disagree that this is a "proper" usage of function overloading.
Some may even argue that function names shouldn't be overloaded at all.
However, I find this usage of function name overloading useful, especially
for keeping function names relatively short.

If we're dumping objects (tables, views, functions, what-have-you) into separate files,
each of these functions is a separate object and should be in its own file.

Michael Glaesemann
grzm seespotcode net





Re: Schema version management

От
Alvaro Herrera
Дата:
Excerpts from Michael Glaesemann's message of jue jul 05 11:36:51 -0400 2012:
>
> On Jul 5, 2012, at 11:17, Alvaro Herrera wrote:

> > However I am also against what seems to be the flow.  Normally, you
> > don't write overloaded plpgsql functions such as "equal".
>
> I often write functions that perform fetches based on different criteria.
> For example,
>
> -- returns count of all orders for the given customer
> int function order_count(in_customer_name text)
>
> -- returns count of all orders for the given customer since the given timestamp
> int function order_count(in_customer_name text, in_since timestamp with time zone)
>
> -- returns count of orders for the given customer during a given interval
> int function order_count(in_customer_name text, in_from timestamp with time zone, in_through timestamp with time
zone) 

Isn't this a perfect example of stuff that, since it does much the same
thing, should be in the same file so that you remember to fix them all
together if you find a bug in one?

> Or, I'll write overloaded functions, one of which provides default values.
>
> -- returns the set of members whose birthday is today. Calls birthday_members(CURRENT_DATE)
> setof record function birthday_members()
>
> -- returns the set of members whose birthday is on the given date, which makes testing a lot easier
> setof record function birthday_members(in_date DATE)

Same.  This seems particularly the case if one implementation calls
another, more general one.

> Some may disagree that this is a "proper" usage of function overloading.
> Some may even argue that function names shouldn't be overloaded at all.
> However, I find this usage of function name overloading useful, especially
> for keeping function names relatively short.

I completely agree.

> If we're dumping objects (tables, views, functions, what-have-you) into separate files,
> each of these functions is a separate object and should be in its own file.

Clearly there is no consensus here.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Schema version management

От
Michael Glaesemann
Дата:
On Jul 5, 2012, at 11:52, Alvaro Herrera wrote:

> Isn't this a perfect example of stuff that, since it does much the same
> thing, should be in the same file so that you remember to fix them all
> together if you find a bug in one?

That's what tests are for.

Michael Glaesemann
grzm seespotcode net





Re: Schema version management

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Michael Glaesemann's message of jue jul 05 11:36:51 -0400 2012:
>> If we're dumping objects (tables, views, functions, what-have-you) into separate files,
>> each of these functions is a separate object and should be in its own file.

> Clearly there is no consensus here.

FWIW, I'm attracted to the all-similarly-named-functions-together
method, mainly because it dodges the problem of how to encode a
function's argument list into a filename.  However, we're being
short-sighted to only think of functions here.  What about operators?
Or casts?  Those don't have simple names either.
        regards, tom lane


Re: Schema version management

От
Joel Jacobson
Дата:
On Thu, Jul 5, 2012 at 5:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
FWIW, I'm attracted to the all-similarly-named-functions-together
method, mainly because it dodges the problem of how to encode a
function's argument list into a filename.  However, we're being
short-sighted to only think of functions here.  What about operators?
Or casts?  Those don't have simple names either.

Someone suggested to urlencode them. I think that's a quite good solution.

Personally, I don't have any user-defined operators or casts. Don't know
how common it is in general, but it must of course work for these as well.

Re: Schema version management

От
Tom Lane
Дата:
Joel Jacobson <joel@trustly.com> writes:
> On Thu, Jul 5, 2012 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> pg_dump is already a bloated, nearly unmaintainable mess.  The very
>> last thing it needs is even more options.

> If you are referring to the code, I don't think that's a good argument
> against implementing new good features.
> The important ratio is the value of a feature compared to the increased
> complexity.

Well, to be perfectly frank, I already doubt that this entire feature
passes the complexity-versus-value test, because pg_dump is not a
substitute for an SCM --- people who have got enough functions to need
this sort of thing need to be keeping them somewhere else than in dump
files.  Complicating things more by supporting multiple ways of doing it
will make that worse.  I think you need to pick one design and stick
with it, not try to paint the bikeshed every color suggested by anybody.
        regards, tom lane


Re: Schema version management

От
Christopher Browne
Дата:
On Thu, Jul 5, 2012 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Excerpts from Michael Glaesemann's message of jue jul 05 11:36:51 -0400 2012:
>>> If we're dumping objects (tables, views, functions, what-have-you) into separate files,
>>> each of these functions is a separate object and should be in its own file.
>
>> Clearly there is no consensus here.
>
> FWIW, I'm attracted to the all-similarly-named-functions-together
> method, mainly because it dodges the problem of how to encode a
> function's argument list into a filename.  However, we're being
> short-sighted to only think of functions here.  What about operators?
> Or casts?  Those don't have simple names either.

If you stow them all together, that still leaves a question as to
whether or not they get stored in a deterministic order.

I was recently working on something of the same issue as part of a
schema differencing tool.  It was pointedly *not* sufficient to use
the internal name (e.g. - information_schema.routines.specific_name),
as I wanted to compare things between databases, and it's pretty
certain that oids will differ.

I wound up expanding the function arguments and using function + args
as the name.  That leads to a risk of rather long names for functions,
but there aren't many other ways possible.

Note that pg_autodoc <http://www.rbt.ca/autodoc/> takes a similar
approach; it attaches function labels based on function + args.

Here's an expanded example in the Slony docs:
<http://slony.info/documentation/2.1/function.ddlscript-prepare-int-p-only-on-node-integer-p-set-id-integer.html>

I wouldn't mind stowing functions together in one file, and I'd
actually not get too bent out of shape if the order was somewhat
nondeterministic.  But something like the autodoc naming seems like
the unambiguous answer.  Long, but unambiguous...
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: Schema version management

От
Joel Jacobson
Дата:
On Thu, Jul 5, 2012 at 6:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Well, to be perfectly frank, I already doubt that this entire feature
passes the complexity-versus-value test, because pg_dump is not a
substitute for an SCM --- people who have got enough functions to need
this sort of thing need to be keeping them somewhere else than in dump
files.  Complicating things more by supporting multiple ways of doing it
will make that worse.  I think you need to pick one design and stick
with it, not try to paint the bikeshed every color suggested by anybody.

I agree it should be one option only, and again I think the one file variant is best.

This is indeed not a substitute for an SCM, but a nice complement.

Personally, I use this feature already to commit the schema for all versions
of my databases (production, test, development) into a git repo every minute.
It only commits if something has changed.

This makes it super easy to compare the schema of the actual production database
between different points in time.

This would not be possible if only manually committing stuff to the normal git repo,
where I also have all the functions, which I modify when developing and testing.

pg_dump -> git means you can be 100% certain version X of the schema was
active in the production database at date/time T.

Re: Schema version management

От
Alvaro Herrera
Дата:
Excerpts from Christopher Browne's message of jue jul 05 12:10:09 -0400 2012:

> I wound up expanding the function arguments and using function + args
> as the name.  That leads to a risk of rather long names for functions,
> but there aren't many other ways possible.

Well, maybe not many, but you don't need many, only some.  You could
stringify the list of arguments and use a hash of the string.  That's
also unambiguous and the length is constrained, regardless of the number
of args.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Schema version management

От
"David E. Wheeler"
Дата:
On Jul 5, 2012, at 3:21 PM, Andrew Dunstan wrote:

> No they are not necessarily one logical unit. You could have a bunch of
> functions called, say, "equal" which have pretty much nothing to do with
> each other, since they refer to different types.
>
> +1 from me for putting one function definition per file.

+1 for an option (I prefer one file for my projects, but might need multiple files for other projects).

David



Re: Schema version management

От
Aidan Van Dyk
Дата:
On Thu, Jul 5, 2012 at 12:57 PM, David E. Wheeler <david@justatheory.com> wrote:
> On Jul 5, 2012, at 3:21 PM, Andrew Dunstan wrote:
>
>> No they are not necessarily one logical unit. You could have a bunch of
>> functions called, say, "equal" which have pretty much nothing to do with
>> each other, since they refer to different types.
>>
>> +1 from me for putting one function definition per file.
>
> +1 for an option (I prefer one file for my projects, but might need multiple files for other projects).

-1

I'd rather have the few overloaded-functions in one file (hopefully
with deterministic ordering) and a sane, simple filename, than have
every function in every database in a separate file with some strange
mess in the filename that makes me cringe every time I see it.

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Re: Schema version management

От
Josh Berkus
Дата:
> I'd rather have the few overloaded-functions in one file (hopefully
> with deterministic ordering) and a sane, simple filename, than have
> every function in every database in a separate file with some strange
> mess in the filename that makes me cringe every time I see it.

Having tried it both ways, for an existing production project, I agree
with Adrian.  One file, multiple functions.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com




Re: Schema version management

От
Dimitri Fontaine
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> However I am also against what seems to be the flow.  Normally, you
> don't write overloaded plpgsql functions such as "equal".  Case in
> point, the equality functions in core have funny names like "int4eq" and
> so on.  Instead, at least in my experience, the overloaded functions
> people seem to have in their databases are like do_stuff_to_foobars()
> and you have one version for foos and another one for bars.

+1

I too want to have my overloaded functions all in the same file, as much
as to have made that the only behavior in getddl.py:
 https://github.com/dimitri/getddl

> If you're doing lots of equality functions, surely it would make more
> sense to package them up as an extension anyway along with all the other
> thingies you need for the type you're supposedly writing.  So it's a
> completely different market than what we're aiming at here.

+1

Tom Lane <tgl@sss.pgh.pa.us> writes:
> FWIW, I'm attracted to the all-similarly-named-functions-together
> method, mainly because it dodges the problem of how to encode a
> function's argument list into a filename.  However, we're being
> short-sighted to only think of functions here.  What about operators?
> Or casts?  Those don't have simple names either.

I would argue like Álvaro that when dealing with operators and casts
you're probably writing an extension already, and we're providing
another way to deal with that.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Schema version management

От
Christopher Browne
Дата:
On Thu, Jul 5, 2012 at 5:52 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> However I am also against what seems to be the flow.  Normally, you
>> don't write overloaded plpgsql functions such as "equal".  Case in
>> point, the equality functions in core have funny names like "int4eq" and
>> so on.  Instead, at least in my experience, the overloaded functions
>> people seem to have in their databases are like do_stuff_to_foobars()
>> and you have one version for foos and another one for bars.
>
> +1
>
> I too want to have my overloaded functions all in the same file, as much
> as to have made that the only behavior in getddl.py:

That seems pretty appropriate to me.

The converse makes my head hurt...

If I have a bunch of overloaded functions, whose definitions *aren't*
really related, are we competing for the "obfuscated PostgreSQL"
contest?

In practice, that sounds like something I'd want to add to my list of
"fire people that do this!" Bad Practices.

>> If you're doing lots of equality functions, surely it would make more
>> sense to package them up as an extension anyway along with all the other
>> thingies you need for the type you're supposedly writing.  So it's a
>> completely different market than what we're aiming at here.
>
> +1
+1

> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> FWIW, I'm attracted to the all-similarly-named-functions-together
>> method, mainly because it dodges the problem of how to encode a
>> function's argument list into a filename.  However, we're being
>> short-sighted to only think of functions here.  What about operators?
>> Or casts?  Those don't have simple names either.
>
> I would argue like Álvaro that when dealing with operators and casts
> you're probably writing an extension already, and we're providing
> another way to deal with that.

Indeed.  Is this something we ought to document as a recommendation?
It's not exactly reference material, but if it's a good practice,
perhaps it should be in the manuals somewhere...
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: Schema version management

От
Tom Lane
Дата:
Christopher Browne <cbbrowne@gmail.com> writes:
> On Thu, Jul 5, 2012 at 5:52 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
>> Tom Lane <tgl@sss.pgh.pa.us> writes:
>>> FWIW, I'm attracted to the all-similarly-named-functions-together
>>> method, mainly because it dodges the problem of how to encode a
>>> function's argument list into a filename.  However, we're being
>>> short-sighted to only think of functions here.  What about operators?
>>> Or casts?  Those don't have simple names either.

>> I would argue like �lvaro that when dealing with operators and casts
>> you're probably writing an extension already, and we're providing
>> another way to deal with that.

> Indeed.  Is this something we ought to document as a recommendation?

This argument seems a bit irrelevant to me.  pg_dump doesn't get to pick
and choose what will be in the database it's told to dump.  If we're
going to do something like what Joel wants, we have to have file naming
conventions for operator and cast objects.  So we can't just leave them
out of the conversation (or if we do, we shouldn't be surprised when the
ensuing design sucks).
        regards, tom lane


Re: Schema version management

От
Dimitri Fontaine
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> This argument seems a bit irrelevant to me.  pg_dump doesn't get to pick
> and choose what will be in the database it's told to dump.  If we're

Sure.

> going to do something like what Joel wants, we have to have file naming
> conventions for operator and cast objects.  So we can't just leave them
> out of the conversation (or if we do, we shouldn't be surprised when the
> ensuing design sucks).

I guess what we're saying is that at this point we can pick non user
friendly naming rules, like pg_operator/<oid>.sql or something like
that, OID based. Impacted users might as well learn about extensions.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Schema version management

От
Robert Haas
Дата:
On Fri, Jul 6, 2012 at 8:23 AM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> This argument seems a bit irrelevant to me.  pg_dump doesn't get to pick
>> and choose what will be in the database it's told to dump.  If we're
>
> Sure.
>
>> going to do something like what Joel wants, we have to have file naming
>> conventions for operator and cast objects.  So we can't just leave them
>> out of the conversation (or if we do, we shouldn't be surprised when the
>> ensuing design sucks).
>
> I guess what we're saying is that at this point we can pick non user
> friendly naming rules, like pg_operator/<oid>.sql or something like
> that, OID based. Impacted users might as well learn about extensions.

I think that would defeat some of the human-readability goals that
people have for this feature, not to mention that it would lose the
ability to do diff -r between a dump produced on cluster A and a dump
produced on cluster B.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Schema version management

От
"Marc Mamin"
Дата:
Hello,

Is it imaginable to additionally generate an index file that map the
"half friendly" file names to a cleartext object signature ?

This would allow user to possibly postprocess the output while merging
overloaded functions to single files or renaming the files according to
their needs and preferences...

best regards,

Marc Mamin



> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Robert Haas
> Sent: Freitag, 6. Juli 2012 15:02
> To: Dimitri Fontaine
> Cc: Tom Lane; Christopher Browne; Pg Hackers
> Subject: Re: [HACKERS] Schema version management
>
> On Fri, Jul 6, 2012 at 8:23 AM, Dimitri Fontaine
> <dimitri@2ndquadrant.fr> wrote:
> > Tom Lane <tgl@sss.pgh.pa.us> writes:
> >> This argument seems a bit irrelevant to me.  pg_dump doesn't get to
> pick
> >> and choose what will be in the database it's told to dump.  If
we're
> >
> > Sure.
> >
> >> going to do something like what Joel wants, we have to have file
> naming
> >> conventions for operator and cast objects.  So we can't just leave
> them
> >> out of the conversation (or if we do, we shouldn't be surprised
when
> the
> >> ensuing design sucks).
> >
> > I guess what we're saying is that at this point we can pick non user
> > friendly naming rules, like pg_operator/<oid>.sql or something like
> > that, OID based. Impacted users might as well learn about
extensions.
>
> I think that would defeat some of the human-readability goals that
> people have for this feature, not to mention that it would lose the
> ability to do diff -r between a dump produced on cluster A and a dump
> produced on cluster B.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: Schema version management

От
Peter Eisentraut
Дата:
On tor, 2012-07-05 at 23:52 +0200, Dimitri Fontaine wrote:
> I would argue like Álvaro that when dealing with operators and casts
> you're probably writing an extension already, and we're providing
> another way to deal with that.

I have code in the wild that defines new operators and casts and has no
C code and is not in an extension and has no business being in an
extension.



Re: Schema version management

От
Aidan Van Dyk
Дата:
On Fri, Jul 6, 2012 at 4:50 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

> I have code in the wild that defines new operators and casts and has no
> C code and is not in an extension and has no business being in an
> extension.

Nobody is claiming that pgdump shouldn't dump it.

But, since you're using operators, what would you think is an
appropriate name for the file the operator is dumped into?

a.


-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Re: Schema version management

От
Alvaro Herrera
Дата:
Excerpts from Aidan Van Dyk's message of sáb jul 07 11:32:33 -0400 2012:
>
> On Fri, Jul 6, 2012 at 4:50 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
>
> > I have code in the wild that defines new operators and casts and has no
> > C code and is not in an extension and has no business being in an
> > extension.
>
> Nobody is claiming that pgdump shouldn't dump it.
>
> But, since you're using operators, what would you think is an
> appropriate name for the file the operator is dumped into?

I was thinking that it might make sense to group operators according to
the type(s) they operate on, somehow.  Using funny chars for names is
guaranteed to cause problems somewhere.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Schema version management

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Aidan Van Dyk's message of sáb jul 07 11:32:33 -0400 2012:
>> But, since you're using operators, what would you think is an
>> appropriate name for the file the operator is dumped into?

> I was thinking that it might make sense to group operators according to
> the type(s) they operate on, somehow.  Using funny chars for names is
> guaranteed to cause problems somewhere.

Sure.  You need not look further than "/" to find an operator name that
absolutely *will* cause trouble if it's dumped into a filename
literally.

I'm not especially thrilled by the idea of using url-encoding or
something like that for operator names, though.  Seems like it loses on
readability.

If we think that operators outside of extensions will be an infrequent
special case, what about just dumping all of them into a single file
named "operators"?  And similarly for casts?
        regards, tom lane


Re: Schema version management

От
Joel Jacobson
Дата:


On Saturday, July 7, 2012, Tom Lane wrote:
If we think that operators outside of extensions will be an infrequent
special case, what about just dumping all of them into a single file
named "operators"?  And similarly for casts?

                        regards, tom lane

 
+1

Re: Schema version management

От
Peter Eisentraut
Дата:
On lör, 2012-07-07 at 11:32 -0400, Aidan Van Dyk wrote:
> But, since you're using operators, what would you think is an
> appropriate name for the file the operator is dumped into?

The name of the operator, just like for any other object.  (Assuming
we're using the name of a table for the file for the table etc.)




Re: Schema version management

От
Peter Eisentraut
Дата:
On lör, 2012-07-07 at 17:18 -0400, Tom Lane wrote:
> Sure.  You need not look further than "/" to find an operator name that
> absolutely *will* cause trouble if it's dumped into a filename
> literally.

But that problem applies to all object names.

> If we think that operators outside of extensions will be an infrequent
> special case, what about just dumping all of them into a single file
> named "operators"?  And similarly for casts?

If we think they are an infrequent case, why make a fuss about it?  Just
treat them like any other object.

In practical terms, I dislike the particular solution proposed here.
For one thing, it would undermine the original purpose of this whole
thread, namely insulating dump output files from ordering differences.




Re: Schema version management

От
Tom Lane
Дата:
Peter Eisentraut <peter_e@gmx.net> writes:
> On lör, 2012-07-07 at 17:18 -0400, Tom Lane wrote:
>> Sure.  You need not look further than "/" to find an operator name that
>> absolutely *will* cause trouble if it's dumped into a filename
>> literally.

> But that problem applies to all object names.

In principle, yes, but in practice it's far more likely that operators
will have names requiring some sort of encoding than that objects with
SQL-identifier names will.

>> If we think that operators outside of extensions will be an infrequent
>> special case, what about just dumping all of them into a single file
>> named "operators"?  And similarly for casts?

> If we think they are an infrequent case, why make a fuss about it?  Just
> treat them like any other object.

> In practical terms, I dislike the particular solution proposed here.
> For one thing, it would undermine the original purpose of this whole
> thread, namely insulating dump output files from ordering differences.

That's a good point.  However, I think that there are no cases where
we'd have dependencies between operators (or between casts), so that
as long as the initial sort is well-defined for them, it shouldn't
really be an issue in practice.
        regards, tom lane


Re: Schema version management

От
Peter Eisentraut
Дата:
On sön, 2012-07-08 at 18:52 -0400, Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > On lör, 2012-07-07 at 17:18 -0400, Tom Lane wrote:
> >> Sure.  You need not look further than "/" to find an operator name
> that
> >> absolutely *will* cause trouble if it's dumped into a filename
> >> literally.
> 
> > But that problem applies to all object names.
> 
> In principle, yes, but in practice it's far more likely that operators
> will have names requiring some sort of encoding than that objects with
> SQL-identifier names will.

I'm not sure.  The only character that's certainly an issue is "/".  Are
there any others on file systems that we want to support?



Re: Schema version management

От
Andrew Dunstan
Дата:
On 07/10/2012 05:39 PM, Peter Eisentraut wrote:
> On sön, 2012-07-08 at 18:52 -0400, Tom Lane wrote:
>> Peter Eisentraut <peter_e@gmx.net> writes:
>>> On lör, 2012-07-07 at 17:18 -0400, Tom Lane wrote:
>>>> Sure.  You need not look further than "/" to find an operator name
>> that
>>>> absolutely *will* cause trouble if it's dumped into a filename
>>>> literally.
>>> But that problem applies to all object names.
>> In principle, yes, but in practice it's far more likely that operators
>> will have names requiring some sort of encoding than that objects with
>> SQL-identifier names will.
> I'm not sure.  The only character that's certainly an issue is "/".  Are
> there any others on file systems that we want to support?
>
>

In general, NTFS forbids the use of these printable ASCII chars in
filenames (see
<http://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations>:


" * : < > ? \ / |


Many of these could be used in operators.

cheers

andrew


Re: Schema version management

От
Magnus Hagander
Дата:
On Tue, Jul 10, 2012 at 11:39 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On sön, 2012-07-08 at 18:52 -0400, Tom Lane wrote:
>> Peter Eisentraut <peter_e@gmx.net> writes:
>> > On lör, 2012-07-07 at 17:18 -0400, Tom Lane wrote:
>> >> Sure.  You need not look further than "/" to find an operator name
>> that
>> >> absolutely *will* cause trouble if it's dumped into a filename
>> >> literally.
>>
>> > But that problem applies to all object names.
>>
>> In principle, yes, but in practice it's far more likely that operators
>> will have names requiring some sort of encoding than that objects with
>> SQL-identifier names will.
>
> I'm not sure.  The only character that's certainly an issue is "/".  Are
> there any others on file systems that we want to support?

\ and : if we care at all about windows....

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: Schema version management

От
Merlin Moncure
Дата:
On Sun, May 20, 2012 at 2:41 PM, Joel Jacobson <joel@trustly.com> wrote:
> Hi,
>
> I just read a very interesting post about "schema version management".
>
> Quote: "You could set it up so that every developer gets their own
> test database, sets up the schema there, takes a dump, and checks that
> in. There are going to be problems with that, including that dumps
> produced by pg_dump are ugly and optimized for restoring, not for
> developing with, and they don't have a deterministic output order." (
> http://petereisentraut.blogspot.com/2012/05/my-anti-take-on-database-schema-version.html
> )
>
> Back in December 2010, I suggested a new option to pg_dump, --split,
> which would write the schema definition of each object in separate
> files:
>
> http://archives.postgresql.org/pgsql-hackers/2010-12/msg02285.php
>
> Instead of a huge plain text schema file, impossible to version
> control, all tables/sequences/views/functions are written to separate
> files, allowing the use of a version control software system, such as
> git, to do proper version controlling.
>
> The "deterministic output order" problem mentioned in the post above,
> is not a problem if each object (table/sequence/view/function/etc) is
> written to the same filename everytime.
> No matter the order, the tree of files and their content will be
> identical, no matter the order in which they are dumped.
>
> I remember a lot of hackers were very positive about this option, but
> we somehow failed to agree on the naming of files in the tree
> structure. I'm sure we can work that out though.
>
> I use this feature in production, I have a cronjob which does a dump
> of the schema every hour, committing any eventual changes to a
> separate git branch for each database installation, such as
> production, development and test.
> If no changes to the schema have been made, nothing will be committed
> to git since none of the files have changed.
>
> It is then drop-dead simple to diff two different branches of the
> database schema, such as development or production, or diffing
> different revisions allowing point-in-time comparison of the schema.
>
> This is an example of the otuput of a git log --summary for one of the
> automatic commits to our production database's git-repo:
>
> --
> commit 18c31f8162d851b0dac3bad7e80529ef2ed18be3
> Author: Production Database <production.database@trustly.com>
> Date:   Fri May 4 15:00:04 2012 +0200
>
>     Update of database schema Linux DB0 2.6.26-2-amd64 #1 SMP Wed Aug
> 19 22:33:18 UTC 2009 x86_64 GNU/Linux Fri, 04 May 2012 15:00:04 +0200
>
>  create mode 100644
> gluepay-split/public/CONSTRAINT/openingclosingbalances_pkey.sql
>  create mode 100644
> gluepay-split/public/CONSTRAINT/openingclosingbalances_source_key.sql
>  create mode 100644 gluepay-split/public/SEQUENCE/seqopeningclosingbalance.sql
>  create mode 100644 gluepay-split/public/TABLE/openingclosingbalances.sql
> --
>
> Here we can see we apparently deployed a new table,
> "openingclosingbalances" around Fri May 4 15:00:04.
>
> Without any manual work, I'm able to follow all changes actually
> _deployed_ in each database.
>
> At my company, a highly database-centric stored-procedure intensive
> business dealing with mission-critical monetary transactions, we've
> been using this technique to successfully do schema version management
> without any hassle for the last two years.
>
> Hopefully this can add to the list of various possible _useful_ schema
> version management methods.

What does your patch do that you can't already do with pg_restore?

create function foo(a int, b int, c text) returns int as $$ select 0;
$$ language sql;
CREATE FUNCTION

pg_dump -Fc postgres -s > postgres.dump
pg_restore -l postgres.dump  | grep FUNCTION
196; 1255 32939 FUNCTION public foo(integer, integer, text) merlin

pg_restore -P "foo(integer, integer, text)" postgres.dump
<function body follows>

it's fairly easy to wrap pg_restore with a smalls script that extracts
function bodies and writes them out to file names.  this is a great
and underused feature, so I'd argue that if you wanted to formalize
per object file extraction you should be looking at expanding
pg_restore, not pg_dump.

merlin


Re: Schema version management

От
Joel Jacobson
Дата:
On Wed, Jul 11, 2012 at 12:24 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
What does your patch do that you can't already do with pg_restore?

Please read through the entire thread, think question has already been answered.

Re: Schema version management

От
Merlin Moncure
Дата:
On Tue, Jul 10, 2012 at 5:24 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Sun, May 20, 2012 at 2:41 PM, Joel Jacobson <joel@trustly.com> wrote:
>> Hi,
>>
>> I just read a very interesting post about "schema version management".
>>
>> Quote: "You could set it up so that every developer gets their own
>> test database, sets up the schema there, takes a dump, and checks that
>> in. There are going to be problems with that, including that dumps
>> produced by pg_dump are ugly and optimized for restoring, not for
>> developing with, and they don't have a deterministic output order." (
>> http://petereisentraut.blogspot.com/2012/05/my-anti-take-on-database-schema-version.html
>> )
>>
>> Back in December 2010, I suggested a new option to pg_dump, --split,
>> which would write the schema definition of each object in separate
>> files:
>>
>> http://archives.postgresql.org/pgsql-hackers/2010-12/msg02285.php
>>
>> Instead of a huge plain text schema file, impossible to version
>> control, all tables/sequences/views/functions are written to separate
>> files, allowing the use of a version control software system, such as
>> git, to do proper version controlling.
>>
>> The "deterministic output order" problem mentioned in the post above,
>> is not a problem if each object (table/sequence/view/function/etc) is
>> written to the same filename everytime.
>> No matter the order, the tree of files and their content will be
>> identical, no matter the order in which they are dumped.
>>
>> I remember a lot of hackers were very positive about this option, but
>> we somehow failed to agree on the naming of files in the tree
>> structure. I'm sure we can work that out though.
>>
>> I use this feature in production, I have a cronjob which does a dump
>> of the schema every hour, committing any eventual changes to a
>> separate git branch for each database installation, such as
>> production, development and test.
>> If no changes to the schema have been made, nothing will be committed
>> to git since none of the files have changed.
>>
>> It is then drop-dead simple to diff two different branches of the
>> database schema, such as development or production, or diffing
>> different revisions allowing point-in-time comparison of the schema.
>>
>> This is an example of the otuput of a git log --summary for one of the
>> automatic commits to our production database's git-repo:
>>
>> --
>> commit 18c31f8162d851b0dac3bad7e80529ef2ed18be3
>> Author: Production Database <production.database@trustly.com>
>> Date:   Fri May 4 15:00:04 2012 +0200
>>
>>     Update of database schema Linux DB0 2.6.26-2-amd64 #1 SMP Wed Aug
>> 19 22:33:18 UTC 2009 x86_64 GNU/Linux Fri, 04 May 2012 15:00:04 +0200
>>
>>  create mode 100644
>> gluepay-split/public/CONSTRAINT/openingclosingbalances_pkey.sql
>>  create mode 100644
>> gluepay-split/public/CONSTRAINT/openingclosingbalances_source_key.sql
>>  create mode 100644 gluepay-split/public/SEQUENCE/seqopeningclosingbalance.sql
>>  create mode 100644 gluepay-split/public/TABLE/openingclosingbalances.sql
>> --
>>
>> Here we can see we apparently deployed a new table,
>> "openingclosingbalances" around Fri May 4 15:00:04.
>>
>> Without any manual work, I'm able to follow all changes actually
>> _deployed_ in each database.
>>
>> At my company, a highly database-centric stored-procedure intensive
>> business dealing with mission-critical monetary transactions, we've
>> been using this technique to successfully do schema version management
>> without any hassle for the last two years.
>>
>> Hopefully this can add to the list of various possible _useful_ schema
>> version management methods.
>
> What does your patch do that you can't already do with pg_restore?
>
> create function foo(a int, b int, c text) returns int as $$ select 0;
> $$ language sql;
> CREATE FUNCTION
>
> pg_dump -Fc postgres -s > postgres.dump
> pg_restore -l postgres.dump  | grep FUNCTION
> 196; 1255 32939 FUNCTION public foo(integer, integer, text) merlin
>
> pg_restore -P "foo(integer, integer, text)" postgres.dump
> <function body follows>
>
> it's fairly easy to wrap pg_restore with a smalls script that extracts
> function bodies and writes them out to file names.  this is a great
> and underused feature, so I'd argue that if you wanted to formalize
> per object file extraction you should be looking at expanding
> pg_restore, not pg_dump.

After extensive off-list discussion with Joel it became clear that
per-object dumping ability really belongs in pg_restore.  It already
has some machinery for that, and has the nice property that you can
pull objects out of dumps long after the fact, not just when the dump
happens.  It then remains to be worked out of pg_restore should be
given the ability to write directly to files as Joel was gunning for
or simply extended to improve upon the current TOC based facilities,
or both.

As noted, choosing a reversible unambiguous filename based on the
database primitive is nastiness of the first order.  For example,
besides the mentioned issues, some filesystems (windows) use case
insensitive entries.  What do you do about that?  Given that all the
reasonable escape characters are reserved or unusable for some reason,
pretty soon you'll arrive to the point of view that you need some sort
of proxy identifier in the filename to give uniqueness so that you can
ditch all the unpleasantness, just like as is done with relfilenode
(maybe using oid, I don't know).  I've worked on the exact same
problem as Joel, but in my case I was able to leverage known naming
conventions in the code in-core solution should not do that,
especially when it comes to backup/restore.

So, just thinking out loud here, maybe the way forward is to try and
tackle stuff in controversy order:
1) add more options to pg_restore to dump other stuff besides tables
and functions
2) add smarter object selectors for extraction to pg_restore ('all
functions', 'objects in list <list>', 'all schema objects') -- thereby
eliminating some of the necessary surrounding scripting, especially if
you can direct output to a program which handles the writing out, for
example something along the lines of
pg_restore --objects=ALL --handler="dumper.sh" schema.dump

Where dumper.sh is a user supplied program that consumes the output
and takes object name, class, etc as arguments.  Just thinking out
loud here, but maybe that's cleaner than hacking specific filename
conventions directly into pg_restore -- this punts the absolute
minimum to userland (what the filename is and where the file is to be
stored).

3) figure out a way to dump those selected objects in filenames not
supplied by the user.  It's tempting to relax  #3 so that filenames
are only guaranteed unique within a particular extraction but that
would foil SCM interaction I think.

Maybe #3 is not really solvable and a hybrid solution needs to be
worked out -- for example hashing the object signature for the
filename and dumping a TOC file along with the extraction:

__restore_toc.sql:
function foo(a int, b int) -> ABCDEF.sql
function "Foo"(a int, b int) -> 132456.sql

etc.  Not very pleasant, but at least internally consistent and SCM friendly.

merlin


Re: Schema version management

От
Peter Eisentraut
Дата:
On tis, 2012-07-10 at 17:54 -0400, Andrew Dunstan wrote:
> In general, NTFS forbids the use of these printable ASCII chars in 
> filenames (see 
> <http://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations>:
> 
> " * : < > ? \ / |

> Many of these could be used in operators.

Yeah, that's a bummer.  Then I guess some escape mechanism would be OK.
I could imagine an operator < on a custom data type being dumped into a
file named operator_%3C.sql.  Still better than putting them all in one
file.

Of course, argument types need to be dealt with as well, just like with
functions (plus prefix/postfix).



Re: Schema version management

От
Alvaro Herrera
Дата:
Excerpts from Peter Eisentraut's message of mié jul 11 17:03:03 -0400 2012:
>
> On tis, 2012-07-10 at 17:54 -0400, Andrew Dunstan wrote:
> > In general, NTFS forbids the use of these printable ASCII chars in
> > filenames (see
> > <http://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations>:
> >
> > " * : < > ? \ / |
>
> > Many of these could be used in operators.
>
> Yeah, that's a bummer.  Then I guess some escape mechanism would be OK.
> I could imagine an operator < on a custom data type being dumped into a
> file named operator_%3C.sql.  Still better than putting them all in one
> file.
>
> Of course, argument types need to be dealt with as well, just like with
> functions (plus prefix/postfix).

operator_!___numeric.sql (postfix, name does not need escape)
operator_%7C%2F_integer__.sql (prefix)
operator_%3C_bit_varying__bit_varying.sql (type name with spaces,
changed to _)

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Schema version management

От
Joel Jacobson
Дата:
On Wed, Jul 11, 2012 at 8:01 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
After extensive off-list discussion with Joel it became clear that
per-object dumping ability really belongs in pg_restore.

The only benefit I could see in putting it in pg_restore is you would then
be able to do a --split on already existing historical dumps.

On the other hand, it would require you to use both pg_dump and pg_restore,
instead of only pg_dump, which makes it a bit less user-friendly.

I haven't looked at how it could be implemented in pg_restore, if its even just
a little more complex, it's probably better to let pg_dump handle the task.

Re: Schema version management

От
Joel Jacobson
Дата:
On Wed, Jul 11, 2012 at 11:20 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote:
operator_!___numeric.sql (postfix, name does not need escape)
operator_%7C%2F_integer__.sql (prefix)
operator_%3C_bit_varying__bit_varying.sql (type name with spaces,
changed to _)


I think the directory structure [schema]/[type]/[name] should be the same
for all object types. I don't like "operator" being part of the filename,
it should be the directory name.

Re: Schema version management

От
Tom Lane
Дата:
Joel Jacobson <joel@trustly.com> writes:
> I think the directory structure [schema]/[type]/[name] should be the same
> for all object types. I don't like "operator" being part of the filename,
> it should be the directory name.

What are you going to do with objects that don't have schemas?
(Including, but not restricted to, the schemas themselves.)
        regards, tom lane


Re: Schema version management

От
Joel Jacobson
Дата:
On Thursday, July 12, 2012, Tom Lane wrote:
What are you going to do with objects that don't have schemas?
(Including, but not restricted to, the schemas themselves.)

Good question. Maybe something like this?

For objects without schema:
/global/[type]/[name].sql

For objects with schema:
/schema/[schema]/[type]/[name].sql


Re: Schema version management

От
Tom Lane
Дата:
Joel Jacobson <joel@trustly.com> writes:
> On Thursday, July 12, 2012, Tom Lane wrote:
>> What are you going to do with objects that don't have schemas?
>> (Including, but not restricted to, the schemas themselves.)

> Good question. Maybe something like this?

> For objects without schema:
> /global/[type]/[name].sql

> For objects with schema:
> /schema/[schema]/[type]/[name].sql

FWIW, I think you could save a level of naming if you were willing to
put the type first, since the type would imply whether the object
lives in a schema or not:
[type]/[name].sql[type]/[schema]/[name].sql
        regards, tom lane


Re: Schema version management

От
Joel Jacobson
Дата:
On Thu, Jul 12, 2012 at 4:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
FWIW, I think you could save a level of naming if you were willing to
put the type first, since the type would imply whether the object
lives in a schema or not:

        [type]/[name].sql
        [type]/[schema]/[name].sql


Could work. But I think it's more relevant and useful to keep all objects
in a schema in its own directory.

That way it's easier to get an overview of what's in a schema,
simply by looking at the file structure of the schema directory.

I think its more common you want to "show all objects within schema X"
than "show all schemas of type X".

PS.

I was thinking -- the guys back in the 70s must have spent a lot of time
thinking about the UNIX directory structure -- before they decided upon it.

I did some googling and found found this explanation which was quite
amusing to say the least :-)


Re: Schema version management

От
Andrew Dunstan
Дата:
On 07/12/2012 10:01 AM, Tom Lane wrote:
>
> FWIW, I think you could save a level of naming if you were willing to
> put the type first, since the type would imply whether the object
> lives in a schema or not:
>
>     [type]/[name].sql
>     [type]/[schema]/[name].sql
>
>             



That will destroy the property of having everything for a given schema 
collected together.

Arguably we should make a special case for the create statement of a 
schema, but I'm not even sure about that.

cheers

andrew


Re: Schema version management

От
Peter Eisentraut
Дата:
On ons, 2012-07-11 at 17:20 -0400, Alvaro Herrera wrote:
> operator_!___numeric.sql (postfix, name does not need escape)
> operator_%7C%2F_integer__.sql (prefix)
> operator_%3C_bit_varying__bit_varying.sql (type name with spaces,
> changed to _)

I'm not sure if it makes things better to escape some operator names and
some not.  It could easily become confusing.



Re: Schema version management

От
Peter Eisentraut
Дата:
On tor, 2012-07-12 at 16:14 +0200, Joel Jacobson wrote:
> Could work. But I think it's more relevant and useful to keep all objects
> in a schema in its own directory.

Personally, I hate this proposed nested directory structure.  I would
like to have all objects in one directory.

But there is a lot of "personally" in this thread, of course.

> I think its more common you want to "show all objects within schema X"
> than "show all schemas of type X".

Or maybe it isn't ...




Re: Schema version management

От
Joel Jacobson
Дата:
On Fri, Jul 13, 2012 at 9:41 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
Personally, I hate this proposed nested directory structure.  I would
like to have all objects in one directory.

But there is a lot of "personally" in this thread, of course.

Why do you hate it?

It's a bit like saying,
 - I hate database normalization, better to keep all rows in one single table.
or even,
 - I hate directories.

I have thousands of objects, it would be a total mess to keep them all in a single directory.

Using a normalized directory structure makes sense for the SCM use-case,
I haven't seen any projects where all the files are kept in one directory.

Re: Schema version management

От
Peter Eisentraut
Дата:
On lör, 2012-07-14 at 10:41 +0200, Joel Jacobson wrote:
> On Fri, Jul 13, 2012 at 9:41 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> 
> > Personally, I hate this proposed nested directory structure.  I would
> > like to have all objects in one directory.
> >
> > But there is a lot of "personally" in this thread, of course.
> 
> 
> Why do you hate it?
> 
> It's a bit like saying,
>  - I hate database normalization, better to keep all rows in one single
> table.
> or even,
>  - I hate directories.

To a certain extent, yes, I hate (excessive use of) directories.

> I have thousands of objects, it would be a total mess to keep them all in a
> single directory.

Thousands of objects could be a problem, in terms of how the typical
file system tools scale.  But hundreds or a few thousand not
necessarily.  It's easy to browse, filter, and sort using common tools,
for example.

> Using a normalized directory structure makes sense for the SCM use-case,

If there is a theory of "normalization" for hierarchical databases, I
don't know it but would like to learn about it.

> I haven't seen any projects where all the files are kept in one directory.

Well, of course everyone uses directories in moderation.  But you might
want to take a look at the gcc source code.  You'll love it. ;-)




Re: Schema version management

От
Joel Jacobson
Дата:
On Sat, Jul 14, 2012 at 11:25 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> Well, of course everyone uses directories in moderation.  But you might
> want to take a look at the gcc source code.  You'll love it. ;-)

Yes, but GCC was also created by someone who picks stuff from his bare
foot and eats it. ;-)


Re: Schema version management

От
Joel Jacobson
Дата:
On Sat, Jul 14, 2012 at 12:34 PM, Joel Jacobson <joel@trustly.com> wrote:
> On Sat, Jul 14, 2012 at 11:25 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> Well, of course everyone uses directories in moderation.  But you might
>> want to take a look at the gcc source code.  You'll love it. ;-)

[505][joel.Joel-Jacobsons-iMac-2: /Users/joel/gcc]$ find . -type d | wc -l  41895
[506][joel.Joel-Jacobsons-iMac-2: /Users/joel/gcc]$ find . -type f | wc -l 167183
[507][joel.Joel-Jacobsons-iMac-2: /Users/joel/gcc]$

Not that bad actually, only 4 files per directory on average.