Обсуждение: Add --include-table-data-where option to pg_dump, to export only asubset of table data

Поиск
Список
Период
Сортировка

Add --include-table-data-where option to pg_dump, to export only asubset of table data

От
Carter Thaxton
Дата:
Many times I've wanted to export a subset of a database, using some sort of row filter condition on some of the large tables.  E.g. copying a production database to a staging environment, but with some time series data only from the past month.

We have the existing options:
  --include-table=table    (and its -t synonym)
  --exclude-table=table
  --exclude-table-data=table

I propose a new option:
  --include-table-data-where=table:filter_clause

One would use this option as follows:

  pg_dump --include-table-data-where=largetable:"created_at >= '2018-05-01'" database_name

The filter_clause is used as the contents of a WHERE clause when querying the data to generate the COPY statement produced by pg_dump.

I've prepared a proposed patch for this, which is attached.  The code changes are rather straightforward.  I did have to add the ability to carry around an extra pointer-sized object to the simple_list implementation, in order to allow the filter clause to be associated to the matching oids of the table pattern.  It seemed the best way to augment the existing simple_list implementation, but change as little as possible elsewhere in the codebase.  (Note that SimpleOidList is actually only used by pg_dump).

Feel free to review and propose any amendments.

Вложения

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Stephen Frost
Дата:
Greetings,

* Carter Thaxton (carter.thaxton@gmail.com) wrote:
> Many times I've wanted to export a subset of a database, using some sort of
> row filter condition on some of the large tables.  E.g. copying a
> production database to a staging environment, but with some time series
> data only from the past month.
>
> We have the existing options:
>   --include-table=table    (and its -t synonym)
>   --exclude-table=table
>   --exclude-table-data=table
>
> I propose a new option:
>   --include-table-data-where=table:filter_clause
>
> One would use this option as follows:
>
>   pg_dump --include-table-data-where=largetable:"created_at >=
> '2018-05-01'" database_name
>
> The filter_clause is used as the contents of a WHERE clause when querying
> the data to generate the COPY statement produced by pg_dump.

I've wanted something similar to this in the past as well, and, as
you've seen, we have some support for this kind of thing in pg_dump
already and what you're doing is exposing that.

> I've prepared a proposed patch for this, which is attached.  The code
> changes are rather straightforward.  I did have to add the ability to carry
> around an extra pointer-sized object to the simple_list implementation, in
> order to allow the filter clause to be associated to the matching oids of
> the table pattern.  It seemed the best way to augment the existing
> simple_list implementation, but change as little as possible elsewhere in
> the codebase.  (Note that SimpleOidList is actually only used by pg_dump).
>
> Feel free to review and propose any amendments.

I've only taken a quick look but I don't see any regression tests, for
starters, and it's not clear if this can be passed multiple times for
one pg_dump run (I'd certainly hope that it could be...).

Also, if you haven't already, this should be registered on the
commitfest app, so we don't lose track of it.

Thanks!

Stephen

Вложения

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Carter Thaxton
Дата:
Hello,

I've only taken a quick look but I don't see any regression tests, for
starters, and it's not clear if this can be passed multiple times for
one pg_dump run (I'd certainly hope that it could be...).

Yes, this will absolutely accept multiple options for one run, which is how I'd imagine it would typically be used.

In fact, for each table_pattern:filter_clause you provide as an option, it will apply a corresponding WHERE clause for *every* table that matches the table_pattern.
So if you happened to use a wildcard in the table_pattern, you could actually end up with multiple tables filtered by the same WHERE clause.

For example:
  pg_dump --include-table-data-where="table_*:created_at >= '2018-05-01'" --include-table-data-where="other_table:id < 100"  db_name

This will filter every table named "table_*", e.g. ["table_0", "table_1", "table_2", "table_associated"], each with "WHERE created_at >= '2018-05-01'", and it will also filter "other_table" with "WHERE id < 100".

Not sure how useful the wildcard feature is, but it matches the behavior of the other pg_dump options that specify tables, and came along for free by reusing that implementation.


Also, if you haven't already, this should be registered on the
commitfest app, so we don't lose track of it.

Done!

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Thomas Munro
Дата:
On Tue, May 22, 2018 at 4:05 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Carter Thaxton (carter.thaxton@gmail.com) wrote:
>>   pg_dump --include-table-data-where=largetable:"created_at >=
>> '2018-05-01'" database_name
>
> I've wanted something similar to this in the past as well, and, as
> you've seen, we have some support for this kind of thing in pg_dump
> already and what you're doing is exposing that.

+1

>> I've prepared a proposed patch for this, which is attached.  The code
>> changes are rather straightforward.  I did have to add the ability to carry
>> around an extra pointer-sized object to the simple_list implementation, in
>> order to allow the filter clause to be associated to the matching oids of
>> the table pattern.  It seemed the best way to augment the existing
>> simple_list implementation, but change as little as possible elsewhere in
>> the codebase.  (Note that SimpleOidList is actually only used by pg_dump).
>>
>> Feel free to review and propose any amendments.
>
> I've only taken a quick look but I don't see any regression tests, for
> starters, and it's not clear if this can be passed multiple times for
> one pg_dump run (I'd certainly hope that it could be...).
>
> Also, if you haven't already, this should be registered on the
> commitfest app, so we don't lose track of it.

Thanks for doing that.  Unfortunately the patch seems to be corrupted
in some way, maybe ANSI control characters or something... perhaps you
set colour.ui = always in your git config, instead of auto?  You might
also consider using git format-patch so you can include a brief commit
message that explains the feature.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Euler Taveira
Дата:
2018-05-20 20:48 GMT-03:00 Carter Thaxton <carter.thaxton@gmail.com>:
> Many times I've wanted to export a subset of a database, using some sort of
> row filter condition on some of the large tables.  E.g. copying a production
> database to a staging environment, but with some time series data only from
> the past month.
>
How would you handle foreign keys? It seems easier to produce a dump
that won't restore.

> We have the existing options:
>   --include-table=table    (and its -t synonym)
>   --exclude-table=table
>   --exclude-table-data=table
>
> I propose a new option:
>   --include-table-data-where=table:filter_clause
>
I remembered an old thread [1]. At that time pg_dump was not so
decoupled from the backend. We are far from being decoupled in a way
that someone can write his own pg_dump using only calls from a
library. I'm not sure pg_dump is the right place to add another ETL
parameter. We already have too much parameters that could break a
restore (flexibility is always welcome but too much is not so good).

> One would use this option as follows:
>
>   pg_dump --include-table-data-where=largetable:"created_at >= '2018-05-01'"
> database_name
>
How would you check that that expression is correct? Every parameter
could quote its value. It means that your parameter have to escape the
quote in '2018-05-01'. Another problem is that your spec does not show
us how you would handle tables like Foo.Bar or "foo:bar" (colon have
to be escaped)?

> The filter_clause is used as the contents of a WHERE clause when querying
> the data to generate the COPY statement produced by pg_dump.
>
You are forgetting about --inserts parameter. Could I use
--include-table-data-where and --inserts?


[1] https://www.postgresql.org/message-id/1212299813.17810.17.camel%40ubuntu


-- 
   Euler Taveira                                   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Carter Thaxton
Дата:

How would you handle foreign keys? It seems easier to produce a dump
that won't restore.

This proposal will not attempt to be smart about foreign keys or anything like that.  I don't believe that would even be expected.

> We have the existing options:
>   --include-table=table    (and its -t synonym)
>   --exclude-table=table
>   --exclude-table-data=table
>
> I propose a new option:
>   --include-table-data-where=table:filter_clause
>
I remembered an old thread [1]. At that time pg_dump was not so
decoupled from the backend. We are far from being decoupled in a way
that someone can write his own pg_dump using only calls from a
library. I'm not sure pg_dump is the right place to add another ETL
parameter. We already have too much parameters that could break a
restore (flexibility is always welcome but too much is not so good).

In general, I agree with your sentiment that we don't want too much flexibility in this tool.  However, this just seems like a very obvious missing feature to me.  I was frankly surprised that pg_dump didn't already have it.

I've designed this feature so that it behaves like a more flexible version between --exclude-table-data and --include-table.  Instead of dumping the schema and zero rows, or the schema and all of the rows, it dumps the schema and some specific rows.

Providing "--include-table-data-where=table:false" behaves exactly like --exclude-table-data, and "--include-table-data-where=table:true" behaves exactly like --include-table.
It does no more or less to prevent a restore.  Given that --exclude-table-data already exists, this seems to introduce no new issues with restore.

 
>   pg_dump --include-table-data-where=largetable:"created_at >= '2018-05-01'"
> database_name
>
How would you check that that expression is correct?
 
The patch as already provided produces an error message and appropriate exit code during the dump process, presenting the invalid SQL that is produced as part of the WHERE clause.
I could see some value in refactoring it to provide error messages earlier in the process, but it's actually not bad as is.


Every parameter could quote its value. It means that your parameter have to escape the
quote in '2018-05-01'.

I don't understand.  The double quotes in my example are bash shell quotes.  There is no special quote parsing in this patch.  The single quotes are part of the WHERE clause.
Note that pg_dump already uses getopt_long, so it's not required to use the = symbol to separate option from its associated value.  So, it would also be fine to call as follows:

  pg_dump --include-table-data-where "largetable:created_at >= '2018-05-01'" database_name


Another problem is that your spec does not show
us how you would handle tables like Foo.Bar or "foo:bar" (colon have
to be escaped)?

Using a dot to separate the schema works just fine.  My proposal uses the same mechanism as --include-table, --exclude-table, and --exclude-table-data.  In fact, it even supports wildcards in those patterns.

Your point about a colon in the table name is interesting.  In all my years of working with PostgreSQL and other databases, I've never encountered a table name that contained a colon.  Perhaps an escape character, like \: could work.  Is there another separator character you would suggest, which is illegal in table names, but also intuitive as a separator?  Maybe a comma?

 
> The filter_clause is used as the contents of a WHERE clause when querying
> the data to generate the COPY statement produced by pg_dump.
>
You are forgetting about --inserts parameter. Could I use
--include-table-data-where and --inserts?

Yes, the --inserts parameter works just fine.  Perhaps I should have said "the COPY statement or INSERT statements".

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Carter Thaxton
Дата:
Ah yes, thanks.  I did in fact have colors enabled.
I've attached a new patch generated by `git format-patch`.  Hopefully that's correct.


On Mon, May 21, 2018 at 4:00 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Tue, May 22, 2018 at 4:05 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Carter Thaxton (carter.thaxton@gmail.com) wrote:
>>   pg_dump --include-table-data-where=largetable:"created_at >=
>> '2018-05-01'" database_name
>
> I've wanted something similar to this in the past as well, and, as
> you've seen, we have some support for this kind of thing in pg_dump
> already and what you're doing is exposing that.

+1

>> I've prepared a proposed patch for this, which is attached.  The code
>> changes are rather straightforward.  I did have to add the ability to carry
>> around an extra pointer-sized object to the simple_list implementation, in
>> order to allow the filter clause to be associated to the matching oids of
>> the table pattern.  It seemed the best way to augment the existing
>> simple_list implementation, but change as little as possible elsewhere in
>> the codebase.  (Note that SimpleOidList is actually only used by pg_dump).
>>
>> Feel free to review and propose any amendments.
>
> I've only taken a quick look but I don't see any regression tests, for
> starters, and it's not clear if this can be passed multiple times for
> one pg_dump run (I'd certainly hope that it could be...).
>
> Also, if you haven't already, this should be registered on the
> commitfest app, so we don't lose track of it.

Thanks for doing that.  Unfortunately the patch seems to be corrupted
in some way, maybe ANSI control characters or something... perhaps you
set colour.ui = always in your git config, instead of auto?  You might
also consider using git format-patch so you can include a brief commit
message that explains the feature.

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Thomas Munro
Дата:
On Wed, May 23, 2018 at 5:18 PM, Carter Thaxton
<carter.thaxton@gmail.com> wrote:
> Ah yes, thanks.  I did in fact have colors enabled.
> I've attached a new patch generated by `git format-patch`.  Hopefully that's
> correct.

pg_dump.c:2323:2: warning: ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]
  char *filter_clause = NULL;
  ^

You need to declare this variable at the top of its scope.  If you're
using GCC or Clang you might consider building with COPT=-Werror so
that any compiler warnings will stop the build from succeeding.

This doesn't build on Windows[1], probably for the same reason.

 /*
  * Is OID present in the list?
+ * Also return extra pointer-sized data by setting extra_data paramter
  */
 bool
-simple_oid_list_member(SimpleOidList *list, Oid val)
+simple_oid_list_member2(SimpleOidList *list, Oid val, void **extra_data)

I feel like that isn't in the spirit of Lisp "member".  It's now a
kind of association list.  I wonder if we are really constrained to
use the cave-man facilities in fe_utils anyway.  Though I suppose this
list is never going to be super large so maybe the data structure
doesn't matter too much (famous last words).

+ char *where_clause = pg_malloc(strlen(filter_clause) + 8 + 1);
+ strcpy(where_clause, "WHERE (");
+ strcat(where_clause, filter_clause);
+ strcat(where_clause, ")");

pg_dump.c seems to be allowed to use psprintf() which'd be less
fragile than the above code.

+ /* When match_data is set, split the pattern on the ':' chararcter,

typo

+ * Also return extra pointer-sized data by setting extra_data paramter

typo

[1] https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.311

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Carter Thaxton
Дата:
pg_dump.c:2323:2: warning: ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]
  char *filter_clause = NULL;
  ^

You need to declare this variable at the top of its scope.  If you're
using GCC or Clang you might consider building with COPT=-Werror so
that any compiler warnings will stop the build from succeeding.

This doesn't build on Windows[1], probably for the same reason.

Done.  And thanks for the tip about COPT=-Werror

 
 /*
  * Is OID present in the list?
+ * Also return extra pointer-sized data by setting extra_data paramter
  */
 bool
-simple_oid_list_member(SimpleOidList *list, Oid val)
+simple_oid_list_member2(SimpleOidList *list, Oid val, void **extra_data)

I feel like that isn't in the spirit of Lisp "member".  It's now a
kind of association list.  I wonder if we are really constrained to
use the cave-man facilities in fe_utils anyway.  Though I suppose this
list is never going to be super large so maybe the data structure
doesn't matter too much (famous last words).

Yeah, I'm just trying to fit into the surrounding code as much as possible.  If you have a specific recommendation, I'm all ears.
SimpleOidList is only used by pg_dump, so if we want to rename or refactor this data structure, it won't have much widespread impact.

And you're right that the list is not going to be particularly large.  Consider that it's already a simple linked-list, and not some more complex hashtable, for the use cases that it already covers in pg_dump.  For all of these uses, it will only be as large as the number of options provided on the command-line.

 
+ char *where_clause = pg_malloc(strlen(filter_clause) + 8 + 1);
+ strcpy(where_clause, "WHERE (");
+ strcat(where_clause, filter_clause);
+ strcat(where_clause, ")");

pg_dump.c seems to be allowed to use psprintf() which'd be less
fragile than the above code.

Done.  Didn't realize psprintf() was available here.

 
typo
And fixed typos.

Thanks for the review!
Вложения

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Carter Thaxton
Дата:
After some consideration, I've created a new patch that addresses even more of the various concerns.

Most notably, the command-line option is shortened to simply --where, which is much easier to read and understand,
and matches the earlier proposal [1].

 
 bool
-simple_oid_list_member(SimpleOidList *list, Oid val)
+simple_oid_list_member2(SimpleOidList *list, Oid val, void **extra_data)

I feel like that isn't in the spirit of Lisp "member".  It's now a
kind of association list.

My new patch has this function named simple_oid_list_find_data, to indicate that it's working with some extra data,
and the corresponding append is called simple_oid_list_append_data.


> Another problem is that your spec does not show
> us how you would handle tables like Foo.Bar or "foo:bar" (colon have
> to be escaped)?

This is now addressed in the v4 patch.  I've added a new function to string_utils,
called findUnquotedChar, which is used to find the colon character in the command-line argument,
which separates the table name from the filter clause.

For example, if you have a table called "foo:bar", then you would use the --where option as follows:

  pg_dump --where '"foo:bar":created_at >= '2018-05-01'" dbname

If you have a table with a double-quote in the name (gah!), as well as colons, like:  has"quote:and:colons
then this still works, because such quotes in the name are escaped by doubling the quote char,
e.g. "has""quote:and:colons", and also works with this patch:

  pg_dump --where '"has""quote:and:colons":created_at >= '2018-05-1'" dbname



Вложения

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Surafel Temesgen
Дата:
hey,
i am reviewing this patch
On Thu, May 31, 2018 at 4:49 AM, Carter Thaxton <carter.thaxton@gmail.com> wrote:

  pg_dump --where '"foo:bar":created_at >= '2018-05-01'" dbname
 
it would be more sqlish if it specified like: 
--table=foo --where ="bar created_at >= 2018-05-01"
and i don't like the idea of duplicating the existing --table behavior it may confuse user
i rather recommend extending it. And when i test it with --table option the content of dump
file depend on the option specified first.
  
Regards
Surafel

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Robert Haas
Дата:
On Fri, Jun 29, 2018 at 8:09 AM, Surafel Temesgen <surafel3000@gmail.com> wrote:
> hey,
> i am reviewing this patch
> On Thu, May 31, 2018 at 4:49 AM, Carter Thaxton <carter.thaxton@gmail.com>
> wrote:
>>
>>
>>   pg_dump --where '"foo:bar":created_at >= '2018-05-01'" dbname
>
> it would be more sqlish if it specified like:
> --table=foo --where ="bar created_at >= 2018-05-01"
> and i don't like the idea of duplicating the existing --table behavior it
> may confuse user
> i rather recommend extending it. And when i test it with --table option the
> content of dump
> file depend on the option specified first.

But you can specify multiple tables.  You wouldn't want the same WHERE
clause to apply to all of them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Carter Thaxton
Дата:
The whole reason for the colon in the --where option is to indicate which table the WHERE clause should refer to, so that one can dump less than all of the rows.
The --table option is totally different.  It specifies which tables to dump at all.

If I provide a --where option, and no --table option, I want the WHERE clause to apply to the given table, and otherwise dump all tables.
If one supplies a --table option, it won't dump all tables - it will only dump the one specified.  I don't want to have to specify all the tables with --table, just to use the --where option.

Also, there may be some misunderstanding about "foo:bar" above.  That's an example of using a namespaced table, where "bar" is a table in the namespace "foo".  Normally, assuming your table is named "bar" in the default namespace, you would just say something like:

  pg_dump --where "bar:created_at >= 2018-05-01'"


On Mon, Jul 2, 2018 at 11:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jun 29, 2018 at 8:09 AM, Surafel Temesgen <surafel3000@gmail.com> wrote:
> hey,
> i am reviewing this patch
> On Thu, May 31, 2018 at 4:49 AM, Carter Thaxton <carter.thaxton@gmail.com>
> wrote:
>>
>>
>>   pg_dump --where '"foo:bar":created_at >= '2018-05-01'" dbname
>
> it would be more sqlish if it specified like:
> --table=foo --where ="bar created_at >= 2018-05-01"
> and i don't like the idea of duplicating the existing --table behavior it
> may confuse user
> i rather recommend extending it. And when i test it with --table option the
> content of dump
> file depend on the option specified first.

But you can specify multiple tables.  You wouldn't want the same WHERE
clause to apply to all of them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Michael Paquier
Дата:
On Mon, Jul 02, 2018 at 03:11:46PM -0700, Carter Thaxton wrote:
> Also, there may be some misunderstanding about "foo:bar" above.  That's an
> example of using a namespaced table, where "bar" is a table in the
> namespace "foo".  Normally, assuming your table is named "bar" in the
> default namespace, you would just say something like:
>
>   pg_dump --where "bar:created_at >= 2018-05-01'"

I am wondering how this works at parsing if the table name, or one of
the columns includes a colon character :)
--
Michael

Вложения

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Carter Thaxton
Дата:
 
>   pg_dump --where "bar:created_at >= 2018-05-01'"

I am wondering how this works at parsing if the table name, or one of
the columns includes a colon character :)

The proposed patch will handle quoted identifiers.  E.g. the following will work just fine:

  pg_dump --where 'table:"column:with:colons" = 5'

Note the use of single quotes in the shell, and then double quotes in the WHERE clause.  There are also many other options for quoting in the shell, of course.


 
Please don't top-post on the PostgreSQL lists. See <http://idallen.com/topposting.html>

Sorry.  Thanks for the reminder.

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Surafel Temesgen
Дата:


On Mon, Jul 2, 2018 at 9:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

But you can specify multiple tables.  You wouldn't want the same WHERE
clause to apply to all of them.

also with this new --where option you can specify multiple table using wildcard and it
try to apply the same where clause to each table. may be its a desirable feature
because such kind of table can be structurally similar too.

regards
Surafel  

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Robert Haas
Дата:
On Tue, Jul 3, 2018 at 6:31 AM, Surafel Temesgen <surafel3000@gmail.com> wrote:
> On Mon, Jul 2, 2018 at 9:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> But you can specify multiple tables.  You wouldn't want the same WHERE
>> clause to apply to all of them.
>>
> also with this new --where option you can specify multiple table using
> wildcard and it
> try to apply the same where clause to each table. may be its a desirable
> feature
> because such kind of table can be structurally similar too.

I don't think that's likely to be very useful.  I think Carter Thaxton
has the right idea, although using foo:bar to mean foo.bar doesn't
seem like a great plan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Surafel Temesgen
Дата:


On Tue, Jul 3, 2018 at 1:11 AM, Carter Thaxton <carter.thaxton@gmail.com> wrote:
The whole reason for the colon in the --where option is to indicate which table the WHERE clause should refer to, so that one can dump less than all of the rows.
The --table option is totally different.  It specifies which tables to dump at all.

Thank you for explaining,

I just have one comment . I found the error message generated on incorrect where clause specification strange for pg_dump. I think query result status check needed to handle it and generate more friendly error message.


regards

Surafel

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Jeremy Finzel
Дата:


On Mon, May 21, 2018 at 6:34 AM Carter Thaxton <carter.thaxton@gmail.com> wrote:
Many times I've wanted to export a subset of a database, using some sort of row filter condition on some of the large tables.  E.g. copying a production database to a staging environment, but with some time series data only from the past month.

We have the existing options:
  --include-table=table    (and its -t synonym)
  --exclude-table=table
  --exclude-table-data=table

I propose a new option:
  --include-table-data-where=table:filter_clause

One would use this option as follows:

  pg_dump --include-table-data-where=largetable:"created_at >= '2018-05-01'" database_name

The filter_clause is used as the contents of a WHERE clause when querying the data to generate the COPY statement produced by pg_dump.

I've prepared a proposed patch for this, which is attached.  The code changes are rather straightforward.  I did have to add the ability to carry around an extra pointer-sized object to the simple_list implementation, in order to allow the filter clause to be associated to the matching oids of the table pattern.  It seemed the best way to augment the existing simple_list implementation, but change as little as possible elsewhere in the codebase.  (Note that SimpleOidList is actually only used by pg_dump).

Feel free to review and propose any amendments.


Why not simply use \copy (select * from largetable where created_at >= '2018-05-01') to stdout? That is what I’ve always done when I need something like this and have not found it particularly bothersome but rather quite powerful. And here you have tons of flexibility because you can do joins and whatever else.

FWIW. Thanks,
Jeremy 

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
"David G. Johnston"
Дата:
On Thu, Sep 6, 2018 at 8:40 AM, Jeremy Finzel <finzelj@gmail.com> wrote:
Why not simply use \copy (select * from largetable where created_at >= '2018-05-01') to stdout? That is what I’ve always done when I need something like this and have not found it particularly bothersome but rather quite powerful. And here you have tons of flexibility because you can do joins and whatever else.

Just skimming the thread but I'd have to say being able to leverage pg_dump's dependency resolution is a major reason for adding features to it instead sticking to writing psql scripts.  This feature in a multi-tenant situation is something with, I suspect, reasonably wide appeal.

David J.

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Jeremy Finzel
Дата:
Just skimming the thread but I'd have to say being able to leverage pg_dump's dependency resolution is a major reason for adding features to it instead sticking to writing psql scripts.  This feature in a multi-tenant situation is something with, I suspect, reasonably wide appeal.

That I would agree with if in fact it's true people want that, but that wasn't how the problem trying to be solved was communicated.  From what I read in the initial examples given, just using psql is more than sufficient in those cases.

I will grant that copying the structure and data at the same time would be much easier, however.  Because using psql, you need pg_dump to create your structure then a separate psql script to copy the data.

But again with --data-only examples given, it's so easy to do that with psql copy I just don't understand the value of the feature unless you really are saying you require the dependency resolution.

I agree with some of the hesitation of complicating the syntax and allowing too much customization for what pg_dump is designed for.  Really, if you need more customization, copy gives you that.  So I don't personally consider it a missing feature because both tools have different uses and I haven't found that any of this disrupts my workflow.  FWIW...

Thanks,
Jeremy

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Dmitry Dolgov
Дата:
Hi,

Unfortunately, current version of this patch has some conflicts, could you
rebase it?


Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Andres Freund
Дата:
Hi,

On 2018-11-29 13:50:21 +0100, Dmitry Dolgov wrote:
> Unfortunately, current version of this patch has some conflicts, could you
> rebase it?

Given this patch has not been updated, I'm marking this as returned with feedback.

Greetings,

Andres Freund


Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Carter Thaxton
Дата:

> Unfortunately, current version of this patch has some conflicts, could you
> rebase it?

Given this patch has not been updated, I'm marking this as returned with feedback.

Here's a rebased patch that works with the latest master branch.  Very straightforward.

Вложения

Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data

От
Surafel Temesgen
Дата:


On Thu, Jan 31, 2019 at 10:22 PM Carter Thaxton <carter.thaxton@gmail.com> wrote:


Here's a rebased patch that works with the latest master branch.  Very straightforward.
 

You forget to resubmit it to the next commitfest and  the error message on incorrect
where clause specification is still the same

regards
Surafel