Обсуждение: [GENERAL] Run statements before pg_dump in same transaction?

Поиск
Список
Период
Сортировка

[GENERAL] Run statements before pg_dump in same transaction?

От
François Beausoleil
Дата:
Hi all!

For development purposes, we dump the production database to local. It's fine because the DB is small enough. The
company'sgrowing and we want to reduce risks. To that end, we'd like to anonymize the data before it leaves the
databaseserver. 

One solution we thought of would be to run statements prior to pg_dump, but within the same transaction, something like
this:

BEGIN;
UPDATE users SET email = 'dev+' || id || '@example.com', password_hash = '/* hash of "password" */', ...;
-- launch pg_dump as usual, ensuring a ROLLBACK at the end
-- pg_dump must run with the *same* connection, obviously

-- if not already done by pg_dump
ROLLBACK;

Is there a ready-made solution for this? Our DB is hosted on Heroku, and we don't have 100% flexibility in how we dump.

I searched for "postgresql anonymize data dump before download"[1] and variations, but I didn't see anything highly
relevant.

Thanks!
François

PS: Cross-posted to http://dba.stackexchange.com/q/168023/3935

  [1]: https://duckduckgo.com/?q=postgresql+anonymize+data+dump+before+download



Re: [GENERAL] Run statements before pg_dump in same transaction?

От
John R Pierce
Дата:
On 3/23/2017 12:06 PM, François Beausoleil wrote:
BEGIN;
UPDATE users SET email = 'dev+' || id || '@example.com', password_hash = '/* hash of "password" */', ...;
-- launch pg_dump as usual, ensuring a ROLLBACK at the end
-- pg_dump must run with the *same* connection, obviously

-- if not already done by pg_dump
ROLLBACK;

there is no way to get pg_dump to use the same connection as your script.... its a standalone program, and will open its own connection, hence run its own transactions.


-- 
john r pierce, recycling bits in santa cruz

Re: [GENERAL] Run statements before pg_dump in same transaction?

От
John R Pierce
Дата:
On 3/23/2017 12:06 PM, François Beausoleil wrote:
> For development purposes, we dump the production database to local. It's fine because the DB is small enough. The
company'sgrowing and we want to reduce risks. To that end, we'd like to anonymize the data before it leaves the
databaseserver. 

how about piping the /output/ of pg_dump to GPG or a similar file/stream
encryption module ?


--
john r pierce, recycling bits in santa cruz



Re: [GENERAL] Run statements before pg_dump in same transaction?

От
Alban Hertroys
Дата:
> On 23 Mar 2017, at 20:23, John R Pierce <pierce@hogranch.com> wrote:
>
> On 3/23/2017 12:06 PM, François Beausoleil wrote:
>> For development purposes, we dump the production database to local. It's fine because the DB is small enough. The
company'sgrowing and we want to reduce risks. To that end, we'd like to anonymize the data before it leaves the
databaseserver. 
>
> how about piping the /output/ of pg_dump to GPG or a similar file/stream encryption module ?

John, anonymising data is not about encrypting, but about making data anonymous. This is usually done to create data
setsuseful for development or statistical analysis, whilst preventing people from obtaining sensitive information from
thedata set. For example, companies want to know how many facebook users are interested in buying drones, but those
companiesshould not know who exactly those people are. 

ISTR that there are some tools for this purpose, but the details escape me.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



Re: [GENERAL] Run statements before pg_dump in same transaction?

От
Adrian Klaver
Дата:
On 03/23/2017 12:06 PM, François Beausoleil wrote:
> Hi all!
>
> For development purposes, we dump the production database to local. It's fine because the DB is small enough. The
company'sgrowing and we want to reduce risks. To that end, we'd like to anonymize the data before it leaves the
databaseserver. 
>
> One solution we thought of would be to run statements prior to pg_dump, but within the same transaction, something
likethis: 
>
> BEGIN;
> UPDATE users SET email = 'dev+' || id || '@example.com', password_hash = '/* hash of "password" */', ...;
> -- launch pg_dump as usual, ensuring a ROLLBACK at the end
> -- pg_dump must run with the *same* connection, obviously
>
> -- if not already done by pg_dump
> ROLLBACK;
>
> Is there a ready-made solution for this? Our DB is hosted on Heroku, and we don't have 100% flexibility in how we
dump.

The only thing I could find is:

https://devcenter.heroku.com/articles/heroku-postgres-backups

Direct database-to-database copies

So copy your production server to a second server on Heroku, anonymize
the data on the second server and then dump that data.

>
> I searched for "postgresql anonymize data dump before download"[1] and variations, but I didn't see anything highly
relevant.
>
> Thanks!
> François
>
> PS: Cross-posted to http://dba.stackexchange.com/q/168023/3935
>
>   [1]: https://duckduckgo.com/?q=postgresql+anonymize+data+dump+before+download
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: Run statements before pg_dump in same transaction?

От
"Greg Sabino Mullane"
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


François Beausoleil asked:
> To that end, we'd like to anonymize the data before it leaves the database server.
>
> One solution we thought of would be to run statements prior to pg_dump, but within
> the same transaction, something like this:
>
> BEGIN;
> UPDATE users SET email = 'dev+' || id || '@example.com', password_hash = '/* hash of "password" */', ...;
> -- launch pg_dump as usual, ensuring a ROLLBACK at the end
...
> Is there a ready-made solution for this?

No - at least not with generating a dump and scrubbing *before* the data comes
out. Some other ideas:

* Periodically do a full dump to another database under your control, sanitize the
data, and make all dev dumps come from *that* database. Process roughly becomes:

* pg_dump herokudb | psql dev_temp
* <sanitize dev_temp>
* drop existing dev_old; rename devdb to dev_old; rename dev_temp to devdb
* Devs can pg_dump devdb at will

That still moves your sensitive data to another server though, even temporarily.
Another approach is to use the -T flag of pg_dump to exclude certain tables.
Make modified copies of them on the server, then rename them after the dump
(or simply put them in a new namespace):

* (create a dev.users identical (including indexes, etc.) to public.users)
* truncate table dev.users;
* insert into dev.users select * from public.users;
* update dev.users set email = 'dev' + || ...etc. <e.g. sanitize data>
* pg_dump -d <herokus DATABASE_URL> -T public.users > devs_use_this.pg

Then dev could do: set schema = dev, public;

Or you could simply move the sanitized table back:

alter table dev.users set schema public;

If you are going to rename, it may be simpler to not make an identical copy
of the affected tables (i.e. with indexes) but just a data-only copy:

create table dev.users as select * from public.users;
<sanitize dev.users>
pg_dump herokudb --schema-only | psql devdb
pg_dump herokudb --data-only -T public.users | psql devdb
psql devdb -c 'insert into public.users select * from dev.users'


- --
Greg Sabino Mullane greg@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201703240911
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAljVHHsACgkQvJuQZxSWSshUbgCg7TzCkAzT4wKoKd5/2rruzLte
TJcAoI7AvGdGzlNp5b3N+LFJ9DWIZ8/C
=7heB
-----END PGP SIGNATURE-----