Обсуждение: Staging Database

Поиск
Список
Период
Сортировка

Staging Database

От
BladeOfLight16
Дата:
The company I work for has a client who has expressed interest in having a staging database for their data. Staging as in they add data, do some QCing, then push to a database with an identical schema to make it public. Fortunately, we're not doing that right now, but it's something they may want later on. So I wanted to at least start thinking about it. What are some ways of going about it? Some kind of replication set up comes to mind, but it would need to be on demand, not constant. It also could be as infrequent as quarterly, if that makes a difference. Preferably, it would only require one server with PostgreSQL installed, but I'm not ruling out multiple servers.

Thanks for any ideas.

Re: Staging Database

От
Luca Ferrari
Дата:
On Wed, Aug 7, 2013 at 6:00 AM, BladeOfLight16 <bladeoflight16@gmail.com> wrote:
> The company I work for has a client who has expressed interest in having a
> staging database for their data. Staging as in they add data, do some QCing,
> then push to a database with an identical schema to make it public.
> Fortunately, we're not doing that right now, but it's something they may
> want later on. So I wanted to at least start thinking about it. What are
> some ways of going about it? Some kind of replication set up comes to mind,
> but it would need to be on demand, not constant. It also could be as
> infrequent as quarterly, if that makes a difference. Preferably, it would
> only require one server with PostgreSQL installed, but I'm not ruling out
> multiple servers.
>

Not really helpful, but here are my considerations.
The low frequency and the preference for a single server suggest me a
dump and restore cycle on two databases, assuming this is possible due
to not high volume data.
I would also consider some way of data partitioning in order to
isolate the data that has to be pushed from staging into the master
(you say data is only added or queried).
The problem for replication is that both the staging and the master
would be in read-write mode, so sounds to me like a multi-master
setup.

Luca


Re: Staging Database

От
BladeOfLight16
Дата:
On Wed, Aug 7, 2013 at 4:43 AM, Luca Ferrari <fluca1978@infinito.it> wrote:
Not really helpful, but here are my considerations.
The low frequency and the preference for a single server suggest me a
dump and restore cycle on two databases, assuming this is possible due
to not high volume data.
I would also consider some way of data partitioning in order to
isolate the data that has to be pushed from staging into the master
(you say data is only added or queried).
The problem for replication is that both the staging and the master
would be in read-write mode, so sounds to me like a multi-master
setup

I wasn't very careful with my wording. Sorry about that. There will be updates and possibly deletions as well as additions. Furthermore, the public version would be read only, I believe. The client would be modifying data, not end users. (It's a catalog site; the client is a non-profit that's publishing information in their field.) You're right in guessing that the data is not high volume.

A dump/restore does seem pretty logical. Replication might be complete overkill. The part that makes it a little more difficult is managing getting rid of the old database before restoring (and preferably, not getting rid of the old DB unless the restore succeeds). The best way I can think of to keep from doing all those steps manually would just be to write a bash script, unless someone has better ideas.

Also, do not degrade your contributions. =) This is very helpful, even if it's kind of vague due to the fact I'm not actually doing this yet. I'd much rather already have a vague idea in mind when this comes up again than to be totally unprepared. Thank you.

Re: Staging Database

От
Luca Ferrari
Дата:
On Wed, Aug 7, 2013 at 12:38 PM, BladeOfLight16
<bladeoflight16@gmail.com> wrote:

> The client would be modifying data,
> not end users. (It's a catalog site; the client is a non-profit that's
> publishing information in their field.) You're right in guessing that the
> data is not high volume.
>

In such case this means that the master node will be the staging one,
with the hot standby the one publicly available. Replication could be
the right and easiest solution assuming that
- updates and addition can be "available" immediatly to the public
(that is they are cut/filtered from your web application)
- you don't change the schema or the structure in a way that could
damage your existing front-end application

Another idea that comes into my mind is to use three database and a
pgbouncer/pgpool approach: one node is the current indipendent node
and is accessed via pgpool from the web application, the staging
database is the master and is replicated in the third slave node. Once
the staging is completed the third node is the new pgpool target, so
that the web application start seeing the new database, the old
indipendent node becomes the new slave of the staging that is always
the master. But as you can imagine, this is a more complex approach.

Please consider that in the case changes involve some damage to the
application you will probably need a full re-deploy, and therefore a
dump-restore could be the easiest approach (and you can create very
complex scripts for managing that).

Hope this helps.

Luca


Re: Staging Database

От
David Johnston
Дата:
BladeOfLight16 wrote
> The company I work for has a client who has expressed interest in having a
> staging database for their data. Staging as in they add data, do some
> QCing, then push to a database with an identical schema to make it public.
> Fortunately, we're not doing that right now, but it's something they may
> want later on. So I wanted to at least start thinking about it. What are
> some ways of going about it? Some kind of replication set up comes to
> mind,
> but it would need to be on demand, not constant. It also could be as
> infrequent as quarterly, if that makes a difference. Preferably, it would
> only require one server with PostgreSQL installed, but I'm not ruling out
> multiple servers.

My first option to evaluate would be simple schema delineation.  Have one or
more schemas inside the same database where you place "staging" data.  Have
specific "staging" users that can only see those schemas and can perform the
work on them.  Once all the staged data has been
validation/modified/whatever you have a function that will migrate the
staging data from the staging schemas to the production/live schemas.

If you want to use only one server than whatever you are going to do with
the staging data cannot overly tax said box otherwise that requirement is
bogus.  Otherwise two boxes means at least a second postgreSQL installation
on the second box.  I'm out of my experience here but to then get the data
from one box to the other you can use something like dbLink/FDW or even
dumping the data to CSV and re-importing it in production (ideally still
using the same staging tables; you have the second server with the staging
tables also so you can perform resource-intensive actions).

Normally you are only altering a small portion of the data inside the
database and so should attempt to only "manage" that subset.  One question
is once all of the data has been placed into production what happens to it?
Would the "new" staging area contain ALL information even that previously
imported and unchanged or does it only reflect changes from the last
production load?  How and what kind of QC is going to be performed and will
you need to build interfaces to support those activities?  Would you end up
QCing the same data time-after-time even if it was unchanged.

David J.




--
View this message in context: http://postgresql.1045698.n5.nabble.com/Staging-Database-tp5766603p5766665.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Staging Database

От
Steve Crawford
Дата:
On 08/07/2013 03:38 AM, BladeOfLight16 wrote:
On Wed, Aug 7, 2013 at 4:43 AM, Luca Ferrari <fluca1978@infinito.it> wrote:
...

I wasn't very careful with my wording. Sorry about that. There will be updates and possibly deletions as well as additions. Furthermore, the public version would be read only, I believe. The client would be modifying data, not end users. (It's a catalog site; the client is a non-profit that's publishing information in their field.)

In addition to other suggestions posted here, the fact that it's read-only when live leads to some possibilities. I'll just toss them out as a brain-dump in no particular order - many may not be practical for your use.

1. Rename the databases. Have the staged and live databases in the same database instance then when time to go live just "alter database live rename to old;" followed by "alter database staged rename to live;"

2. Have the app query through views and have a script that repoints the views to the new data.

3. Similar to #2, make all the primary tables empty parent tables with data in the child tables. Put the new data into tables then have a script drop the existing children and alter-table to make the new tables children of your primary tables.

4. Run a second instance of PostgreSQL on a different port number to get the staged database set up the way you want. When time to go live, stop the servers, swap port numbers and restart.

5. Use the "Connection Service File", pg_service.conf, file to name/route your connections and point your apps to the currently live database (i.e. have staged and live in your cluster and point to the desired db). This avoids needing to use pg_bouncer at the expense of needing to update the pg_service.conf file on all clients.

6. Put the data into a new schema then just rename schemas.

Perhaps more will come to me later.

Cheers,
Steve