Обсуждение: Design decision advice

Поиск
Список
Период
Сортировка

Design decision advice

От
"William Temperley"
Дата:
Dear all

I'd really appreciate a little advice here - I'm designing a PG
database to manage a scientific dataset.
I've these fairly clear requirements:

1. Multiple users of varying skill will input data.
2. Newly inserted data will be audited and marked good / bad
3. We must have a dataset that is frozen or "known good" to feed into
various models.

This, as far as I can see, leaves me with three options:
A. Two databases, one for transaction processing and one for
modelling. At arbitrary intervals (days/weeks/months) all "good" data
will be moved to the modelling database.
B. One database, where all records will either be marked "in" or
"out". The application layer has to exclude all data that is out.
C. Sandbox tables for all tables updated by the application.

I prefer option A, this gives me the flexibility to run heavy
modelling queries on a separate server, but I'm not sure how best to
deal with the replication issues when moving to the modelling db.
Option B makes me think of hard to diagnose bugs with queries looking
at different datasets, for example.
With option C, if both tables tX and tY have sandbox tables sX and sY,
there could be problems where sX needs to reference data in sY, but
has a foreign key referencing tY

What would you guys do?  Have I missed a better option here?

Thanks

Will T

Re: Design decision advice

От
Craig Ringer
Дата:
William Temperley wrote:
> Dear all
>
> I'd really appreciate a little advice here - I'm designing a PG
> database to manage a scientific dataset.
> I've these fairly clear requirements:
>
> 1. Multiple users of varying skill will input data.
> 2. Newly inserted data will be audited and marked good / bad
> 3. We must have a dataset that is frozen or "known good" to feed into
> various models.
>
> This, as far as I can see, leaves me with three options:
> A. Two databases, one for transaction processing and one for
> modelling. At arbitrary intervals (days/weeks/months) all "good" data
> will be moved to the modelling database.
> B. One database, where all records will either be marked "in" or
> "out". The application layer has to exclude all data that is out.

You could also exclude "out" data at the database level with appropriate
use of (possibly updatable) views.

If you put your raw tables in one schema and put your valid-data-only
query views in another schema, you can set your schema search path so
applications cannot see the raw tables containing not-yet-validated data.

You also have the option of using materialized views, where a trigger
maintains the "good" tables by pushing data over from the raw tables
when it's approved.

That gives you something between your options "A" and "B" to consider,
at least.

--
Craig Ringer


Re: Design decision advice

От
"William Temperley"
Дата:
On Thu, Aug 14, 2008 at 2:55 AM, Craig Ringer
<craig@postnewspapers.com.au> wrote:
> William Temperley wrote:

>> A. Two databases, one for transaction processing and one for
>> modelling. At arbitrary intervals (days/weeks/months) all "good" data
>> will be moved to the modelling database.
>> B. One database, where all records will either be marked "in" or
>> "out". The application layer has to exclude all data that is out.
>
> You could also exclude "out" data at the database level with appropriate
> use of (possibly updatable) views.
>
> If you put your raw tables in one schema and put your valid-data-only
> query views in another schema, you can set your schema search path so
> applications cannot see the raw tables containing not-yet-validated data.
>
> You also have the option of using materialized views, where a trigger
> maintains the "good" tables by pushing data over from the raw tables
> when it's approved.
>
> That gives you something between your options "A" and "B" to consider,
> at least.
>
> --
> Craig Ringer
>
>

Thanks Craig -

I didn't know about the search_path setting - a gem of knowlege. I'd
overlooked views too.

I'm using Django btw, which is great except for limited support for
multiple DBs, so the single DB option will be much easier.

Search_path gives me quite an elegant solution - I can direct my
read-only modelling users to their schema ($user), the modelling
schema, where views are kept.
Admin users get directed to their $user schema.

This leaves me with the views/materialised views question.
Oh yeah, and hacking Django to allow different DB users in one project.

Cheers,

Will