Обсуждение: Big image tables maintenance

Поиск
Список
Период
Сортировка

Big image tables maintenance

От
still Learner
Дата:
Hi ,

I have a 10 TB size table with multiple bytea columns (image & doc)and makes 20TB of DB size. I have a couple of issues to maintain the DB.

1. I Would like to separate the image column from the 10TB size table, place it in a separate schema. The change should not result in any query change in the application.  Is it possible? Doing this it should not affect the performance. 

2. I can't maintain files on File system as the count is huge, so thinking of using any no-sql mostly mongo-DB, is it recommended? Or PostgreSQL itself can handle? 

3. Taking the backup of 20TB data, is big task. Any more feasible solution other than online backup/pg_dump?

Each image retrieval is 
Currently, we are on pg 9.4 and moving to 10.5 soon.
 
Thanks,
GJ.

Re: Big image tables maintenance

От
Ron
Дата:
On 09/17/2018 07:38 AM, still Learner wrote:
Hi ,

I have a 10 TB size table with multiple bytea columns (image & doc)and makes 20TB of DB size. I have a couple of issues to maintain the DB.

1. I Would like to separate the image column from the 10TB size table, place it in a separate schema. The change should not result in any query change in the application.  Is it possible? Doing this it should not affect the performance.

That's called "vertical partitioning", which I don't think Postgres supports.


2. I can't maintain files on File system as the count is huge,

Eh?  You aren't supposed to maintain the files on the filesystem; Postgres is.

so thinking of using any no-sql mostly mongo-DB, is it recommended? Or PostgreSQL itself can handle? 

3. Taking the backup of 20TB data, is big task. Any more feasible solution other than online backup/pg_dump?

pgbackrest and barman are popular options.

(We have a database like yours, though only 3TB, and have found that pg_dump runs a lot faster with "--compress=0".  The backups are 2.25x larger than the database, though...)


Each image retrieval is 
Currently, we are on pg 9.4 and moving to 10.5 soon.
 
Thanks,
GJ.

--
Angular momentum makes the world go 'round.

Re: Big image tables maintenance

От
Stephen Frost
Дата:
Greetings,

(limiting this to -admin, cross-posting this to a bunch of different
lists really isn't helpful)

* still Learner (stilllearner23@gmail.com) wrote:
> I have a 10 TB size table with multiple bytea columns (image & doc)and
> makes 20TB of DB size. I have a couple of issues to maintain the DB.

*What* are those issues..?  That's really the first thing to discuss
here but you don't ask any questions about it or state what the issue is
(except possibly for backups, but we have solutions for that, as
mentioned below).

> 1. I Would like to separate the image column from the 10TB size table,
> place it in a separate schema. The change should not result in any query
> change in the application.  Is it possible? Doing this it should not affect
> the performance.

How large are these images?  PostgreSQL will already pull out large
column values and put them into a side-table for you, behind the scenes,
using a technique called TOAST.  Documentation about TOAST is available
here:

https://www.postgresql.org/docs/current/static/storage-toast.html

> 2. I can't maintain files on File system as the count is huge, so thinking
> of using any no-sql mostly mongo-DB, is it recommended? Or PostgreSQL
> itself can handle?

I suspect you'd find that your data size would end up being much, much
larger if you tried to store it as JSON or in a similar system, and
you're unlikely to get any performance improvement (much more likely the
opposite, in fact).

> 3. Taking the backup of 20TB data, is big task. Any more feasible solution
> other than online backup/pg_dump?

Absolutely, I'd recommend using pgBackRest which supports parallel
online backup and restore.  Using pg_dump for a large system like this
is really not a good idea- your restore time would likely be
particularly terrible and you have no ability to do point-in-time
recovery.  Using pgBackRest and a capable system, you'd be able to get a
complete backup of 20TB in perhaps 6-12 hours, with similar time on the
recovery side.  If you wish to be able to recover faster, running a
replica (as well as doing backups) may be a good idea, perhaps even a
time-delayed one.

> Each image retrieval is

Unfinished thought here..?

> Currently, we are on pg 9.4 and moving to 10.5 soon.

That's definitely a good plan.

Thanks!

Stephen

Вложения

Re: Big image tables maintenance

От
Scott Ribe
Дата:
On 09/17/2018 07:38 AM, still Learner wrote:
> Hi ,
>
> I have a 10 TB size table with multiple bytea columns (image & doc)and makes 20TB of DB size. I have a couple of
issuesto maintain the DB. 
>
> 1. I Would like to separate the image column from the 10TB size table, place it in a separate schema. The change
shouldnot result in any query change in the application.  Is it possible? Doing this it should not affect the
performance. 

They're automatically stored separate, see https://www.postgresql.org/docs/current/static/storage-toast.html.

> 2. I can't maintain files on File system as the count is huge,

So? I've stored millions of documents on a Mac mini. Real server hardware & OS should have no problem--10TB is really
notall that much. 

> so thinking of using any no-sql mostly mongo-DB, is it recommended? Or PostgreSQL itself can handle?

Only if all you need is the document storage, none of everything else PG offers.

> 3. Taking the backup of 20TB data, is big task. Any more feasible solution other than online backup/pg_dump?

That's an argument for keeping the presumably immutable files on the file system. (There are arguments against as
well.)



Re: Big image tables maintenance

От
Stephen Frost
Дата:
Greetings,

* Ron (ronljohnsonjr@gmail.com) wrote:
> On 09/17/2018 07:38 AM, still Learner wrote:
> >I have a 10 TB size table with multiple bytea columns (image & doc)and
> >makes 20TB of DB size. I have a couple of issues to maintain the DB.
> >
> >1. I Would like to separate the image column from the 10TB size table,
> >place it in a separate schema. The change should not result in any query
> >change in the application.  Is it possible? Doing this it should not
> >affect the performance.
>
> That's called "vertical partitioning", which I don't think Postgres supports.

As mentioned, PostgreSQL will already do this for you with TOAST, but
even without that, you could certainly create a simple view..

> >2. I can't maintain files on File system as the count is huge,
>
> Eh? *You* aren't supposed to maintain the files on the filesystem;
> *Postgres* is.

I believe the point being made here is that pushing the images out of PG
and on to the filesystem would result in a huge number of files and that
would be difficult for the filesystem to handle and generally difficult
to work with.

> (We have a database like yours, though only 3TB, and have found that pg_dump
> runs a *lot* faster with "--compress=0".  The backups are 2.25x larger than
> the database, though...)

Unfortunately, your restore time with a pg_dump-based backup is very
high and that's something that I don't think enough people think about.

Having both pgBackRest-based physical backups and pg_dump-based backups
is nice as it allows you to do selective restore when you need it, and
fast full restore when needed.  Of course, that requires additional
storage.

Note that pg_dump/pg_restore also support parallelism, which can help
with how long they take to run.

Thanks!

Stephen

Вложения

Re: Big image tables maintenance

От
Stephen Frost
Дата:
Greetings,

* Scott Ribe (scott_ribe@elevated-dev.com) wrote:
> On 09/17/2018 07:38 AM, still Learner wrote:
> > 3. Taking the backup of 20TB data, is big task. Any more feasible solution other than online backup/pg_dump?
>
> That's an argument for keeping the presumably immutable files on the file system. (There are arguments against as
well.)

While I'm not generally against the idea of keeping files on the
filesystem, I'm not sure how that really changes things when it comes to
backup..?  If anything, having the files on the filesystem makes backing
things up much more awkward, since you don't have the transactional
guarantees on the filesystem that you do in the database and if you push
the files out but keep the metadata and indexes in the database then you
have to deal with reconsiling the two, in general and particularly when
performing a backup/restore.

Thanks!

Stephen

Вложения

Re: Big image tables maintenance

От
still Learner
Дата:


On Mon, Sep 17, 2018, 19:28 Stephen Frost <sfrost@snowman.net> wrote:
Greetings,

(limiting this to -admin, cross-posting this to a bunch of different
lists really isn't helpful)

* still Learner (stilllearner23@gmail.com) wrote:
> I have a 10 TB size table with multiple bytea columns (image & doc)and
> makes 20TB of DB size. I have a couple of issues to maintain the DB.

*What* are those issues..?  That's really the first thing to discuss
here but you don't ask any questions about it or state what the issue is
(except possibly for backups, but we have solutions for that, as
mentioned below).
> 1. I Would like to separate the image column from the 10TB size table,
> place it in a separate schema. The change should not result in any query
> change in the application.  Is it possible? Doing this it should not affect
> the performance.

How large are these images?  PostgreSQL will already pull out large
column values and put them into a side-table for you, behind the scenes,
using a technique called TOAST.  Documentation about TOAST is available
here:

https://www.postgresql.org/docs/current/static/storage-toast.htm

Image size is restricted in two digit KBs
only, but we have very large volume of data. The main reason to split the image to different schema is to avoid data loss in future if corruption occurs on the table. Also maintenance would be easier compared to now. The Db growth is much faster, I can say 1 Tb per quarter.


> 2. I can't maintain files on File system as the count is huge, so thinking
> of using any no-sql mostly mongo-DB, is it recommended? Or PostgreSQL
> itself can handle?

I suspect you'd find that your data size would end up being much, much
larger if you tried to store it as JSON or in a similar system, and
you're unlikely to get any performance improvement (much more likely the
opposite, in fact)


We are also considering document management tools. 

For these type of huge amount of data,is it advisable to keep the images in bytea type only or Jsonb( I haven't used yet) is also an option?

> 3. Taking the backup of 20TB data, is big task. Any more feasible solution
> other than online backup/pg_dump?

Absolutely, I'd recommend using pgBackRest which supports parallel
online backup and restore.  Using pg_dump for a large system like this
is really not a good idea- your restore time would likely be
particularly terrible and you have no ability to do point-in-time
recovery.  Using pgBackRest and a capable system, you'd be able to get a
complete backup of 20TB in perhaps 6-12 hours, with similar time on the
recovery side.  If you wish to be able to recover faster, running a
replica (as well as doing backups) may be a good idea, perhaps even a
time-delayed one.

Yeah I will try pgBackrest. We are already having time dealy replica.



> Each image retrieval is

Unfinished thought here..?
Sorry, some how I missed to complete. I supposed to say, image rerival ratio would be 1:10, mean once each image inserted it would be retrieved by the application more about 10 times for verification and prints etc.

Viewing the current data growth how long I mean till what size I can survive with this type of flow. In other words, just dont want to survive but would like build a robust environment.


> Currently, we are on pg 9.4 and moving to 10.5 soon.

That's definitely a good plan.

Thanks!

Stephen

Re: Big image tables maintenance

От
Stephen Frost
Дата:
Greetings,

* still Learner (stilllearner23@gmail.com) wrote:
> On Mon, Sep 17, 2018, 19:28 Stephen Frost <sfrost@snowman.net> wrote:
> > (limiting this to -admin, cross-posting this to a bunch of different
> > lists really isn't helpful)
> >
> > * still Learner (stilllearner23@gmail.com) wrote:
> > > I have a 10 TB size table with multiple bytea columns (image & doc)and
> > > makes 20TB of DB size. I have a couple of issues to maintain the DB.
> >
> > *What* are those issues..?  That's really the first thing to discuss
> > here but you don't ask any questions about it or state what the issue is
> > (except possibly for backups, but we have solutions for that, as
> > mentioned below).
>
> > 1. I Would like to separate the image column from the 10TB size table,
> > > place it in a separate schema. The change should not result in any query
> > > change in the application.  Is it possible? Doing this it should not
> > affect
> > > the performance.
> >
> > How large are these images?  PostgreSQL will already pull out large
> > column values and put them into a side-table for you, behind the scenes,
> > using a technique called TOAST.  Documentation about TOAST is available
> > here:
> >
> > https://www.postgresql.org/docs/current/static/storage-toast.htm
> > <https://www.postgresql.org/docs/current/static/storage-toast.html>
>
> Image size is restricted in two digit KBs
> only, but we have very large volume of data. The main reason to split the
> image to different schema is to avoid data loss in future if corruption
> occurs on the table. Also maintenance would be easier compared to now. The
> Db growth is much faster, I can say 1 Tb per quarter.

"two digit KBs" doesn't actually provide much enlightenment.  I'd
suggest you check for and look at the size of the TOAST table for your
environment.

As for growth, you'd probably be best off looking at partitioning the
large data set once you've gotten the system up to 10.5, but that's
mostly to make it easier to manage the data and to do things like expire
out old data.

> > > 2. I can't maintain files on File system as the count is huge, so
> > thinking
> > > of using any no-sql mostly mongo-DB, is it recommended? Or PostgreSQL
> > > itself can handle?
> >
> > I suspect you'd find that your data size would end up being much, much
> > larger if you tried to store it as JSON or in a similar system, and
> > you're unlikely to get any performance improvement (much more likely the
> > opposite, in fact)
>
> We are also considering document management tools.

Not really sure what that changes here, but doesn't seem like much.

> For these type of huge amount of data,is it advisable to keep the images in
> bytea type only or Jsonb( I haven't used yet) is also an option?

If you go to JSONB then you'd likely end up seriously increasing the
size, so, no, I wouldn't suggest going there for binary image data.

> > 3. Taking the backup of 20TB data, is big task. Any more feasible solution
> > > other than online backup/pg_dump?
> >
> > Absolutely, I'd recommend using pgBackRest which supports parallel
> > online backup and restore.  Using pg_dump for a large system like this
> > is really not a good idea- your restore time would likely be
> > particularly terrible and you have no ability to do point-in-time
> > recovery.  Using pgBackRest and a capable system, you'd be able to get a
> > complete backup of 20TB in perhaps 6-12 hours, with similar time on the
> > recovery side.  If you wish to be able to recover faster, running a
> > replica (as well as doing backups) may be a good idea, perhaps even a
> > time-delayed one.
>
> Yeah I will try pgBackrest. We are already having time dealy replica.

That's good.

> > > Each image retrieval is
> >
> > Unfinished thought here..?
>
> Sorry, some how I missed to complete. I supposed to say, image rerival
> ratio would be 1:10, mean once each image inserted it would be retrieved by
> the application more about 10 times for verification and prints etc.

If there's an issue with the load associated with retriving the images
then I would suggest that you stand up a physical replica or two and
then move the read load to those systems.

> Viewing the current data growth how long I mean till what size I can
> survive with this type of flow. In other words, just dont want to survive
> but would like build a robust environment.

PostgreSQL is quite robust and can handle a very large amount of data.
You can improve on that by having physical replicas which are available
for failover and handling high read load.  At 1TB/quarter, it seems very
unlikely that you'll run into any serious limitations in PostgreSQL any
time soon.

Thanks!

Stephen

Вложения