Обсуждение: Database versus filesystem for storing images
A web application will store hundreds of thousands images. The images meta data will be stored in postgresql but I'm in doubt about where to store the images themselves, if in the database as byte strings or in the file system. Would one of them have a faster image search than the other? In the database the images would be indexed by a serial id, that of the file submitter, and an image number. They would always be searched by the same key. In the file system there would be a two or three level directory structure and the directory names would be hexadecimal numbers like '/images_directory/f/0/d' taken from the 2/3 first letters of the file name, which would be a hash. Whenever I want to know the hash name of an image so I can find it in the file system I would have to search it first in the database by the index. So I suppose the database will always be faster. But there is an advantage in using the file system. It is that the Apache's HTTP file caching management is there for free. If I store the images in the database I will have to handle the request and response HTTP headers myself if I want the images to be cached by the clients. What is the best practice in this situation? I mean not only the two options as above but any. Regards, -- Clodoaldo Pinto Neto
Clodoaldo wrote: > A web application will store hundreds of thousands images. <snip> > What is the best practice in this situation? I mean not only the two > options as above but any. This discussion comes up regularly. See the archives for a thread titled "Storing images in PostgreSQL databases (again)" for the latest. -- Guy Rouillier
Personally, I'd put them on the file system, because then backup software can perform incremental backups. In the database, that becomes more of a difficulty. One suggestion, don't use a file name from a hash to store the image, just use the serial id, and break them up by hundreds or thousands, iow image 1123 might be in images/000/000001/000001123. -- Scott Ribe scott_ribe@killerbytes.com http://www.killerbytes.com/ (303) 722-0567 voice
Scott Ribe wrote: > Personally, I'd put them on the file system, because then backup software > can perform incremental backups. In the database, that becomes more of a > difficulty. One suggestion, don't use a file name from a hash to store the > image, just use the serial id, and break them up by hundreds or thousands, > iow image 1123 might be in images/000/000001/000001123. > > -- > Scott Ribe > scott_ribe@killerbytes.com > http://www.killerbytes.com/ > (303) 722-0567 voice I think I know the answer, but if you don't have an "application server" - ie a webserver, etc, and many of the workstations/clients that need access to the images but may not have access to a network share, isn't the database the only choice ? - or is there a postgresql function/utility that will "server" the file from the file system based on the reference/link embeded in the database ?? Geoff.
5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>: > > I think I know the answer, If you know the answer please tell it as I have read some discussions on the web and although I have decided on a solution I'm still not sure about the best answer, if there is a best answer after all. > but if you don't have an "application > server" - ie a webserver, etc, Yes I have an application server, the Apache server. > and many of the workstations/clients > that need access to the images but may not have access to a network > share, network share? I don't understand. The images will be loaded by html pages with the img tag like in <img src="http://domain.com/images/xxx.jpg"> > isn't the database the only choice ? No. It is one of the choices. The other is to store the images in the file system, in a directory readable by Apache. > - or is there a postgresql function/utility that will "server" the > file from the file system based on the reference/link embeded in the > database ?? I think some procedure languages can read files. In this case what would be the gain in introducing a middle man, the db server? Regards, -- Clodoaldo Pinto Neto
It's almost always better to store the images on the file system and just store the filename or relative path in the database. This is more efficient, doesn't bloat the database by storing files in it, and is easier to get proper browser caching behavior (depending on how your app is setup). I try to avoid BLOBs whenever possible. Cheers, Jeremy Haile On Fri, 5 Jan 2007 17:18:10 -0200, "Clodoaldo" <clodoaldo.pinto.neto@gmail.com> said: > 5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>: > > > > I think I know the answer, > > If you know the answer please tell it as I have read some discussions > on the web and although I have decided on a solution I'm still not > sure about the best answer, if there is a best answer after all. > > > but if you don't have an "application > > server" - ie a webserver, etc, > > Yes I have an application server, the Apache server. > > > and many of the workstations/clients > > that need access to the images but may not have access to a network > > share, > > network share? I don't understand. The images will be loaded by html > pages with the img tag like in <img > src="http://domain.com/images/xxx.jpg"> > > > isn't the database the only choice ? > > No. It is one of the choices. The other is to store the images in the > file system, in a directory readable by Apache. > > > - or is there a postgresql function/utility that will "server" the > > file from the file system based on the reference/link embeded in the > > database ?? > > I think some procedure languages can read files. In this case what > would be the gain in introducing a middle man, the db server? > > Regards, > -- > Clodoaldo Pinto Neto > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Don't store your images in the database. Store them on the filesystem and store their path in the database. Anyone that tells you otherwise is a stark raving madman :) My system is very heavily used, and our pg_dump is only a few gigs. Meanwhile our images/documents storage is well over a hundred gigs. I'd hate to think that I'd have to dump and restore 100 gigs every time I wanted to dump the newest data to the development database. As far as how they actually get to the client machine, typically these days people use web servers for this sort of thing. Clodoaldo wrote: > 5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>: > >> >> I think I know the answer, > > > If you know the answer please tell it as I have read some discussions > on the web and although I have decided on a solution I'm still not > sure about the best answer, if there is a best answer after all. > >> but if you don't have an "application >> server" - ie a webserver, etc, > > > Yes I have an application server, the Apache server. > >> and many of the workstations/clients >> that need access to the images but may not have access to a network >> share, > > > network share? I don't understand. The images will be loaded by html > pages with the img tag like in <img > src="http://domain.com/images/xxx.jpg"> > >> isn't the database the only choice ? > > > No. It is one of the choices. The other is to store the images in the > file system, in a directory readable by Apache. > >> - or is there a postgresql function/utility that will "server" the >> file from the file system based on the reference/link embeded in the >> database ?? > > > I think some procedure languages can read files. In this case what > would be the gain in introducing a middle man, the db server? > > Regards,
John McCawley <nospam@hardgeus.com> writes: > Don't store your images in the database. Store them on the filesystem and > store their path in the database. Anyone that tells you otherwise is a stark > raving madman :) > > My system is very heavily used, and our pg_dump is only a few gigs. Meanwhile > our images/documents storage is well over a hundred gigs. I'd hate to think > that I'd have to dump and restore 100 gigs every time I wanted to dump the > newest data to the development database. How do you plan your backup routine and how do you guarantee that on a failure all needed data is restored? I mean, how do you handle integrity with data outside the database? -- Jorge Godoy <jgodoy@gmail.com>
 >> Don't store your images in the database.  Store them on the filesystem and
 >> store their path in the database
I 100% agree.  Use the database as a lookup into the filesystem.  Don't load the
database up with terabytes of non-searchable binary data?  not sure how that
would help you?
Here is one idea:
Have a 64-bit sequence that you use to generate an image_id (becomes file name).
  Hash that id out over a 3-level deep directory structure that allows 4096
entries per directory.  Should give you 64-bit worth of files.
CREATE OR REPLACE FUNCTION get_image_path(image_id BIGINT)
RETURNS TEXT AS $$
   DECLARE
     -- 40963, avoids "integer out of range"
     dir1_val BIGINT := 68719476736;
     dir1 BIGINT;
     dir2 BIGINT;
     dir3 BIGINT;
     path TEXT;
   BEGIN
     dir1 := (image_id / dir1_val) % 4096;
     dir2 := (image_id / (4096 * 4096)) % 4096;
     dir3 := (image_id / 4096) % 4096;
     RETURN '/BASE_PATH/' || dir1 || '/' || dir2 ||
            '/' || dir3 || '/' || image_id;
   END;
$$ LANGUAGE PLPGSQL;
test=# select get_image_path(200399322222);
            get_image_path
-------------------------------------
  /BASE_PATH/2/3752/2991/200399322222
(1 row)
 >I mean, how do you handle integrity with data
 > outside the database?
You don't, the file system handles integrity of the stored data.  Although, one
must careful to avoid db and fs orphans.  Meaning, a record with no
corresponding file or a file with no corresponging record.  Always
write()/insert an image file to the system within a transaction, including
writing the image out to the fs.  Make sure to unlink any paritally written
image files.
 >>How do you plan your backup routine
In regards to backup, backup the files one-by-one.  Grab the lastest image file
refs from the database and start backing up those images.  Each successfully
backed up image should be followed by inserting that file's database record into
a remote db server.  If anything fails, cleanup the partial image file (to avoid
orphaned data) and rollout the transaction.
just one idea.  i'm sure there are other ways of doing it.  point is, this is
completely possible to do reliably.
andrew
Jorge Godoy wrote:
> John McCawley <nospam@hardgeus.com> writes:
>
>> Don't store your images in the database.  Store them on the filesystem and
>> store their path in the database.  Anyone that tells you otherwise is a stark
>> raving madman :)
>>
>> My system is very heavily used, and our pg_dump is only a few gigs.  Meanwhile
>> our images/documents storage is well over a hundred gigs.  I'd hate to think
>> that I'd have to dump and restore 100 gigs every time I wanted to dump the
>> newest data to the development database.
>
> How do you plan your backup routine and how do you guarantee that on a failure
> all needed data is restored?  I mean, how do you handle integrity with data
> outside the database?
>
			
		"... and Moses said unto them, 'The eleventh commandment : thou shalt store images in a database!'..." What if you had another database where you stored just the images and not back it up if you don't want to? As an application developer, I like the idea of storing files and images in the database because it makes it much easier to control access and security from an application standpoint. I think Microsoft SQL Server stores blobs in a separate file, and only retains pointers in the actually database field for that blob. So when you SELECT on that blob MS SQL reads the external file for you as if it lived in the database. I don't know if Postgres does the same thing, but if it did, you wouldn't have to worry about "bloating" database files. Sounds like this is for an Apache web application. Think about how web sites like Flickr and Webshots store their images in a database. You could write a cool Apache mod so that the url: "http://mycompany.com/images/01234.jpg" would go through this module, pull the appropriate image from the database and send it back; all the while the client is none-the-wiser. Just a thought. I think its one of those things where there's not right or wrong answer. Instead you just have to do the minimum of what your application requires. If you don't need application-level control over the files, then by all means store them on the file system. But if you need to control security than you have to prevent physical access to the file (which means no file system storage) and pull the image from the database through the application. My two cents, James John McCawley wrote: > Don't store your images in the database. Store them on the filesystem > and store their path in the database. Anyone that tells you otherwise > is a stark raving madman :) > > My system is very heavily used, and our pg_dump is only a few gigs. > Meanwhile our images/documents storage is well over a hundred gigs. > I'd hate to think that I'd have to dump and restore 100 gigs every > time I wanted to dump the newest data to the development database. > > > As far as how they actually get to the client machine, typically these > days people use web servers for this sort of thing. > Clodoaldo wrote: > >> 5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>: >> >>> >>> I think I know the answer, >> >> >> If you know the answer please tell it as I have read some discussions >> on the web and although I have decided on a solution I'm still not >> sure about the best answer, if there is a best answer after all. >> >>> but if you don't have an "application >>> server" - ie a webserver, etc, >> >> >> Yes I have an application server, the Apache server. >> >>> and many of the workstations/clients >>> that need access to the images but may not have access to a network >>> share, >> >> >> network share? I don't understand. The images will be loaded by html >> pages with the img tag like in <img >> src="http://domain.com/images/xxx.jpg"> >> >>> isn't the database the only choice ? >> >> >> No. It is one of the choices. The other is to store the images in the >> file system, in a directory readable by Apache. >> >>> - or is there a postgresql function/utility that will "server" the >>> file from the file system based on the reference/link embeded in the >>> database ?? >> >> >> I think some procedure languages can read files. In this case what >> would be the gain in introducing a middle man, the db server? >> >> Regards, > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org/ >
This is a web app, so in my example all of the images live on a web server, and our data lives on a separate database server. We have a completely duplicated setup offsite, and mirror images of every server at the backup site. Every night we use rsync to duplicate everything offsite. Also, a cron job pg_dumps every night and copies the dump over to the backup DB server. And before anybody gives me any guff, our office is in New Orleans, and we went through Katrina with less than an hour of downtime, and without losing anything. So there ;) Jorge Godoy wrote: >John McCawley <nospam@hardgeus.com> writes: > > > >>Don't store your images in the database. Store them on the filesystem and >>store their path in the database. Anyone that tells you otherwise is a stark >>raving madman :) >> >>My system is very heavily used, and our pg_dump is only a few gigs. Meanwhile >>our images/documents storage is well over a hundred gigs. I'd hate to think >>that I'd have to dump and restore 100 gigs every time I wanted to dump the >>newest data to the development database. >> >> > >How do you plan your backup routine and how do you guarantee that on a failure >all needed data is restored? I mean, how do you handle integrity with data >outside the database? > > >
We use WebDAV and Apache's Slide to store our images and, as someone pointed out earlier, store the links to the images inour database. WebDAV has provided us with excellent access control and security... http://www.webdav.org/ http://jakarta.apache.org/slide/index.html Just my 1/2 cents, -Jeanna -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org]On Behalf Of James Neff Sent: Friday, January 05, 2007 2:27 PM To: John McCawley Cc: Clodoaldo; imageguy; pgsql-general@postgresql.org Subject: Re: [GENERAL] Database versus filesystem for storing images "... and Moses said unto them, 'The eleventh commandment : thou shalt store images in a database!'..." What if you had another database where you stored just the images and not back it up if you don't want to? As an application developer, I like the idea of storing files and images in the database because it makes it much easier to control access and security from an application standpoint. I think Microsoft SQL Server stores blobs in a separate file, and only retains pointers in the actually database field for that blob. So when you SELECT on that blob MS SQL reads the external file for you as if it lived in the database. I don't know if Postgres does the same thing, but if it did, you wouldn't have to worry about "bloating" database files. Sounds like this is for an Apache web application. Think about how web sites like Flickr and Webshots store their images in a database. You could write a cool Apache mod so that the url: "http://mycompany.com/images/01234.jpg" would go through this module, pull the appropriate image from the database and send it back; all the while the client is none-the-wiser. Just a thought. I think its one of those things where there's not right or wrong answer. Instead you just have to do the minimum of what your application requires. If you don't need application-level control over the files, then by all means store them on the file system. But if you need to control security than you have to prevent physical access to the file (which means no file system storage) and pull the image from the database through the application. My two cents, James John McCawley wrote: > Don't store your images in the database. Store them on the filesystem > and store their path in the database. Anyone that tells you otherwise > is a stark raving madman :) > > My system is very heavily used, and our pg_dump is only a few gigs. > Meanwhile our images/documents storage is well over a hundred gigs. > I'd hate to think that I'd have to dump and restore 100 gigs every > time I wanted to dump the newest data to the development database. > > > As far as how they actually get to the client machine, typically these > days people use web servers for this sort of thing. > Clodoaldo wrote: > >> 5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>: >> >>> >>> I think I know the answer, >> >> >> If you know the answer please tell it as I have read some discussions >> on the web and although I have decided on a solution I'm still not >> sure about the best answer, if there is a best answer after all. >> >>> but if you don't have an "application >>> server" - ie a webserver, etc, >> >> >> Yes I have an application server, the Apache server. >> >>> and many of the workstations/clients >>> that need access to the images but may not have access to a network >>> share, >> >> >> network share? I don't understand. The images will be loaded by html >> pages with the img tag like in <img >> src="http://domain.com/images/xxx.jpg"> >> >>> isn't the database the only choice ? >> >> >> No. It is one of the choices. The other is to store the images in the >> file system, in a directory readable by Apache. >> >>> - or is there a postgresql function/utility that will "server" the >>> file from the file system based on the reference/link embeded in the >>> database ?? >> >> >> I think some procedure languages can read files. In this case what >> would be the gain in introducing a middle man, the db server? >> >> Regards, > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org/ > ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
My comment about stark raving madmen was somewhat tongue-in-cheek. There is certainly a case to be made for images in a database under some circumstances. However, for the average Joe web developer, and certainly someone new to the whole problem, I think storing them on the filesystem is going to be by far the most painless and efficient approach. James Neff wrote: > "... and Moses said unto them, 'The eleventh commandment : thou shalt > store images in a database!'..." > > What if you had another database where you stored just the images and > not back it up if you don't want to? > > As an application developer, I like the idea of storing files and > images in the database because it makes it much easier to control > access and security from an application standpoint. > > I think Microsoft SQL Server stores blobs in a separate file, and only > retains pointers in the actually database field for that blob. So > when you SELECT on that blob MS SQL reads the external file for you as > if it lived in the database. I don't know if Postgres does the same > thing, but if it did, you wouldn't have to worry about "bloating" > database files. > > Sounds like this is for an Apache web application. Think about how > web sites like Flickr and Webshots store their images in a database. > You could write a cool Apache mod so that the url: > "http://mycompany.com/images/01234.jpg" would go through this module, > pull the appropriate image from the database and send it back; all the > while the client is none-the-wiser. Just a thought. > > I think its one of those things where there's not right or wrong > answer. Instead you just have to do the minimum of what your > application requires. If you don't need application-level control > over the files, then by all means store them on the file system. But > if you need to control security than you have to prevent physical > access to the file (which means no file system storage) and pull the > image from the database through the application. > > My two cents, > James
How does it make it easier to control access and security? If your web app makes a decision about allowing access to the database, it can just as easily make a decision about allowing access to the filesystem. Storing the images on the file system doesn't mean that there isn't a piece of code that determines whether or not users can access a particular file. I see security and access as a non-issue in making this decision. Either way, it's got to be controlled at the application level (if it is even necessary). On Fri, 05 Jan 2007 15:26:45 -0500, "James Neff" <jneff@tethyshealth.com> said: > "... and Moses said unto them, 'The eleventh commandment : thou shalt > store images in a database!'..." > > What if you had another database where you stored just the images and > not back it up if you don't want to? > > As an application developer, I like the idea of storing files and images > in the database because it makes it much easier to control access and > security from an application standpoint. > > I think Microsoft SQL Server stores blobs in a separate file, and only > retains pointers in the actually database field for that blob. So when > you SELECT on that blob MS SQL reads the external file for you as if it > lived in the database. I don't know if Postgres does the same thing, > but if it did, you wouldn't have to worry about "bloating" database > files. > > Sounds like this is for an Apache web application. Think about how web > sites like Flickr and Webshots store their images in a database. You > could write a cool Apache mod so that the url: > "http://mycompany.com/images/01234.jpg" would go through this module, > pull the appropriate image from the database and send it back; all the > while the client is none-the-wiser. Just a thought. > > I think its one of those things where there's not right or wrong > answer. Instead you just have to do the minimum of what your > application requires. If you don't need application-level control over > the files, then by all means store them on the file system. But if you > need to control security than you have to prevent physical access to the > file (which means no file system storage) and pull the image from the > database through the application. > > My two cents, > James > > > > John McCawley wrote: > > Don't store your images in the database. Store them on the filesystem > > and store their path in the database. Anyone that tells you otherwise > > is a stark raving madman :) > > > > My system is very heavily used, and our pg_dump is only a few gigs. > > Meanwhile our images/documents storage is well over a hundred gigs. > > I'd hate to think that I'd have to dump and restore 100 gigs every > > time I wanted to dump the newest data to the development database. > > > > > > As far as how they actually get to the client machine, typically these > > days people use web servers for this sort of thing. > > Clodoaldo wrote: > > > >> 5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>: > >> > >>> > >>> I think I know the answer, > >> > >> > >> If you know the answer please tell it as I have read some discussions > >> on the web and although I have decided on a solution I'm still not > >> sure about the best answer, if there is a best answer after all. > >> > >>> but if you don't have an "application > >>> server" - ie a webserver, etc, > >> > >> > >> Yes I have an application server, the Apache server. > >> > >>> and many of the workstations/clients > >>> that need access to the images but may not have access to a network > >>> share, > >> > >> > >> network share? I don't understand. The images will be loaded by html > >> pages with the img tag like in <img > >> src="http://domain.com/images/xxx.jpg"> > >> > >>> isn't the database the only choice ? > >> > >> > >> No. It is one of the choices. The other is to store the images in the > >> file system, in a directory readable by Apache. > >> > >>> - or is there a postgresql function/utility that will "server" the > >>> file from the file system based on the reference/link embeded in the > >>> database ?? > >> > >> > >> I think some procedure languages can read files. In this case what > >> would be the gain in introducing a middle man, the db server? > >> > >> Regards, > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 4: Have you searched our list archives? > > > > http://archives.postgresql.org/ > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq
John McCawley <nospam@hardgeus.com> writes: > This is a web app, so in my example all of the images live on a web server, > and our data lives on a separate database server. We have a completely > duplicated setup offsite, and mirror images of every server at the backup > site. Every night we use rsync to duplicate everything offsite. Also, a cron > job pg_dumps every night and copies the dump over to the backup DB server. > > And before anybody gives me any guff, our office is in New Orleans, and we > went through Katrina with less than an hour of downtime, and without losing > anything. So there ;) Anyway, you have no guarantee that all your images exist on file and that all existing files have a corresponding entry in your database. -- Jorge Godoy <jgodoy@gmail.com>
Andrew Chernow <pg-job@esilo.com> writes:
>>I mean, how do you handle integrity with data
>> outside the database?
> You don't, the file system handles integrity of the stored data.  Although,
> one must careful to avoid db and fs orphans.  Meaning, a record with no
> corresponding file or a file with no corresponging record.  Always
> write()/insert an image file to the system within a transaction, including
> writing the image out to the fs.  Make sure to unlink any paritally written
> image files.
And how do you guarantee that after a failure?  You're restoring two
different sets of data here:
          - backup from your database
          - backup from your files
How do you link them together on that specific operation?  Or even on a daily
basis, if you get corrupted data...
>>>How do you plan your backup routine
> In regards to backup, backup the files one-by-one.  Grab the lastest image
> file refs from the database and start backing up those images.  Each
> successfully backed up image should be followed by inserting that file's
> database record into a remote db server.  If anything fails, cleanup the
> partial image file (to avoid orphaned data) and rollout the transaction.
>
> just one idea.  i'm sure there are other ways of doing it.  point is, this is
> completely possible to do reliably.
Wouldn't replication with, e.g., Slony be easier?  And wouldn't letting the
database handle all the integrity be easier?  I mean, create an "images" table
and then make your record depends on this table, so if there's no record with
the image, you won't have any references to it left.
It would also make the backup plan easier: backup the database.
Not counting that depending on your choice of filesystem and image size you
might get a very poor performance.
--
Jorge Godoy      <jgodoy@gmail.com>
			
		> And how do you guarantee that after a failure? You're restoring two > different sets of data here: > How do you link them together on that specific operation? Or even on a daily > basis, if you get corrupted data... I answered that already. > Not counting that depending on your choice of filesystem and image size you > might get a very poor performance. apache has very good page and image caching. You could take advantage of that using this technique. Another nice feature is the database and images can be handled spearately. Some people have seen this as a disadvantage on this thread, I personally don't see it that why. I guess it depends on access needs, many files and how much data you have. What if you had 3 billion files across a few hundred terabytes? Can you say with experience how the database would hold up in this situation? andrew Jorge Godoy wrote: > Andrew Chernow <pg-job@esilo.com> writes: > >>> I mean, how do you handle integrity with data >>> outside the database? >> You don't, the file system handles integrity of the stored data. Although, >> one must careful to avoid db and fs orphans. Meaning, a record with no >> corresponding file or a file with no corresponging record. Always >> write()/insert an image file to the system within a transaction, including >> writing the image out to the fs. Make sure to unlink any paritally written >> image files. > > And how do you guarantee that after a failure? You're restoring two > different sets of data here: > > - backup from your database > - backup from your files > > How do you link them together on that specific operation? Or even on a daily > basis, if you get corrupted data... > >>>> How do you plan your backup routine >> In regards to backup, backup the files one-by-one. Grab the lastest image >> file refs from the database and start backing up those images. Each >> successfully backed up image should be followed by inserting that file's >> database record into a remote db server. If anything fails, cleanup the >> partial image file (to avoid orphaned data) and rollout the transaction. >> >> just one idea. i'm sure there are other ways of doing it. point is, this is >> completely possible to do reliably. > > Wouldn't replication with, e.g., Slony be easier? And wouldn't letting the > database handle all the integrity be easier? I mean, create an "images" table > and then make your record depends on this table, so if there's no record with > the image, you won't have any references to it left. > > It would also make the backup plan easier: backup the database. > > Not counting that depending on your choice of filesystem and image size you > might get a very poor performance. > > >
On Fri, Jan 05, 2007 at 15:26:45 -0500, James Neff <jneff@tethyshealth.com> wrote: > "... and Moses said unto them, 'The eleventh commandment : thou shalt > store images in a database!'..." > > What if you had another database where you stored just the images and > not back it up if you don't want to? I think the main reason to keep images in the database is if you need transactional semantics. If you are updating images and transactions that started before the update, need to see the old version you are going to want them in the database. I suspect this need isn't very common though.
Yeah - it can make it easier to implement transactional semantics by storing them in the database, although for simple operations it wouldn't be hard to replicate this manually. And you are going to incur a performance penalty by storing them in the database. Another thing to consider is that storing them in the file system makes it much easier to browse the images using third-party tools, update them, archive them (by gzipping or whatever). This is much more difficult if they are stored in the database. On Fri, 5 Jan 2007 15:51:59 -0600, "Bruno Wolff III" <bruno@wolff.to> said: > On Fri, Jan 05, 2007 at 15:26:45 -0500, > James Neff <jneff@tethyshealth.com> wrote: > > "... and Moses said unto them, 'The eleventh commandment : thou shalt > > store images in a database!'..." > > > > What if you had another database where you stored just the images and > > not back it up if you don't want to? > > I think the main reason to keep images in the database is if you need > transactional semantics. If you are updating images and transactions that > started before the update, need to see the old version you are going to > want them in the database. I suspect this need isn't very common though. > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq
On Fri, 2007-01-05 at 15:54, Jeremy Haile wrote: > Yeah - it can make it easier to implement transactional semantics by > storing them in the database, although for simple operations it wouldn't > be hard to replicate this manually. And you are going to incur a > performance penalty by storing them in the database. > > Another thing to consider is that storing them in the file system makes > it much easier to browse the images using third-party tools, update > them, archive them (by gzipping or whatever). This is much more > difficult if they are stored in the database. The biggest performance penalty from them being in the same database as your data is that they're going to be the majority of your kernel / database buffers. So, as mentioned earlier, it's almost a necessity (for performance reasons) to put them into their own db server.
"Jeremy Haile" <jhaile@fastmail.fm> writes: > Another thing to consider is that storing them in the file system makes > it much easier to browse the images using third-party tools, update > them, archive them (by gzipping or whatever). This is much more > difficult if they are stored in the database. If you touch the files with third-party tools how are you going to prevent that they aren't missing when the database say they are there? If you're referencing them somehow, you have to guarantee that they are there... Or am I the only one that is thinking about referential integrity with those files? -- Jorge Godoy <jgodoy@gmail.com>
If you end up storing the pictures on a separate database server for performance reasons, Jorge's argument regarding integrity becomes moot...And his argument so far is the strongest case I've seen for storing the files in the database. Scott Marlowe wrote: >On Fri, 2007-01-05 at 15:54, Jeremy Haile wrote: > > >>Yeah - it can make it easier to implement transactional semantics by >>storing them in the database, although for simple operations it wouldn't >>be hard to replicate this manually. And you are going to incur a >>performance penalty by storing them in the database. >> >>Another thing to consider is that storing them in the file system makes >>it much easier to browse the images using third-party tools, update >>them, archive them (by gzipping or whatever). This is much more >>difficult if they are stored in the database. >> >> > >The biggest performance penalty from them being in the same database as >your data is that they're going to be the majority of your kernel / >database buffers. So, as mentioned earlier, it's almost a necessity >(for performance reasons) to put them into their own db server. > >---------------------------(end of broadcast)--------------------------- >TIP 5: don't forget to increase your free space map settings > > ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
>> Or am >>I the only one that is thinking about referential integrity with those files? Not at all. I'm not sure how 3rd party tools like apache, `ls`, `gzip`, `find`, nfs, etc... are breaking integrity. Any php, jsp, C or shell script you write would be doing the same thing, accessing the data. All your doing is making your system more accessible to a wider range of tools, other than your own. Just like you are cautious about not deleting the pg_data folder, big no-no, you need to be cautious about not deleting or modifying these image files. Basically, the image files are an extension of the database that you would glue together. I think there is a clear separation of tasks here. I think this is required if you were handling any sizeable amounts of data. The other thing is the original poster needs apache to access these images. This is a requirement of his/her project. Probably a good idea to meet those requirements. It is far more effecient to have apache access them directly then pounding your database with web requests for image file data. It is good design, and distribution of tasks, to get the image paths from the database and and have apache server the data; select images paths from php or something. Now you can have the data anywhere, on a different server, over an nfs mount, gfs, wherever. Much more flexible and distributed. Andrew Jorge Godoy wrote: > "Jeremy Haile" <jhaile@fastmail.fm> writes: > >> Another thing to consider is that storing them in the file system makes >> it much easier to browse the images using third-party tools, update >> them, archive them (by gzipping or whatever). This is much more >> difficult if they are stored in the database. > > If you touch the files with third-party tools how are you going to prevent > that they aren't missing when the database say they are there? If you're > referencing them somehow, you have to guarantee that they are there... Or am > I the only one that is thinking about referential integrity with those files? >
Andrew Chernow <pg-job@esilo.com> writes: >> And how do you guarantee that after a failure? You're restoring two >> different sets of data here: > >> How do you link them together on that specific operation? Or even on a daily >> basis, if you get corrupted data... > > I answered that already. I'm sorry. It must be the flu, the pain or something else, but I really don't remember reading your message about how you can be 100% sure that all references to the filesystem have their corresponding files present and also all present files have their respective database entry. I've seen HA measures (I don't imagine anyone sacrificing their customers copying 3 billion files and a few hundred terabytes while still maintaining an adequate service rate with part of its infra-structure down, just to use your example to that answer...), ideas about requiring an answer from the filesystem before considering the transaction done DB-wise (who grants you that the image really went to the disk and is not on cache when the machine has a power failure and shuts down abruptly?)... I might have missed your message, though. Would you be gentle to quote that again, please? > Another nice feature is the database and images can be handled spearately. What might be bad. > Some people have seen this as a disadvantage on this thread, I personally > don't see it that why. I am questioning two points that show two situations where it is bad. Specially if those images are important to the records (e.g. product failure images, prize winning images, product specs, prototype images, blueprints -- after all, we don't need to restrict our files to images, right? --, agreements, spreadsheets with the last years of company account movements, documents received from lawyers, etc.). > I guess it depends on access needs, many files and how much data you have. > What if you had 3 billion files across a few hundred terabytes? Can you say > with experience how the database would hold up in this situation? I'd have partitioning if I had a case like that. Part of those would be delegated to one machine, part to another and so on. Even if that solution -- partitioning -- makes the overall MTBF lower... And I still can't imagine how you guarantee that all 3 billion files have their corresponding entries on the database. Couting them is not enough since I can have one file with the wrong "name" present on the filesystem or some duplicate record on the DB... -- Jorge Godoy <jgodoy@gmail.com>
Referential integrity would be one positive for storing the files in the database. I wasn't responding to that. I'm simply saying that browsing them with third-party tools, updating, compressing/archiving, etc. is easier if they are not in the database. Those are all actions that I've found useful on other projects when storing user-uploaded images. Depending upon the number of images on disk, it might not be hard to verify that all the images referenced from the DB are there. You could have a small program that steps through each record and verifies its existence on disk. That could easily be run after a restore or as a batch-process. If you're talking about trillions of images - sure that'd be tough. If these images are extremely important - maybe that solution is unacceptable. But this is just a case of too little information to make a decision. There are pros and cons to both approach - in every project I've worked on that faced this decision, I felt the pros of storing it in the file system outweighed the pros of storing it in the DB. But there is no right or wrong answer to the question (unless of course you had special circumstances that made one option clearly superior - but I don't think we know enough details to make that call) My two cents, Jeremy Haile On Fri, 05 Jan 2007 20:24:05 -0200, "Jorge Godoy" <jgodoy@gmail.com> said: > "Jeremy Haile" <jhaile@fastmail.fm> writes: > > > Another thing to consider is that storing them in the file system makes > > it much easier to browse the images using third-party tools, update > > them, archive them (by gzipping or whatever). This is much more > > difficult if they are stored in the database. > > If you touch the files with third-party tools how are you going to > prevent > that they aren't missing when the database say they are there? If you're > referencing them somehow, you have to guarantee that they are there... > Or am > I the only one that is thinking about referential integrity with those > files? > > -- > Jorge Godoy <jgodoy@gmail.com>
Scott Ribe wrote: > Personally, I'd put them on the file system, because then backup software > can perform incremental backups. In the database, that becomes more of a > difficulty. One suggestion, don't use a file name from a hash to store the > image, just use the serial id, and break them up by hundreds or thousands, > iow image 1123 might be in images/000/000001/000001123. > > -- > Scott Ribe > scott_ribe@killerbytes.com > http://www.killerbytes.com/ > (303) 722-0567 voice I think I know the answer, but if you don't have an "application server" - ie a webserver, etc, and many of the workstations/clients that need access to the images but may not have access to a network share, isn't the database the only choice ? - or is there a postgresql function/utility that will "server" the file from the file system based on the reference/link embeded in the database ?? Geoff.
On fös, 2007-01-05 at 15:49 -0500, Andrew Chernow wrote: > I 100% agree. Use the database as a lookup into the filesystem. Don't load the > database up with terabytes of non-searchable binary data? not sure how that > would help you? > > >I mean, how do you handle integrity with data > > outside the database? > You don't, the file system handles integrity of the stored data. Although, one > must careful to avoid db and fs orphans. Meaning, a record with no > corresponding file or a file with no corresponging record. Always > write()/insert an image file to the system within a transaction, including > writing the image out to the fs. Make sure to unlink any paritally written > image files. what happens if you rollback a transaction that just updated an image file? for that matter, what happens if one transaction is using or even reading an image while another is updating it? gnari
> what happens if you rollback a transaction that just updated > an image file? > > for that matter, what happens if one transaction is using or > even reading an image while another is updating it? One thing I mentioned was about a point in time backup, not updating the image. This would rollback the transaction on the backup server, not the live one. I was also proposing to never update an image. Just generate a new file, you got 64-bits worth of ids. If you need to rollback an update to image on the live server, unlink() the file and rollback. If all is good, unlink the old version. andrew Ragnar wrote: > On fös, 2007-01-05 at 15:49 -0500, Andrew Chernow wrote: >> I 100% agree. Use the database as a lookup into the filesystem. Don't load the >> database up with terabytes of non-searchable binary data? not sure how that >> would help you? >> > >> >I mean, how do you handle integrity with data >> > outside the database? >> You don't, the file system handles integrity of the stored data. Although, one >> must careful to avoid db and fs orphans. Meaning, a record with no >> corresponding file or a file with no corresponging record. Always >> write()/insert an image file to the system within a transaction, including >> writing the image out to the fs. Make sure to unlink any paritally written >> image files. > > what happens if you rollback a transaction that just updated > an image file? > > for that matter, what happens if one transaction is using or > even reading an image while another is updating it? > > gnari > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > >
> copying 3 billion files and a few hundred terabytes while still maintaining an > adequate service rate with part of its infra-structure down, just to use your I wasn't saying to do this each time you run a backup, geez that would be horrible. Pickup from where you left off the last time you backed up data/records. How many images and how much data is being generated in a 60 second period? I dought 3 billion files and hundreds of terabytes. When you know what your data generation is, you know what resources you need to replicate this information to a backup server (local or remote). How is this any different than db replication. It would have to backup the same amount of information? You would require the same horse power and bandwidth. andrew Jorge Godoy wrote: > Andrew Chernow <pg-job@esilo.com> writes: > >>> And how do you guarantee that after a failure? You're restoring two >>> different sets of data here: >>> How do you link them together on that specific operation? Or even on a daily >>> basis, if you get corrupted data... >> I answered that already. > > I'm sorry. It must be the flu, the pain or something else, but I really don't > remember reading your message about how you can be 100% sure that all > references to the filesystem have their corresponding files present and also > all present files have their respective database entry. > > I've seen HA measures (I don't imagine anyone sacrificing their customers > copying 3 billion files and a few hundred terabytes while still maintaining an > adequate service rate with part of its infra-structure down, just to use your > example to that answer...), ideas about requiring an answer from the > filesystem before considering the transaction done DB-wise (who grants you > that the image really went to the disk and is not on cache when the machine > has a power failure and shuts down abruptly?)... > > I might have missed your message, though. Would you be gentle to quote that > again, please? > >> Another nice feature is the database and images can be handled spearately. > > What might be bad. > >> Some people have seen this as a disadvantage on this thread, I personally >> don't see it that why. > > I am questioning two points that show two situations where it is bad. > Specially if those images are important to the records (e.g. product failure > images, prize winning images, product specs, prototype images, blueprints -- > after all, we don't need to restrict our files to images, right? --, > agreements, spreadsheets with the last years of company account movements, > documents received from lawyers, etc.). > >> I guess it depends on access needs, many files and how much data you have. >> What if you had 3 billion files across a few hundred terabytes? Can you say >> with experience how the database would hold up in this situation? > > I'd have partitioning if I had a case like that. Part of those would be > delegated to one machine, part to another and so on. Even if that solution -- > partitioning -- makes the overall MTBF lower... > > And I still can't imagine how you guarantee that all 3 billion files have > their corresponding entries on the database. Couting them is not enough since > I can have one file with the wrong "name" present on the filesystem or some > duplicate record on the DB... > >
Ragnar <gnari@hive.is> writes: > for that matter, what happens if one transaction is using or > even reading an image while another is updating it? I believe that this also depends on how the file is updated. Some applications create a temporary file with the new (or changed) content and then replace the old file with this one. There are also cases where you can keep a file descriptor open to one file that has no name (it existed once) and that was replaced. It all depends on how / who's generating and how / who's consuming the information. -- Jorge Godoy <jgodoy@gmail.com>
Andrew Chernow <pg-job@esilo.com> writes:
>>> Or am
>>>I the only one that is thinking about referential integrity with those files?
> Not at all.  I'm not sure how 3rd party tools like apache, `ls`, `gzip`,
> `find`, nfs, etc... are breaking integrity.  Any php, jsp, C or shell script
For gzip, for example:
    - DB record contains "/some/dir/file.ext"
    - Filesystem contains "/some/dir/file.ext.gz"
NFS can also be guilty if it fails or the server goes down.  If I have a share
mounted as "/some/remote/dir" and I say that the file is at
"/some/remote/dir/file.ext" but the NFS server is down then it is the same as
if the file didn't exist at all since it can't be reached.
For both cases, if the file is inside the database and I am referencing it
then I know that it *is* there.  Referential integrity takes care of that for
me with no cost or any other action of mine.
> you write would be doing the same thing, accessing the data.  All your doing
> is making your system more accessible to a wider range of tools, other than
> your own.
And introducint more failing points.
> Just like you are cautious about not deleting the pg_data folder, big no-no,
> you need to be cautious about not deleting or modifying these image
> files. Basically, the image files are an extension of the database that you
> would glue together.  I think there is a clear separation of tasks here.  I
> think this is required if you were handling any sizeable amounts of data.
So you have added the possibility of manipulating (which is different from
reading or accessing) the files directly but you say "don't touch them!".
> The other thing is the original poster needs apache to access these
> images. This is a requirement of his/her project.  Probably a good idea to
And nothing prevents those files from being served from the database.
> meet those requirements.  It is far more effecient to have apache access
> them
Where weren't we meeting his/her requirements?  All the discussion is around
available means to do that.  One option is having the files on the database,
the other is on the filesystem.  From my understanding we're discussing the
benefits of each one.  Aren't we?
> directly then pounding your database with web requests for image file data.
It might be.  If you can be certain that the image *is* there when it tries to
access it.  Both examples above -- gzip + NFS -- show two ways of having
different things inside the DB and on the FS.
> It is good design, and distribution of tasks, to get the image paths from the
> database and and have apache server the data; select images paths from php or
> something.  Now you can have the data anywhere, on a different server, over an
> nfs mount, gfs, wherever.  Much more flexible and distributed.
And also more uncertain that the referred data is there at all.
--
Jorge Godoy      <jgodoy@gmail.com>
			
		> And introducint more failing points. depends on how you do it. not everything has to go in a database to be reliable. Part of good engineering is realizing when to use a tool and when not to. I think a 10K view of the issue is in order. The bigger picture is being missed, or I am not communicating well. I bet if I gave you a million dollars, you could implement what I proposed; in a reliable, redundant fasion. You could probably convince anyone of its merits, making my case for me. All you would have to do is entertain the idea ... I mean you sound smart enough ... probably smarter than me :) andrew Jorge Godoy wrote: > Andrew Chernow <pg-job@esilo.com> writes: > >>>> Or am >>>> I the only one that is thinking about referential integrity with those files? >> Not at all. I'm not sure how 3rd party tools like apache, `ls`, `gzip`, >> `find`, nfs, etc... are breaking integrity. Any php, jsp, C or shell script > > For gzip, for example: > > - DB record contains "/some/dir/file.ext" > - Filesystem contains "/some/dir/file.ext.gz" > > NFS can also be guilty if it fails or the server goes down. If I have a share > mounted as "/some/remote/dir" and I say that the file is at > "/some/remote/dir/file.ext" but the NFS server is down then it is the same as > if the file didn't exist at all since it can't be reached. > > For both cases, if the file is inside the database and I am referencing it > then I know that it *is* there. Referential integrity takes care of that for > me with no cost or any other action of mine. > >> you write would be doing the same thing, accessing the data. All your doing >> is making your system more accessible to a wider range of tools, other than >> your own. > > And introducint more failing points. > >> Just like you are cautious about not deleting the pg_data folder, big no-no, >> you need to be cautious about not deleting or modifying these image >> files. Basically, the image files are an extension of the database that you >> would glue together. I think there is a clear separation of tasks here. I >> think this is required if you were handling any sizeable amounts of data. > > So you have added the possibility of manipulating (which is different from > reading or accessing) the files directly but you say "don't touch them!". > >> The other thing is the original poster needs apache to access these >> images. This is a requirement of his/her project. Probably a good idea to > > And nothing prevents those files from being served from the database. > >> meet those requirements. It is far more effecient to have apache access >> them > > Where weren't we meeting his/her requirements? All the discussion is around > available means to do that. One option is having the files on the database, > the other is on the filesystem. From my understanding we're discussing the > benefits of each one. Aren't we? > >> directly then pounding your database with web requests for image file data. > > It might be. If you can be certain that the image *is* there when it tries to > access it. Both examples above -- gzip + NFS -- show two ways of having > different things inside the DB and on the FS. > >> It is good design, and distribution of tasks, to get the image paths from the >> database and and have apache server the data; select images paths from php or >> something. Now you can have the data anywhere, on a different server, over an >> nfs mount, gfs, wherever. Much more flexible and distributed. > > And also more uncertain that the referred data is there at all. >
Andrew Chernow <pg-job@esilo.com> writes: > I wasn't saying to do this each time you run a backup, geez that would be > horrible. Pickup from where you left off the last time you backed up > data/records. How many images and how much data is being generated in a 60 > second period? I dought 3 billion files and hundreds of terabytes. When you > know what your data generation is, you know what resources you need to > replicate this information to a backup server (local or remote). I'm not talking about backups. I'm talking about restores. > How is this any different than db replication. It would have to backup the > same amount of information? You would require the same horse power and > bandwidth. The difference is that I'd restore the data and then sync the difference from some point in time to "now". The referential integrity would be guaranteed by the database itself and I won't have any pointers to files that doesn't exist or files without pointers to it. -- Jorge Godoy <jgodoy@gmail.com>
Andrew Chernow <pg-job@esilo.com> writes: > depends on how you do it. not everything has to go in a database to be > reliable. Part of good engineering is realizing when to use a tool and when > not to. > > I think a 10K view of the issue is in order. The bigger picture is being > missed, or I am not communicating well. > > I bet if I gave you a million dollars, you could implement what I proposed; > in a reliable, redundant fasion. You could probably convince anyone of its > merits, making my case for me. All you would have to do is entertain the > idea ... I mean you sound smart enough ... probably smarter than me :) LOL. :-) This proposal was made to Demi Moore and her marriage ended in a disaster... ;-) My point is: if I need to be 100% sure that what is referenced on the database is accessible all the time when the reference is, then I need to have this on the database; if it can be missed sometimes then using the filesystem is acceptable. I'd include on the "100%" category fiscal records, client's documents, company projects, etc. On the second the drawing the user made of his grandmother, the picture of the pretty neighbor in the shower, all reports generated automatically from data existing on the database itself, etc. I have systems with both solutions and the ones that give me more trouble to manage are the ones that use external files. If nobody has access to the server, then it is a piece of cake. But if there are other people touching things -- moving / creating / deleting / changing files and directories -- then things get more complicated to manage. Inside the DB they usually don't dare to touch things because they know that they might destroy very important data. It is something that is inherent to people that everything the board of directors use might cost their job. Even if it is just the coffee plate... :-) Anyway, do you need my account number? ;-) You can leave the transfer charges to me, after all friends do that ;-) And I am not smarter than anyone that listen to and learn with other people. I am still learning here since I'm really a newbie in this database world. Thanks for that, though :-) -- Jorge Godoy <jgodoy@gmail.com>
Clodoaldo wrote: > 5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>: > > > > I think I know the answer, > > If you know the answer please tell it as I have read some discussions > on the web and although I have decided on a solution I'm still not > sure about the best answer, if there is a best answer after all. Sorry, didn't mean to sound like and expert on this, I am actually quite a newbie. From all of the discussions I have read and even the ones in this thread, including your own comments below, it would seem that to store the files in the files system you need some sort of application erver or :middleware - like a webserver - to handle the retreiving and serving of the files. My organization is developing a commercial application for "document tracking". It is not a Browser application, but rather a more traditional "windows" thick client app. At the present time we do not intend to deploy any sort of "application server" - web server, ftp server, and not all of the workstations will have access to a consistent network share. So in this case, it is my understanding that our only real choice is to store the documents and images in the database itself. ... unless someone knows of a postgresql function that would allow us to "server" the file from the filesystem via the dbserver ?? > > > but if you don't have an "application > > server" - ie a webserver, etc, > > Yes I have an application server, the Apache server. > > > and many of the workstations/clients > > that need access to the images but may not have access to a network > > share, > > network share? I don't understand. The images will be loaded by html > pages with the img tag like in <img > src="http://domain.com/images/xxx.jpg"> > > > isn't the database the only choice ? > > No. It is one of the choices. The other is to store the images in the > file system, in a directory readable by Apache. See above. WE are trying to reduce the dependancies on other applications to ensure a simple deployment of the application.
2007/1/5, Jorge Godoy <jgodoy@gmail.com>: > Andrew Chernow <pg-job@esilo.com> writes: > > meet those requirements. It is far more effecient to have apache access > > them > > Where weren't we meeting his/her requirements? All the discussion is around > available means to do that. One option is having the files on the database, > the other is on the filesystem. From my understanding we're discussing the > benefits of each one. Aren't we? Yes, although I suggested two solutions I asked for anything that would be considered the best practice. Now I think there is not a best practice or better, there should be one best practice for each of the solutions. I have done an intranet application that stored images in the database. It worked perfectly and I used the same engine in another intranet application to store not only images but any document which also worked perfectly. The decision to go the dabatase only route was easy: The filesystem space would have to be negotiated while the space occupied by the databases were not controlled and used an advanced storage solution that gave lots of terabytes to be used at will. Also the any document application should not loose a single document and access control should be strictly enforced which was much easier to do with the database since I had no control over the webserver and even if I had I think the database access is still easier to control than the filesystem access. That was in a corporate intranet. What I'm doing now is an internet application. While the FS x DB synchronicity is very important in some kinds of document management, it is not in this application. Indeed if a few images are lost each day it has no meaning in a 500K to 1M inventory. The offended clients just upload them again. No one will be sued. The images are all public. No need to control the access. But the main factor to push me in the file system direction is the HTTP cache management. I want the internet web clients and proxies to cache the images. The Apache web server has it ready and easy. If the images where to be stored in the DB I would have to handle the HTTP cache headers myself. Another code layer. Not too big a deal, but if Apache give me it for free... I wonder why this HTTP cache headers argument didn't surface in this heated debate. Aren't DB developers/admins aware of the internet client's bandwidth limitations? Or they just assume the application would handle the HTTP cache headers? In the applications I created for intranet bandwidth was almost a non issue and I didn't care to make them bandwidth efficient, but for the internet the problem is there and it is big. Regards, -- Clodoaldo Pinto Neto
2007/1/6, Andrew Chernow <andrew@esilo.com>: > >>>apache has very good page and image caching. You could take advantage > >>>of that using this technique. > > > I wonder why this HTTP cache headers argument didn't surface in this > > heated debate. > > I did other up this argument by the way. Sorry, I understood you were talking about server side caching while what I refer to is client side caching. Regards, -- Clodoaldo Pinto Neto
On 1/6/07, Jorge Godoy <jgodoy@gmail.com> wrote: > Andrew Chernow <pg-job@esilo.com> writes: > > >>I mean, how do you handle integrity with data > >> outside the database? > > You don't, the file system handles integrity of the stored data. Although, > > one must careful to avoid db and fs orphans. Meaning, a record with no > > corresponding file or a file with no corresponging record. Always > > write()/insert an image file to the system within a transaction, including > > writing the image out to the fs. Make sure to unlink any paritally written > > image files. > > And how do you guarantee that after a failure? You're restoring two > different sets of data here: > > - backup from your database > - backup from your files you have a point -- keeping two sets of data in sync is more difficult than working of a single interface (you have to implement your own referential integrity of sorts), but that has to be balanced against the fact that postgresql is not great (yet) at storing and retrieving huge numbers of large binary objects. then again, neither are most filesystems in my opinion. also, and this is coming from a sql junky, sql is not really that good at handling binary data, you have to go right to binary prepared statements before things become even remotely reasonable. taking your side for a moment, the backup argument (against databases) is a little bit disingenuous because while dumping a 10 terabyte database is a pain, backing up a 10 terabyte filesystem is no picnic either, rsync will get seizures...you have to implement some smart logic that takes advantage of the fact that the information rarely if ever changes. andrew's posts hint at a way to do that that could be implemented directly in a database or a filesystem...one would choose a filesystem for this because the overhead is lower but there are other ways... > How do you link them together on that specific operation? Or even on a daily > basis, if you get corrupted data... if you take this approach, you have to layer a middleware over the filesystem and use that always. it's a programming challenge but it can be done... > >>>How do you plan your backup routine > > In regards to backup, backup the files one-by-one. Grab the lastest image > > file refs from the database and start backing up those images. Each > > successfully backed up image should be followed by inserting that file's > > database record into a remote db server. If anything fails, cleanup the > > partial image file (to avoid orphaned data) and rollout the transaction. > > > > just one idea. i'm sure there are other ways of doing it. point is, this is > > completely possible to do reliably. > > Wouldn't replication with, e.g., Slony be easier? And wouldn't letting the > database handle all the integrity be easier? I mean, create an "images" table > and then make your record depends on this table, so if there's no record with > the image, you won't have any references to it left. I think if you were to replicate a really big database, for something like this, a log based replication approach (pitr, or a modified flavor of it) would be a better bet. pg_dump should not even enter into your vocabulary, unless you did some hackery like storing data in progressive tables. > It would also make the backup plan easier: backup the database. backing up big *anythings* can be tough, pg_dump is not a scalable tool. > Not counting that depending on your choice of filesystem and image size you > might get a very poor performance. performance of storing large blobs in the database is not great...if you can get 50mb/sec writing to a filesystem, count yourself lucky if you get 10mb/sec writing to the database, and this is only if you make your client code very efficient. I think at some point in the future postgresql might be a useful tool in the way you are describing. currently, there are some performance issues, more flexibility in backup options would be required, and maybe some other things. merlin
> I wonder why this HTTP cache headers argument didn't surface in this > heated debate. I mentioned this earlier as well. Although you could do it in the app layer - it would be easier to just let the web server handle it.
Is there any overwhelming reason you can't just stick an apache server on your DB server? Unless you expect this thing to get hit *hard*, the performance of having them both on the same machine is pretty acceptable (I know, everyone's opinion about what constitutes heavy usage differs). If this is a simple intranet application with around 100 users, the performance should be OK. If I had to write a client-side app to pull images off of a server, I'd much rather code HTTP GETs into my app than deal with binary data from the database. I've done both, and HTTP GETs are much easier. Bear in mind I haven't pulled binaries from a database in about 5 years, so things might be easier now. imageguy wrote: >Clodoaldo wrote: > > >>5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>: >> >> >>>I think I know the answer, >>> >>> >>If you know the answer please tell it as I have read some discussions >>on the web and although I have decided on a solution I'm still not >>sure about the best answer, if there is a best answer after all. >> >> > >Sorry, didn't mean to sound like and expert on this, I am actually >quite a newbie. From all of the discussions I have read and even the >ones in this thread, including your own comments below, it would seem >that to store the files in the files system you need some sort of >application erver or :middleware - like a webserver - to handle the >retreiving and serving of the files. >My organization is developing a commercial application for "document >tracking". It is not a Browser application, but rather a more >traditional "windows" thick client app. > >At the present time we do not intend to deploy any sort of "application >server" - web server, ftp server, and not all of the workstations will >have access to a consistent network share. > >So in this case, it is my understanding that our only real choice is to >store the documents and images in the database itself. > >... unless someone knows of a postgresql function that would allow us >to "server" the file from the filesystem via the dbserver ?? > > > > >>>but if you don't have an "application >>>server" - ie a webserver, etc, >>> >>> >>Yes I have an application server, the Apache server. >> >> >> >>>and many of the workstations/clients >>>that need access to the images but may not have access to a network >>>share, >>> >>> >>network share? I don't understand. The images will be loaded by html >>pages with the img tag like in <img >>src="http://domain.com/images/xxx.jpg"> >> >> >> >>>isn't the database the only choice ? >>> >>> >>No. It is one of the choices. The other is to store the images in the >>file system, in a directory readable by Apache. >> >> > >See above. WE are trying to reduce the dependancies on other >applications to ensure a simple deployment of the application. > > >---------------------------(end of broadcast)--------------------------- >TIP 4: Have you searched our list archives? > > http://archives.postgresql.org/ > >
On 1/5/07, Jorge Godoy <jgodoy@gmail.com> wrote:
> Andrew Chernow <pg-job@esilo.com> writes:
>
> >> And how do you guarantee that after a failure?  You're restoring two
> >> different sets of data here:
> >
> >> How do you link them together on that specific operation?  Or even on a daily
> >> basis, if you get corrupted data...
> >
> > I answered that already.
>
> I'm sorry.  It must be the flu, the pain or something else, but I really don't
> remember reading your message about how you can be 100% sure that all
> references to the filesystem have their corresponding files present and also
> all present files have their respective database entry.
By designing the system so that you can be sure.  For instance delegate
removing data from filesystem to a dedicated queue table within database,
and carefully check that code.  Let no other software delete data.
If you need PITR, you can stop the remover utility during backups (think of
it as a VACUUM for filesystem ;)).
> I've seen HA measures (I don't imagine anyone sacrificing their customers
> copying 3 billion files and a few hundred terabytes while still maintaining an
> adequate service rate with part of its infra-structure down, just to use your
> example to that answer...), ideas about requiring an answer from the
> filesystem before considering the transaction done DB-wise (who grants you
> that the image really went to the disk and is not on cache when the machine
> has a power failure and shuts down abruptly?)...
And who grants you that the WAL log file really went to the disk and is not on
cache when the machine has a power failure and shuts down abruptly?
Use a trustworthy hardware.  You would have to do it anyway, if you wanted
to go with "all-in-DB" approach.
> > Some people have seen this as a disadvantage on this thread, I personally
> > don't see it that why.
>
> I am questioning two points that show two situations where it is bad.
> Specially if those images are important to the records (e.g. product failure
> images, prize winning images, product specs, prototype images, blueprints --
> after all, we don't need to restrict our files to images, right? --,
> agreements, spreadsheets with the last years of company account movements,
> documents received from lawyers, etc.).
I don't think noone is saying that storing images in DB isn't better from
data integrity point of view.  But it has drawbacks, which sometimes make
pepople store their images elsewhere in real life.
Of course if one had infinite budget...  But other than that, if you are storing
lots of data, and you can afford a trade-off between data safety (that you
can loose some data or that your data can be bloated by dead data) and
costs, then you have to consider storing data otherwise.
> > I guess it depends on access needs, many files and how much data you have.
> > What if you had 3 billion files across a few hundred terabytes?  Can you say
> > with experience how the database would hold up in this situation?
>
> I'd have partitioning if I had a case like that.  Part of those would be
> delegated to one machine, part to another and so on.  Even if that solution --
> partitioning -- makes the overall MTBF lower...
And how do you handle data integrity between many machines? The answer
is of 2PC, I guess.  But still, managing integrity between many machines is
also hard -- remember, you don't have RI constraints for remote data.
> And I still can't imagine how you guarantee that all 3 billion files have
> their corresponding entries on the database.  Couting them is not enough since
> I can have one file with the wrong "name" present on the filesystem or some
> duplicate record on the DB...
Depends what you mean by integrity.
For my needs it is sufficient that I can guarantee that every object mentioned
in a database is present on the filesystem.  Orphaned files are not a problem
and can be removed (if there really is need to) by a monthly cron job ("find
old files not mentioned in DB", "find not accessed old files and check if they
are in DB", etc. etc.).
More important still, is that I can make a snapshot of data.  With filesystem
it would be just:
  stop remover utility
  pg_dump  &  tar (or whatever) the files
  start remover utility
With DB, one would have to start pg_dump and wait until it finishes.
And observe
how your tables start to bloat as the data needs to be preserved for
data integrity
reasons.  How your database is more and mroe filled with dead tuples.  And how,
after it finishes VACUUM struggles to clean up the tables.  The more data you,
have, the more frequently it changes, the bigger problem it is.
Filesystem is simply
more efficient at storing data (the non-transactionness, and limited
metadata being
the tradeoff).
I don't say "all in DB" is fundamentally wrong.  It's just that its
niches are "low
bandwidth services", like intranets or rarely visited data, some forms of "very
important data store" services, where one must be absolutely sure about
safety and integrity, and finally, let's call it "academic/hobbyst research" ;-)
files outside of DB are where one must squeeze performance out of hardware
and it comes at a price.  If you can afford the price, you get the prize. ;)
   Regards,
        Dawid
			
		2007/1/6, Maurice Aubrey <maurice.aubrey@gmail.com>: > Clodoaldo wrote: > > But the main factor to push me in the file system direction is the > > HTTP cache management. I want the internet web clients and proxies to > > cache the images. The Apache web server has it ready and easy. If > > the images where to be stored in the DB I would have to handle the > > HTTP cache headers myself. Another code layer. Not too big a deal, > > but if Apache give me it for free... > > There's a hybrid approach which has worked well for us. > > You store the binary data in the database along with a signature. > > On the Apache side, you write a 404 handler that, based on the request, > fetches the binary from the database and writes it locally to the > filesystem based on the signature (using a multi-level hashing scheme > possibly as detailed in previous posts). > > When a request comes in to Apache, if the file exists it is served > directly without any db interaction. OTOH, if it's missing, your 404 > handler kicks in to build it and you get a single trip to the db. > > You get the benefits of keeping the data in the db (transaction > semantics, etc.) but also get the scalability and caching benefits > of having the front-end webservers handle delivery. > > If you lose the locally cached data it's not an issue. They'll be > faulted back into existence on demand. > > With multiple webservers, you can just allow the data to be cached on > each machine, or if there's too much data for that, have your load > balancer divide the requests to different webserver pools based on the > signature. > > As an extension, if you need different versions of the data (like > different sizes of an image, etc.), you can modify your URLs to indicate > the version wanted and have the 404 handler take that into account when > building them. You only store the original content in the database but > could have any number of transformed versions on the webservers. Again, > losing those versions is not an issue and do not require backup. Very interesting approach. And I think it is also original as I have not seen any mention of it. Thanks for sharing it. -- Clodoaldo Pinto Neto
>>>apache has very good page and image caching. You could take advantage >>>of that using this technique. > I wonder why this HTTP cache headers argument didn't surface in this > heated debate. I did other up this argument by the way. Andrew Clodoaldo wrote: > 2007/1/5, Jorge Godoy <jgodoy@gmail.com>: >> Andrew Chernow <pg-job@esilo.com> writes: >> > meet those requirements. It is far more effecient to have apache >> access >> > them >> >> Where weren't we meeting his/her requirements? All the discussion is >> around >> available means to do that. One option is having the files on the >> database, >> the other is on the filesystem. From my understanding we're >> discussing the >> benefits of each one. Aren't we? > > Yes, although I suggested two solutions I asked for anything that > would be considered the best practice. Now I think there is not a best > practice or better, there should be one best practice for each of the > solutions. > > I have done an intranet application that stored images in the > database. It worked perfectly and I used the same engine in another > intranet application to store not only images but any document which > also worked perfectly. The decision to go the dabatase only route was > easy: The filesystem space would have to be negotiated while the space > occupied by the databases were not controlled and used an advanced > storage solution that gave lots of terabytes to be used at will. Also > the any document application should not loose a single document and > access control should be strictly enforced which was much easier to do > with the database since I had no control over the webserver and even > if I had I think the database access is still easier to control than > the filesystem access. That was in a corporate intranet. > > What I'm doing now is an internet application. While the FS x DB > synchronicity is very important in some kinds of document management, > it is not in this application. Indeed if a few images are lost each > day it has no meaning in a 500K to 1M inventory. The offended clients > just upload them again. No one will be sued. The images are all > public. No need to control the access. > > But the main factor to push me in the file system direction is the > HTTP cache management. I want the internet web clients and proxies to > cache the images. The Apache web server has it ready and easy. If the > images where to be stored in the DB I would have to handle the HTTP > cache headers myself. Another code layer. Not too big a deal, but if > Apache give me it for free... > > I wonder why this HTTP cache headers argument didn't surface in this > heated debate. Aren't DB developers/admins aware of the internet > client's bandwidth limitations? Or they just assume the application > would handle the HTTP cache headers? In the applications I created for > intranet bandwidth was almost a non issue and I didn't care to make > them bandwidth efficient, but for the internet the problem is there > and it is big. > > Regards,
Clodoaldo wrote: > But the main factor to push me in the file system direction is the > HTTP cache management. I want the internet web clients and proxies to > cache the images. The Apache web server has it ready and easy. If > the images where to be stored in the DB I would have to handle the > HTTP cache headers myself. Another code layer. Not too big a deal, > but if Apache give me it for free... There's a hybrid approach which has worked well for us. You store the binary data in the database along with a signature. On the Apache side, you write a 404 handler that, based on the request, fetches the binary from the database and writes it locally to the filesystem based on the signature (using a multi-level hashing scheme possibly as detailed in previous posts). When a request comes in to Apache, if the file exists it is served directly without any db interaction. OTOH, if it's missing, your 404 handler kicks in to build it and you get a single trip to the db. You get the benefits of keeping the data in the db (transaction semantics, etc.) but also get the scalability and caching benefits of having the front-end webservers handle delivery. If you lose the locally cached data it's not an issue. They'll be faulted back into existence on demand. With multiple webservers, you can just allow the data to be cached on each machine, or if there's too much data for that, have your load balancer divide the requests to different webserver pools based on the signature. As an extension, if you need different versions of the data (like different sizes of an image, etc.), you can modify your URLs to indicate the version wanted and have the 404 handler take that into account when building them. You only store the original content in the database but could have any number of transformed versions on the webservers. Again, losing those versions is not an issue and do not require backup. Maurice
Clodoaldo wrote: > 5 Jan 2007 06:59:18 -0800, imageguy <imageguy1206@gmail.com>: > > > > I think I know the answer, > > If you know the answer please tell it as I have read some discussions > on the web and although I have decided on a solution I'm still not > sure about the best answer, if there is a best answer after all. Sorry, didn't mean to sound like and expert on this, I am actually quite a newbie. From all of the discussions I have read and even the ones in this thread, including your own comments below, it would seem that to store the files in the files system you need some sort of application erver or :middleware - like a webserver - to handle the retreiving and serving of the files. My organization is developing a commercial application for "document tracking". It is not a Browser application, but rather a more traditional "windows" thick client app. At the present time we do not intend to deploy any sort of "application server" - web server, ftp server, and not all of the workstations will have access to a consistent network share. So in this case, it is my understanding that our only real choice is to store the documents and images in the database itself. ... unless someone knows of a postgresql function that would allow us to "server" the file from the filesystem via the dbserver ?? > > > but if you don't have an "application > > server" - ie a webserver, etc, > > Yes I have an application server, the Apache server. > > > and many of the workstations/clients > > that need access to the images but may not have access to a network > > share, > > network share? I don't understand. The images will be loaded by html > pages with the img tag like in <img > src="http://domain.com/images/xxx.jpg"> > > > isn't the database the only choice ? > > No. It is one of the choices. The other is to store the images in the > file system, in a directory readable by Apache. See above. WE are trying to reduce the dependancies on other applications to ensure a simple deployment of the application.
If you end up storing the pictures on a separate database server for performance reasons, Jorge's argument regarding integrity becomes moot...And his argument so far is the strongest case I've seen for storing the files in the database. Scott Marlowe wrote: >On Fri, 2007-01-05 at 15:54, Jeremy Haile wrote: > > >>Yeah - it can make it easier to implement transactional semantics by >>storing them in the database, although for simple operations it wouldn't >>be hard to replicate this manually. And you are going to incur a >>performance penalty by storing them in the database. >> >>Another thing to consider is that storing them in the file system makes >>it much easier to browse the images using third-party tools, update >>them, archive them (by gzipping or whatever). This is much more >>difficult if they are stored in the database. >> >> > >The biggest performance penalty from them being in the same database as >your data is that they're going to be the majority of your kernel / >database buffers. So, as mentioned earlier, it's almost a necessity >(for performance reasons) to put them into their own db server. > >---------------------------(end of broadcast)--------------------------- >TIP 5: don't forget to increase your free space map settings > > ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
Clodoaldo wrote: > But the main factor to push me in the file system direction is the > HTTP cache management. I want the internet web clients and proxies to > cache the images. The Apache web server has it ready and easy. If > the images where to be stored in the DB I would have to handle the > HTTP cache headers myself. Another code layer. Not too big a deal, > but if Apache give me it for free... There's a hybrid approach which has worked well for us. You store the binary data in the database along with a signature. On the Apache side, you write a 404 handler that, based on the request, fetches the binary from the database and writes it locally to the filesystem based on the signature (using a multi-level hashing scheme possibly as detailed in previous posts). When a request comes in to Apache, if the file exists it is served directly without any db interaction. OTOH, if it's missing, your 404 handler kicks in to build it and you get a single trip to the db. You get the benefits of keeping the data in the db (transaction semantics, etc.) but also get the scalability and caching benefits of having the front-end webservers handle delivery. If you lose the locally cached data it's not an issue. They'll be faulted back into existence on demand. With multiple webservers, you can just allow the data to be cached on each machine, or if there's too much data for that, have your load balancer divide the requests to different webserver pools based on the signature. As an extension, if you need different versions of the data (like different sizes of an image, etc.), you can modify your URLs to indicate the version wanted and have the 404 handler take that into account when building them. You only store the original content in the database but could have any number of transformed versions on the webservers. Again, losing those versions is not an issue and do not require backup. Maurice
> My point is: if I need to be 100% sure that what is referenced on the database > is accessible all the time when the reference is, then I need to have this on > the database Not necessarily. It does take carefully controlling access, with a good deal of thought and error-checking on the part of the code that has write access to the files, but it can certainly be done. > But if there are other people touching > things -- moving / creating / deleting / changing files and directories -- > then things get more complicated to manage. Absolutely. But allowing the kinds of tools mentioned earlier for examining files does not require giving anyone write access ;-) -- Scott Ribe scott_ribe@killerbytes.com http://www.killerbytes.com/ (303) 722-0567 voice