Обсуждение: Problem with the numbers I reported yesterday
I ran my performance tests some more times and it seems the numbers are not really comparable. When I run PostgreSQL without -F I get a sync after every insert. With -F I get no sync at all as all inserts fit well into the buffer. However, Oracle in comparison does sync. Simply hearing the disk access it seems as if they sync every two or three seconds. Does anyone know a way to really check both DBMSs? Michael -- Dr. Michael Meskes, Project-Manager | topsystem Systemhaus GmbH meskes@topsystem.de | Europark A2, Adenauerstr. 20 meskes@debian.org | 52146 Wuerselen Go SF49ers! Go Rhein Fire! | Tel: (+49) 2405/4670-44 Use Debian GNU/Linux! | Fax: (+49) 2405/4670-10
> > I ran my performance tests some more times and it seems the numbers are not > really comparable. When I run PostgreSQL without -F I get a sync after every > insert. With -F I get no sync at all as all inserts fit well into the > buffer. However, Oracle in comparison does sync. Simply hearing the disk > access it seems as if they sync every two or three seconds. > > Does anyone know a way to really check both DBMSs? Many dbms's do buffered logging, that is they sync after the buffer gets full or after a minute or so. We have the logic to add buffered logging to PostgreSQL and will be doing it later. Right now, we only have non-buffered logging, and no logging. -- Bruce Momjian maillist@candle.pha.pa.us
Hi, perhaps a stupid question: What will happen, if you start PostgreSQL with -F and write a script which is a loop that fsyncs every 2 seconds? Does this lead to a database which is "almost" correct? Does this seem like a good compromise? > > I ran my performance tests some more times and it seems the numbers are not > > really comparable. When I run PostgreSQL without -F I get a sync after every > > insert. With -F I get no sync at all as all inserts fit well into the > > buffer. However, Oracle in comparison does sync. Simply hearing the disk > > access it seems as if they sync every two or three seconds. > > > > Does anyone know a way to really check both DBMSs? > > Many dbms's do buffered logging, that is they sync after the buffer gets > full or after a minute or so. We have the logic to add buffered logging > to PostgreSQL and will be doing it later. Right now, we only have > non-buffered logging, and no logging. > > -- > Bruce Momjian > maillist@candle.pha.pa.us > > Ciao Das Boersenspielteam. --------------------------------------------------------------------------- http://www.boersenspiel.de Das Boersenspiel im Internet *Realitaetsnah* *Kostenlos* *Ueber 6000 Spieler* ---------------------------------------------------------------------------
> > Hi, > > perhaps a stupid question: > > What will happen, if you start PostgreSQL with -F and write a script > which is a loop that fsyncs every 2 seconds? Does this lead to a > database which is "almost" correct? Does this seem like a good > compromise? Well, you really have to sync the data pages BEFORE sync'ing pg_log. Our buffered logging idea for post-6.3 will do exactly that. I don't think it is a workaround. You could get pg_log to disk saying a transaction is complete without getting all the data pages to disk if the crash happened during the sync. -- Bruce Momjian maillist@candle.pha.pa.us
Bruce Momjian wrote: > > > > > Hi, > > > > perhaps a stupid question: > > > > What will happen, if you start PostgreSQL with -F and write a script > > which is a loop that fsyncs every 2 seconds? Does this lead to a > > database which is "almost" correct? Does this seem like a good > > compromise? > > Well, you really have to sync the data pages BEFORE sync'ing pg_log. Why should this be necessary? If the transaction is considered committed once the log has be written, and the system crashes before the data are written, then postgres can look at the data and logs when it is next started up and apply all the transactions that were logged but not committed to the data pages. Am I missing something? It seems to me if you sync the data pages first, then what good is the log? (other than being able to retrace your steps) Ocie Mitchell
> > Bruce Momjian wrote: > > > > > > > > Hi, > > > > > > perhaps a stupid question: > > > > > > What will happen, if you start PostgreSQL with -F and write a script > > > which is a loop that fsyncs every 2 seconds? Does this lead to a > > > database which is "almost" correct? Does this seem like a good > > > compromise? > > > > Well, you really have to sync the data pages BEFORE sync'ing pg_log. > > Why should this be necessary? If the transaction is considered > committed once the log has be written, and the system crashes before > the data are written, then postgres can look at the data and logs when > it is next started up and apply all the transactions that were logged > but not committed to the data pages. No, on restart, you can't identify the old/new data. Remember, pg_log is just the transaction id and a flag. The superceeded/new rows are mixed on the data pages, with transaction id's as markers. > > Am I missing something? It seems to me if you sync the data pages > first, then what good is the log? (other than being able to retrace > your steps) Again, the log is just a list of transaction ids, and their statuses. -- Bruce Momjian maillist@candle.pha.pa.us
>>>>> "ocie" == ocie <ocie@paracel.com> writes: > Bruce Momjian wrote: >> > > Hi, > > perhaps a stupid question: > > What will happen, >> if you start PostgreSQL with -F and write a script > which is a >> loop that fsyncs every 2 seconds? Does this lead to a > >> database which is "almost" correct? Does this seem like a good >> > compromise? >> >> Well, you really have to sync the data pages BEFORE sync'ing >> pg_log. > Why should this be necessary? If the transaction is considered > committed once the log has be written, and the system crashes > before the data are written, then postgres can look at the data > and logs when it is next started up and apply all the > transactions that were logged but not committed to the data > pages. > Am I missing something? It seems to me if you sync the data > pages first, then what good is the log? (other than being able > to retrace your steps) I do not think that pg_log is used like a normal 'log' device in other databases. My quick look at the code looks like pg_log only has a list of transactions and not the actual data blocks. Notice that TransRecover is commented out in backent/access/transam/transam.c. Most database log has the before images and after images of any page that has been modified in a transaction followed by commit/abort record. This allows for only this file to have to be synced. The rest of the database can float (generally checkpoints are done every so often to reduce recover time). The method of recovering from a crash is to replay the log from the last checkpoint until the end of the log by applying the before/after images (as needed based on weather the transaction commited) to the actual database relations. I would appreciate anyone correcting any mistakes in my understanding of how postgres works. > Ocie Mitchell Kent S. Gordon Architect iNetSpace Co. voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com
This is 100% correct. See my other posting describing the issues. > > >>>>> "ocie" == ocie <ocie@paracel.com> writes: > > > Bruce Momjian wrote: > >> > > Hi, > > perhaps a stupid question: > > What will happen, > >> if you start PostgreSQL with -F and write a script > which is a > >> loop that fsyncs every 2 seconds? Does this lead to a > > >> database which is "almost" correct? Does this seem like a good > >> > compromise? > >> > >> Well, you really have to sync the data pages BEFORE sync'ing > >> pg_log. > > > Why should this be necessary? If the transaction is considered > > committed once the log has be written, and the system crashes > > before the data are written, then postgres can look at the data > > and logs when it is next started up and apply all the > > transactions that were logged but not committed to the data > > pages. > > > Am I missing something? It seems to me if you sync the data > > pages first, then what good is the log? (other than being able > > to retrace your steps) > > I do not think that pg_log is used like a normal 'log' device in other > databases. My quick look at the code looks like pg_log only has a > list of transactions and not the actual data blocks. Notice that > TransRecover is commented out in backent/access/transam/transam.c. > > Most database log has the before images and after images of any page > that has been modified in a transaction followed by commit/abort > record. This allows for only this file to have to be synced. The > rest of the database can float (generally checkpoints are done every > so often to reduce recover time). The method of recovering from a > crash is to replay the log from the last checkpoint until the end of > the log by applying the before/after images (as needed based on > weather the transaction commited) to the actual database relations. > > I would appreciate anyone correcting any mistakes in my understanding > of how postgres works. > > > Ocie Mitchell > > Kent S. Gordon > Architect > iNetSpace Co. > voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com > -- Bruce Momjian maillist@candle.pha.pa.us
DB logging (was: Problem with the numbers I reported yesterday)
От
jwieck@debis.com (Jan Wieck)
Дата:
Kent wrote: > > I do not think that pg_log is used like a normal 'log' device in other > databases. My quick look at the code looks like pg_log only has a > list of transactions and not the actual data blocks. Notice that > TransRecover is commented out in backent/access/transam/transam.c. > > Most database log has the before images and after images of any page > that has been modified in a transaction followed by commit/abort > record. This allows for only this file to have to be synced. The > rest of the database can float (generally checkpoints are done every > so often to reduce recover time). The method of recovering from a > crash is to replay the log from the last checkpoint until the end of > the log by applying the before/after images (as needed based on > weather the transaction commited) to the actual database relations. > > I would appreciate anyone correcting any mistakes in my understanding > of how postgres works. > > > Ocie Mitchell > > Kent S. Gordon > Architect > iNetSpace Co. > voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com > > Totally right, PostgreSQL doesn't have a log mechanism that collects all the information to recover a corrupted database from a backup. I hacked around on that a little bit. When doing a complete after image logging, that is taking all the tuples that are stored on insert/update, the tuple id's of deletes plus the information about transaction id's that commit, the regression tests produce log data that is more than the size of the final regression database. The performance increase when only syncing the log- and controlfiles (2 control files on different devices and the logfile on a different device from the database files) and running the backends with -F is about 15-20% for the regression test. I thought this is far too much logging data and so I didn't spent much time trying to implement a recovery. But as far as I got it I can tell that the updates to system catalogs and keeping the indices up to date will be really tricky. Another possible log mechanism I'll try sometimes after v6.3 release is to log the queries and data from copy commands along with informations about Oid and Tid allocations. Until later, Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
>>>>> "jwieck" == Jan Wieck <jwieck@debis.com> writes: > Kent wrote: description of current logging method deleted. > Totally right, PostgreSQL doesn't have a log mechanism that > collects all the information to recover a corrupted database > from a backup. > I hacked around on that a little bit. > When doing a complete after image logging, that is taking > all the tuples that are stored on insert/update, the tuple id's > of deletes plus the information about transaction id's that > commit, the regression tests produce log data that is more than > the size of the final regression database. The performance > increase when only syncing the log- and controlfiles (2 control > files on different devices and the logfile on a different device > from the database files) and running the backends with -F is > about 15-20% for the regression test. Log files do get very big with image logging. I would not expect a huge win in performance unless the log device is a raw device. On a cooked device (file system) buffer cache effects are very large (all disk data is being buffered both by postgresql and the OS buffer cache. The buffer cache is actual harmfully in this case, since data is not reused, and the writes are synced. The number of writes to the log also flush out other buffer from the cache leading to even more io.). If a system does not have raw devices (linux, NT), it would be very useful if a flag exists to tell the OS that the file will be read sequential like in the madvise() call for mmap. Is your code available anywhere? > I thought this is far too much logging data and so I didn't > spent much time trying to implement a recovery. But as far as I > got it I can tell that the updates to system catalogs and > keeping the indices up to date will be really tricky. I have not looked at this area of the code. Do the system catalogs have a separate storage manager? I do not see why the could not be handled like any other data except for keeping the buffer in the cache. Kent S. Gordon Architect iNetSpace Co. voice: (972)851-3494 fax:(972)702-0384 e-mail:kgor@inetspace.com
Kent S. Gordon wrote: [SNIP] > Log files do get very big with image logging. I would not expect a > huge win in performance unless the log device is a raw device. On a > cooked device (file system) buffer cache effects are very large (all > disk data is being buffered both by postgresql and the OS buffer > cache. The buffer cache is actual harmfully in this case, since data > is not reused, and the writes are synced. The number of writes to the > log also flush out other buffer from the cache leading to even more > io.). If a system does not have raw devices (linux, NT), it would be ^^^^^ What exactly do you mean by "raw devices" that it is not offered by Linux? If I take a partition of one of my hard drives and I don't make a filesystem on it, I can perform reads and writes on the "raw device" /dev/hd?? or /dev/sd?? I didn't think these writes were buffered (if that's what you're referring to), but I could be wrong. Ocie Mitchell.
> What exactly do you mean by "raw devices" that it is not offered by > Linux? If I take a partition of one of my hard drives and I don't > make a filesystem on it, I can perform reads and writes on the "raw > device" /dev/hd?? or /dev/sd?? I didn't think these writes were > buffered (if that's what you're referring to), but I could be wrong. Your /dev/hd* goes through the the buffer cache, the raw versions /dev/rhd* does not. -- Bruce Momjian maillist@candle.pha.pa.us
Bruce Momjian wrote: > > > What exactly do you mean by "raw devices" that it is not offered by > > Linux? If I take a partition of one of my hard drives and I don't > > make a filesystem on it, I can perform reads and writes on the "raw > > device" /dev/hd?? or /dev/sd?? I didn't think these writes were > > buffered (if that's what you're referring to), but I could be wrong. > > Your /dev/hd* goes through the the buffer cache, the raw versions > /dev/rhd* does not. Actually on Linux, there is no raw/cooked drive interface as in Solaris. In Solaris, the /dev/dsk/ devices are buffered by the OS, while their counterparts in /dev/rdsk are not. Linux only has the one interface to the partition, which is raw. Code which uses these raw devices (the filesystem code) must supply its own buffering. Anyway, I don't want to prolong this tangential topic. Linux should provide raw devices, but does not (as Solaris does) provide buffered or cooked access to disks. Ocie Mitchell
Re: [HACKERS] Re: DB logging (was: Problem with the numbers I reported yesterday)
От
jwieck@debis.com (Jan Wieck)
Дата:
Kent wrote: > > >>>>> "jwieck" == Jan Wieck <jwieck@debis.com> writes: > > When doing a complete after image logging, that is taking > > all the tuples that are stored on insert/update, the tuple id's > > of deletes plus the information about transaction id's that > > commit, the regression tests produce log data that is more than > > the size of the final regression database. The performance > > increase when only syncing the log- and controlfiles (2 control > > files on different devices and the logfile on a different device > > from the database files) and running the backends with -F is > > about 15-20% for the regression test. > > Log files do get very big with image logging. I would not expect a > huge win in performance unless the log device is a raw device. On a > cooked device (file system) buffer cache effects are very large (all > disk data is being buffered both by postgresql and the OS buffer > cache. The buffer cache is actual harmfully in this case, since data > is not reused, and the writes are synced. The number of writes to the > log also flush out other buffer from the cache leading to even more > io.). If a system does not have raw devices (linux, NT), it would be > very useful if a flag exists to tell the OS that the file will be read > sequential like in the madvise() call for mmap. Is your code > available anywhere? I don't have that code any more. It wasn't that much so I can redo it if at least you would like to help on that topic. But since this will be a feature we should wait for the 6.3 release before touching anything. > > > I thought this is far too much logging data and so I didn't > > spent much time trying to implement a recovery. But as far as I > > got it I can tell that the updates to system catalogs and > > keeping the indices up to date will be really tricky. > > I have not looked at this area of the code. Do the system catalogs > have a separate storage manager? I do not see why the could not be > handled like any other data except for keeping the buffer in the cache. I just had some problems on the system catalogs (maybe due to the system caching). I think that it can be handled somehow. There are other details in the logging we should care about when we implement it. The logging should be configurable per database. Some databases have logging enabled while others are unprotected. It must be able to do point in time recovery (restore the database from a backup and recover until an absolute time or transaction ID). The previous two produce a problem for shared system relations. If a backend running on an unlogged database updates pg_user for example, this time it must go into the log! We should give query logging instead of image logging a try. Until later, Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #