Обсуждение: Create and drop temp table in 8.3.4
We recently upgraded the databases for our circuit court applications from PostgreSQL 8.2.5 to 8.3.4. The application software didn't change. Most software runs fine, and our benchmarks prior to the update tended to show a measurable, if not dramatic, performance improvement overall. We have found one area where jobs are running much longer and having a greater impact on concurrent jobs -- those where the programmer creates and drops many temporary tables (thousands) within a database transaction. We are looking to mitigate the problem while we look into rewriting queries where the temporary table usage isn't really needed (probably most of them, but the rewrites are not trivial). I'm trying to quantify the issue, and would appreciate any suggestions, either for mitigation or collecting useful data to find the cause of the performance regression. I create a script which brackets 1000 lines like the following within a single begin/commit: create temporary table tt (c1 int not null primary key, c2 text, c3 text); drop table tt; I run this repeatedly, to get a "steady state", with just enough settling time between runs to spot the boundaries of the runs in the vmstat 1 output (5 to 20 seconds between runs). I'm surprised at how much disk output there is for this, in either version. In 8.2.5 a typical run has about 156,000 disk writes in the vmstat output, while 8.3.4 has about 172,000 writes. During the main part of the run 8.2.5 ranges between 0 and 15 percent of cpu time in I/O wait, averaging around 10%; while 8.3.4 ranges between 15 and 25 percent of cpu time in I/O wait, averaging around 18%, with occasional outliers on both sides, down to 5% and up to 55%. For both, there's a period of time at the end of the transaction where the COMMIT seems to be doing disk output without any disk wait, suggesting that the BBU RAID controller is either able to write these faster because there are multiple updates to the same sectors which get combined, or that they can be written sequentially. The time required for psql to run the script varies little in 8.2.5 -- from 4m43.843s to 4m49.388s. Under 8.3.4 this bounces around from run to run -- from 1m28.270s to 5m39.327s. I can't help wondering why creating and dropping a temporary table requires over 150 disk writes. I also wonder whether there is something in 8.3.4 which directly causes more writes, or whether it is the result of the new checkpoint and background writer hitting some pessimal usage pattern where "just in time" writes become "just too late" to be efficient. Most concerning is that the 8.3.4 I/O wait time results in slow performance for interactive tasks and results in frustrated users calling the support line complaining of slowness. I can confirm that queries which normally run in 10 to 20 ms are running for several seconds in competition with the temporary table creation/drop queries, which wasn't the case before. I'm going to get the latest snapshot to see if the issue has changed for 8.4devel, but I figured I should post the information I have so far to get feedback. -Kevin
>>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > We have found one area where jobs are running > much longer and having a greater impact on concurrent jobs -- those > where the programmer creates and drops many temporary tables > (thousands) within a database transaction. I forgot to include the standard information about the environment and configuration. ccsa@COUNTY2-PG:~> cat /proc/version Linux version 2.6.16.60-0.31-smp (geeko@buildhost) (gcc version 4.1.2 20070115 (SUSE Linux)) #1 SMP Tue Oct 7 16:16:29 UTC 2008 ccsa@COUNTY2-PG:~> cat /etc/SuSE-release SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10 PATCHLEVEL = 2 ccsa@COUNTY2-PG:~> uname -a Linux COUNTY2-PG 2.6.16.60-0.31-smp #1 SMP Tue Oct 7 16:16:29 UTC 2008 x86_64 x86_64 x86_64 GNU/Linux Two dual-core Xeon 3 GHz processors. 4 GB system RAM. BBU RAID controller with 256 MB RAM. RAID 5 on 5 spindles. 8.2.5: ccsa@COUNTY2-PG:~> /usr/local/pgsql-8.2.5-64/bin/pg_config BINDIR = /usr/local/pgsql-8.2.5-64/bin DOCDIR = /usr/local/pgsql-8.2.5-64/doc INCLUDEDIR = /usr/local/pgsql-8.2.5-64/include PKGINCLUDEDIR = /usr/local/pgsql-8.2.5-64/include INCLUDEDIR-SERVER = /usr/local/pgsql-8.2.5-64/include/server LIBDIR = /usr/local/pgsql-8.2.5-64/lib PKGLIBDIR = /usr/local/pgsql-8.2.5-64/lib LOCALEDIR = MANDIR = /usr/local/pgsql-8.2.5-64/man SHAREDIR = /usr/local/pgsql-8.2.5-64/share SYSCONFDIR = /usr/local/pgsql-8.2.5-64/etc PGXS = /usr/local/pgsql-8.2.5-64/lib/pgxs/src/makefiles/pgxs.mk CONFIGURE = '--prefix=/usr/local/pgsql-8.2.5-64' '--enable-integer-datetimes' '--enable-debug' '--disable-nls' CC = gcc CPPFLAGS = -D_GNU_SOURCE CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Winline -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -g CFLAGS_SL = -fpic LDFLAGS = -Wl,-rpath,'/usr/local/pgsql-8.2.5-64/lib' LDFLAGS_SL = LIBS = -lpgport -lz -lreadline -lcrypt -ldl -lm VERSION = PostgreSQL 8.2.5 max_connections = 50 shared_buffers = 256MB temp_buffers = 10MB max_prepared_transactions = 0 work_mem = 16MB maintenance_work_mem = 400MB max_fsm_pages = 1000000 bgwriter_lru_percent = 20.0 bgwriter_lru_maxpages = 200 bgwriter_all_percent = 10.0 bgwriter_all_maxpages = 600 wal_buffers = 256kB checkpoint_segments = 50 archive_command = '/bin/true' archive_timeout = 3600 seq_page_cost = 0.1 random_page_cost = 0.1 effective_cache_size = 3GB geqo = off from_collapse_limit = 20 join_collapse_limit = 20 redirect_stderr = on log_line_prefix = '[%m] %p %q<%u %d %r> ' autovacuum_naptime = 1min autovacuum_vacuum_threshold = 10 autovacuum_analyze_threshold = 10 datestyle = 'iso, mdy' lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' escape_string_warning = off sql_inheritance = off standard_conforming_strings = on 8.3.4: ccsa@COUNTY2-PG:~> /usr/local/pgsql-8.3.4-64/bin/pg_config BINDIR = /usr/local/pgsql-8.3.4-64/bin DOCDIR = /usr/local/pgsql-8.3.4-64/doc INCLUDEDIR = /usr/local/pgsql-8.3.4-64/include PKGINCLUDEDIR = /usr/local/pgsql-8.3.4-64/include INCLUDEDIR-SERVER = /usr/local/pgsql-8.3.4-64/include/server LIBDIR = /usr/local/pgsql-8.3.4-64/lib PKGLIBDIR = /usr/local/pgsql-8.3.4-64/lib LOCALEDIR = MANDIR = /usr/local/pgsql-8.3.4-64/man SHAREDIR = /usr/local/pgsql-8.3.4-64/share SYSCONFDIR = /usr/local/pgsql-8.3.4-64/etc PGXS = /usr/local/pgsql-8.3.4-64/lib/pgxs/src/makefiles/pgxs.mk CONFIGURE = '--prefix=/usr/local/pgsql-8.3.4-64' '--enable-integer-datetimes' '--enable-debug' '--disable-nls' '--with-libxml' CC = gcc CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2 CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Winline -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv -g CFLAGS_SL = -fpic LDFLAGS = -Wl,-rpath,'/usr/local/pgsql-8.3.4-64/lib' LDFLAGS_SL = LIBS = -lpgport -lxml2 -lz -lreadline -lcrypt -ldl -lm VERSION = PostgreSQL 8.3.4 max_connections = 50 shared_buffers = 256MB temp_buffers = 10MB max_prepared_transactions = 0 work_mem = 16MB maintenance_work_mem = 400MB max_fsm_pages = 1000000 bgwriter_lru_maxpages = 1000 bgwriter_lru_multiplier = 4.0 wal_buffers = 256kB checkpoint_segments = 50 archive_mode = on archive_command = '/bin/true' archive_timeout = 3600 seq_page_cost = 0.1 random_page_cost = 0.1 effective_cache_size = 3GB geqo = off from_collapse_limit = 20 join_collapse_limit = 20 logging_collector = on log_checkpoints = on log_connections = on log_disconnections = on log_line_prefix = '[%m] %p %q<%u %d %r> ' autovacuum_naptime = 1min autovacuum_vacuum_threshold = 10 autovacuum_analyze_threshold = 10 datestyle = 'iso, mdy' lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' default_text_search_config = 'pg_catalog.english' escape_string_warning = off sql_inheritance = off standard_conforming_strings = on
>>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > I'm going to get the latest snapshot to see if the issue has changed > for 8.4devel In testing under today's snapshot, it seemed to take 150,000 writes to create and drop 1,000 temporary tables within a database transaction. The numbers for the various versions might be within the sampling noise, since the testing involved manual steps and required saturating the queues in PostgreSQL, the OS, and the RAID controller to get meaningful numbers. It seems like the complaints of slowness result primarily from these writes saturating the bandwidth when a query generates a temporary table in a loop, with the increased impact in later releases resulting from it getting through the loop faster. I've started a thread on the hackers' list to discuss a possible PostgreSQL enhancement to help such workloads. In the meantime, I think I know which knobs to try turning to mitigate the issue, and I'll suggest rewrites to some of these queries, to avoid the temporary tables. If I find a particular tweak to the background writer or some such is particularly beneficial, I'll post again. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > I'm trying to quantify the issue, and would appreciate any > suggestions, either for mitigation or collecting useful data to find > the cause of the performance regression. I create a script which > brackets 1000 lines like the following within a single begin/commit: > create temporary table tt (c1 int not null primary key, c2 text, c3 > text); drop table tt; I poked at this a little bit. The test case is stressing the system more than might be apparent: there's an index on c1 because of the PRIMARY KEY, and the text columns force a toast table to be created, which has its own index. So that means four separate filesystem files get created for each iteration, and then dropped at the end of the transaction. (The different behavior you notice at COMMIT must be the cleanup phase where the unlink()s get issued.) Even though nothing ever gets put in the indexes, their metapages get created immediately, so we also allocate and write 8K per index. So there are three cost components: 1. Filesystem overhead to create and eventually delete all those thousands of files. 2. Write traffic for the index metapages. 3. System catalog tuple insertions and deletions (and the ensuing WAL log traffic). I'm not too sure which of these is the dominant cost --- it might well vary from system to system anyway depending on what filesystem you use. But I think it's not #2 since that one would only amount to 16MB over the length of the transaction. As far as I can tell with strace, the filesystem overhead ought to be the same in 8.2 and 8.3 because pretty much the same series of syscalls occurs. So I suspect that the slowdown you saw comes from making a larger number of catalog updates in 8.3; though I can't think what that would be offhand. A somewhat worrisome point is that the filesystem overhead is going to essentially double in CVS HEAD, because of the addition of per-relation FSM files. (In fact, Heikki is proposing to triple the overhead by also adding DSM files ...) If cost #1 is significant then that could really hurt. regards, tom lane
>>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > If I find a particular tweak to the background writer or some such is > particularly beneficial, I'll post again. It turns out that it was not the PostgreSQL version which was primarily responsible for the performance difference. We updated the kernel at the same time we rolled out 8.3.4, and the new kernel defaulted to using write barriers, while the old kernel didn't. Since we have a BBU RAID controller, we will add nobarrier to the fstab entries. This makes file creation and unlink each about 20 times faster. -Kevin
On Thu, 2008-11-06 at 13:02 -0600, Kevin Grittner wrote: > >>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > > If I find a particular tweak to the background writer or some such > is > > particularly beneficial, I'll post again. > > It turns out that it was not the PostgreSQL version which was > primarily responsible for the performance difference. We updated the > kernel at the same time we rolled out 8.3.4, and the new kernel > defaulted to using write barriers, while the old kernel didn't. Since > we have a BBU RAID controller, we will add nobarrier to the fstab > entries. This makes file creation and unlink each about 20 times > faster. Woah... which version of the kernel was old and new? > > -Kevin > --
>>> "Joshua D. Drake" <jd@commandprompt.com> wrote: > On Thu, 2008-11-06 at 13:02 -0600, Kevin Grittner wrote: >> the new kernel >> defaulted to using write barriers, while the old kernel didn't. Since >> we have a BBU RAID controller, we will add nobarrier to the fstab >> entries. This makes file creation and unlink each about 20 times >> faster. > > Woah... which version of the kernel was old and new? old: kgrittn@DBUTL-PG:/var/pgsql/data/test> cat /proc/version Linux version 2.6.5-7.287.3-bigsmp (geeko@buildhost) (gcc version 3.3.3 (SuSE Linux)) #1 SMP Tue Oct 2 07:31:36 UTC 2007 kgrittn@DBUTL-PG:/var/pgsql/data/test> uname -a Linux DBUTL-PG 2.6.5-7.287.3-bigsmp #1 SMP Tue Oct 2 07:31:36 UTC 2007 i686 i686 i386 GNU/Linux kgrittn@DBUTL-PG:/var/pgsql/data/test> cat /etc/SuSE-release SUSE LINUX Enterprise Server 9 (i586) VERSION = 9 PATCHLEVEL = 3 new: kgrittn@SAWYER-PG:~> cat /proc/version Linux version 2.6.16.60-0.27-smp (geeko@buildhost) (gcc version 4.1.2 20070115 (SUSE Linux)) #1 SMP Mon Jul 28 12:55:32 UTC 2008 kgrittn@SAWYER-PG:~> uname -a Linux SAWYER-PG 2.6.16.60-0.27-smp #1 SMP Mon Jul 28 12:55:32 UTC 2008 x86_64 x86_64 x86_64 GNU/Linux kgrittn@SAWYER-PG:~> cat /etc/SuSE-release SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10 PATCHLEVEL = 2 To be clear, file create and unlink speeds are almost the same between the two kernels without write barriers; the difference is that they were in effect by default in the newer kernel. -Kevin
To others that may stumble upon this thread:
Note that Write Barriers can be very important for data integrity when power loss or hardware failure are a concern. Only disable them if you know the consequences are mitigated by other factors (such as a BBU + db using the WAL log with sync writes), or if you accept the additional risk to data loss. Also note that LVM prevents the possibility of using write barriers, and lowers data reliability as a result. The consequences are application dependent and also highly file system dependent.
On Temp Tables:
I am a bit ignorant on the temp table relationship to file creation -- it makes no sense to me at all that a file would even be created for a temp table unless it spills out of RAM or is committed. Inside of a transaction, shouldn't they be purely in-memory if there is space? Is there any way to prevent the file creation? This seems like a total waste of time for many temp table use cases, and explains why they were so slow in some exploratory testing we did a few months ago.
Note that Write Barriers can be very important for data integrity when power loss or hardware failure are a concern. Only disable them if you know the consequences are mitigated by other factors (such as a BBU + db using the WAL log with sync writes), or if you accept the additional risk to data loss. Also note that LVM prevents the possibility of using write barriers, and lowers data reliability as a result. The consequences are application dependent and also highly file system dependent.
On Temp Tables:
I am a bit ignorant on the temp table relationship to file creation -- it makes no sense to me at all that a file would even be created for a temp table unless it spills out of RAM or is committed. Inside of a transaction, shouldn't they be purely in-memory if there is space? Is there any way to prevent the file creation? This seems like a total waste of time for many temp table use cases, and explains why they were so slow in some exploratory testing we did a few months ago.
On Thu, Nov 6, 2008 at 11:35 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
>>> "Joshua D. Drake" <jd@commandprompt.com> wrote:
> On Thu, 2008-11-06 at 13:02 -0600, Kevin Grittner wrote:>> the new kernelold:
>> defaulted to using write barriers, while the old kernel didn't.
Since
>> we have a BBU RAID controller, we will add nobarrier to the fstab
>> entries. This makes file creation and unlink each about 20 times
>> faster.
>
> Woah... which version of the kernel was old and new?
kgrittn@DBUTL-PG:/var/pgsql/data/test> cat /proc/version
Linux version 2.6.5-7.287.3-bigsmp (geeko@buildhost) (gcc version 3.3.3
(SuSE Linux)) #1 SMP Tue Oct 2 07:31:36 UTC 2007
kgrittn@DBUTL-PG:/var/pgsql/data/test> uname -a
Linux DBUTL-PG 2.6.5-7.287.3-bigsmp #1 SMP Tue Oct 2 07:31:36 UTC 2007
i686 i686 i386 GNU/Linux
kgrittn@DBUTL-PG:/var/pgsql/data/test> cat /etc/SuSE-release
SUSE LINUX Enterprise Server 9 (i586)
VERSION = 9
PATCHLEVEL = 3
new:
kgrittn@SAWYER-PG:~> cat /proc/version
Linux version 2.6.16.60-0.27-smp (geeko@buildhost) (gcc version 4.1.2
20070115 (SUSE Linux)) #1 SMP Mon Jul 28 12:55:32 UTC 2008
kgrittn@SAWYER-PG:~> uname -a
Linux SAWYER-PG 2.6.16.60-0.27-smp #1 SMP Mon Jul 28 12:55:32 UTC 2008x86_64 x86_64 x86_64 GNU/Linuxkgrittn@SAWYER-PG:~> cat /etc/SuSE-releaseSUSE Linux Enterprise Server 10 (x86_64)To be clear, file create and unlink speeds are almost the same between
VERSION = 10
PATCHLEVEL = 2
the two kernels without write barriers; the difference is that they
were in effect by default in the newer kernel.
-Kevin
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
>>> "Scott Carey" <scott@richrelevance.com> wrote: > Note that Write Barriers can be very important for data integrity when power > loss or hardware failure are a concern. Only disable them if you know the > consequences are mitigated by other factors (such as a BBU + db using the > WAL log with sync writes), or if you accept the additional risk to data > loss. For those using xfs, this link may be useful: http://oss.sgi.com/projects/xfs/faq.html#wcache > On Temp Tables: > I am a bit ignorant on the temp table relationship to file creation -- it > makes no sense to me at all that a file would even be created for a temp > table unless it spills out of RAM or is committed. Inside of a transaction, > shouldn't they be purely in-memory if there is space? Is there any way to > prevent the file creation? This seems like a total waste of time for many > temp table use cases, and explains why they were so slow in some exploratory > testing we did a few months ago. As I learned today, creating a temporary table in PostgreSQL can easily create four files and do dozens of updates to system tables; that's all before you start actually inserting any data into the temporary table. -Kevin
On Thu, Nov 6, 2008 at 2:05 PM, Scott Carey <scott@richrelevance.com> wrote: > To others that may stumble upon this thread: > Note that Write Barriers can be very important for data integrity when power > loss or hardware failure are a concern. Only disable them if you know the > consequences are mitigated by other factors (such as a BBU + db using the > WAL log with sync writes), or if you accept the additional risk to data > loss. Also note that LVM prevents the possibility of using write barriers, > and lowers data reliability as a result. The consequences are application > dependent and also highly file system dependent. I am pretty sure that with no write barriers that even a BBU hardware caching raid controller cannot guarantee your data.
>>> "Scott Marlowe" <scott.marlowe@gmail.com> wrote: > I am pretty sure that with no write barriers that even a BBU hardware > caching raid controller cannot guarantee your data. That seems at odds with this: http://oss.sgi.com/projects/xfs/faq.html#wcache_persistent What evidence to you have that the SGI XFS team is wrong? It does seem fairly bizarre to me that we can't configure our system to enforce write barriers within the OS and file system without having it enforced all the way past the BBU RAID cache onto the hard drives themselves. We consider that once it hits the battery-backed cache, it is persisted. Reality has only contradicted that once so far (with a RAID controller failure), and our backups have gotten us past that with no sweat. -Kevin
On Thu, Nov 6, 2008 at 3:33 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >>>> "Scott Marlowe" <scott.marlowe@gmail.com> wrote: >> I am pretty sure that with no write barriers that even a BBU > hardware >> caching raid controller cannot guarantee your data. > > That seems at odds with this: > > http://oss.sgi.com/projects/xfs/faq.html#wcache_persistent > > What evidence to you have that the SGI XFS team is wrong? Logic? Without write barriers in my file system an fsync request will be immediately returned true, correct? That means that writes can happen out of order, and a system crash could corrupt the file system. Just seems kind of obvious to me.
>>> "Scott Marlowe" <scott.marlowe@gmail.com> wrote: > On Thu, Nov 6, 2008 at 3:33 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >>>>> "Scott Marlowe" <scott.marlowe@gmail.com> wrote: >>> I am pretty sure that with no write barriers that even a BBU >> hardware >>> caching raid controller cannot guarantee your data. >> >> That seems at odds with this: >> >> http://oss.sgi.com/projects/xfs/faq.html#wcache_persistent >> >> What evidence to you have that the SGI XFS team is wrong? > > Without write barriers in my file system an fsync request will > be immediately returned true, correct? Not as I understand it; although it will be pretty fast if it all fits into the battery backed cache. -Kevin
On Thu, Nov 6, 2008 at 4:04 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >>>> "Scott Marlowe" <scott.marlowe@gmail.com> wrote: >> On Thu, Nov 6, 2008 at 3:33 PM, Kevin Grittner >> <Kevin.Grittner@wicourts.gov> wrote: >>>>>> "Scott Marlowe" <scott.marlowe@gmail.com> wrote: >>>> I am pretty sure that with no write barriers that even a BBU >>> hardware >>>> caching raid controller cannot guarantee your data. >>> >>> That seems at odds with this: >>> >>> http://oss.sgi.com/projects/xfs/faq.html#wcache_persistent >>> >>> What evidence to you have that the SGI XFS team is wrong? >> >> Without write barriers in my file system an fsync request will >> be immediately returned true, correct? > > Not as I understand it; although it will be pretty fast if it all fits > into the battery backed cache. OK, thought exercise time. There's a limited size for the cache. Let's assume it's much smaller, say 16Megabytes. We turn off write barriers. We start writing data to the RAID array faster than the disks can write it. At some point, the data flowing into the cache is backing up into the OS. Without write barriers, the second we call an fsync it returns true. But the data's not in the cache yet, or on the disk. Machine crashes, data is incoherent. But that's assuming write barriers work as I understand them.
On Thu, Nov 6, 2008 at 4:03 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Thu, Nov 6, 2008 at 4:04 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> "Scott Marlowe" <scott.marlowe@gmail.com> wrote: >>> Without write barriers in my file system an fsync request will >>> be immediately returned true, correct? >> >> Not as I understand it; although it will be pretty fast if it all fits >> into the battery backed cache. > > OK, thought exercise time. There's a limited size for the cache. > Let's assume it's much smaller, say 16Megabytes. We turn off write > barriers. We start writing data to the RAID array faster than the > disks can write it. At some point, the data flowing into the cache is > backing up into the OS. Without write barriers, the second we call an > fsync it returns true. But the data's not in the cache yet, or on the > disk. Machine crashes, data is incoherent. > > But that's assuming write barriers work as I understand them. Let's try to clear up a couple things: 1. We are talking about 3 different memory caches in order from low to high: Disk cache, Controller cache (BBU in this case) and OS cache. 2. A write barrier instructs the lower level hardware that commands issued before the barrier must be written to disk before commands issued after the barrier. Write barriers are used to ensure that data written to disk is written in such a way as to maintain filesystem consistency, without losing all the benefits of a write cache. 3. A fsync call forces data to be synced to the controller. This means that whenever you call fsync, at the very minimum, the data will have made it to the controller. How much further down the line will depend on whether or not the controller is in WriteBack or WriteThrough mode and whether or not the disk is also caching writes. So in your example, if the OS is caching some writes and fsync is called, it won't be returned until at a minimum the controller has accepted all the data, regardless of whether or not write barriers are enabled. In theory, it should be safe to disable write barriers if you have a BBU because the BBU should guarantee that all writes will eventually make it to disk (or at least reduce the risk of that not happening to an acceptable level). -Dave
On Thu, 6 Nov 2008, Scott Marlowe wrote: > Without write barriers, the second we call an fsync it returns true. > > But that's assuming write barriers work as I understand them. Write barriers do not work as you understand them. Calling fsync always blocks until all the data has made it to safe storage, and always has (barring broken systems). Write barriers are meant to be a way to speed up fsync-like operations. Before write barriers, all the system could do was call fsync, and that would cause the operating system to wait for a response from the disc subsystem that the data had been written before it could start writing some more stuff. Write barriers provide an extra way of telling the disc "Write everything before the barrier before you write anything after the barrier", which allows the operating system to keep stuffing data into the disc queue without having to wait for a response. So fsync should always work right, unless the system is horribly broken, on all systems going back many years. Matthew -- I'd try being be a pessimist, but it probably wouldn't work anyway.
Seems like this didn't make it through to the list the first time... * Aidan Van Dyk <aidan@highrise.ca> [081106 22:19]: > * David Rees <drees76@gmail.com> [081106 21:22]: > > > 2. A write barrier instructs the lower level hardware that commands > > issued before the barrier must be written to disk before commands > > issued after the barrier. Write barriers are used to ensure that data > > written to disk is written in such a way as to maintain filesystem > > consistency, without losing all the benefits of a write cache. > > > > 3. A fsync call forces data to be synced to the controller. > > > > This means that whenever you call fsync, at the very minimum, the data > > will have made it to the controller. How much further down the line > > will depend on whether or not the controller is in WriteBack or > > WriteThrough mode and whether or not the disk is also caching writes. > > > > So in your example, if the OS is caching some writes and fsync is > > called, it won't be returned until at a minimum the controller has > > accepted all the data, regardless of whether or not write barriers are > > enabled. > > > > In theory, it should be safe to disable write barriers if you have a > > BBU because the BBU should guarantee that all writes will eventually > > make it to disk (or at least reduce the risk of that not happening to > > an acceptable level). > > All that's "correct", but note that fsync doesn't guarentee *coherent* > filesystem state has been made to controller. And fsync *can* carry "later" > writes to the controller. > > I belive the particular case the prompted the write-barriers to become default > was ext3 + journals, where in certain (rare) cases, upon recovery, things were > out of sync. What was happening was that ext3 was syncing the journal, but > "extra" writes were getting carried to the controller during the sync > operation, and if something crashed at the right time, "new" data was on the > disk where the "old journal" (because the new journal hadn't finished making > it to the controller) didn't expect it. > > The write barriers give the FS the symantics to say "all previous queue > writes" [BARRIER] flush to controller [BARRIER] "any new writes", and thus > guarentee the ordering of certian operations to disk, and guarentee coherency > of the FS at all times. > > Of course, that guarenteed FS consistency comes at a cost. As to it's > necessity with the way PG uses the FS w/ WAL.... or it's necessity with > xfs... > > a. > > -- > Aidan Van Dyk Create like a god, > aidan@highrise.ca command like a king, > http://www.highrise.ca/ work like a slave. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.