Обсуждение: Create and drop temp table in 8.3.4

Поиск
Список
Период
Сортировка

Create and drop temp table in 8.3.4

От
"Kevin Grittner"
Дата:
We recently upgraded the databases for our circuit court applications
from PostgreSQL 8.2.5 to 8.3.4.  The application software didn't
change.  Most software runs fine, and our benchmarks prior to the
update tended to show a measurable, if not dramatic, performance
improvement overall.  We have found one area where jobs are running
much longer and having a greater impact on concurrent jobs -- those
where the programmer creates and drops many temporary tables
(thousands) within a database transaction.  We are looking to mitigate
the problem while we look into rewriting queries where the temporary
table usage isn't really needed (probably most of them, but the
rewrites are not trivial).

I'm trying to quantify the issue, and would appreciate any
suggestions, either for mitigation or collecting useful data to find
the cause of the performance regression.  I create a script which
brackets 1000 lines like the following within a single begin/commit:

create temporary table tt (c1 int not null primary key, c2 text, c3
text);  drop table tt;

I run this repeatedly, to get a "steady state", with just enough
settling time between runs to spot the boundaries of the runs in the
vmstat 1 output (5 to 20 seconds between runs).  I'm surprised at how
much disk output there is for this, in either version.  In 8.2.5 a
typical run has about 156,000 disk writes in the vmstat output, while
8.3.4 has about 172,000 writes.

During the main part of the run 8.2.5 ranges between 0 and 15 percent
of cpu time in I/O wait, averaging around 10%; while 8.3.4 ranges
between 15 and 25 percent of cpu time in I/O wait, averaging around
18%, with occasional outliers on both sides, down to 5% and up to 55%.
For both, there's a period of time at the end of the transaction
where the COMMIT seems to be doing disk output without any disk wait,
suggesting that the BBU RAID controller is either able to write these
faster because there are multiple updates to the same sectors which
get combined, or that they can be written sequentially.

The time required for psql to run the script varies little in 8.2.5 --
from 4m43.843s to 4m49.388s.  Under 8.3.4 this bounces around from run
to run -- from 1m28.270s to 5m39.327s.

I can't help wondering why creating and dropping a temporary table
requires over 150 disk writes.  I also wonder whether there is
something in 8.3.4 which directly causes more writes, or whether it is
the result of the new checkpoint and background writer hitting some
pessimal usage pattern where "just in time" writes become "just too
late" to be efficient.

Most concerning is that the 8.3.4 I/O wait time results in slow
performance for interactive tasks and results in frustrated users
calling the support line complaining of slowness.  I can confirm that
queries which normally run in 10 to 20 ms are running for several
seconds in competition with the temporary table creation/drop queries,
which wasn't the case before.

I'm going to get the latest snapshot to see if the issue has changed
for 8.4devel, but I figured I should post the information I have so
far to get feedback.

-Kevin

Re: Create and drop temp table in 8.3.4

От
"Kevin Grittner"
Дата:
>>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> We have found one area where jobs are running
> much longer and having a greater impact on concurrent jobs -- those
> where the programmer creates and drops many temporary tables
> (thousands) within a database transaction.

I forgot to include the standard information about the environment and
configuration.

ccsa@COUNTY2-PG:~> cat /proc/version
Linux version 2.6.16.60-0.31-smp (geeko@buildhost) (gcc version 4.1.2
20070115 (SUSE Linux)) #1 SMP Tue Oct 7 16:16:29 UTC 2008
ccsa@COUNTY2-PG:~> cat /etc/SuSE-release
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 2
ccsa@COUNTY2-PG:~> uname -a
Linux COUNTY2-PG 2.6.16.60-0.31-smp #1 SMP Tue Oct 7 16:16:29 UTC 2008
x86_64 x86_64 x86_64 GNU/Linux

Two dual-core Xeon 3 GHz processors.
4 GB system RAM.
BBU RAID controller with 256 MB RAM.
RAID 5 on 5 spindles.


8.2.5:

ccsa@COUNTY2-PG:~> /usr/local/pgsql-8.2.5-64/bin/pg_config
BINDIR = /usr/local/pgsql-8.2.5-64/bin
DOCDIR = /usr/local/pgsql-8.2.5-64/doc
INCLUDEDIR = /usr/local/pgsql-8.2.5-64/include
PKGINCLUDEDIR = /usr/local/pgsql-8.2.5-64/include
INCLUDEDIR-SERVER = /usr/local/pgsql-8.2.5-64/include/server
LIBDIR = /usr/local/pgsql-8.2.5-64/lib
PKGLIBDIR = /usr/local/pgsql-8.2.5-64/lib
LOCALEDIR =
MANDIR = /usr/local/pgsql-8.2.5-64/man
SHAREDIR = /usr/local/pgsql-8.2.5-64/share
SYSCONFDIR = /usr/local/pgsql-8.2.5-64/etc
PGXS = /usr/local/pgsql-8.2.5-64/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/usr/local/pgsql-8.2.5-64'
'--enable-integer-datetimes' '--enable-debug' '--disable-nls'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE
CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Winline
-Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -g
CFLAGS_SL = -fpic
LDFLAGS = -Wl,-rpath,'/usr/local/pgsql-8.2.5-64/lib'
LDFLAGS_SL =
LIBS = -lpgport -lz -lreadline -lcrypt -ldl -lm
VERSION = PostgreSQL 8.2.5

max_connections = 50
shared_buffers = 256MB
temp_buffers = 10MB
max_prepared_transactions = 0
work_mem = 16MB
maintenance_work_mem = 400MB
max_fsm_pages = 1000000
bgwriter_lru_percent = 20.0
bgwriter_lru_maxpages = 200
bgwriter_all_percent = 10.0
bgwriter_all_maxpages = 600
wal_buffers = 256kB
checkpoint_segments = 50
archive_command = '/bin/true'
archive_timeout = 3600
seq_page_cost = 0.1
random_page_cost = 0.1
effective_cache_size = 3GB
geqo = off
from_collapse_limit = 20
join_collapse_limit = 20
redirect_stderr = on
log_line_prefix = '[%m] %p %q<%u %d %r> '
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 10
autovacuum_analyze_threshold = 10
datestyle = 'iso, mdy'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
escape_string_warning = off
sql_inheritance = off
standard_conforming_strings = on


8.3.4:

ccsa@COUNTY2-PG:~> /usr/local/pgsql-8.3.4-64/bin/pg_config
BINDIR = /usr/local/pgsql-8.3.4-64/bin
DOCDIR = /usr/local/pgsql-8.3.4-64/doc
INCLUDEDIR = /usr/local/pgsql-8.3.4-64/include
PKGINCLUDEDIR = /usr/local/pgsql-8.3.4-64/include
INCLUDEDIR-SERVER = /usr/local/pgsql-8.3.4-64/include/server
LIBDIR = /usr/local/pgsql-8.3.4-64/lib
PKGLIBDIR = /usr/local/pgsql-8.3.4-64/lib
LOCALEDIR =
MANDIR = /usr/local/pgsql-8.3.4-64/man
SHAREDIR = /usr/local/pgsql-8.3.4-64/share
SYSCONFDIR = /usr/local/pgsql-8.3.4-64/etc
PGXS = /usr/local/pgsql-8.3.4-64/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/usr/local/pgsql-8.3.4-64'
'--enable-integer-datetimes' '--enable-debug' '--disable-nls'
'--with-libxml'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Winline
-Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing
-fwrapv -g
CFLAGS_SL = -fpic
LDFLAGS = -Wl,-rpath,'/usr/local/pgsql-8.3.4-64/lib'
LDFLAGS_SL =
LIBS = -lpgport -lxml2 -lz -lreadline -lcrypt -ldl -lm
VERSION = PostgreSQL 8.3.4

max_connections = 50
shared_buffers = 256MB
temp_buffers = 10MB
max_prepared_transactions = 0
work_mem = 16MB
maintenance_work_mem = 400MB
max_fsm_pages = 1000000
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 4.0
wal_buffers = 256kB
checkpoint_segments = 50
archive_mode = on
archive_command = '/bin/true'
archive_timeout = 3600
seq_page_cost = 0.1
random_page_cost = 0.1
effective_cache_size = 3GB
geqo = off
from_collapse_limit = 20
join_collapse_limit = 20
logging_collector = on
log_checkpoints = on
log_connections = on
log_disconnections = on
log_line_prefix = '[%m] %p %q<%u %d %r> '
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 10
autovacuum_analyze_threshold = 10
datestyle = 'iso, mdy'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.english'
escape_string_warning = off
sql_inheritance = off
standard_conforming_strings = on


Re: Create and drop temp table in 8.3.4

От
"Kevin Grittner"
Дата:
>>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> I'm going to get the latest snapshot to see if the issue has changed
> for 8.4devel

In testing under today's snapshot, it seemed to take 150,000 writes to
create and drop 1,000 temporary tables within a database transaction.
The numbers for the various versions might be within the sampling
noise, since the testing involved manual steps and required saturating
the queues in PostgreSQL, the OS, and the RAID controller to get
meaningful numbers.  It seems like the complaints of slowness result
primarily from these writes saturating the bandwidth when a query
generates a temporary table in a loop, with the increased impact in
later releases resulting from it getting through the loop faster.

I've started a thread on the hackers' list to discuss a possible
PostgreSQL enhancement to help such workloads.  In the meantime, I
think I know which knobs to try turning to mitigate the issue, and
I'll suggest rewrites to some of these queries, to avoid the temporary
tables.

If I find a particular tweak to the background writer or some such is
particularly beneficial, I'll post again.

-Kevin

Re: Create and drop temp table in 8.3.4

От
Tom Lane
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> I'm trying to quantify the issue, and would appreciate any
> suggestions, either for mitigation or collecting useful data to find
> the cause of the performance regression.  I create a script which
> brackets 1000 lines like the following within a single begin/commit:

> create temporary table tt (c1 int not null primary key, c2 text, c3
> text);  drop table tt;

I poked at this a little bit.  The test case is stressing the system
more than might be apparent: there's an index on c1 because of the
PRIMARY KEY, and the text columns force a toast table to be created,
which has its own index.  So that means four separate filesystem
files get created for each iteration, and then dropped at the end of
the transaction.  (The different behavior you notice at COMMIT must
be the cleanup phase where the unlink()s get issued.)

Even though nothing ever gets put in the indexes, their metapages get
created immediately, so we also allocate and write 8K per index.

So there are three cost components:

1. Filesystem overhead to create and eventually delete all those
thousands of files.

2. Write traffic for the index metapages.

3. System catalog tuple insertions and deletions (and the ensuing
WAL log traffic).

I'm not too sure which of these is the dominant cost --- it might
well vary from system to system anyway depending on what filesystem
you use.  But I think it's not #2 since that one would only amount
to 16MB over the length of the transaction.

As far as I can tell with strace, the filesystem overhead ought to be
the same in 8.2 and 8.3 because pretty much the same series of syscalls
occurs.  So I suspect that the slowdown you saw comes from making a
larger number of catalog updates in 8.3; though I can't think what that
would be offhand.

A somewhat worrisome point is that the filesystem overhead is going to
essentially double in CVS HEAD, because of the addition of per-relation
FSM files.  (In fact, Heikki is proposing to triple the overhead by also
adding DSM files ...)  If cost #1 is significant then that could really
hurt.

            regards, tom lane

Re: Create and drop temp table in 8.3.4

От
"Kevin Grittner"
Дата:
>>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> If I find a particular tweak to the background writer or some such
is
> particularly beneficial, I'll post again.

It turns out that it was not the PostgreSQL version which was
primarily responsible for the performance difference.  We updated the
kernel at the same time we rolled out 8.3.4, and the new kernel
defaulted to using write barriers, while the old kernel didn't.  Since
we have a BBU RAID controller, we will add nobarrier to the fstab
entries.  This makes file creation and unlink each about 20 times
faster.

-Kevin

Re: Create and drop temp table in 8.3.4

От
"Joshua D. Drake"
Дата:
On Thu, 2008-11-06 at 13:02 -0600, Kevin Grittner wrote:
> >>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> > If I find a particular tweak to the background writer or some such
> is
> > particularly beneficial, I'll post again.
>
> It turns out that it was not the PostgreSQL version which was
> primarily responsible for the performance difference.  We updated the
> kernel at the same time we rolled out 8.3.4, and the new kernel
> defaulted to using write barriers, while the old kernel didn't.  Since
> we have a BBU RAID controller, we will add nobarrier to the fstab
> entries.  This makes file creation and unlink each about 20 times
> faster.

Woah... which version of the kernel was old and new?

>
> -Kevin
>
--


Re: Create and drop temp table in 8.3.4

От
"Kevin Grittner"
Дата:
>>> "Joshua D. Drake" <jd@commandprompt.com> wrote:
> On Thu, 2008-11-06 at 13:02 -0600, Kevin Grittner wrote:
>> the new kernel
>> defaulted to using write barriers, while the old kernel didn't.
Since
>> we have a BBU RAID controller, we will add nobarrier to the fstab
>> entries.  This makes file creation and unlink each about 20 times
>> faster.
>
> Woah... which version of the kernel was old and new?

old:

kgrittn@DBUTL-PG:/var/pgsql/data/test> cat /proc/version
Linux version 2.6.5-7.287.3-bigsmp (geeko@buildhost) (gcc version 3.3.3
(SuSE Linux)) #1 SMP Tue Oct 2 07:31:36 UTC 2007
kgrittn@DBUTL-PG:/var/pgsql/data/test> uname -a
Linux DBUTL-PG 2.6.5-7.287.3-bigsmp #1 SMP Tue Oct 2 07:31:36 UTC 2007
i686 i686 i386 GNU/Linux
kgrittn@DBUTL-PG:/var/pgsql/data/test> cat /etc/SuSE-release
SUSE LINUX Enterprise Server 9 (i586)
VERSION = 9
PATCHLEVEL = 3

new:

kgrittn@SAWYER-PG:~> cat /proc/version
Linux version 2.6.16.60-0.27-smp (geeko@buildhost) (gcc version 4.1.2
20070115 (SUSE Linux)) #1 SMP Mon Jul 28 12:55:32 UTC 2008
kgrittn@SAWYER-PG:~> uname -a
Linux SAWYER-PG 2.6.16.60-0.27-smp #1 SMP Mon Jul 28 12:55:32 UTC 2008
x86_64 x86_64 x86_64 GNU/Linux
kgrittn@SAWYER-PG:~> cat /etc/SuSE-release
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 2

To be clear, file create and unlink speeds are almost the same between
the two kernels without write barriers; the difference is that they
were in effect by default in the newer kernel.

-Kevin

Re: Create and drop temp table in 8.3.4

От
"Scott Carey"
Дата:
To others that may stumble upon this thread:
Note that Write Barriers can be very important for data integrity when power loss or hardware failure are a concern.  Only disable them if you know the consequences are mitigated by other factors (such as a BBU + db using the WAL log with sync writes), or if you accept the additional risk to data loss.  Also note that LVM prevents the possibility of using write barriers, and lowers data reliability as a result.   The consequences are application dependent and also highly file system dependent.

On Temp Tables:
I am a bit ignorant on the temp table relationship to file creation -- it makes no sense to me at all that a file would even be created for a temp table unless it spills out of RAM or is committed.  Inside of a transaction, shouldn't they be purely in-memory if there is space?  Is there any way to prevent the file creation?  This seems like a total waste of time for many temp table use cases, and explains why they were so slow in some exploratory testing we did a few months ago.


On Thu, Nov 6, 2008 at 11:35 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
>>> "Joshua D. Drake" <jd@commandprompt.com> wrote:
> On Thu, 2008-11-06 at 13:02 -0600, Kevin Grittner wrote:
>> the new kernel
>> defaulted to using write barriers, while the old kernel didn't.
Since
>> we have a BBU RAID controller, we will add nobarrier to the fstab
>> entries.  This makes file creation and unlink each about 20 times
>> faster.
>
> Woah... which version of the kernel was old and new?

old:

kgrittn@DBUTL-PG:/var/pgsql/data/test> cat /proc/version
Linux version 2.6.5-7.287.3-bigsmp (geeko@buildhost) (gcc version 3.3.3
(SuSE Linux)) #1 SMP Tue Oct 2 07:31:36 UTC 2007
kgrittn@DBUTL-PG:/var/pgsql/data/test> uname -a
Linux DBUTL-PG 2.6.5-7.287.3-bigsmp #1 SMP Tue Oct 2 07:31:36 UTC 2007
i686 i686 i386 GNU/Linux
kgrittn@DBUTL-PG:/var/pgsql/data/test> cat /etc/SuSE-release
SUSE LINUX Enterprise Server 9 (i586)
VERSION = 9
PATCHLEVEL = 3

new:

kgrittn@SAWYER-PG:~> cat /proc/version
Linux version 2.6.16.60-0.27-smp (geeko@buildhost) (gcc version 4.1.2
20070115 (SUSE Linux)) #1 SMP Mon Jul 28 12:55:32 UTC 2008
kgrittn@SAWYER-PG:~> uname -a
Linux SAWYER-PG 2.6.16.60-0.27-smp #1 SMP Mon Jul 28 12:55:32 UTC 2008
x86_64 x86_64 x86_64 GNU/Linux
kgrittn@SAWYER-PG:~> cat /etc/SuSE-release
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 2

To be clear, file create and unlink speeds are almost the same between
the two kernels without write barriers; the difference is that they
were in effect by default in the newer kernel.

-Kevin

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Create and drop temp table in 8.3.4

От
"Kevin Grittner"
Дата:
>>> "Scott Carey" <scott@richrelevance.com> wrote:
> Note that Write Barriers can be very important for data integrity
when power
> loss or hardware failure are a concern.  Only disable them if you
know the
> consequences are mitigated by other factors (such as a BBU + db using
the
> WAL log with sync writes), or if you accept the additional risk to
data
> loss.

For those using xfs, this link may be useful:

http://oss.sgi.com/projects/xfs/faq.html#wcache

> On Temp Tables:
> I am a bit ignorant on the temp table relationship to file creation
-- it
> makes no sense to me at all that a file would even be created for a
temp
> table unless it spills out of RAM or is committed.  Inside of a
transaction,
> shouldn't they be purely in-memory if there is space?  Is there any
way to
> prevent the file creation?  This seems like a total waste of time for
many
> temp table use cases, and explains why they were so slow in some
exploratory
> testing we did a few months ago.

As I learned today, creating a temporary table in PostgreSQL can
easily create four files and do dozens of updates to system tables;
that's all before you start actually inserting any data into the
temporary table.

-Kevin

Re: Create and drop temp table in 8.3.4

От
"Scott Marlowe"
Дата:
On Thu, Nov 6, 2008 at 2:05 PM, Scott Carey <scott@richrelevance.com> wrote:
> To others that may stumble upon this thread:
> Note that Write Barriers can be very important for data integrity when power
> loss or hardware failure are a concern.  Only disable them if you know the
> consequences are mitigated by other factors (such as a BBU + db using the
> WAL log with sync writes), or if you accept the additional risk to data
> loss.  Also note that LVM prevents the possibility of using write barriers,
> and lowers data reliability as a result.   The consequences are application
> dependent and also highly file system dependent.

I am pretty sure that with no write barriers that even a BBU hardware
caching raid controller cannot guarantee your data.

Re: Create and drop temp table in 8.3.4

От
"Kevin Grittner"
Дата:
>>> "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
> I am pretty sure that with no write barriers that even a BBU
hardware
> caching raid controller cannot guarantee your data.

That seems at odds with this:

http://oss.sgi.com/projects/xfs/faq.html#wcache_persistent

What evidence to you have that the SGI XFS team is wrong?

It does seem fairly bizarre to me that we can't configure our system
to enforce write barriers within the OS and file system without having
it enforced all the way past the BBU RAID cache onto the hard drives
themselves.  We consider that once it hits the battery-backed cache,
it is persisted.  Reality has only contradicted that once so far (with
a RAID controller failure), and our backups have gotten us past that
with no sweat.

-Kevin

Re: Create and drop temp table in 8.3.4

От
"Scott Marlowe"
Дата:
On Thu, Nov 6, 2008 at 3:33 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
>>>> "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
>> I am pretty sure that with no write barriers that even a BBU
> hardware
>> caching raid controller cannot guarantee your data.
>
> That seems at odds with this:
>
> http://oss.sgi.com/projects/xfs/faq.html#wcache_persistent
>
> What evidence to you have that the SGI XFS team is wrong?

Logic?  Without write barriers in my file system an fsync request will
be immediately returned true, correct?  That means that writes can
happen out of order, and a system crash could corrupt the file system.
 Just seems kind of obvious to me.

Re: Create and drop temp table in 8.3.4

От
"Kevin Grittner"
Дата:
>>> "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
> On Thu, Nov 6, 2008 at 3:33 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
>>>>> "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
>>> I am pretty sure that with no write barriers that even a BBU
>> hardware
>>> caching raid controller cannot guarantee your data.
>>
>> That seems at odds with this:
>>
>> http://oss.sgi.com/projects/xfs/faq.html#wcache_persistent
>>
>> What evidence to you have that the SGI XFS team is wrong?
>
> Without write barriers in my file system an fsync request will
> be immediately returned true, correct?

Not as I understand it; although it will be pretty fast if it all fits
into the battery backed cache.

-Kevin

Re: Create and drop temp table in 8.3.4

От
"Scott Marlowe"
Дата:
On Thu, Nov 6, 2008 at 4:04 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
>>>> "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
>> On Thu, Nov 6, 2008 at 3:33 PM, Kevin Grittner
>> <Kevin.Grittner@wicourts.gov> wrote:
>>>>>> "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
>>>> I am pretty sure that with no write barriers that even a BBU
>>> hardware
>>>> caching raid controller cannot guarantee your data.
>>>
>>> That seems at odds with this:
>>>
>>> http://oss.sgi.com/projects/xfs/faq.html#wcache_persistent
>>>
>>> What evidence to you have that the SGI XFS team is wrong?
>>
>> Without write barriers in my file system an fsync request will
>> be immediately returned true, correct?
>
> Not as I understand it; although it will be pretty fast if it all fits
> into the battery backed cache.

OK, thought exercise time.  There's a limited size for the cache.
Let's assume it's much smaller, say 16Megabytes.  We turn off write
barriers.  We start writing data to the RAID array faster than the
disks can write it.  At some point, the data flowing into the cache is
backing up into the OS.  Without write barriers, the second we call an
fsync it returns true.  But the data's not in the cache yet, or on the
disk.  Machine crashes, data is incoherent.

But that's assuming write barriers work as I understand them.

Re: Create and drop temp table in 8.3.4

От
"David Rees"
Дата:
On Thu, Nov 6, 2008 at 4:03 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Thu, Nov 6, 2008 at 4:04 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
>> "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
>>> Without write barriers in my file system an fsync request will
>>> be immediately returned true, correct?
>>
>> Not as I understand it; although it will be pretty fast if it all fits
>> into the battery backed cache.
>
> OK, thought exercise time.  There's a limited size for the cache.
> Let's assume it's much smaller, say 16Megabytes.  We turn off write
> barriers.  We start writing data to the RAID array faster than the
> disks can write it.  At some point, the data flowing into the cache is
> backing up into the OS.  Without write barriers, the second we call an
> fsync it returns true.  But the data's not in the cache yet, or on the
> disk.  Machine crashes, data is incoherent.
>
> But that's assuming write barriers work as I understand them.

Let's try to clear up a couple things:

1. We are talking about 3 different memory caches in order from low to high:
Disk cache, Controller cache (BBU in this case) and OS cache.

2. A write barrier instructs the lower level hardware that commands
issued before the barrier must be written to disk before commands
issued after the barrier. Write barriers are used to ensure that data
written to disk is written in such a way as to maintain filesystem
consistency, without losing all the benefits of a write cache.

3. A fsync call forces data to be synced to the controller.

This means that whenever you call fsync, at the very minimum, the data
will have made it to the controller. How much further down the line
will depend on whether or not the controller is in WriteBack or
WriteThrough mode and whether or not the disk is also caching writes.

So in your example, if the OS is caching some writes and fsync is
called, it won't be returned until at a minimum the controller has
accepted all the data, regardless of whether or not write barriers are
enabled.

In theory, it should be safe to disable write barriers if you have a
BBU because the BBU should guarantee that all writes will eventually
make it to disk (or at least reduce the risk of that not happening to
an acceptable level).

-Dave

Re: Create and drop temp table in 8.3.4

От
Matthew Wakeling
Дата:
On Thu, 6 Nov 2008, Scott Marlowe wrote:
> Without write barriers, the second we call an fsync it returns true.
>
> But that's assuming write barriers work as I understand them.

Write barriers do not work as you understand them.

Calling fsync always blocks until all the data has made it to safe
storage, and always has (barring broken systems). Write barriers are meant
to be a way to speed up fsync-like operations. Before write barriers, all
the system could do was call fsync, and that would cause the operating
system to wait for a response from the disc subsystem that the data had
been written before it could start writing some more stuff. Write
barriers provide an extra way of telling the disc "Write everything before
the barrier before you write anything after the barrier", which allows the
operating system to keep stuffing data into the disc queue without having
to wait for a response.

So fsync should always work right, unless the system is horribly broken,
on all systems going back many years.

Matthew

--
I'd try being be a pessimist, but it probably wouldn't work anyway.

Re: Create and drop temp table in 8.3.4

От
Aidan Van Dyk
Дата:
Seems like this didn't make it through to the list the first time...

* Aidan Van Dyk <aidan@highrise.ca> [081106 22:19]:
> * David Rees <drees76@gmail.com> [081106 21:22]:
>
> > 2. A write barrier instructs the lower level hardware that commands
> > issued before the barrier must be written to disk before commands
> > issued after the barrier. Write barriers are used to ensure that data
> > written to disk is written in such a way as to maintain filesystem
> > consistency, without losing all the benefits of a write cache.
> >
> > 3. A fsync call forces data to be synced to the controller.
> >
> > This means that whenever you call fsync, at the very minimum, the data
> > will have made it to the controller. How much further down the line
> > will depend on whether or not the controller is in WriteBack or
> > WriteThrough mode and whether or not the disk is also caching writes.
> >
> > So in your example, if the OS is caching some writes and fsync is
> > called, it won't be returned until at a minimum the controller has
> > accepted all the data, regardless of whether or not write barriers are
> > enabled.
> >
> > In theory, it should be safe to disable write barriers if you have a
> > BBU because the BBU should guarantee that all writes will eventually
> > make it to disk (or at least reduce the risk of that not happening to
> > an acceptable level).
>
> All that's "correct", but note that fsync doesn't guarentee *coherent*
> filesystem state has been made to controller.  And fsync *can* carry "later"
> writes to the controller.
>
> I belive the particular case the prompted the write-barriers to become default
> was ext3 + journals, where in certain (rare) cases, upon recovery, things were
> out of sync.  What was happening was that ext3 was syncing the journal, but
> "extra" writes were getting carried to the controller during the sync
> operation, and if something crashed at the right time, "new" data was on the
> disk where the "old journal" (because the new journal hadn't finished making
> it to the controller) didn't expect it.
>
> The write barriers give the FS the symantics to say "all previous queue
> writes" [BARRIER] flush to controller [BARRIER] "any new writes", and thus
> guarentee the ordering of certian operations to disk, and guarentee coherency
> of the FS at all times.
>
> Of course, that guarenteed FS consistency comes at a cost.  As to it's
> necessity with the way PG uses the FS w/ WAL....  or it's necessity with
> xfs...
>
> a.
>
> --
> Aidan Van Dyk                                             Create like a god,
> aidan@highrise.ca                                       command like a king,
> http://www.highrise.ca/                                   work like a slave.



--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Вложения