Обсуждение: problem with large maintenance_work_mem settings and CREATE INDEX
Hi all! while playing on a new box i noticed that postgresql does not seem to be able to cope with very large settings for maintenance_work_mem. For a test I created a single table with 5 integer columns containing about 1,8B rows 8(about 300M distinct values in the column I want to index): foo=# select count(*) from testtable; count------------ 1800201755(1 row) I tried to create an index on one of the columns: foo=# SET maintenance_work_mem to 4000000; SET foo=# CREATE INDEX a_idx ON testtable(a); ERROR: invalid memory alloc request size 1073741824 foo=# SET maintenance_work_mem to 3000000; SET foo=# CREATE INDEX a_idx ON testtable(a); ERROR: invalid memory alloc request size 1073741824 the error is generated pretty fast (a few seconds into the create index) however: foo=# SET maintenance_work_mem to 2000000; SET foo=# CREATE INDEX a_idx ON testtable(a); is running now for about 10 hours with nearly no IO but pegging the CPU-core it is running on at a constent 100%. watching the process while this happens seems to indicate that the above error occures after the backend exceeds about 3,1GB in resident size. The box in question is a Dual Opteron 275 (4 cores @2,2Ghz) with 16GB ofRAM and 24GB of swap. OS is Debian Sarge/AMD64 witha pure 64bit userland. Stefan
Stefan Kaltenbrunner wrote: > Hi all! > > while playing on a new box i noticed that postgresql does not seem to be > able to cope with very large settings for maintenance_work_mem. > > For a test I created a single table with 5 integer columns containing > about 1,8B rows 8(about 300M distinct values in the column I want to index): > > > foo=# select count(*) from testtable; > count > ------------ > 1800201755 > (1 row) > > > I tried to create an index on one of the columns: > > foo=# SET maintenance_work_mem to 4000000; > SET > foo=# CREATE INDEX a_idx ON testtable(a); > ERROR: invalid memory alloc request size 1073741824 > > foo=# SET maintenance_work_mem to 3000000; > SET > foo=# CREATE INDEX a_idx ON testtable(a); > ERROR: invalid memory alloc request size 1073741824 > > the error is generated pretty fast (a few seconds into the create index) > > however: > > foo=# SET maintenance_work_mem to 2000000; > SET > foo=# CREATE INDEX a_idx ON testtable(a); > > is running now for about 10 hours with nearly no IO but pegging the > CPU-core it is running on at a constent 100%. > > watching the process while this happens seems to indicate that the above > error occures after the backend exceeds about 3,1GB in resident size. > > The box in question is a Dual Opteron 275 (4 cores @2,2Ghz) with 16GB of > RAM and 24GB of swap. OS is Debian Sarge/AMD64 with a pure 64bit userland. forgot to mention that this is 8.1.3 compiled from source. Further testing shows that not only CREATE INDEX has some issue with large maintenance_work_mem settings : foo=# set maintenance_work_mem to 2000000; SET foo=# VACUUM ANALYZE verbose; INFO: vacuuming "information_schema.sql_features" ERROR: invalid memory alloc request size 2047999998 Stefan
On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > forgot to mention that this is 8.1.3 compiled from source. Further > testing shows that not only CREATE INDEX has some issue with large > maintenance_work_mem settings : what does it show: cat /proc/sys/kernel/shmmax ? depesz
hubert depesz lubaczewski wrote: > On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > >>forgot to mention that this is 8.1.3 compiled from source. Further >>testing shows that not only CREATE INDEX has some issue with large >>maintenance_work_mem settings : > > > what does it show: > cat /proc/sys/kernel/shmmax 1421326592 not that I think it is related to the problem at all. It looks like I'm hitting the MaxAllocSize Limit in src/include/utils/memutils.h. Stefan
Stefan Kaltenbrunner wrote: > hubert depesz lubaczewski wrote: >> On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: >> >>>forgot to mention that this is 8.1.3 compiled from source. Further >>>testing shows that not only CREATE INDEX has some issue with large >>>maintenance_work_mem settings : >> >> what does it show: >> cat /proc/sys/kernel/shmmax > > 1421326592 > > not that I think it is related to the problem at all. I can second that. Maintenance work mem is not allocated in shared memory. > It looks like I'm > hitting the MaxAllocSize Limit in src/include/utils/memutils.h. There are two issues you have mentioned. This one is more obvious: the limitation of the memory that can be allocated. The other one is the very bad performance for index creation. I can only guess, but is sound like this is related to the recent discussion on hackers about issues with the qsort performance. If the theory is true, your index creation should be much faster with a much lower setting for maintenance_work_mem, so that it uses external sort. See the discussion starting here: http://archives.postgresql.org/pgsql-hackers/2006-02/msg00590.php Best Regards, Michael Paesold
Stefan Kaltenbrunner wrote: > hubert depesz lubaczewski wrote: > >>On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: >> >> >>>forgot to mention that this is 8.1.3 compiled from source. Further >>>testing shows that not only CREATE INDEX has some issue with large >>>maintenance_work_mem settings : >> >> >>what does it show: >>cat /proc/sys/kernel/shmmax > > > 1421326592 > > not that I think it is related to the problem at all. It looks like I'm > hitting the MaxAllocSize Limit in src/include/utils/memutils.h. just tried to increase this limit to 4GB (from the default 1GB) and this seems to help a fair bit. (ie CREATE INDEX and VACUUM do seem to work with much higher maintainance_work_mem settings now. BUT: VACUUM VERBOSE (or VACUUM ANALYZE VERBOSE) breaks(with a rather gigantic allocation request ;-)): foo=# VACUUM VERBOSE; INFO: vacuuming "information_schema.sql_features" ERROR: invalid memory alloc request size 18446744073709551615 Stefan
Stefan Kaltenbrunner wrote: > foo=# set maintenance_work_mem to 2000000; > SET > foo=# VACUUM ANALYZE verbose; > INFO: vacuuming "information_schema.sql_features" > ERROR: invalid memory alloc request size 2047999998 Just an FYI, I reported a similar problem on my 8.0.0 database a few weeks ago. I upgraded to 8.1.3 and just recetly got that message again in a nighly maintenance email. vacuumdb: vacuuming database "postgres" vacuumdb: vacuuming of database "postgres" failed: ERROR: out of memory DETAIL: Failed on request of size 167772156. The interesting thing is that there aren't any tables in the postgres database. This happened Thursday night, but vacuum ran fine on Friday night. I'm on vacation right now, so I can't really look into it much deeper at the moment, but I thought I would mention it. Matt
Michael Paesold wrote: > Stefan Kaltenbrunner wrote: > >> hubert depesz lubaczewski wrote: >> >>> On 3/4/06, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: >>> >>>> forgot to mention that this is 8.1.3 compiled from source. Further >>>> testing shows that not only CREATE INDEX has some issue with large >>>> maintenance_work_mem settings : >>> >>> >>> what does it show: >>> cat /proc/sys/kernel/shmmax >> >> >> 1421326592 >> >> not that I think it is related to the problem at all. > > > I can second that. Maintenance work mem is not allocated in shared memory. > >> It looks like I'm >> hitting the MaxAllocSize Limit in src/include/utils/memutils.h. > > > There are two issues you have mentioned. This one is more obvious: the > limitation of the memory that can be allocated. yes > > The other one is the very bad performance for index creation. I can only > guess, but is sound like this is related to the recent discussion on > hackers about issues with the qsort performance. If the theory is true, > your index creation should be much faster with a much lower setting for > maintenance_work_mem, so that it uses external sort. > > See the discussion starting here: > http://archives.postgresql.org/pgsql-hackers/2006-02/msg00590.php I was following this thread - and it was partly a reason why I'm playing with that(the CREATE INDEX on that table finished after about 12 hours with a bit less 2GB for maintenance_work_mem(for comparision it took me only about 2,5hours to create this table) . I'm currently testing who long it takes with very low settings to force an external sort. Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> not that I think it is related to the problem at all. It looks like I'm
>> hitting the MaxAllocSize Limit in src/include/utils/memutils.h.
> just tried to increase this limit to 4GB (from the default 1GB) and this
> seems to help a fair bit.
s/help a fair bit/break a whole lot of stuff/
There are reasons for that limit, and you can't just arbitrarily
rejigger it.
The sorting code probably needs a defense to keep it from trying to
exceed MaxAllocSize for the SortObject array; AFAIR there is no such
consideration there now, but it's easily added.  I'm not sure where your
VACUUM failure is coming from though --- can you get a back trace from
the errfinish call in that case?
        regards, tom lane
			
		Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > >>>not that I think it is related to the problem at all. It looks like I'm >>>hitting the MaxAllocSize Limit in src/include/utils/memutils.h. > > >>just tried to increase this limit to 4GB (from the default 1GB) and this >>seems to help a fair bit. > > > s/help a fair bit/break a whole lot of stuff/ > > There are reasons for that limit, and you can't just arbitrarily > rejigger it. heh - sure this is just a testbox so it was worth a try and I don't care for the data anyway ... > > The sorting code probably needs a defense to keep it from trying to > exceed MaxAllocSize for the SortObject array; AFAIR there is no such > consideration there now, but it's easily added. I'm not sure where your > VACUUM failure is coming from though --- can you get a back trace from > the errfinish call in that case? like(with maintenance_work_mem set to 2000000): (gdb) bt #0 errfinish (dummy=0) at elog.c:310 #1 0x00000000005c6c93 in elog_finish (elevel=-4145840, fmt=0x84da50 "invalid memory alloc request size %lu") at elog.c:931 #2 0x00000000005d96a0 in MemoryContextAlloc (context=0x8d9c58, size=2047999998) at mcxt.c:505 #3 0x00000000004db947 in lazy_space_alloc (vacrelstats=0x8de5b0, relblocks=6) at vacuumlazy.c:963 #4 0x00000000004dab33 in lazy_scan_heap (onerel=0x2ad69a589cc8, vacrelstats=0x8de5b0, Irel=0x0, nindexes=0) at vacuumlazy.c:240 #5 0x00000000004da9d1 in lazy_vacuum_rel (onerel=0x2ad69a589cc8, vacstmt=0x8c0fd0) at vacuumlazy.c:157 #6 0x00000000004d7325 in vacuum_rel (relid=2589498568, vacstmt=0x8c0fd0, expected_relkind=-27 'å') at vacuum.c:1077 #7 0x00000000004d6990 in vacuum (vacstmt=0x8c0fd0, relids=0x0) at vacuum.c:449 #8 0x000000000055e871 in PortalRunUtility (portal=0x8e0360, query=0x8c0e00, dest=0x8c1050, completionTag=0x7fffffc0c410 "") at pquery.c:987 #9 0x000000000055eb07 in PortalRunMulti (portal=0x8e0360, dest=0x8c1050, altdest=0x8c1050, completionTag=0x7fffffc0c410 "") at pquery.c:1054 #10 0x000000000055e28f in PortalRun (portal=0x8e0360, count=9223372036854775807, dest=0x8c1050, altdest=0x8c1050, completionTag=0x7fffffc0c410 "") at pquery.c:665 #11 0x000000000055a3a1 in exec_simple_query (query_string=0x8c0cf0 "VACUUM VERBOSE;") at postgres.c:1002 #12 0x000000000055cc2c in PostgresMain (argc=4, argv=0x84c078, username=0x84c040 "postgres") at postgres.c:3217 #13 0x0000000000538a71 in BackendRun (port=0x86add0) at postmaster.c:2853 #14 0x0000000000538550 in BackendStartup (port=0x86add0) at postmaster.c:2497 #15 0x0000000000536b4d in ServerLoop () at postmaster.c:1230 #16 0x0000000000535fcf in PostmasterMain (argc=3, argv=0x8493c0) at postmaster.c:941 #17 0x00000000004fcaa5 in main (argc=3, argv=0x8493c0) at main.c:265 Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Tom Lane wrote:
>> The sorting code probably needs a defense to keep it from trying to
>> exceed MaxAllocSize for the SortObject array; AFAIR there is no such
>> consideration there now, but it's easily added.  I'm not sure where your
>> VACUUM failure is coming from though --- can you get a back trace from
>> the errfinish call in that case?
> like(with maintenance_work_mem set to 2000000):
> (gdb) bt
> #0  errfinish (dummy=0) at elog.c:310
> #1  0x00000000005c6c93 in elog_finish (elevel=-4145840, fmt=0x84da50
> "invalid memory alloc request size %lu")
>     at elog.c:931
> #2  0x00000000005d96a0 in MemoryContextAlloc (context=0x8d9c58,
> size=2047999998) at mcxt.c:505
> #3  0x00000000004db947 in lazy_space_alloc (vacrelstats=0x8de5b0,
> relblocks=6) at vacuumlazy.c:963
Oh, OK, it's lazy_space_alloc's fault --- we need to fix that to not try
to allocate chunks larger than MaxAllocSize.  Will do ... thanks for the
report.
        regards, tom lane
			
		Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >>> forgot to mention that this is 8.1.3 compiled from source. >> >> See the discussion starting here: >> http://archives.postgresql.org/pgsql-hackers/2006-02/msg00590.php > I was following this thread - and it was partly a reason why I'm playing > with that(the CREATE INDEX on that table finished after about 12 hours > with a bit less 2GB for maintenance_work_mem(for comparision it took me > only about 2,5hours to create this table) . It would be interesting to try the same test with CVS tip to see if the sorting improvements Simon and I made over the past few weeks help much. regards, tom lane
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > >>>>forgot to mention that this is 8.1.3 compiled from source. >>> >>>See the discussion starting here: >>>http://archives.postgresql.org/pgsql-hackers/2006-02/msg00590.php > > >>I was following this thread - and it was partly a reason why I'm playing >>with that(the CREATE INDEX on that table finished after about 12 hours >>with a bit less 2GB for maintenance_work_mem(for comparision it took me >>only about 2,5hours to create this table) . > > > It would be interesting to try the same test with CVS tip to see if the > sorting improvements Simon and I made over the past few weeks help much. playing with CVS tip right now, it is a bit faster for both the initial bulkloading (about 5%) and for the CREATE INDEX itself (11h30min vs 11h54min) though not a dramatic improvement. Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> samples  %        symbol name
> 24915704 96.2170  ltsReleaseBlock
> 387265    1.4955  LogicalTapeRead
> 168725    0.6516  inlineApplySortFunction
Hmm ... the comment in ltsReleaseBlock sez
   /*    * Insert blocknum into array, preserving decreasing order (so that    * ltsGetFreeBlock returns the lowest
availableblock number). This could    * get fairly slow if there were many free blocks, but we don't expect    * there
tobe very many at one time.    */
 
We probably need to tweak things so this doesn't get called during the
"final merge" pass.  Looking at it now.
        regards, tom lane
			
		On Sun, 2006-03-05 at 15:15 -0500, Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > > samples % symbol name > > 24915704 96.2170 ltsReleaseBlock > > 387265 1.4955 LogicalTapeRead > > 168725 0.6516 inlineApplySortFunction > > Hmm ... the comment in ltsReleaseBlock sez > > /* > * Insert blocknum into array, preserving decreasing order (so that > * ltsGetFreeBlock returns the lowest available block number). This could > * get fairly slow if there were many free blocks, but we don't expect > * there to be very many at one time. > */ > > We probably need to tweak things so this doesn't get called during the > "final merge" pass. Looking at it now. OK. I also had a report of poor performance, just isolated to the final merge pass and sucked quite badly; sounds like you've located the cause. Best Regards, Simon Riggs
I wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> samples  %        symbol name
>> 24915704 96.2170  ltsReleaseBlock
> We probably need to tweak things so this doesn't get called during the
> "final merge" pass.  Looking at it now.
I've committed a fix for this into CVS HEAD --- please try it out.
        regards, tom lane
			
		Tom Lane wrote: > I wrote: > >>Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> >>>samples % symbol name >>>24915704 96.2170 ltsReleaseBlock > > >>We probably need to tweak things so this doesn't get called during the >>"final merge" pass. Looking at it now. > > > I've committed a fix for this into CVS HEAD --- please try it out. just tried that with CVS HEAD (includes the second fix too): CREATE INDEX on a 1,8B row table (5 int columns - index created on the first row about 300M distinct values): before: 11h 51min after: 3h 11min(!) Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> CREATE INDEX on a 1,8B row table (5 int columns - index created on the 
> first row about 300M distinct values):
> before: 11h 51min
> after: 3h 11min(!)
Cool.  Does it seem to be I/O bound now?  Would you be willing to do it
over with oprofile turned on?
        regards, tom lane
			
		On Wed, 2006-03-08 at 10:45 -0500, Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > > CREATE INDEX on a 1,8B row table (5 int columns - index created on the > > first row about 300M distinct values): > > > before: 11h 51min > > after: 3h 11min(!) > > Cool. Does it seem to be I/O bound now? Would you be willing to do it > over with oprofile turned on? Very. Any chance of trying it with different maintenance_work_mem settings? Did you try this with trace_sort=on? If so could we get the logs for that? [Results welcome from other hackers...particularly with regard to queries with sort steps in rather than CREATE INDEX.] Best Regards, Simon Riggs
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > >>CREATE INDEX on a 1,8B row table (5 int columns - index created on the >>first row about 300M distinct values): > > >>before: 11h 51min >>after: 3h 11min(!) > > > Cool. Does it seem to be I/O bound now? Would you be willing to do it > over with oprofile turned on? while it now does a fair amount of IO during the whole operation, it is not yet IObound afaiks. profile: samples % symbol name 103520432 47.9018 inlineApplySortFunction 33382738 15.4471 comparetup_index 25296438 11.7054 tuplesort_heap_siftup 10089122 4.6685 btint4cmp 8395676 3.8849 LogicalTapeRead 2873556 1.3297 tuplesort_heap_insert 2796545 1.2940 tuplesort_gettuple_common 2752345 1.2736 AllocSetFree 2233889 1.0337 IndexBuildHeapScan 2035265 0.9418 heapgettup 1571035 0.7270 LWLockAcquire 1498800 0.6935 readtup_index 1213587 0.5616 index_form_tuple 1097172 0.5077 AllocSetAlloc 1056964 0.4891 heap_fill_tuple 1041172 0.4818 btbuildCallback 990005 0.4581 LWLockRelease 897662 0.4154 slot_deform_tuple 858527 0.3973 LogicalTapeWrite 806849 0.3734 PageAddItem 764136 0.3536 LockBuffer trace_sort: LOG: begin index sort: unique = f, workMem = 2048000, randomAccess = f LOG: switching to external sort with 7315 tapes: CPU 4.07s/13.70u sec elapsed 17.79 sec LOG: finished writing run 1 to tape 0: CPU 240.07s/3926.66u sec elapsed 4498.49 sec LOG: performsort starting: CPU 535.66s/8138.92u sec elapsed 9435.11 sec LOG: finished writing final run 2 to tape 1: CPU 538.54s/8242.23u sec elapsed 9541.55 sec LOG: performsort done (except final merge): CPU 539.39s/8254.83u sec elapsed 9559.75 sec LOG: external sort ended, 4398827 disk blocks used: CPU 768.38s/10027.39u sec elapsed 11884.63 sec Stefan
Simon Riggs wrote: > On Wed, 2006-03-08 at 10:45 -0500, Tom Lane wrote: > >>Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> >>>CREATE INDEX on a 1,8B row table (5 int columns - index created on the >>>first row about 300M distinct values): >> >>>before: 11h 51min >>>after: 3h 11min(!) >> >>Cool. Does it seem to be I/O bound now? Would you be willing to do it >>over with oprofile turned on? > > > Very. > > Any chance of trying it with different maintenance_work_mem settings? interesting: with maintenance_work_mem = 2048000: for 3 runs runtime between 3h11min and 3h25min with maintenance_work_mem = 10240: for 3 runs runtime between 2h40min and 2h50min will try other settings as time permits Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> samples  %        symbol name
> 103520432 47.9018  inlineApplySortFunction
> 33382738 15.4471  comparetup_index
> 25296438 11.7054  tuplesort_heap_siftup
> 10089122  4.6685  btint4cmp
> 8395676   3.8849  LogicalTapeRead
> 2873556   1.3297  tuplesort_heap_insert
> 2796545   1.2940  tuplesort_gettuple_common
> 2752345   1.2736  AllocSetFree
> 2233889   1.0337  IndexBuildHeapScan
Interesting.  What's the platform, and what compiler exactly?  For me,
gcc seems to inline inlineApplySortFunction successfully, but your
compiler evidently is not doing that.
I get a significantly different oprofile result from CVS HEAD:
CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples  %        symbol name
194439   37.8399  tuplesort_heap_siftup
70803    13.7790  comparetup_index
35030     6.8172  btint4cmp
25139     4.8923  AllocSetAlloc
23965     4.6638  writetup_index
18057     3.5141  GetMemoryChunkSpace
17384     3.3831  tuplesort_gettuple_common
14356     2.7938  mergepreread
12740     2.4793  LWLockAcquire
11175     2.1748  LWLockRelease
6986      1.3596  tuplesort_heap_insert
6659      1.2959  PageAddItem
6387      1.2430  index_form_tuple
5225      1.0168  LogicalTapeRead
LOG:  begin index sort: unique = f, workMem = 65536, randomAccess = f
LOG:  switching to external sort with 234 tapes: CPU 0.23s/0.81u sec elapsed 1.82 sec
LOG:  finished writing run 1 to tape 0: CPU 0.68s/6.80u sec elapsed 9.59 sec
LOG:  finished writing run 2 to tape 1: CPU 1.23s/13.54u sec elapsed 17.88 sec
LOG:  finished writing run 3 to tape 2: CPU 1.62s/20.43u sec elapsed 25.56 sec
LOG:  finished writing run 4 to tape 3: CPU 2.08s/27.48u sec elapsed 34.64 sec
LOG:  finished writing run 5 to tape 4: CPU 2.56s/34.50u sec elapsed 43.06 sec
LOG:  performsort starting: CPU 2.66s/35.81u sec elapsed 45.00 sec
LOG:  finished writing run 6 to tape 5: CPU 2.74s/38.11u sec elapsed 47.38 sec
LOG:  finished writing final run 7 to tape 6: CPU 2.74s/38.15u sec elapsed 47.43 sec
LOG:  performsort done (except 7-way final merge): CPU 2.74s/38.56u sec elapsed 47.84 sec
LOG:  external sort ended, 24434 disk blocks used: CPU 4.10s/51.98u sec elapsed 65.33 sec
This is creating an index on a single integer column, 10 million rows,
random data, with maintenance_work_mem set to 65536 (64MB).  Platform
is 64-bit Xeon, Fedora Core 4.  I tried a larger test case (100M rows,
640MB maintenance_work_mem) and got similar results, with if anything
a few more percent in tuplesort_heap_siftup.  What's even more
interesting is that an instruction-level dump of the oprofile counts
shows that almost all the tuplesort_heap_siftup time is spent at
the "((tup1)->tupindex != (tup2)->tupindex)" part of the first
HEAPCOMPARE macro.  I interpret this as memory fetch delay: that's the
first touch of the SortTuple array elements, and if that part of the
array wasn't in L2 cache this is where the CPU would stall waiting
for it.
Your machine seems not to be subject to nearly the same amount of memory
delay.  Which is interesting because most of the argument for changing
sort algorithms seems to hinge on the assumption that main-memory delay
is the main thing we need to worry about.  That looks to be valid on the
Xeon I'm testing but not on your machine ...
        regards, tom lane
			
		Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > >>samples % symbol name >>103520432 47.9018 inlineApplySortFunction >>33382738 15.4471 comparetup_index >>25296438 11.7054 tuplesort_heap_siftup >>10089122 4.6685 btint4cmp >>8395676 3.8849 LogicalTapeRead >>2873556 1.3297 tuplesort_heap_insert >>2796545 1.2940 tuplesort_gettuple_common >>2752345 1.2736 AllocSetFree >>2233889 1.0337 IndexBuildHeapScan > > > Interesting. What's the platform, and what compiler exactly? For me, > gcc seems to inline inlineApplySortFunction successfully, but your > compiler evidently is not doing that. Debian Sarge/AMD64 with gcc version 3.3.5 (Debian 1:3.3.5-13) running on a Dual AMD Opteron 280 (so 4 cores @2,4GHz) with 16GB of RAM and a very recent Kernel. Debian has gcc 3.4 as an optional package in Sarge too so I certainly can try that one. [...] > Your machine seems not to be subject to nearly the same amount of memory > delay. Which is interesting because most of the argument for changing > sort algorithms seems to hinge on the assumption that main-memory delay > is the main thing we need to worry about. That looks to be valid on the > Xeon I'm testing but not on your machine ... hmm very interesting, Opterons are known for there very high memory bandwidth and some (rather limited) testing using various benchmarktools against a 3,2Ghz DP Xeon with 2MB L2 cache shows that the Opteron box really has a significant advantage here ... Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> LOG:  begin index sort: unique = f, workMem = 8024000, randomAccess = f
> LOG:  switching to external sort with 28658 tapes: CPU 4.18s/13.96u sec 
> elapsed 32.43 sec
> LOG:  finished writing run 1 to tape 0: CPU 173.56s/3425.85u sec elapsed 
> 3814.82 sec
> LOG:  performsort starting: CPU 344.17s/7013.20u sec elapsed 7715.45 sec
> LOG:  finished writing final run 2 to tape 1: CPU 347.19s/7121.78u sec 
> elapsed 7827.25 sec
> LOG:  performsort done (except 2-way final merge): CPU 348.25s/7132.99u 
> sec elapsed 7846.47 sec
> after that the postmaster is now consuming 99% CPU for about 22 hours(!) 
I'll look into it, but I was already wondering if we shouldn't bound the
number of tapes somehow.  It's a bit hard to believe that 28000 tapes is
a sane setting.
        regards, tom lane
			
		Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > >>LOG: begin index sort: unique = f, workMem = 8024000, randomAccess = f >>LOG: switching to external sort with 28658 tapes: CPU 4.18s/13.96u sec >>elapsed 32.43 sec >>LOG: finished writing run 1 to tape 0: CPU 173.56s/3425.85u sec elapsed >>3814.82 sec >>LOG: performsort starting: CPU 344.17s/7013.20u sec elapsed 7715.45 sec >>LOG: finished writing final run 2 to tape 1: CPU 347.19s/7121.78u sec >>elapsed 7827.25 sec >>LOG: performsort done (except 2-way final merge): CPU 348.25s/7132.99u >>sec elapsed 7846.47 sec > > >>after that the postmaster is now consuming 99% CPU for about 22 hours(!) > > > I'll look into it, but I was already wondering if we shouldn't bound the > number of tapes somehow. It's a bit hard to believe that 28000 tapes is > a sane setting. heh - don't think it is a sane setting either (and it doesn't look like that pg is using more than 2GB anyway). If this testing helps with defining appropriate upper bounds to prevent bad behaviour like this (not responding to signals any more and eating CPU like mad) I'm more than happy. And the ltsReleaseBlock-fix already reduced dump&restore times for one of our production databases by at about 15% which is already quite an impressive improvment on its own ;-) Stefan
On Fri, 2006-03-10 at 09:31 -0500, Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > > LOG: begin index sort: unique = f, workMem = 8024000, randomAccess = f > > LOG: switching to external sort with 28658 tapes: CPU 4.18s/13.96u sec > > elapsed 32.43 sec > > LOG: finished writing run 1 to tape 0: CPU 173.56s/3425.85u sec elapsed > > 3814.82 sec > > LOG: performsort starting: CPU 344.17s/7013.20u sec elapsed 7715.45 sec > > LOG: finished writing final run 2 to tape 1: CPU 347.19s/7121.78u sec > > elapsed 7827.25 sec > > LOG: performsort done (except 2-way final merge): CPU 348.25s/7132.99u > > sec elapsed 7846.47 sec > > > after that the postmaster is now consuming 99% CPU for about 22 hours(!) > > I'll look into it, but I was already wondering if we shouldn't bound the > number of tapes somehow. It's a bit hard to believe that 28000 tapes is > a sane setting. Well, since they are not actually tapes, why not? I thought you had changed the memory settings so that the 28658 was a maximum, not the actual number it used. In the above example we are only using 2 runs (tapes)... so it should just read in run 1 and then run 2 into memory and complete the final merge. Seems reasonable to limit tapes to say 10,000. 28000 tapes allows you to sort 448 TB without any merging... Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes:
> On Fri, 2006-03-10 at 09:31 -0500, Tom Lane wrote:
>> I'll look into it, but I was already wondering if we shouldn't bound the
>> number of tapes somehow.  It's a bit hard to believe that 28000 tapes is
>> a sane setting.
> I thought you had changed the memory settings so that the 28658 was a
> maximum, not the actual number it used.
Well, it does set up the control structures with 28658 entries, but the
I/O buffers and so on are not allocated unless used (in this instance
only two will get used).  logtape.c itself does not look like it has any
serious problem with too many tapes, but maybe tuplesort.c does.  Or the
problem Stefan has stumbled across might be unrelated to number of
tapes, anyway --- we still need to dig.
> Seems reasonable to limit tapes to say 10,000. 28000 tapes allows you to
> sort 448 TB without any merging...
Yeah, I was thinking MAXTAPES = 10000 might not be a bad idea.
        regards, tom lane
			
		Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> samples  %        symbol name
> 350318533 98.8618  mergepreread
> 971822    0.2743  tuplesort_gettuple_common
> 413674    0.1167  tuplesort_heap_siftup
I'm not immediately seeing why mergepreread would have such a problem
with lots of tapes.  Could you reproduce this in a debug build, then
attach to the process with gdb and print out the contents of the "state"
struct?
        regards, tom lane
			
		Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> samples  %        symbol name
> 350318533 98.8618  mergepreread
> 971822    0.2743  tuplesort_gettuple_common
> 413674    0.1167  tuplesort_heap_siftup
I don't have enough memory to really reproduce this, but I've come close
enough that I believe I see what's happening.  It's an artifact of the
code I added recently to prevent the SortTuple array from growing during
the merge phase, specifically the "mergeslotsfree" logic.  You can get
into a state where mergeslotsfree is at the lower limit of what the code
will allow, and then if there's a run of tuples that should come from a
single tape, mergepreread ends up sucking just one tuple per call from
that tape --- and with the outer loop over 28000 tapes that aren't doing
anything, each call is pretty expensive.  I had mistakenly assumed that
the mergeslotsfree limit would be a seldom-hit corner case, but it seems
it's not so hard to get into that mode after all.  The code really needs
to do a better job of sharing the available array slots among the tapes.
Probably the right answer is to allocate so many free array slots to each
tape, similar to the per-tape limit on memory usage --- I had thought
that the memory limit would cover matters but it doesn't.
Another thing I am wondering about is the code's habit of prereading
from all tapes when one goes empty.  This is clearly pretty pointless in
the final-merge-pass case: we might as well just reload from the one
that went empty, and not bother scanning the rest.  However, in the
scenario where we are rewriting the data to tape, I think we still need
the preread-from-all behavior in order to keep things efficient in
logtape.c.  logtape likes it if you alternate a lot of reads with a lot
of writes, so once you've started reading you really want to refill
memory completely.
It might also be worth remembering the index of the last active tape so
that we don't iterate over thousands of uninteresting tapes in
mergepreread.
        regards, tom lane
			
		I wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> samples  %        symbol name
>> 350318533 98.8618  mergepreread
>> 971822    0.2743  tuplesort_gettuple_common
>> 413674    0.1167  tuplesort_heap_siftup
> I don't have enough memory to really reproduce this, but I've come close
> enough that I believe I see what's happening.  It's an artifact of the
> code I added recently to prevent the SortTuple array from growing during
> the merge phase, specifically the "mergeslotsfree" logic.
I've committed a fix for this; if you're still interested, please
retest.
        regards, tom lane
			
		Tom Lane wrote: > I wrote: > >>Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> >>>samples % symbol name >>>350318533 98.8618 mergepreread >>>971822 0.2743 tuplesort_gettuple_common >>>413674 0.1167 tuplesort_heap_siftup > > >>I don't have enough memory to really reproduce this, but I've come close >>enough that I believe I see what's happening. It's an artifact of the >>code I added recently to prevent the SortTuple array from growing during >>the merge phase, specifically the "mergeslotsfree" logic. > > > I've committed a fix for this; if you're still interested, please > retest. This seems to have fixed the problem - mergepreread is gone from the profile and the CREATE INDEX finished in about 2h 37 minutes: LOG: begin index sort: unique = f, workMem = 8024000, randomAccess = f LOG: switching to external sort with 28658 tapes: CPU 4.68s/11.67u sec elapsed 16.36 sec LOG: finished writing run 1 to tape 0: CPU 303.18s/3324.51u sec elapsed 3858.55 sec LOG: performsort starting: CPU 611.37s/6829.75u sec elapsed 7830.90 sec LOG: finished writing final run 2 to tape 1: CPU 614.15s/6928.92u sec elapsed 7933.38 sec LOG: performsort done (except 2-way final merge): CPU 615.26s/6940.77u sec elapsed 7951.58 sec LOG: external sort ended, 4398827 disk blocks used: CPU 827.45s/8519.86u sec elapsed 10046.31 sec profile for this run looks like: CPU: AMD64 processors, speed 2405.5 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 77645525 39.6040 comparetup_index 47024448 23.9854 btint4cmp 22508630 11.4808 tuplesort_heap_siftup 12657874 6.4563 LogicalTapeRead 4049912 2.0657 tuplesort_heap_insert 3025537 1.5432 tuplesort_gettuple_common 2922149 1.4905 AllocSetFree 2499669 1.2750 readtup_index 1725984 0.8804 AllocSetAlloc 1318344 0.6724 LWLockAcquire 1298245 0.6622 PageAddItem 1271657 0.6486 heapgettup Stefan