open/lseek overhead with many partitions (including with "fasterpartitioning pruning")

Поиск
Список
Период
Сортировка
От Justin Pryzby
Тема open/lseek overhead with many partitions (including with "fasterpartitioning pruning")
Дата
Msg-id 20180403194613.GY28454@telsasoft.com
обсуждение исходный текст
Ответы Re: open/lseek overhead with many partitions (including with "fasterpartitioning pruning")  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
Re: open/lseek overhead with many partitions (including with "fasterpartitioning pruning")  (David Rowley <david.rowley@2ndquadrant.com>)
Список pgsql-hackers
TLDR: even with "faster pruning" patch, pg11dev open()+lseek() every file
backing every table being queried, even for those partitions eventually
"excluded".

One of our customers (with the largest number of child tables, currently of
relkind='r') takes a long time to plan query, and I wondered whether we'd
benefit from ongoing work, if I should plan to convert to relkind='p', and if I
could start testing that for our use cases.

I found by stracing that the backend is open()ing + lseek(SEEK_END) each file
in each relation being UNIONed in this view,  including relations which were
excluded by constraints.  Their tables are partitioned by day (some are still
40+GB), so there's only 10-20 tables being included in the above query, total
size perhaps ~50GB and under 50 backing files, but accessing a LOT more files
than needed:

cold backend:
[pryzbyj@database ~]$ grep open /tmp/strace-pg12-0|cut -d/ -f1,2 |sort |uniq -c |sort -nr |head
  26593 open("pg_tblspc/16400
   6894 open("pg_tblspc/16401
    482 open("base/16402

warm backend:
[pryzbyj@database ~]$ grep open /tmp/strace-pg12-1|cut -d/ -f1,2 |sort |uniq -c |sort -nr |head
   6545 open("pg_tblspc/16401
   1338 open("pg_tblspc/16400
    254 open("base/16402

I was curious if that was improved in pg11dev, and if it applied to relkind='r'
or only to relkind='p'.  I applied v47 patches for "path toward faster
partition pruning", and was able to quickly create synthetic "huge table" using
truncate (and verified that smgr logic actually opens all 1e5 files).

$ psql -d postgres -h /tmp --port 5678 -c "CREATE TABLE t(i int) PARTITION BY range(i)" 
$ for a in `seq 1 999`; do psql -d postgres -h /tmp --port 5678 -c "CREATE TABLE t$a PARTITION OF t FOR VALUES FROM
($a)TO($((1+a)))";done
 
$ time for a in ./test11dev/base/13236/1???[0-9]; do echo "$a..."; truncate -s 1G $a; done # make existing, empty files
1GB..
$ time for a in ./test11dev/base/13236/1???[0-9]; do echo "$a..."; for b in `seq 1 99`; do truncate -s 1G $a.$b; done;
done# make new 1GB tables with .n extension
 

postgres=# explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;
 Limit  (cost=0.00..0.01 rows=1 width=8)
   ->  Aggregate  (cost=0.00..0.01 rows=1 width=8)
         ->  Result  (cost=0.00..0.00 rows=0 width=0)
               One-Time Filter: false
Time: 63268.474 ms (01:03.268)

=> zfs/fuse is certainly exacerbating the issue here due to context switches; but..

[pryzbyj@database postgresql]$ time cut -d\( -f1 /tmp/strace-pg11dev-faster-15|sort |uniq -c |sort -nr |head
 100899 open
  99900 lseek
  98918 close
    118 brk

postgres is still open()+lseek() on 100k files, most of which are for relations
which were clearly excluded:

(gdb) bt
#0  0x00000032b2adb580 in __open_nocancel () from /lib64/libc.so.6
#1  0x0000000000711827 in BasicOpenFilePerm (fileName=0x1e2c9d0 "base/13236/16759.10", fileFlags=2, fileMode=384) at
fd.c:965
#2  0x00000000007121ea in PathNameOpenFilePerm (fileName=0x1e2c9d0 "base/13236/16759.10", fileFlags=2, fileMode=384) at
fd.c:1394
#3  0x0000000000733f6e in _mdfd_openseg (reln=0x1e14d98, forknum=MAIN_FORKNUM, segno=10, oflags=0) at md.c:1783
#4  0x0000000000734026 in mdnblocks (reln=0x1e14d98, forknum=MAIN_FORKNUM) at md.c:918
#5  0x00000000006b26bc in estimate_rel_size (rel=0x7f3f707f4b98, attr_widths=0x1e2cd24, pages=<value optimized out>,
tuples=0x1e2cb98,allvisfrac=<value optimized out>) at plancat.c:946
 
#6  0x00000000006b3a47 in get_relation_info (root=0x1bd9030, relationObjectId=16759, inhparent=false, rel=0x1e2cae0) at
plancat.c:144
#7  0x00000000006b78fa in build_simple_rel (root=0x1bd9030, relid=126, parent=0x1bda578) at relnode.c:185
#8  0x00000000006b7990 in build_simple_rel (root=0x1bd9030, relid=1, parent=0x0) at relnode.c:251
#9  0x0000000000691be3 in add_base_rels_to_query (root=0x1bd9030, jtnode=<value optimized out>) at initsplan.c:121
#10 0x00000000006924af in query_planner (root=<value optimized out>, tlist=0x1d34fd8, qp_callback=0x693940
<standard_qp_callback>,qp_extra=0x7ffe0a077290) at planmain.c:138
 
#11 0x000000000069777e in grouping_planner (root=<value optimized out>, inheritance_update=false, tuple_fraction=<value
optimizedout>) at planner.c:1892
 
#12 0x0000000000698ef7 in subquery_planner (glob=<value optimized out>, parse=<value optimized out>, parent_root=<value
optimizedout>, hasRecursion=<value optimized out>, tuple_fraction=0) at planner.c:966
 
#13 0x0000000000699d97 in standard_planner (parse=0x1bd9778, cursorOptions=256, boundParams=<value optimized out>) at
planner.c:405
#14 0x0000000000739d2a in pg_plan_query (querytree=<value optimized out>, cursorOptions=<value optimized out>,
boundParams=<valueoptimized out>) at postgres.c:808
 
#15 0x00000000005a0bd6 in ExplainOneQuery (query=0x1bd9778, cursorOptions=<value optimized out>, into=0x0,
es=0x1bd8f58,queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT 1;", params=0x0, queryEnv=0x0)
 
    at explain.c:365
#16 0x00000000005a0e5d in ExplainQuery (pstate=0x1bd8d80, stmt=0x1ae1fa0, queryString=0x1ae0fb0 "explain SELECT
COUNT(1)FROM t WHERE i=-99 LIMIT 1;", params=0x0, queryEnv=0x0, dest=0x1bd8e90) at explain.c:254
 
#17 0x00000000007408f2 in standard_ProcessUtility (pstmt=0x1ae2050, queryString=0x1ae0fb0 "explain SELECT COUNT(1) FROM
tWHERE i=-99 LIMIT 1;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1bd8e90, 
 
    completionTag=0x7ffe0a077720 "") at utility.c:670
#18 0x000000000073c789 in PortalRunUtility (portal=0x1b46780, pstmt=0x1ae2050, isTopLevel=<value optimized out>,
setHoldSnapshot=<valueoptimized out>, dest=0x1bd8e90, completionTag=<value optimized out>) at pquery.c:1178
 
#19 0x000000000073d9aa in FillPortalStore (portal=0x1b46780, isTopLevel=true) at pquery.c:1038
#20 0x000000000073de80 in PortalRun (portal=0x1b46780, count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x1ae2618,altdest=0x1ae2618, completionTag=0x7ffe0a077920 "") at pquery.c:768
 
#21 0x000000000073a39d in exec_simple_query (query_string=0x1ae0fb0 "explain SELECT COUNT(1) FROM t WHERE i=-99 LIMIT
1;")at postgres.c:1121
 
#22 0x000000000073b341 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x1b0ac80
"postgres",username=<value optimized out>) at postgres.c:4149
 
#23 0x00000000006cd4f3 in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4409
#24 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4081
#25 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1754
#26 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1362
#27 0x00000000006477d0 in main (argc=7, argv=0x1adba20) at main.c:228

I haven't been following the development patch, so don't know if that's
expected to be improved, or a possible future improvement, or ...

Also, I wondered if it'd be preferable for mdnblocks/estimate_rel_size to
l/stat() rather than open+lseek, which can require opening large number of
rarely-queried files and mess up fd.c LRU cache.

Justin


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: Foreign keys and partitioned tables
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: WIP: BRIN multi-range indexes