Re: [HACKERS] WAL logging problem in 9.4.3?

Поиск
Список
Период
Сортировка
От	Kyotaro Horiguchi
Тема	Re: [HACKERS] WAL logging problem in 9.4.3?
Дата	3 декабря 2019 г. 11:51:46
Msg-id	20191203.205146.1521643852457054060.horikyota.ntt@gmail.com обсуждение исходный текст
Ответ на	Re: [HACKERS] WAL logging problem in 9.4.3? (Noah Misch <noah@leadboat.com>)
Ответы	Re: [HACKERS] WAL logging problem in 9.4.3? Re: [HACKERS] WAL logging problem in 9.4.3?
Список	pgsql-hackers
Дерево обсуждения
Hello.

At Thu, 28 Nov 2019 17:23:19 -0500, Noah Misch <noah@leadboat.com> wrote in 
> On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
> > I measured the performance with the latest patch set.
> > 
> > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> > >    minute when done via syncs.
> > > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > > 3. Wait 10s.
> > > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
> 
> If you have the raw data requested in (5), please share them here so folks
> have the option to reproduce your graphs and calculations.

Sorry, I forgot to attach the scripts. The raw data was vanished into
unstable connection and the steps was quite crude. I prioritized on
showing some numbers at the time. I revised the scripts into more
automated way and will take numbers again.

> > > 2. Start server with wal_level = replica (all other variables are not
> > > changed) then run the attached ./bench.sh
> > 
> > The bench.sh attachment was missing; please attach it.  Please give the output
> > of this command:
> > 
> >   select name, setting from pg_settings where setting <> boot_val;

(I intentionally show all the results..)
=# select name, setting from pg_settings where setting<> boot_val;
            name            |      setting       
----------------------------+--------------------
 application_name           | psql
 archive_command            | (disabled)
 client_encoding            | UTF8
 data_directory_mode        | 0700
 default_text_search_config | pg_catalog.english
 lc_collate                 | en_US.UTF-8
 lc_ctype                   | en_US.UTF-8
 lc_messages                | en_US.UTF-8
 lc_monetary                | en_US.UTF-8
 lc_numeric                 | en_US.UTF-8
 lc_time                    | en_US.UTF-8
 log_checkpoints            | on
 log_file_mode              | 0600
 log_timezone               | Asia/Tokyo
 max_stack_depth            | 2048
 max_wal_senders            | 0
 max_wal_size               | 10240
 server_encoding            | UTF8
 shared_buffers             | 16384
 TimeZone                   | Asia/Tokyo
 unix_socket_permissions    | 0777
 wal_buffers                | 512
 wal_level                  | minimal
(23 rows)

The result for "replica" setting in the benchmark script are used as
base numbers (or the denominator of the percentages).

> > 3. Restart server with wal_level = replica then run the bench.sh
> > twice.
> 
> I assume this is wal_level=minimal, not wal_level=replica.

Oops! It's wrong I ran once with replica, then twice with minimal.


Anyway, I revised the benchmarking scripts and attached them.  The
parameters written in benchmain.sh were decided as ./bench2.pl 5
<count> <pages> s with wal_level=minimal server takes around 60
seconds.

I'll send the complete data tomorrow (in JST). The attached f.txt is
the result of preliminary test only with pages=100 and 250 (with HDD).

The attached files are:
  benchmain.sh    - main script
  bench2.sh       - run a benchmark with a single set of parameters
  bench1.pl       - behchmark client program
  summarize.pl    - script to summarize benchmain.sh's output
  f.txt.gz        - result only for pages=100, DDL count = 2200 (not 2250)

How to run:

$ /..unpatched_path../initdb -D <unpatched_datadir>
 (wal_level=replica, max_wal_senders=0, log_checkpoints=yes, max_wal_size=10GB)
$ /..patched_path../initdb -D <patched_datadir>
 (wal_level=minimal, max_wal_senders=0, log_checkpoints=yes, max_wal_size=10GB)
$./benchmain.sh > <result_file>   # output raw data
$./summarize.pl [-v] < <result_file>   # show summary


With the attached f.txt, summarize.pl gives the following output.
WAL wins with the that pages.

$ cat f.txt | ./summarize.pl
## params: wal_level=replica mode=none pages=100 count=353 scale=20
(% are relative to "before")
before: tps  262.3 (100.0%), lat    39.840 ms (100.0%) (29 samples)
during: tps  120.7 ( 46.0%), lat   112.508 ms (282.4%) (35 samples)
 after: tps  106.3 ( 40.5%), lat   163.492 ms (410.4%) (86 samples)
DDL time:  34883 ms ( 100.0% relative to mode=none)
## params: wal_level=minimal mode=sync pages=100 count=353 scale=20
(% are relative to "before")
before: tps  226.3 (100.0%), lat    48.091 ms (100.0%) (29 samples)
during: tps   83.0 ( 36.7%), lat   184.942 ms (384.6%) (100 samples)
 after: tps   82.6 ( 36.5%), lat   196.863 ms (409.4%) (21 samples)
DDL time:  99239 ms ( 284.5% relative to mode=none)
## params: wal_level=minimal mode=WAL pages=100 count=353 scale=20
(% are relative to "before")
before: tps  240.3 (100.0%), lat    44.686 ms (100.0%) (29 samples)
during: tps  129.6 ( 53.9%), lat   113.585 ms (254.2%) (31 samples)
 after: tps  124.5 ( 51.8%), lat   141.992 ms (317.8%) (90 samples)
DDL time:  30392 ms (  87.1% relative to mode=none)
## params: wal_level=replica mode=none pages=250 count=258 scale=20
(% are relative to "before")
before: tps  266.3 (100.0%), lat    45.884 ms (100.0%) (29 samples)
during: tps   87.9 ( 33.0%), lat   148.433 ms (323.5%) (54 samples)
 after: tps  105.6 ( 39.6%), lat   153.216 ms (333.9%) (67 samples)
DDL time:  53176 ms ( 100.0% relative to mode=none)
## params: wal_level=minimal mode=sync pages=250 count=258 scale=20
(% are relative to "before")
before: tps  225.1 (100.0%), lat    47.705 ms (100.0%) (29 samples)
during: tps   93.7 ( 41.6%), lat   143.231 ms (300.2%) (83 samples)
 after: tps   93.8 ( 41.7%), lat   186.097 ms (390.1%) (38 samples)
DDL time:  82104 ms ( 154.4% relative to mode=none)
## params: wal_level=minimal mode=WAL pages=250 count=258 scale=20
(% are relative to "before")
before: tps  230.2 (100.0%), lat    48.472 ms (100.0%) (29 samples)
during: tps   90.3 ( 39.2%), lat   183.365 ms (378.3%) (48 samples)
 after: tps  123.9 ( 53.8%), lat   131.129 ms (270.5%) (73 samples)
DDL time:  47660 ms (  89.6% relative to mode=none)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#! /usr/bin/bash

PGBENCH_SCALE=20
export PATH=/home/horiguti/bin/pgsql_pendsync/bin:$PATH
function run_one_param()
{
    # bench2.sh <ddl_count> <insert_pages> <method (s:sync, w:WAL, n:not-set)> <pgbench scale>
    ./bench2.sh $1 $2 n ${PGBENCH_SCALE} 2>&1
    ./bench2.sh $1 $2 s ${PGBENCH_SCALE} 2>&1
    ./bench2.sh $1 $2 w ${PGBENCH_SCALE} 2>&1
}

killall -9 postgres
sleep 1

#run_one_param <ddl_count> <insert_pages>

# On slow HDD
export PGDATA=/home/horiguti/storage/hdd/data/data_pendsync
run_one_param  353   100
run_one_param  258   250
run_one_param  185   500
run_one_param  118  1000
run_one_param   58  2500
run_one_param   32  5000
run_one_param   18 10000

# On M.2 SSD
export PGDATA=/home/horiguti/storage/ssd/data/data_pendsync
run_one_param 2250   100
run_one_param 1162   250
run_one_param  564   500
run_one_param  297  1000
run_one_param  123  2500
run_one_param   63  5000
run_one_param   32 10000
#! /usr/bin/bash

if [ "$PGDATA" == "" ]; then
    echo "\$PGDATA should be set"
    exit;
fi;

rm -r $PGDATA/*

initdb
if [ "$3" == "n" ]; then
    wal_level=replica
else
    wal_level=minimal
fi    
cat <<EOF >> $PGDATA/postgresql.conf
wal_level=$wal_level
max_wal_senders=0
log_checkpoints=yes
max_wal_size = 10GB
EOF
    
binary=`which postgres`
scale=$4
echo "## params: count=$1 pages=$2 mode=$3 binary=$binary scale=$scale wal_level=$wal_level"
pg_ctl stop -m f
pg_ctl start 2>&1
pgbench -i -s ${scale}
psql -c 'checkpoint;'
((sleep 30; echo "START"; ./bench1.pl 5 $1 $2 $3; echo "END") & pgbench -rP1 --progress-timestamp -T150 -c10 -j10)
2>&1
pg_ctl stop -m i
#! /usr/bin/perl

use strict;
use IPC::Open2;
use Time::HiRes qw (gettimeofday tv_interval);

my $tupperpage = 226;
my $large_size = 100000000;
my @time = ();

sub bench {
    my ($header, $nprocs, $ntups, $threshold, $ddlcount) = @_;
    my @result = ();
    my @rds = ();
    
    for (my $ip = 0 ; $ip < $nprocs ; $ip++)
    {
        pipe(my $rd, my $wr);
        $rds[$ip] = $rd;
        
        my $pid = fork();

        die "fork failed: $!\n" if ($pid < 0);
        if ($pid == 0)
        {
            close($rd);
            
            my $pid = open2(my $psqlrd, my $psqlwr, "psql postgres > /dev/null");
            if ($threshold >= 0) {
                print $psqlwr "SET wal_skip_threshold to $threshold;\n";
            }
            print $psqlwr "DROP TABLE IF EXISTS t$ip;";
            print $psqlwr "CREATE TABLE t$ip (a int);\n";

            my @st = gettimeofday();
            for (my $i = 0 ; $i < $ddlcount ; $i++)
            {
                print $psqlwr "BEGIN;";
                print $psqlwr "TRUNCATE t$ip;";
                print $psqlwr "INSERT INTO t$ip (SELECT a FROM generate_series(1, $ntups) a);";
                print $psqlwr "COMMIT;";
            }
            close($psqlwr);
            waitpid($pid, 0);

            print $wr $ip, " ", 1000 * tv_interval(\@st, [gettimeofday]), "\n";
            exit;
        }
        close($wr);
    }

    my $rpid;
    while (($rpid = wait()) == 0) {}

    my $sum = 0;
    for (my $ip = 0 ; $ip < $nprocs ; $ip++)
    {
        my $ret = readline($rds[$ip]);
        die "format? $ret\n" if ($ret !~ /^([0-9]+) ([0-9.]+)$/);

        $sum += $2;
    }

    printf "$header: procs $nprocs: time %.0f\n", $sum / $nprocs;
}


sub log10 { return log($_[0]) / log(10); }

# benchmark for wal_level = replica, the third parameter of bench
# doesn't affect
sub bench1
{
    my $ddlcount = 5;
    $ddlcount = $ARGV[1] if ($#ARGV > 0);

    print "benchmark for wal_level = replica\n";
    for (my $s = 0 ; $s <= 4 ; $s += 0.25)
    {
        my $ss = int(10 ** $s);
        bench("size $ss", 1, $ss * $tupperpage, 0, $ddlcount);
    }
}

# benchmark for wal_level = minimal.
sub bench2
{
    my $ddlcount = 5;
    $ddlcount = $ARGV[1] if ($#ARGV > 0);

    print "benchmark for wal_level = minimal\n";
    for (my $s = 0 ; $s <= 4.5 ; $s += 0.25)
    {
        my $ss = int(10 ** $s);
        bench("size $ss: SYNC ", 1, $ss * $tupperpage,           0, $ddlcount);
        bench("size $ss: WAL  ", 1, $ss * $tupperpage, $large_size, $ddlcount);
    }
}

# find crossing point of WAL and SYNC by bisecting
sub bench3
{
    my $ddlcount = 5;
    $ddlcount = $ARGV[1] if ($#ARGV > 0);

    print "find crossing point of WAL and SYNC by bisecting\n";
    bench("SYNC: size 0", 1, 1, 8);
    bench("WAL : size 0", 1, 1, 16);
    my $s = 1;
    my $st = 10000;
    while (1)
    {
        my $ts = bench("SYNC: size $s",
                       $tupperpage * $s,           0, $ddlcount);
        my $tw = bench("WAL : size $s",
                       $tupperpage * $s, $large_size, $ddlcount);

        if ($st < 1.0){
            print "DONE\n";
            exit(0);
        }
        if ($ts > $tw)
        {
            $s += $st; $st /= 2;
        }
        else
        {
            $s -= $st; $st /= 2;
        }
    }
}

# benchmark with multiple processes
sub bench4
{
    my $ddlcount = 5;
    my $nprocs = 10;

    $nprocs = $ARGV[1] if ($#ARGV > 0);
    $ddlcount = $ARGV[2] if ($#ARGV > 1);
    
    print "benchmark for wal_level = minimal, $nprocs processes\n";
    print "bench 4: nprocs = $nprocs, DDL count = $ddlcount\n";
    
    for (my $s = 1.0 ; $s <= 3.5 ; $s += 0.25)
    {
        my $ss = int(10 ** $s);
        bench("pages $ss: SYNC ", $nprocs, $ss * $tupperpage,           0, 5);
        bench("pages $ss: WAL  ", $nprocs, $ss * $tupperpage, $large_size, 5);
    }
}

sub bench5
{
    my $ddlcount = 5;
    my $pages = 100;
    my $mode = "s";
    my $threshold = 0;

    $ddlcount = $ARGV[1] if ($#ARGV > 0);
    $pages = $ARGV[2] if ($#ARGV > 1);
    $mode = $ARGV[3] if ($#ARGV > 2);
    if ($mode eq 's') {
        $threshold = 0;
    } elsif ($mode eq 's') {
        $threshold = $large_size;
    } elsif ($mode eq 'n') {
        $threshold = -1;
    } elsif ($mode eq 'w') {
        $threshold = $large_size;
    } else {
        die "mode must be s, w or n\n";
    }
    

    print "bench 5: mode = $mode, DDL count = $ddlcount, pages = $pages\n";
    bench("size $pages", 1, $pages * $tupperpage, $threshold, $ddlcount);
}

bench1() if ($ARGV[0] == 1);
bench2() if ($ARGV[0] == 2);
bench3() if ($ARGV[0] == 3);
bench4() if ($ARGV[0] == 4);
bench5() if ($ARGV[0] == 5);




#! /usr/bin/perl

use strict;
use POSIX qw(floor);

my $state = 0;
my $wal_level = '';
my $pages = 0;
my $binary = '';
my $scale = 0;
my $paramcount = 0;
my $mode = '';
my $sumtps = 0;
my $sumlat = 0;
my $count = 0;
my $trig = 0;
my $title = "(undef)";
my $ddltime = 0;
my @lines = ();
my $trailstr = '';
my $verbose = 1 if ($ARGV[0] eq '-v');
my %tps = ();
my %lat = ();
my %ddltime = ();
my %modestr=("n", "none", "s", "sync", "w", "WAL");

while (<STDIN>) {
    push(@lines, $_);
    chomp;
    next if (/^(END|START)$/);
    next if (/NOTICE:  /);

#    print "$state: $_\n";
    if ($state == 0) {
        if (/^## params: count=([0-9.]+) pages=([0-9.]+) mode=(.) binary=([^ ]+) scale=([0-9.]+) wal_level=([a-z]+)/)
{
            $paramcount = $1;
            $pages = $2;
            $mode = $3;
            $binary = $4;
            $scale = $5;
            $wal_level = $6;
            my $modestr = $modestr{$mode};

            print "## params: wal_level=$wal_level mode=$modestr pages=$pages count=$paramcount scale=$scale\n";
            print "(% are relative to \"before\")\n";
            $state = 1;
            next;
        } else {
            next;
        }
    } elsif ($state == 1) {
        if (/^starting vacuum/) {
            $state = 2;
        }
        next;
    } elsif ($state == 2) {
        if (/^bench.*/) {
            $trig = 1;
            $title = "before";
            $state = 3;
        }
    } elsif ($state == 3) {
        if (/^size ([0-9]+): procs ([0-9]+): time ([0-9]+)$/) {
            $ddltime{$mode} = $3;
            $trig = 1;
            $title = "during";
            $trailstr = '';
            $state = 4;
        }
    } elsif ($state == 4) {
        if (/^transaction type: /) {
            $trig = 1;
            $title = "after";
            $trailstr = "# $_\n";
            $state = 5;
        }
    } elsif ($state == 5) {
        if (!/^statement latencies /) {
            $trailstr .= "# $_\n";
            next;
        }
        printf "DDL time: %6.0f ms (%6.1f%% relative to mode=%s)\n",
            $ddltime{$mode},
            floor(1000.0 * $ddltime{$mode} / $ddltime{n} + 0.5) / 10,
            $modestr{n};
        $trailstr .= "# $_\n";
        $state = 6;
        next;
    } elsif ($state == 6) {
        if (/^ {8}/) {
            $trailstr.= "# $_\n";
            next;
        }
        if ($verbose) {
            print $trailstr;
        }
        $state = 0;
        next;
    }

    if ($trig) {
        die "count 0?\n" if ($count == 0);
        $tps{$title} = $sumtps / $count;
        $lat{$title} = $sumlat / $count;
        printf "%6s: tps %6.1f (%5.1f%%), lat %9.3f ms (%5.1f%%) (%d samples)\n",
            $title,
            $tps{$title}, floor(1000.0 * $tps{$title} / $tps{before} + 0.5)/10,
            $lat{$title}, floor(1000.0 * $lat{$title} / $lat{before} + 0.5)/10,
            $count;
        $sumtps = $sumlat = $count = 0;
        $trig = 0;
        next;
    }

    if (!/^progress: ([0-9.]+) s, ([0-9.]+) tps, lat ([0-9.]+) ms stddev ([0-9.]+|NaN)$/) {
        last;
    }
    $sumtps += $2;
    $sumlat += $3;
    $count++;
}

if ($state != 0) {
    print "Wrong state after EOF: state = $state\n";
    print "=====================================\n";
    foreach (-10 .. -1) {
        printf "%d: %s", ($. + $_), $lines[$. + $_];
    }
    print "=====================================\n";
    exit(1);
}

die "uncounted lines?\n" if ($count > 0);
Вложения

f.txt.gz
В списке pgsql-hackers по дате отправления:
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] WAL logging problem in 9.4.3?

Вложения