Обсуждение: DBMirror.pl performance change

Поиск
Список
Период
Сортировка

DBMirror.pl performance change

От
Achilleus Mantzios
Дата:
I discovered a problem in DBMirror.pl, performance wise.

pending.c stores data in a way
very similar to the PgSQL input "\" escaped format.

When the field is of type bytea, and the source of data is binary, then
this produces 2 additional backslashes for every unprintable
char.

The performance in function extractData in DBMirror.pl, really suffers
from this condition, since it breaks data in chunks of "\" delimited
strings.

Informally speaking, performance tends to be O(n) where n is the size
of the data.

This can be remedied if we break data in chunks of "'" rather than "\".
"'" happens much more infrequently in common binary files (bz2, tiff, jpg,
pdf etc..), and if we notice that odd number of contained "\", signals an
intermidiate "'", whereas even number of "\" signals the final "'",
then we can make this routine run much faster.

I attach the new extractData function.

Now replicating a 400 k tiff takes 3 seconds instead of 12 minutes
it used to do.

I am wondering about the state of
http://www.whitebeam.org/library/guide/TechNotes/replicate.rhtm

Please feel free for any comments.

Pete could you test this new DBMirror.pl, to see how it behaves
in comparison with your C++ solution?

--
-Achilleus

Вложения

Re: DBMirror.pl performance change

От
Peter Wilson
Дата:
The Whitebeam implementation of DBMirror.pl :
    http://www.whitebeam.org/library/guide/TechNotes/replicate.rhtm

is a complete re-write in 'C' which avoids a lot of the text processing, and what text processing is required is done
usinga state machine rather  
than repeated regular expressions. Before I wrote the 'C' implementation I did look at optimising the Perl version. One
ofmy big concerns was the  
time taking escaping and re-escaping the strings. I can't remember the details now but as far as I can remember a lot
ofthat is unnecessary. There  
seemed to be an unescape of the data then a re-escape to the target database. In practice the data was in the correct
format.

We make quite heavy use of both BYTEA and large varchar fields in our database. I did some load testing at the time and
foundthe new version could  
replicate 10s of file objects per second - where the Perl version took 10 minutes to replicate a 120K BYTEA field (both
ona slowish machine, but the  
Perl version wasn't much better on a fast machine *and* took 97% CPU).

I also took the opportunity to make the 'C' version much more tolerant to lost DB connections without having to restart
andadded a few other tweaks  
to make it more flexible.

It's released under the BSD license now as well

Pete
--
http://www.whitebeam.org
http://www.yellowhawk.co.uk
--------

Achilleus Mantzios wrote:
> I discovered a problem in DBMirror.pl, performance wise.
>
> pending.c stores data in a way
> very similar to the PgSQL input "\" escaped format.
>
> When the field is of type bytea, and the source of data is binary, then
> this produces 2 additional backslashes for every unprintable
> char.
>
> The performance in function extractData in DBMirror.pl, really suffers
> from this condition, since it breaks data in chunks of "\" delimited
> strings.
>
> Informally speaking, performance tends to be O(n) where n is the size
> of the data.
>
> This can be remedied if we break data in chunks of "'" rather than "\".
> "'" happens much more infrequently in common binary files (bz2, tiff, jpg,
> pdf etc..), and if we notice that odd number of contained "\", signals an
> intermidiate "'", whereas even number of "\" signals the final "'",
> then we can make this routine run much faster.
>
> I attach the new extractData function.
>
> Now replicating a 400 k tiff takes 3 seconds instead of 12 minutes
> it used to do.
>
> I am wondering about the state of
> http://www.whitebeam.org/library/guide/TechNotes/replicate.rhtm
>
> Please feel free for any comments.
>
> Pete could you test this new DBMirror.pl, to see how it behaves
> in comparison with your C++ solution?
>
>
>
> ------------------------------------------------------------------------
>
> sub extractData($$) {
>   my $pendingResult = $_[0];
>   my $currentTuple = $_[1];
>   my $fnumber;
>   my %valuesHash;
>   $fnumber = 4;
>   my $dataField = $pendingResult->getvalue($currentTuple,$fnumber);
>   my $numofbs;
>
>   while(length($dataField)>0) {
>     # Extract the field name that is surronded by double quotes
>     $dataField =~ m/(\".*?\")/s;
>     my $fieldName = $1;
>     $dataField = substr $dataField ,length($fieldName);
>     $fieldName =~ s/\"//g; #Remove the surronding " signs.
>
>     if($dataField =~ m/(^= )/s) {
>       #Matched null
>     $dataField = substr $dataField , length($1);
>       $valuesHash{$fieldName}=undef;
>     }
>     elsif ($dataField =~ m/(^=\')/s) {
>       #Has data.
>       my $value;
>       $dataField = substr $dataField ,2; #Skip the ='
>     LOOP: {  #This is to allow us to use last from a do loop.
>          #Recommended in perlsyn manpage.
>       do {
>     my $matchString;
>     my $matchString2;
>     #Find the substring ending with the first ' or first \
>     $dataField =~ m/(.*?[\'])?/s;
>     $matchString = $1;
>
>     $numofbs = ($matchString =~ tr/\\//) % 2;
>
>     if ($numofbs == 1) { #// odd number of \, i.e. intermediate '
>         $matchString2 = substr $matchString,0, length($matchString)-2;
>         $matchString2 =~ s/\\\\/\\/g;
>         $value .= ($matchString2 . "\'");
>         $dataField = substr $dataField,length($matchString);
>     }
>     else { #// even number of \, i.e. found end of data
>         $matchString2 = substr $matchString,0, length($matchString)-1;
>         $matchString2 =~ s/\\\\/\\/g;
>         $value .= $matchString2;
>         $dataField = substr $dataField,length($matchString)+1;
>         last;
>     }
>
>
>       } until(length($dataField)==0);
>   }
>       $valuesHash{$fieldName} = $value;
>
>
>       }#else if
>       else {
>
>         logErrorMessage "Error in PendingData Sequence Id " .
>         $pendingResult->getvalue($currentTuple,0);
>         die;
>       }
>
>
>
>   } #while
>   return %valuesHash;
>
> }
>
>
> ------------------------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

Re: DBMirror.pl performance change

От
Achilleus Mantzios
Дата:
Peter,
It is much more convinient for you to make a test,
(just change the last function in DBmirror.pl), than for me
(grab whitebeam, compile for FreeBSD, etc...)

Of course you would need to use the original .conf format
than the one you are using now.

It would be interesting to see some numbers.

P.S.

Please include my address explicitly, pgsql-general comes
to me in digest mode.
--
-Achilleus


Re: DBMirror.pl performance change

От
Peter Wilson
Дата:
Achilleus Mantzios wrote:
> Peter,
> It is much more convinient for you to make a test,
> (just change the last function in DBmirror.pl), than for me
> (grab whitebeam, compile for FreeBSD, etc...)
>
> Of course you would need to use the original .conf format
> than the one you are using now.
>
> It would be interesting to see some numbers.
>
> P.S.
>
> Please include my address explicitly, pgsql-general comes
> to me in digest mode.

I'll take a look into this when I get a chance. Right now the only replicated systems I have are for live commercial
clients- my development systems  
aren't replicated, just backed-up periodically.

It is worth looking through the Perl version some more though. I'm pretty sure I worked around most of the
escaping/unescapingwhen I looked at the  
'C' version. I'm pretty sure some of the same approach could be used to improve performance of the Perl version. The
mainthing I found was that the  
data table is un-escaped when read from the table and then re-escaped before being sent to the slave database. In
practicethe data doesn't have to be  
touched.

My own preference right now is to stick with the C version now I have it. Replication is just about simultaneous with
negligibleCPU usage. When I get  
a chance, I'm intending decoupling the 'C' version from the whole of Whitebeam so it can be built by itself. At the
timeI needed a solution quickly  
so making use of a few Whitebeam utility classes got me there.

Pete