Обсуждение: BDR problem

Поиск
Список
Период
Сортировка

BDR problem

От
Charles Lynch
Дата:
So for about a month now, we've been getting things prepared to use a BDR cluster in a production, multi-region setup on aws. Our initial testing produced some absolutely fantastic results with replication delays less than 150ms between singapore, ireland, and north virginia and this is will SSL encryption.

We have, just recently, ran into a problem. I created a test cluster only within NV and after about a week of working without any problems, we got an error: Unexpected EOF on SSL connection. I had seen something like this before but on initial cluster join and chalked it up to me doing something wrong. This was after a week of working without issue. I wasn't sure what to do next. restarting the database started producing errors like this:

LOG:  starting background worker process "bdr (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
FATAL:  mismatch in worker state, got 3, expected 1
LOG:  starting background worker process "bdr (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
FATAL:  mismatch in worker state, got 3, expected 1
FATAL:  mismatch in worker state, got 3, expected 1
LOG:  starting background worker process "bdr (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
LOG:  worker process: bdr (6188205071755053119,1,16385,)->bdr (6188203625564571611,1, (PID 20300) exited with exit code 1

This would repeat. So I removed this node from the cluster using the proper bdr commands and tried re-joining but that just resulted in the return error changing from a 3 to a 0 and the same errors repeating. I have BDR completely automated and orchestrated using chef so I simply fired up a new cluster and started over. 

My problem is I don't know what caused this and, more importantly, I'm not sure how to fix it / prevent it and I can't launch this into production without figuring this out.

One other thing: I've seen a lot of conflicting information on how to setup BDR on ubuntu (using ppas, what pkg to install, and where to get source) I'm curious now if I don't have a younger version and that this issue is all but fixed now. Here are my build steps if anyone has any comments on how to setup bdr better, please let me know.

I grab postgres 9.4.4 from here:
and compile it with "./configure --prefix=/opt/psql --with-openssl && make -j4 -s install"

then I compile and install the btree_gist module

then I get the BDR plugin from here:
and compile it with "./configure && make -j4 -s all && make install"

then init the db and set everything with config, ssl certs, and cluster creation and joining.

Any help on this would be really appreciated.

Thanks guys

Charles

Re: BDR problem

От
Giovanni Maruzzelli
Дата:

On Fri, Sep 11, 2015 at 11:21 PM, Charles Lynch <charleslynchpostgresql@gmail.com> wrote:
So for about a month now, we've been getting things prepared to use a BDR cluster in a production, multi-region setup on aws. Our initial testing produced some absolutely fantastic results with replication delays less than 150ms between singapore, ireland, and north virginia and this is will SSL encryption.

We have, just recently, ran into a problem. I created a test cluster only within NV and after about a week of working without any problems, we got an error: Unexpected EOF on SSL connection. I had seen something like this before but on initial cluster join and chalked it up to me doing something wrong. This was after a week of working without issue. I wasn't sure what to do next. restarting the database started producing errors like this:

LOG:  starting background worker process "bdr (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
FATAL:  mismatch in worker state, got 3, expected 1
LOG:  starting background worker process "bdr (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
FATAL:  mismatch in worker state, got 3, expected 1
FATAL:  mismatch in worker state, got 3, expected 1
LOG:  starting background worker process "bdr (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
LOG:  worker process: bdr (6188205071755053119,1,16385,)->bdr (6188203625564571611,1, (PID 20300) exited with exit code 1

This would repeat. So I removed this node from the cluster using the proper bdr commands and tried re-joining but that just resulted in the return error changing from a 3 to a 0 and the same errors repeating. I have BDR completely automated and orchestrated using chef so I simply fired up a new cluster and started over. 

My problem is I don't know what caused this and, more importantly, I'm not sure how to fix it / prevent it and I can't launch this into production without figuring this out.

One other thing: I've seen a lot of conflicting information on how to setup BDR on ubuntu (using ppas, what pkg to install, and where to get source) I'm curious now if I don't have a younger version and that this issue is all but fixed now. Here are my build steps if anyone has any comments on how to setup bdr better, please let me know.

I grab postgres 9.4.4 from here:
and compile it with "./configure --prefix=/opt/psql --with-openssl && make -j4 -s install"

then I compile and install the btree_gist module

then I get the BDR plugin from here:
and compile it with "./configure && make -j4 -s all && make install"

then init the db and set everything with config, ssl certs, and cluster creation and joining.

Any help on this would be really appreciated.

Thanks guys

Charles



--
Sincerely,

Giovanni Maruzzelli
Cell : +39-347-2665618

Re: BDR problem

От
Craig Ringer
Дата:
On 12 September 2015 at 05:21, Charles Lynch
<charleslynchpostgresql@gmail.com> wrote:

> We have, just recently, ran into a problem. I created a test cluster only
> within NV and after about a week of working without any problems, we got an
> error: Unexpected EOF on SSL connection. I had seen something like this
> before but on initial cluster join and chalked it up to me doing something
> wrong.

That's generally network level, though it could also occur if a worker
exits unexpectedly.

> This was after a week of working without issue. I wasn't sure what to
> do next. restarting the database started producing errors like this:
>
> LOG:  starting background worker process "bdr
> (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
> FATAL:  mismatch in worker state, got 3, expected 1

That's ... very odd. It's violating a sanity check that shouldn't
really ever be triggered.

How exactly did you restart the database? Can you send more info on
your configuration via direct mail to me?

> This would repeat. So I removed this node from the cluster using the proper
> bdr commands and tried re-joining

You can't just re-join a removed node. Once it's removed it's removed
for ever. You have to drop the database (or re-initdb), create a new
blank database, and join it as a new node.

The reason for this is that when you remove the node the replication
slots on other nodes get dropped, so there's no record of what catchup
work needs to be done. It's not really possible to resync the node
with the rest after that. That's the point of node removal, to free
the resources from those slots when a node is retired, otherwise you'd
just switch it off.

> My problem is I don't know what caused this and, more importantly, I'm not
> sure how to fix it / prevent it and I can't launch this into production
> without figuring this out.

The "mismatch in worker state" is strongly likely to be a bug. The
trick will be figuring out how you triggered it.

Did you retain the malfunctioning cluster, or have you deleted it?

> One other thing: I've seen a lot of conflicting information on how to setup
> BDR on ubuntu (using ppas, what pkg to install, and where to get source) I'm
> curious now if I don't have a younger version and that this issue is all but
> fixed now. Here are my build steps if anyone has any comments on how to
> setup bdr better, please let me know.

You should use the apt respository referenced by
http://bdr-project.org/docs/stable/installation-packages.html#INSTALLATION-PACKAGES-DEBIAN
.

Support is focused mainly on RHEL/CentOS/Fedora, but Debian/Ubuntu
packages are also produced. We're a little behind at the moment and
haven't got 0.9.2 packages out. I'll be pushing 0.9.3 soon and will
produce 0.9.3 packages for Debian/Ubuntu as well as for
Fedora/RHEL/CentOS.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: BDR problem

От
Martín Marqués
Дата:
El 14/09/15 a las 06:37, Craig Ringer escribió:
>
> Support is focused mainly on RHEL/CentOS/Fedora, but Debian/Ubuntu
> packages are also produced. We're a little behind at the moment and
> haven't got 0.9.2 packages out. I'll be pushing 0.9.3 soon and will
> produce 0.9.3 packages for Debian/Ubuntu as well as for
> Fedora/RHEL/CentOS.

We (well, actually mostly you ;)) have pushed 0.9.2 bdr packages in rpm
and deb format.

$ rpm -qa | grep bdr94-bdr
postgresql-bdr94-bdr-debuginfo-0.9.2-1_2ndQuadrant.el7.centos.x86_64
postgresql-bdr94-bdr-0.9.2-1_2ndQuadrant.el7.centos.x86_64

Regards,

--
Martín Marqués                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: BDR problem

От
Florin Andrei
Дата:
On 2015-09-14 13:32, Martín Marqués wrote:
>
> We (well, actually mostly you ;)) have pushed 0.9.2 bdr packages in rpm
> and deb format.
>
> $ rpm -qa | grep bdr94-bdr
> postgresql-bdr94-bdr-debuginfo-0.9.2-1_2ndQuadrant.el7.centos.x86_64
> postgresql-bdr94-bdr-0.9.2-1_2ndQuadrant.el7.centos.x86_64

Yup, I'm using .deb packages from
http://packages.2ndquadrant.com/bdr/apt/ on Ubuntu 14.04:

# dpkg -l | grep postgresql-bdr | awk '{print $2"\t"$3}'
postgresql-bdr-9.4    9.4.4-1trusty
postgresql-bdr-9.4-bdr-plugin    0.9.2-1trusty
postgresql-bdr-client-9.4    9.4.4-1trusty
postgresql-bdr-contrib-9.4    9.4.4-1trusty
postgresql-bdr-server-dev-9.4    9.4.4-1trusty

It's very useful to have these packages available, helps a lot with
testing, kudos to everyone involved.

--
Florin Andrei
http://florin.myip.org/