Abstracting Binlog Servers and MySQL Master Promotion without Reconfiguring all Slaves

pre {
min-width:590px;
white-space:pre;
display:block;
hyphens:none;
word-break:normal;
}

In a MySQL replication deployment, the master is a single point of
failure. To recover after the failure of this critical component,
a common solution is to promote a slave to be the new master.
However, when doing so using classic methods,
the slaves need to be reconfigured.
This is a tedious operation in which many things can go
wrong. We found a more simple way to achieve master promotion
using Binlog Server. Read on for more details.

When a master fails in a MySQL replication deployment, the classic way to
promote a slave to be the new master is the following:

  1. Find the most up-to-date slave.
  2. If the most up-to-date slave is not a good candidate master, level
    a suitable candidate with the most up-to-date slave [1].
  3. Repoint the remaining slaves to the new master.

The procedure above needs to contact all slaves in step #1, and
to reconfigure all slaves in step #3. This becomes increasingly complex in
Booking.com environments where we have very wide,
and still growing,
replication topologies; it is not uncommon to have more than fifty
(and sometimes more than a hundred) slaves replicating from the
same master. Many things can go wrong when tens of slaves need to be
contacted and reconfigured:

  • some slaves might be down for maintenance or for taking a backup,
  • some slaves could be temporarily unreachable for other reasons,
  • and a few slaves could be processing a big backlog of relay logs
    (including delayed slaves), which will make them
    hard/unsuitable to reconfigure.

A way to reduce the complexity of master promotion
is presented below, but to get there, we must first give some context
about Binlog Servers and abstract them into a service.

Reminders about Binlog Servers

In a previous post, I described how to take advantage of
Binlog Server to perform master
promotion without GTIDs and without log-slave-updates, while still
requiring to reconfigure all slaves. To do this, the slaves must
replicate through a Binlog Server. This gives us the following
deployment with a single Binlog Server:

+---+
| A |
+---+
  |
 / 
/ X 
-----
  |
  +----------+----------+----------+----------+----------+
  |          |          |          |          |          |
+---+      +---+      +---+      +---+      +---+      +---+
| B |      | C |      | D |      | E |      | F |      | G |
+---+      +---+      +---+      +---+      +---+      +---+

or with redundant Binlog Servers:

+---+
| A |
+---+
  |
  +--------------------------------+
  |                                |
 /                               / 
/ X                             / Y 
-----                            -----
  |                                |
  +----------+----------+          +----------+----------+
  |          |          |          |          |          |
+---+      +---+      +---+      +---+      +---+      +---+
| B |      | C |      | D |      | E |      | F |      | G |
+---+      +---+      +---+      +---+      +---+      +---+

or with more than one site with redundant Binlog Servers.

  +---+
  | A |
  +---+
    |
    +-----------+------------------------+
    |           |                        |
   /          /                       /          / 
  / X        / Y                     / Z ------>/ W 
  -----       -----                    -----       -----
    |           |                        |           |
  +-+-----------+-+                    +-+-----------+-+
  |               |                    |               |
+---+           +---+                +---+           +---+
| S1|    ...    | Sn|                | T1|    ...    | Tm|
+---+           +---+                +---+           +---+

These schemas are becoming increasingly complex – let’s simplify them
by abstracting the Binlog Servers.

Binlog Server Abstraction

By hiding the Binlog Servers in an abstracted layer, which I call
the Distributed Binlog Serving Server (DBSS), a deployment on three
sites becomes the following:

   +---+
   | M |
   +---+
     |
+----+----------------------------------------------------------+
|                                                               |
+----+---------+-----------+---------+-----------+---------+----+
     |         |           |         |           |         |
   +---+     +---+       +---+     +---+       +---+     +---+
   | S1| ... | Sn|       | T1| ... | Tm|       | U1| ... | Uo|
   +---+     +---+       +---+     +---+       +---+     +---+

Of course, the DBSS is built with many Binlog Servers.
One way to build the layer above minimizing the number of slaves
served by the master
is described below. Other ways to build this layer can be
imagined [2], but let’s stick to this one, for now.

+----|----------------------------------------------------------+
|    +---------------------+---------------------+              |
|    |                     |                     |              |
|   /                    /                    /              |
|  / X1----->/         / X2----->/         / X3----->/    |
|  -----     / Y1       -----     / Y2       -----     / Y3  |
|    |       -----         |       -----         |       -----  |
+----|---------|-----------|---------|-----------|---------|----+

In the deployment above, using one DNS A record per site
resolving to both Xi and Yi, if a Binlog Server fails,
its slaves will reconnect to the other one. If the Yi
Binlog Server fails, nothing more needs to be done. If the Xi
Binlog Server fails, the corresponding Yi must be
repointed to the master. This repointing is easy, as, by design,
a Binlog Server is identical to its master. Only the destination
server must be changed, and the binary log filename and position stay
the same.

When the Master Fails…

Equipped with the above implementation of the DBSS, in a situation
when the master fails, we end up with the state below;
each site might be at a different position in the
binary log stream of the failed master.

+---------------------------------------------------------------+
|                                                               |
|   /                    /                    /              |
|  / X1----->/         / X2----->/         / X3----->/    |
|  -----     / Y1       -----     / Y2       -----     / Y3  |
|    |       -----         |       -----         |       -----  |
+----|---------|-----------|---------|-----------|---------|----+

The first step of master promotion is to level the Binlog Servers
in the DBSS. To do so, the most up-do-date Binlog Server must be found
and all other Binlog Servers must be chained to it. In the deployment
above, only three servers must be contacted, which is much easier than
tens of slaves. If the most up-to-date Binlog Server is X2,
levelling the Binlog Servers consists of the temporary replication
architecture below.

+---------------------------------------------------------------+
|                                                               |
|   /   /              |
|  / X1----->/         / X2----->/         / X3----->/    |
|  -----     / Y1       -----     / Y2       -----     / Y3  |
|    |       -----         |       -----         |       -----  |
+----|---------|-----------|---------|-----------|---------|----+

Levelling should happen very quickly (if it does not,
one of the Binlog Servers is lagging, which should not happen).
After that, the slaves will quickly
follow. Once a slave is up to date (this actually does not need levelling,
a slave of X2 or Y2 could have been promoted before levelling),
master promotion can be performed.
Shown below, a slave from the third site on the right has been chosen
to be the new master, but any slave on any of the three sites could have
been used.

+------------------------------------------------|--------------+
|    +---------------------+---------------------+              |
|    |                     |                     |              |
|   /                    /                    /              |
|  / X1----->/         / X2----->/         / X3----->/    |
|  -----     / Y1       -----     / Y2       -----     / Y3  |
|    |       -----         |       -----         |       -----  |
+----|---------|-----------|---------|-----------|---------|----+

Note that the other slaves have not been touched: they are still connected
to their Binlog Server. This means that this solution works well even if
one of the slaves is unavailable during master promotion. This solution
also works very well with delayed or lagging slaves, as those slaves are
simply not good candidates for becoming the new master. For some time,
the lagging slaves will process the binary logs of the old master
that are still stored on the Binlog Servers.

The Trick for not Reconfiguring every Slave

Promoting a slave to be the new master in a DBSS deployment
requires working some magic on a slave to make its binary log position
(SHOW MASTER STATUS) matches what is expected by
the Binlog Servers. Let’s take an example: if the last binary log
stored on the levelled Binlog Servers is binlog.000163,
we could repoint the Binlog Servers to a new master if the
SHOW MASTER STATUS of this new master is at the beginning
of binary log filename binlog.000164.

When doing that promotion, from the point of view of the Binlog Servers,
their master is simply restarted with a different server_id and server_uuid.
From the point of view of the
slaves, they are processing the binary logs of the old master
(up to and including binlog.000163) followed by the binary logs
of the new master (starting at binlog.000164).

So, the trick is to have our candidate master at the right binary log
position. This can be made possible by:

  1. configuring all nodes with binary logging enabled,
  2. with all identical log-bin value (binlog in the example above),
  3. and without enabling log-slave-updates.

Configuration #3 above allows us to assume that the master will consume
binary log filenames much faster than the slaves. This way, the slaves
will always be behind the master in their binary log filenames [3].
As such, bringing a slave to the right binary log filename is as simple as
doing FLUSH BINARY LOGS in a loop until the slave is in the correct
position. To avoid this loop from taking too much time, we can
run a cron job on our slaves that makes sure they are not too far
away from their master (maximum ten binary logs away, for an example).

Summary of Master Promotion

In the following replication deployment, with log-bin=binlog and
with log-slave-updates disabled:

   +---+
   | M |
   +---+
     |
+----+----------------------------------------------------------+
|                                                               |
+----+---------+-----------+---------+-----------+---------+----+
     |         |           |         |           |         |
   +---+     +---+       +---+     +---+       +---+     +---+
   | S1| ... | Sn|       | T1| ... | Tm|       | U1| ... | Uo|
   +---+     +---+       +---+     +---+       +---+     +---+

If M fails, we first level the Binlog Servers in the DBSS.

Once this is done, and let’s take T1 as our candidate master,
we need to perform the following on it:

  1. FLUSH BINARY LOGS until the binary log filename follows the last
    one from the levelled DBSS,
  2. PURGE BINARY LOGS TO <latest binary log file>,
  3. RESET SLAVE ALL.

The step #2 above drops all binary logs on the
new master that could conflict with the one from the previous master.
The binary logs of the old master are stored on the DBSS
and we must be sure to avoid having similar, but misleading data, on
the new master.

We now have this:

   +-/+
   | X |
   +/-+

+---------------------------------------------------------------+
|                                                               |
+----+---------+---------------------+-----------+---------+----+
     |         |                     |           |         |
   +---+     +---+       +---+     +---+       +---+     +---+
   | S1| ... | Sn|       | T1| ... | Tm|       | U1| ... | Uo|
   +---+     +---+       +---+     +---+       +---+     +---+

where we repoint the DBSS to T1 to get the following:

   +-/+                 +---+
   | X |                 | T1|
   +/-+                 +---+
                           |
+--------------------------+------------------------------------+
|                                                               |
+----+---------+-----------+---------+-----------+---------+----+
     |         |           |         |           |         |
   +---+     +---+       +---+     +---+       +---+     +---+
   | S1| ... | Sn|       | T2| ... | Tm|       | U1| ... | Uo|
   +---+     +---+       +---+     +---+       +---+     +---+

and we have achieved master promotion without reconfiguring all slaves.

A Cleaner Way

The trick above works well, but preforming FLUSH BINARY LOGS in a loop is not
the cleanest of solutions. It would be much better if there was a way
to set the binary log to the desired filename in a single
operation. With this idea in mind, we created the following two feature
requests:

MariaDB 10.1.6 is already implementing a
RESET MASTER TO
syntax. Let’s hope that Oracle will provide something similar in
MySQL 5.7.

What about the Software?

This idea and procedure is all well and good,
but it is not very useful if you cannot use it yourself.
The currently available version of the Binlog Server, the
MaxScale Binlog Router plugin,
does not yet implement all the configuration hooks needed to
make this procedure easy. Booking.com is currently working with
MariaDB to implement the missing hooks in a
new version of MaxScale. We are in the last testing phase of
a Binlog Router plugin that support the following:

  • STOP SLAVE, START SLAVE, SHOW MASTER STATUS, SHOW SLAVE STATUS,
    CHANGE MASTER TO:
    these new commands allow easier configuration of the Binlog Server.
  • The CHANGE MASTER TO command not only allows to easily chain Binlog
    Servers, but also to bootstrap a Binlog Server without
    editing the configuration file. Moreover, this command allows
    to repoint MaxScale to a new master at binary log filename N+1,
    effectively enabling to perform master promotion.
  • Transaction safety: when the master fails, the Binlog Server could
    have downloaded a partial transaction. If we replace the master
    with a slave, this transaction should not be sent to slaves. So this
    feature of the next version of MaxScale will make sure such partial
    transactions are not sent downstream.
  • DBSS identity: the initial design of the Binlog Server was intended to
    impersonate the master, and did not consider swapping the master
    at the top of the hierarchy. In a
    DBSS deployment, swapping the master should not be made visible to slaves,
    so the Binlog Servers should present the slave with a different
    server_id and server_uuid to those of the master. The next version of
    the MaxScale Binlog Router supports that virtual master feature.

This next version of the MaxScale Binlog Router will be generally available once we are
done with the testing. Stay tuned on the MariaDB web site for the
announcement and the failover procedure. In the meantime, you can still
experiment with master promotion without reconfiguring all slaves by using
the current version of MaxScale and following this
proof of concept
procedure.

If you are interested in this topic and would like to learn more,
I am giving a talk about
Binlog Servers
at Percona Live
Amsterdam. Feel free to grab me after the talk, catch me at the Booking.com
booth (#205) or share a drink with me at the
Community Dinner,
to exchange thoughts on this subject. (You can also post a comment below.)

I will also be giving a talk about
Binlog Servers
at Oracle Open World in
San Francisco at the end of October.

One last thing: if you want to know more about other cool things we
do at Booking.com, I suggest you come to our other talks at
Percona Live
Amsterdam in September:

[1] Slave levelling can be done with
MHA, with
MySQL 5.6
or MariaDB 10.0
GTIDs, or with
Pseudo-GTIDs
when using earlier versions of MySQL and MariaDB.

[2] If we were not concerned about WAN bandwidth, all
Binlog Servers could be directly connected to the master. Another
solution could be to connect all master-local Binlog Servers
directly to the master and to use the chained strategy for remote
Binlog Servers. (This hybrid deployment could be well-suited to a
semi-sync deployment, but I am diverging from the subject of this
post.)

[3] The same can be achieved when using
log-slave-updates, by using smaller
max_binlog_size
on the master than on all the slaves.