A ‘Simple’ Protocol for Manual MySQL Slave Promotion to Master

We’ve been working on the design of a protocol which would enable promotion of a slave to a master in a MySQL replication cluster.

Right now, if a MySQL master fails, most people just deal with a temporary outage. They bring the box back up, run REPAIR TABLEs if necessary, and generally take a few hours of downtime.

Google, Flickr, and Friendster have protocols in place for handling master failure but for the most part these are undocumented.

One solution would be to use a system like DRDB to get a synchronous copy of the data into a backup DB. This would work of course but would require more hardware and a custom kernel.

You could also use a second master in multi-master replication but this would require more hardware as well and complicates matters now that you’re using multi-master replication which has a few technical issues.

A simpler approach is to just take a slave and promote it to the master. If this were possible you’d be able to start writing to the new master almost immediately after the old master fails. You’d lose a few transactions but if you have any critical code that depends on data insertion you can have it assert that it reached at least one slave before moving forward.

The biggest problem is trying to figure out where the current slaves should start reading in the binary log on the new master.

This is actually a really difficult problem with an amazingly trivial solution.

Just make sure the master and slave all start and write to the same positions in the binary log. This way the file names and byte positions are always the same and a slave can start reading from any other slave as long as the position in replication greater than its current position.

Now you can have a new master up and running and online within a few minutes and can worry about repairing the old master later. When you get it ready to go back online you can just add it as another slave (assuming the hardware is similar).

Clients would need a few modifications. All clients would need to block when INSERTs start failing and test if a new master has been selected. This can happen by doing a DNS lookup or fetching a key from something like Memcached. If a new master is selected the code would just connect to the new master and start writing there.

For the promotion to work the clients would need a way to select the master without being in a transient state where 50% of the cluster thinks that the master is master1 and the other 50% of the cluster thinks that the master is master2.

This can be accomplished by the following protocol.

1. Shoot the current master in the head and/or shutdown mysql. If the box is dead but the port is still up this means killing power to the box. You can also SSH in and shutdown mysqld if necessary. Note that if MySQL is setup to automatically start when the box is rebooted this will need to be removed (at least from masters).

2. Setup all the slaves to read from the new master and make sure replication works.

3. Update the DNS entry for the master hostname to point to a new IP.

4. Verify that all clients have reconnected to the cluster and are functioning correctly. This can be done by setting up a PING table and making sure values are updated from the master to all the slaves.

At this point all the clients should reconnect to the new master and continue. Again. If you need to make sure you don’t lose ANY data you have to verify that it made it to at least one slave before continuing.

There’s more work to be done here. I’d like to work on a consensus protocol (probably Paxos) within the lbpool framework to have the slaves automatically detect when a master has failed, promote an existing master, and start using the new master (all without any human intervention). This is obviously a hard task which is why I’m trying to do this in two stages.


  1. For info on electing a new master between a set of slaves/replicas:
    http://www.oracle.com/technology/documentation/berkeley-db/db/ref/rep/elect.html
    (Source code is open source.. etc)

    For how to ‘find’ the new master, you really want something like Neptune:
    http://www.cs.ucsb.edu/projects/neptune/

    I don’t think there is a ‘good’ open source set of tools for high performance service location, which is why I proposed the Apache dislocate project:
    http://mail-archives.apache.org/mod_mbox/labs-labs/200612.mbox/%3c4583132B.7090304@apache.org%3e

    Sadly, I haven’t had much time to hack on it.

  2. Btw, I forgot to harp on this in my first comment, But I strive to remove the use of DNS in production systems and configurations, it many times ends up as another possible point of failure in an already complicated system, and its not that often when you really need to move IP addresses.

  3. Kevin, thanks for writing this up. It’s an interesting approach. I’m currently trying to learn about how to promote a new master too. I have some thoughts/questions, and I’m trying to understand the weaknesses of this approach.

    If the master goes down, not all the slaves may be at the same point. Say M is the master and it dies at entry 100. S1 has read up to 100, but S2 and S3 read up to 99. S1 gets elected to be the master. This is safe, as S2 and S3 can connect to it and execute from 100 on.

    But if S2 gets promoted to master and a client program connects and issues a new write, S1 and S3 are going to get different results. S1 will try to read entry 101, which is likely at a different byte position, so it will probably die horribly. If not, there will be silent differences between the slaves; S1 will not have applied entry 100, and S3 will have.

    Therefore the new master should be chosen based on its binlog position. I think it is safe to just say “choose one from among those with the maximum position.” Are there any edge cases I’m missing?

    If the DNS name of the master is a second name pointing to the machine — something like “master” — the slaves will reconnect after something starts responding to it again, so there’ll be no need to change anything on the slaves. To Paul’s point, DNS can be a single point of failure, but if you provide redundancy for it, that can be mitigated. And DNS is designed for the kind of thing you might have to do manually otherwise, like moving IP addresses.

    True it doesn’t have to happen that often, but if you need it… which you do for this scheme I think, unless you want to stop and restart the slave processes. I’m most interested in avoiding that, as I’m trying to figure out how to get the machines to fix themselves.

  4. Sean Chighizola

    Xaprb,

    I’m in the same boat… at my last job we only had two database servers (master and slave) so choosing the most up-to-date slave wasn’t an issue. As for automating this process, a shell script was used.

    However I’m in a new job where there are many slaves.

    As you mentioned, which slave is most up-to-date? One solution would be to use the following schema: Master -> Slave/Master -> X # of Slaves. So now you’ve isolated the most up-to-date slave because there is only one. The downside to this schema is, the latency in getting reads from the Slaves because each query has to go through the Master and Slave/Master.

    Another idea is to create a complex set of scripts that run on each slave server. At the time of a failure, each slave could determine it’s relay log position and compare those with other slaves. Then the most up-to-date slave would be promoted. I haven’t tried this and this definitely introduces complexity.

  5. Thanks for the comments guys.

    “But if S2 gets promoted to master and a client program connects and issues a new write, S1 and S3 are going to get different results. S1 will try to read entry 101, which is likely at a different byte position, so it will probably die horribly. If not, there will be silent differences between the slaves; S1 will not have applied entry 100, and S3 will have.”

    Yeah……. The solution is not to promote S2. There are two steps to resolve this. One is manually at which point you would just promote S1.

    In the case of automatic promotion the clients would use a consensus protocol to pick a master promotion client. The client would then be given a few minutes to promote a master and report status.

    This way if the master promotion client fails or dumps core during this process the whole cluster doesn’t remain offline.

    I think you’re right in saying that the master can just be the one with the most up to date position.

    There are no edge cases that I can think of other than another slave in the same position but that would just be redundant and wouldn’t cause a failure.

  6. Paul.

    Regarding DNS. This is just one method that could be selected. The client could be written to use any method to determine the new master URL.

    Off the top of my head LDAP, DNS, and Memcached (with enough memory to prevent LRU) could be used to provide this functionality.

    You could even have the clients exchange the new master URL between themselves but then new clients would connect to the old master.

    Kevin

  7. Sean…..

    Your protocol for using two masters can just be done with multimaster replication.

    You have two masters that replicate into each other. We actually developed a protocol to handle single master promotion in the event of a failure but it yielded some edge conditions that could have lead to database inconsistency.

    At the end of the day we just decided that promoting a slave to master was the easiest way to go forward.

  8. Sean Chighizola

    Kevin,

    I think I need add clarification, my proposal wasn’t a Master/Master replication ring. Rather the Master has a slave, then the slave is the master to another slave (A -> B -> C). Where B is isolated to the be the most up-to-date slave. This provides an automated solution when it comes to promoting. But as I mentioned early it doesn’t scale well for reads.






%d bloggers like this: