Eavesdropping on a September 2015 discussion about Chain Replication

Silly me, one day I wrote a tweet that said:

Just wrote 1,400 words about chain replication in 3 email messages to [@**cmeik**](https://twitter.com/cmeik) to make certain he's occupied when he wakes up in the morning.

It got two retweets and 13 favorites and a few replies, including and including:

[@**slfritchie**](https://twitter.com/slfritchie) Publish it? You do have a larger audience :) [@**cmeik**](https://twitter.com/cmeik) [@**slfritchie**](https://twitter.com/slfritchie)[@**swvist**](https://twitter.com/swvist)[@**cmeik**](https://twitter.com/cmeik) I'd be interested, and I imagine [@**zeit_geist**](https://twitter.com/zeit_geist) (who has also worked on a chain replication system) would be too! [@**slfritchie**](https://twitter.com/slfritchie)[@**swvist**](https://twitter.com/swvist)[@**cmeik**](https://twitter.com/cmeik) Yes, you ought, and worry less about the cleaning-up than about the conversation.

Hrm, well.  First I open my big mouth, then some other folks want to eavesdrop on the discussion.  Well, OK, but regardless of my restless fan club’s desire for raw & unedited bits, these bits are a bit edited. Sorry about the delay. Reach toward the stream of consciousness tap, grasp its handle, and open the valve.

Message number 1 of 3


Hi, Chris.  Thanks for the time to chat this afternoon.  Attached is my whiteboard stuff from Monday JST.  If you have difficulty with the text, I wrote most of it up, starting at the line “TODO fitness_mon, the fitness monitor”, inside of https://raw.githubusercontent.com/basho/machi/slf/chain-manager/remove-inner/TODO-shortterm.org [SLF: Note that the whiteboard’s description has a bug: “spam merged dict to All – [Me, R]” should not exclude R from the spam.]

My original intent would be that the DownList (whiteboard) / UnfitList (TODO version) would be a boolean thing.  It could instead be a list of measured phi values as measured by ObservingServerName.  Props is there because, you know, expansion, I dunno, just in case.

My goal, with some code started yesterday on the “slf/chain-manager/remove-inner” branch is to use riak_dt for the map & lwwreg types. Every so often, a riak_dt map would be broadcast to everyone else.  (The whiteboard uses the word “spam”.)  The CRDT makes it easy to mash together from any receipt order.  Each participant can then figure out its own phi value (done separately) and then compare phi values as observed & calculated by everyone else.  Everyone else’s opinion of phi for some remote server could feed into my next local phi calculation, shrug, go crazy?

Machi will be using this information to create a digraph of known pairwise communications problems.  This code has existed for a few weeks in the older “slf/chain-manager/cp-mode4” branch and is brought forward into “slf/chain-manager/remove-inner”, because it worked pretty well in the older branch.  The digraph is created from a list of {X,problem_with,Y} tuples, where X is reporting a problem communicating directly with Y.  We don’t know which direction the actual problem is.  (My partition simulator can do asymmetric partitions!)  But we assume that if X reports a problem with Y, then Y has a problem.

Then we build a digraph of all reports.  Then we start counting the number of inbound edges, and use that count to decide which node(s) are down for the purposes of chain membership.  Oops, code is commented out in https://github.com/basho/machi/blob/21015efcbb837f89e2a2f7c34614993105ce1c97/src/machi_chain_manager1.erl#L2493-L2518 but I will bring it back today.  (“Magic” because at the time I didn’t believe that this simple technique might work.)

  * X -> Y ............... Y=1, Y is considered down by all
  * X <-> Y ............. X=Y=1, X<Y, so X is considered down by all. (Sort order is implicit in calc_magic_down() arg #1 list.)
  * X <->Y<->Z ............. Y>1, it's definitely considered down by all.  X=Z=1 but on each "end" of a line and so both are considered ok

Machi chain manager style, all chain managers are independent.  But if they get the same up/down/phi/whatever info (via my spam-everyone implementation, to be finished today) or epidemic gossip (Plumtree or other … I will simply spam because a single Machi chain is too small/short to bother optimizing) or whatever, then they can independently make the same considerations about down-ness.

More code to follow.  ^_^  I decided that I liked the word “fitness”, so I created https://github.com/basho/machi/blob/21015efcbb837f89e2a2f7c34614993105ce1c97/src/machi_fitness.erl.  Right now, it only holds the local boolean-down server list.  But the communication/spamming of the CRDT map will be done by the machi_fitness server (one gen_server proc per chain member), and the get_unfit_list() call will be doing the CRDT merge & digraph “magic” calculation to deterministically (limited on latest gossip/CRDT merge, naturally) of who ought to be considered down globally.

[SLF: The basic draft of this new code is finished and merged to the “master” branch, see https://github.com/basho/machi/blob/master/src/machi_fitness.erl.]

Message 2 of 3

Oh, so just to follow up on my “SWIM isn’t the right thing exactly for Machi”.  If there are 3 servers in the chain, and there’s a partition between A&B but only between A&B, then the SWIM style indirect pings will report to everyone that all three are up.

But in this case, A&B still need to make a decision.  What to do, in general??  I’m still not sure.

In this specific case, A knows (by local experience) that B is unreachable.  SWIM says that B is up.  So A can make a choice, for example, my name sorts smaller than B’s name, so I’ll make myself inelegible.  (B can make the same choice, from its point of view & local experience.)

But in general, what to do?  (/me searches for old paper notes & photos of those notes & whiteboards…..) OK, found one.  But I’ll ASCII’fy it.

Connectivity *problem* reports among servers A&B&C&D&E:
     A <--> C <--> B     A=1, B=1, C=2, D=0, E=0
         D     E

So, D & E report no problems talking to any participant.  A & B & C have varying difficulties with each other.

If D or E start having problems (in addition to the problems shown above with A&B&C), but if their problem is only with C, then … the digraph method says “no change: C is the only one unfit enough to declare down”.  If D/E have a problem with A or B, then that extra edge causes A or B to fall into down status.  If the D/E problem is only among D & E, then choose D (because of its smaller name) to be added to considered down status.

Back to SWIM.  In the diagram above, SWIM considers all servers healthy.  So, CR cannot use SWIM alone to make decisions.

Perhaps SWIM + local connectivity info is enough?  No, I don’t think so, because pure SWIM info + local connectivity info isn’t sufficient to make the digraph.  The digraph, or any other approach, needs more communication of more information.  I suppose it would be possible to use SWIM’s infrastructure to disseminate that extra info (in place of Plumtree or just-spam-everybody), but such dissemination is out of SWIM’s scope (based on my fuzzy recollection, sorry).

There’s another wrinkle in this.  Right now, Machi is only considering comm paths between servers, despite the fact that (at this moment) Machi doesn’t actually use direct server-to-server comm for file replication.  Machi is still using CORFU style, 100% client-side-implemented CR.

To be most effective, IMHO chain management should consider all comm paths, including client<->server.  For Machi today, that would mean having the chain managers run on each of the clients, ha.  That would be weird, but it would be completely doable.  Instead, Machi will have to go with the regular Client->Head->Middle->Tail->Client asymmetric comms, but that requires more changes to Machi’s plumbing, and I’m too lazy to do that.

However, even when Machi’s plumbing can accommodate asymmetric client comm patterns, perhaps clients should be able to inform The Chain Management Oracle that it has a problem with a chain member.

Problem reports:     Client1 <--> A <--> C <--> B
Digraph incoming edge count:  A=2, B=1, C=2, Client1=1

So, we ought to whittle the chain down to [B] or [C], right?  There’s no sense in a [C,B] chain, because C & B have having comms problems.

[SLF: Right about now, if your graph theory brain is active, you probably are asking yourself a question.  Why not invert the graph from a “problem with” edges to “works great” edges?  And then find the longest path through the “works great” graph and define the chain according to that longest path?  Yeah, good question.]

Another question is, then, when should The Chain Management Oracle consider client reports?  How many clients need to report how frequently before you decide if the problem involves enough clients to decide to consider a server down.  After all, it’s clear:

Problem reports:     All Clients <--> A (where chain members are A,B,C)

… it’s clear that there’s little advantage to keeping A at the head or tail of the chain: no client can talk to it.  But, what if 80% of clients can talk to A?  50%?  20%?  …. [SLF: I’ve since pondered this a bit more, see my comment at the end of message 3.]

Whee, this is fun!

Message 3 of 3

Damn, I’m still writing English instead of writing Erlang.  Ha, well, it’s good to think about problems before writing code. Dijkstra, yo!

Right now, your head is much closer to failure detector papers than my memory, almost certainly.  So I trust your recollection better than mine.  My flawed recollection of phi & SWIM papers is that they worry about failure detection of servers.  But for manymany distributed systems, clients are part of “the service”.  In Machi, at the moment, all the CR is done on the client, so yeah, clients are the server, in part.  The clients are inside the perimeter.  

Question: is it worth writing more/first?? about including clients in the detection scheme?  If clients are inside the perimeter but they have no say in failure detection, is that good?  Are there systems that don’t care about the client’s opinion (vs. those that really do (or really ought to) care)?  I think I just argued in the last message that CR systems definitely ought to be caring a lot about client opinions.

What does a client’s opinion look like?  Hrm.  If a client crashes, who cares?  (Ha, there’s probably a taxonomy of yes-I-care/no-I-don’t-care systems out there, lurking in the dark?) But I propose that there is a larger subset of systems that really ought to care about network partitions between “client” and “server”. First example: my previous email about CR & client problem reporting.


[SLF: In the intervening week & more, I think that this problem doesn’t exist in many commonly-deployed systems because of use of proxies & middleware. For example, pervasive use of HAProxy for both load balancing and masking faults in back end services has the side effect of creating multiple paths between a client and a back end server … if you really stretch & abuse the word “path” and squint very hard at it.]