1 / 23

Self Healing Wide Area Network Services

Self Healing Wide Area Network Services. Bhavjit S Walha Ganesh Venkatesh. Layout. Introduction Previous Work Issues Solution Preliminary results Problems & Future Extensions Conclusion. Motivation. Companies may have servers distributed over a wide area network

kylar
Télécharger la présentation

Self Healing Wide Area Network Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self Healing Wide Area Network Services Bhavjit S Walha Ganesh Venkatesh

  2. Layout • Introduction • Previous Work • Issues • Solution • Preliminary results • Problems & Future Extensions • Conclusion

  3. Motivation • Companies may have servers distributed over a wide area network • Akamai Content Distribution Network. • Distributed web-servers • Manual monitoring may not be feasible • Centralized control – may lead to problems in case of a network partition • Typical server applications • May crash due software bugs • Little state is retained • Simple restart is thus sufficient

  4. Motivation … • What if peers monitored each others health? • In case a crash is detected - try and restart. • No central monitoring station involved. • Loosely based on a worm • Resilient to sporadic failures • Spreads to uninfected nodes • But • No backdoor involved • May not always shift to new nodes

  5. Introduction • Previous Work • Issues • Solution • Preliminary results • Problems & Future Extensions • Conclusion

  6. Medusa • All nodes a part of a Multicast Group • Each node is thus in touch with all other nodes through Heatbeat messages. • Nodes send regular updates to the multicast tree • All communication through reliable multicast • In case a node goes down • Other nodes try to restart it • Request for service sent to multicast group

  7. Medusa Problems • Scalability • Assumptions of reliable packet delivery • State information shared with all nodes. • Reliable Multicast • Assumes reliable delivery of packets to all nodes • No explicit ACKs • The kill operations fail in case of a temporary break in Multicast tree. • Security • No way of authenticating packets

  8. Introduction • Previous Work • Issues • Solution • Preliminary results • Problems & Future Extensions • Conclusion

  9. Proposed solution • Nodes form peering relationships with only a subset of other nodes. • Exchange Hello packets • Scalable as the degree is fixed • No central control • No dependence on reliable multicast • Distributed communication protocol • Explicit ACKs for packets • Some super-nodes required to be up when booted • Power of Randomly-connected graphs graphs

  10. Design • Each node continually sends Hello Packets to its peer nodes. • Indicates everything is up and working • A timeout indicates something is wrong • Application crash • Network Partition • Aim at application crashes • Application should be stateless • No code transfer • Remotely restartable • SSH needed – A login account and distributed keys.

  11. Initialization • 3-5 super-nodes form a fully-connected connected graph. • Are expected to be up all the time • All nodes have information about their IPs • May be under manual supervision • May have information about the topology • Responsible for forwarding join requests to other nodes

  12. Remote start • SSH to a remote node to restart • Remote (re)start attempted after Hello timeout. • Current implementation requires keys to be distributed beforehand • Starts a small watchdog program which immediately returns • Checks if there is a another copy already running • Current implementation uses ps • In case the application start fails, do nothing – wait for retry to restart • Possible extension: allow the service to spread

  13. New node comes up… • Waits for others to contact it • After timeout: • Send JoinRequest to a super-node with the number of peers needed. • Supernode forwards this request to other nodes • AddRequest • Some node may ask new node to become its peer • Add to neighbourList and send AddACK • Hello • Can add to neighbourList if unsolicited Hello received • Beneficial in case of a short temporary failures • After Request-timeout: • Contact another super-node with another JoinRequest. • Timeout can be dynamically specified in JoinRequestACK.

  14. New node comes up…Random Walk. • Request forwarded by super-node to 3 random nodes on behalf of new node • Each node forwards it to others • Decrease hop count by 1 each time • If hop count = 0, check if it can support more nodes • YES! • Send AddRequest to new node • Add to neighbourList on receiving AddACK. • NO! • Ignore the request • New node may already have found neighbours • Due to duplicate joinRequest or repair of Network partition • New node thus replies to AddRequest with Die packet.

  15. Shutdown • Critical to ensure that all nodes go down • 3-way protocol • Send kill to target node • Target node replies with die • Send dieACK to target node. • kill • used when multiple copies detected • Possibly to balance load • die • Reply to unsolicited Hello • No perfect solution in case of a network partition

  16. Global Shutdown… • Secret killAll packet • Sent by an external program for complete system shutdown • Forwarded to all neighbours • Node does not die until it receives a killACK from everyone • Stops sending hellos immediately • No further restart attempts • Reply only to die, kill and killAll • May send unnecessary traffic • Eventually time out on seeing zero neighbours.

  17. Performance • Tested on 6 nodes in GradLab • Hello interval: 5s • Hello timeout: 22s • Wait before joinRequest: 10s • joinRequest timeout: 20s • Hop count: 2 • Initial degree request: 3 • Super-nodes: 3 • Preliminary tests on PlanetLab

  18. Results • LAN • No timeouts or packet losses observed • No duplicate copies • killAll works perfectly • Re-start latency: 22s • Decreases after a number of restarts • Join latency: 15s • PlanetLab • Re-start latency: 27s • Join latency: 21s

  19. Introduction • Previous Work • Issues • Solution • Preliminary results • Problems and Future Extensions • Conclusion

  20. Limitations • Security • The packets are not authenticated • Stray copies • After a killAll there may be stray copies • Harmless as they do not try to spread • But: prevents another copy from running • No new nodes • Node discovery • Why should they be idle in first place? • What to do when the original nodes come back up? • Solution • Send regular updates to super-nodes • Extra servers can be killed easily

  21. Parameter tweaking • Hop count for Random Walk • Connectivity • Min-degree to ensure connectivity • Max-degree to spread the failure probability • Timeouts • Request timeout • Depends on hop-count • Hello timeout • Different for WAN & LAN • Global timeout • In case of network partition • Loss of Kill ACK packets

  22. Conclusion • Maintaining High Availability does not always require central control • Achieving a global shutdown is problematic • Need to explore connectivity requirements to ensure a connected graph at all times.

  23. Thank You !

More Related