1 / 15

Experiment: Step by Step

Experiment: Step by Step. Author: Anna Bekkerman abekkerm@ecs.umass.edu. Setup. Node. Client. LMM. Control signals. Node. Server. Data. Target system. LMM. Data. Node. LMM. Client. Configuration File. Describes an experiment Nodes

ikia
Télécharger la présentation

Experiment: Step by Step

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experiment: Step by Step Author: Anna Bekkerman abekkerm@ecs.umass.edu

  2. Setup Node Client LMM Control signals Node Server Data Target system LMM Data Node LMM Client

  3. Configuration File • Describes an experiment • Nodes • IP addresses, types (SOCC node/radar node), etc. • Commands to start/stop involved processes • Collected metrics (CPU/memory utilization, etc.) • Monitored processes • Net control parameters • Delays, drop rates • Refresh rates

  4. Start LMMs • When started, RAPIDS server: • Grabs two ports: • 49162 - to communicate with LMMs • 8888 - to communicate with RAPIDS clients • Reads a configuration file • Starts LMMs on all nodes through SSH connections • Waits for ack signals from all LMMs • Starts setting LMMs up according to the configuration file FIXME: Server will wait indefinitely for the acks from all LMMs. A time-out mechanism should be introduced.

  5. Set LMMs Up • Home-made protocol is used to set up LMM parameters • Examples of commands sent from the server to LMMs: • STM set metric • STP set monitored process • STE set start-up command • STT start • SPP stop • When a parameter is set, LMM sends an ack signal back to the server • At the end of each step, server waits for acks from all LMMs

  6. Start Monitoring • When LMM receives the start command: • If needed, network control application is started • Network control application runs only if iptables are turned on. • iptables select IP packets (as specified in iptables rules) and queue them for processing by the application. • The application introduces delays and/or drops packets according to the settings in the configuration file.

  7. Start Monitoring • When LMM receives the start command: • If needed, network control application is started • RAPIDS Message Queues (RMQ) are initialized • A mechanism used for communication between RAPIDS and monitored applications. • See more in the “RMQ” section.

  8. Start Monitoring • When LMM receives the start command: • If needed, network control application is started • RAPIDS Message Queues (RMQ) are initialized • Heartbeat applications are started • Send “I’m alive” signals from radar nodes to SOCC nodes. • If a signal has not been received, RAPIDS reports link failure. • FIXME: Timeout mechanism should be added to minimize false alarms.

  9. Start Monitoring • When LMM receives the start command: • If needed, network control application is started • RAPIDS Message Queues (RMQ) are initialized • Heartbeat applications are started • Processes are started • Commands are specified by user in the configuration file

  10. Start Monitoring • When LMM receives the start command: • If needed, network control application is started • RAPIDS Message Queues (RMQ) are initialized • Heartbeat applications are started • Processes are started • Commands are specified by user in the configuration file • “Collection sessions” are started every t seconds • According to the refresh rates provided by user in the configuration file

  11. Collection Session • During each collection session LMM: • Collects metrics • Reads events accumulated in RMQ • Sends the metrics and events to the RAPIDS server • More details in the “LMM” section

  12. Stop Monitoring • When the server is stopped, it sends stop commands to all LMMs • Upon receiving the stop signal, LMM: • Stops launching collection sessions • Stops processes • Using the commands specified by user in the configuration file • Heartbeat applications are stopped • RMQ is deleted • Network control applications are stopped

  13. What Might Go Wrong? • When the server is stopped, it sends stop commands to all LMMs • Upon receiving the stop signal, LMM: • Stops launching collection sessions • Stops processes • Using the commands specified by user in the configuration file • Heartbeat applications are stopped • RMQ is deleted • Network control applications are stopped If “untrappable” signals (SIGKILL and SIGSTOP) are used to kill the server, the shut-down procedures will not be executed!

  14. What Might Go Wrong? • If commands provided by user do not stop all processes, LMM will hang waiting for their termination. • While an LMM is hanging the port used for communication with the server remains unreleased, which means that the new experiment cannot be started until LMMs are stopped and all necessary clean-up procedures have been completed. • When the server is stopped, it sends stop commands to all LMMs • Upon receiving the stop signal, LMM: • Stops launching collection sessions • Stops processes • Using the commands specified by user in the configuration file • Heartbeat applications are stopped • RMQ is deleted • Network control applications are stopped

  15. What Might Go Wrong? • When the server is stopped, it sends stop commands to all LMMs • Upon receiving the stop signal, LMM: • Stops launching collection sessions • Stops processes • Using the commands specified by user in the configuration file • Heartbeat applications are stopped • RMQ is deleted • Network control applications are stopped • FIXME: • These applications do not always react to the termination signal properly. • Symptom: sometimes a number of zombie processes appear

More Related