190 likes | 322 Vues
System Ranger, an update of the original Patrol software from 1994, improves upon its core functionalities by introducing extensible data collectors and a new rules language for effective resource monitoring. Originally designed to renice processes, Ranger now monitors filesystems and daemons, providing vital notifications and repair actions. Built on Perl5, it supports complex rule expressions and offers customizable configurations for diverse environments. With over 300 downloads and production use in around 20 sites, Ranger is an efficient, cost-effective solution for system monitoring.
E N D
Patrol/Ranger Update Chuck Boeheim Assistant Director SLAC Computer Services
History • Patrol originated in 1994 • Originally only to renice processes • Extended to monitor filesystems, daemons, and to perform more notifications/repairs • Downloaded by over 300 sites, in production use in about 20 known sites
Limitations • Original rules language simple, columnar PC afs[0-9]* 50 log,mail(unix-admin) • Difficult to extend to express complexities • E.g., renice processes using more than 20% of the CPU if the load average is over 3. • Written in Perl4, limited by not having complex data structures
The Rewrite • Update to Perl5 • Introduce new rules language • Introduce extensible data collectors • Rename to System Ranger
Rules file structure • Config section supplies local customizations • Ruleset sections defines data collectors and the set of rules to be applied to them • Message section defines message texts
Config section • Supplies the common customizations made at other sites config { optsfile(/etc/tailor.opts) path(/usr/ucb:/bin:/usr/bin) mailfrom('The System Ranger <root>') mailreply(’Unix Admins <unix-admin>') }
Rulesets • Rulesets name a set of rules and associate them with a data collector Ruleset(anyname) collector(process) { list of rules... } • Builtin data collectors are: System, Process, Daemon, User, Filesystem, File, Service • Custom collectors are planned
Rules • A rule is a set of function calls in braces Rule { cpu(gt,50) kill() log() } • Functions return SUCCESS or FAILURE • FAILURE causes remainder of rule not to be executed, execution passes to next rule • A rule that succeeds ends processing of the ruleset unless the CONTINUE function appears in it.
Rules • The word OR may connect functions Rule { cpu(gt,50) or size(gt,20M) kill() } • A sequence of functions in braces returns SUCCESS or FAILURE for the entire sequence Rule {{cpu(gt,50) kill()} or cpu(gt,25) log } • A sequence of functions in brackets always returns SUCCESS • Rule { cpu(gt,50) [size(gt,10M) kill] log }
Selection Functions • Apply to specific machines: • host • option • arch • test • Apply to specific instances: • user • group • name All tests may be negative or positive e.g., host(icarus) or user(!root)
Comparison Functions • Determine when thresholds crossed • cpu - percent of CPU • size - memory or file size or rate of change • time - total CPU time • Or test global values • loadavg, numusers, numprocs, uptime • Have optional first argument specifying comparison: gt, lt, eq, etc.
Action Functions • Specify some action to perform • log • mail • page • kill, signal (by pid or name) • nice
Sample Process Rules Rule { host(www.*) pct(gt,10) or size(gt,20M) mail(PROC_REPORT,www-monitor) mcons(info) log } Rule { {time(gt,6h) kill mail(OVERLIM, $user)} or {time(gt,4h) mail(WARN2, $user)} or {time(gt,2h) mail(WARN1, $user)} } Message OVERLIM <<EOF The CPU limit for $host is 6 hours. Your process $pid $cmd has been terminated for exceeding the limit. <<EOF
Sample Filesystem Rules Rule { name(/u[0-9]) pct(gt,99,90+1) page(admin)} Rule { host(afs[0-9]+) name(/vicep.*) { host(afs07) name(/vicepg) } or { host(afs08) name(/vicepf) } or { pct(gt,98) mail(FSFULL, admin) } } Message FSFULL <<EOF File system $name is $pct% full, grew by $delta%. EOF
Sample File Rules Rule { name(/var/adm*) size(gt,1M) page(admin) } Rule { name(/etc/passwd) md5() mail(PSWDCHG, admin) } Message PSWDCHG <<EOF File $name has been changed! EOF
Sample Daemon Rules Rule { name(nfsd) number(ne,8) page(admin) } Rule { name(pud) number(lt,1) restart(pud) } Rule { name(amd) number(gt,1) page(admin) }
Sample User Rules • Still somewhat experimental Rule { user(!root) number(gt,3) pct(gt,50) mail(CPUHOG, admin) } Message CPUHOG <<EOF User $user has $number processes using $pct% of the CPU on $host. <<EOF
Why Ranger? • Some automatic monitoring is needed • Commercial packages are complex and expensive • Ranger does a lot in a small package • Because it’s cool
Availability • Needs a bit more shakedown at SLAC before distribution • Look for via http://www.slac.stanford.edu/~boeheim • Will be starting a mailing list; send email to be included