190 likes | 313 Vues
Patrol/Ranger Update. Chuck Boeheim Assistant Director SLAC Computer Services. History. Patrol originated in 1994 Originally only to renice processes Extended to monitor filesystems, daemons, and to perform more notifications/repairs
E N D
Patrol/Ranger Update Chuck Boeheim Assistant Director SLAC Computer Services
History • Patrol originated in 1994 • Originally only to renice processes • Extended to monitor filesystems, daemons, and to perform more notifications/repairs • Downloaded by over 300 sites, in production use in about 20 known sites
Limitations • Original rules language simple, columnar PC afs[0-9]* 50 log,mail(unix-admin) • Difficult to extend to express complexities • E.g., renice processes using more than 20% of the CPU if the load average is over 3. • Written in Perl4, limited by not having complex data structures
The Rewrite • Update to Perl5 • Introduce new rules language • Introduce extensible data collectors • Rename to System Ranger
Rules file structure • Config section supplies local customizations • Ruleset sections defines data collectors and the set of rules to be applied to them • Message section defines message texts
Config section • Supplies the common customizations made at other sites config { optsfile(/etc/tailor.opts) path(/usr/ucb:/bin:/usr/bin) mailfrom('The System Ranger <root>') mailreply(’Unix Admins <unix-admin>') }
Rulesets • Rulesets name a set of rules and associate them with a data collector Ruleset(anyname) collector(process) { list of rules... } • Builtin data collectors are: System, Process, Daemon, User, Filesystem, File, Service • Custom collectors are planned
Rules • A rule is a set of function calls in braces Rule { cpu(gt,50) kill() log() } • Functions return SUCCESS or FAILURE • FAILURE causes remainder of rule not to be executed, execution passes to next rule • A rule that succeeds ends processing of the ruleset unless the CONTINUE function appears in it.
Rules • The word OR may connect functions Rule { cpu(gt,50) or size(gt,20M) kill() } • A sequence of functions in braces returns SUCCESS or FAILURE for the entire sequence Rule {{cpu(gt,50) kill()} or cpu(gt,25) log } • A sequence of functions in brackets always returns SUCCESS • Rule { cpu(gt,50) [size(gt,10M) kill] log }
Selection Functions • Apply to specific machines: • host • option • arch • test • Apply to specific instances: • user • group • name All tests may be negative or positive e.g., host(icarus) or user(!root)
Comparison Functions • Determine when thresholds crossed • cpu - percent of CPU • size - memory or file size or rate of change • time - total CPU time • Or test global values • loadavg, numusers, numprocs, uptime • Have optional first argument specifying comparison: gt, lt, eq, etc.
Action Functions • Specify some action to perform • log • mail • page • kill, signal (by pid or name) • nice
Sample Process Rules Rule { host(www.*) pct(gt,10) or size(gt,20M) mail(PROC_REPORT,www-monitor) mcons(info) log } Rule { {time(gt,6h) kill mail(OVERLIM, $user)} or {time(gt,4h) mail(WARN2, $user)} or {time(gt,2h) mail(WARN1, $user)} } Message OVERLIM <<EOF The CPU limit for $host is 6 hours. Your process $pid $cmd has been terminated for exceeding the limit. <<EOF
Sample Filesystem Rules Rule { name(/u[0-9]) pct(gt,99,90+1) page(admin)} Rule { host(afs[0-9]+) name(/vicep.*) { host(afs07) name(/vicepg) } or { host(afs08) name(/vicepf) } or { pct(gt,98) mail(FSFULL, admin) } } Message FSFULL <<EOF File system $name is $pct% full, grew by $delta%. EOF
Sample File Rules Rule { name(/var/adm*) size(gt,1M) page(admin) } Rule { name(/etc/passwd) md5() mail(PSWDCHG, admin) } Message PSWDCHG <<EOF File $name has been changed! EOF
Sample Daemon Rules Rule { name(nfsd) number(ne,8) page(admin) } Rule { name(pud) number(lt,1) restart(pud) } Rule { name(amd) number(gt,1) page(admin) }
Sample User Rules • Still somewhat experimental Rule { user(!root) number(gt,3) pct(gt,50) mail(CPUHOG, admin) } Message CPUHOG <<EOF User $user has $number processes using $pct% of the CPU on $host. <<EOF
Why Ranger? • Some automatic monitoring is needed • Commercial packages are complex and expensive • Ranger does a lot in a small package • Because it’s cool
Availability • Needs a bit more shakedown at SLAC before distribution • Look for via http://www.slac.stanford.edu/~boeheim • Will be starting a mailing list; send email to be included