The Near Earth Asteroid Rendezvous (NEAR) Rendezvous Burn Anomaly

The Near Earth Asteroid Rendezvous (NEAR)Rendezvous Burn Anomaly Susan C. Lee The Johns Hopkins University Applied Physics Laboratory

Disclaimer The NEAR Mission ended in February 2001 and some documentation has dissipated. Some of this presentation relies on memory, but is basically accurate. The Lessons Learned represent my own opinions, not necessarily those of the JHUAPL Space Department, where I have not worked since January 1998.

Overview • NEAR Overview • Anomaly Description • Investigation Findings • Lessons Learned

Mission Description • Three-year cruise to the Near-Earth Asteroid Eros • Up to 12-day solar transit • Round-trip light times up to 40 min. • Numerous small Trajectory Correction Maneuvers (TCMs) • 2 Large TCMs using bi-propellant large velocity adjust thruster • Deep Space Maneuver • Eros Rendezvous Burn • No critical time windows for TCMs • One-year science mission orbiting Eros • TCMs planned for once a week • Frequent momentum dumps

Spacecraft Description • Mechanically simple • Fixed solar panels (after one-time deployment) • Fixed antennas • High Gain (1º BW) • Fan Beam (40ºx 8º BW) • Dual hemispherical low gain • Electrically simple • Direct energy transfer power system • 1553 bus/discrete line communications • Computationally complex • 3-axis active guidance and control using thrusters and momentum wheels • Careful power management

Spacecraft Block Diagram

Safing Design • Goal of safing: keep the S/C viable and make ground contact 1. Keep the solar panels pointed at the Sun and the load below the solar panel output 2. Point the fan beam antenna at the Earth and swap redundant RF systems • Joint function of the C&T Processor and G&C system • Coordinated via housekeeping telemetry and discrete lines

Spacecraft Mode Descriptions • Operational • Under ground command (either real time or uploaded command sequences) • Solar panel normal near Sun line • Earth Safe • Solar panel normal on Sun line • Earth in fan beam antenna • Downlink at 10 bps • Sun Safe - rotate • Solar panel normal on Sun line • Rotate about Sun line at 1 rev/3 hours • Beacon on fan beam downlink • Sun Safe - freeze • Same as Earth Safe

Simple, rule-based autonomy system Rules checked flags, relay states, housekeeping telemetry, discrete lines Triggered rules point to associated command sequence No loops or jumps Priority dictated by order in the list of rules Single Software Mode Check for commands at the uplink interface/Execute Check autonomy commands needed/Execute Check for commands in uploaded command sequence/Execute Place telemetry on the downlink according to commanded format and rate S/C Safing Modes implemented by executing autonomy commands to set desired power state, RF state, formats, etc. Safing Implementation:Command and Telemetry Processor

Safing Implementation: Guidance and Control (G&C) Processors • Many, many possible combinations of functions performed on Flight Computer (FC) and Attitude Interface Unit (AIU) using combinations of sensors, actuators and guidance algorithms • Implemented as in-line code triggered by IF-THEN-ELSE sequences • Multiple flags, timers and parameters • NOMINALLY, FC controls the S/C, using the AIU simply as an interface unit • Guidance algorithms based on stored orbit • Attitude based on Star Camera and IMU input • Nominal wheel control; thruster control during TCMs only • NOMINALLY, AIU checks for FC problems (e.g., Sun-pointing keep-in violated) • Can ask C&T Processor to switch to backup FC • Can take over S/C control for Safe Modes

Thruster Use Safing - Clementine Prevention • Thruster hardware enable/disable • ‘Open’ commands must be sent every 40 ms, else thruster values close automatically • Separation of thruster use function between C&T Processor and G&C • C&T processor controls enable/disable • Faulty CTP can only enable thrusters, not open values • G&C (AIU) controls thruster value open/close • Faulty AIU can try to open values, but won’t succeed unless CTP also enables thruster hardware

Thruster Use Safing - Trajectory Correction Maneuvers • TCM thruster use under ground control only • Parameters loaded on G&C in advance and verified prior to burn • Timetagged commands on C&T Processor initiate and terminate burn

Burn Abort Conditions • NEAR burn philosophy: Better safe than sorry • No burns with critical timing • Better to shutdown, correct problem if any, and try again • G&C Burn Shutdown Criteria (partial list) • Attitude Keep-in violated • Acceleration Keep-in violated • Anomalous pressure reading on fuel or oxygen tanks • C&T Processor signals a Safe Mode • C&T Processor Burn Shutdown Criteria • Loss of AIU Heartbeat for 5 seconds • Occurrence of any condition normally causing a Safe Mode entry • G&C signals a Safe Mode

Thruster Use Safing - Autonomous Momentum Dumps • Autonomous use of the thrusters for emergency momentum dump only

Preparation - Burn Scripts • Work began in November 1998, with DSM1 scripts as model • Significant deviation from nominal safing practice • Final review December 7, 1998 • System Engineer not present • Brassboard testing of the nominal burn only • Brassboard configuration deviated significantly from the S/C • Burn abort not tested at all

The Anomaly: Just the Facts, Ma’am • Burn Command Script • Uploaded December 16, 1998 • Burn • On-board initiation at specified time • Normal execution of 200-sec settling burn • Initiation of bi-propellant burn as expected • Anomaly • Burn abort within fraction of a second from bi-propellant initiation • S/C signal lost 37 seconds following abort • DSN acquired Sun Safe beacon 27 hours after LoS • Freeze command stopped rotation with Earth in the fan beam, and telemetry downlink was commanded.

Recovery and Outcome • When reacquired, S/C in stable Sun Safe mode controlled by the backup AIU (AIU 2) • Mission Operations recovered to Operational Mode • Interrupted Command History/Autonomy History downlink • Faulty procedure used in first attempt resulted in immediate demotion back to Safe Mode and AIU switch • S/C state assessed and immediate cause of burn abort ascertained in two days • New burn planned and executed on Jan. 3, 2000 allows completion of NEAR mission • Used up fuel margin • Additional 13 months of cruise prior to mission start • Some contamination of imager optics • Degradation of some thrusters due to cold firings

Data Sources for Diagnosis

Burn Abort Early Sequence of Events (1)

Why did the burn abort?

Transition to Earth Safe initiates burn shutdown command sequence and high-rate slew to Sun pointing using thrusters-only. Command script error causes abrupt transition to wheels-only control. NEAR goes to Sun Safe before the wheels can overcome the high rate. Early Sequence of Events (2)

What was the script error? • The script did not return control to the wheels at all, but did disable the thrusters. • Without enabled thrusters, the G&C autonomously forced wheel control, but without the controlled transition. • Because thruster-only control was commanded, the G&C used thrusters each time they were enabled. • Need for a carefully timed sequence for returning control to the wheels was established prior to the first TCM one week after launch. • Brassboard testing showed that the S/C was very likely to receive a “kick” from the thrusters without this controlled procedure. • Brassboard testing of the script reproduced the early anomaly events perfectly • The DSM1 burn script DID contain the wheels-only command (but not the right sequence), and the missing safing rules.

“Kick” leaves high system momentum and initiates a momentum dump. Command script error causes new ‘kick’ to attitude and momentum. Early Sequence of Events (3)

Simulated Early Sequence of Events

Complete Timeline (00:00 - 03:00)

NEAR Anomaly Investigation • NEAR Anomaly Review Board established on 6 January 1999 • Assess APL efforts to understand and correct causes of anomaly, and recommend additional efforts • Determine most probable cause of the anomaly • Review NEAR program and recommend improvements for future missions • Timeline reconstruction from available data • Determination of probable cause • Fact of and reason for burn abort recorded in snapshot data • Script error obvious to knowledgeable engineers • Impact of such an error known prior to the event • Impact confirmed by brassboard simulation of the burn abort event • Fault tree developed for anomalous momentum dumps • Analysis and 128 brassboard simulations of potential scenarios

Findings “The investigation established a good understanding of the events during approximately the first 47 min after the abort, but no explanation for the failure of onboard autonomy to quickly correct the problem. The Board found no evidence that any hardware fault orsingle-event upsetcontributed to the failure. Although software errors were found that could prolong and exacerbate the recovery, they by no means fully explain it.” • No explanation for the long-term behavior of the S/C • Only remaining branch in the fault tree is “two or more failures”

Hardware Faults • All hardware functioned nominally before, after and as far as can be seen with limited data, during the anomaly • Most hardware failure modes failed to reproduce known events when simulated • Only gyro noise gave results close to observed behavior • Required noise levels 10x higher measured on ground or in flight, before or after the anomaly • With high gyro noise, simulations show NEAR never recovers, so noise would have to go up and down and up and down (…) • No credible mechanism for this phenomenon was ever suggested by APL or the gyro manufacturer (Litton)

Software Failures • An independent review team found 17 errors in AIU or FC software • 9 in complied code • 8 in data structures or other parameters (in addition to the acceleration limit that precipitated the anomaly) • One error caused a high momentum wheel rate to be set to zero • Known to have occurred at least once during the anomaly • Can cause high momentum to be calculated as low OR low momentum to be calculated high • In simulation, S/C always recovered in 20 minutes or less • Software error eliminated as the total cause, because: • Simulator running flight code did not exhibit anomaly • No repeat of anomaly for remainder of the mission • But there were parameter changes and software uploads

Can Software/SEU’s be eliminated? • Many G&C data structures not downloaded until after certain parameters were changed • Data structures were not verified prior to burn • SEU, upload error, or configuration management error possible • FC1 program memory not downloaded and verified after anomaly • February 24, 1999, FC1 spontaneously re-booted (first and only) • Could be SEU or still unknown software error • When the anomaly review began, it found two versions 1.11 of the FC code. Which was on the S/C? Was either? • 80,000 lines of highly convoluted code • The brassboard simulates the physics; the S/C lives it More about this later.

First Observation • Apparently, it took four, seemingly-independent errors to cause the NEAR rendezvous burn anomaly • Burn abort caused by a threshold set too low • Data was available to set it properly • Serious errors in a script that should have been under configuration management, reviewed and tested • Two or more unknown errors that caused continued control problems, even after the autonomy actions corrected the configuration error • NARB concluded that no single error could produce the known behavior • Was NEAR just unbelievably unlucky, or is there something to be learned here? • Examine the patterns

Q: Why didn’t the G&C use data from DSM1 to set the acceleration limit? • Consider the following: • Less than a week prior to spacecraft/rocket mating, the System Engineer checked the alignment of the Star Camera and found it to be 90° out. • Immediately following launch, the Star Camera was unable to find guide stars, because the on-board star map of southern hemisphere of the celestial sphere was incorrect. • The first TCM was poorly controlled, because the control law used an inaccurate model of the thruster value action. (Manufacturer’s data was located in the G&C engineer’s file cabinet.) • The burn anomaly was not the only or first time the NEAR G&C team failed to measure or use data to check their models • S/C testing failed to uncover any of these errors, including the faulty acceleration limit

Testing G&C Algorithms • Lacking a zero-gravity environment, a wrap-around simulation with a ‘truth model’ is the only way to test G&C • Meticulous attention to modeling of physical phenomena • Independence between the flight algorithms and the truth model • The NEAR ‘truth model’ was written by the flight team and mirrored all the incorrect physical models used to design the S/C G&C algorithms • Although NEAR had an Independent V&V team for G&C, the flight team GAVE them all the models • MSX (the program prior to NEAR) had an independent team build the truth model • NEAR flight G&C team opposed an independent team for NEAR • “50% of the errors found on MSX were in the truth model, not the S/C”

Lesson Number 1 • Always have an independent team build the simulation that will be used to test the G&C algorithms. • Different approaches by different teams can uncovered biases on the part of either team • Use measurements on real flight hardware as much as possible • Accept the time spent on errors in the truth model to get find the errors in flight algorithms You can’t fool Mother Nature.

Q: How did such an obviously flawed script escape notice? • Consider the following: • At launch, most of the scripts in use by Mission Operations were last-minute adaptations of S/C-level test scripts • Dangerous test commands still in place in the rendezvous burn script • In just the first 8 months of operations, there were 7 entries into Safe Mode caused by Mission Operations errors • Many script errors that could have been found by brassboard testing • Lessons of the first TCM • At the time of the rendezvous burn, Mission Operations still had no set procedure for recovery from Safe Mode • This was an Action Item from Critical Design Review • 2 months after the burn anomaly, two new Safe Mode entries were caused by operations errors in loading orbits • Loading new orbits was a routine operation for three years of cruise • Such events were almost accepted as an inevitable part of operations

Preparation for Operations • Most operations on spacecraft can be planned, scripted and tested on the ground before launch • A Concept of Operations that reflects the actual S/C design • Contingency planning, as well as nominal operations • Scripts that can be used in flight • Pre-launch NEAR Mission Operations concentrated on ground system acquisition and DSN connectivity • The NEAR CONOPs was essentially generic - no thinking about how the operations would be conducted with NEAR • No thinking about contingencies • No practice of operations with significant round-trip light times • A function of better, cheaper, faster? • Nope; function of inexperience with professional operations

Mission Operations Professionalism • Professional Mission Operations requires discipline • Configuration management of scripts, code, parameters, etc. • Following a process • Review process • Test requirements • Script sign-off • Use of proven procedures to perform routine tasks • Using Problem Failure Reporting as an opportunity to learn • Change or institute process to avoid repeat of errors • NEAR approach: Conduct operations with a team of engineers who would become experts on the spacecraft and mission • Resulted in a ‘heroic’ mode of operations - CMM Level 1 of ops • Configuration management, reviews, sign-off on scripts were not the interesting part of operations for the NEAR team • Didn’t acquire the degree of knowledge required for hero status

Specialized Technical Knowledge • Very few people were truly capable of reviewing scripts • G&C engineers didn’t understand the scripting language • Mission Operations team didn’t understand the spacecraft • Running the brassboard simulator took knowledge and patience • Setting up the ‘truth model’ simulation • Maintenance to keep the brassboard synchronized with the S/C • Ran in real-time, so simulations took time • “Half the time, errors are in the brassboard setup, not the script”

Lesson Number 2 • Mission Operations requires a team of experienced, dedicated professionals with a unique set of skills. • Planning, preparation, process control, configuration management are as important than detailed technical knowledge • Practice on the pre-flight the way you plan to operate in flight and then don’t deviate unless absolutely necessary • Accept the time spent on errors in the ground simulator to find the errors in scripts Being a hero means never having to say “I’m sorry”.

Q: Why didn’t the S/C recover after autonomy corrected the G&C mis-configuration? • Consider the following: • Pre-flight, G&C code had more SW PRF’s than the other 5 processors combined • New versions were loaded ~ 10 times during S/C-level testing, in Maryland and the Cape • The first three versions wouldn’t boot • Three separate problems that caused FC commands to be ignored were discovered pre-flight and a fourth after launch. • Telemetry was a particular problem • The G&C used the ground simulator, not telemetry, to test their software • Prior to the anomaly, FC code was uploaded three times and the AIU once to correct major problems in flight • 17 additional errors were found during the rendezvous burn investigation • The existence of undiscovered G&C code errors is not unlikely, based on the continued high rate of fault discovery

Q: Do we even need an undiscovered error to explain the anomaly? • G&C software error caused a high momentum wheel rate to be set to zero • Can cause high momentum to be calculated as low OR low momentum to be calculated high • In simulation, S/C always recovered in 20 minutes or less • Limited brassboard simulation of this scenario • Other failures of the simulation to catch errors in flight code have been caused by mismatches between the truth model and reality • How accurate is the wheel model? • How would the behavior change if the wheel model is changed? • An attractive hypothesis • Requires only a known error in the G&C code, plus an unrealistic wheel model • Code containing the error known to be invoked at least once during the anomaly

Lesson Number 3 • If the software error discovery rate is still high, keep testing, even if the S/C has already been launched. • Use the ground simulator • Use an independent team, like the NARB did following the anomaly • Remember Lesson 1: Take every opportunity to adjust the truth model to match S/C performance in flight It ain’t over ‘til it’s over.

Q: Was NEAR just really unlucky? • Consider the following: • G&C code was known to be buggy before pre-flight • G&C code continued to be buggy during flight • The fact that there had been no true independent look at the G&C truth model was known within a week of launch • Mission Operations preparation was known to be inadequate prior to launch • Mission Operations were fault-prone throughout cruise • Mission Operations was never asked for an accounting of their process prior to the burn anomaly • Each event was treated individually, rather than as a pattern that had a high probability of converging into disaster • The NEAR burn anomaly represented a Management failure

Lesson Number 4 • Management must stay informed and involved before there is a serious problem • Look for trends and patterns • When there is serious disagreement on the cause or meaning of events, look closer • Get an independent opinion Ultimately, leadership is responsible.

The Near Earth Asteroid Rendezvous (NEAR) Rendezvous Burn Anomaly

The Near Earth Asteroid Rendezvous (NEAR) Rendezvous Burn Anomaly

Presentation Transcript

NEO -Near Earth Objects

Design of a Surface Albedo Modification Payload for Near Earth Asteroid (NEA) Mitigation

The Physical Properties of Near Earth Asteroids

Near Earth Asteroid Detection System

Near the Ocean

Design of a Surface Albedo Modification Payload for Near Earth Asteroid (NEA) Mitigation

Multi-tired Implementation for Near-Earth Asteroid Mitigation

Free Space Dose Near Earth

Multi-tired Implementation for Near-Earth Asteroid Mitigation

Chaotic Dynamics of Near Earth Asteroids

Asteroid Mining Concepts

Near-Earth Space: The Van Allen Belts

Gravity Near Earth

Measuring the Rotational Period of Near Earth Asteroids

Chapter 3: Near-Earth Objects

Near Earth Object Overview

Near

Near Earth Asteroid Scout Marshall Space Flight Center/Jet Propulsion Lab/ LaRC /JSC/GSFC/NASA

Gravity Near Earth

Near-Earth Object Camera NEOCam

The Physical Properties of Near Earth Asteroids