70 likes | 199 Vues
On August 18, 2010, a firmware bug led to the simultaneous reboot of both controllers in the Sun StorageTek 6540 array, resulting in data corruption during recovery attempts. After unsuccessful data restoration on the original hardware, a successful recovery was completed on alternate hardware by September 1, 2010. The resynchronization process, involving copying LHCb and ATLAS data, was initiated shortly after. The workshop in November 2010 highlighted the importance of documenting procedures and the assistance received. An upgrade resolved issues, emphasizing proactive measures against data corruption.
E N D
A brief timeline • Due to a firmware bug both controllers of the Sun StorageTek 6540 array reboot within 90 minutes after each other on 18/08/2010. • All attempts to restore and recover the data to the original hardware fail. The database is corrupted during the recovery process. • On 01/09/2010 the database is successfully restored to alternate hardware. • On 02/09/2010 preparations are started to synchronize the database from RAL. • RAL starts to copy LHCb and ATLAS data to SARA. 3D DBA Workshop 16-17 November 2010
The data is imported into the database on 08/09/2010. • On the same day CERN brings the streams up. 3D DBA Workshop 16-17 November 2010
Some minor issues • At SARA the COMPATIBLE parameter had to be changed from 10.2.0.3 to 10.2.0.4 to match the one at RAL. • There is an error in the Oracle Database Administrator's Guide on page 8-37 regarding the syntax of the parameter file of the impdp command. 3D DBA Workshop 16-17 November 2010
Conclusions on the resynchronization • The resynchronization process went rather smoothly (at least from SARA’s point of view). • For SARA this was a learning opportunity. The procedure has been documented for possible future use. • The assistance we received from both RAL and CERN was amazing. 3D DBA Workshop 16-17 November 2010
Conclusions on data corruption • We’ve been unable to determine the exact cause of the corruption. • An upgrade of the storage firmware and a rebuild of the LUNs solved the problem (but for how long?). • Always use "db_block_checking='TRUE'" in combination with db_block_checksum to detect logical corruption at a very early stage. 3D DBA Workshop 16-17 November 2010