Tuesday, December 22, 2009

Case Study: Recovering from Memory Faults




I l@ve RuBoard









Case Study: Recovering from Memory Faults


In this case study, the system administrator, Bill Landis, is responsible for maintaining system availability 24 hours a day, 7 days a week for the Silicon Valley Hospital's billing system. Based on past experience, he wants a system for which system components, such as CPU and memory, can be replaced without the need to bring down the system. Bill has a Sun Enterprise 10000 server, which has dynamic reconfiguration capabilities. To take advantage of Sun's dynamic reconfiguration, Bill is configuring his system memory so that it can be taken offline and replaced in the event of a board failure.


Verifying Configuration


The system's real memory is divided into memory banks, which become ineligible for dynamic reconfiguration when they contain kernel pages. With Sun's dynamic reconfiguration features, you can configure kernel pages to use certain memory banks. Once configured, you use the Dynamic Reconfiguration screen (shown in Figure 4-13) in SyMON to verify that these banks aren't "permanent" and are available to be unconfigured.


Figure 4-13. Using SyMON dynamic reconfiguration to replace a failed memory board.



As Figure 4-13 indicates, the memory board in slot 2 is configured, but it isn't assigned to a permanent memory bank. As a result, Bill can use this screen in SyMON to take the memory offline. In contrast, slot 0 is associated with a permanent memory bank and can't be disconnected while the system is running.



Setting Up Monitoring and Reconfiguration


From the Enterprise SyMON console, Bill loads the Config-Reader and Dynamic Reconfiguration modules to ensure that he will be notified of hardware faults, so that he can handle the faults without having to take the system down. The Config-Reader module is located under Hardware and the Dynamic Reconfiguration module is located under local applications (shown in Figure 4-7, earlier in this chapter).



Memory Board Failure Occurs


When a critical memory fault occurs, the icon for the system on the SyMON console indicates the alarm. Bill looks at the Alarm window to see more details about the event. He notices that a memory board has failed. Using the Logical View, like the one shown in Figure 4-14, he locates the failed memory board. Using the Physical View, like the one shown earlier in Figure 4-5, he locates the exact location of the physical board in the system.


Figure 4-14. Using the SyMON Logical View to locate a failed memory board.




Fixing the Failure and Restoring Service


Bill accesses the Dynamic Reconfiguration screen from the SyMON console. First, he selects the failed memory slot and clicks the Disconnect button to unconfigure and disconnect the board. Next, he replaces the failed board and then connects the board by clicking the Connect button; he leaves it temporarily unconfigured, however, while he performs a memory test using the Test Memory button, to ensure that the new board is functional. Finally, Bill clicks the Configure button to make these memory resources available to the system.


Bill was able to handle a failed memory board in this environment with very little impact to the system. Sun's dynamic reconfiguration capabilities, available from SyMON, provide a powerful feature that allows failed memory, CPU, and I/O resources to be fixed without having to bring down the system.









    I l@ve RuBoard



    No comments: