I l@ve RuBoard |
Collecting System Performance DataUsers call their IT department when they have delays in accessing data or applications. Good tools are needed to help an operator pinpoint the source of the problem. This section covers some of the interesting performance and resource-utilization metrics, and the tools available to collect data about these metrics. A wide range of conditions may result in resource and performance problems. Running out of available memory may be caused by a failure of a memory component or by a memory leak in an application. A sudden rise in CPU utilization could be an indication of processor failure or the introduction on the system of a CPU-intensive application. Analysis is needed to determine whether resource problems can be fixed with a configuration change, hardware repair, or other techniques. Many important system resources have configured limits. The following system resource metrics are important to monitor:
Earlier, this chapter discussed some of the tools that can be used to check system resource usage. The sar and sysdef commands can compare current usage to configured limits. An EMS monitor is available to detect thresholds being exceeded for the following resources:
The performance tools discussed in this section can also detect resource usage problems. Some system performance monitoring is available from the SAM Performance Monitors, with which an administrator can obtain information on system, disk, and virtual memory activity, for example. Text-based information is displayed in a Motif window when one of the desired metrics is selected. Having historical information is important, to understand how the system performance has varied over time. Knowing how your system behaves under normal conditions helps when trying to troubleshoot system performance problems. Note that the performance tools themselves impact the performance of the system, so you need to find a tool with low overhead. This section describes some common tools for measuring and monitoring system performance. Here are some of the key metrics discussed in this section:
Performance tools, such as BMC PATROL and MeasureWare, don't always provide the same set of metrics on all platforms. For simplicity, this section focuses on the Sun Solaris and HP-UX platforms only. Also, these products are continually being enhanced, so the actual metrics available for use in your environment may not precisely match the information presented in this section. MeasureWareHP MeasureWare Agent is a Hewlett-Packard product that collects and logs resource and performance metrics. MeasureWare agents run and collect data on the individual server systems being monitored. agents exist for many platforms and operating systems, including HP-UX, Solaris, and AIX. The MeasureWare agents collect data, summarize it, timestamp it, log it, and send alarms when appropriate. The agents collect and report on a wide variety of system resources, performance metrics, and user-defined data. The information can then be exported to spreadsheets or to performance analysis programs, such as PerfView. The data can be used by these programs to generate alarms to warn of potential performance problems. By using historical data, trends can be discovered. This can help address resource issues before they affect system performance. MeasureWare agents collect data at three different levels: global system metrics, application, and process metrics. Global and application data is summarized at five-minute intervals, whereas process data is summarized at one-minute intervals. Important applications can be defined by an administrator by listing the processes that make up an application in a configuration file.
The basic categories of MeasureWare data are listed in Table 4-4. Also included are optional modules for database and networking support. MeasureWare agents also collect data provided through the DSI interface. The following lists the global system metrics that are available from MeasureWare on HP-UX and Sun Solaris. Additional metrics provided by MeasureWare are covered in other chapters.
Note that, in addition to performance metrics, MeasureWare provides useful configuration information, such as number of processors and the number of disk devices. The following additional global system metrics are available on HP-UX:
You can have alarms sent based on conditions that involve a combination of metrics. For example, a CPU bottleneck alarm can be based on the CPU use and CPU run queue length. MeasureWare agents provide these alarms to PerfView for analysis, and to the IT/O management console. SNMP traps can also be sent at the time a threshold condition is met. Automated actions can be taken, or the operator can choose to take a suggested action. MeasureWare's extract command can be used to export data to other tools, such as spreadsheet programs. Additionally, Application Resource Measurement (ARM) APIs (described in detail in Chapter 7) can be used to instrument applications so that response times can be measured. The application response time information can be passed along to MeasureWare agents for analysis. Although MeasureWare provides extensive performance and resource information, it provides limited configuration information and no data about system faults. For further information, visit the HP Resource and Performance Management Web site at http://www.openview.hp.com /solutions/application/. GlancePlusGlancePlus is a real-time, graphical performance monitoring tool from Hewlett-Packard. It is used to monitor the performance and system resource utilization of a single system. Both Motif-based and character-based interfaces are available. The product can be used on HP-UX, Sun Solaris, and many other operating systems. GlancePlus collects information similar to the information collected by MeasureWare, and samples data more frequently than MeasureWare. GlancePlus can be used to graphically view the following:
GlancePlus is also capable of setting and receiving performance-related alarms. Customizable rules determine when a system performance problem should be sent as an alarm. The rules are managed by the GlancePlus Adviser. The Adviser menu gives you the option to Edit Adviser Syntax. When you select this option, all the alarm conditions are shown, and you can then modify them. Listing 4-13 Defining alarms in GlancePlus.
Alarms result in onscreen notification, with the color representing the criticality of the alarm. An alarm can also trigger a command or script to be executed automatically. Instead of sending an alarm, GlancePlus can print messages or notify you by executing a UNIX command, such as mailx, using its EXEC feature. To configure events, you need to edit a configuration file. The GlancePlus Adviser syntax file (/var/opt/perf/adviser.syntax) contains symptom and alarm configuration. Additional syntax files can also be used. A condition for an alarm to be sent can be based on rules involving different symptoms. Listing 4-13 shows an example of how you can set up an alarm for CPU bottlenecks that is based on CPU utilization and the size of the run queue. You can also execute scripts in command mode. To execute a script, type:
In this example, a yellow alert is sent to the GlancePlus Alarm screen if a CPU bottleneck is suspected. As a bottleneck becomes more likely, the alarm changes to red. You can define the threshold for when the alarm should be sent. The symptoms are re-evaluated at every time interval. Here is a sampling of some of the useful system metrics that can be monitored with GlancePlus:
More than 600 metrics are accessible from GlancePlus. Some of these metrics are discussed in other chapters. The complete list of metrics can be found by using the online help facility. This information can also be found in the directory /opt/perf/paperdocs/gp/C. GlancePlus allows filters to be used to reduce the amount of information shown. For example, you can set up a filter in the Process view to show only the more active system processes. GlancePlus can also show short-term historical information. When selected, the alarm buttons, visible on the main GlancePlus screen, show a history of alarms that have occurred. GlancePlus also shows Process Resource Manager behavior, if PRM is installed, and allows the PRM process group entitlements to be changed. For further information, visit the HP Resource and Performance Management Web site at http://www.openview.hp.com/solutions/application/. PerfViewPerfView is a graphical performance analysis tool from Hewlett-Packard. It is used to graphically display performance and system resource utilization for one system or multiple systems simultaneously, so that comparisons can be made. A variety of performance graphs can be displayed. The graphs are based on data collected over a period of time, unlike the real-time graphs of GlancePlus. This tool runs on HP-UX or NT systems and works with data collected by MeasureWare agents. PerfView has the following three main components:
PerfView can be used to monitor critical system resources. Figure 4-10 shows the Perf- View Analyzer graphing memory utilization and paging rates. Other predefined graphs exist for history, CPU, memory, and queue information. For example, the history graph shows CPU, active processes, disk utilization, memory pageout rates, and swapout rates. Figure 4-10. PerfView graph showing memory utilization and paging rates.The PerfView Analyzer graph shown in Figure 4-11 compares the performance of two systems simultaneously. Up to eight systems can be compared in one graph. Comparing system utilization can be useful when determining where to deploy new applications, or when adding new users. Figure 4-11. PerfView graph comparing two systems.PerfView's ability to show history and trend information can be helpful in diagnosing system problems. Graphing performance information can help you to understand whether a persistent problem exists or if an anomaly is simply a momentary spike of activity. To diagnose a problem further, PerfView Monitor can allow users to change time intervals, to try to find the specific time a problem occurred. The graph is redrawn showing the new time period. PerfView is integrated with several other monitoring tools. You can launch GlancePlus from within PerfView by accessing the Tools menu. PerfView can be launched from the IT/O Applications Bank as well. When troubleshooting an event in the IT/O Message Browser window, you can launch PerfView to see a related performance graph. PerfView Monitor is not used with IT/O. Instead, the IT/O Message Browser is used. When an alarm is received in IT/O, the operator can click the alarm and a related PerfView graph can be shown. PerfView can show information collected from multiple systems in a single performance graph. The PerfView and ClusterView products have also been integrated to enable the operator to select a cluster symbol on an HP OpenView submap and launch the PerfView application. This quickly shows a performance comparison between all systems in the cluster. For further information, visit the HP Resource and Performance Management Web site at http://www.openview.hp.com/solutions/application/. BMC PATROL for UNIXBMC Software provides monitoring capabilities through its PATROL software suite. PATROL is a system, application, and event management suite for system and database administrators. PATROL provides the basic framework for defining thresholds, sending and translating events, and so forth. Optional products, called Knowledge Modules (KMs), are capable of monitoring specific components. For example, BMC PATROL includes KMs for UNIX, SAP R/3, Oracle, Informix, and other applications. In fact, more than 40 KMs are available from BMC for use with PATROL. With the PATROL KM for UNIX, managed components include the CPU, memory, users, kernel, processes, printers, security, and filesystems. These components are discovered automatically and represented on the PATROL console with status icons. System utilization can be shown as graphs, to capture trends, and data can either be displayed in real time or saved in log files. Like other graphical monitoring tools, PATROL provides an Event Manager window, which can show received events. Figure 4-12 highlights disk and NFS events received at the console. Figure 4-12. PATROL Event Manager showing disk and NFS events.For memory and swap resources, PATROL can show total real memory available, total virtual memory available, a list of swap devices, the number of processes swapped, and swap space utilization. For the CPU, PATROL can show bottlenecks and utilization information, along with a variety of statistics, such as CPU idle time, run queue length, and swap queue length. Information about the operating system itself is also maintained, such as the name, version, and creation date. PATROL can display the total number of processes, the number of zombie processes, and heavy CPU users. Through the PATROL console, you can perform administrative tasks, such as reprioritizing processes. PATROL also can display the total number of users and sessions, and can check security by monitoring the number of failed user and privileged logins. You can check the printer queue to see how many jobs are in the queue and to determine the state of the printer. PATROL can monitor the filesystem and can automatically determine the effectiveness of the buffer cache. Regular reports can be generated to check disk usage per user, to create a list of the largest files, or to list files that have not been accessed in a long time. Corrective actions, such as removing core files, can also be configured. In addition to the system metrics monitored by PATROL, the KM for UNIX includes a set of tools to provide additional system monitoring, including tools to monitor CPU usage, paging activity, I/O caching, swap activity, and system log files, tools to check filesystem and kernel file resources, and tools to monitor printer queues. The following list shows some of the parameters available for monitoring from the PATROL KM for UNIX:
The BMC PATROL KM for UNIX is supported on Bull, DG AViiON, DEC Alpha, DEC Ultra, Hewlett-Packard, NCR, Olivetti, OSF/1, Pyramid, RS/6000, SCO, Sequent, SGI, Sun Solaris, SunOS, Unisys, and UNIXWare systems. CandleThe Candle Corporation provides software for mainframes and distributed systems. The Availability Command Center is a suite of integrated performance monitors and availability management solutions. The Candle Command Center for Distributed Systems is used to manage the performance and availability of computer systems and applications. Command Center solutions are available for UNIX, NT, IBM AIX, and MVS platforms. The Command Center for Distributed Systems can monitor many systems from a single console. Candle's management agents provide detailed performance and availability metrics. The OMEGAMON Monitoring Agent for UNIX provides system information standardized across multiple UNIX platforms (IBM AIX, HP-UX, Sun Solaris, and SunOS). Available metrics include OS and CPU performance, process status, and disk performance. Disk performance is expressed as kilobytes per second, percent busy, and transfers per second. Disk performance and other tools can be launched from the Command Center console. The Command Center provides some predefined threshold conditions for sending alerts. You also can change these conditions. If you decide to change the threshold conditions, they are automatically redistributed to the appropriate systems. Different alarm severity levels can be used. The Command Center's event correlation engine and Visual Policy Editor can be used to create rules that automatically recognize the symptoms of problems and develop automated responses. Candle has performed additional testing of the Command Center with MC/ServiceGuard to ensure that its Command Center for Distributed Systems product runs in that environment. More information about Candle Corporation's products can be found on the Web at http://www.candle.com. |
I l@ve RuBoard |
No comments:
Post a Comment