Collecting System Performance Data
Users call their IT department when they have delays in accessing data or applications. Good tools are needed to help an operator pinpoint the source of the problem. This section covers some of the interesting performance and resource-utilization metrics, and the tools available to collect data about these metrics.
A wide range of conditions may result in resource and performance problems. Running out of available memory may be caused by a failure of a memory component or by a memory leak in an application. A sudden rise in CPU utilization could be an indication of processor failure or the introduction on the system of a CPU-intensive application. Analysis is needed to determine whether resource problems can be fixed with a configuration change, hardware repair, or other techniques.
Many important system resources have configured limits. The following system resource metrics are important to monitor:
Earlier, this chapter discussed some of the tools that can be used to check system resource usage. The sar and sysdef commands can compare current usage to configured limits. An EMS monitor is available to detect thresholds being exceeded for the following resources:
The performance tools discussed in this section can also detect resource usage problems.
Some system performance monitoring is available from the SAM Performance Monitors, with which an administrator can obtain information on system, disk, and virtual memory activity, for example. Text-based information is displayed in a Motif window when one of the desired metrics is selected.
Having historical information is important, to understand how the system performance has varied over time. Knowing how your system behaves under normal conditions helps when trying to troubleshoot system performance problems. Note that the performance tools themselves impact the performance of the system, so you need to find a tool with low overhead.
This section describes some common tools for measuring and monitoring system performance. Here are some of the key metrics discussed in this section:
Buffer cache queue length: Refers to the number of processes blocked that are waiting for updates to the buffer cache. If this value is high, it could be an indication of a memory bottleneck.
Context switches: How often processes are being swapped out of the run queue.
CPU utilization: Expressed as a percentage of time spent in various execution states. Low utilization indicates that the CPU spent the majority of its time in the idle state.
CPU run queue length: The average number of processes in the run state waiting to be scheduled.
Memory utilization: Usually expressed as a ratio of the amount of memory in use versus the total memory available.
Paging: Refers to the transfer of data between virtual memory (disks) and physical memory.
Swapping: Refers to the transfer of data between physical memory and a special virtual memory area reserved for swapping.
Performance tools, such as BMC PATROL and MeasureWare, don't always provide the same set of metrics on all platforms. For simplicity, this section focuses on the Sun Solaris and HP-UX platforms only. Also, these products are continually being enhanced, so the actual metrics available for use in your environment may not precisely match the information presented in this section.
MeasureWare
HP MeasureWare Agent is a Hewlett-Packard product that collects and logs resource and performance metrics. MeasureWare agents run and collect data on the individual server systems being monitored. agents exist for many platforms and operating systems, including HP-UX, Solaris, and AIX.
The MeasureWare agents collect data, summarize it, timestamp it, log it, and send alarms when appropriate. The agents collect and report on a wide variety of system resources, performance metrics, and user-defined data. The information can then be exported to spreadsheets or to performance analysis programs, such as PerfView. The data can be used by these programs to generate alarms to warn of potential performance problems. By using historical data, trends can be discovered. This can help address resource issues before they affect system performance.
MeasureWare agents collect data at three different levels: global system metrics, application, and process metrics. Global and application data is summarized at five-minute intervals, whereas process data is summarized at one-minute intervals. Important applications can be defined by an administrator by listing the processes that make up an application in a configuration file.
Table 4-4. Categories of MeasureWare Agent Information
|
System
|
CPU, disk, networking, memory, process queue depths, user/process information, and summary information
|
Application
|
CPU, disk, memory, process count, average process wait states, and summary information
|
Process
|
CPU, disk, memory, average process wait states, overall process lifetime, and summary information
|
Transaction
|
Transaction count, average response time, distribution of response time metrics, and aborted transactions
|
The basic categories of MeasureWare data are listed in Table 4-4. Also included are optional modules for database and networking support. MeasureWare agents also collect data provided through the DSI interface.
The following lists the global system metrics that are available from MeasureWare on HP-UX and Sun Solaris. Additional metrics provided by MeasureWare are covered in other chapters.
CPU use during interval
Number and rate of physical disk inputs/outputs
Maximum percent full of all disk file sets
System CPU use during interval
User CPU use during interval
CPU use at nice priorities
CPU idle time during interval
Rate of system procedure calls during interval
Main memory use
Swap space use on disk
Number and rate of memory page faults during interval
Number of process swaps during interval
Percentage of virtual memory currently in active use
Number of processes in run queue during interval
Number of processes waiting for a disk during interval
Number of processes waiting for memory during interval
Number of processes currently in sleep state during interval
Number of processes waiting for other reasons during interval
Number of user sessions during interval
Number of processes alive during interval
Number of processes active during interval
Number of processes started during interval
Number of processes completed during interval
Average runtime of completing process during interval
Operating system version
Number of processors in the system
Number of disk devices and their device IDs
Main memory size
Swapping space allocated
Disk I/O information (see Chapter 5)
Networking statistics (see Chapter 6)
Note that, in addition to performance metrics, MeasureWare provides useful configuration information, such as number of processors and the number of disk devices.
The following additional global system metrics are available on HP-UX:
CPU use at real-time priorities
CPU use for context switching during interval
CPU use for interrupt handling during interval
Number of processes waiting for interprocess communications during interval
Number of processes waiting on network transfers during interval
Number and rate of terminal transactions during interval
Average terminal transaction "think" time
Average terminal transaction first response time
Average terminal response to prompt time
Distribution of transaction first response times
Distribution of transaction response to prompt times
You can have alarms sent based on conditions that involve a combination of metrics. For example, a CPU bottleneck alarm can be based on the CPU use and CPU run queue length.
MeasureWare agents provide these alarms to PerfView for analysis, and to the IT/O management console. SNMP traps can also be sent at the time a threshold condition is met. Automated actions can be taken, or the operator can choose to take a suggested action.
MeasureWare's extract command can be used to export data to other tools, such as spreadsheet programs. Additionally, Application Resource Measurement (ARM) APIs (described in detail in Chapter 7) can be used to instrument applications so that response times can be measured. The application response time information can be passed along to MeasureWare agents for analysis.
Although MeasureWare provides extensive performance and resource information, it provides limited configuration information and no data about system faults. For further information, visit the HP Resource and Performance Management Web site at http://www.openview.hp.com /solutions/application/.
GlancePlus
GlancePlus is a real-time, graphical performance monitoring tool from Hewlett-Packard. It is used to monitor the performance and system resource utilization of a single system. Both Motif-based and character-based interfaces are available. The product can be used on HP-UX, Sun Solaris, and many other operating systems.
GlancePlus collects information similar to the information collected by MeasureWare, and samples data more frequently than MeasureWare. GlancePlus can be used to graphically view the following:
Current CPU, memory, swap, and disk activity and utilization (see Figure 4-9)
Application and process information
Transaction information, if the MeasureWare Agent is installed and active
Alarm information, color-coded to reflect severity
CPU utilization, with per-processor information available for multiprocessor systems
Memory utilization, split among cache, user, and system memory
Disk utilization, with the I/O paths of the top disk users indicated
I/O activity, by filesystem or logical volume
GlancePlus is also capable of setting and receiving performance-related alarms. Customizable rules determine when a system performance problem should be sent as an alarm. The rules are managed by the GlancePlus Adviser. The Adviser menu gives you the option to Edit Adviser Syntax. When you select this option, all the alarm conditions are shown, and you can then modify them.
Listing 4-13 Defining alarms in GlancePlus.
alarm CPU_Bottleneck > 50 for 2 minutes start if CPU_Bottleneck > 90 then red alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%" else yellow alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%" repeat every 10 minutes if CPU_Bottleneck > 90 then red alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%" else yellow alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%" end reset alert "End of CPU Bottleneck Alert"
Alarms result in onscreen notification, with the color representing the criticality of the alarm. An alarm can also trigger a command or script to be executed automatically. Instead of sending an alarm, GlancePlus can print messages or notify you by executing a UNIX command, such as mailx, using its EXEC feature.
To configure events, you need to edit a configuration file. The GlancePlus Adviser syntax file (/var/opt/perf/adviser.syntax) contains symptom and alarm configuration. Additional syntax files can also be used. A condition for an alarm to be sent can be based on rules involving different symptoms. Listing 4-13 shows an example of how you can set up an alarm for CPU bottlenecks that is based on CPU utilization and the size of the run queue.
You can also execute scripts in command mode. To execute a script, type:
glance -adviser_only --syntax <script file name>
In this example, a yellow alert is sent to the GlancePlus Alarm screen if a CPU bottleneck is suspected. As a bottleneck becomes more likely, the alarm changes to red. You can define the threshold for when the alarm should be sent. The symptoms are re-evaluated at every time interval.
Here is a sampling of some of the useful system metrics that can be monitored with GlancePlus:
CPU utilization
CPU run queue length
Number of processors
Filesystem buffer cache queue length
Disk utilization and queue length
Physical memory capacity
Amount of physical memory available
Memory page fault rate
Total swap space
Amount of swap space available
Filesystem I/O rates
Amount of buffer cache available
Available shared memory
Available file table entries
Available process table entries
Most active processes
Wait states
System table resources
Open file information
More than 600 metrics are accessible from GlancePlus. Some of these metrics are discussed in other chapters. The complete list of metrics can be found by using the online help facility. This information can also be found in the directory /opt/perf/paperdocs/gp/C.
GlancePlus allows filters to be used to reduce the amount of information shown. For example, you can set up a filter in the Process view to show only the more active system processes.
GlancePlus can also show short-term historical information. When selected, the alarm buttons, visible on the main GlancePlus screen, show a history of alarms that have occurred.
GlancePlus also shows Process Resource Manager behavior, if PRM is installed, and allows the PRM process group entitlements to be changed.
For further information, visit the HP Resource and Performance Management Web site at http://www.openview.hp.com/solutions/application/.
PerfView
PerfView is a graphical performance analysis tool from Hewlett-Packard. It is used to graphically display performance and system resource utilization for one system or multiple systems simultaneously, so that comparisons can be made. A variety of performance graphs can be displayed. The graphs are based on data collected over a period of time, unlike the real-time graphs of GlancePlus. This tool runs on HP-UX or NT systems and works with data collected by MeasureWare agents.
PerfView has the following three main components:
PerfView Monitor: Provides the ability to receive alarms. A textual description of an alarm can be displayed. Alarms can be filtered by severity, type, or source system. Also, after an alarm is received, the alarm can be selected to display a graph of related metrics. An operator can monitor trends leading to failures and then take proactive actions to avoid problems. Graphs can be used for comparison between systems and to show a history of resource consumption. An internal database is maintained that keeps a history of alarm notification messages.
PerfView Analyzer: Provides resource and performance analyses for disks and other resources. System metrics can be shown at three different levels: process, application (configured by the user as a set of processes), and global system information. It relies on data received from MeasureWare agents on managed nodes. Data can be analyzed from up to eight systems concurrently. All MeasureWare data sources are supported. PerfView Analyzer is required by both PerfView Monitor and PerfView Planner.
PerfView Planner: Provides forecasting capability. Graphs can be extrapolated into the future. A variety of graphs (such as linear, exponential, s-curve, and smoothed) can be shown for forecasted data.
PerfView can be used to monitor critical system resources. Figure 4-10 shows the Perf- View Analyzer graphing memory utilization and paging rates. Other predefined graphs exist for history, CPU, memory, and queue information. For example, the history graph shows CPU, active processes, disk utilization, memory pageout rates, and swapout rates.
The PerfView Analyzer graph shown in Figure 4-11 compares the performance of two systems simultaneously. Up to eight systems can be compared in one graph. Comparing system utilization can be useful when determining where to deploy new applications, or when adding new users.
PerfView's ability to show history and trend information can be helpful in diagnosing system problems. Graphing performance information can help you to understand whether a persistent problem exists or if an anomaly is simply a momentary spike of activity.
To diagnose a problem further, PerfView Monitor can allow users to change time intervals, to try to find the specific time a problem occurred. The graph is redrawn showing the new time period.
PerfView is integrated with several other monitoring tools. You can launch GlancePlus from within PerfView by accessing the Tools menu. PerfView can be launched from the IT/O Applications Bank as well. When troubleshooting an event in the IT/O Message Browser window, you can launch PerfView to see a related performance graph.
PerfView Monitor is not used with IT/O. Instead, the IT/O Message Browser is used. When an alarm is received in IT/O, the operator can click the alarm and a related PerfView graph can be shown.
PerfView can show information collected from multiple systems in a single performance graph. The PerfView and ClusterView products have also been integrated to enable the operator to select a cluster symbol on an HP OpenView submap and launch the PerfView application. This quickly shows a performance comparison between all systems in the cluster.
For further information, visit the HP Resource and Performance Management Web site at http://www.openview.hp.com/solutions/application/.
BMC PATROL for UNIX
BMC Software provides monitoring capabilities through its PATROL software suite. PATROL is a system, application, and event management suite for system and database administrators. PATROL provides the basic framework for defining thresholds, sending and translating events, and so forth. Optional products, called Knowledge Modules (KMs), are capable of monitoring specific components. For example, BMC PATROL includes KMs for UNIX, SAP R/3, Oracle, Informix, and other applications. In fact, more than 40 KMs are available from BMC for use with PATROL.
With the PATROL KM for UNIX, managed components include the CPU, memory, users, kernel, processes, printers, security, and filesystems. These components are discovered automatically and represented on the PATROL console with status icons. System utilization can be shown as graphs, to capture trends, and data can either be displayed in real time or saved in log files.
Like other graphical monitoring tools, PATROL provides an Event Manager window, which can show received events. Figure 4-12 highlights disk and NFS events received at the console.
For memory and swap resources, PATROL can show total real memory available, total virtual memory available, a list of swap devices, the number of processes swapped, and swap space utilization.
For the CPU, PATROL can show bottlenecks and utilization information, along with a variety of statistics, such as CPU idle time, run queue length, and swap queue length. Information about the operating system itself is also maintained, such as the name, version, and creation date.
PATROL can display the total number of processes, the number of zombie processes, and heavy CPU users. Through the PATROL console, you can perform administrative tasks, such as reprioritizing processes.
PATROL also can display the total number of users and sessions, and can check security by monitoring the number of failed user and privileged logins. You can check the printer queue to see how many jobs are in the queue and to determine the state of the printer.
PATROL can monitor the filesystem and can automatically determine the effectiveness of the buffer cache. Regular reports can be generated to check disk usage per user, to create a list of the largest files, or to list files that have not been accessed in a long time. Corrective actions, such as removing core files, can also be configured.
In addition to the system metrics monitored by PATROL, the KM for UNIX includes a set of tools to provide additional system monitoring, including tools to monitor CPU usage, paging activity, I/O caching, swap activity, and system log files, tools to check filesystem and kernel file resources, and tools to monitor printer queues.
The following list shows some of the parameters available for monitoring from the PATROL KM for UNIX:
CPUCpuUtil
CPUIdleTime
CPUInt
CPULoad
CPUProcsWaiting
CPUProcSwch
CPURunQSize
CPUSysTime
CPUUserTime
KERSysCall
MEMActiveVirPage
MEMFreeMem
MEMPageAnticipated
MEMPageFreed
MEMPageIn
MEMPageOut
MEMPageScanned
PRNQlength
PROCAvgUsrProc
PROCCpuHogs
PROCNoZombies
PROCNumProcs
PROCProcWait
PROCUserProcs
SWPSwapFreeSpace
SWPSwapIn
SWPSwapOut
SWPSwapSize
SWPSwapUsedPercent
USRNoSession
USRNoUser
The BMC PATROL KM for UNIX is supported on Bull, DG AViiON, DEC Alpha, DEC Ultra, Hewlett-Packard, NCR, Olivetti, OSF/1, Pyramid, RS/6000, SCO, Sequent, SGI, Sun Solaris, SunOS, Unisys, and UNIXWare systems.
Candle
The Candle Corporation provides software for mainframes and distributed systems. The Availability Command Center is a suite of integrated performance monitors and availability management solutions. The Candle Command Center for Distributed Systems is used to manage the performance and availability of computer systems and applications. Command Center solutions are available for UNIX, NT, IBM AIX, and MVS platforms. The Command Center for Distributed Systems can monitor many systems from a single console.
Candle's management agents provide detailed performance and availability metrics. The OMEGAMON Monitoring Agent for UNIX provides system information standardized across multiple UNIX platforms (IBM AIX, HP-UX, Sun Solaris, and SunOS). Available metrics include OS and CPU performance, process status, and disk performance. Disk performance is expressed as kilobytes per second, percent busy, and transfers per second. Disk performance and other tools can be launched from the Command Center console.
The Command Center provides some predefined threshold conditions for sending alerts. You also can change these conditions. If you decide to change the threshold conditions, they are automatically redistributed to the appropriate systems. Different alarm severity levels can be used.
The Command Center's event correlation engine and Visual Policy Editor can be used to create rules that automatically recognize the symptoms of problems and develop automated responses.
Candle has performed additional testing of the Command Center with MC/ServiceGuard to ensure that its Command Center for Distributed Systems product runs in that environment. More information about Candle Corporation's products can be found on the Web at http://www.candle.com.
|