4.4 Consumers of Memory
Memory is consumed by four
things: the kernel, filesystem caches, processes, and intimately
shared memory. When the system starts, it takes a small amount
(generally less than 4 MB) of memory for itself. As it dynamically
loads modules and requires additional memory, it claims pages from
the free list. These pages are locked in physical memory, and cannot
be paged out except in the most severe of memory shortages.
Sometimes, on a system that is very short of memory, you can hear a
pop from the speaker. This is actually the speaker being turned off
as the audio device driver is being unloaded from the kernel.
However, a module won't be unloaded if a process is
actually using the device; otherwise, the disk driver could be paged
out, causing difficulties. Occasionally, however, a system will
experience a kernel memory allocation error.
While there is a limit on the size of kernel memory, the problem
is caused by the kernel trying to get memory when the free list is
completely exhausted. Since the kernel cannot always wait for memory
to become available, this can cause operations to fail rather than be
delayed. One of the subsystems that cannot wait for memory is the
streams facility; if a large number of users try to log into a system
at the same time, some logins may fail. Starting with Solaris 2.5.1,
changes were made to expand the free list on large systems, which
helps prevent the free list from ever being totally empty.
Processes have private memory to
hold their stack space, heap, and data areas. The only way to see how
much memory a process is actively using is to use
/usr/proc/bin/pmap -x process-id, which is
available in Solaris 2.6 and later releases.
Intimately shared memory is a technique
for allowing the sharing of low-level kernel information about pages,
rather than by sharing the memory pages themselves. This is a
significant optimization in that it removes a great deal of redundant
mapping information. It is of primary use in database applications
such as Oracle, which benefit from having a very large shared memory
cache. There are three special things worth noting about intimately
shared memory. First, all the intimately shared memory is locked, and
cannot ever be paged out. Second, the memory management structures
that are usually created independently for each process are only
created once, and shared between all processes. Third, the kernel
tries to find large pieces of contiguous physical memory (4 MB) that
can be used as large pages, which substantially reduces MMU overhead.
4.4.1 Filesystem Caching
The single largest consumer of
memory is usually the filesystem-caching mechanism. In order for a
process to read from or write to a file, the file needs to be
buffered in memory. When this is happening, these pages are locked in
memory. After the operation completes, the pages are unlocked and
placed at the bottom of the free list. The kernel remembers the pages
that store valid cached data. If the data is needed again, it is
readily available in memory, which saves the system an expensive trip
to disk. When a file is deleted or truncated, or if the kernel
decides to stop caching a particular inode, any pages caching that
data are placed at the head of the free list for immediate reuse.
Most files, however, only become uncached upon the action of the page
scanner. Data that has been modified in the memory caches is
periodically written to data by fsflush on Solaris
and bdflush on Linux, which we'll
discuss a little later.
The amount of space used for this behavior is
not tunable in Solaris; if you want to cache a
large amount of filesystem data in memory, you simply need to buy a
system with a lot of physical memory. Furthermore, since Solaris
handles all its filesystem I/O by means of the paging mechanism, a
large number of observed page-ins and page-outs is completely normal.
In the Linux 2.2 kernel, this caching behavior is tunable: only a
specific amount of memory is available for filesystem buffering. The
min_percent variable controls the minimum
percentage of system memory available for caching. The upper bound is
not tunable. This variable can be found in the
/proc/sys/vm/buffermem file. The format of that
file is min_percent max_percent borrow_percent;
note that max_percent and
borrow_percent are not used.
4.4.2 Filesystem Cache Writes: fsflush and bdflush
Of course, the caching of files in memory is a huge performance
boost; it often allows us to access main memory (a few hundred
nanoseconds) when we would otherwise have to go all the way to disk
(tens of milliseconds). Since the contents of a file can be operated
upon in memory via the filesystem cache, it is important for
data-reliability purposes to regularly write changed data to disk.
Older Unix operating systems, like SunOS 4, would write the modified
contents of memory to disk every 30 seconds. Solaris and Linux both
implement a mechanism to spread this workload out, which is
implemented by the fsflush and
bdflush processes, respectively.
This mechanism can have substantial impacts on a
system's performance. It also explains some unusual
disk statistics.
4.4.2.1 Solaris: fsflush
The maximum age of any memory-resident
modified page is set by the autoup variable, which
is thirty seconds by default. It can be increased safely to several
hundred seconds if necessary. Every
tune_t_fsflushr seconds (by default, every five
seconds), fsflush wakes up and checks a fraction
of the total memory equal to tune_t_fsflushr
divided by autoup (that is, by default,
five-thirtieths, or one-sixth, of the system's total
physical memory). It then flushes any modified entries it finds from
the inode cache to disk; it can be disabled by setting
doiflush to zero. The page-flushing mechanism can
be totally disabled by setting dopageflush to
zero, but this can have serious repercussions on data reliability in
the event of a crash. Note that dopageflush and
doiflush are complimentary, not mutually
exclusive.
4.4.2.2 Linux: bdflush
Linux implements a slightly different
mechanism, which is tuned via the values in the
/proc/sys/vm/bdflush file. Unfortunately, the
tunable behavior of the bdflush daemon has
changed significantly from the 2.2 kernels to the 2.4 kernels. I
discuss each in turn.
In Linux 2.2, if the percentage of the filesystem buffer cache that
is "dirty" (that is, changed and
needs to be flushed) exceeds bdflush.nfract, then
bdflush wakes up. Setting this variable to a high
value means that cache flushing can be delayed for quite a while, but
it also means that when it does occur, a lot of disk I/O will happen
at once. A lower value spreads out disk activity more evenly.
bdflush will write out a number of buffer entries
equal to bdflush.ndirty; a high value here causes
sporadic, bursting I/O, but a small value can lead to a memory
shortage, since bdflush isn't
being woken up frequently enough. The system will wait for
bdflush.age_buffer or
bdflush.age_super, in hundredths of a second,
before writing a dirty data block or dirty filesystem metadata block
to disk. Here's a simple Perl script for displaying,
in a pretty format, the values of the bdflush
configuration file:
#!/usr/bin/perl
my ($nfract, $ndirty, $nrefill, $nref_dirt, $unused, $age_buffer, $age_super,
$unused, $unused) = split (/\s+/, `cat /proc/sys/vm/bdflush`, 9);
print "Current settings of bdflush kernel variables:\n";
print "nfract\t\t$nfract\tndirty\t\t$ndirty\tnrefill\t\t$nrefill\n\r";
print "nref_dirt\t$nref_dirt\tage_buffer\t$age_buffer\tage_super\t$age_super\n\r";
In Linux 2.4, about the only thing that didn't
change was the fact that bdflush still wakes up if
the percentage of the filesystem buffer cache that is dirty exceeds
bdflush.nfract. The default value of
bdflush.nfract (the first in the file) is 30%; the
range is from 0 to 100%. The minimum interval between wakeups and
flushes is determined by the bdflush.interval
parameter (the fifth in the file), which is expressed in clock
ticks. The default value is 5 seconds; the
minimum is 0 and the maximum is 600. The
bdflush.age_buffer tunable (the sixth in the file)
governs the maximum amount of time, in clock ticks, that the kernel
will wait before flushing a dirty buffer to disk. The default value
is 30 seconds, the minimum is 1 second, and the maximum is 6,000
seconds. The final parameter, bdflush.nfract_sync
(the seventh in the file), governs the percentage of the buffer cache
that must be dirty before bdflush will activate
synchronously; in other words, it is the hard limit after which
bdflush will force buffers to disk. The default is
60%. Here's a script to extract values for these
bdflush parameters in Linux 2.4:
#!/usr/bin/perl
my ($nfract, $unused, $unused, $unused, $interval, $age_buffer, $nfract_sync, $u
nused, $unused) = split (/\s+/, `cat /proc/sys/vm/bdflush`, 9);
print "Current settings of bdflush kernel variables:\n";
print "nfract $nfract\tinterval $interval\tage_buffer $age_buffer\tnfract_sync $
nfract_sync\n";
If the system has a very large amount of physical memory,
fsflush and bdflush
(we'll refer to them generically as
flushing daemons) will have a lot of work to
do every time they are woken up. However, most files that would have
been written out by the flushing daemon have already closed by the
time they're marked for flushing. Furthermore,
writes over NFS are always performed synchronously, so the flushing
daemon isn't required. In cases where the system is
performing lots of I/O but not using direct I/O or synchronous
writes, the performance of the flushing daemons becomes important. A
general rule for Solaris systems is that if
fsflush has consumed more than five percent of the
system's cumulative nonidle processor time,
autoup should be increased.
4.4.3 Interactions Between the Filesystem Cache and Memory
Because Solaris has an untunable filesystem caching mechanism, it can
encounter problems under some specific instances. The source of the
problem is that the kernel allows the filesystem cache to grow to the
point where it begins to steal memory pages from user applications.
This behavior not only shortchanges other potential consumers of
memory, but it means that the filesystem performance becomes
dominated by the rate at which the virtual memory subsystem can free
memory.
There are two solutions to this problem: priority paging and the
cyclic cache.
4.4.3.1 Priority paging
In order to address this issue, Sun
introduced a new paging algorithm in Solaris 7, called
priority paging, which places a boundary
around the filesystem cache. A new kernel variable,
cachefree, is created, which scales with
minfree, desfree, and
lotsfree. The system attempts to keep
cachefree pages of memory available, but frees
filesystem cache pages only when the size of the free list is between
cachefree and lotsfree.
The effect is generally excellent. Desktop systems and on-line
transaction processing (OLTP) environments tend to feel much more
responsive, and much of the swap device activity is eliminated;
computational codes that do a great deal of filesystem writing may
see as much as a 300% performance increase. By default, priority
paging has been disabled until sufficient end-user feedback on its
performance is gathered. It will likely become the new algorithm in
future Solaris releases. In order to use this new mechanism, you need
Solaris 7, or 2.6 with kernel patch 105181-09. To enable the
algorithm, set the priority_paging variable to 1.
You can also implement the change on a live 32-bit system by setting
the cachefree tunable to twice the value of
lotsfree.
4.4.3.2 Cyclic caching
A more technically elegant solution to
this problem has been implemented in Solaris 8, primarily due to
efforts by Richard McDougall, a senior engineer at Sun Microsystems.
No special procedures need be followed to enable it. At the heart of
this mechanism is a straightforward rule: nondirty pages that are not
mapped anywhere should be on the free list. This rule means that the
free list now contains all the filesystem cache pages, which has
far-reaching consequences:
Application startup (or other heavy memory consumption in a short
period of time) can occur much faster, because the page scanner is
not required to wake up and free memory.
Filesystem I/O has very little impact on other applications on the
system.
Paging activity is reduced to zero and the page scanner is idle when
sufficient memory is enabled.
As a result, analyzing a Solaris 8 system for a memory shortage is
simple: if the page scanner reclaims any pages at
all, there is a memory shortage. The mere activity of the
page scanner means that memory is tight.
4.4.4 Interactions Between the Filesystem Cache and Disk
When data is being
pushed out from memory to disk via fsflush,
Solaris will try to gather modified pages that are adjacent to each
other on disk, so that they can be written out in one continuous
piece. This is governed by the maxphys kernel
parameter. Set this parameter to a reasonably large value (1,048,576
is a good choice; it is the largest value that makes sense for a
modern UFS filesystem). As we'll discuss in Section 6.6 in Chapter 6, a maxphys value of
1,048,576 with a 64 KB interlace size is sufficient to drive a
16-disk RAID 0 array to nearly full speed with a single file.
There is another case where memory and disk interact to give
suboptimal performance. If your applications are constantly writing
and rewriting files that are cached in memory, the in-memory
filesystem cache is very effective. Unfortunately, the filesystem
flushing process is regularly attempting to purge data out to disk,
which may not be a good thing. For example, if your working set is 40
GB and fits entirely in available memory, with the default
autoup value of 30, fsflush is
attempting to synchronize up to 40 GB of data to disk every 30
seconds. Most disk subsystems cannot sustain 1.3 GB/second, which
will mean that the application is throttled and waiting for disk I/O
to complete, despite the fact that all of the working set is in
memory!
There are three telltale signs for this case:
vmstat -p shows very low filesystem activity. iostat -xtc shows constant disk write activity. The application has a high wait time for file operations.
Increasing autoup (to, say, 840) and
tune_t_fsflushr (to 120) will decrease the amount
of data sent to disk, improving the chances of issuing a single
larger I/O (rather than many smaller I/Os). You will also improve
your chances of seeing write cancellation, when not every
modification to a file is written to disk. The flip-side is that you
run a higher risk of losing data in the case of server failure.
|
No comments:
Post a Comment