Wednesday, December 16, 2009

4.4 Consumers of Memory










 < Free Open Study > 







4.4 Consumers of Memory









Memory is consumed by four

things: the kernel, filesystem caches, processes, and intimately

shared memory. When the system starts, it takes a small amount

(generally less than 4 MB) of memory for itself. As it dynamically

loads modules and requires additional memory, it claims pages from

the free list. These pages are locked in physical memory, and cannot

be paged out except in the most severe of memory shortages.

Sometimes, on a system that is very short of memory, you can hear a

pop from the speaker. This is actually the speaker being turned off

as the audio device driver is being unloaded from the kernel.

However, a module won't be unloaded if a process is

actually using the device; otherwise, the disk driver could be paged

out, causing difficulties. Occasionally, however, a system will

experience a kernel memory allocation error.

While there is a limit on the size of kernel memory,[7] the problem

is caused by the kernel trying to get memory when the free list is

completely exhausted. Since the kernel cannot always wait for memory

to become available, this can cause operations to fail rather than be

delayed. One of the subsystems that cannot wait for memory is the

streams facility; if a large number of users try to log into a system

at the same time, some logins may fail. Starting with Solaris 2.5.1,

changes were made to expand the free list on large systems, which

helps prevent the free list from ever being totally empty.

[7] This number is typically very large. On UltraSPARC-based

Solaris systems, it is about 3.75 GB.







Processes have private memory to

hold their stack space, heap, and data areas. The only way to see how

much memory a process is actively using is to use

/usr/proc/bin/pmap -x process-id, which is

available in Solaris 2.6 and later releases.







Intimately shared memory is a technique

for allowing the sharing of low-level kernel information about pages,

rather than by sharing the memory pages themselves. This is a

significant optimization in that it removes a great deal of redundant

mapping information. It is of primary use in database applications

such as Oracle, which benefit from having a very large shared memory

cache. There are three special things worth noting about intimately

shared memory. First, all the intimately shared memory is locked, and

cannot ever be paged out. Second, the memory management structures

that are usually created independently for each process are only

created once, and shared between all processes. Third, the kernel

tries to find large pieces of contiguous physical memory (4 MB) that

can be used as large pages, which substantially reduces MMU overhead.







4.4.1 Filesystem Caching







The single largest consumer of

memory is usually the filesystem-caching mechanism. In order for a

process to read from or write to a file, the file needs to be

buffered in memory. When this is happening, these pages are locked in

memory. After the operation completes, the pages are unlocked and

placed at the bottom of the free list. The kernel remembers the pages

that store valid cached data. If the data is needed again, it is

readily available in memory, which saves the system an expensive trip

to disk. When a file is deleted or truncated, or if the kernel

decides to stop caching a particular inode, any pages caching that

data are placed at the head of the free list for immediate reuse.

Most files, however, only become uncached upon the action of the page

scanner. Data that has been modified in the memory caches is

periodically written to data by fsflush on Solaris

and bdflush on Linux, which we'll

discuss a little later.





The amount of space used for this behavior is

not tunable in Solaris; if you want to cache a

large amount of filesystem data in memory, you simply need to buy a

system with a lot of physical memory. Furthermore, since Solaris

handles all its filesystem I/O by means of the paging mechanism, a

large number of observed page-ins and page-outs is completely normal.

In the Linux 2.2 kernel, this caching behavior is tunable: only a

specific amount of memory is available for filesystem buffering. The

min_percent variable controls the minimum

percentage of system memory available for caching. The upper bound is

not tunable. This variable can be found in the

/proc/sys/vm/buffermem file. The format of that

file is min_percent max_percent borrow_percent;

note that max_percent and

borrow_percent are not used.









4.4.2 Filesystem Cache Writes: fsflush and bdflush





Of course, the caching of files in memory is a huge performance

boost; it often allows us to access main memory (a few hundred

nanoseconds) when we would otherwise have to go all the way to disk

(tens of milliseconds). Since the contents of a file can be operated

upon in memory via the filesystem cache, it is important for

data-reliability purposes to regularly write changed data to disk.

Older Unix operating systems, like SunOS 4, would write the modified

contents of memory to disk every 30 seconds. Solaris and Linux both

implement a mechanism to spread this workload out, which is

implemented by the fsflush and

bdflush processes, respectively.





This mechanism can have substantial impacts on a

system's performance. It also explains some unusual

disk statistics.







4.4.2.1 Solaris: fsflush








The maximum age of any memory-resident

modified page is set by the autoup variable, which

is thirty seconds by default. It can be increased safely to several

hundred seconds if necessary. Every

tune_t_fsflushr seconds (by default, every five

seconds), fsflush wakes up and checks a fraction

of the total memory equal to tune_t_fsflushr

divided by autoup (that is, by default,

five-thirtieths, or one-sixth, of the system's total

physical memory). It then flushes any modified entries it finds from

the inode cache to disk; it can be disabled by setting

doiflush to zero. The page-flushing mechanism can

be totally disabled by setting dopageflush to

zero, but this can have serious repercussions on data reliability in

the event of a crash. Note that dopageflush and

doiflush are complimentary, not mutually

exclusive.











4.4.2.2 Linux: bdflush








Linux implements a slightly different

mechanism, which is tuned via the values in the

/proc/sys/vm/bdflush file. Unfortunately, the

tunable behavior of the bdflush daemon has

changed significantly from the 2.2 kernels to the 2.4 kernels. I

discuss each in turn.





In Linux 2.2, if the percentage of the filesystem buffer cache that

is "dirty" (that is, changed and

needs to be flushed) exceeds bdflush.nfract, then

bdflush wakes up. Setting this variable to a high

value means that cache flushing can be delayed for quite a while, but

it also means that when it does occur, a lot of disk I/O will happen

at once. A lower value spreads out disk activity more evenly.

bdflush will write out a number of buffer entries

equal to bdflush.ndirty; a high value here causes

sporadic, bursting I/O, but a small value can lead to a memory

shortage, since bdflush isn't

being woken up frequently enough. The system will wait for

bdflush.age_buffer or

bdflush.age_super, in hundredths of a second,

before writing a dirty data block or dirty filesystem metadata block

to disk. Here's a simple Perl script for displaying,

in a pretty format, the values of the bdflush

configuration file:





#!/usr/bin/perl

my ($nfract, $ndirty, $nrefill, $nref_dirt, $unused, $age_buffer, $age_super,

$unused, $unused) = split (/\s+/, `cat /proc/sys/vm/bdflush`, 9);

print "Current settings of bdflush kernel variables:\n";

print "nfract\t\t$nfract\tndirty\t\t$ndirty\tnrefill\t\t$nrefill\n\r";

print "nref_dirt\t$nref_dirt\tage_buffer\t$age_buffer\tage_super\t$age_super\n\r";




In Linux 2.4, about the only thing that didn't

change was the fact that bdflush still wakes up if

the percentage of the filesystem buffer cache that is dirty exceeds

bdflush.nfract. The default value of

bdflush.nfract (the first in the file) is 30%; the

range is from 0 to 100%. The minimum interval between wakeups and

flushes is determined by the bdflush.interval

parameter (the fifth in the file), which is expressed in clock

ticks.[8] The default value is 5 seconds; the

minimum is 0 and the maximum is 600. The

bdflush.age_buffer tunable (the sixth in the file)

governs the maximum amount of time, in clock ticks, that the kernel

will wait before flushing a dirty buffer to disk. The default value

is 30 seconds, the minimum is 1 second, and the maximum is 6,000

seconds. The final parameter, bdflush.nfract_sync

(the seventh in the file), governs the percentage of the buffer cache

that must be dirty before bdflush will activate

synchronously; in other words, it is the hard limit after which

bdflush will force buffers to disk. The default is

60%. Here's a script to extract values for these

bdflush parameters in Linux 2.4:

[8] There are typically 100 clock ticks per

second.





#!/usr/bin/perl

my ($nfract, $unused, $unused, $unused, $interval, $age_buffer, $nfract_sync, $u

nused, $unused) = split (/\s+/, `cat /proc/sys/vm/bdflush`, 9);

print "Current settings of bdflush kernel variables:\n";

print "nfract $nfract\tinterval $interval\tage_buffer $age_buffer\tnfract_sync $

nfract_sync\n";




If the system has a very large amount of physical memory,

fsflush and bdflush

(we'll refer to them generically as

flushing daemons) will have a lot of work to

do every time they are woken up. However, most files that would have

been written out by the flushing daemon have already closed by the

time they're marked for flushing. Furthermore,

writes over NFS are always performed synchronously, so the flushing

daemon isn't required. In cases where the system is

performing lots of I/O but not using direct I/O or synchronous

writes, the performance of the flushing daemons becomes important. A

general rule for Solaris systems is that if

fsflush has consumed more than five percent of the

system's cumulative nonidle processor time,

autoup should be increased.











4.4.3 Interactions Between the Filesystem Cache and Memory





Because Solaris has an untunable filesystem caching mechanism, it can

encounter problems under some specific instances. The source of the

problem is that the kernel allows the filesystem cache to grow to the

point where it begins to steal memory pages from user applications.

This behavior not only shortchanges other potential consumers of

memory, but it means that the filesystem performance becomes

dominated by the rate at which the virtual memory subsystem can free

memory.





There are two solutions to this problem: priority paging and the

cyclic cache.







4.4.3.1 Priority paging








In order to address this issue, Sun

introduced a new paging algorithm in Solaris 7, called

priority paging, which places a boundary

around the filesystem cache.[9] A new kernel variable,

cachefree, is created, which scales with

minfree, desfree, and

lotsfree. The system attempts to keep

cachefree pages of memory available, but frees

filesystem cache pages only when the size of the free list is between

cachefree and lotsfree.

[9] The algorithm was later

backported to Solaris 2.6.





The effect is generally excellent. Desktop systems and on-line

transaction processing (OLTP) environments tend to feel much more

responsive, and much of the swap device activity is eliminated;

computational codes that do a great deal of filesystem writing may

see as much as a 300% performance increase. By default, priority

paging has been disabled until sufficient end-user feedback on its

performance is gathered. It will likely become the new algorithm in

future Solaris releases. In order to use this new mechanism, you need

Solaris 7, or 2.6 with kernel patch 105181-09. To enable the

algorithm, set the priority_paging variable to 1.

You can also implement the change on a live 32-bit system by setting

the cachefree tunable to twice the value of

lotsfree.











4.4.3.2 Cyclic caching








A more technically elegant solution to

this problem has been implemented in Solaris 8, primarily due to

efforts by Richard McDougall, a senior engineer at Sun Microsystems.

No special procedures need be followed to enable it. At the heart of

this mechanism is a straightforward rule: nondirty pages that are not

mapped anywhere should be on the free list. This rule means that the

free list now contains all the filesystem cache pages, which has

far-reaching consequences:





  • Application startup (or other heavy memory consumption in a short

    period of time) can occur much faster, because the page scanner is

    not required to wake up and free memory.

  • Filesystem I/O has very little impact on other applications on the

    system.

  • Paging activity is reduced to zero and the page scanner is idle when

    sufficient memory is enabled.



As a result, analyzing a Solaris 8 system for a memory shortage is

simple: if the page scanner reclaims any pages at

all
, there is a memory shortage. The mere activity of the

page scanner means that memory is tight.











4.4.4 Interactions Between the Filesystem Cache and Disk







When data is being

pushed out from memory to disk via fsflush,

Solaris will try to gather modified pages that are adjacent to each

other on disk, so that they can be written out in one continuous

piece. This is governed by the maxphys kernel

parameter. Set this parameter to a reasonably large value (1,048,576

is a good choice; it is the largest value that makes sense for a

modern UFS filesystem). As we'll discuss in Section 6.6 in Chapter 6, a maxphys value of

1,048,576 with a 64 KB interlace size is sufficient to drive a

16-disk RAID 0 array to nearly full speed with a single file.





There is another case where memory and disk interact to give

suboptimal performance. If your applications are constantly writing

and rewriting files that are cached in memory, the in-memory

filesystem cache is very effective. Unfortunately, the filesystem

flushing process is regularly attempting to purge data out to disk,

which may not be a good thing. For example, if your working set is 40

GB and fits entirely in available memory, with the default

autoup value of 30, fsflush is

attempting to synchronize up to 40 GB of data to disk every 30

seconds. Most disk subsystems cannot sustain 1.3 GB/second, which

will mean that the application is throttled and waiting for disk I/O

to complete, despite the fact that all of the working set is in

memory!





There are three telltale signs for this case:





  • vmstat -p shows very low filesystem activity.

  • iostat -xtc shows constant disk write activity.

  • The application has a high wait time for file operations.



Increasing autoup (to, say, 840) and

tune_t_fsflushr (to 120) will decrease the amount

of data sent to disk, improving the chances of issuing a single

larger I/O (rather than many smaller I/Os). You will also improve

your chances of seeing write cancellation, when not every

modification to a file is written to disk. The flip-side is that you

run a higher risk of losing data in the case of server failure.


















     < Free Open Study > 



    No comments: