Monday, October 26, 2009

Section 18.5.  Checksums










18.5. Checksums








A checksum is a redundant field used by network protocols to recognize transmission errors. Some checksums

cannot only detect errors, but also automatically fix errors of certain types.


The idea behind a checksum is simple. Before transmitting a packet, the sender computes a small, fixed-length field (the checksum) containing a sort of hash of the data. If a few bits of the data were to change during transit, it is likely that the corrupted data would produce a different checksum. Depending on what function you used to produce the checksum, it provides different levels of reliability. The checksum used by the IP protocol is a simple one involving sums and one's complements, which is too weak to be considered reliable. For a more reliable sanity check, you must rely on L2 CRCs or SSL/IPSec Message Authentication Codes (MACs).


Different protocols can use different checksum algorithms. The IP protocol checksum covers only the IP header. Most L4 protocols' checksums cover both their header and the data.


It may seem redundant to have a checksum at L2 (e.g., Ethernet), another one at L3 (e.g., IP), and another one at L4 (e.g., TCP), because they often all apply to overlapping portions of data, but the checks are valuable. Errors can occur not only during transmission, but also while moving data between layers. Moreover, each protocol is responsible for ensuring its own correct transmission, and cannot assume that layers above or below it take on that task.


As an example of the complex scenarios that can arise, imagine that PC A in LAN1 sends data over the Internet to PC B in LAN2. Let's also suppose that the L2 protocol used in LAN1 uses a checksum but that the one on LAN2 doesn't. It's important for at least one higher layer to provide some form of checksum to reduce the likelihood of accepting corrupted data.


The use of a checksum is recommended in every protocol definition, although it is not required. Nevertheless, one has to admit that a better design of related protocols could remove some of the overhead imposed by features that overlap in the protocols at different layers. Because most L2 and L4 protocols provide checksums, having it at L3 as well is not strictly necessary. For exactly this reason, the checksum has been removed from IPv6.


In IPv4, the IP checksum is a 16-bit field that covers the entire IP header, options included. The checksum is first computed by the source of the packet, and is updated hop by hop all the way to its destination to reflect changes to the header applied by each router. Before updating the checksum, each hop first has to check the sanity of the packet by comparing the checksum included in the packet with the one computed locally. A packet is discarded if the sanity check fails, but no ICMP is generated: the L4 protocol will take care of it (for example, with a timer that will force a retransmission if no acknowledgment is received within a given amount of time).


Here are some cases that trigger the need to update the checksum:



Decrementing the TTL


A router has to decrement a packet's TTL in its IP header before forwarding it. Since the IP checksum also covers that field, the original checksum is no longer valid. You will see in the section "ip_forward Function" in Chapter 20 that the TTL is decreased with ip_decrease_ttl, which takes care of the checksum, too.


Packet mangling (including NAT)


All of those features that involve the change of one or more of the IP header fields force the checksum to be recomputed. NAT is probably the best-known example.


IP options handling


Since the options are part of the header, they are covered by the checksum. Therefore, every time they are processed in a way that requires adding or modifying the IP header (i.e., the addition of a timestamp) forces the recomputation of the checksum.


Fragmentation


When a packet is fragmented, each fragment has a different header. Most of the fields remain unchanged, but the ones that have to do with fragmentation, such as offset, are different. Therefore, the checksum has to be recomputed.


Since the checksum used by the IP protocol is computed using the same simple algorithm that is used by TCP, UDP, and ICMP, a general set of functions has been written to be used by all of them. There is also a specialized function optimized for the IP checksum. According to the definition of the IP checksum algorithm, the header is split into 16-bit words that are summed and ones-complemented. Figure 18-13 shows an example of checksum computation on only two 16-bit words for simplicity. Linux does not sum 16-bit words, but it does sum 32-bit words and even 64-bit longs, which results in faster computation (this requires an extra step between the computation of the sum and its one's complement; see the description of csum_fold in the next section). The function that implements the algorithm, called ip_fast_csum, is written directly in Assembly language on most architectures.



Figure 18-13. IP checksum computation




18.5.1. APIs for Checksum Computation










The L3 (IP) checksum is much faster to compute than the L4 checksum, because it covers only the IP header. Because it's a cheap operation, it is often computed in software.


The set of general functions used to compute checksums are placed in the per-architecture files include/asm-xxx/checksum.h. (The one for the i386 platform, for instance, is include/asm-i386/checksum.h.) Each protocol calls the general function directly using the right input parameters, or defines a wrapper that calls the general functions. The checksumming algorithm allows a protocol to simply update a checksum, instead of recomputing it from scratch, when changing a previously checksummed piece of data such as the IP header.


The prototype for one IP-specific function in checksum.h, ip_fast_csum, is shown here. The function takes as parameters the pointer to the IP header (iph), and its length (ihl). The latter can change due to IP options. The return value is the checksum. This function takes advantage of the fact that the IP header is always a multiple of 4 bytes in length to streamline some of the processing.



static inline
unsigned short ip_fast_csum(unsigned char * iph, unsigned int ihl)



When computing the checksum of an IP header on a packet to be transmitted, the value of iphdr->check should first be zeroed out because the checksum should not reflect the checksum itself. In this algorithm, because it uses simple summing, a zero-value field is effectively excluded from the resulting checksum. This is why in different places in the code you can see that this field is zeroed right before the call to ip_fast_csum.


The checksum algorithm has an interesting property that may initially confuse people who read the source code for packet forwarding and reception. If the checksum is correct, and the forwarding or receiving node runs the algorithm over the entire header (leaving the original iphdr->check field in place), a result of zero is obtained. If you look at the function ip_rcv, you can see that this is exactly how input packets are validated against the checksum. This way of checking for corruption is faster than the more intuitive way of zeroing out the iphdr->check field and recomputing.


Here are the main functions used to compute or update an IP checksum:



ip_compute_csum


A general-purpose function that computes a checksum. It simply receives as input a buffer of an arbitrary size.


ip_fast_csum


Given an IP header and length, computes and returns the IP checksum. It can be used both to validate an input packet and to compute the checksum of an outgoing packet.


You can consider ip_fast_csum a variation of ip_compute_csum optimized for IP headers.


ip_send_check


Computes the IP checksum of an outgoing packet. It is a simple wrapper to ip_fast_csum that zeros iphdr->check beforehand.


ip_decrease_ttl


When changing a single field of an IP header, it is faster to apply an incremental update to the IP checksum than to compute it from scratch. This is possible thanks to the simple algorithm used to compute the checksum. A common example is a packet that is forwarded and therefore gets its iphdr->ttl field decremented. ip_decrease_ttl is called within ip_forward.


There are several other general support routines in the previously mentioned checksum.h file, but they are mostly used by L4 protocols. For instance:



skb_checkum


Defined in net/core/skbuff.c, it is a general-purpose checksumming function used by several wrappers (including some of the functions listed earlier), and used mostly by L4 protocols for specific situations.


csum_fold


Folds the 16 most-significant bits of a 32-bit value into the 16 least-significant bits and then complements the output value. This operation is normally the last stage of a checksum computation.


csum_partial[_ xxx]


This family of functions computes a checksum that lacks the final folding done by csum_fold. L4 protocols can call one of the csum_partial functions to compute the checksum on the L4 data, then invoke a function such as csum_tcpudp_magic that computes the checksum on a pseudoheader (described in the following section), and finally sums the two partial checksums and folds the result.


csum_partial and some of its variations are written in assembly language on most architectures.


csum_block_add



csum_block_sub


Sum and subtract two checksums, respectively. The first one is useful when the checksum over a block of data is computed incrementally. The second one might be needed when a piece of data is removed from one whose checksum had already been computed. Many of the other functions use these two internally.


skb_checksum_help


This function has two different behaviors, depending on whether it is passed an ingress IP packet or an egress IP packet.


On ingress packets, it invalidates the L4 hardware checksum.


On egress packets, it computes the L4 checksum. It is used, for example, when the hardware checksumming capabilities of the egress device cannot be used (see dev_queue_xmit in Chapter 11), or when the L4 hardware checksum has been invalidated and therefore needs to be recomputed. A checksum can be invalidated, for example, by a NAT operation from Netfilter, or when the transformation protocols of the IPsec suite mangle the L4 payload by inserting additional headers between the original IP header and the L4 header. Note also that if a device could compute the L4 checksum in hardware and store it in the L4 header, it would end up modifying the L3 payload, which is not possible when the latter has been digested or encrypted by the IPsec suite, because it would invalidate the data.


csum_tcpudp_magic


Compute the checksum on the TCP and UDP pseudoheader (see Figure 18-14).


Newer NICs can provide both the IP and L4 checksum computations in hardware. While Linux takes advantage of the L4 hardware checksumming capabilities of most modern NICs, it does not take advantage of the IP hardware checksumming capabilities because it's not worth the extra complexity (i.e., the software computation is already fast enough given the limited size of the IP header). Hardware checksumming is only one example of CPU offloading that allows the kernel to process packets faster; most modern NICs provide some L4 (mainly TCP) offloading, too. Hardware checksumming is briefly described in Chapter 19.




18.5.2. Changes to the L4 Checksum









The TCP and UDP protocols compute a checksum that covers their header, their payloads, and what is known as the pseudoheader, which is basically a block whose fields are taken from the IP header for convenience (see Figure 18-14). In other words, some information that appears in the IP header ends up being incorporated in the L4 checksum

. Note that the pseudoheader is defined only for computing the checksum; it does not exist in the packet on the wire.



Figure 18-14. Pseudoheader used by TCP and UDP while computing the checksum



Unfortunately, the IP layer sometimes needs to change some of the IP header fields, for NAT or other activities, that were used by TCP and UDP in their pseudoheaders. The change at the IP level invalidates the L4 checksums. If the checksum is left in place, none of the nodes at the IP layer will detect any error because they validate only the IP checksum. However, the TCP layer of the destination host will believe the packet is corrupted. This case therefore has to be handled by the kernel.


Furthermore, there are routine cases where L4 checksums computed in hardware on received frames are invalidated. Here are the most common ones:


  • When an input L2 frame includes some padding to reach the minimum frame size, but the NIC was not smart enough to leave the padding out when computing the checksum. In this case, the hardware checksum won't match the one computed by the receiving L4 layer. You will see in the section "Processing Input IP Packets" in Chapter 19 that to be on the safe side, the ip_rcv function always invalidates the checksum in this case. In Part IV, you will see that the bridging code can do something similar.

  • When an input IP fragment overlaps with a previously received fragment. See Chapter 22.

  • When an input IP packet uses any of the IPsec suite's protocols. In such cases, the L4 checksum cannot have been computed correctly by the NIC because the L4 header and payload are either compressed, digested, or encrypted. For an example, see esp_input in net/ipv4/esp4.c.

  • The checksum needs to be recomputed because of NAT or some similar intervention at the IP layer. See, for instance, ip_nat_fn in net/ipv4/netfilter/ip_nat_standalone.c.


Although the name might prove confusing, the field skb->ip_summed has to do with the L4 checksum (more details in Chapter 19). Its value is manipulated by the IP layer when it knows that something has invalidated the L4 checksum, such as a change in a field that is part of the pseudoheader.


I will not cover the details of how the checksum is computed for locally generated packets. But we will briefly see in the section "Copying data into the fragments: getfrag" in Chapter 21 how it can be computed incrementally while creating fragments.













No comments: