Thursday, October 22, 2009

Section 22.1.  IP Fragmentation










22.1. IP Fragmentation











As shown in Figure 18-1 in Chapter 18, the dst_output function is called by both locally generated and forwarded packets, so the ip_fragment function in the area below dst_output can run in both situations. Thus, the input to ip_fragment can be:


  • Forwarded packets that are whole

  • Forwarded packets that the originating host or a router along the way has fragmented

  • Buffers created by local functions that, as described in the previous chapter, have started the fragmentation process but have not added the headers that are required for transmission as packets


In particular, ip_fragment must be able to handle both of the following cases:



Big chunks of data that need to be split into smaller parts.


Splitting the big buffer requires the allocation of new buffers and memory copies from the big buffer to the small ones. This, of course, impacts performance.


A list or array of data fragments that do not need to be fragmented further.


If the buffers were allocated such that they have room to allow the addition of lower-layer L3 and L2 headers, ip_fragment can handle them without a memory copy. All the IP layer needs to do is add an IP header to each fragment and handle the checksum.


Previous kernel versions used to handle IP fragmentation entirely at the IP layer. The IP functions used to transmit a packet could receive a payload of any size between 0 and 64 KB, and had to split that payload into multiple IP fragments when the size of the packet exceeded the PMTU. We saw this in the section "Packet Fragmentation/Defragmentation" in Chapter 18.


The approach used by newer kernels is to make the L4 protocols aid in the fragmentation task in advance: instead of passing to the IP layer a single buffer that will have to be fragmented, they can pass a set of buffers appropriate to the PMTU. This way, the IP fragmentation handled at the IP layer consists simply of creating an IP header for each data fragment already formed. This does not mean that the L4 protocols implement IP fragmentation; it simply means that since L4 protocols are aware of IP fragmentation, they try to cooperate and make life easier for the IP layer. The L4 protocols do not touch the IP headers.


Before the introduction of the ip_append_data/ip_append_page functions discussed in Chapter 21, IP fragmentation used to be simpler than IP defragmentation. Now both processes are equally complex.


Fragmentation can currently be done in two ways: the so-called fast (or efficient) way, and the slow (or old-style) way. Both of them are taken care of by ip_fragment. Before seeing how those two approaches differ, let's review the main tasks required to fragment an IP packet:


  1. Split the L3 payload into smaller pieces to fit within the MTU associated with the route being used to send the packet (PMTU). As we will see in a moment, this task may or may not involve some memory copies. If the size of the IP payload is not an exact multiple of the fragment size, the last fragment is smaller than the others. Also, since the fragment offset field of the IP header is measured in units of 8 bytes, this value is aligned to an 8-byte boundary. Every fragment, with the possible exception of the last one, has this size. See Figure 18-10 in Chapter 18.

  2. Initialize each fragment's IP header, taking into account that not all of the options have to be replicated into all of the fragments. ip_options_fragment, introduced in section "IP Options" in Chapter 18, does this job.

  3. Compute the IP checksum. Each fragment has a different IP header, so the checksum has to be recomputed for each one.

  4. Ask Netfilter, the Linux filtering system, for permission to complete the transmission.

  5. Update all the necessary kernel and SNMP statistics (such as IPSTATS_MIB_FRAGCREATES, IPSTATS_MIB_FRAGOKS, and IPSTATS_MIB_FRAGFAILS).


In kernel versions prior to 2.4, a function named ip_build_xmit_slow created and transmitted IP fragments for locally generated packets in reverse order: last to first. This approach had a couple of advantages:


  • The last fragment is the only one that can tell the receiver the size of the original, unfragmented packet. To know this as soon as possible could help the defragmenter handle its memory better.

  • It makes it more likely that the defragmenter can build up a packet faster. As described in the section "IP Defragmentation," fragments are added into a list (ipq) in increasing order of offset. If each fragment arrives after the fragment that comes after it, fragments can be added speedily at the head of the list.


While this sort of optimization works when the receiver is a Linux box, it might have no effect or even be a drawback if the receiver uses some other operating system that makes different assumptions.[*] Therefore, starting with 2.4, the Linux kernel transmits fragments in forward order.

[*] For example, the PIX firewall from Cisco Systems has an option that lets the administrator prevent IP fragments from passing through unless they are received in order from first to last.



22.1.1. Functions Involved with IP Fragmentation






The previous chapter, which described the functions that transmit data at the IP layer, covered the ip_append_data/ip_append_page set of functions that do a lot of the groundwork for fragmentation. The rest of this section focuses on ip_fragment, which turns the buffers waiting for transmission into actual packets.


Here are a couple of support routines used by the fragmentation code:



ip_dont_fragment


Decides whether the IP packet can be fragmented, based on Path MTU discovery configuration (see the section "Path MTU Discovery" in Chapter 18).


ip_options_fragment


Modifies the IP header of the first fragment so that it can be recycled by the following ones. See the section "IP Options" in Chapter 19.


ip_dont_fragment and ip_options_fragment are defined in include/net/ip.h and net/ipv4/ip_options.c, respectively.




22.1.2. The ip_fragment Function


We already mentioned in the previous section that ip_fragment can take care of fragmentation in two different ways. Let's first see what the common part does. In the next two sections, we will analyze the two cases separately.



int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff*))



Here are the meanings of the function's input parameters:



skb


Buffer containing the IP packet to fragment. The packet includes an already initialized IP header, which will have to be adapted and replicated into all the fragments. See Figure 21-12(b) in Chapter 21 for an example of what skb may look like.


output


Function to use to transmit the fragments. In Figure 18-1 in Chapter 18, you can see some of the places where ip_fragment is called. You can check them to see what function is used as output (for example, ip_output uses ip_finish_output).


ip_fragment begins by initializing a few key variables that will be used later. It extracts their values from the device and IP header structures that are obtained via the input skb parameter. The egress device dev and the PMTU mtu are extracted from the routing entry used to transmit the packet (rt). You will see in Chapter 36 what other parameters are kept in that data structure.


If the input IP packet cannot be fragmented because the source has set the DF flag, ip_fragment sends an ICMP packet back to the source to notify it of the problem, and then drops the packet. The local_df flag shown in the if condition is set mainly by the Virtual Server code when it does not want the condition just described to generate an ICMP message.



dev = rt->u.dst.dev;
iph = skb->nh.iph;

if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) {
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
htonl(dst_pmtu(&rt->u.dst)));
kfree_skb(skb);
return -EMSGSIZE;
}

hlen = iph->ihl * 4;
mtu = dst_mtu(&rt->u.dst) - hlen;



Fast fragmentation is used when ip_fragment receives an sk_buff whose data is already fragmented. This is possible, for example, for packets locally generated by an L4 protocol that uses the ip_append_data and ip_push_pending_frames functions. It is also possible for packets generated by L4 protocols that use the ip_queue_xmit function, because they take care of creating fragments themselves. See Chapter 21.


The slow path is used in all the other cases, among which we have:


  • Packets being forwarded

  • Locally generated traffic that has not been fragmented before reaching dst_output

  • All of those cases where fast fragmentation was disabled due to a sanity check on the buffers (see the beginning of ip_fragment)


Even if ip_fragment was given a buffer whose data was already broken into fragment-size buffers as input, it may not be possible to use the fast path due to an error in the organization of the fragments. An error could be caused by a broken feature that performs a faulty buffer manipulation, or by the transformers used by the IPsec protocols.


In both cases (slow and fast), if any of the fragment transmission fails, ip_fragment returns immediately with an error code and the following fragments are not transmitted. When this happens, the destination host will receive only a subset of the IP fragments and therefore will fail to reassemble them.




22.1.3. Slow Fragmentation


Unlike the fast fragmentation done in collaboration with ip_append_page/ip_append_data, slow fragmentation
does not need to keep any state information (such as the list of fragments, etc.). The process simply consists of splitting the IP packet into fragments whose size is given by the MTU of the outgoing interface, or by the MTU associated with the route used if path MTU discovery is enabled.


Before entering the loop, the function needs to initialize a few local variables.


ptr is the offset into the packet about to be fragmented; it will be moved as fragmentation proceeds. left is initialized to the length of the IP packet. In calculating left, the ip_fragment function subtracts hlen (the L2 header length) because that component is not part of the IP payload and the function must leave room for it because it will be copied into each fragment.


The IP header places the fragment offset and the DF and MF flags together in a single 16-bit field. The formula in the following code extracts the offset field from it.


The local variable not_last_frag, as the name suggests, is true when more data is supposed to follow the current fragment in the packet. This is an important bit of data because the last fragment in the packet indicates the size of the packet, which is valuable information for allocating memory efficiently; the function acts on this information later. The not_last_frag variable is not set, however, on the first fragment within the packet (that is, the original packetif a packet is fragmented into two pieces, for example, and the second piece is later fragmented, all fragments in the second piece will have the not_last_frag variable set).



left = skb->len - hlen;
ptr = raw + hlen;

offset = (ntohs(iph->frag_off) & IP_OFFSET) << 3;
not_last_frag = iph->frag_off & htons(IP_MF);



ip_fragment next starts a loop to create a new buffer for each fragment (skb2). The input parameter skb contains the original IP packet.



while(left > 0) {
len = left;



For each fragment, the length is set to the MTU value defined earlier through the PMTU field. The size of the fragment is also aligned to an 8-byte boundary, as imposed by the IP RFC. The only cases where the following condition is not met are when we are transmitting the last fragment or when fragmentation is not needed. But the second case should never occur because if fragmentation were not needed, the function would not execute in the first place.



if (len > mtu)
len = mtu;

if (len < left) {
len &= ~7;
}



The size of the buffer allocated to hold a fragment is the sum of:


  • The size of the IP payload

  • The size of the IP header

  • The size of the L2 header


The last of those values is initialized just before the while loop and is retrieved from the routing table cache. The IP layer can learn, from the routing table, the L2 device to be used to transmit the fragments. The ip_fragment function can extract the size of the header associated with the device's protocol from the associated net_device data structure. This value is aligned to a 16-byte boundary by the LL_RESERVED_SPACE[_EXTRA] macros and is stored in the local variable ll_rs (Link Layer Reserved Space). This alignment has nothing to do with the 8-byte alignment just performed on the payload. When the kernel is compiled with support for L2 firewalling (i.e., the CONFIG_BRIDGE_NETFILTER kernel option), ll_rs and mtu are updated accordingly to accommodate a possible 802.1Q header.



if ((skb2 = alloc_skb(len+hlen+ll_rs,
GFP_ATOMIC)) == NULL) {
NETDEBUG(printk(KERN_INFO "IP: frag: no memory for new fragment!\n"));
err = -ENOMEM;
goto fail;
}



Now the function needs to copy into the newly allocated buffer skb2 the value of a few fields from the sk_buff structure (the original IP packet) being replicated. Some of them are copied here, and others are taken care of by ip_copy_metadata, which also may copy some fields based on whether specific features (such as Traffic Control and Netfilter) are built into the kernel. The pointers to the L3 (nh.raw) and L4 (n.raw) headers are also initialized.



ip_copy_metadata(sk2, skb);
skb_reserve(skb2, ll_rs);
skb_put(skb2, len + hlen);
skb2->nh.raw = skb2->data;
skb2->h.raw = skb2->data + hlen;



The newly allocated buffer is associated with the socket attempting the transmission, if any. (This is the case, for instance, when the transmission was requested with the functions on the left side of Figure 18-1 in Chapter 18.)



if (skb->sk)
skb_set_owner_w(skb2, skb->sk);



Now it is time to fill in the new buffer skb2 with some real data. (So far the function has taken care of only the management fields of the sk_buff structure.) This is done in two parts:


  • The IP header is copied with a simple memcpy.

  • Then a piece of payload from the original packet is copied into the fragment.


The latter task cannot use a simple memcpy, because the data may be stored in skb in a variety of ways using a list of fragments or memory page extensions (see Chapter 21). The slow path could be invoked when a packet contains all its data in the memory area pointed to by skb->data (see Figure 21-2 in Chapter 21), or when data has already been fragmented before reaching ip_fragment but one of the sanity checks described earlier rules out the fast path. The logic to handle the various possibilities for data layout is in the helper function skb_copy_bits, which ip_fragment calls.



memcpy(skb2->nh.raw, skb->data, hlen);

if (skb_copy_bits(skb, ptr, skb2->h.raw, len))
BUG( );

left -= len;
iph = skb2->nh.iph;
iph->frag_off = htons((offset >> 3));



The first fragment (where offset is 0) is special from the IP options point of view because it is the only one that includes a full copy of the options from the original IP packet. Not all the options have to be replicated into all of the fragments; only the first fragment will include all of them.



if (offset == 0)
ip_options_fragment(skb);



ip_options_fragment, described in Chapter 19, cleans up the content of the ip_opt structure associated with the original IP packet so that fragments following the first one will not include options they do not need. Therefore, ip_options_fragment is called only during the processing of the first fragment (which is the one with offset=0).


The MF flag (for More Fragments) is set if either of the following conditions is met:


  • The packet being fragmenting is not a fragment itself, and the fragment created in this loop is not the last one (left>0).

  • The packet being fragmented is a fragment itself, but is not the last one, and therefore all of its fragments must have MF set (not_last_frag=1).


    if (left > 0 || not_last_frag)
    iph->frag_off |= htons(IP_MF);



The following two statements update two offsets. It is easy to confuse the two. offset is maintained because the packet currently being fragmented may be a fragment of a larger packet; if so, offset represents the offset of the current fragment within the original packet (otherwise, it is simply 0). ptr is an offset within the packet we are fragmenting and changes as the loop progresses. The two variables have the same value in two cases: where the packet we are fragmenting is not a fragment itself, and where this fragment is the very first fragment.



ptr += len;
offset += len;



Finally, the slow path needs to update the header length (taking into account the size of the options), compute the checksum with ip_send_check, and transmit the fragment using the output function passed as a parameter. The output function used by IPv4 is ip_finish_output (see Figure 18-1 in Chapter 18).



iph->tot_len = htons(len + hlen);
ip_send_check(iph);

err = output(skb2);





22.1.4. Fast Fragmentation



ip_fragment tries the fast path when it sees that the frag_list pointer of the input skb buffer is not NULL. However, as described earlier in this chapter, it must make sure that the fragments are suitable for the fast path. Here are the sanity checks related to protocol requirements:


  • The size of each fragment should not exceed the PMTU.

  • Only the last fragment can have an L3 payload whose size is not a multiple of eight bytes.

  • Each fragment must have enough space at the head to allow the addition of an L2 header later.


And there are some other buffer management checks as well:


  • The fragment cannot be shared, because that would forbid ip_fragment from editing it to add the IP header. It is acceptable for ip_fragment to receive a shared buffer when using the slow path because the buffer is going to be copied into many other new buffers, but it is not acceptable for the fast path.


    if (skb_shinfo(skb)->frag_list) {
    struct sk_buff *frag;
    int first_len = skb_pagelen(skb);

    if (first_len - hlen > mtu ||
    ((first_len - hlen) & 7) ||
    (iph->frag_off & htons(IP_MF|IP_OFFSET)) ||
    skb_cloned(skb))
    goto slow_path;

    for (frag = skb_shinfo(skb)->frag_list; frag; frag = frag->next) {
    if (frag->len > mtu ||
    ((frag->len & 7) && frag->next) ||
    skb_headroom(frag) < hlen)
    goto slow_path;

    if (skb_shared(frag))
    goto slow_path;
    ...
    }



The initialization of the IP header of the first fragment is completed outside the loop because it can be optimized slightly. For instance, when this function runs, it knows there are at least two fragments, and therefore it does not need to check frag->next on the first fragment to initialize iph->frag_off: as the first fragment, this fragment must have the IP_MF flag set and the rest of the offset set to 0 (iph->frag_off = IP_MF). The other packets must have the IP_MF bit set in frag_off without disturbing the rest of the value (iph->frag_off |= IP_MF).


Let's suppose the fast path can be used. The rest of the code is pretty simple, and to some extent it is similar to the code seen for the slow path. After the first fragment has been sent (i.e., after the first loop of the for block), the IP header is modified with ip_options_fragment so that it can be recycled by the following fragments. If we exclude that special case, all we need to do to transmit a fragment is:


  • Copy the (modified) header from the first IP fragment into the current fragment.

  • Initialize those fields of the IP header that may differ. Among them are the offset and the IP checksum, which is computed with ip_send_check. Also, if the fragment is not the last one, set the MF flag.

  • Copy from the first fragment to the current fragment the rest of the sk_buff fields, using ip_copy_metadata. These fields are management parameters; they do not have anything to do with the content of the IP fragment.

  • Transmit the fragment with the function output passed as a parameter.


In case of errors, memory for all the subsequent fragments in frag_list is freed (not shown in the following snapshot). Note that the code inside the if (frag) {...} block prepares the fragment that will be transmitted in the following loop iteration, and the call to output transmits the current one.



skb->data_len = first_len - skb_headlen(skb);
skb->len = first_len;
iph->tot_len = htons(first_len);
iph->frag_off = htons(IP_MF);
ip_send_check(iph);

for (;;) {
if (frag) {
frag->ip_summed = CHECKSUM_NONE;
frag->h.raw = frag->data;
frag->nh.raw = _ _skb_push(frag, hlen);
memcpy(frag->nh.raw, iph, hlen);
iph = frag->nh.iph;
iph->tot_len = htons(frag->len);

ip_copy_metadata(frag, skb);
if (offset == 0)
ip_options_fragment(frag);
offset += skb->len - hlen;
iph->frag_off = htons(offset>>3);
if (frag->next != NULL)
iph->frag_off |= htons(IP_MF);
ip_send_check(iph);
}

err = output(skb);

if (err || !frag)
break;

skb = frag;
frag = skb->next;
skb->next = NULL;
}














No comments: