22.1. IP FragmentationAs shown in Figure 18-1 in Chapter 18, the dst_output function is called by both locally generated and forwarded packets, so the ip_fragment function in the area below dst_output can run in both situations. Thus, the input to ip_fragment can be:
In particular, ip_fragment must be able to handle both of the following cases:
Previous kernel versions used to handle IP fragmentation entirely at the IP layer. The IP functions used to transmit a packet could receive a payload of any size between 0 and 64 KB, and had to split that payload into multiple IP fragments when the size of the packet exceeded the PMTU. We saw this in the section "Packet Fragmentation/Defragmentation" in Chapter 18. The approach used by newer kernels is to make the L4 protocols aid in the fragmentation task in advance: instead of passing to the IP layer a single buffer that will have to be fragmented, they can pass a set of buffers appropriate to the PMTU. This way, the IP fragmentation handled at the IP layer consists simply of creating an IP header for each data fragment already formed. This does not mean that the L4 protocols implement IP fragmentation; it simply means that since L4 protocols are aware of IP fragmentation, they try to cooperate and make life easier for the IP layer. The L4 protocols do not touch the IP headers. Before the introduction of the ip_append_data/ip_append_page functions discussed in Chapter 21, IP fragmentation used to be simpler than IP defragmentation. Now both processes are equally complex. Fragmentation can currently be done in two ways: the so-called fast (or efficient) way, and the slow (or old-style) way. Both of them are taken care of by ip_fragment. Before seeing how those two approaches differ, let's review the main tasks required to fragment an IP packet:
In kernel versions prior to 2.4, a function named ip_build_xmit_slow created and transmitted IP fragments for locally generated packets in reverse order: last to first. This approach had a couple of advantages:
While this sort of optimization works when the receiver is a Linux box, it might have no effect or even be a drawback if the receiver uses some other operating system that makes different assumptions.[*] Therefore, starting with 2.4, the Linux kernel transmits fragments in forward order.
22.1.1. Functions Involved with IP FragmentationThe previous chapter, which described the functions that transmit data at the IP layer, covered the ip_append_data/ip_append_page set of functions that do a lot of the groundwork for fragmentation. The rest of this section focuses on ip_fragment, which turns the buffers waiting for transmission into actual packets. Here are a couple of support routines used by the fragmentation code:
ip_dont_fragment and ip_options_fragment are defined in include/net/ip.h and net/ipv4/ip_options.c, respectively. 22.1.2. The ip_fragment FunctionWe already mentioned in the previous section that ip_fragment can take care of fragmentation in two different ways. Let's first see what the common part does. In the next two sections, we will analyze the two cases separately.
Here are the meanings of the function's input parameters:
ip_fragment begins by initializing a few key variables that will be used later. It extracts their values from the device and IP header structures that are obtained via the input skb parameter. The egress device dev and the PMTU mtu are extracted from the routing entry used to transmit the packet (rt). You will see in Chapter 36 what other parameters are kept in that data structure. If the input IP packet cannot be fragmented because the source has set the DF flag, ip_fragment sends an ICMP packet back to the source to notify it of the problem, and then drops the packet. The local_df flag shown in the if condition is set mainly by the Virtual Server code when it does not want the condition just described to generate an ICMP message.
Fast fragmentation is used when ip_fragment receives an sk_buff whose data is already fragmented. This is possible, for example, for packets locally generated by an L4 protocol that uses the ip_append_data and ip_push_pending_frames functions. It is also possible for packets generated by L4 protocols that use the ip_queue_xmit function, because they take care of creating fragments themselves. See Chapter 21. The slow path is used in all the other cases, among which we have:
Even if ip_fragment was given a buffer whose data was already broken into fragment-size buffers as input, it may not be possible to use the fast path due to an error in the organization of the fragments. An error could be caused by a broken feature that performs a faulty buffer manipulation, or by the transformers used by the IPsec protocols. In both cases (slow and fast), if any of the fragment transmission fails, ip_fragment returns immediately with an error code and the following fragments are not transmitted. When this happens, the destination host will receive only a subset of the IP fragments and therefore will fail to reassemble them. 22.1.3. Slow FragmentationUnlike the fast fragmentation done in collaboration with ip_append_page/ip_append_data, slow fragmentation Before entering the loop, the function needs to initialize a few local variables. ptr is the offset into the packet about to be fragmented; it will be moved as fragmentation proceeds. left is initialized to the length of the IP packet. In calculating left, the ip_fragment function subtracts hlen (the L2 header length) because that component is not part of the IP payload and the function must leave room for it because it will be copied into each fragment. The IP header places the fragment offset and the DF and MF flags together in a single 16-bit field. The formula in the following code extracts the offset field from it. The local variable not_last_frag, as the name suggests, is true when more data is supposed to follow the current fragment in the packet. This is an important bit of data because the last fragment in the packet indicates the size of the packet, which is valuable information for allocating memory efficiently; the function acts on this information later. The not_last_frag variable is not set, however, on the first fragment within the packet (that is, the original packetif a packet is fragmented into two pieces, for example, and the second piece is later fragmented, all fragments in the second piece will have the not_last_frag variable set).
ip_fragment next starts a loop to create a new buffer for each fragment (skb2). The input parameter skb contains the original IP packet.
For each fragment, the length is set to the MTU value defined earlier through the PMTU field. The size of the fragment is also aligned to an 8-byte boundary, as imposed by the IP RFC. The only cases where the following condition is not met are when we are transmitting the last fragment or when fragmentation is not needed. But the second case should never occur because if fragmentation were not needed, the function would not execute in the first place.
The size of the buffer allocated to hold a fragment is the sum of:
The last of those values is initialized just before the while loop and is retrieved from the routing table cache. The IP layer can learn, from the routing table, the L2 device to be used to transmit the fragments. The ip_fragment function can extract the size of the header associated with the device's protocol from the associated net_device data structure. This value is aligned to a 16-byte boundary by the LL_RESERVED_SPACE[_EXTRA] macros and is stored in the local variable ll_rs (Link Layer Reserved Space). This alignment has nothing to do with the 8-byte alignment just performed on the payload. When the kernel is compiled with support for L2 firewalling (i.e., the CONFIG_BRIDGE_NETFILTER kernel option), ll_rs and mtu are updated accordingly to accommodate a possible 802.1Q header.
Now the function needs to copy into the newly allocated buffer skb2 the value of a few fields from the sk_buff structure (the original IP packet) being replicated. Some of them are copied here, and others are taken care of by ip_copy_metadata, which also may copy some fields based on whether specific features (such as Traffic Control and Netfilter) are built into the kernel. The pointers to the L3 (nh.raw) and L4 (n.raw) headers are also initialized.
The newly allocated buffer is associated with the socket attempting the transmission, if any. (This is the case, for instance, when the transmission was requested with the functions on the left side of Figure 18-1 in Chapter 18.)
Now it is time to fill in the new buffer skb2 with some real data. (So far the function has taken care of only the management fields of the sk_buff structure.) This is done in two parts:
The latter task cannot use a simple memcpy, because the data may be stored in skb in a variety of ways using a list of fragments or memory page extensions (see Chapter 21). The slow path could be invoked when a packet contains all its data in the memory area pointed to by skb->data (see Figure 21-2 in Chapter 21), or when data has already been fragmented before reaching ip_fragment but one of the sanity checks described earlier rules out the fast path. The logic to handle the various possibilities for data layout is in the helper function skb_copy_bits, which ip_fragment calls.
The first fragment (where offset is 0) is special from the IP options point of view because it is the only one that includes a full copy of the options from the original IP packet. Not all the options have to be replicated into all of the fragments; only the first fragment will include all of them.
ip_options_fragment, described in Chapter 19, cleans up the content of the ip_opt structure associated with the original IP packet so that fragments following the first one will not include options they do not need. Therefore, ip_options_fragment is called only during the processing of the first fragment (which is the one with offset=0). The MF flag (for More Fragments) is set if either of the following conditions is met:
The following two statements update two offsets. It is easy to confuse the two. offset is maintained because the packet currently being fragmented may be a fragment of a larger packet; if so, offset represents the offset of the current fragment within the original packet (otherwise, it is simply 0). ptr is an offset within the packet we are fragmenting and changes as the loop progresses. The two variables have the same value in two cases: where the packet we are fragmenting is not a fragment itself, and where this fragment is the very first fragment.
Finally, the slow path needs to update the header length (taking into account the size of the options), compute the checksum with ip_send_check, and transmit the fragment using the output function passed as a parameter. The output function used by IPv4 is ip_finish_output (see Figure 18-1 in Chapter 18).
22.1.4. Fast Fragmentationip_fragment tries the fast path when it sees that the frag_list pointer of the input skb buffer is not NULL. However, as described earlier in this chapter, it must make sure that the fragments are suitable for the fast path. Here are the sanity checks related to protocol requirements:
And there are some other buffer management checks as well:
The initialization of the IP header of the first fragment is completed outside the loop because it can be optimized slightly. For instance, when this function runs, it knows there are at least two fragments, and therefore it does not need to check frag->next on the first fragment to initialize iph->frag_off: as the first fragment, this fragment must have the IP_MF flag set and the rest of the offset set to 0 (iph->frag_off = IP_MF). The other packets must have the IP_MF bit set in frag_off without disturbing the rest of the value (iph->frag_off |= IP_MF). Let's suppose the fast path can be used. The rest of the code is pretty simple, and to some extent it is similar to the code seen for the slow path. After the first fragment has been sent (i.e., after the first loop of the for block), the IP header is modified with ip_options_fragment so that it can be recycled by the following fragments. If we exclude that special case, all we need to do to transmit a fragment is:
In case of errors, memory for all the subsequent fragments in frag_list is freed (not shown in the following snapshot). Note that the code inside the if (frag) {...} block prepares the fragment that will be transmitted in the following loop iteration, and the call to output transmits the current one.
|
Thursday, October 22, 2009
Section 22.1. IP Fragmentation
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment