35.7. Input RoutingIngress IP packets for which no route can be found in the cache by ip_route_input are checked against the routing tables by ip_route_input_slow, which is defined in net/ipv4/route.c and whose logic is shown in Figures 35-5(a) and 35-5(b). In this section, we describe the internals of this routine in detail. Figure 35-5a. ip_route_input_slow functionFigure 35-5b. ip_route_input_slow functionThe function starts with a few sanity checks on the source and destination addresses; for instance, the source IP address must not be a multicast address. I already listed most of those checks in the section "Verbose Monitoring" in Chapter 31. More sanity checks are done later in the function. The routing table lookup is done with fib_lookup, the routine introduced in the section "fib_lookup Function." If fib_lookup cannot find a matching route, the packet is dropped; additionally, if the receiving interface is configured with forwarding enabled, an ICMP_UNREACHABLE message is sent back to the source. Note that the ICMP message is sent not by ip_route_input_slow but by its caller, who takes care of it upon seeing a return value of RTN_UNREACHABLE. In case of success, ip_route_input_slow distinguishes the following three cases:
In the first two cases, the packet is to be delivered locally, and in the third, it needs to be forwarded. The details of how local delivery and forwarding are handled can be found in the sections "Local Delivery" and "Forwarding." Here are some of the tasks they both need to take care of:
35.7.1. Creation of a Cache EntryI said already in the section "Cache Lookup" in Chapter 33 that ip_route_input (and therefore ip_route_input_slow, in case of a cache miss) can be called just to consult the routing table, not necessarily to route an ingress packet. Because of that, ip_route_input_slow does not always create a new cache entry. When invoked from IP or an L4 protocol (such as IP over IP), the function always creates a cache entry. Currently, the only other possibility is invocation by ARP. Routes generated by ARP are cached only when they would be valid for proxy ARP. See the section "Processing ARPOP_REQUEST Packets" in Chapter 28. The new entry is allocated with dst_alloc. Of particular importance are the following initializations for the new cache entry:
35.7.2. Preferred Source Address SelectionThe route added to the routing cache is unidirectional, meaning that it will not be used to route traffic in the reverse direction toward the source IP address of the packet being routed. However, in some cases, the reception of a packet can trigger an action that requires the local host to choose a source IP address that it can use when transmitting a packet back to the sender.[*] This address, the preferred source IP address,[] must be saved with the routing cache entry that routed the ingress packet. Here are two cases where that address, which is saved in a field called rt_spec_dst, comes in handy:
The preferred source is selected through the fib_validate_source function mentioned in the section "Helper Routines" and called by ip_route_input_slow. ip_route_input_slow initializes the preferred source IP address rt_spec_dst based on the destination address of the packet being routed:
In this example, when transmitting packets to the hosts of the 10.0.1.0/24 subnet, the kernel will use 10.0.3.100 as the source IP address. Of course, only locally configured addresses are accepted: this means that for the previous command to be accepted, 10.0.3.100 must have been configured on one of the local interfaces, but not necessarily on the same device used to reach the 10.0.1.0/24 subnet. (Remember that in Linux, addresses belong to the host, not to the devices; see the section "Responding from Multiple Interfaces" in Chapter 28.) An administrator normally provides a source address when she does not want to use the one that would be picked by default from the egress device. Figure 35-6 summarizes how rt_spec_dst is selected. 35.7.3. Local DeliveryThe following types of packets are delivered locally by initializing dst->input appropriately, as we saw in the section "Initialization of Function Pointers for Ingress Traffic":
Figure 35-6. Selection of rt_spec_dstip_route_input_slow recognizes two kinds of broadcasts:
ip_route_input_slow accepts broadcasts only if they are generated by the IP protocol. You might think that this a superfluous check, given that ip_route_input_slow is called to route IP packets. However, as I said in the section "Cache Lookup" in Chapter 33, the input buffer to ip_route_input (and therefore to ip_route_input_slow in case of a cache miss) does not necessarily represent a packet to be routed. If everything goes fine, a new cache entry, rtable, is created, initialized, and inserted into the routing cache. Note that there is no need to handle Multipath for packets that are delivered locally. 35.7.4. ForwardingIf the packet is to be forwarded but the configuration of the ingress device has disabled forwarding If the matching route returned by fib_lookup includes more than one next hop, fib_select_multipath is used to choose among them. When multipath caching is supported, the selection is taken care of differently. The section "Effects of Multipath on Next Hop Selection" describes the algorithm used for the selection. The source address is validated with fib_validate_source. Then, based on the factors we saw in the section "Transmitting ICMP_REDIRECT Messages" in Chapter 31, the kernel may decide to send an ICMP_REDIRECT to the source. In that case, the ICMP message is sent not by ip_route_input_slow directly, but by ip_forward, which takes care of it upon seeing the RTCF_DOREDIRECT flag. As we saw in the section "Creation of a Cache Entry," the result of a routing lookup is not always cached. 35.7.5. Routing FailureWhen a packet cannot be routed, either because of host configuration or because no route matches, the new route is added to the cache with dst->input initialized to ip_error. This means that all the ingress packets matching this route will be processed by ip_error. That function, when invoked by dst_input, will generate the proper ICMP_UNREACHABLE message depending on why the packet cannot be routed, and will drop the packet. Adding the erroneous route to the cache is useful because it can speed up the error processing of further packets sent to the same incorrect address. ICMP messages are rate limited by ip_error. We already saw in the section "Egress ICMP REDIRECT Rate Limiting" in Chapter 33 that ICMP_REDIRECT messages are also rate limited by the DST. The rate limiting discussed here is independent of the other, but is enforced using the same fields of the dst_entry. This is possible because given any route, these two forms of rate limiting are mutually exclusive: one applies to ICMP_REDIRECT messages and the other one applies to ICMP_UNREACHABLEmessages. Here is how rate limiting is implemented by ip_error with a simple token bucket algorithm. The timestamp dst.rate_last is updated every time ip_error is invoked to generate an ICMP message. dst.rate_tokens specifies how many ICMP messagesalso known as the number of tokens, or the budgetcan be sent before the rate limiting kicks in and new ICMP_UNREACHABLE transmission requests will be ignored. The budget is decremented each time an ICMP_UNREACHABLE message is sent, and is incremented by ip_error itself. The budget cannot exceed the maximum number ip_rt_error_burst, which represents, as its name suggests, the maximum number of ICMP messages a host can send in 1 second (i.e., the burst). Its value is expressed in Hz so that it is easy to add tokens based on the difference between the local time jiffies and dst.rate_last. Figure 35-7. ip_mkroute_input functionWhen ip_error is invoked and at least one token is available, the function is allowed to transmit an ICMP_UNREACHABLE message. The ICMP subtype is derived from dst.error, which was initialized by ip_route_input_slow when fib_lookup failed to find a route. |
Wednesday, October 21, 2009
Section 35.7. Input Routing
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment