Wednesday, October 21, 2009

Section 35.7.  Input Routing










35.7. Input Routing


Ingress IP packets for which no route can be found in the cache by ip_route_input are checked against the routing tables by ip_route_input_slow, which is defined in net/ipv4/route.c and whose logic is shown in Figures 35-5(a) and 35-5(b). In this section, we describe the internals of this routine in detail.



Figure 35-5a. ip_route_input_slow function




Figure 35-5b. ip_route_input_slow function



The function starts with a few sanity checks on the source and destination addresses; for instance, the source IP address must not be a multicast address. I already listed most of those checks in the section "Verbose Monitoring" in Chapter 31. More sanity checks are done later in the function.


The routing table lookup is done with fib_lookup, the routine introduced in the section "fib_lookup Function." If fib_lookup cannot find a matching route, the packet is dropped; additionally, if the receiving interface is configured with forwarding enabled, an ICMP_UNREACHABLE message is sent back to the source. Note that the ICMP message is sent not by ip_route_input_slow but by its caller, who takes care of it upon seeing a return value of RTN_UNREACHABLE.


In case of success, ip_route_input_slow distinguishes the following three cases:


  • Packet addressed to a broadcast address

  • Packet addressed to a local address

  • Packet addressed to a remote address


In the first two cases, the packet is to be delivered locally, and in the third, it needs to be forwarded. The details of how local delivery and forwarding are handled can be found in the sections "Local Delivery" and "Forwarding." Here are some of the tasks they both need to take care of:



Sanity checks, especially on the source address


Source addresses are checked against illegal values and are run through fib_validate_source to detect spoofing attempts.


Creation and initialization of a new cache entry (the local variable rth)


See the following section, "Creation of a Cache Entry."



35.7.1. Creation of a Cache Entry





I said already in the section "Cache Lookup" in Chapter 33 that ip_route_input (and therefore ip_route_input_slow, in case of a cache miss) can be called just to consult the routing table, not necessarily to route an ingress packet. Because of that, ip_route_input_slow does not always create a new cache entry. When invoked from IP or an L4 protocol (such as IP over IP), the function always creates a cache entry. Currently, the only other possibility is invocation by ARP. Routes generated by ARP are cached only when they would be valid for proxy ARP. See the section "Processing ARPOP_REQUEST Packets" in Chapter 28.


The new entry is allocated with dst_alloc. Of particular importance are the following initializations for the new cache entry:



rth->u.dst.input



rth->u.dst.output


These two virtual functions are invoked respectively by dst_input and dst_output to complete the processing of ingress and egress packets, as shown in Figure 18-1 in Chapter 18. We already saw in the section "Setting Functions for Reception and Transmission" how these two routines can be initialized depending on whether a packet is to be forwarded, delivered locally, or dropped.


rth->fl


This flowi structure is used as a search key by cache lookups. It is important to note that rth->fl's fields are initialized to the input parameters received by ip_route_input_slow: this ensures that the next time a lookup is done with the same parameters, ip_route_input will be able to satisfy it with a cache lookup.


rth->rt_spec_dst


This is the preferred source address. See the following section, "Preferred Source Address Selection."




35.7.2. Preferred Source Address Selection






The route added to the routing cache is unidirectional, meaning that it will not be used to route traffic in the reverse direction toward the source IP address of the packet being routed. However, in some cases, the reception of a packet can trigger an action that requires the local host to choose a source IP address that it can use when transmitting a packet back to the sender.[*] This address, the preferred source IP address,[] must be saved with the routing cache entry that routed the ingress packet. Here are two cases where that address, which is saved in a field called rt_spec_dst, comes in handy:

[*] The preferred source IP address

to use for traffic generated locally (i.e., packets whose transmission is not triggered or influenced by the reception of another packet) may be different. See the section "Selecting the Source IP Address."

[] RFC 1122 calls it the "specific destination."



ICMP


When a host receives an ICMP ECHO REQUEST message (popularly known as "pings" from the name of the command that usually generates them), the host returns an ICMP ECHO REPLY unless it is explicitly configured not to. The rt_spec_dst of the route used for the ingress ICMP ECHO REQUEST is used as the source address for the routing lookup made to route the ICMP ECHO REPLY. See icmp_reply in net/ipv4/icmp.c, and see Chapter 25. The ip_send_reply routine in net/ipv4/ip_output.c does something similar.


IP options


A couple of IP options require the intermediate hosts between the source and the destination to write the IP addresses of their receiving interfaces into the IP header. The address that Linux writes is rt_spec_dst. See the description of ip_options_compile in Chapter 19.



The preferred source is selected through the fib_validate_source function mentioned in the section "Helper Routines" and called by ip_route_input_slow.


ip_route_input_slow initializes the preferred source IP address rt_spec_dst based on the destination address of the packet being routed:



Packet addressed to a local address


In this case, the local address to which the packet was addressed becomes the preferred source address. (The ICMP example previously cited falls into this case.)


Broadcast packet


A broadcast address cannot be used as a source address for egress packets, so in this case, ip_route_input_slow does more investigation with the help of two other routines: inet_select_addr and fib_validate_source (see the section "Helper Routines").


When the source IP address is not set in the received packet (that is, when it is all zeroes), inet_select_addr selects the first address with scope RT_SCOPE_LINK on the device the packet was received from. This is because packets are sent with a null source address when addressed to the limited broadcast address, which is an address with scope RT_SCOPE_LINK. An example is a DHCP discovery message.


When the source address is not all zeroes, fib_validate_source take cares of it.


Forwarded packet


In this case, the choice is left to fib_validate_source. (The IP options example previously cited falls into this case.)


The preferred source IP to use for packets matching a given route can be explicitly configured by the user with a command like this:



ip route add 10.0.1.0/24 via 10.0.0.1 src 10.0.3.100



In this example, when transmitting packets to the hosts of the 10.0.1.0/24 subnet, the kernel will use 10.0.3.100 as the source IP address. Of course, only locally configured addresses are accepted: this means that for the previous command to be accepted, 10.0.3.100 must have been configured on one of the local interfaces, but not necessarily on the same device used to reach the 10.0.1.0/24 subnet. (Remember that in Linux, addresses belong to the host, not to the devices; see the section "Responding from Multiple Interfaces" in Chapter 28.) An administrator normally provides a source address when she does not want to use the one that would be picked by default from the egress device.


Figure 35-6 summarizes how rt_spec_dst is selected.




35.7.3. Local Delivery




The following types of packets are delivered locally by initializing dst->input appropriately, as we saw in the section "Initialization of Function Pointers for Ingress Traffic":


  • Packets addressed to locally configured addresses, including multicast addresses

  • Packets addressed to broadcast addresses



Figure 35-6. Selection of rt_spec_dst



ip_route_input_slow recognizes two kinds of broadcasts:



Limited broadcasts


This is an address consisting of all ones: 255.255.255.255.[*] It can be recognized easily without a call to fib_lookup. Limited broadcasts are delivered to any host on the link, regardless of the subnet the host is configured on. No table lookup is required.

[*] There is an obsolete form of limited broadcast that consists of all zeros: 0.0.0.0.


Subnet broadcasts


These broadcasts are directed at hosts configured on a specific subnet. If hosts are configured on different subnets reachable via the same device (see Figure 30-4(c) in Chapter 30), only the right ones will receive a subnet broadcast. Unlike a limited broadcast, subnet broadcasts

cannot be recognized without involving the routing table with fib_lookup. For example, the address 10.0.1.127 might be a subnet broadcast in 10.0.1.0/25, but not in 10.0.1.0/24.


ip_route_input_slow accepts broadcasts only if they are generated by the IP protocol. You might think that this a superfluous check, given that ip_route_input_slow is called to route IP packets. However, as I said in the section "Cache Lookup" in Chapter 33, the input buffer to ip_route_input (and therefore to ip_route_input_slow in case of a cache miss) does not necessarily represent a packet to be routed.


If everything goes fine, a new cache entry, rtable, is created, initialized, and inserted into the routing cache.


Note that there is no need to handle Multipath for packets that are delivered locally.




35.7.4. Forwarding


If the packet is to be forwarded but the configuration of the ingress device has disabled forwarding
, the packet cannot be transmitted and must be dropped. The forwarding status of the device is checked with IN_DEV_FORWARD. Figure 35-7 shows the internals of ip_mkroute_input; in particular, it shows what that function looks like when there is no support for multipath caching (i.e., when ip_mkroute_input ends up being an alias to ip_mkroute_input_def). In the section "Multipath Caching," you will see how the other case differs.


If the matching route returned by fib_lookup includes more than one next hop, fib_select_multipath is used to choose among them. When multipath caching is supported, the selection is taken care of differently. The section "Effects of Multipath on Next Hop Selection" describes the algorithm used for the selection.


The source address is validated with fib_validate_source. Then, based on the factors we saw in the section "Transmitting ICMP_REDIRECT Messages" in Chapter 31, the kernel may decide to send an ICMP_REDIRECT to the source. In that case, the ICMP message is sent not by ip_route_input_slow directly, but by ip_forward, which takes care of it upon seeing the RTCF_DOREDIRECT flag.


As we saw in the section "Creation of a Cache Entry," the result of a routing lookup is not always cached.




35.7.5. Routing Failure







When a packet cannot be routed, either because of host configuration or because no route matches, the new route is added to the cache with dst->input initialized to ip_error. This means that all the ingress packets matching this route will be processed by ip_error. That function, when invoked by dst_input, will generate the proper ICMP_UNREACHABLE message depending on why the packet cannot be routed, and will drop the packet. Adding the erroneous route to the cache is useful because it can speed up the error processing of further packets sent to the same incorrect address.


ICMP messages are rate limited by ip_error. We already saw in the section "Egress ICMP REDIRECT Rate Limiting" in Chapter 33 that ICMP_REDIRECT messages are also rate limited by the DST. The rate limiting discussed here is independent of the other, but is enforced using the same fields of the dst_entry. This is possible because given any route, these two forms of rate limiting are mutually exclusive: one applies to ICMP_REDIRECT messages and the other one applies to ICMP_UNREACHABLEmessages.


Here is how rate limiting is implemented by ip_error with a simple token bucket algorithm.


The timestamp dst.rate_last is updated every time ip_error is invoked to generate an ICMP message. dst.rate_tokens specifies how many ICMP messagesalso known as the number of tokens, or the budgetcan be sent before the rate limiting kicks in and new ICMP_UNREACHABLE transmission requests will be ignored. The budget is decremented each time an ICMP_UNREACHABLE message is sent, and is incremented by ip_error itself. The budget cannot exceed the maximum number ip_rt_error_burst, which represents, as its name suggests, the maximum number of ICMP messages a host can send in 1 second (i.e., the burst). Its value is expressed in Hz so that it is easy to add tokens based on the difference between the local time jiffies and dst.rate_last.



Figure 35-7. ip_mkroute_input function



When ip_error is invoked and at least one token is available, the function is allowed to transmit an ICMP_UNREACHABLE message. The ICMP subtype is derived from dst.error, which was initialized by ip_route_input_slow when fib_lookup failed to find a route.













No comments: