Thursday, October 22, 2009

Section 25.8.  Transmitting ICMP Messages










25.8. Transmitting ICMP Messages



















The two classes of ICMP messages introduced in the section "ICMP Header," errors and queries, are transmitted using two different routines:



icmp_send


Used by the kernel to transmit ICMP error messages when specific conditions are detected.


icmp_reply


Used by the ICMP protocol to reply to ingress ICMP request messages that require a response.


Both routines receive an skb buffer in input. However, the one used as input to icmp_send represents the ingress IP packet that triggered the transmission of the ICMP message, whereas the one in input to icmp_reply represents an ingress ICMP request message that requires a response.


The code in net/core/icmp.c processes incoming ICMP messages, and therefore always uses icmp_reply to transmit an ICMP message in response to another one received in input. Other kernel network subsystems (i.e., routing, IP, etc.) use icmp_send when they need to generate ICMP messages, as shown in Figure 25-8.



Figure 25-8. Subsystems using icmp_send/icmp_reply



In both cases:


  • ip_route_output_key is used to find the route to the destination (see Chapter 33).

  • The two routines ip_append_data and ip_push_pending_frames are used to request a transmission to the IP layer. These routines are described in Chapter 21.

  • ICMP messages generated in kernel space are rate limited (if the kernel has been configured to do it via /proc) with icmpv4_xrlim_allow (see the section "Rate Limiting").

  • Transmissions are serialized with a per-CPU spin lock through icmp_xmit_lock and icmp_xmit_unlock. The per-CPU spin locks are accessed via the per-CPU ICMP sockets (see the section "Protocol Initialization"). When the spin lock cannot be acquired because it is already held, transmission fails (but neither of the routines returns an error code).


Tables 25-6, 25-7, and 25-8 show where the ICMP types in Table 25-1 are generated by the kernel. For those subsystems covered in this book, it also includes a reference to the routines where the ICMP messages are generated.


Table 25-6. Network subsystems that generate ICMP messages

Type

Name

Generated by

0

ICMP_ECHOREPLY

ICMP (icmp_echo)

3

ICMP_DEST_UNREACH

See Table 25-7

5

ICMP_REDIRECT

Routing (ip_rt_send_redirect)

11

ICMP_TIME_EXCEEDED

See Table 25-8

12

ICMP_PARAMETERPROB

IPv4 (ip_options_compile, ip_options_rcv_srr)

14

ICMP_TIMESTAMPREPLY

ICMP (icmp_timestamp)



Table 25-7. Network subsystems that generate variants of the ICMP_DEST_UNREACH message type

Code

Kernel symbol

Generated by

0

ICMP_NET_UNREACH

Routing (ip_error), Netfilter

1

ICMP_HOST_UNREACH

Routing (ip_error, ipv4_link_failure), Netfilter, GRE, IPIP

2

ICMP_PROT_UNREACH

IPv4 (ip_local_deliver_finish), Netfilter, GRE

3

ICMP_PORT_UNREACH

Netfilter, GRE, IPIP, UDP

4

ICMP_FRAG_NEEDED

IPv4 (ip_fragment), GRE, IPIP, Virtual Server

5

ICMP_SR_FAILED

IPv4 (ip_forward)

9

ICMP_NET_ANO

Netfilter

10

ICMP_HOST_ANO

Netfilter

13

ICMP_PKT_FILTERED

Routing (ip_error), Netfilter



Netfilter generates ICMP_DEST_UNREACH messages when it drops ingress IP packets according to the configuration applied, for instance, with iptables. The -reject-with option for the REJECT target allows the user to select which ICMP message type to use when rejecting ingress IP packets that match a given rule.


Tunneling protocols such as IPIP and GRE, defined in net/ipv4/ipip.c and net/ipv4/ip_gre.c, respectively, need to handle ICMP messages according to the rules in RFC 2003, section 4.


Table 25-8. Network subsystems that generate variants of the ICMP_TIME_EXCEEDED message type

Code

Kernel symbol

Generated by

0

ICMP_EXC_TTL

IPv4 (ip_forward)

1

ICMP_EXC_FRAGTIME

IPv4 (ip_expire)




25.8.1. Transmitting ICMP Error Messages


















Figures 25-9(a) and 25-9(b) show the internals of icmp_send. Here are its input parameters:



skb_in


Input IP packet the error is associated with.


type



code


Type and code fields to use in the ICMP header.


info


Additional information: an MTU for ICMP_FRAG_NEEDED messages, a gateway address for ICMP_REDIRECT messages, and an offset for ICMP_PARAMETERPROB messages.



Figure 25-9a. icmp_send function




Figure 25-9b. icmp_send function


icmp_send starts with a few sanity checks to filter out illegal requests. The following conditions cause it to abort:


  • The IP datagram is received as broadcast or multicast. This case is detected by checking the RTCF_BROADCAST and RTCF_MULTICAST flags of the routing cache entry associated with skb_in.

  • The IP datagram is received encapsulated in a broadcast link layer frame. This case is detected by comparing the packet type skb_in->pkt_type against PACKET_HOST.

  • The IP datagram is a fragment, and it is not the first one of the original packet. This case can be detected by reading the offset field of the IP header (see Chapter 22).

  • The IP datagram carries an ICMP
    error message. You must not use an error message to reply to an error message.


It is not the responsibility of the ICMP layer to initialize the IP header. However, a couple of IP header fields will be initialized by the IP layer according to the requirements of ICMP. In particular:



Source IP address


When the target of the ICMP message is not a locally configured IP address (i.e., RTCF_LOCAL), the source IP address to place in the encapsulating header is selected according to the sysctl_icmp_errors_use_inbound_ifaddr configuration (see the section "Tuning via /proc Filesystem").


Type of Service (TOS)


The TOS is copied from the TOS of skb_in. In addition, when the ICMP message is classified as an error (see Table 25-1), the precedence's component of the TOS is initialized to IPTOS_PREC_INTERNETCONTROL (i.e., this message has higher precedence). See Chapter 18 for more information on TOS.


IP options


The IP options are copied and reversed from skb_in with ip_options_echo. See the section "IP Options" in Chapter 19.


Next, the function finds the route to the destination with ip_route_output_key, which is a cache lookup routine introduced in Chapter 33.


Note that, as shown in Figure 25-8, transmissions are rate limited with a token bucket algorithm via the icmpv4_xrlim_allow routine. When the ICMP message is not suppressed by the token bucket algorithm, the transmission ends with a call to icmp_push_reply, which ends up calling the two IP routines shown in Figure 25-8.




25.8.2. Replying to Ingress ICMP Messages




As mentioned in the section "ICMP Header," a subset of the ICMP message types comes in pairs: a request message and a response message. For one example, an ICMP_ECHOREPLY message is sent in answer to an ingress ICMP_ECHO message. The transmission of response messages is done as follows:


  1. The header of the response message is first copied from the ingress request ICMP message.

  2. The type field of the ICMP header is updated (for example, ICMP_ECHO is replaced with ICMP_ECHOREPLY).

  3. icmp_reply is called to complete the transmission (i.e., to compute the checksum on the ICMP header, find the route to the destination, fill in the IP header, etc.).




25.8.3. Rate Limiting



ICMP messages are rate limited in two places:



By the routing code


The routing code rate limits only the outgoing ICMP_DEST_UNREACH and ICMP_REDIRECT message types. See the section "Routing Failure" in Chapter 35 and the section "Egress ICMP REDIRECT Rate Limiting" in Chapter 33.


By the ICMP code


The ICMP code can rate limit all outgoing ICMP message types (with only the few exceptions listed later in this section), including the types that are also rate limited by the routing code.


The two types of rate limiting

differ in an important way: the routing code rate limits ICMP messages per destination IP address, and the ICMP code rate limits per source IP address. This means that the types that are rate limited by both ICMP and the routing code are rate limited twice.


Let me clarify this point. The kernel keeps the rate-limiting information needed to apply the token bucket algorithm in the dst_entry entries of the routing cache. Each dst_entry instance is associated with a destination IP address (more details in Chapter 33). This alone tells us that rate limiting is applied on a per-IP-address basis, not on a per-ICMP-message-type basis, but let's see exactly how per-source and per-destination rate limiting differ:


  • When a kernel subsystem, such as the IPv4 protocol, processes an input IP packet that meets certain error conditions, it sends an ICMP error message back to the source of the ingress IP packet. The ICMP code consults the routing table, the routing lookup returns a cache entry, and the cache entry is used to store the rate limiting information. This cache entry is associated with the route from the local host to the source of the faulty IP packetthat is, to the source IP address of the faulty IP packet. This is called per-source IP address rate limiting.

  • When the routing code cannot route an ingress IP packet, it generates an ICMP_HOST_UNREACH message, whereas it generates an ICMP_REDIRECT message when the destination IP address of the ingress IP packet is better reached via another gateway. In both cases, the routing code adds an entry to the cache whose associated destination IP address is the destination IP address of the ingress IP packet. This is why this is called per-destination IP address rate limiting. Chapter 35 explains how such cache entries will be used by subsequent matching IP packets.




25.8.4. Implementation of Rate Limiting


Let's see now how the ICMP code applies its rate limiting. As shown in Figure 25-10, any time an ICMP message is transmitted and rate limiting is configured in the kernel, the icmpv4_xrlim_allow function is called to enforce rate limiting. Both the ICMP message types to rate limit (sysctl_icmp_ratemask) and the rate limit's rate (sysctl_icmp_ratelimit) can be configured via /proc (see the section "Tuning via /proc Filesystem").



Figure 25-10. icmpv4_xrlim_allow function



icmpv4_xrlim_allow does not apply any rate limiting in the following cases:


  • ICMP messages whose type is not known to the kernel (they could be important ones).

  • ICMP messages used by the PMTU protocol described in RFC 1191 (i.e., type ICMP_DEST_UNREACH and code ICMP_FRAG_NEEDED).[*] PMTU is briefly described in Chapter 18.

    [*] Note that the policy used by the kernel has nothing to do with the one used by the firewall. It is common, for instance, for firewalls to drop all but a few ICMP messages. Sometimes the ones used by PMTU are dropped too, even though it goes against the RFC recommendations.

  • ICMPs sent out on the loopback device.


icmpv4_xrlim_allow is a wrapper for a more general-purpose function, xlim_allow, which does the real job. It is called if, according to the sysctl_icmp_ratemask bitmap, the ICMP message is to be rate limited.



#define XRLIM_BURST_FACTOR 6
int xrlim_allow(struct dst_entry *dst, int timeout)
{
unsigned long now;
int rc = 0;

now = jiffies;
dst->rate_tokens += now - dst->rate_last;
dst->rate_last = now;
if (dst->rate_tokens > XRLIM_BURST_FACTOR * timeout)
dst->rate_tokens = XRLIM_BURST_FACTOR * timeout;
if (dst->rate_tokens >= timeout) {
dst->rate_tokens -= timeout;
return 1;
}
return rc;
}



xrlim_allow applies a simple token bucket algorithm. Whenever it is called, it updates the available dst->rate_tokens tokens (measured in jiffies), makes sure that the accumulated tokens are not more than a predefined maximum value (XRLIM_BURST_FACTOR), and allows the transmission of the ICMP message if the available tokens are sufficient. The input parameter timeout represents the rate to enforce, expressed in Hz (for example, 1*HZ would mean a rate limit of one ICMP message per second).


Note that since xrlim_allow is a generic routine shared by different protocols, it operates on protocol-independent routing cache entries (dst_entry structures), and icmpv4_xrlim_allow is an IPv4 routine and therefore operates on rtable data structures. For more details on the dst_entry and rtable data structures, please refer to Chapter 36.




25.8.5. Receiving ICMP Messages























icmp_rcv is the function called by ip_local_deliver_finish to process ingress ICMP messages.


The ICMP protocol registers its receiving
routine icmp_rcv in net/ipv4/protocol.c, as described in Chapter 24. See Chapter 20 for more details on local delivery of ingress IP packets.


First, the ICMP message's checksum is verified. Note that even when the receiving NIC is able to compute the L4 checksum in hardware (which would be the ICMP checksum in this case) and that checksum says the ICMP message is corrupted, icmp_rcv verifies the checksum once more in software. You can refer to the section "sk_buff structure" in Chapter 19 for more details on L4 checksumming support by NICs.


Not all ICMP message types can be sent to a multicast IP address: only ICMP_ECHO, ICMP_TIMESTAMP, ICMP_ADDRESS, and IMCP_ADDRESSREPLY. icmp_rcv filters out those messages that do not respect this rule. In particular, ingress broadcast ICMP_ECHO messages are dropped if the system has been configured to do so. See the section "Tuning via /proc Filesystem."


When all sanity checks are satisfied, icmp_rcv passes the ingress ICMP message to the right helper routine. The latter is accessed via the icmp_pointers vector that is initialized at the end of net/ipv4/icmp.c. icmp_pointers is an array of icmp_control data structures. Table 25-9 summarizes part of icmp_pointers's initialization. See the section "icmp_control Structure" for the exact meaning of the handler and error fields. Any types not in the table are obsolete, unsupported, or not supposed to be processed in kernel space. For all these types, handler is initialized to icmp_discard.


Table 25-9. Initialization of handler and error

Type

Kernel symbol

Handler

Error

3

ICMP_DEST_UNREACH

icmp_unreach

1

4

ICMP_SOURCE_QUENCH

icmp_unreach

1

5

ICMP_REDIRECT

icmp_redirect

1

8

ICMP_ECHO

icmp_echo

0

11

ICMP_TIME_EXCEEDED

icmp_unreach

1

12

ICMP_PARAMETERPROB

icmp_unreach

1

13

ICMP_TIMESTAMP

icmp_timestamp

0

17

ICMP_ADDRESS

icmp_address

0

18

ICMP_ADDRESSREPLY

icmp_address_reply

0



Figure 25-11 shows the internals of icmp_rcv .


Note that neither ICMP_ADDRESS nor ICMP_ADDRESSREPLY is supported; the two handlers that are registered against them are just placeholders or apply some kind of logging.



Figure 25-11. icmp_rcv function



Note also that the icmp_unreach handler takes care of different ICMP message types, not just ICMP_DEST_UNREACH.


Figure 25-12(a) shows how some of skb's pointers are initialized when icmp_rcv is invoked, and Figure 25-12(b) shows how they are initialized when the handlers of Table 25-9 are called. This figure can be useful when analyzing the routines in Table 25-9, especially icmp_unreach.



Figure 25-12. (a) skb at the beginning of icmp_rcv; (b) skb as it is passed to the handler





25.8.6. Processing ICMP_ECHO and ICMP_ECHOREPLY Messages




ICMP_ECHO messages are processed according to the generic model described in the section "Replying to Ingress ICMP Messages":



static void icmp_echo(struct sk_buff *skb)
{
if (!sysctl_icmp_echo_ignore_all) {
struct icmp_bxm icmp_param;

icmp_param.data.icmph = *skb->h.icmph;
icmp_param.data.icmph.type = ICMP_ECHOREPLY;
icmp_param.skb = skb;
icmp_param.offset = 0;
icmp_param.data_len = skb->len;
icmp_param.head_len = sizeof(struct icmphdr);
icmp_reply(&icmp_param, skb);
}
}



ICMP_ECHOREPLY messages are not processed by the kernel, but by the applications that generated the associated ICMP_ECHO messages. See the section "Raw Sockets and Raw IP" in Chapter 24 for an example involving ping.




25.8.7. Processing the Common ICMP Messages









icmp_unreach is used as a handler for multiple ICMP types, as shown in Table 25-9. The function starts with some common sanity checks, continues with some processing based on the particular message type, and concludes with another common part.


The internals of the routine are shown in Figure 25-13.


The per-type processing is minimal:


  • It prints a warning message for ICMP_SR_FAILED ICMPs.

  • It updates the routing cache when it receives an ICMP of type ICMP_DEST_UNREACH and code ICMP_FRAG_NEEDED. The cache is updated with ip_rt_frag_needed, but only if PMTU discovery is enabled (i.e., if ipv4_config.no_pmtu_disc is nonzero). When PMTU discovery is not enabled, the kernel simply logs a warning.

  • It extracts the pointer field from the ICMP header when the message is of type ICMP_PARAMETERPROB. pointer is an offset relative to the beginning of the IP header in the ICMP payload. The field will be passed to the transport protocol.

  • ICMP_SOURCE_QUENCH does not require any specific treatment in icmp_unreach, so it is completely up to the transport protocols to handle it when notified via the err_handler routines. Currently, all transport protocols ignore this type of ICMP message.


For both ICMP_FRAG_NEEDED and ICMP_SR_FAILED, the logging is rate limited via LIMIT_NETDEBUG, which is a generic routine that rate limits networking-related messages to five per second.


The last part of icmp_unreach is again common to all ICMP types that use it as a handler, and consists of the following tasks:


  • When the sysctl_icmp_ignore_bogus_error_messages variable is set (by default, it is not), the ICMP message is discarded if it is received with a broadcast IP packet.

  • The function makes sure the ICMP payload includes the whole IP header of the IP packet that triggered the generation of the ICMP message, plus 64 bits from the transport payload of the same IP packet. This information is necessary to


    Figure 25-13. icmp_unreach function

    allow the transport protocol to identify a local socket (i.e., the application). When this condition is not met, the ICMP message is dropped. Note that the 64-bit requirement comes from RFC 792, but RFC 1812 changed the requirement (see the section "ICMP Payload").

  • The function notifies the transport protocol about this ICMP message via the err_handler function. The right transport protocol is identified using the protocol field of the IP header in the ICMP payload. See the section "Passing Error Notifications to the Transport Layer" and Figure 25-2.




25.8.8. Processing ICMP_REDIRECT Messages








icmp_redirect, the function used to process incoming ICMP_REDIRECT messages, is a wrapper around ip_rt_redirect with some additional sanity checks. The logic used by the latter function is described in the section "Processing Ingress ICMP_REDIRECT Messages" in Chapter 31. ip_rt_redirect adds an entry to the routing cache with rt_intern_hash, which is described in Chapter 33. The route is initialized with the RTCF_REDIRECTED flag toggled on, to be distinguished from the other routes. For example, we will see in the section "Examples of eligible cache victims" in Chapter 30 how the routing code uses this information when it is forced to delete entries from the routing cache.


The system administrator can also influence when ICMP redirects are generated. Through the /proc filesystem, it is possible to specify for each interface whether to send and accept ICMP redirects (see the section "The /proc/sys/net/ipv4/conf Directory" in Chapter 36). Using the firewall capabilities, as well, the administrator can specify from whom to accept particular types of ICMP packets and therefore whose ICMP_REDIRECT messages to trust.




25.8.9. Processing ICMP_TIMESTAMP and ICMP_TIMESTAMPREPLY Messages


Ingress ICMP_TIMESTAMP messages are handled by replying with an ICMP_TIMESTAMPREPLY message, using the scheme discussed in the section "Replying to Ingress ICMP Messages." The second and third timestamps are not initialized according to the rules we saw in the section "ICMP_TIMESTAMP and ICMP_TIMESTAMPREPLY": they are initialized to the same timestamp with do_gettimeofday.


Note that head_len is initialized to include not only the default ICMP header length, but also the three 32-bit timestamps.




25.8.10. Processing ICMP_ADDRESS and ICMP_ADDRESSREPLY Messages


Because the Linux kernel does not generate ICMP_ADDRESS messages, ingress ICMP_ADDRESSREPLY messages cannot be answers to queries generated locally (not in kernel space, at least). However, when forwarding and logging of Martian addresses[*] are enabled on the ingress device, Linux listens to ICMP_ADDRESSREPLY messages with icmp_address_reply. The latter function checks whether the mask advertised with the message is correct with regard to the IP addresses configured on the receiving interface: if the receiving interface does not have any IP address configured on the same subnet of the source IP address used by the ICMP message sender (which also implies the exact same netmask), the kernel logs a warning.

[*] See the definition of log_martians in the section "File descriptions" in Chapter 36.


The sanity check on the received reply is not done when the routing cache has the RTCF_DIRECTSRC flag set. This flag is set only when the destination address is reachable by the local host via a next hop that has local scope (i.e., that exists only internally to the Linux box).













No comments: