Tuesday, October 20, 2009

Section 5.4.  Interaction Between Devices and Kernel










5.4. Interaction Between Devices and Kernel















Nearly all devices (including NICs) interact with the kernel in one of two ways:



Polling


Driven on the kernel side. The kernel checks the device status at regular intervals to see if it has anything to say.


Interrupt


Driven on the device side. The device sends a hardware signal (by generating an interrupt) to the kernel when it needs the kernel's attention.


In Chapter 9, you can find a detailed discussion of NIC driver design alternatives as well as software interrupts. You will also see how Linux can use a combination of polling and interrupts to increase performance. In this chapter, we will look only at the interrupt-based case.


I won't go into detail on how interrupts are reported by the hardware, the difference between hardware exceptions and device interrupts, how the driver and bus kernel infrastructures are designed, etc.; you can refer to Linux Device Drivers and Understanding the Linux Kernel for those topics. But I'll give a brief overview on interrupts to help you understand how device drivers initialize and register the devices they are responsible for, with special attention to the networking aspect.



5.4.1. Hardware Interrupts


You do not need to know the low-level background about how hardware interrupts

are handled. However, there are details worth mentioning because they can make it easier to understand how NIC device drivers are written, and therefore how they interact with the upper networking layers.


Every interrupt runs a function called an interrupt handler, which must be tailored to the device and therefore is installed by the device driver. Typically, when a device driver registers an NIC, it requests and assigns an IRQ. It then registers and (if the driver is unloaded) unregisters a handler for a given IRQ with the following two architecture-dependent functions. They are defined in kernel/irq/manage.c and are overridden by architecture-specific functions in arch/XXX/kernel/irq.c, where XXX is the architecture-specific directory:



int request_irq(unsigned int irq, void (*handler)(int, void*, struct pt_regs*), unsigned long irqflags, const char * devname, void *dev_id)


This function registers a handler, first making sure that the requested interrupt is a valid one, and that it is not already allocated to another device unless both devices understand shared IRQs (see the later section "Interrupt sharing").


void free_irq(unsigned_int irq, void *dev_id)


Given the device identified by dev_id, this function removes the handler and disables the IRQ line if no more devices are registered for that IRQ. Note that to identify the handler, the kernel needs both the IRQ number and the device identifier. This is especially important with shared IRQs, as explained in the later section "Interrupt sharing."


When the kernel receives an interrupt notification, it uses the IRQ number to find out the driver's handler and then executes this handler. To find handlers, the kernel stores the associations between IRQ numbers and function handlers in a global table. The association can be either one-to-one or one-to-many, because the Linux kernel allows multiple devices to use the same IRQ, a feature described in the later section "Interrupt sharing."


In the following sections, you will see common examples of the information exchanged between devices and drivers by means of interrupts, and how an IRQ can be shared by multiple devices under some conditions.



5.4.1.1. Interrupt types










With an interrupt, an NIC can tell its driver several different things. Among them are:



Reception of a frame


This is the most common and standard situation.


Transmission failure


This kind of notification is generated on Ethernet devices only after a feature called exponential binary backoff has failed (this feature is implemented at the hardware level by the NIC). Note that the driver will not relay this notification to higher network layers; they will come to know about the failure by other means (timer timeouts, negative ACKs, etc.).


DMA transfer has completed successfully


Given a frame to send, the buffer that holds it is released by the driver once the frame has been uploaded into the NIC's memory for transmission on the medium. With synchronous transmissions (no DMA), the driver knows right away when the frame has been uploaded on the NIC. But with DMA, which uses asynchronous transmissions, the device driver needs to wait for an explicit interrupt from the NIC. You can find an example of each case at points where dev_kfree_skb[*] is called within the driver code drivers/net/3c59x.c (DMA) and drivers/net/3c509.c (non-DMA).

[*] Chapter 11 describes this function in detail.


Device has enough memory to handle a new transmission


It is common for an NIC device driver to disable transmissions by stopping the egress queue when that queue does not have sufficient free space to hold a frame of maximum size (e.g., 1,536 bytes for an Ethernet NIC). The queue is then re-enabled when memory becomes available. The rest of this section goes into this case in more detail.


The final case in the previous list covers a sophisticated way of throttling transmissions in a manner that can improve efficiency if done properly. In this system, a device driver disables transmissions for lack of queuing space, asks the NIC to issue an interrupt when the available memory is bigger than a given amount (typically the device's Maximum Transmission Unit, or MTU), and then re-enables transmissions when the interrupt comes.


A device driver can also disable the egress queue before a transmission (to prevent the kernel from generating another transmission request on the device), and re-enable it only if there is enough free memory on the NIC; if not, the device asks for an interrupt that allows it to resume transmission at a later time. Here is an example of this logic, taken from the el3_start_xmit routine, which the drivers/net/3c509.c driver installs as its hard_start_xmit[] function in its net_device structure:

[] The hard_start_xmit virtual function is described in Chapter 11.



static int
el3_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
... ... ...
netif_stop_queue (dev);
... ... ...
if (inw(ioaddr + TX_FREE) > 1536)
netif_start_queue(dev);
else
outw(SetTxThreshold + 1536, ioaddr + EL3_CMD);
... ... ...
}



The driver stops the device queue with netif_stop_queue, thus inhibiting the kernel from submitting further transmission requests. The driver then checks whether the device's memory has enough free space for a packet of 1,536 bytes. If so, the driver starts the queue to allow the kernel once again to submit transmission requests; otherwise, it instructs the device (by writing to a configuration register with an outw call) to generate an interrupt when that condition will be met. An interrupt handler will then re-enable the device queue with netif_start_queue so that the kernel can restart transmissions.


The netif_xxx_queue routines are described in the section "Enabling and Disabling Transmissions" in Chapter 11.




5.4.1.2. Interrupt sharing




IRQ lines are a limited resource. A simple way to increase the number of devices a system can host is to allow multiple devices to share a common IRQ. Normally, each driver registers its own handler to the kernel for that IRQ. Instead of having the kernel receive the interrupt notification, find the right device, and invoke its handler, the kernel simply invokes all the handlers of those devices that registered for the same shared IRQ. It is up to the handlers to filter spurious invocations, such as by reading a registry on their devices.


For a group of devices to share an IRQ line, all of them must have device drivers capable of handling shared IRQs. In other words, each time a device registers for an IRQ line, it needs to explicitly say whether it supports interrupt sharing. For example, the first device that registers for one IRQ, saying something like "assign me IRQ n and use this routine fn as the handler," must also specify whether it is willing to share the IRQ with other devices. When another device driver tries to register the same IRQ number, it is refused if either it, or the driver to which the IRQ is currently assigned, is incapable of sharing IRQs.




5.4.1.3. Organization of IRQs to handler mappings






The mapping of IRQs to handlers is stored in a vector of lists, one list of handlers for each IRQ (see Figure 5-2). A list includes more than one element only when multiple devices share the same IRQ. The size of the vector (i.e., the number of possible IRQ numbers) is architecture dependent and can vary from 15 (on an x86) to more than 200. With the introduction of interrupt sharing, even more devices can be supported on a system at once.


The section "Hardware Interrupts" already introduced the two functions provided by the kernel to register and unregister a handler, respectively. Let's now see the data structure used to store the mappings.


Mappings are defined with irqaction data structures. The request_irq function introduced in the earlier section "Hardware Interrupts" is a wrapper around setup_irq, which takes an irqaction structure as input and inserts it into the global irq_desc vector. irq_desc is defined in kernel/irq/handler.c and can be overridden in the per-architecture files arch/XXX/kernel/irq.c. setup_irq is defined in kernel/irq/manage.c and can be overridden in the per-architecture files arch/XXX/kernel/irq.c.


The kernel function that handles interrupts and passes them to drivers is architecture dependent. It is called handle_IRQ_event on most architectures.


Figure 5-2 shows how irqaction instances are stored: there is an instance of irq_desc for each possible IRQ and an instance of irqaction for each successfully registered IRQ handler. The vector of irq_desc instances is called irq_desc as well, and its size is given by the architecture-dependent symbol NR_IRQS.


Note that when you have more than one irqaction instance for a given IRQ number (that is, for a given element of the irq_desc vector), interrupt sharing is required (each structure must have the SA_SHIRQ flag set).



Figure 5-2. Organization of IRQ handlers



Let's see now what information is stored about IRQ handlers in the fields of an irqaction data structure:



void (*handler)(int irq, void *dev_id, struct pt_regs *regs)


Function provided by the device driver to handle notifications of interrupts: whenever the kernel receives an interrupt on line irq, it invokes handler. Here are the function's input parameters:



int irq


IRQ number that generated the notification. Most of the time it is not used by the NICs' device drivers to accomplish their job; the device ID is sufficient.


void *dev_id


Device identifier. The same driver can be responsible for different devices at the same time, so it needs the device ID to process the notification correctly.


struct pt_regs *regs


Structure used to save the content of the processor's registers at the moment the interrupt interrupted the current process. It is normally not used by the interrupt handler.


unsigned long flags


Set of flags. The possible values SA_XXX are defined in include/asm-XXX/signal.h. Here are the main ones from the x86 architecture file:



SA_SHIRQ


When set, the device driver can handle shared IRQs.


SA_SAMPLE_RANDOM


When set, the device is making itself available as a source of random events. This can be useful to help the kernel generate random numbers for internal use, and is called contributing to system entropy. This is further described in the later section "Initializing the Device Handling Layer: net_dev_init."


SA_INTERRUPT


When set, the handler runs with interrupts disabled on the local processor. This should be specified only for handlers that can get done very quickly. See one of the handle_IRQ_event instances for an example (for instance, /kernel/irq/handle.c).


There are other values, but they are either obsolete or used only by particular architectures.


void *dev_id


Pointer to the net_device data structure associated with the device. The reason it is declared void * is that NICs are not the only devices to use IRQs. Because various device types use different data structures to identify and represent device instances, a generic type declaration is used.


struct irqaction *next


All the devices sharing the same IRQ number are linked together in a list with this pointer.


const char *name


Device name. You can read it by dumping the contents of /proc/interrupts.














No comments: