Monday, December 21, 2009

Section 14.1. Increasing Code Efficiency










14.1. Increasing Code Efficiency


By

the time your program is working, you might already know, or have a pretty good idea, which functions and modules are the most critical for overall code efficiency. ISRs, high-priority tasks, calculations with real-time deadlines, and functions that are either compute-intensive or frequently called are all likely candidates.


A tool called a profiler
, included with some software development suites, can be used to narrow your focus to those routines in which the program spends most (or too much) of its time. A profiler collects and reports execution statistics for a program. These execution statistics include the number of calls to and the total time spent in each routine.


Once you've identified the routines that require greater code efficiency, you can use the following techniques to reduce their execution time. Note that the techniques described here are very compiler-dependent. In most cases, there aren't general rules that can be applied in all situations. The best way to determine if a technique will provide improvement is to look at the compiler's assembly output.




Inline functions





In C99, the keyword inline can be added to any function declaration. This keyword asks the compiler to replace all calls to the indicated function with copies of the code that is inside. This eliminates the runtime overhead associated with the function call and is most effective when the function is used frequently but contains only a few lines of code.


Inline functions provide a perfect example of how execution speed and code size are sometimes inversely linked. The repetitive addition of the inline code will increase the size of your program in direct proportion to the number of times the function is called. And, obviously, the larger the function, the more significant the size increase will be. However, you will lose the overhead of setting up the stack frame if parameters are passed into the function. The resulting program runs faster but requires more code memory.





Table lookups



A switch

statement is one common programming technique that should be used with care. Each test and jump that makes up the machine language implementation uses up valuable processor time simply deciding what work should be done next. To speed things up, try to put the individual cases in order by their relative frequency of occurrence. In other words, put the most likely cases first and the least likely cases last. This will reduce the average execution time, though it will not improve at all upon the worst-case time.


If there is a lot of work to be done within each case, it might be more efficient to replace the entire switch statement with a table of pointers to functions. For example, the following block of code is a candidate for this improvement:


enum NodeType {NODE_A, NODE_B, NODE_C};

switch (getNodeType( ))
{
case NODE_A:
.
.
case NODE_B:
.
.
case NODE_C:
.
.
}



To speed things up, replace this switch statement with the following alternative. The first part of this is the setup: the creation of an array of function pointers. The second part is a one-line replacement for the
switch
statement that executes more efficiently.


int processNodeA(void);
int processNodeB(void);
int processNodeC(void);

/* Establishment of a table of pointers to functions. */
int (* nodeFunctions[])( ) = {processNodeA, processNodeB, processNodeC};

.
.

/* The entire switch statement is replaced by the next line. */
status = nodeFunctions[getNodeType()]( );






Hand-coded assembly




Some software modules are best written in assembly language. This gives the programmer an opportunity to make them as efficient as possible. Though most C compilers produce much better machine code than the average programmer, a skilled and experienced assembly programmer might do better work than the compiler for a given function.


For example, on one of our past projects, a digital filtering algorithm was implemented in C and targeted to a TI TMS320C30 DSP. The compiler was unable to take advantage of a special instruction that performed exactly the mathematical operations needed. By manually replacing one for loop of the C program with inline assembly instructions that did the same thing, overall computation time decreased by more than a factor of 10.





Register variables




The keyword register can be used when declaring local variables. This asks the compiler to place the variable into a general-purpose register rather than on the stack. Used judiciously, this technique provides hints to the compiler about the most frequently accessed variables and will somewhat enhance the performance of the function. The more frequently the function is called, the more likely it is that such a change will improve the code's performance. But some compilers ignore the register keyword.





Global variables





It is sometimes more efficient to use a global variable than to pass a parameter to a function. This eliminates the need to push the parameter onto the stack before the function call and pop it back off once the function is completed. In fact, the most efficient implementation of any subroutine would have no parameters at all. However, the decision to use a global variable can also have some negative effects on the program. The software engineering community generally discourages the use of global variables in an effort to promote the goals of modularity and reentrancy, which are also important considerations.





Polling




ISRs are often used to improve a program's responsiveness. However, there are some rare cases in which the overhead associated with the interrupts actually causes inefficiency. These are cases in which the average time between interrupts is of the same order of magnitude as the interrupt latency. In such cases, it might be better to use polling to communicate with the hardware device. But this too can lead to a less modular software design.





Fixed-point arithmetic




Unless your target platform features a floating-point processor, you'll pay a very large penalty for manipulating float data in your program. The compiler-supplied floating-point library contains a set of software subroutines that emulate the floating-point instructions. Many of these functions take a long time to execute relative to their integer counterparts and also might not be reentrant.


If you are using floating-point for only a few calculations, it might be better to implement the calculations themselves using fixed-point arithmetic. For example, two fractional bits representing a value of 0.00, 0.25, 0.50, or 0.75 are easily stored in any integer by merely multiplying the real value by 4 (e.g., << 2). Addition and subtraction can be accomplished via the integer instruction set, as long as both values have the same imaginary binary point. Multiplication and division can be accomplished similarly, if the other number is a whole integer.


It is theoretically possible to perform any floating-point calculation with fixed-point arithmetic. (After all, that's how the floating-point software library does it, right?) Your biggest advantage is that you probably don't need to implement the entire IEEE 754 standard just to perform one or two calculations. If you do need that kind of complete functionality, stick with the compiler's floating-point library and look for other ways to speed up your program.





Variable size



It is typically best to use the processor's native register width for variables whenever possible (whether it is 8, 16, or 32 bits). This allows the compiler to produce code that takes advantage of the fast registers built into the processor's machine opcodes. Obviously, you need to ensure that the variable size accommodates the number range that the variable represents. For example, if you need a count that goes from 0 to 512, you can't use an 8-bit variable.


A variable size tailored to the processor can also speed up processing by limiting the number of external memory accesses. If a processor has a 16-bit data bus and it needs to access a 32-bit variable in external RAM, two data fetches must be performed for the processor to get the variable.


C99 defines integer types int_fastN_t and uint_fastN_t (where N represents the integer length) in stdint.h. These types are meant to be used when you need at least "X bits" (e.g., X = 16) to store your data but don't care if the field is larger than X in width, to make access as fast as possible. These "fast" integer types are thus no good for use with peripheral registers, which always have a fixed width that cannot be larger or smaller than X.





Loop unrolling



In some cases, repetitive loop code can be optimized by performing
loop unrolling
. In loop unrolling, the loop overhead at the start and end of a loop is eliminated. Here's an example of a for loop:


for (idx = 0; idx < 5; idx++)
{
value[idx] = incomingData[idx];
}



Here's the unrolled version without the loop overhead:


value[0] = incomingData[0];
value[1] = incomingData[1];
value[2] = incomingData[2];
value[3] = incomingData[3];
value[4] = incomingData[4];



Some compilers offer loop unrolling as an optimization; in other cases, it might be better for the developer to code it. It is helpful to check the assembly output from the compiler to see whether efficiency has actually been improved.


The amount of rolling that youor the compilerchoose to do must balance the gain in speed versus the increased size of the code. Loop unrolling increases code sizeanother situation where you must trade code size for speed. Also, loop unrolling can be used only when the number of iterations through the loop are fixed. One example of an optimized implementation of loop unrolling is the coding technique known as Duff's device (http://en.wikipedia.org/wiki/Duff's_device).















No comments: