Monday, January 4, 2010

10.5 Roll Your Own Tool



[ Team LiB ]





10.5 Roll Your Own Tool


Sometimes none of the tools at your disposal will handle a mundane and obviously automatable code-reading task. Do not be afraid to create your own code-reading tools. The Unix shell with its tools and modern interpreted programming languages such as Perl, Python, Ruby, and Visual Basic are particularly suited for creating customized code-reading tools. Each environment has its own particular strengths.


  • The Unix shell provides a plethora of tools such as sed, awk, sort, uniq, and diff that can be incrementally combined to obtain the required functionality.

  • Perl, Python, and Ruby have strong support for strings, regular expressions, and associative arrays, simplifying many code-parsing tasks that can be achieved at a lexical level.


  • Visual Basic can access through its object model code that is not normally available to text-based tools. Examples include Microsoft Access validation functions and recursive walks through deep object hierarchies.


Consider one particular example, the location of code lines with an indentation that does not match that of the surrounding code. Such code can be misleading at best and is often an indication of an important program bug. When we wanted to look in the existing source code base for examples of such code (see Section 2.3), none of the tools at our disposal could perform such a task, so we proceeded to roll our own. Figure 10.6 contains the implementation we used. As you can see, the tool depends on lexical heuristics and makes a number of assumptions about the code. It detects suspect code by locating two consecutive indented lines that follow an if, while, or for statement that does not end with an opening brace. The tool can get confused by reserved words and tokens inside comments or strings, statements that span more than one line, and the use of whitespace instead of tabs. However, its 23 lines of code were written in less than an hour by gradually improving on the original idea until we could live with the signal-to-noise ratio of its output. (One of the previous versions did not handle the very common occurrence of the first line introducing a new braced block.) The way we implemented our indentation verification tool is applicable to the development of similar tools.


Figure 10.6 Locating code blocks with incorrect indentation.


#!/usr/bin/perl
use File::Find; <-- a
find(\&process, $ARGV[0]);

sub process
{
return unless -f; <-- b
return unless (/\.c$/i);
open(IN, $fname = $_) || die "Unable to open $_:$!\n";
while (<IN>) { <-- c
chop;
if (/^(\t+)(if|for|while)/ && !/\{/) { <-- d
$tab = $1;
$n = <IN>; <-- e
$n1 = <IN>;
<-- f
if ($n =~ m/^$tab\t.*;/ &&
$n1 =~ m/^$tab\t.*;/ &&
$n !~ m/\t(if|for|while|switch)/) {

print "$File::Find::name\n$_\n$n$n1\n"; <-- g
}
}
}
}


(a)
Process all files in the tree specified


(b)
Process only C files


(c)
For every source code line


(d)
Is it an if/for/while without a brace?


(e)
Get the next two lines


(f)
Are they;-terminated plain statements starting with an additional tab?


(g)
Then we found a problem. Print the file location and the lines.


Cunningham [Cun01] describes how he wrote two 40-line CGI Perl scripts to summarize and amplify large Java source code collections. Both scripts create HTML output so that a Web browser can be used to navigate over megabyte-large source collections. You can see an example of the summarization tool's output[16] in Figure 10.7. The summarization script condenses each Java method into a single line consisting of the {};" characters that occur in the method's body. This method's signature can reveal a surprising amount of information; the size and structure of the method are readily apparent from the line's length and the placement of braces. Hyperlinks allow you to navigate to each particular method to examine it in detail, but in practice the tool's strength lies in its ability to succinctly visualize huge code collections.

[16] jt4/catalina/src/share/org/apache/catalina


Figure 10.7. A signature survey of Java code.


The rules we follow when building tools can be summarized as follows.


  • Exploit the capabilities of modern rapid-prototyping languages.

  • Start with a simple design, gradually improving it as needed.

  • Use heuristics that exploit the lexical structure of the code.

  • Tolerate some output noise or silence (extraneous or missing output), but remember to take into consideration this noise or silence when using the tool.

  • Use other tools to preprocess your input or postprocess your output.


Exercise 10.14
Identify code-reading tasks that can benefit by use of a custom-built tool. Briefly describe how each such tool could be built. Implement one of the tools you described.


Exercise 10.15
A pretty-printer typesets source code in a way that makes it easy to read (see Section 10.7). Typically, different fonts or font styles are used for comments, variables, and reserved words. Implement a simple pretty-printer for the language of your choice based on lexical heuristics. To typeset the output you can use Postscript or drive another program: under Unix you can output LaTeX or troff commands; under Windows you can output RTF (rich text format) code or drive Microsoft Word using OLE automation.


Exercise 10.16
Write a tool that indexes source code files to simplify their browsing. The index entries can be function definitions and declarations. You can present the index as an alphabetically sorted list or an HTML page with hypertext links.





    [ Team LiB ]



    No comments: