Monday, January 4, 2010

1.2 How to Read This Book



[ Team LiB ]





1.2 How to Read This Book


In this book we demonstrate important code-reading techniques and outline common programming concepts in the form they appear in practice, striving to improve your code-reading ability. Although you will find in the following chapters discussions of many important computer science and computing practice concepts such as data and control structures, coding standards, and software architectures, their treatment is by necessity cursory since the purpose of the book is to get you to examine the use of these ideas in the context of production code, rather than to introduce the ideas themselves. We have arranged the material in an order that will let you progress from the basic to the more sophisticated elements. However, the book is a reader, not a detective novel, so feel free to read it in the sequence that suits your interests.


1.2.1 Typographical Conventions


All code listings and text references to program elements (for example, function names, keywords, operators) are set in typewriter font. Some of our examples refer to command sequences executed in a Unix or Windows shell. We display the shell command prompt $ to denote Unix shell commands and the DOS command prompt


Figure 1.1 Example of an annotated listing.


main(argc, argv) <-- a
[...] <-- b
{

if (argc > 1)
for(;;)
(void)puts(argv[1]);
else for (;;)
(void)puts("y");
}


(a)
Simple annotation


(b)
Omitted code

Annotation referenced from the text


C:\> to denote the Windows console prompt. Unix shell commands can span more than one line; we use > as the continuation line symbol.



$ grep -l malloc *.c |
> wc -l
8
C:\>grep -l malloc *.c | wc -l
8

The prompts and the continuation line symbol are displayed only to distinguish your input from the system output; you type only the commands after the prompt.


In some places we discuss unsafe coding practices or common pitfalls. These are identified on the margin with a danger symbol. You should be alert for such code when conducting a code walkthrough or just reading code to look for a bug. Text marked on the margin with an i identifies common coding idioms. When we read text we tend to recognize whole words rather than letters; similarly, recognizing these idioms in code will allow you to read code faster and more effectively and to understand programs at a higher level.


The code examples we use in this book come from real-world programs. We identify the programs we use (such as the one appearing in Figure 1.1) in a footnote[3] giving the precise location of the program in the directory tree of the book's companion source code and the line numbers covered by the specific fragment. When a figure includes parts of different source code files (as is the case in Figure 5.17, page 169) the footnote will indicate the directory where these files reside.[4]

[3] netbsdsrc/usr.bin/yes/yes.c:53�64

[4] netbsdsrc/distrib/utils/more


Sometimes we omit parts from the code we list; we indicate those with an ellipsis sign [...]. In those cases the line numbers represent the entire range covered by the listed code. Other changes you may notice when referring back to the original code are changes of most C declarations from the old "Kernighan and Ritchie" style to ANSI C and the omission of some comments, white space, and program licensing information. We hope that these changes enhance the readability of the examples we provide without overly affecting the realism of the original examples. Nontrivial code samples are graphically annotated with comments using a custom-built software application. The use of the annotation software ensures that the examples remain correct and can be machine-verified. Sometimes we expand on an annotation in the narrative text. In those cases (Figure 1.1:1) the annotation starts with a number printed in a box; the same number, following a colon, is used to refer to the annotation from the text.


1.2.2 Diagrams


We chose UML for our design diagrams because it is the de facto industry standard. In preparing this book, we found it useful to develop an open-source declarative language for generating UML diagrams,[5] and we also made some small improvements to the code base underlying GraphViz[6] tool. We hope you find that the resulting UML diagrams help you better understand the code we analyze.

[5] http://www.spinellis.gr/sw/umlgraph

[6] http://www.graphviz.org


Figure 1.2 shows examples of the notation we use in our diagrams. Keep in mind the following.


  • We draw processes (for example, filter-style programs) using UML's active class notation: a class box with a bold frame (for example, see Figure 6.14, page 213).

  • We depict pointers between data elements using an association navigation relationship: a solid line with an open arrow. We also split each data structure into horizontal or vertical compartments to better depict its internal organization (for example, see Figure 4.10, page 121).

  • We show the direction of associations (for example, to illustrate the flow of data) with a solid arrow located on the association line, rather than on top of it as prescribed by the UML (for example, see Figure 9.3, page 274).


Figure 1.2. UML-based diagram notation.


All other relationships use standard UML notation.


  • Class inheritance is drawn using a generalization relationship: a solid line with an empty arrow (for example, see Figure 9.6, page 277).

  • An interface implementation is drawn as a realization relationship: a dashed line with an empty arrow (for example, see Figure 9.7, page 278).

  • A dependency between two elements (for example, between files of a build process) is shown with a dashed line and an open arrow (for example, see Figure 6.8, page 191).

  • Compositions (for example, a library consisting of various modules) are depicted through an aggregation association: a line ending in a diamond-like shape (for example, see Figure 9.24, page 321).


1.2.3 Exercises


The exercises you will find at the end of most sections aim to provide you with an incentive to apply the techniques we described and to further research particularly interesting issues, or they may be starting points for in-depth discussions. In most instances you can use references to the book's CD-ROM and to "code in your environment" interchangeably. What is important is to read and examine code from real-world, nontrivial systems. If you are currently working on such a system (be it in a proprietary development effort or an open-source project), it will be more productive to target the code-reading exercises toward that system instead of the book's CD-ROM.


Many exercises begin by asking you to locate particular code sequences. This task can be automated. First, express the code you are looking for as a regular expression. (Read more about regular expressions in Chapter 10.) Then, search through the code base using a command such as the following in the Unix environment:



find /cdrom -name '*.c' -print | xargs grep 'malloc.*NULL'

or using the Perl script codefind.pl[7] in the Windows environment. (Some of the files in the source code base have the same name as old MS-DOS devices, causing some Windows implementations to hang when trying to access them; the Perl script explicitly codes around this problem.)

[7] tools/codefind.pl


1.2.4 Supplementary Material


All the examples you will find in this book are based on existing open-source software code. The source code base comprises more than 53,000 files occupying over 540 MB. All references to code examples are unambiguously identified in footnotes so you can examine the referenced code in its context. In addition, you can coordinate your exploration of the source code base with the book's text in three different ways.


  1. You can look up the file name (the last component of the complete file path) of each referenced source file in the Index.

  2. You can browse Appendix A, which provides an overview of the source code base.

  3. You can search Appendix C, which contains a list of referenced source code files sorted according to the code directory structure.


1.2.5 Tools


Some of the examples we provide depend on the availability of programs found under Unix-type operating systems, such as grep and find. A number of such systems (for example, FreeBSD, GNU/Linux, NetBSD, OpenBSD, and Solaris) are now freely available to download and install on a wide variety of hardware. If you do not have access to such a system, you can still benefit from these tools by using ports that have been made to other operating systems such as Windows. (Section 10.9 contains further details on tool availability.)


1.2.6 Outline


In Chapter 2 we present two complete programs and examine their workings in a step-by-step fashion. In doing so we outline some basic strategies for code reading and identify common C control structures, building blocks, idioms, and pitfalls. We leave some more advanced (and easily abused) elements of the C language to be discussed in Chapters 3 and 5. Chapter 4 examines how to read code embodying common data structures. Chapter 6 deals with code found in really large projects: geographically distributed team efforts comprising thousands of files and millions of lines of code. Large projects typically adopt common coding standards and conventions (discussed in Chapter 7) and may include formal documentation (presented in Chapter 8). Chapter 9 provides background information and advice on viewing the forest rather than the trees: the system's architecture rather than its code details. When reading code you can use a number of tools. These are the subject of Chapter 10. Finally, Chapter 11 contains a complete worked-out example: the code-reading and code-understanding techniques presented in the rest of the book are applied for locating and extracting a phase of the moon algorithm from the NetBSD source code base and adding it as an SQL function in the Java-based HSQL database engine.


In the form of appendices you will find an overview of the code that we used in the examples and that accompanies this book (Appendix A), a list of individuals and organizations whose code appears in the book's text (Appendix B), a list of all referenced source files ordered by the directory in which they occur (Appendix C), the source code licenses (Appendix D), and a list of maxims for reading code with references to the page where each one is introduced (Appendix E).


1.2.7 The Great Language Debate


Most examples in the book are based on C programs running on a POSIX character terminal environment. The reasons behind this choice have to do with the abundance of open-source software portable C code and the conciseness of the examples we found compared to similar ones written in C++ or Java. (The reasons behind this phenomenon are probably mostly related to the code's age or the prevalent coding style rather than particular language characteristics.) It is unfortunate that programs based on graphical user interfaces (GUIs) are poorly represented in our samples, but reading and reasoning about such programs really deserves a separate book volume. In all cases where we mention Microsoft Windows API functions we refer to the Win32 SDK API rather than the .NET platform.


Table 1.1. The Ten Most-Used Languages in Open-Source Projects

Language

Number of Projects

% of Projects

C

8,393

21.2

C++

7,632

19.2

Java

5,970

15.1

PHP

4,433

11.2

Perl

3,618

9.1

Python

1,765

4.5

Visual Basic

916

2.3

Unix Shell

835

2.1

Assembly

745

1.9

JavaScript

738

1.9


We have been repeatedly asked about the languages used to write open-source software. Table 1.1 summarizes the number of projects using each of the top ten most-used languages in the SourceForge.net repository.[8] The C language features at the top of the list and is probably underrepresented because many very large C open-source projects such as FreeBSD and GNU/Linux are independently hosted, and many projects claiming to use C++ are in fact written in C, making very little use of the C++ features. On the other hand, keep in mind that a similar list compiled for non-open-source projects would be entirely different, probably featuring COBOL, Ada, Fortran, and assorted 4GLs at the top of the list. Furthermore, if you are maintaining code�a very likely if unfashionable reason for reading code�the language you will be reading would very likely have been adopted (if you are lucky) five or ten years ago, reflecting the programming language landscape of that era.

[8] http://sourceforge.net/softwaremap/trove_list.php?form_cat=160


The code structures that our examples represent apply in most cases equally well to Java and C++ programs; many also apply to Perl, Python, and PHP. However, you can safely skip


  • Chapter 3 if you never encounter C (or C++ as a better C) code

  • Section 9.3.4 if C++ and Ada are not your cup of tea

  • Section 9.1.3 if you have managed to avoid object-oriented approaches, C++, Java, Python, Smalltalk, and the object-oriented features of Perl

  • Section 5.2 if you are not into Java or C++

  • Section 2.8 if you are a structured programming zealot or a devoted Java fan





    [ Team LiB ]



    No comments: