start page | rating of books | rating of authors | reviews | copyrights

Unix Power ToolsUnix Power ToolsSearch this book

16.8. Looking for Closure

A common problem in text processing is making sure that items that need to occur in pairs actually do so.

Most Unix text editors include support for making sure that elements of C syntax such as parentheses and braces are closed properly. Some editors, such as Emacs (Section 19.1) and vim (Section 17.1), also support syntax coloring and checking for text documents -- HTML and SGML, for instance. There's much less support in command-line utilities for making sure that textual documents have the proper structure. For example, HTML documents that start a list with <UL> need a closing </UL>.

Unix provides a number of tools that might help you to tackle this problem. Here's a gawk script written by Dale Dougherty that makes sure <UL> and </UL> tags macros come in pairs:

gawk Section 20.11

#! /usr/local/bin/gawk -f
BEGIN {
    IGNORECASE = 1
    inList = 0
    LSlineno = 0
    LElineno = 0
    prevFile = ""
}
# if more than one file, check for unclosed list in first file
FILENAME != prevFile {
    if (inList)
     printf ("%s: found <UL> at line %d without </UL> before end of file\n",
            prevFile, LSlineno)
    inList = 0
    prevFile = FILENAME
}
# match <UL> and see if we are in list
/^<UL>/ {
    if (inList) {
        printf("%s: nested list starts: line %d and %d\n",
            FILENAME, LSlineno, FNR)
    }
    inList = 1
    LSlineno = FNR
}
/^<\/UL>/ {
    if (! inList)
        printf("%s: too many list ends: line %d and %d\n",
            FILENAME, LElineno, FNR)
    else
        inList = 0
    LElineno = FNR
}
# this catches end of input
END {
    if (inList)
        printf ("%s: found <UL> at line %d without </UL> before end of file\n",
            FILENAME, LSlineno)
}

You can adapt this type of script for any place you need to check for a start and finish to an item. Note, though, that not all systems have gawk preinstalled. You'll want to look for an installation of the utility for your system to use this script.

A more complete syntax-checking program could be written with the help of a lexical analyzer like lex. lex is normally used by experienced C programmers, but it can be used profitably by someone who has mastered awk and is just beginning with C, since it combines an awk-like pattern-matching process using regular-expression syntax with actions written in the more powerful and flexible C language. (See O'Reilly & Associates' lex & yacc.)

Of course, this kind of problem could be very easily tackled with the information in Chapter 41.

--TOR and SP



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.