CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

INTRODUCT

Lexical analysis is the process of converting an input stream of characters into a stream of words or tokens. Tokens are groups of characters with collective significance. Lexical analysis is the first stage of automatic indexing, and of query processing. Automatic indexing is the process of algorithmically examining information items to generate lists of index terms. The lexical analysis phase produces candidate index terms that may be further processed, and eventually added to indexes (see Chapter 1 for an outline of this process). Query processing is the activity of analyzing a query and comparing it to indexes to find relevant items. Lexical analysis of a query produces tokens that are parsed and turned into an internal representation suitable for comparison wi IONth indexes.

In automatic indexing, candidate index terms are often checked to see whether they are in a stoplist, or negative dictionary. Stoplist words are known to make poor index terms, and they are immediately removed from further consideration as index terms when they are identified.

This chapter discusses the design and implementation of lexical analyzers and stoplists for information retrieval. These topics go well together because, as we will see, one of the most efficient ways to implement stoplists is to incorporate them into a lexical analyzer.

7.2 LEXICAL ANALYSIS

7.2.1 Lexical Analysis for Automatic Indexing

The first decision that must be made in designing a lexical analyzer for an automatic indexing system is: What counts as a word or token in the indexing scheme? At first, this may seem an easy question, and there are some easy answers to it--for example, terms consisting entirely of letters should be tokens. Problems soon arise, however. Consider the following:

 Digits--Most numbers do not make good index terms, so often digits are not included as tokens. However, certain numbers in some kinds of databases may be important (for example, case numbers in a legal database). Also, digits are often included in words that should be index terms, especially in databases containing technical documents. For example, a database about vitamins would contain important tokens like "B6" and "B12." One partial (and easy) solution to the last problem is to allow tokens to include digits, but not to begin with a digit.

 Hyphens--Another difficult decision is whether to break hyphenated words into their constituents, or to keep them as a single token. Breaking hyphenated terms apart helps with inconsistent usage (e.g., "state-of-the-art" and "state of the art" are treated identically), but loses the specificity of a hyphenated phrase. Also, dashes are often used in place of ems, and to mark a single word broken into syllables at the end of a line. Treating dashes used in these ways as hyphens does not work. On the other hand, hyphens are often part of a name, such as "Jean-Claude," "F-16," or "MS-DOS."

 Other Punctuation--Like the dash, other punctuation marks are often used as parts of terms. For example, periods are commonly used as parts of file names in computer systems (e.g., "COMMAND.COM" in DOS), or as parts of section numbers; slashes may appear as part of a name (e.g., "OS/2"). If numbers are regarded as legitimate index terms, then numbers containing commas and decimal points may need to be recognized. The underscore character is often used in terms in programming languages (e.g., "max_size" is an identifier in Ada, C, Prolog, and other languages).

 Case--The case of letters is usually not significant in index terms, and typically lexical analyzers for information retrieval systems convert all characters to either upper or lower case. Again, however, case may be important in some situations. For example, case distinctions are important in some programming languages, so an information retrieval system for source code may need to preserve case distinctions in generating index terms.

There is no technical difficulty in solving any of these problems, but information system designers must think about them carefully when setting lexical analysis policy. Recognizing numbers as tokens adds many terms with poor discrimination value to an index, but may be a good policy if exhaustive searching is important. Breaking up hyphenated terms increases recall but decreases precision, and may be inappropriate in some fields (like an author field). Preserving case distinctions enhances precision but decreases recall.

Commercial information systems differ somewhat in their lexical analysis policies, but are alike in usually taking a conservative (recall enhancing) approach. For example, Chemical Abstracts Service, ORBIT Search Service, and Mead Data Central's LEXIS/NEXIS all recognize numbers and words containing digits as index terms, and all are case insensitive. None has special provisions for most punctuation marks in most indexed fields. However, Chemical Abstracts Service keeps hyphenated words as single tokens, while the ORBIT Search Service and LEXIS/NEXIS break them apart (if they occur in title or abstract fields).

The example we use to illustrate our discussion is simple so it can be explained easily, and because the simplest solution often turns out to be best. Modifications to it based on the considerations discussed above are easy to make. In the example, any nonempty string of letters and digits, not beginning with a digit, is regarded as a token. All letters are converted to lower case. All punctuation, spacing, and control characters are treated as token delimiters.

7.2.2 Lexical Analysis for Query Processing

Designing a lexical analyzer for query processing is like designing one for automatic indexing. It also depends on the design of the lexical analyzer for automatic indexing: since query search terms must match index terms, the same tokens must be distinguished by the query lexical analyzer as by the indexing lexical analyzer. In addition, however, the query lexical analyzer must usually distinguish operators (like the Boolean operators, stemming or truncating operators, and weighting function operators), and grouping indicators (like parentheses and brackets). A lexical analyzer for queries should also process certain characters, like control characters and disallowed punctuation characters, differently from one for automatic indexing. Such characters are best treated as delimiters in automatic indexing, but in query processing, they indicate an error. Hence, a query lexical analyzer should flag illegal characters as unrecognized tokens.

The example query lexical analyzer presented below recognizes left and right parentheses (as grouping indicators), ampersand, bar, and caret (as Boolean operators), and any alphanumeric string beginning with a letter (as search terms). Spacing characters are treated as delimiters, and other characters are returned as unrecognized tokens. All uppercase characters are converted to lowercase.

7.2.3 The Cost of Lexical Analysis

Lexical analysis is expensive because it requires examination of every input character, while later stages of automatic indexing and query processing do not. Although no studies of the cost of lexical analysis in information retrieval systems have been done, lexical analysis has been shown to account for as much as 50 percent of the computational expense of compilation (Wait 1986). Thus, it is important for lexical analyzers, particularly for automatic indexing, to be as efficient as possible.

7.2.4 Implementing a Lexical Analyzer

Lexical analysis for information retrieval systems is the same as lexical analysis for other text processing systems; in particular, it is the same as lexical analysis for program translators. This problem has been studied thoroughly, so we ought to adopt the solutions in the program translation literature (Aho, Sethi, and Ullman 1986). There are three ways to implement a lexical analyzer:

 Use a lexical analyzer generator, like the UNIX tool lex (Lesk 1975), to generate a lexical analyzer automatically;

 Write a lexical analyzer by hand ad hoc; or

 Write a lexical analyzer by hand as a finite state machine.

The first approach, using a lexical analyzer generator, is best when the lexical analyzer is complicated; if the lexical analyzer is simple, it is usually easier to implement it by hand. In our discussion of stoplists below, we present a special purpose lexical analyzer generator for automatic indexing that produces efficient lexical analyzers that filter stoplist words. Consequently, we defer further discussion of this alternative.

The second alternative is the worst. An ad hoc algorithm, written just for the problem at hand in whatever way the programmer can think to do it, is likely to contain subtle errors. Furthermore, finite state machine algorithms are extremely fast, so ad hoc algorithms are likely to be less efficient.

The third approach is the one we present in this section. We assume some knowledge of finite state machines (also called finite automata), and their use in program translation systems. Readers unfamiliar with these topics can consult Hopcroft and Ullman (1979), and Aho, Sethi, and Ullman (1986). Our example is an implementation of a query lexical analyzer as described above.

The easiest way to begin a finite state machine implementation is to draw a transition diagram for the target machine. A transition diagram for a machine recognizing tokens for our example query lexical analyzer is pictured in Figure 7.1.

In this diagram, characters fall into ten classes: space characters, letters, digits, the left and right parentheses, ampersand, bar, caret, the end of string character, and all other characters. The first step in implementing this finite state machine is to build a mechanism for classifying characters. The easiest and fastest way to do this is to preload an array with the character classes for the character set. Assuming the ASCII character set, such an array would contain 128 elements with the character classes for the corresponding ASCII characters. If such an array is called char_class, for example, then the character class for character 'c' is simply char_class [c]. The character classes themselves form a distinct data type best declared as an enumeration in C. Figure 7.2 contains C declarations for a character class type and array. (Note that the end of file character requires special treatment in C because it is not part of ASCII).

 

Figure 7.1: Transition diagram for a query lexical analyzer

The same technique is used for fast case conversion. In Figure 7.2, an array of 128 characters called convert_case is preloaded with the printing characters, with lowercase characters substituted for uppercase characters. Nonprinting character positions will not be used, and are set to 0.

/**************  Character Classification  *****************/

/* Tokenizing requires that ASCII be broken into character */

/* classes distinguished for tokenizing. White space       */

/* characters separate tokens. Digits and letters make up  */

/* the body of search terms. Parentheses group sub-        */

/* expressions. The ampersand, bar, and caret are          */

/* operator symbols.                                       */

typedefenum {

WHITE_CH,             /* whitespace characters */

DIGIT_CH,             /* the digits */

LETTER_CH,            /* upper and lower case */

LFT_PAREN_CH,         /* the "(" character */

RGT_PAREN_CH,         /* the ")" character */

AMPERSAND_CH,         /* the "&" character */

BAR_CH,               /* the "|" character */

CARET_CH,             /* the "^" character */

EOS_CH,               /* the end of string character */

OTHER_CH,             /* catch-all for everything else */

} CharClassType;

staticCharClassTypechar_class[128] = {

/* ^@ */  EOS_CH,      /* ^A */  OTHER_CH,    /* ^B */  OTHER_CH,

/* ^C */  OTHER_CH,    /* ^D */  OTHER_CH,    /* ^E */  OTHER_CH,

/* ^F */  OTHER_CH,    /* ^G */  OTHER_CH,    /* ^H */  WHITE_CH,

/* ^I */  WHITE_CH,    /* ^J */  WHITE_CH,    /* ^K */  WHITE_CH,

/* ^L */  WHITE_CH,    /* ^M */  WHITE_CH,    /* ^N */  OTHER_CH,

/* ^O */  OTHER_CH,    /* ^P */  OTHER_CH,    /* ^Q */  OTHER_CH,

/* ^R */  OTHER_CH,    /* ^S */  OTHER_CH,    /* ^T */  OTHER_CH,

/* ^U */  OTHER_CH,    /* ^V */  OTHER_CH,    /* ^W */  OTHER_CH,

/* ^X */  OTHER_CH,    /* ^Y */  OTHER_CH,    /* ^Z */  OTHER_CH,

/* ^[ */  OTHER_CH,    /* ^\ */  OTHER_CH,    /* ^] */  OTHER_CH,

/* ^^ */  OTHER_CH,    /* ^_ */  OTHER_CH,    /*    */  WHITE_CH,

/*  ! */  OTHER_CH,    /*  " */  OTHER_CH,    /*  # */  OTHER_CH,

/*  $ */  OTHER_CH,    /*  % */  OTHER_CH,    /*  & */  AMPERSAND_CH,

/*  ' */  OTHER_CH,    /*  ( */  LFT_PAREN_C, /*  ) */  RGT_PAREN_CH,

/*  * */  OTHER_CH,    /*  + */  OTHER_CH,    /*  , */  OTHER_CH,

/*  - */  OTHER_CH,    /*  . */  OTHER_CH,    /*  / */  OTHER_CH,

/*  0 */  DIGIT_CH,    /*  1 */  DIGIT_CH,    /*  2 */  DIGIT_CH,

/*  3 */  DIGIT_CH,    /*  4 */  DIGIT_CH,    /*  5 */  DIGIT_CH,

/*  6 */  DIGIT_CH,    /*  7 */  DIGIT_CH,    /*  8 */  DIGIT_CH,

/*  9 */  DIGIT_CH,    /*  : */  OTHER_CH,    /*  ; */  OTHER_CH,

/*  < */  OTHER_CH,    /*  = */  OTHER_CH,    /*  > */  OTHER_CH,

/*  ? */  OTHER_CH,    /*  @ */  OTHER_CH,    /*  A */  LETTER_CH,

/*  B */  LETTER_CH,   /*  C */  LETTER_CH,   /*  D */  LETTER_CH,

/*  E */  LETTER_CH,   /*  F */  LETTER_CH,   /*  G */  LETTER_CH,

/*  H */  LETTER_CH,   /*  I */  LETTER_CH,   /*  J */  LETTER_CH,

/*  K */  LETTER_CH,   /*  L */  LETTER_CH,   /*  M */  LETTER_CH,

/*  N */  LETTER_CH,   /*  O */  LETTER_CH,   /*  P */  LETTER_CH,

/*  Q */  LETTER_CH,   /*  R */  LETTER_CH,   /*  S */  LETTER_CH,

/*  T */  LETTER_CH,   /*  U */  LETTER_CH,   /*  V */  LETTER_CH,

/*  W */  LETTER_CH,   /*  X */  LETTER_CH,   /*  Y */  LETTER_CH,

/*  Z */  LETTER_CH,   /*  [ */  OTHER_CH,    /*  \ */  OTHER_CH,

/*  ] */  OTHER_CH,    /*  ^ */  CARET_CH,    /*  _ */  OTHER_CH,

/*  ` */  OTHER_CH,    /*  a */  LETTER_CH,   /*  b */  LETTER_CH,

/*  c */  LETTER_CH,   /*  d */  LETTER_CH,   /*  e */  LETTER_CH,

/*  f */  LETTER_CH,   /*  g */  LETTER_CH,   /*  h */  LETTER_CH,

/*  i */  LETTER_CH,   /*  j */  LETTER_CH,   /*  k */  LETTER_CH,

/*  l */  LETTER_CH,   /*  m */  LETTER_CH,   /*  n */  LETTER_CH,

/*  o */  LETTER_CH,   /*  p */  LETTER_CH,   /*  q */  LETTER_CH,

/*  r */  LETTER_CH,   /*  s */  LETTER_CH,   /*  t */  LETTER_CH,

/*  u */  LETTER_CH,   /*  v */  LETTER_CH,   /*  w */  LETTER_CH,

/*  x */  LETTER_CH,   /*  y */  LETTER_CH,   /*  z */  LETTER_CH,

/*  { */  OTHER_CH,    /*  | */  BAR_CH,      /*  } */  OTHER_CH,

/*   */  OTHER_CH,    /* ^? */  OTHER_CH,                           };

/**************   Character Case Conversion   *************/

/* Term text must be accumulated in a single case. This   */

/* array is used to convert letter case but otherwise     */

/* preserve characters.                                   */

static char convert_case[128] = {

/* ^@ */    0,    /* ^A */    0,    /* ^B */    0,    /* ^C */     0,

/* ^D */    0,    /* ^E */    0,    /* ^F */    0,    /* ^G */     0,

/* ^H */    0,    /* ^I */    0,    /* ^J */    0,    /* ^K */     0,

/* ^L */    0,    /* ^M */    0,    /* ^N */    0,    /* ^O */     0,

/* ^P */    0,    /* ^Q */    0,    /* ^R */    0,    /* ^S */     0,

/* ^T */    0,    /* ^U */    0,    /* ^V */    0,    /* ^W */     0,

/* ^X */    0,    /* ^Y */    0,    /* ^Z */    0,    /* ^[ */     0,

/* ^\ */    0,    /* ^] */    0,    /* ^^ */    0,    /* ^_ */     0,

/*    */  ' ',    /*  ! */  '!',    /*  " */  '"',    /*  # */   '#',

/*  $ */  '$',    /*  % */  '%',    /*  & */  '&',    /*  ' */  ''',

/*  ( */  '(',    /*  ) */  ')',    /*  * */  '*',    /*  + */   '+',

/*  , */  ',',    /*  - */  '-',    /*  . */  '.',    /*  / */   '/',

/*  0 */  '0',    /*  1 */  '1',    /*  2 */  '2',    /*  3 */   '3',

/*  4 */  '4',    /*  5 */  '5',    /*  6 */  '6',    /*  7 */   '7',

/*  8 */  '8',    /*  9 */  '9',    /*  : */  ':',    /*  ; */   ';',

/*  < */  '<',    /*  = */  '=',    /*  > */  '>',    /*  ? */   '?',

/*  @  */ '@',    /*  A */  'a',    /*  B */  'b',    /*  C */   'c',

/*  D */  'd',    /*  E */  'e',    /*  F */  'f',    /*  G */   'g',

/*  H */   'h',   /*  I */  'i',    /*  J */  'j',    /*  K */   'k',

/*  L */   'l',   /*  M */  'm',    /*  N */  'n',    /*  O */    o',

/*  P */   'p',   /*  Q */  'q',    /*  R */  'r',    /*  S */   's',

/*  T */   't',   /*  U */  'u',    /*  V */  'v',    /*  W */   'w',

/*  X */   'x',   /*  Y */  'y',    /*  Z */  'z',    /*  [ */   '[',

/*  \ */  '\',   /*  ] */  ']',    /*  ^ */  '^',    /*  _ */   '_',

/*  ` */   '`',   /*  a */  'a',    /*  b */  'b',    /*  c */   'c',

/*  d */   'd',   /*  e */  'e',    /*  f */  'f',    /*  g */   'g',

/*  h */   'h',   /*  i */  'i',    /*  j */  'j',    /*  k */   'k',

/*  l */   'l',   /*  m */  'm',    /*  n */  'n',    /*  o */   'o',

/*  p */   'p',   /*  q */  'q',    /*  r */  'r',    /*  s */   's',

/*  t */   't',   /*  u */  'u',    /*  v */  'v',    /*  w */   'w',

/*  x */   'x',   /*  y */  'y',    /*  z */  'z',    /*  { */   '{' ,

/*  | */   '|',   /*  } */  '}',    /*  ~ */  '~',    /* ^? */     0, };

/********************   Tokenizing   ***********************/

/* Thelexer distinguishes terms, parentheses, the and, or */

/* and not operators, the unrecognized token, and the end  */

/* of the input.                                           */

typedefenum {

TERM_TOKEN      = 1,    /* a search term */

LFT_PAREN_TOKEN = 2,    /* left parenthesis */

RGT_PAREN_TOKEN = 3,    /* right parenthesis */

AND_TOKEN       = 4,    /* set intersection connective */

OR_TOKEN        = 5,    /* set union connective */

NOT_TOKEN       = 6,    /* set difference connective */

END_TOKEN       = 7,    /* end of the query */

NO_TOKEN        = 8,    /* the token is not recognized */

} TokenType;

Figure 7.2: Declarations for a simple query lexical analyzer

There also needs to be a type for tokens. An enumeration type is best for this as well. This type will have an element for each of the tokens: term, left parenthesis, right parenthesis, ampersand, bar, caret, end of string, and the unrecognized token. Processing is simplified by matching the values of the enumeration type to the final states of the finite state machine. The declaration of the token type also appears in Figure 7.2.

The code for the finite state machine must keep track of the current state, and have a way of changing from state to state on input. A state change is called atransition. Transition information can be encoded in tables, or in flow of control. When there are many states and transitions, a tabular encoding is preferable; in our example, a flow of control encoding is probably clearest. Our example implementation reads characters from an input stream supplied as a parameter. The routine returns the next token from the input each time it is called. If the token is a term, the text of the term (in lowercase) is written to a term buffer supplied as a parameter. Our example code appears in Figure 7.3.

/*FN************************************************************************

GetToken( stream )

Returns: void

Purpose: Get the next token from an input stream

Plan:    Part 1: Run a state machine on the input

Part 2: Coerce the final state to return the token type

Notes:   Run a finite state machine on an input stream, collecting

the text of the token if it is a term. The transition table

for this DFA is the following (negative states are final):

State  |  White  Letter  (    )  &   |   ^  EOS  Digit  Other

             -------------------------------------------------------------

               0    |   0      1     -2  -3  -4  -5  -6  -7    -8     -8

               1    |  -1      1     -1  -1  -1  -1  -1  -1     1     -1

See the token type above to see what is recognized in the

various final states.

**/

staticTokenType

GetToken( stream, term )

FILE *stream;  /* in: where to grab input characters */

char *term;    /* out: the token text if the token is a term */

{

intnext_ch;  /* from the input stream */

int state;    /* of the tokenizer DFA */

int i;        /* for scanning through the term buffer */

/* Part 1: Run a state machine on the input */

state = 0;

i = 0;

while ( 0  < = state )

{

if ( EOF == (next_ch = getc(stream)) ) next_ch = '\0';

term[i++] = convert_case[next_ch];

switch( state )

{

case 0 :

switch(char_class[next_ch] )

{

case WHITE_CH :     i = 0; break;

case LETTER_CH :    state =  1; break;

case LFT_PAREN_CH : state = -2; break;

case RGT_PAREN_CH : state = -3; break;

case AMPERSAND_CH : state = -4; break;

case BAR_CH :       state = -5; break;

case CARET_CH :     state = -6; break;

case EOS_CH :       state = -7; break;

case DIGIT_CH :     state = -8; break;

case OTHER_CH :     state = -8; break;

default :           state =-8; break;

}

break;

case 1 :

if ( (DIGIT_CH != char_class[next_ch])

&& (LETTER_CH != char_class[next_ch]) )

{

ungetc(next_ch, stream );

term[i-1] = '\0';

state = -1;

}

break;

default : state = -8; break;

}

}

/* Part 2: Coerce the final state to return the type token */

return( (TokenType) (-state) );

} /* GetToken */

Figure 7.3: Code for a simple query lexical analyzer

The algorithm begins in state 0. As each input character is consumed, a switch on the state determines the transition. Input is consumed until a final state (indicated by a negative state number) is reached. When recognizing a term, the algorithm keeps reading until some character other than a letter or a digit is found. Since this character may be part of another token, it must be pushed back on the input stream. The final state is translated to a token type value by changing its sign and coercing it to the correct type (this was the point of matching the token type values to the final machine states).

/*FN***********************************************************************

main(argc, argv )

Returns: int -- 0 on success, 1 on failure

Purpose: Program main function

Plan:    Part 1: Open a file named on the command line

Part 2: List all the tokens found in the file

Part 3: Close the file and re

/ 0 نظر / 58 بازدید