For information extraction a full-scale syntax analysis is
normally not required, but given the domain and variety of text to be analyzed,
having a broad coverage parser is useful. The Link Grammar Parser by Sleator
and Temperly at Carnegie-Mellon University is based on a formal grammar system
that has equivalent power to context-free grammars and is as powerful as most
modern natural language parsers. The system operates by forming links between
words, hence the name. The parser can handle: noun-verb agreement, questions,
imperatives, complex and irregular verbs, past-participles, present
participles, punctuation, different adjectives, prepositions, adverbs, relative
clauses, possessives and other linguistic phenomena. It can robustly skip
elements that it cannot process and continue to produce meaningful results for
the portions it does understand. The algorithm performance is O(n3) where n is
the number of words in a sentence. Tests in 1995 showed 80% coverage for Wall
Street Journal sentences. A sentence is recognized by the link grammar if:
The linking requirement of each word is stored in the dictionary.
Each word has links it requires and links it can fill. Connecting the links
between words corresponds to a partial parse of a sentence. A number of
heuristics are used to order and weight the possible parses of ambiguous
sentences:
The Link Grammar Parser has features that make it
desirable for a first phase information extraction tool. The Link Grammar parser
has parameters to specify time and space boundaries on analysis. Once the
boundaries are exceeded the parser goes into "panic" mode, executing
a "light" analysis and the best result is delivered in a short time.
If detailed analysis is required, one option is to send individual sentences of
interest to the parser with boundaries set very high. This allows the handling
of exceptions without penalizing normal processing. System performance can be
improved with sub-language restrictions on the dictionary using overall context
to select the proper sub-language to process an article.
The TABARI sparse parser is an open source high-speed
parser used to encode and extract information about political events. It codes
events as