Natural Language Parsers

The CMU Link Grammar Parser

For information extraction a full-scale syntax analysis is normally not required, but given the domain and variety of text to be analyzed, having a broad coverage parser is useful. The Link Grammar Parser by Sleator and Temperly at Carnegie-Mellon University is based on a formal grammar system that has equivalent power to context-free grammars and is as powerful as most modern natural language parsers. The system operates by forming links between words, hence the name. The parser can handle: noun-verb agreement, questions, imperatives, complex and irregular verbs, past-participles, present participles, punctuation, different adjectives, prepositions, adverbs, relative clauses, possessives and other linguistic phenomena. It can robustly skip elements that it cannot process and continue to produce meaningful results for the portions it does understand. The algorithm performance is O(n3) where n is the number of words in a sentence. Tests in 1995 showed 80% coverage for Wall Street Journal sentences. A sentence is recognized by the link grammar if:

  • The local requirements of each word is satisfied
  • The links do not cross
  • The words form a connected graph

The linking requirement of each word is stored in the dictionary. Each word has links it requires and links it can fill. Connecting the links between words corresponds to a partial parse of a sentence. A number of heuristics are used to order and weight the possible parses of ambiguous sentences:

  • The sum of the length of connections
  • The sum of the cost of disjunctions
  • The cost of the imbalances or unevenness of parse

The Link Grammar Parser has features that make it desirable for a first phase information extraction tool. The Link Grammar parser has parameters to specify time and space boundaries on analysis. Once the boundaries are exceeded the parser goes into "panic" mode, executing a "light" analysis and the best result is delivered in a short time. If detailed analysis is required, one option is to send individual sentences of interest to the parser with boundaries set very high. This allows the handling of exceptions without penalizing normal processing. System performance can be improved with sub-language restrictions on the dictionary using overall context to select the proper sub-language to process an article.

The TABARI Sparse Parser

The TABARI sparse parser is an open source high-speed parser used to encode and extract information about political events. It codes events as triplets by processing unmodified news reports to monitor international relations. The original system can encode 3000 events per second, which should translate into a reading rate of 10000 events per second per node. This capacity forms the ability to quickly skimming the large quantity of information expected, extract key relations, and flag passages for more detailed analysis by either the Link Grammar system or by human means. The sparse parser already contains an extensive dictionary, and has the ability to flag sentences too complex for it to handle.