Information Extraction

Information Extraction is designed to get specific data from high volumes of text, using robust means. Information Extraction is a very large field containing hundreds of researchers. An excellent tutorial from which much of this summary comes is available on line at: http://www.ai.sri.com/~appelt/ie-tutorial.

DARPA MUC (Message Understanding Conference/Component) and TIPSTER advanced information extraction technology.

Information Extraction is difficult even for a trained analyst. For various aspects of information extraction, inter-annotator agreement of between 60% and 80% has been reported. Current state-of-the-art systems are around 60% of human performance. The general consensus is that the 60% figure relates to the proportion of information that is related in a relatively straightforward way, and does not require complex syntactic analysis or domain-specific knowledge.

Two broad approaches exist for Information Extraction:

Knowledge Engineering Approach
A person writes special knowledge to extract information using grammars and rules. This requires skill, labor, and familiarity with both domain and tools. Systems' building is iterative following a loop of write rules, test against data set, examine results, and propose changes. Given properly designed systems, a good undergraduate is capable of writing extraction rules with about a week of training. This bodes well for a Knowledge Engineering approach. However, it requires time, inclination, and ability to write rules, along with lexicons and source data to test against.

Automatic Training Approach
This method collects lots of examples of sentences with data to be extracted, and run a learning procedure to generate extraction rules. This requires someone who knows what information to extract and large quantity of example text to markup. Automatic Training focuses on producing training data and only requires people familiar with the domain to annotate text.

Bigger is not always better. Larger dictionaries or lexicons tend to include rarer senses of words and phrases, which can throw systems off. There is a place for recognizing large lists of specific terms, like names and places. One component will be identification of proper names and locations. Here the system can leverage existing data sources.

Since extraction systems tend to look for simple relationships and are confined to a finite set of domain events and relationships, shallow analysis is adequate most of the time. Shallow parsing makes mistakes, but multiple sources help fill gaps and make corrections.