Information
Extraction is designed to get specific data from high volumes of text, using
robust means. Information Extraction is a very large field containing hundreds
of researchers. An excellent tutorial from which much of this summary comes is
available on line at: http://www.ai.sri.com/~appelt/ie-tutorial.
DARPA
MUC (Message Understanding Conference/Component) and TIPSTER advanced
information extraction technology.
Information
Extraction is difficult even for a trained analyst. For various aspects of
information extraction, inter-annotator agreement of between 60% and 80% has
been reported. Current state-of-the-art systems are around 60% of human
performance. The general consensus is that the 60% figure relates to the
proportion of information that is related in a relatively straightforward way,
and does not require complex syntactic analysis or domain-specific knowledge.
Two
broad approaches exist for Information Extraction:
Knowledge
Engineering Approach
A person writes special knowledge to extract information using
grammars and rules. This requires skill, labor, and familiarity with both
domain and tools. Systems' building is iterative following a loop of write
rules, test against data set, examine results, and propose changes. Given
properly designed systems, a good undergraduate is capable of writing
extraction rules with about a week of training. This bodes well for a Knowledge
Engineering approach. However, it requires time, inclination, and ability to
write rules, along with lexicons and source data to test against.
Automatic
Training Approach
This method collects lots of examples of sentences with data to be
extracted, and run a learning procedure to generate extraction rules. This
requires someone who knows what information to extract and large quantity of
example text to markup. Automatic Training focuses on producing training data
and only requires people familiar with the domain to annotate text.
Bigger
is not always better. Larger dictionaries or lexicons tend to include rarer
senses of words and phrases, which can throw systems off. There is a place for
recognizing large lists of specific terms, like names and places. One component
will be identification of proper names and locations. Here the system can
leverage existing data sources.
Since
extraction systems tend to look for simple relationships and are confined to a
finite set of domain events and relationships, shallow analysis is adequate
most of the time. Shallow parsing makes mistakes, but multiple sources help
fill gaps and make corrections.