Natural Language Processing for the Semantic Web. Diana MaynardЧитать онлайн книгу.
2.3 NLP PIPELINES
An NLP pre-processing pipeline, as shown in Figure 2.1, typically consists of the following components:
• Tokenization.
• Sentence splitting.
• Part-of-speech tagging.
• Morphological analysis.
• Parsing and chunking.
Figure 2.1: A typical linguistic pre-processing pipeline.
The first task is typically tokenization, followed by sentence splitting, to chop the text into tokens (typically words, numbers, punctuation, and spaces) and sentences respectively. Part-of-speech (POS) tagging assigns a syntactic category to each token. When dealing with multilingual text such as tweets, an additional step of language identification may first be added before these take place, as discussed in Chapter 8. Morphological analysis is not compulsory, but is frequently used in a pipeline, and essentially consists of finding the root form of each word (a slightly more sophisticated form of stemming or lemmatization). Finally, parsing and/or chunking tools may be used to analyze the text syntactically, identifying things like noun and verb phrases in the case of chunking, or performing a more detailed analysis of grammatical structure in the case of parsing.
Concerning toolkits, GATE [4] provides a number of open-source linguistic preprocessing components under the LGPL license. It contains a ready-made pipeline for Information Extraction, called ANNIE, and also a large number of additional linguistic processing tools such as a selection of different parsers. While GATE does provide functionality for machine learning-based components, ANNIE is mostly knowledge-based, making for easy adaptation. Additional resources can be added via the plugin mechanism, including components from other pipelines such as the Stanford CoreNLP Tools. GATE components are all Java-based, which makes for easy integration and platform independence.
Stanford CoreNLP [5] is another open-source annotation pipeline framework, available under the GPL license, which can perform all the core linguistic processing described in this section, via a simple Java API. One of the main advantages is that it can be used on the command line without having to understand more complex frameworks such as GATE or UIMA, and this simplicity, along with the generally high quality of results, makes it widely used where simple linguistic information such as POS tags are required. Like ANNIE, most of the components other than the POS tagger are rule-based.
OpenNLP1 is an open-source machine learning-based toolkit for language processing, which uses maximum entropy and Perceptron-based classifiers. It is freely available under the Apache license. Like Stanford CoreNLP, it can be run on the command line via a simple Java API. While, like most other pipelines, the various components further down the pipeline mainly rely on tokens and sentences, the sentence splitter can be run either before or after the tokenizer, which is slightly unusual.
NLTK [6] is an open-source Python-based toolkit, available under the Apache license, which is also very popular due to its simplicity and command-line interface, particularly where Java-based tools are not a requirement. It provides a number of different variants for some components, both rule-based and learning-based.
In the rest of this chapter, we will describe the individual pipeline components in more detail, using the relevant tools from these pipelines as examples.
2.4 TOKENIZATION
Tokenization is the task of splitting the input text into very simple units, called tokens, which generally correspond to words, numbers, and symbols, and are typically separated by white space in English. Tokenization is a required step in almost any linguistic processing application, since more complex algorithms such as part-of-speech taggers mostly require tokens as their input, rather than using the raw text. Consequently, it is important to use a high-quality tokenizer, as errors are likely to affect the results of all subsequent NLP components in the pipeline. Commonly distinguished types of tokens are numbers, symbols (e.g., $, %), punctuation, and words of different kinds, e.g., uppercase, lowercase, mixed case. A representation of a tokenized sentence is shown in Figure 2.2, where each pink rectangle corresponds to a token.
Figure 2.2: Representation of a tokenized sentence.
Tokenizers may add a number of features describing the token. These include details of orthography (e.g., whether they are capitalized or not), and more information about the kind of token (whether it is a word, number, punctuation, etc.). Other components may also add features to the existing token annotations, such as their syntactic category, details of their morphology, and any cleaning or normalization (such as correcting a mis-spelled word). These will be described in subsequent sections and chapters. Figure 2.3 shows a token for the word offences in the previous example with some features added: the kind of token is a word, it is 8 characters long, and the orthography is lowercase.
Tokenizing well-written text is generally reliable and reusable, since it tends to be domain-independent. However, such general purpose tokenizers typically need to be adapted to work correctly with things like chemical formulae, twitter messages, and other more specific text types. Other non-standard cases are hyphenated words in English, which some tools treat as a single token and some tools treat as three (the two words, plus the hyphen itself). Some systems also perform a more complex tokenization that takes into account number combinations such as dates and times (for example, treating 07:56 as a single token). Other tools leave this to later processing stages, such as a Named Entity Recognition component. Another issue is the apostrophe: for example in cases where an apostrophe is used to denote a missing letter and effectively joins two words without a space between, such as it’s, or in French l’homme. German compound nouns suffer the opposite problem, since many words can be written together without a space. For German tokenizers, an extra module which splits compounds into their constituent parts can therefore be very useful, in particular for retrieval purposes. This extra segmentation module is critical to define word boundaries also for many East Asian languages such as Chinese, which have no notion of white space between words.
Figure 2.3: Representation of a tokenized sentence.
Because tokenization generally follows a rigid set of constraints about what constitutes a token, pattern-based rule matching approaches are frequently used for these tools, although some tools do use other approaches. The OpenNLP TokenizerME,2 for example, is a trainable maximum entropy tokenizer. It uses a statistical model, based on a training corpus, and can be re-trained on a new corpus.
GATE’s ANNIE Tokenizer3 relies on a set of regular expression rules which are then compiled into a finite-state machine. It differs somewhat from most other tokenizers in that it maximizes efficiency by doing only very light processing, and enabling greater flexibility by placing the burden of deeper processing on other components later in the pipeline, which are more adaptable. The generic version of the ANNIE tokenizer is based on Unicode4 and can be used for any language which has similar notions of token and white space to English (i.e., most Western languages). The tokenizer can be adapted for different languages either by modifying the existing rules, or by adding some extra post-processing rules. For English, a specialized set of rules is available, dealing mainly with use of apostrophes in words such as don’t.