Natural Language Processing for Social Media. Diana InkpenЧитать онлайн книгу.
for Tokenizers
Horsmann and Zesch [2016] evaluated a method for dealing with token boundaries consisting of three steps. First, the researchers split the text according to the white space characters. Then they employed regular expressions to refine the splitting of alpha-numerical text segments from punctuation characters in special character sequences such as similes. Finally, these sequences of punctuation are reassembled. They merge the most common combinations of characters into a single token using the training data, and use word lists to merge abbreviations with their following dot character. They increase accuracy in the experiment using more in-domain training data.
Table 2.2: Examples of tokenization
Evaluation Measures for Tokenizers
Accuracy is a simple measure that calculates how many correct decisions a tool makes. When not all the expected tokens are retrieved, precision and recall are the measure to report. The precision of the tokens recognition measures how many tokens are correct out of how many were found. Recall measures the coverage (from the tokens that should have been retrieved, how many were found). F-measure (or F-score) is often reported when one single number is needed, because F-measure is the harmonic mean of the precision and recall, and it is high only when both the precision and the recall are high.2 Evaluation measures are rarely reported for tokenizers, one exception being the CleanEval shared task which focused on tokenizing text from web pages [Baroni et al., 2008].
Many NLP projects tend to not mention what kind of tokenization they used, and focus more on higher-level processing. Tokenization, however, can have a large effect on the results obtained at the next levels. For example, Fokkens et al. [2013] replicated two high-level tasks from previous work and obtained very different results, when using the same settings but different tokenization.
Adapting Tokenizers to Social Media Texts
Tokenizers need to deal with the specifics of social media texts. Emoticons need to be detected as tokens. For Twitter messages, user names (starting with @), hashtags (starting with #), and URLs (links to web pages) should be treated as tokens, without separating punctuation or other symbols that are part of the token. Some shallow normalization can be useful at this stage. Derczynski et al. [2013b] tested a tokenizer on Twitter data, and its F-measure was around 80%. By using regular expressions designed specifically for Twitter messages, they were able to increase the F-measure to 96%. More about such regular expressions can be found in [O’Connor et al., 2010].
2.4 PART-OF-SPEECH TAGGERS
Part-of-speech (POS) taggers determine the part of speech of each word in a sentence. They label nouns, verbs, adjectives, adverbs, interjections, conjunctions, etc. Often they use finer-grained tagsets, such as singular nouns, plural nouns, proper nouns, etc. Different tagsets exist, one of the most popular being the Penn TreeBank tagset3 [Marcus et al., 1993]. See Table 2.3 for one of its more popular lists of the tags. The models embedded in the POS taggers are often complex, based on Hidden Markov Models [Baum and Petrie, 1966], Conditional Random Fields [Lafferty et al., 2001], etc. They need annotated training data in order to learn probabilities and other parameters of the models.
Methods for Part-of-speech Taggers
Horsmann and Zesch [2016] trained a CRF classifier [Lafferty et al., 2001] using the FlexTag tagger [Zesch and Horsmann, 2016] There are two adaptations involved in this method. The first is a general domain adaptation. The researchers applied a domain adaption strategy, which they proposed as a competitive model to improve the accuracy for tagging social media texts. To train their model, they used the CMC and Web corpora subsets from the EmpiriST shared task and some additional 100,000 tokens of newswire text from the Tiger corpus. The second adaptation is specific to the EmpiriST shared task. Because some PoS tags are too rare to be learned from training data, the researchers utilized a post-processing step that leveraged heuristics. This step involved the use of regular expressions and word lists from Wikipedia and Wiktionary to improve named entity recognition and case-insensitive matching. Selecting tags from the larger Tiger corpus introduced bias, so the researchers added extra Boolean features to their model.
Evaluation Measures for Part-of-speech Taggers
The accuracy of the tagging is usually measured as the number of tags correctly assigned out of the total number of words/tokens being tagged.
Adapting Part-of-speech Taggers
POS taggers clearly need re-training in order to be usable on social media data. Even the set of POS tags used must be extended in order to adapt to the needs of this kind of text. Ritter et al. [2011] used the Penn TreeBank tagset (Table 2.3) to annotate 800 Twitter messages. They added a few new tags for the Twitter-specific phenomena: retweets, @usernames, #hashtags, and URLs. Words in these categories can be tagged with very high accuracy using simple regular expressions, but they still need to be taken into consideration as features in the re-training of the taggers (for example as tags of the previous word to be tagged). In Ritter et al. [2011], the POS tagging accuracy drops from about 97% on newspaper text to 80% on the 800 tweets. These numbers are reported for the Stanford POS tagger [Toutanova et al., 2003]. Their POS tagger T-POS—based on a Conditional Random Field classifier and on the clustering of out-of-vocabulary (OOV) words—also obtained low performance on Twitter data (81%). By retraining the T-POS tagger on the annotated Twitter data (which is rather small), the accuracy increases to 85%. The best accuracy raises to 88% when the size of the training data is increased by adding to the Twitter data the initial Penn TreeBank training data, plus 40,000 tokens of annotated Internet Relay Chat (IRC) data [Forsyth and Martell, 2007], which is similar in style to Twitter data. Similar numbers are reported by Derczynski et al. [2013b] on a part of the same Twitter dataset.
Table 2.3: Penn TreeBank tagset
A key reason for the drop in accuracy on Twitter data is that the data contains far more OOV words than grammatical text. Many of these OOV words come from spelling variation, e.g., the use of the word n for in in Example 3 from Table 2.1 The tag for proper nouns (NNP) is the most frequent tag for OOV words, while in fact only about one third are proper nouns.
Gimpel et al. [2011] developed a new POS tagset for Twitter (see Table 2.4), that is more coarse-grained, and it pays particular attention to punctuation, emoticons, and Twitter-specific tags (@usernames, #hashtags, URLs). They manually tagged 1,827 tweets with the new tagset; then, they trained a POS tagging model that uses features geared toward Twitter text. The experiments conducted to evaluate the model showed 90% accuracy for the POS tagging task. Owoputi et al. [2013] improved on the model by using word clustering techniques and trained the POS tagger on a better dataset of tweets and chat messages.4
2.5 CHUNKERS AND PARSERS
A chunker detects noun phrases, verb phrases, adjectival phrases, and adverbial phrases, by determining the start point and the end point of every such phrase. Chunkers are often referred to as shallow parsers because they do not attempt to connect the phrases in order to detect the syntactic structure of the whole sentence.
A parser performs the syntactic analysis of a sentence, and usually produces a parse tree. The trees are often used in future processing stages, toward semantic analysis or information extraction.
A dependency parser extracts pairs of words that are in a syntactic dependency relation, rather than a parse tree. Relations can be verb-subject, verb-object, noun-modifier, etc.