Natural Language Processing for Social Media. Diana InkpenЧитать онлайн книгу.
the input texts can be fed into classifiers, each text needs to be transformed into a set of features. For some linguistic tools, extracting the words is sufficient while for semantic tasks and applications the texts need to be transformed into vectors of numeric or discrete values. The simplest way to represent texts is using the Bag-of-Words model (BOW) (a word is present or not, possibility with with frequency information), or more advanced and less sparse vectors called word embeddings [Mikolov et al., 2013]. Linguistic features can be used in addition or instead of the word-based features. These representations are important especially for the semantic tasks and applications discussed in the next two chapters.
The remainder of this chapter is structured as follows. Section 2.2 discusses generic methods of adapting NLP tools to social media texts. The next five sections discuss NLP tools of interest: tokenizers, part-of-speech taggers, chunkers, parsers, and named entity recognizers, as well as adaptation techniques for each. Section 2.7 enumerates some of the existing toolkits that were adapted to social media texts in English. Section 2.8 discusses multi-lingual aspects and language identification issues in social media. Section 2.9 summarizes this chapter.
2.2 GENERIC ADAPTATION TECHNIQUES FOR NLP TOOLS
NLP tools are important because they need to be used before we can build any applications that aim to understand texts or extract useful information from texts. Many NLP tools are now available, with acceptable levels of accuracy on texts that are similar to the types of texts used for training the models embedded in these tools. Most of the tools are trained on carefully edited texts, usually newspaper texts, due to the wide availability of these kinds of texts. For example, the Penn TreeBank corpus, consisting of 4.5 million words of American English [Marcus et al., 1993], was manually annotated with part-of-speech tags and parse trees, and it is often the main resource used to train part-of-speech taggers and parsers.
Current NLP tools tend to work poorly on social media texts, because these texts are informal, not carefully edited, and they contain grammatical errors, misspellings, new types of abbreviations, emoticons, etc. They are very different than the types of texts used for training the NLP tools. Therefore, the tools need to be adapted in order to achieve reasonable levels of performance on social media texts.
Table 2.1 shows three examples of Twitter messages, taken from Ritter et al. [2011], just to illustrate how noisy the texts can be.
Table 2.1: Three examples of Twitter texts
No. | Example |
1 | The Hobbit has FINALLY started filming! I cannot wait! |
2 | @c@ Yess! Yess! It’s official Nintendo announced today that theyWill release the Nintendo 3DS in north America march 27 for $250 |
3 | Government confirms blast n #nuclear plants n #japan…don’t knw wht s gona happen nw… |
There are two ways to adapt NLP tools to social media texts. The first one is to perform text normalization so that the informal language becomes closer to the type of texts on which the tools were trained. The second one is to re-train the models inside the tool on annotated social media texts. Depending on the goal of the NLP application, a combination of the two techniques could be used, since both have their own limitations, as discussed below (see Eisenstein [2013b] for a more detailed discussion).
2.2.1 TEXT NORMALIZATION
Text normalization is a possible solution for overcoming or reducing linguistic noise. The task can be approached in two stages: first, the identification of orthographic errors in an input text, and second, the correction of these errors. Normalization approaches typically include a dictionary of known correctly spelled terms, and detects in-vocabulary and out-of-vocabulary (OOV) terms with respect to this dictionary. The normalization can be basic or more advanced. Basic normalization deals with the errors detected at the POS tagging stage, such as unknown words, misspelled words, etc. Advanced normalization is more flexible, taking a lightly supervised automatic approach trained on an external dataset (annotated with short forms vs. their equivalent long or corrected forms).
For social media texts, the normalization that can be done is rather shallow. Because of its informal and conversational nature, social media text cannot become carefully edited English. Similar issues appear in SMS text messages on phones, where short forms and phonetic abbreviations are often used to save the typing time. According to Derczynski et al. [2013b], text normalization in Twitter messages did not help too much in the named entity recognition task.
Twitter text normalization into traditional written English [Han and Baldwin, 2011] is not only difficult, but it can be viewed as a “lossy” translation task. For example, many of Twitter’s unique linguistic phenomena are due not only to its informal nature, but also to a set of authors that is heavily skewed toward younger ages and minorities, with heavy usage of dialects that are different than standard English [Eisenstein, 2013a, Eisenstein et al., 2011].
Figure 2.1: Methodology for tweet normalization. The dotted horizontal line separates the two steps (detecting the text to be normalized and applying normalization rules) [Akhtar et al., 2015].
Demir [2016] describes a method of context-tailored text normalization. The method considers contextual and lexical similarities between standard and non-standard words, in order to reduce noise. The non-standard words in the input context in a given sentence are tailored into a direct match, if there are possible shared contexts. A morphological parser is used to analyze all the words in each sentence. Turkish social media texts were used to evaluate the performance of the system. The dataset contains tweets (~11 GB) and clean Turkish texts (~ 6 GB). The system achieved state-of-the-art results on the 715 Turkish tweets.
Akhtar et al. [2015] proposed a hybrid approach for text normalization for tweets. Their methodology proceeds in two phases: the first one detects noisy text, and the second one uses various heuristic-based rules for normalization. The researchers trained a supervised learning model, using 3-fold cross validation to determine the best feature set. Figure 2.1 depicts a schematic diagram of the proposed approach. Their system yielded precision, recall, and F-measure values of 0.90, 0.72, and 0.80, respectively, for their test dataset.
Most practical applications leverage the simpler approach of replacing non-standard words with their standard counterparts as a “one size fits all” task. Baldwin and Li [2015] devised a method that uses a taxonomy of normalization edits. The researchers evaluated this method on three different downstream applications: dependency parsing, named entity recognition, and text-to-speech synthesis. The taxonomy of normalization edits is shown in Figure 2.2. The method categorizes edits at three levels of granularity and its results demonstrate that the targeted application of the taxonomy is an efficient approach to normalization.
Figure 2.2: Taxonomy of normalization edits [Baldwin and Li, 2015].
The effect of manual vs. automatic lexical normalization for dependency parsing was analyzed by van der Goot [2019]. They showed that for most categories, automatic normalization scores are close to manual normalization but the small differences are important to take into consideration when exploiting normalization in a pipeline setup.