Natural Language Processing for Social Media. Diana InkpenЧитать онлайн книгу.
also considered the accuracy of dialects group. Figure 2.6 shows the result on the three different character n-gram Markov language models and a classification on the six groups of divisions that were defined in Figure 2.4. Again, the bigram and trigram character Markov language models performed almost the same as in Figure 2.5, although the F-Measure of the bigram model for all dialect groups was higher than for the trigram model, except for the Egyptian dialect. Therefore, on average, for all dialects, the character-based bigram language model performed better than the character-based unigram and trigram models.
Figure 2.7 shows the results on the n-gram models using Naïve Bayes classifiers for the different countries, while Figure 2.8 shows the results on the n-gram models using Naïve Bayes classifiers for the six divisions according to Figure 2.4. The results show that the Naïve Bayes classifiers based on character unigram, bigram, and trigram have better results than the previous character-based unigram, bigram, and trigram Markov language models, respectively. An overall F-measure of 72% and an accuracy of 97% were noticed for the 18 Arabic dialects. Furthermore, the Naïve Bayes classifier that is based on a bigram model has an overall F-measure of 80% and an accuracy of 98%, except for the Palestinian dialect because of the small size of the data. The Naïve Bayes classifier based on the trigram model showed an overall F-measure of 78% and an accuracy of 98% except for the Palestinian and Bahrain dialects. This classifier could not distinguish between the Bahrain and the Emirati dialects because of the similarities on their three affixes. In addition, the Naïve Bayes classifier based on character bigrams performed better than the classifier based on character trigrams, according to Figure 2.7. Also, as shown in Figure 2.8, the accuracy of dialect groups for the Naïve Bayes classifier based on character bigram model yielded better results than the two other models (unigrams and trigrams).
Figure 2.7: Accuracies on the character-based n-gram Naïve Bayes classifiers for 18 countries [Sadat et al., 2014a].
Recently, Zaidan and Callison-Burch [2014] created a large monolingual data set rich in dialectal Arabic content called the Arabic Online Commentary Dataset. They used crowdsourcing for annotating the texts with the dialect label. They also presented experiments on the automatic classification of the dialects for this dataset, using similar word and character-based language models. The best results were around 85% accuracy for distinguishing MSA from dialectal data and lower accuracies for identifying the correct dialect for the latter case. Then they applied the classifiers to discover new dialectical data from a large Web crawl consisting of 3.5 million pages mined from online Arabic newspapers.
Figure 2.8: Accuracies on the character-based n-gram Naïve Bayes classifiers for the six divisions/groups [Sadat et al., 2014a].
Several other projects focused on Arabic dialects: classification [Tillmann et al., 2014], code switching [Elfardy and Diab, 2013], and collecting a Twitter corpus for several dialects [Mubarak and Darwish, 2014].
2.9 SUMMARY
This chapter discussed the issue of adapting NLP tools to social media texts. One way is to use text normalization techniques, in order to make the text closer to standard carefully edited texts on which the NLP tools are usually trained. The normalization that can be achieved in practice is rather shallow and it does not seem to help much in improving the performance of the tools. The second way of adapting the tools is to re-train them on annotated social media data. This significantly improves the performance, although the amount of annotated data available for retraining is still small. Further development of annotated data sets for social media data is needed in order to reach very high levels of performance.
In the next chapter, we will look at advanced methods for various NLP tasks for social media texts. These tasks use as components some of the tools discussed in this chapter.
1urlhttps://www.cs.waikato.ac.nz/ml/weka/
2urlhttps://scikit-learn.org/stable/
4urlhttps://www.tensorflow.org/
5 https://sites.google.com/site/empirist2015/
6The F-score usually gives the same weight to precision and to recall, but it can weight one of them more when needed for an application.
7 http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html
8This data set is available at http://code.google.com/p/ark-tweet-nlp/downloads/list.
9 https://deepai.org/machine-learning-model/parseymcparseface
10A bracketing is a pair of matching opening and closing brackets in a linearized tree structure.
11 http://www.ark.cs.cmu.edu/TweetNLP/#tweeboparser_tweebank
12
Конец ознакомительного фрагмента.
Текст предоставлен ООО «ЛитРес».
Прочитайте эту книгу целиком, купив полную легальную версию на ЛитРес.
Безопасно оплатить книгу можно банковской картой Visa, MasterCard, Maestro, со счета мобильного телефона, с платежного терминала, в салоне МТС или Связной, через PayPal, WebMoney, Яндекс.Деньги, QIWI Кошелек, бонусными картами или другим удобным Вам способом.