Evaluation of statistical text normalisation techniques for Twitter
Loading...
Supplementary material
Other Title
Authors
Sosamphan, P.
Liesaputra, Veronica
Yongchareon, Dr. Sira
Mohaghegh, Dr Mahsa
Liesaputra, Veronica
Yongchareon, Dr. Sira
Mohaghegh, Dr Mahsa
Author ORCID Profiles (clickable)
Degree
Grantor
Date
2016-11
Supervisors
Type
Conference Contribution - Paper in Published Proceedings
Ngā Upoko Tukutuku (Māori subject headings)
Keyword
Twitter
micro-blogs
noisy tweets
tweets
data normalisation
normalisation
big data
text cleansers
social media
statistical language models
lexical normalisation
micro-blogs
noisy tweets
tweets
data normalisation
normalisation
big data
text cleansers
social media
statistical language models
lexical normalisation
ANZSRC Field of Research Code (2020)
Citation
Sosamphan, P., Liesaputra, V., Yongchareon, S., & Mohaghegh, M. (2016, November). Evaluation of Statistical Text Normalisation Techniques for Twitter. A. Fred; J. Dietz; D. Aveiro; K. Liu; J. Bernardino & J. Filipe (Ed.), Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management: KDIR (pp.413-418). 1.
Abstract
One of the major challenges in the era of big data use is how to ‘clean’ the vast amount of data, particularly from micro-blog websites like Twitter. Twitter messages, called tweets, are commonly written in ill-forms, including abbreviations, repeated characters, and misspelled words. These ‘noisy tweets’ require text normalisation techniques to detect and convert them into more accurate English sentences. There are several existing techniques proposed to solve these issues, however each technique possess some limitations and therefore cannot achieve good overall results. This paper aims to evaluate individual existing statistical normalisation methods and their possible combinations in order to find the best combination that can efficiently clean noisy tweets at the character-level, which contains abbreviations, repeated letters and misspelled words. Tested on our Twitter sample dataset, the best combination can achieve 88% accuracy in the Bilingual Evaluation Understudy (BLEU) score and 7% Word Error Rate (WER) score, both of which are considered better than the baseline model.
Publisher
IC3K
Permanent link
Link to ePress publication
DOI
Copyright holder
Authors
Copyright notice
All rights reserved
