• Login
    View Item 
    •   Research Bank Home
    • Unitec Institute of Technology
    • Study Areas
    • Computing
    • Computing Conference Papers
    • View Item
    •   Research Bank Home
    • Unitec Institute of Technology
    • Study Areas
    • Computing
    • Computing Conference Papers
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Evaluation of statistical text normalisation techniques for Twitter

    Sosamphan, P.; Liesaputra, Veronica; Yongchareon, Dr. Sira; Mohaghegh, Dr Mahsa

    Thumbnail
    Share
    View fulltext online
    KDIR_SNET_paper.pdf (187.8Kb)
    Date
    2016-11
    Citation:
    Sosamphan, P., Liesaputra, V., Yongchareon, S., & Mohaghegh, M. (2016, November). Evaluation of Statistical Text Normalisation Techniques for Twitter. A. Fred; J. Dietz; D. Aveiro; K. Liu; J. Bernardino & J. Filipe (Ed.), Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management: KDIR (pp.413-418). 1.
    Permanent link to Research Bank record:
    https://hdl.handle.net/10652/3808
    Abstract
    One of the major challenges in the era of big data use is how to ‘clean’ the vast amount of data, particularly from micro-blog websites like Twitter. Twitter messages, called tweets, are commonly written in ill-forms, including abbreviations, repeated characters, and misspelled words. These ‘noisy tweets’ require text normalisation techniques to detect and convert them into more accurate English sentences. There are several existing techniques proposed to solve these issues, however each technique possess some limitations and therefore cannot achieve good overall results. This paper aims to evaluate individual existing statistical normalisation methods and their possible combinations in order to find the best combination that can efficiently clean noisy tweets at the character-level, which contains abbreviations, repeated letters and misspelled words. Tested on our Twitter sample dataset, the best combination can achieve 88% accuracy in the Bilingual Evaluation Understudy (BLEU) score and 7% Word Error Rate (WER) score, both of which are considered better than the baseline model.
    Keywords:
    Twitter, micro-blogs, noisy tweets, tweets, data normalisation, normalisation, big data, text cleansers, social media, statistical language models, lexical normalisation
    ANZSRC Field of Research:
    080109 Pattern Recognition and Data Mining, 150502 Marketing Communications
    Copyright Holder:
    Authors

    Copyright Notice:
    All rights reserved
    Rights:
    This digital work is protected by copyright. It may be consulted by you, provided you comply with the provisions of the Act and the following conditions of use. These documents or images may be used for research or private study purposes. Whether they can be used for any other purpose depends upon the Copyright Notice above. You will recognise the author's and publishers rights and give due acknowledgement where appropriate.
    Metadata
    Show detailed record
    This item appears in
    • Computing Conference Papers [150]

    Te Pūkenga

    Research Bank is part of Te Pūkenga - New Zealand Institute of Skills and Technology

    • About Te Pūkenga
    • Privacy Notice

    Copyright ©2022 Te Pūkenga

    Usage

    Downloads, last 12 months
    45
     
     

    Usage Statistics

    For this itemFor the Research Bank

    Share

    About

    About Research BankContact us

    Help for authors  

    How to add research

    Register for updates  

    LoginRegister

    Browse Research Bank  

    EverywhereInstitutionsStudy AreaAuthorDateSubjectTitleType of researchSupervisorCollaboratorThis CollectionStudy AreaAuthorDateSubjectTitleType of researchSupervisorCollaborator

    Te Pūkenga

    Research Bank is part of Te Pūkenga - New Zealand Institute of Skills and Technology

    • About Te Pūkenga
    • Privacy Notice

    Copyright ©2022 Te Pūkenga