• Login
    View Item 
    •   Research Bank Home
    • Unitec Institute of Technology
    • Study Areas
    • Computing
    • Computing Dissertations and Theses
    • View Item
    •   Research Bank Home
    • Unitec Institute of Technology
    • Study Areas
    • Computing
    • Computing Dissertations and Theses
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    SNET : a statistical normalisation method for Twitter

    Sosamphan, Phavanh

    Thumbnail
    Share
    View fulltext online
    Phavanh Sosamphan +.pdf (1.858Mb)
    Date
    2016
    Citation:
    Sosamphan, P. (2016). SNET: A statistical normalisation method for Twitter. An unpublished thesis submitted for the degree of Master of Computing, Unitec Institute of Technology, New Zealand.
    Permanent link to Research Bank record:
    https://hdl.handle.net/10652/3508
    Abstract
    One of the major problems in the era of big data use is how to ‘clean’ the vast amount of data on the Internet, particularly data in the micro-blog website Twitter. Twitter enables people to connect with their friends, colleagues, or even new people who they have never met before. Twitter, one of the world’s biggest social media networks, has around 316 million users, and 500 million tweets posted per day (Twitter, 2016). Undoubtedly, social media networks create huge opportunities in helping businesses build relationships with customers, gain more insights into their customers, and deliver more value to them. Despite all the advantages of Twitter use, comments – called tweets - posted on social media networks may not be all that useful if they contain irrelevant and incomprehensible information, therefore making it difficult to analyse. Tweets are commonly written in ‘ill-forms’, such as abbreviations, repeated characters, and misspelled words. These ‘noisy tweets’ become text normalisation challenges in terms of selecting the proper methods to detect and convert them into the most accurate English sentences. There are several existing text cleaning techniques which are proposed to solve the issues, however they possess some limitations and still do not achieve good results overall. In this research, our aim is to propose the SNET, a statistical normalisation method for cleaning noisy tweets at character-level (which contain abbreviations, repeated letters, and misspelled words) that combines different techniques to achieve more accurate and clean data. To clean noisy tweets, existing techniques are evaluated in order to find the best solution by combining techniques so as to solve all problems with high accuracy. This research proposes that abbreviations are converted to their standard form by using abbreviations dictionary lookup, while repeated characters are normalised by the Natural Language Toolkit (NLTK) platform and a dictionary based approach. Besides the NLTK, the edit distance algorithm is also utilised as a means of solving misspelling problems, while “Enchant” dictionary can be used to access the spell checking library. Furthermore, existing models, such as a spell corrector, can be deployed for conversion purposes, while text cleanser is advanced as superior for comparing the SNET with a baseline model. With experiments on a Twitter sample dataset, our results show that the SNET satisfies 88% accuracy in the Bilingual Evaluation Understudy (BLEU) score and 7% in the word error rate (WER) score, both of which are better than the baseline model. Devising such a method to clean tweets can make a great contribution in terms of its adoption in brand sentiment analysis or opinion mining, political analysis, and other applications seeking to make sound predictions.
    Keywords:
    Twitter, micro-blogs, noisy tweets, tweets, data normalisation, normalisation, big data, text cleansers, spell checkers, Natural Language Toolkit (NLTK), abbreviations, social media
    ANZSRC Field of Research:
    080109 Pattern Recognition and Data Mining, 150502 Marketing Communications
    Degree:
    Master of Computing, Unitec Institute of Technology
    Supervisors:
    Yongchareon, Dr. Sira; Liesaputra, Veronica; Mohaghegh, Dr Mahsa
    Copyright Holder:
    Author

    Copyright Notice:
    All rights reserved
    Rights:
    This digital work is protected by copyright. It may be consulted by you, provided you comply with the provisions of the Act and the following conditions of use. These documents or images may be used for research or private study purposes. Whether they can be used for any other purpose depends upon the Copyright Notice above. You will recognise the author's and publishers rights and give due acknowledgement where appropriate.
    Metadata
    Show detailed record
    This item appears in
    • Computing Dissertations and Theses [90]

    Te Pūkenga

    Research Bank is part of Te Pūkenga - New Zealand Institute of Skills and Technology

    • About Te Pūkenga
    • Privacy Notice

    Copyright ©2022 Te Pūkenga

    Usage

    Downloads, last 12 months
    20
     
     

    Usage Statistics

    For this itemFor the Research Bank

    Share

    About

    About Research BankContact us

    Help for authors  

    How to add research

    Register for updates  

    LoginRegister

    Browse Research Bank  

    EverywhereInstitutionsStudy AreaAuthorDateSubjectTitleType of researchSupervisorCollaboratorThis CollectionStudy AreaAuthorDateSubjectTitleType of researchSupervisorCollaborator

    Te Pūkenga

    Research Bank is part of Te Pūkenga - New Zealand Institute of Skills and Technology

    • About Te Pūkenga
    • Privacy Notice

    Copyright ©2022 Te Pūkenga