REAL TIME FIRST STORY DETECTION IN TWITTER USING A MODIFIED TF-IDF ALGORITHM

Document Type : Original Article

Authors

Computer Science Department Faculty of Computer and Information Sciences, Mansoura University - Egypt

Abstract

Twitter is a social micro blogging, it has its own feature that it enables to tweet only a maximum of 140 characters per tweet. Even with this small number of characters per tweet, analyzing the tweets for billions of users faces the challenges of real-time data processing. One of the important aspects of social behavior is that we can detect the significance of the events and the way the people reacted to them. In this paper, we focus on First Story Detection (FSD) that means we can detect bursts of tweets that refer to a particular topic. First story is defined as the first document from a given series of documents to discuss a specific event, which occurred at a particular time and place. TF-IDF denotes to term frequency–inverse document frequency is an algorithm traditionally used in most of Text similarity applications like FSD. In this paper, we embedded a modified version of TF-IDF algorithm to enhance the accuracy of a pre-implemented open source for FSD that uses Storm platform to benefit from its scalability, efficiency and robustness in analyzing the tweets in real time. The empirical results show significant enhancements in the accuracy of the detection without noticeable effect on performance

Keywords