Social media moves fast, but nowhere is it faster than on Twitter. Tweets may be short,but the Twitter community is large the platform has 320 million monthly active users. De-tecting which tweets are related to events and classifying them into categories is a challengingtask due to the peculiarities of Twitter language and to the lack of contextual information. Thecapability to understand and analyze the stream of messages on Twitter is an effective wayto monitor what people think, what trending topics are emerging, and which main events areaffecting people’s lives. In this paper we are going to detect event within the twitter stream using machine learn-ing in which our main approach is semantic linking from DBpedia, YOGO and other ontologywhich enrich the textual information within the tweet and we get more generic Named Entityin tweet. We have used different type of supervised machine learning approach to detect eventand determine the effect of semantic linking with Named Entities in tweets.
The ability to think and analyse messages on Twitter is an efficient way to screen what peoplethink and what are trending events which are affecting people’s lives. Event detection in tweets isa trivial task. In order to monitor trending topics and in which way events are affecting peo-ple’s life, automated topic and event detection is an emerging field of research. Therefore, eventdetection is an important task and there are a lot of new ways proposed to handle this task inliterature. Moreover, processing Twitter messages is challenging because tweets are 240 characterslong, contain little contextual information and often contain useless data which are not useful foranalyse such as stop words,emoji, misspelled words etc. In this paper we have done Name Entity Recognition and then linking tweets to related eventsand analyse them. After then we proposed related event to semantically make individual twittermeaningful for this we apply lots of different machine learning approach through which we get toknow benefits of our approach with high precision and accuracy. Event detection in twitter is a troublesome task because of the absence of relevant data in tweetsand the greater part of the tweets are not identified with events.
Furthermore, conventional Textmining strategies are not appropriate, due to the short length of tweets, the vast number of spellingand linguistic blunders, and the continuous utilization of casual and different languages. We have chosen supervised learning method for our project and we will consider multinomialclassification of tweets data as each event types has its own classes. Our motive is to learn ma-chine learning algorithm such as Naive Bayesian(NB), Support Vector Machine(SVM) and trainthese algorithms using tweets dataset which we have created and validate our results via Confu-sion Matrix. Furthermore, we replace the named entities in data with their semantic types fromdifferent ontologies and calculate the impact of semantic linking of named entities to the accuracyof different classifiers such as Naive Bayesian(NB).
There are many URL’s present in the tweet text that does not carry much informationregarding the semantic of the tweet. Hence we remove these URL’s
- Remove Duplicate TweetsBy mining the tweets through tweepy we get many duplicate tweets. Hence before prepro-cessing them further we remove the duplicate tweets in the data set.
- Remove Mentions
- Remove UnicodeUTF-8 Unicode special characters are useful for adding little extra character to the text, Butthis extra information is not useful for sentiment analysis, hence we remove these characters.
- Remove Unwanted special symbol
- Word Segmentation
- Remove Emoticons
- Remove Open or Empty lines
- Remove Stop WordsStop words is in high frequencies such as is, at etc. , are removed from the dataset.
- Spelling Correction
- Remove Extra Spacing
- Spelling Correction
- LemmatizationAfter preprocessing the dataset can be analyze and further machine learning approach can beapplied.
We label the tweet in different event categories. If tweet is extracted from arts events such asAmsterdam Dance Event then we will put Arts label in dataset. Its our strong assumption that ifwe crawl data from arts event then we put label as arts.
Name Entity Recognition, Linking and Replacement
After the preprocessing step, we have to do NE recognition,replacement and linking them toNERD API. Several extractor are available for semantic analysis. We have used DandelionAPI,TextRazor,and DBpedia for this project.
Dandelion APIs follows a general schema for requesting data. All the requests must be sent, eitherby GET or POST, to the API endpoint, which follows this structure: https://api. dandelion. eu/api/product/methodpath/api-versionEvery request must be authenticated. Dandelion API implements authentication through a singletoken parameter which identify the caller. Currently only one token is given to each user.
Vector is a two-layer neural net that processestext. It takes a text data corpus as input and generate the word vectors as output. It firstbuild up a vocabulary from the training text dataset and then learns vector representation ofwords. The generated output word vector file can be used as features in many natural languageprocessing(NLP) and machine learning method.
After feature extraction we have to do classification to accurately predict and analyse our dataset. For classification we have used supervised machine learning approach. Supervised machine learningapproach is that we have input variables (X) and an output variable (Y) and we use an algorithmto learn the mapping function from the input to the output. Y = f(X)The goal is to approximate the mapping function so that when we have new input data (x) that wecan predict the output variables (Y) for that dataset. We have used Naive Bayes classifier, DecisionT ree Classifier and Support Vector Classifier(SVM).
Naive Bayes Classifier
Naive Bayes classifiers is a collection of classification algorithms based on Bayes’ Theorem. It is nota single algorithm but a group of algorithms where all of them share a common principle, there-fore every pair of features being classified is independent of each other. conditional independenceassumption rarely holds true in real world applications, hence the characterization as Naive yet thealgorithm tends to perform well and learn rapidly in various supervised classification problems . It is simple probabilistic classifier which calculate probability by counting the frequency in givendataset. Naive Bayes Classifier have two fundamental assumption:
We assume that no pair of features are dependent. Hence, the features are assumed to beindependent
Secondly, each feature is given the same weight and none of the attributes is irrelevant andassumed to be contributing equally to the outcome. 5http://bio. nlplab. org/word-vectors5Naive Bayes classifier assume that the effect of the value of a predictor A on a given class B isindependent of the values of other predictors. P (A|B) = P (A) ∗ P (B|A)P (B)where A and B are events and P(A) 6= 0
Decision Tree Classifier
Decision tree is powerful and most popular tool for classification and prediction. Decision Trees area non-parametric supervised learning method used for classification and regression. The goal is tocreate a model that predicts the value of a target variable by learning simple decision rules inferredfrom the data features. A Decision tree is a flowchart like tree structure where:
• Each internal node denotes a test on an attribute.
• Each branch represents an outcome of the test.
• Each leaf node holds a class label.
Support Vector Machine
Support Vector Machine (SVM) is a supervised learning algorithm that can be used for both clas-sification and regression. A Support Vector Machine (SVM) is a discriminative classifier formallydefined by a separating hyperplane therefore labeled training data (supervised learning), the algo-rithm outputs an optimal hyperplane which categorizes new outcome. SVM represent outcome aspoint in space. So that the examples of the separate categories are divided by a clear gap that is aswide as possible. SVM can efficiently perform a non-linear classification, implicitly mapping theirinputs into high-dimensional feature spaces.
After applying classifier we have to evaluate it for this we are using confusion matrix to calculateaccuracy of each classifier.
Confusion matrix is used to calculate binary classifier performance. It is square matrix consist of 2row and 2 column such that for each class the number of true positives(TP), true negatives(TN),false positives(FP) and false negatives(FN) were calculated as shown in figure 6.
- Positive: Observation is positive
- Negative: Observation is not positive
- True Positive: Observation is positive, and is predicted to be positive
- False Negative: Observation is positive, but is predicted negative.
- True Negative: Observation is negative, and is predicted to be negative.
- False Positive: Observation is negative, but is predicted positive.