您的当前位置：首页 Hybrid Classification for Tweets Related to Infection with Influenza

Hybrid Classification for Tweets Related to Infection with Influenza

来源：华佗小知识

astCon 2015, April 9 - 12, 2015 - Fort Lauderdale, Florida

Hybrid Classification for Tweets Related to Infection

with Influenza

Xiangfeng Dai and Marwan Bikdash

Department of Computational Science and Engineering

North Carolina A&T State University

Greensboro, USA

Abstract—Traditional public health surveillance methods such as those employed by the CDC (United States Centers for Disease Control and Prevention) rely on regular clinical reports, which are almost always manual and labor intensive. Twitter, a popular micro-blogging service, provides the possibility of automated public health surveillance. Tweets, however, are less than 140 characters, and do not provide sufficient word occurrences for conventional classification methods to work reliably. Moreover, natural language is complex. This makes health-related classification more challenging. In this study, we use flu-related classification as a demonstration to propose a hybrid classification method, which combines two classification approaches: manually-defined features and auto-generated features by machine learning approaches. Preprocessing based on Natural Language Processing (NLP) is used to help extract useful information, and to eliminate noise features. Our simulations show an improved accuracy.

Keywords—Public Health, Twitter, Social Network, Big data, Machine learning, Classification, Natural Language Processing

I. INTRODUCTION

A. Background

Epidemics of seasonal influenza are a major public health concern. Approximately 5%–10% in adults and 20%–30% in children get influenza each year, which results in about 3 to 5 million cases of severe illness and about 250,000 to 500,000 deaths in the world every year [19]. Influenza surveillance systems have been established, for instance, by the United States Centers for Disease Control and Prevention (CDC), but they are almost always entirely manual [17]. This introduces about two-week delays for clinical data acquisition, and requires clinical encounters with health professionals [6]. Twitter is one of the most popular social network websites and has been growing at a very fast pace. It has 284 million monthly active users and 500 million tweets are sent per day [7]. Users often share activities, opinions or feelings about everything on Twitter. It provides a low-cost alternative source for public health surveillance. The ultimate goad in automatic public health surveillance is to be able to detect the onset, progress, extent and geography of epidemics. To this end time-stamped and geolocation stamped data must be available in addition to multilayered classification and reasoning

algorithms [21]. The health-related classification plays an important role in public health surveillance.

B. Related Work

Several studies on the classification techniques for public health surveillance from social media have already been published. Ginsberg et al. [17] showed evidences of correlation between the occurrence of search queries containing flu-related words and ILI (influenza-like illness) rates to detect flu epidemics. Corley et al. [16] evaluated the flu trends in blog posts by analyzing the ILI search queries on Google.

Most recent studies rely on the analysis of tweet contents, such as keywords or related words analysis, word occurrences, correlation analysis and word frequency, etc. Lampos and Cristianini [5] detected flu-related keywords in the U.K. and used them to track influenza rates. They used a set of flu-related keywords to learn a flu-score for each document by learning the weights for each keyword. Culotta et al. [1] performed a similar study, and tracked flu-related keywords and analyzed the correlation with national health statistics. Achrekar et al. [2] selected keywords automatically to follow on Twitter, and then counted the number of tweets at each time step per keyword to predict flu trends.

Kanhabua et al. [14] used clustering methods to determine important topics of Twitter data, and then constructed time series for matched keywords.

Parker et al. [6] found frequent health-related word sets from health-related tweets and from Wikipedia medically-related articles and used them to detect public health trends. Chew et al. [13] studied the dynamics of change in circulated tweets for H1N1 virus, and focused on the diversity in keyword lists. Similar studies [2, 11, 12] focused on keywords match, and illustrated the difficulties in finding powerful feature words. Polysemy for instance was found to cause problems.

Moreover, some studies of classification applied machine-learning methods to tweet classification. Sankara-Narayanan et al. [10] used a Naive Bayes Classifier to distinguish news from junk. Other classification approaches based on machine-learning algorithms are in [6, 10, 11].

Tweets, however, do not provide enough word co-occurrence because of the short length. Most existing

astCon 2015, April 9 - 12, 2015 - Fort Lauderdale, Florida

approaches rely on contents analysis such as using keyword lists or more plentiful text to identify relevant information [9], and do not work reliably. In addition, typical machine learning methods are relying on simple content analysis, like word frequency or co-occurrences, are often not effective.

C. Domain

Fig. 1 shows three categories of tweets about flu. The classes C1 and C2 contain flu keywords such as words referring to flu symptoms. C3 contains tweets that are not related to flu. Most of the existing approaches focus on a domain of distinguishing “related” (C1 and C2) from “unrelated” (C3) tweets.

Fig. 1. The Categories of the tweets

C1: Related to Flu Detection, C2: Unrelated to Flu Detection, C3: Unrelated to Flu

The tweets in the Table I are related to flu, but do not imply flu infection. They talk about news of flu, stomach flu, and flu-like symptoms that are not due to flu (e.g., due to alcohol use). The tweets in C2 are noisy data from the point of view of flu detection.

TABLE I.

C2 : UNRELATED TO INFECTION WITH FLU

4th day of summer and i have the stomach flu. god dammit.

Australia has many cases of swine flu :( in 24 hours it should have doubled! they believe a million people will get it!? scary! :(

Realizing how stoked I really am to go to Norway on the 18th. Really Really stoked. Do they still have bird flu over there?”󰁳 Pondering finding out if whiskey is more effective than cold & flu medicine at ridding oneself of said cold and/or flu

For better flu-detection, one must be able to distinguish C2 from C3. In this study, we propose a hybrid classification to classify a small domain, which distinguishes C1 from C2. This

hybrid method improves the classification process because it takes advantage of the multiple approaches.

II. THE ARCHITECTURE OF HYBRID CLASSIFICATION This hybrid classification process has three steps (Fig. 2), which are: NLP preprocessing, manually-defined features for classification, and classification based on a Naive Bayes model that uses auto-generated features.

1. The first step is responsible for normalizing the tweets

using NLP (Natural Language Processing) techniques, for instance, by removing stopwords. 2. In the second step, we manually predefine feature lists

based on common sense. These words are often uncommon, and cannot be automatically generated by machine learning algorithms. In this step, the classifier predicts the tweets with the pre-defined features effectively, but it cannot evaluate the tweets that do not contain pre-defined features. 3. The undecided tweets from the previous classifiers are

subject to a Naive Bayes classifier, which has generated features automatically using machine learning concept. It takes care of all tweets that cannot be evaluated by the previous classifiers.

Fig. 2. Hybrid Classification Architecture

III. THE DESIGN OF CLASSIFIERS

A. NLP Approaches

One of the major problems in utilizing NLP techniques is due to ’s noisy content such as hash tags, emoticons,

astCon 2015, April 9 - 12, 2015 - Fort Lauderdale, Florida

slangs, abbreviations, links, etc. This module is responsible for normalizing the tweets. It performs the following operations: • Filter out stopwords - for example: a, is, the, with etc. These words do not contain any information to help in identifying C1 or C2. These words are removed. • Lower Case - Convert all words to lower case.

• Removing Punctuations as comma, single/double and Additional Spaces quote, question marks -- such at the start and end of each word. • Nested Length - filtering words by length.

B. Classification Using Manually-defined Features

We manually predefined feature words for this classifier based on common sense. These words are often uncommon, and cannot be automatically generated by machine learning algorithms. Normal machine learning methods rely on the word frequency or enough word co-occurrences, they usually fail to evaluate uncommon and special words [3]. In this study, we predefined feature words from 3 lists: • Alcohol “schwarzbier”, list “scotch”, (e.g., “beer”, “whiskey”, “barleywine”, “rum”, “tequila”, “ale”, “gin”, “maotai”, etc.). They are negatively correlated with flu infection. The rationale is that alcohol hangover produces symptoms and words that are noise features when one seeks to detect infection with flu. • Medicine etc.), symptom list (e.g., lists “tamiflu”, (e.g., “tylenol”, “fever”, “coricidin”, “sneezing”, “headache”, “cough”, etc.). They are positively correlated with flu infection. • Special They can be negatively or positively correlated with flu word combinations (e.g., “stomach flu”, etc.). infection. The manually-defined feature classifier cannot classify most of the tweets. However, this method is very effective when tweets have uncommon words characteristic of other categories. The only rules applied here are concerned with the uncommon words found in the text.

For example, the tweets in Table II are related to infection with flu, the associated symptoms (sneezing, fever, etc.) and treatments (Tylenol). Previous experience with flu indicates that Coricidin, Tylenol and Tamiflu are typically used to treat the flu. If the tweets contain flu symptoms and treatments, the classifier classifies them as related (C1).

TABLE II.

FEATURED WORDS (SYMPTOMS AND TREATMENTS)

Feeling much better after lots of water, coricidin, and sleep. Still sneezing and congested tho.

Just woke up, headaches back; need tylenol... Sore throat is back, need cough drops. Ughh! fever down to 100F, started tamiflu, watching her like a hawk, Moms worry! The tweets in Table III have symptoms (such as headache) and they have uncommon words like alcohol, whisky, etc. They indicate that these tweets are unrelated to infection with flu. Therefore, if the tweets contain flu symptoms and alcohol words, the classifier classifies them as unrelated (C2).

TABLE III.

FEATURED WORDS (SYMPTOMS AND ALCOHOL)

Has been making himself be in a good mood (kinds works)... But STILL has a headache! Must be the alcohol... Darn it!

Too much whisky last night. Result: a v expensive taxi, a slight headache and a craving for Irn Bru....

C. Naive Bayes Approch

The machine-learning classifier is the last step in the classification process. It takes care of all tweets that cannot be evaluated by the previous classifiers. In this study, we use the Naive Bayes model. As shown in Fig. 3, during the training phase, pairs of features and categories are applied into the naive Bayes algorithm to train a classifier.

In the prediction phase, the features of training tweets are inputted to the trained Naive Bayes classifier to determine the predicted category (C1 or C2) [4].

Fig. 3. Auto-generated features classification by machine learning

D. Feature selection

Feature selection [20] is the process of selecting a subset of the vocabulary from the training set. It is a very important concept in implementing a classifier. The feature words capture the distinctive characteristics of the tweets. We use NLP techniques to filter out stopwords and remove duplicate words during preprocessing. Table IV shows the tweets and the corresponding words. Stopwords, such as: “a, so, has, for, the”, have been removed.

astCon 2015, April 9 - 12, 2015 - Fort Lauderdale, Florida

TABLE IV. CONSTRUCTING THE VOCABULARY

Words

Having a flu, Feeling “flu”, “feel”, so miserable… headache “miserable”, “headache” Looks like flu has come back for a visit. “Looks”, “like”, “flu”, Miserable : (

“come”, “back”, “visit” Headaches.. Suffering the flu.

“headache”, “Suffering”, “flu”

The words remaining after NLP preprocessing are collected and combined in a vocabulary. For example, the vocabulary from the above 3 tweets in Table IV consists of:

{“flu”, “miserable”, “headache”, “feel”, “look”,

“like”, “come”, “back”, “visit”, “suffer”}In this study, we use frequency-based feature selection. In this approach, the most common words in the category are chosen as features. The words in the vocabulary are ordered by their total frequency in the training set [4]. In our study, we selected the top 300 most frequent feature words from the vocabulary (579 words) to train the classifier.

Since tweet texts are very short (limited to 140 characters), Boolean representation will suffice for our purpose. We use the Bernoulli model [20], which generates an indicator for each term of the feature set. Here “1” indicates presence of the term in the document and “0” indicates absence. Therefore, a tweet d is represented by a binary vector:

𝐹= <𝐹!, 𝐹!… 𝐹!> (1)

where 𝐹! = 1 if word t is present in tweet d and 𝐹not present in d. F is called the feature vector representation of ! = 0 if t is tweet d, or simply the feature vector.

Therefore, the tweets in Table IV are represented by the vectors shown in Table V.

TABLE V.

FEATURE VECTORS

Tweets Feature Vectors

Having a flu, Feeling [1,1,1,1,0,0,0,0,0,0] so miserable… headache Looks like flu has [1,1,0,0,1,1,1,1,1,0] come back for a visit. Miserable : (

Headaches.. Suffering the flu.

[1,0,1,0,0,0,0,0,0,1]

In anticipation of using the naive Bayes model of classification, we estimate the probability that feature i occurs in a tweet tagged as category j. Using the training data, this probability can be estimated as:

𝑃!\" =𝑃(𝑈=𝐹𝑁!|𝐶!)≈

!\"

𝑁 2 !

where

𝑁!\" = number of tweets tagged as category j in the training

set and contain feature i 𝑁! = number of tweets tagged as category j

The probability of a given category j, which can be estimated from the training, set as follows:

𝑃! =𝑃𝐶!=

𝑁!

𝑁 3

where

N = set.

!𝑁! is the total number of tweets in the training

E. Prediction using Naive Bayes

The decision of the Naive Bayes classifier that a tweet d was generated from category 𝐶! maximizes the conditional probability:

𝑃𝐶𝑑|𝐶!𝑑=

𝑃(!)𝑃(𝐶!)

𝑃(𝑑)

Because 𝑃(𝑑) is the same for any category, it can be dropped. Hence:

𝐶!= arg max! 𝑃(𝑑|𝐶!)𝑃(𝐶!) 5

The naive Bayes assumption is that the features are all independent. Hence:

𝑃𝑑𝐶!=𝑃𝐹!𝐹!…𝐹!𝐶!=

𝑃(𝑈=𝐹!|𝐶!) 6

The needed probabilities have been estimated above.

IV. EVALUATION & RESULTS

We implemented this study with Python and the NLTK toolkit. The Independent Test Sample Method [18] was used for evaluating performance of the hybrid classifier. Since this study focuses on distinguishing C1 from C2, we collected 500 tweets from C1 and another 500 tweets from C2. The N tweets were then randomly divided into a training set (75% of N), and a testing set (25% of N). Denote these numbers as N!\"#$%$%& and 𝑁!\"#!, where 𝑁!\"#$%$%& + 𝑁!\"#! = N.

We use the training set to build the classifier and then we classify observations in the test set using the classification rules. We also count the number (𝑁successfully classified. The rate of accuracy at which correctly !! ) of the tweets that are classified observations is:

𝑃(𝑐𝑐) =

𝑁!!

𝑁 7 !\"#!

The proportion of correctly classified observations is the estimated classification rate. The higher this proportion, the better the classifier.

TABLE VI.

RESULTS

Classifier

Accuracy Manually-defined features 0.163 Auto-generated features 0.7 Hybrid Approach

0.796

astCon 2015, April 9 - 12, 2015 - Fort Lauderdale, Florida

Classifier

Accuracy Hybrid Approach with NLP

0.842

Table VI shows that the auto-generated feature approach achieves good results (0.7), which is better than the manually-defined features approach (0.163). After combining the two approaches with NLP, the hybrid approach yielded the best result (0.842).

V. CONCLUSIONS

In this study, we present a hybrid classification method focusing on a small domain, which can distinguish tweets indicating flu infection from tweets simply connected with flu, but irrelevant to flu infection. This method improves the classification process because it takes advantage of the multiple approaches.

The experimental results show the hybrid classification approaches achieved better results than any single approach. The shown approach however suffers from the limitation that the machine-learning classifier is supervised, which requires experts to read the tweets and ascertain the category to which they belong. In other words, it can be labor intensive. As a result, it is not easy to readily modify this classifier as to make it applicable to another disease or to another question. This approach will also suffer when approaching new diseases or rare diseases for which the available data is scarce.

In future work, we will consider further and more elaborate use of NLP processing to improve feature selection, and to reduce the labor needed to tag tweets. Less supervision in the learning may be achieved by computing the PMI (Pointwise mutual information) between frequency words and flu symptoms, medicines, and similar correlation measures.

REFERENCES

[1] A. Culotta, “Detecting influenza outbreaks by analyzing Twitter

messages” in Proc. 2010 Conf. on Knowledge Discovery and Data Mining, 2010.

[2] H. Achrekar; A. Gandhe; R. Lazarus; S. Yu and B. Liu, “Predicting Flu

Trends using Twitter Data,” in Computer Communications Workshops (INFOCOM WKSHPS), 2011 IEEE Conferenceon, pp. 702–707, 2011.

[3] P. Rafeeque; S. Sendhilkumar, “A survey on Short text analysis in Web”

Advanced Computing (ICoAC), 2011 Third International Conference on. IEEE, pp. 365-371, 2011.

[4] S. Bird, E. Klein, and E. Loper, “Natural Language Processing with

Python,” O'Reilly Media, 2009.

[5] V. Lampos and N. Cristianini. “Tracking the flu pandemic by

monitoring the Social Web,” In 2nd IAPR Workshop on Cognitive Information Processing (CIP 2010), pp. 411-416, 2010.

[6] J. Parker; Y. Wei; A. Yates; O. Frieder and N. Goharian, “A Framework

for Detecting Public Health Trends with Twitter,” 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 556-563, 2013.

[7] Twitter, Inc., “Twitter usage,” [online] 2015,

https://about..com/company (Accessed: 7 Jan 2015).

[8] G. Song; Y. Ye; X. Du; X. Huang and S. Bie, “Short Text Classification:

A Survey,” Journal of Multimedia, Vol 9, No 5 (2014), pp. 635-3, 2014.

[9] M. Sofean and K. Denecke, “Medical Case-Driven Classification of

Microblogs : Characteristics and Annotation,” IHI’12, Miami, Florida, USA, 2012.

[10] J. Sankaranarayanan and H. Samety, “TwitterStand: News in Tweets,”

ACM GIS '09, pp. 42-51, 2009.

[11] A. Signorini1; A. Segre1 and P. Polgreen, “The Use of Twitter to Track

Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1Pandemic,” PLoS ONE, 2011.

[12] Q. Yuan; E. Nsoesie; B. Lv; G. Peng; R. Chunara and J. Brownstein,

“Monitoring influenza epidemics in china with search query from baidu,” PLoS ONE, 2013.

[13] C. Chew and G. Eysenbach, “Pandemics in the age of : Content

analysis of tweets during the 2009 h1n1 outbreak,” PLoS ONE, 2013. [14] N. Kanhabua and W. Nejdl, “Understanding the diversity of tweets in

the time of outbreaks,” WWW '13, pp. 1335-1342, 2013.

[15] A. Go; R. Bhayani and L. Huang, “Twitter Sentiment Classification

using Distant Supervision,” CS224N Project Report, Stanford, 2009. [16] C. Corley; A. Mikler; K. Singh and D. Cook, “Monitoring influenza

trends through mining social media,” In Proceedings of the International Conference on Bioinformatics Computational Biology, ICBCB, pp. 340–346, 2009.

[17] J. Ginsberg; M. Mohebbi; R. Patel; L. Brammer; M. Smolinski and L.

Brilliant., “Detecting influenza epidemics using search engine query data,” Nature, 457(7232):1012–1014, 2008.

[18] W. Martinez and A. Martinez, Computational Statistics Handbook with

MATLAB, 2rd ed., Chapman and Hall/CRC Press, 2007.

[19] World Health Organization, “Influenza fact sheet,” [online] March 2014,

http://www.who.int/mediacentre/factsheets/fs211/en/ (Accessed: 11 Jan 2015)

[20] C. Manning, P. Raghavan and H. Schütze, An Introduction to

Information Retrieval, Online edition, Cambridge University Press, 2009.

[21] C. Thacker and J. Qualters. “Public Health Surveillance in the United

States: Evolution and Challenges,” Morbidity and Mortality Weekly Report, Supplement Vol. 61, 2012.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文