Examining the Twitterverse with Content Analysis: A First Look
Analyzing Twitter data is becoming increasingly popular. Within the computational linguistics community, tweets are particularly challenging and interesting. The limitation of 140 characters would seem to make tasks easier, since sentences would be relatively short (e.g., compared to long sentences in newspaper articles). However, this limitation has brought with it some rather fundamental changes in the way we communicate, primarily in the lexicon, with novel creations (e.g., “l8” for “late”). In addition, tweets are full of non-standard use of punctuation marks, particularly in creating emoticons, further complicating analysis. A recent paper by Kyle Dent and Sharoda Paul, “Through the Twitter Glass: Detecting Questions in Micro-text“, took on the natural language processing (NLP) challenges (described briefly in a Scientific American article), developing NLP techniques to deal specifically with issues in tokenization, the lexicon, and parsing. They built a system to classify 2304 tweets into “real” questions and “not” questions (which had a superficial resemblance to questions). Tweets share a property with Likert scales, namely, that they are both short. The content analysis program MCCA (Minnesota Contextual Content Analysis) has been applied to an examination of Likert items in an attempt to improve the coherence of an entire scale. I modified MCCA slightly so that it would perform a classification task, applied it to the Twitter data used by Dent & Paul, and achieved results almost as good, without having to deal with all the NLP issues. This would suggest that MCCA can provide an initial classification tool as a first step in the analysis of Twitter data. The MCCA analysis also showed that the tweets in this data set are extremely emotional, anti-practical, and anti-analytic.
MCCA is a content analysis program designed to characterize texts based on the relative frequency with which words in categories are used, compared to norms determined from general usage statistics for the English language. It has been used in over 1500 studies since the early 1970s, primarily using a mainframe program at the University of Minnesota (with statistics on results used to determine any trends in general usage). The texts can range in size from short answers given to open-ended questions in questionnaires, newspaper articles, books, and multi-person transcripts such as focus groups or plays. MCCA takes about 4 seconds to analyze the 30,000 words in Hamlet. There are two primary sets of statistics used to characterize texts: (1) emphasis scores, showing the relative frequency of words in 116 categories (such as Feeling, Quantities, Spatial Sense, and Human Roles) and (2) context scores, profiling texts along four social context dimensions. The dimensions are traditional (judicial or religious texts), practical (newspaper articles, goal-oriented how-to texts), emotional (focus on personal involvement, such as leisure or recreation), and analytic (objective, research-oriented texts). The contextual analysis is a distinguishing characteristic of MCCA, determined from a principal components analysis of texts and each of which has a set of weights for each of the 116 emphasis categories. Typically, multiple texts are processed together to permit a comparison among them.
Underlying MCCA is a dictionary of 11,000 words, each of which has been assigned one or more categories (i.e., allowing ambiguity). As a text is processed, words with multiple categories are disambiguated using a running context score (so that the selected category is closest to the running context). The various statistics characterizing texts are essentially based on the words that can be categorized. Unknown words are relegated to a “leftover” category and generally do not participate in the analyses. Two principal statistics are distance matrices, one for emphasis scores and one for context scores, allowing an examination of the distances among the texts being analyzed (or the characters in a play).
As currently implemented, MCCA is not a classifier. However, it was straightforward to modify it so that it can be used as a classifier. To do this, a file containing multiple “texts” is processed to serve as the reference or marker set. (In the case of the Twitter data, the file was divided into two texts, one for instances deemed to be real questions and one for instances deemed not to be real questions. See the Dent & Paul paper for a more complete description of these two sets.) Then, instances to be classified are processed one by one, with each instance first analyzed as an ordinary text, with emphasis and context scoring, and next compared to the reference texts, using the nearest distance as the criterion for the classification.
In this first look at the Twitter data, we simply put the two files (questions.txt and notquestions.txt) into one file, with separators to reflect the two sets. We then processed this file (in about 4 seconds). The size of this file is comparable to Hamlet (about 30,000 tokens), about 10 times as large as a smaller demonstration file of five texts. The first observation about the data is the percentage of words that could be categorized. For the demonstration file, about 90 percent of the words were classified; this is roughly what occurs for modern texts. For Hamlet, 83 percent are classified; this reflects the change in English over 400 years. For the Twitter data, only 77 percent of the words were classified; this is a clear indication that, with Twitter, a significant change in the language is occurring.
I next examined the various statistics generated by MCCA to attempt to discern any differences between the Questions and ¬Questions. There were non-zero distances between the two sets for both emphasis scores and context scores; it was not immediately clear whether these differences were important. One of the result sets is a difference analysis that shows the emphasis categories that are most different between the two sets. In this case, two categories looked most different: Move-in-Space (forward, close, side) and Who-Where (who, which, someone, something); these looked interesting.
The next step was to classify the instances. In this initial examination, I used the full Twitter data as the reference set and then classified each instance. I did not create a subset of the data to use as a “training set”, against which to classify the remaining instances as the “test set”. This would have been more rigorous, but I’m not sure that it would have been necessary in this first look. In this first test, I used the emphasis score distance as the criterion for classification. The results are shown in the following table:
|MCCA Results||MT Questions||MT ¬Questions|
For comparison, the Dent-Paul results are shown the following table:
|Dent-Paul Results||MT Questions||MT ¬Questions|
The MCCA results have a precision of 0.62050, a recall of 0.61458, and an accuracy of 0.61902. These compare with the Dent-Paul results, with a precision of 0.64484, a recall of 0.77951, and an accuracy of 0.67881.
My conclusion is that the MCCA results are quite comparable and were achieved with much less effort. I performed the same test using the context score distance, the five categories with the highest differences, and the single Who-Where category. In all these cases, the results were not as good, with only the top five categories achieving an accuracy of 0.61033, but with much lower precision. I’m not sure if any better results can be achieved with MCCA. It’s possible that use of various machine learning classifiers might optimize the results, but I don’t think this would be worth the effort on this data set, which may not be typical of Twitter data. I think this classification task is very difficult, particularly since the principal criterion for selecting tweets in the Dent & Paul study was the presence of a question mark.
One very interesting aspect of the Twitter data is the overall contextual characterization of the two sets. As indicated above, the reference file was separated into two sets: Questions and ¬Questions. One of the statistics produced by MCCA is a table of the weighted context scores. Each context is normalized on a 50 point scale, from -25 to +25. The results in this case are shown in the following table:
In papers available through the MCCA link above, the point is made that in analyzing texts, the social context scores are almost never pure. Thus, these context scores are very surprising. They suggest that, at least for this set of Twitter data, the instances are almost purely emotional, with a strong anti-practical and anti-analytical bias, and a neutral score on traditional values. These results raise the question of whether this data is representative of the question universe and whether real questions that, for example, might be asked in a more practical and analytical context are being missed. Notwithstanding, the results make further examination of Twitter data in different contexts (e.g., in dealing with natural disasters or in situations like the Arab spring) would show different profiles. Clearly, this kind of analysis might prove to be very useful and interesting.