Posted in September 27, 2011 ¬ 4:38 pmh.Ken
Analyzing Twitter data is becoming increasingly popular. Within the computational linguistics community, tweets are particularly challenging and interesting. The limitation of 140 characters would seem to make tasks easier, since sentences would be relatively short (e.g., compared to long sentences in newspaper articles). However, this limitation has brought with it some rather fundamental changes in the way we communicate, primarily in the lexicon, with novel creations (e.g., “l8″ for “late”). In addition, tweets are full of non-standard use of punctuation marks, particularly in creating emoticons, further complicating analysis. A recent paper by Kyle Dent and Sharoda Paul, “Through the Twitter Glass: Detecting Questions in Micro-text“, took on the natural language processing (NLP) challenges (described briefly in a Scientific American article), developing NLP techniques to deal specifically with issues in tokenization, the lexicon, and parsing. They built a system to classify 2304 tweets into “real” questions and “not” questions (which had a superficial resemblance to questions). Tweets share a property with Likert scales, namely, that they are both short. The content analysis program MCCA (Minnesota Contextual Content Analysis) has been applied to an examination of Likert items in an attempt to improve the coherence of an entire scale. I modified MCCA slightly so that it would perform a classification task, applied it to the Twitter data used by Dent & Paul, and achieved results almost as good, without having to deal with all the NLP issues. This would suggest that MCCA can provide an initial classification tool as a first step in the analysis of Twitter data. The MCCA analysis also showed that the tweets in this data set are extremely emotional, anti-practical, and anti-analytic. (more…)
Posted in March 10, 2011 ¬ 2:19 pmh.Ken
Recently, I made a request on the ACL SIGLEX mailing list for tools that might help in analyzing preposition lexical samples. In this request, I indicated a need for software that would specifically provide enhanced word sketch analysis. I only received a couple of replies, one of which asked what I meant by this term. I responded, with some vagueness, but the interchange sparked some thoughts that are worth exploring further. In particular, this discussion raised questions about the amount of information in preposition dictionary entries and what might help in expanding these entries. I’d like to expand on this, particularly on the relation between current approaches to word-sense disambiguation (primarily statistical in nature) and what ends up in the dictionary. I think there is still something of a disconnect between the computational community and the lexicographers.
(more…)
Posted in January 15, 2011 ¬ 5:30 pmh.Ken
In a recent posting to CORPORA on the topic of semantic primitives, John Sowa says,
The so-called primitives are the result of analysis by adults who have learned how to write dissertations about language. I believe there are no primitives that are truly primitive in the sense that they cannot be analyzed in different ways by different adults with different biases.
While I won’t argue with John, I do believe such statements can have a discouraging effect on useful research. Throughout the 1970s and 1980s, research on machine-readable dictionaries (MRDs) was quite the rage. However, in 1991, Jean Veronis and Nancy Ide wrote a paper, “An Assessment of Semantic Information Automatically Extracted from Machine Readable dictionaries.” They concluded that 55 to 70 percent of the data was garbled in some way. This paper had a similar discouraging effect on MRD research. I have been engaged in MRD research for 40 years and would like to suggest that the search for primitives is not without value.
(more…)
Posted in December 4, 2010 ¬ 4:02 pmh.Ken
I recently developed an overview of the tasks in SemEval (the series of semantic evaluations conducted under the auspices of the ACL SIGLEX). The nice thing about this exercise was that it put semantic analysis into a larger perspective, where it becomes clearer where things are lacking. The overview groups the tasks into dictionary issues and issues involving how sentence and textual elements fit together, the fruits of which are then available for application areas. After the first Senseval (the precursor to SemEval) was conducted, with a focus on word-sense disambiguation (WSD), the question was raised as to what purpose WSD served. The same question can be asked about all the other tasks. Attempting to answer this question may help to identify needed further tasks in SemEval, but also may help to identify how the various pieces of information may be used in different application areas. In what follows, I offer some opinions, particularly trying to identify other research that is relevant to the SemEval tasks.
(more…)