SENSEVAL: The CL Research Experience

(Preliminary draft- September 28, 1998; to be revised after SENSEVAL debugging round)

Ken Litkowski

CL Research

9208 Gue Road

Damascus, MD 20872-1025 USA

http://www.clres.com

email: ken@clres.com

 

Abstract: CL Research achieved a reasonable level of performance in the final SENSEVAL word-sense disambiguation evaluation, with an overall fine-grained score of 52 percent for recall and 56 percent for precision on 93 percent of the 8,448 texts. These results were significantly affected by time constraints; results from the training data and initial perusal of the submitted answers strongly suggest that an additional 15 percent for recall, 10 percent for precision, and coverage of nearly 100 percent could have been achieved without looking at the answer keys. These results were achieved with an almost complete reliance on syntactic behavior, as time constraints severely limited the opportunity for incorporation of various semantic disambiguation strategies. The results were achieved primarily through the performance of (1) a robust and fast ATN-style parser producing parse trees with annotations on nodes, (2) the use of the DIMAP dictionary creation and maintenance software (via conversion of the HECTOR dictionary files), and (3) the strategy for analyzing the parse trees with the dictionary data. Several potential avenues for increasing performance were investigated briefly during development of the system and suggest the likelihood of further improvements. SENSEVAL has provided an excellent testbed for the development of practical strategies for analyzing text. These strategies are now being expanded to include (1) parsing of dictionary definitions in MRDs to create entries like those used in SENSEVAL (and simultaneously, creating semantic network links), (2) analysis of corpora to extract dictionary information to create entries, and (3) extraction of information for creation of knowledge bases.


 

  1. Introduction
  2. Description of CL Research System
    1. Parser
    2. Dictionary Component
    3. Analysis Strategy
  3. Developmental Process
  4. Results
  5. Examination of Results and Possible Improvements
    1. Overall Assessment
    2. Oops! and Darn!
    3. Further Exploitation of System Capabilities
    4. Paths for Future Exploration
  6. Discussion
  7. Conclusions and Future Directions
  8. Bibliography


1. Introduction

To be completed.

2. Description of CL Research System

The CL Research system consists of a parser, various dictionaries, and routines to analyze the parser output in light of dictionary entries. The system was developed entirely to respond to the SENSEVAL call for participation, with the effort beginning in March 1998 and consuming about 3-1/2 person months, with 2-1/4 months spent on integrating the parser into the DIMAP dictionary maintenance software and 1-1/4 months focusing on the development of the analysis routines and running the program against the SENSEVAL training and evaluation data.

In the SENSEVAL system categorization, the CL Research systems is an "All-words" system (nominally capable of "disambiguating all content words or, at least, all content words of a given grammatical category in a text"). This is to distinguish it from the other types of systems, which need training data (known disambiguated uses of the tagged words for setting parametric values) or require hand-crafting of definitions. We did not actually attempt to disambiguate all content words, only assigning parts of speech to these other words at this time. We did not use the training data to set parameters, but did make use of it to gauge the developmental progress of our system. The system relies on successful conversion of the HECTOR dictionary data, which use evolved during development; since this conversion is now almost automatic, our system could in theory proceed to disambiguate any word for which HECTOR-style dictionary information is available.

2.1. Parser

The parser used in SENSEVAL was provided by Proximity Technology, Inc. and is a prototype for a grammar checker. It has never been fully developed; an unsuccessful attempt was made in 1993 to use an earlier version in MUC-5. It was provided essentially as a black box, requiring some time for understanding its functionality and results, a process that can still be regarded as only its initial phases. About two-thirds of the SENSEVAL effort was devoted to this familiarization and some fine-tuning.

The parser is based on an ATN-style grammar of 350 productions. Each production consists of a start state, a condition to be satisfied (either a non-terminal or a lexical category), and an end state. The condition may have an associated annotation which is made to the growing parse tree. The satisfaction of the condition frequently evokes an "action" program which may make further annotations or grow the parse tree. One unique feature of the parser is that, at virtually every point in parsing the input, allowance is made for interrupting the flow ("dynamic" parsing) to handle constituents like adverbial clauses, subordinate clauses, and appositives.

The parser is accompanied by a dictionary containing the parts of speech associated with each lexical entry. The dictionary information allows for the recognition of phrases (as single lexical entities) and uses 36 different verb government patterns to create additional paths for dynamic parsing and to recognize particles associated with the verbs. These government patterns follow those used in the Oxford Advanced Learners Dictionary. The dictionary is easily extended; indeed, several additions were made during SENSEVAL.

The output of the parser is in the form of bracketed parse trees, with constituents down to leaf nodes consisting of the part of speech and lexical entry. Annotations, such as number and tense information, may be included at any node. The parser does not always produce a correct parse, but it is very robust since the parse tree is constructed starting from the leaf nodes. When the constituents cannot be put together in a way that corresponds to the grammar, STUB nodes are generated and become part of the overall parse tree for the input text. This makes it possible to examine the local context of a word even when the parse tree is wildly complex (such as might occur when parsing a list of items from a catalog). The parser produced viable output for almost all the texts in the evaluation corpora, 8443 out of 8448 items (99.94 percent); even the remaining may have been due to a length constraint of 1200 characters and 400 words.

The parser is also reasonably fast. It parsed the 8448 sentences containing the target words in 94 minutes, about 90 sentences per minute, on a Pentium II 266 Mhz machine with 64 MB of RAM. This included displaying the parse tree for each sentence in a pretty-printed format.

2.2. Dictionary Component

The design of the dictionary to support the word-sense disambiguation (WSD) was viewed as a critical component. In the end, however, this turned into something of an ad-hoc process, guided primarily by practical considerations with little attention to any theoretical constructs. The overall design of the CL Research SENSEVAL system was based on the use of the DIMAP dictionary creation and maintenance software. To respond to SENSEVAL, it was necessary to integrate the Proximity parser into DIMAP, use the existing DIMAP functionality to create dictionary entries from the HECTOR data, and create an analysis strategy for examining the parser output in light of the DIMAP entries.

The parser integration was straightforward, involving only compilation and linking of the C source modules into the C++ Windows environment used for DIMAP. This was done in such a way as to facilitate processing and examination of results from parsing the SENSEVAL corpora, specifically enabling a user to select which texts to examine, including dry run, training, and evaluation texts, from a single sentence to an entire file to all files. Initially, this system was very brittle, but by the final run, it was very robust, so that all problematic texts could be processed without system failure, even when no parse was produced or no sense selection could be made.

Of course, the key to any success of the CL Research system depended on the effective utilization of the HECTOR data. Several days were spent poring over this data and trying to find how to structure it in DIMAP. The underlying structure of a DIMAP dictionary is oriented around an entry with multiple senses, with each sense having its own part of speech. While it is possible to encode homograph information, the focus is on a single string as the entry, since this is where any recognition begins. The DIMAP structure also envisions phrases as an entry, so that there is a separate entry for a phrase such as in the betting; DIMAP encodes this by creating "idiom starters" for phrasal entries, so that there would be additional entries in the dictionary for in and in the, both with a "nil" part of speech (although, of course, eventually the entry for in would be populated with its prepositional senses).

A DIMAP sense can hold an unlimited amount of information. There are specific fields to hold the usual types of information found in an ordinary dictionary, specifically, sense numbers (the "ord" of HECTOR data), sense usage labels (the "field"), definitions, and usage notes (from the HECTOR data, cases where a definition might say "used as a noun modifier", but not the parenthesized phrases like "of people" which constitute selectional restrictions). A DIMAP sense also contains fields for linking the instant sense to others, either directly through specific fields for hypernymic and hyponymic links or more generally, through a user-defined link, such as a meronymic or substance-of link. Any number of such links is allowed so that multiple inheritance is supported. A DIMAP sense also has fields that allow specification of syntactic patterns and logical forms to be associated with the sense. Finally, and most importantly for this task, a DIMAP sense can hold any number of features, where the user specificies attributes and their values.

The DIMAP functionality includes the capability for uploading dictionary data from other sources after it is put into a specified format. A small program was written to convert the HECTOR data into this format; basically, this program recognizes the data-type definition of the HECTOR data and creates the appropriate formatting information into a file that can be uploaded into DIMAP entries and senses. This program was used to convert virtually all of the HECTOR data. Some modifications were made later by hand, both in the files to be uploaded or directly inside the DIMAP dictionaries (that is, using the maintenance functionality of DIMAP). As experience was gained in using the HECTOR data, it became clear that this conversion program can be refined further to eliminate hand modifications and to increase the efficiency with the dictionary data are used. We suspect this characteristic will be true of dealing with any dictionary data, whether from ordinary dictionaries or computational lexicons.

For the most part, the HECTOR data were encoded using the feature field (excepting the information detailed above). Usually, noun, verb, and adjective subtypes were encoded as type attributes with values like nu, vt, v-absol, and attrib. Some features attributes, like sing, were given + or - values. Features were created for attributes like ord, UID, clues, dfrm, kind, and ex(amples), with values corresponding to the HECTOR data. These are discussed more fully in the next section, describing the analysis strategy.

By pulling in all the HECTOR data, we intended to enable a capability for analyzing that data in order to create additional information in the entry that would be usable in the WSD task. Because of time limitations, only a few limited forays along this path were made, none of which paid off in time for the final run. Of most importance was the parsing of the definition. We were able to use the same parser to begin analysis of the semantic content of the definition. Of particular interest here was the analysis of parenthesized expressions within verb and adjective definitions. The idea was to use such expressions to build further selectional restrictions for these senses. The DIMAP functionality includes integrated access to WordNet data, so we were able to identify the WordNet synsets to which the selectional restrictions belonged. However, since we were unable to see a clear path to making use of this information, further exploration was deferred. We also wanted to parse the verb and noun definitions to identify positions within the WordNet hierarchies; again, time constraints ruled out further exploration.

We also explored the possibility of making use of the examples that accompanied each sense in the HECTOR data. During development, we did not try to parse the examples, although this is certainly a viable path for further exploration. Instead, we examined the possibility that the contexts provided by the examples would leave something of a characteristic trace. Another part of the DIMAP functionality is an integrated content analysis module known as Minnesota Contextual Content Analysis (MCCA). MCCA characterizes textual material in two ways, based on a dictionary which places words into one or more of 116 categories. The first characterization is a "context" score (along pragmatic, analytic, emotional, and traditional dimensions); the second is an "emphasis" score (showing which of the 116 categories are more or less emphasized compared with a norm established using the Brown corpus). MCCA produces two vectors, one of size 4 and the other of size 116, for each text that is analyzed. We produced these scores for each sense's example usages and then determined which sense was the closest (using Euclidean distance) to the scores for a set of about 100 training texts. There was no improvement in scores using the context scores, but as much as 20 percent using the emphasis scores. However, again, time constraints did not allow for integration of this finding into the final analysis.

One final point about the dictionary design has to do with the ordering of the HECTOR definitions. As will be seen in the description of the analysis strategy, the order of the senses in DIMAP is crucial. The HECTOR description indicates that some attempt was made to order the senses according to frequency. In general, we found these orders to be consistent with the training data, but there were a few cases where the order was very inconsistent and so, we changed the order in which these definitions were uploaded into DIMAP. We discuss this point more fully below.

2.3. Analysis Strategy

The CL Research response to SENSEVAL is intended to be part of a larger discourse analysis processing system based on observations first described in Litkowski & Harris (1997). Basically, this system entails (1) parsing each sentence of a discourse; (2) identifying and maintaining a list of discourse entities; and (3) identifying and maintaining a list of discourse eventualities. The most significant part of this system, from the standpoint of WSD, is a lexical cohesion module that takes advantage of the observation that, even within a text of 5 or 6 sentences, many of the words were found to be related to one another in the WordNet hierarchy and also to form other relations such as typical subject of a verb (i.e., a semantic network containing a larger set of relations than used in WordNet). Subsequent to these observations, we implemented functionality in DIMAP following Litkowski (1978) to identify semantic primitives from an analysis of the directed graph formed by hypernymic links from the heads of dictionary definitions. After this implementation, as a lark, we created a sublexicon by extracting from WordNet only the words in the text used in Litkowski & Harris (this extraction included all the WordNet links for the relevant synsets). We then subjected this set of synsets to the primitive finding algorithm and found that the sublexicon induced by the text yielded a set of primitives itself, specifically, the set of words that had no hypernymic links within the sublexicon. We interpreted this induced sublexicon as constituting a reduced ontological commitment necessary for understanding a particular text. Further, such a sublexicon automatically pares the set of available senses to those with overt links and hence constitutes an automatic disambiguation. The CL Research participation in SENSEVAL was thus guided by the objective of determining the generalizability, power, and value of these observations.

The actual implementation in SENSEVAL fell far short of the principal objective, but nonetheless has provided considerable insights into the importance of a lexical cohesion module and opened up a large number of avenues for further exploration. In addition, it is expected that examination of the results and methods of other SENSEVAL participants will provide still further avenues.

The principal components of the CL Research system are:

 

2.3.1. Preprocessing the SENSEVAL texts

 Each SENSEVAL text was read and broken down into sentences, of which the last contained the tagged word to be disambiguated. Since we did not have time to develop methods for making use of the context-providing sentences, only the final sentence of each text was analyzed. We found it necessary to perform some "cleaning" of the sentence before submitting it to the parser. For the most part, this consisted of removing the entity references, frequently changing them back to what the original text contained (e.g., changing "&dash" to "-"), since the Proximity parser is generally capable of handling the original form. We also took the opportunity to remove extraneous spaces and the tags surrounding the target word. This preparation of the text was thus quite innocuous.

One step of the cleaning process, however, led to a degradation of overall performance. Many of the sentences in the SENSEVAL corpora had a closing quote mark without a corresponding opening quote mark. We attempted to remove this extraneous character by erasing it and all subsequent spaces. We didn't notice that there could be an opening quote mark without a closing one. The routine we used erased all material from the single quote mark and frequently erased the target word, so that we unwittingly did not submit results for 195 sentences (2.3 percent).

2.3.2. Parsing a sentence

After cleaning the sentence, it was submitted to the parser. The parse could fail (not returning a parse tree and shutting down the program), but we incorporated functionality to catch such failures (and return no answer for the text) and continue to the next sentence. Failures occurred often during development and we made some changes to the parser to improve its robustness (one of which was changing the maximum number of words in a sentence from 100 to 400 and the maximum number of characters from 400 to 1200). In the final run, there were only five such failures and we have not yet analyzed the reasons for them.

2.3.3. Identification of the appropriate DIMAP entry

After obtaining the parse tree result for the sentence, we first had to locate the tagged word within the tree. Making this link proved somewhat problematic, so we returned empty answers for such cases. In the final run, we returned empty answers for somewhere between one and four percent of the sentences. (An empty answer was scored in SENSEVAL as "not attempted.")

After locating the tagged word, we next had to determine whether it was an inflected form and to establish some basic characteristics of the tagged word. For example, if the parse tree indicated that the tagged word was an adjective but had an underlying verb base, we could set a flag indicating that we had a present or past participle. For a noun, we could examine the parse tree to determine if it was used as a noun modifier or if the node was labeled as singular or plural. For a verb, we could determine whether there was a noun phrase identified as modifying the verb and hence, could be characterized as its object.

It is important to note at this point that establishment of the basic characteristics and virtually all the processing that follows focuses on the local context of the tagged word. The parse tree output from the Proximity parser contains several pieces of information about the tagged word, particularly for verbs, where the information includes the root form, its syntactic patterns, and inflectional information. Having located the tagged word (both its ordinal position in the sentence and its position in the parse tree), we moved to the left and right to examine properties of preceding and succeeding words and constituents. Correct answers in SENSEVAL are thus a reflection of successful identification of the part of speech (particularly since the DIMAP dictionary was not separated into senses for a single part of speech) and examination of the properties of very local context. It is difficult to assess the relative importance of the part of speech. Early results with the training data, before very much of the local context examination was incorporated, were about 48 percent overall, with many of the errors still stemming from basic bugs in the program. So, perhaps it it possible to achieve 30 to 40 percent correct disambiguation with part of speech alone. The remaining success can largely be attributed to examination of the local context.

We would next look up the base form of the tagged word in the DIMAP dictionary and retrieve its entry. (If an entry was not available, we would not make an assignment and would return an empty answer.) For some HECTOR words (such as band), there were separate lexical items (e.g., brass band); when these were converted to DIMAP format, they were made instances of a general sense of band and separate DIMAP entries were also created. (These are different from kinds of bands, such as rock band.) So, at this point in the processing, we iterated through the DIMAP senses for those with instances. If the prior word in the input combined with the base form of the tagged word constituted a DIMAP entry, that entry was retrieved and the base form entry was not considered further. We performed a similar operation with the base form of the word following the tagged word, if the DIMAP entry for the tagged word indicated that it might be an idiom or phrase starter (e.g., resulting in the retrieval of a DIMAP entry for band saw).

At this point, for adjectives derived from verbs, we would also retrieve a distinct entry created when the HECTOR data indicated that a particular sense was labeled as a dfrm.

There were many problematic cases of obtaining the correct DIMAP entry because of difficulties in interpreting the parse results, particularly for derived forms of verbs and the word steering. In many cases (probably around one percent), these resulted in an empty answer. In an unknown number of cases, this difficulty in retrieving the appropriate DIMAP entry resulted in an incorrect answer.

2.3.4. Filtering out non-viable senses and adding preference points

This component of the CL Research system is the largest part and where the essence of the sense selection is made. In this step, we iterate over the senses of the DIMAP entry, adding to an array of possible senses, each of which has an accompanying score. The score is not a critical part; for many words, score-adding routines do not come into play, but when they do, the results are used. Most of the work is accomplished by examining the features associated with a sense.

A sense is excluded if it does not match the part of speech returned by the parser for the tagged word. (Note that we did not separate parts of speech for words that fall into more than one category. The CL Research results would have been the same if all the SENSEVAL corpora had been combined in one file.) With a part of speech match, we initialize several flags. We make note that we have to find a sense identified as a noun modifier if the parse results indicated that a noun was used as such. Similarly, if we have identified an object of a verb, we note that we have to find a verb sense that has a type that allows an object (i.e., not a sense marked as only intransitive). We next iterate over the features set associated with the sense, skipping those not used in this phase.

There are certain attributes that we took as absolute. If an adjective sense was marked as superlative (e.g., slightest) and an adjective was not so marked by the parser, we excluded the sense. Conversely, if the sense was marked as superlative and we had a non-empty set of viable senses, we dumped this set and started anew, thus showing a preference for such marked senses. Similarly, if a noun sense was marked as singular or plural and the parse tree showed the opposite or the noun was marked as uncountable and the parse tree showed a plural, we excluded the sense.

We did not do much with clues as specified in the HECTOR data, although we did convert some clues into another feature kinds, discussed below. We used the collocate ("c/") clue followed by a single word. For this type of clue, we merely allowed the collocate to be present anywhere in the sentence, and if so, we made sure the sense was included in the viable set and added a score of 5 points. We treated the with operator in the grammar field of HECTOR data in a similar manner, except requiring this to be the succeeding word in the sentence; we only used this if a literal was specified and added a score of 3 points. We treated the after operator in the grammar field of HECTOR data in a similar manner, except requiring this to be the preceding word in the sentence; we only used this if a literal was specified or an adjective was required as the preceding word and added a score of 3 points for a literal and 2 points for an adjective.

For nouns identified by the parser as being noun modifiers, we did not require that there be a sense marked as mod+. However, if there was such a sense and the tagged word was so used, we dumped the existing set of viable senses and started anew, unless we already had multiple senses marked as noun modifiers.

For verbs, the presence of an object phrase (a noun phrase, an infinitive clause, or a 'that' clause) was key. It was first necessary to find the verb type among the features for the sense. If found, we could then determine whether the presence or absence of an object was consistent with the type. If consistent, we added the sense to the set of viable senses. At this time, the verb processing is very rudimentary. If a verb was marked as ergative or absolute, we viewed this as a match. If there was an object and the sense was marked transitive, ditransitive, or reflexive, we viewed this as a match. If there was no object and the sense was marked intransitive, we viewed this as a match. If we had already identified a verb type, but had not yet made a match and we came upon the feature passive, we recorded a match if the verb was a past participle form and the HECTOR data indicated "rarely," "often," "usually," or "only" for this sense. Finally, if we had flagged the present participle form, found a verb type but not yet a match, and dfrm data from the HECTOR sense matched the tagged token, we recorded a match. If, after processing all features for the sense, we had recorded a verb match, we added the sense to the set of viable senses.

The kind field in the HECTOR data emerged as perhaps the most significant recognition device; several clues in the HECTOR data were rewritten into kind features as it became apparent that this field could serve very efficiently. In the HECTOR data, the kind field is designed primarily to list compounds and combinations in which a sense appears. For the most part, this applies to nouns and provides "kinds" that may frequently be separate headwords (e.g., brass band, cocktail onion, and football shirt). As initially implemented, the presence of a kind feature simply resulted in examining the word immediately prior to the tagged word. It was then observed that, in noun phrases, the kind could be separated from the tagged word by intervening words; as a result, we allowed the recognition of the kind if the value of this feature appeared in the same noun phrase. It was next observed that the functionality to accomplish this could be generalized to handle many of the HECTOR clues equations (i.e., where the target word in a clue was represented by an equals sign).

Satisfaction of a kind feature is deemed absolute in sense selection. That is, if the value of the kind attribute is matched, the current set of viable senses for the tagged word is discarded and we only allow further senses to be added to the set if they similarly match the value of a kind attribute. (There are only a few words where an identical kind equation appears under more than one sense.)

As currently implemented, a kind specification has the following format: (1) no equals sign is assumed, in which case the value is interpreted as specifying what must appear prior to the target word; (2) if there is an equals sign, then what appears before the equals sign must appear prior to the target word and what appears after the equals sign must appear after the target word; (3) any word appearing in the equation gives rise to a search for the word in its root form; (4) if a word is quoted, it must appear exactly as given; (5) a word type may be specified in brackets (e.g., "[prpos]" indicates that a possessive pronoun must appear in that position, so that a phrase like on one's knees would be recognized for any possessive pronoun before the word knees); (6) an asterisk in the equation allows 0 or at most three intervening words; and (7) the presence of a (non-quoted) verb in the equation requires an object (e.g., to handle cases like give a promise or cover one's bets). We have not yet implemented the more elaborate equations available from the HECTOR clues such as those specifying grammatical codes and lexical sets. There are some bugs in the kind processing and we have not yet examined cases in the training or evaluation data where phrases are present but not accurately recognized, but it is clear that this is a very powerful and efficient mechanism for recognizing senses that have such specifications.

After all features of a sense have been examined, we perform some default processing with respect to the sense. If the sense has not yet been added to the set of viable senses, we add it unless we have set a flag that indicates it should not be added. This attempts to ensure that the set of viable senses is not empty and that senses are not excluded unless we have had some good reason for discarding them. There are still some bugs in this process resulting in empty answers for one to three percent of the evaluation texts.

2.3.5. Selection of an answer

After adding to the set of viable senses and giving points for certain characteristics, the set of viable senses is sorted and the one with the highest score is returned. The sorting preserves the sense order present in the DIMAP entry for the tagged word, so that the most frequent sense is selected from among those that have not otherwise been filtered out. In most cases, however, the ordering does not reflect the HECTOR order, but rather the frequency pattern present in the dry-run data. Further investigation of the effect of this reordering is necessary.

This is the point in the processing where it was intended to incorporate semantic or other criteria for sense selection. However, time constraints precluded any developments in this area. Indeed, it may be said that the fruitfulness of the prior steps lowered the priority for attending to this area.

Finally, if the set of viable senses was empty at this point, an empty answer was returned (resulting in the SENSEVAL scoring of the answer as "not attempted"). In the final run, empty answers were returned for 4.1 percent of the texts, although some of these arose from earlier parts of the program. We need to separate out which of these are due to an empty set of viable answers. As of the date of this paper, we have now reduced the empty answers to 2.3 percent.

3. Developmental Process

The development of the CL Research system has been a fascinating and a satisfying experience, primarily because of its incremental nature, which provided immediate feedback on changes with demonstrable improvements in success with the SENSEVAL training data. The effort can be divided into two segments: (1) an unfocused phase, consisting primarily of familiarization with the Proximity parser and the HECTOR dry-run data and development of ancillary programs to facilitate gauging of progress, and (2) a focused phase, working with the HECTOR training data, implementing functionality to perform the parsing and return answers that could be scored. All efforts were focused on familiarization with the Proximity parser until mid-May, particularly with a new version of the parser received near the end of April. We began working with the HECTOR dry-run data in mid-May, primarily focusing on familiarization with HECTOR dictionary entries for conversion to DIMAP entries, but with continued examination of the details of the Proximity parser code and data structures for the purpose of identifying how to make use of its parsing output, along with development of the infrastructure programs (primarily, the integration within DIMAP). These efforts continued until mid-July, when the training data and the final set of HECTOR definitions were obtained. This marked the beginning of the second phase and the development and implementation of the analysis strategy described in the previous section.

The second, more focused phase was marked by concentration on the exigencies of the moment: what worked and where was the next most likely source of the greatest improvement in the results. To this end, we might focus on any of the three major components of the system: the parser, the DIMAP dictionary, or the analysis strategy. The development revolved around the 29 training corpora that were provided. The objectives were (1) to create DIMAP dictionaries from each HECTOR dictionary, (2) to ensure that the dictionary entries used by the parser were properly configured, and (3) to test algorithms against the training corpora. To gauge progress, we maintained a set of statistics on the performance of the system against the individual corpora. These statistics included not only the percent of correct assignments as measured by the assignments included in the training data, but also a characterization of the failures (the empty assignments described in the previous section). This enabled us to see where the system seemed to be experiencing the greatest failures.

We began with the smaller training corpora, covering words with small numbers of definitions in the three major parts of speech (onion, invade, and wooden). We focused first on simply getting the system to run against the training corpora, since at first there were many difficulties in interacting properly with the parsing output. We gradually improved this interaction up until the final evaluation run. Once over the first hurdle, this became less and less of a problem, but there were still difficulties in the final run, some of which have since been removed.

Once beyond the parsing difficulties, we were able to focus on implementing the functionality constituting the analysis strategy. We worked first on ensuring a connect between the parse output and the basic parts of speech. For nouns and adjectives, this was generally straightforward, except in cases when they were capitalized and part of proper noun phrases. For verbs, this posed more of a difficulty. For example, invading might be identified by the parser as either an adjective or a verb. An important next step for verbs was the recognition of whether there was an object.

We moved on to words that had somewhat more complex sets of definitions, but ones that were generally unambiguous once appropriate recognitive devices had been set in place. The word shirt was particularly useful, since it contained a number of senses that called simply for the recognition of shirt used in set phrases or as a noun modifier. We were able to implement some basic functionality to recognize phrases, making use of all three components of the system. The parser has the ability to recognize phrases as a single unit; our implementation involved merely ensuring that the parser dictionary included such phrases. Similarly, DIMAP has mechanisms for indicating that a word is part of a phrasal entry; our implementation ensured that conversion of HECTOR dictionary data to DIMAP format included the creation of such phrasal entries. Finally, in the analysis functionality, we made sure that the Proximity results for phrases were properly used in conjunction with the DIMAP entries and then added part of the functionality to make use of a kind feature to recognize phrases such as sweat shirt. (But, we still had not developed a mechanism for handling phrases like in one's shirtsleeves.)

At this point (the last day of July), we had most of the mechanisms in place for making complete passes through the training data. We were still experiencing some substantial difficulties in getting complete runs, but had managed to process all the training data, failing on assignments for 12 percent of the 13,000 texts, but with an overall initial recall of 48 percent. It was at this point that we obtained the evaluation data and could begin to determine what difficulties we would have in processing it.

An important part of the development process was maintaining the statistics for each system run against the training and evaluation corpora. These statistics were useful not only for identifying what to do next, but could also measure the effect of any changes. One of the first surprizes from analysis of the changes came from discovering that a change to improve the results for one word could lead to a degradation in the results for another word. Examination of the changes, particularly when a degradation occurred, proved to be an excellent debugging tool, enabling us to focus in on problematic parts of the code. On some occasions, a degradation in performance came from more rigorous and proper use of the HECTOR dictionary data. For example, better routines for finding the object of a verb (making sure that the parser identified the NP as the object rather than just using a following NP) led to a disheartening reduction in performance, but correctly so. Maintaining these statistics was also extremely useful for testing hypotheses or different methods of handling some aspect of the disambiguation. Thus, for example, it was found that the HECTOR ordering of definitions did not correspond to the frequencies present in the training data; changes were made for some of the HECTOR dictionaries to accord with these frequencies, in some cases resulting in significant improvement in results.

We continued to work with the training data until the last minutes on the due date for the final evaluation results. We had essentially only 10 days for development of the analysis strategy (described in the previous section) from the time when we first achieved an overall successful pass through the training data. During this time, we moved the results to failing on assignments for 3.9 percent of the training texts and an overall precision of 60.5 percent. These results are based on fine-grained scoring (and are actually low since we did not take into account ambiguous assignments by the coders). Curiously, for some of the words (we think now that these were words having hyphenated forms), we had significant reductions in performance right at the end.

We were able to spend very little time in examining the results from the training data and none at all on the final evaluation corpora, other than to ensure that we were producing output (and finding out later that we had missed 2.3 percent of the texts and produced empty results for another two percent because of some problem with hyphenated words in the DIMAP dictionary used in the final run). We also spent almost no time in looking at the texts in either the training or the evaluation corpora. The little time that was spent on the training texts was primarily to examine the parse tree output in order to develop the next general changes in the analysis strategy and to find out general sources of problems. We did not have the opportunity to systematically examine incorrect assignments.

4. Results

To be completed. In Tables 1 and 2, we present the overall results and the results by task (i.e., major part of speech), first for recall (percent correct of total) and then for precision (percent correct of total attempted), as scored by the SENSEVAL coordinators for the coarse, mixed, and fine-grained assessments. The precision results are based on the SENSEVAL scoring routine determination of whether a given text had been attempted. In Tables 3 and 4, we present analogous results by task (i.e., individual word files). In presenting these tables, we provide some initial discussion, primarily for the purpose of highlighting significant aspects in the jumble of numbers. We go into more detail in the next section.

Table 1. Recall for major tasks (percent correct of total number of texts)

Task

Number of Texts

Grain

Attempted

Coarse

Mixed

Fine

Overall

8448

59.3

56.8

51.9

92.74

Noun

2756

67.8

63.0

58.1

91.11

Verb

2501

 50.7

49.7

44.3

94.56

Adjective

1406

60.1

56.4

53.2

95.59

Indeterminate

1785

57.5

57.3

51.8

90.48

 

While we believe that the percent correct of the total number of texts is the best reflection of overall performance, we think the results in Table 1 may slightly overstate the performance for the CL Research system in comparison with other systems as of the final evaluation run (slightly above the median position). We believe this reflects primarily the fact that the CL Research system was able to "attempt" more texts than most of the other systems.

Of special note here are (1) the relatively low recall for the verbs (perhaps a reflection of their greater ambiguity), (2) the lower successful attempts for nouns (perhaps reflecting an inability to successfully deal with noun phrases), and (3) the mirroring of the overall results in the corpora for which the part of speech first had to be determined (since there was no special processing to deal with corpora in individual parts of speech).

 

Table 2. Precision for major tasks (percent correct of attempted texts)

Task

Number of Texts

Grain

Attempted

Coarse

Mixed

Fine

Overall

7835

63.9

61.2

55.9

92.74

Noun

2511

74.4

69.1

63.7

91.11

Verb

2365

53.7

52.6

46.9

94.56

Adjective

1344

62.9

59.0

55.7

95.59

Indeterminate

1615

63.5

63.3

57.2

90.48

 

In the CL Research system, as described in the previous section, empty answers were returned for texts for which the system had particular problems. These answers were treated by the SENSEVAL scoring program as if no attempt had been made. Strictly speaking, this is inaccurate for the CL Research system and such results should have been scored as incorrect. However, as will be discussed in the next section, the 613 texts characterized as not attempted are largely due to minor bugs that had not been examined at the time of the final evaluation run. The precision results shown in Table 2, therefore, are more reflective of the CL Research system performance that would occur when the bugs are corrected, extrapolated to when the percent attempted moves closer to 100 percent (as is occurring from modifications made since the submission of results).

Table 3. Recall for individual words (percent correct of all texts)

Task

Number of Texts

Grain

Attempted

Coarse

Mixed

Fine

Accident-n

267

88.8

82.8

80.1

93.63

Amaze-v

70

90.0

90.0

90.0

90.00

Band-p

302

82.8

82.8

82.8

93.71

Behaviour-n

279

92.8

92.8

84.2

96.77

Bet-n

274

43.1

43.1

31.8

94.53

Bet-v

117

51.3

48.7

47.9

94.02

Bitter-p

373

44.2

44.2

44.0

88.47

Bother-v

209

40.2

40.2

36.4

94.74

Brilliant-a

229

49.3

49.3

39.7

94.32

Bury-v

201

29.9

29.6

22.4

93.03

Calculate-v

218

53.7

53.7

44.0

97.71

Consume-v

186

44.1

40.1

35.5

93.01

Deaf-a

122

73.0

73.0

56.6

88.52

Derive-v

217

47.5

47.5

47.5

90.32

Disability-n

160

90.6

90.6

83.7

96.25

Excess-n

186

74.7

49.7

41.4

90.86

Floating-a

47

55.3

55.3

55.3

95.74

Float-n

75

40.0

34.7

34.7

100.00

Float-v

229

35.8

32.3

28.8

94.76

Generous-a

227

37.0

37.0

37.0

98.24

Giant-a

97

1.0

1.0

0.0

96.91

Giant-n

118

72.0

56.8

49.2

100.00

Hurdle-p

323

36.5

36.5

9.3

95.98

Invade-v

207

49.3

48.6

30.4

93.72

Knee-n

251

67.7

61.2

57.8

88.05

Modest-a

270

65.9

64.9

64.4

97.04

Onion-n

214

82.2

82.2

82.2

97.20

Promise-n

113

72.6

63.3

60.2

95.58

Promise-v

224

68.3

67.2

56.7

97.32

Rabbit-n

221

80.1

79.4

78.7

85.52

Sack-n

82

57.3

57.3

57.3

75.61

Sack-v

178

82.6

82.6

82.6

91.01

Sanction-p

431

70.3

70.3

70.3

90.72

Scrap-n

156

54.5

48.4

38.5

94.23

Scrap-v

186

79.6

79.6

73.1

98.39

Seize-v

259

26.3

25.1

25.1

96.91

Shake-p

356

53.4

52.5

49.7

84.55

Shirt-n

184

59.2

54.1

48.4

59.78

Slight-a

218

77.5

54.6

54.6

91.74

Steering-n

176

5.1

5.1

5.1

97.16

Wooden-a

196

94.4

94.4

94.4

100.00

 

In the results above, we particularly note the low percentage attempted for shirt, shake, bitter, knee, sack-n, deaf, and rabbit. Also, the particularly low fine-grained scores for giant-a, hurdle, and steering. Of these, we attribute the poor showing for deaf, rabbit, hurdle, and steering in part to the fact that there were no training data for these words; we simply devoted little time to consideration of the HECTOR dictionary entries for these words. For giant-a, we note that the Proximity parser dictionary did not have an adjective sense for giant, so that the parser force-fit all parses into other interpretations, primarily a noun modifier sense. For the remaining words of special note, we found that the DIMAP dictionary entries for hyphenated forms of these words was not operating correctly. This particularly affected recognition of T-shirt in the shirt corpus, accounting for the low percentage attempted.

Table 4. Precision for individual words (percent correct of attempted texts)

Task

Number of Texts

Grain

Attempted

Coarse

Mixed

Fine

Accident-n

250

94.8

88.4

85.6

93.63

Amaze-v

63

100.0

100.0

100.0

90.00

Band-p

283

88.3

88.3

88.3

93.71

Behaviour-n

270

95.9

95.9

87.0

96.77

Bet-n

259

45.6

45.6

33.6

94.53

Bet-v

110

54.5

51.8

50.9

94.02

Bitter-p

330

50.0

50.0

49.7

88.47

Bother-v

198

42.4

42.4

38.4

94.74

Brilliant-a

216

52.3

52.3

42.1

94.32

Bury-v

187

32.1

31.8

24.1

93.03

Calculate-v

213

54.9

54.9

45.1

97.71

Consume-v

173

47.4

43.1

38.2

93.01

Deaf-a

108

82.4

82.4

63.9

88.52

Derive-v

196

52.6

52.6

52.6

90.32

Disability-n

154

94.2

94.2

87.0

96.25

Excess-n

169

82.2

54.7

45.6

90.86

Floating-a

45

57.8

57.8

57.8

95.74

Float-n

75

40.0

34.7

34.7

100.00

Float-v

217

37.8

34.1

30.4

94.76

Generous-a

223

37.7

37.7

37.7

98.24

Giant-a

94

1.1

1.1

0.0

96.91

Giant-n

118

72.0

56.8

49.2

100.00

Hurdle-p

310

38.1

38.1

9.7

95.98

Invade-v

194

52.6

51.8

32.5

93.72

Knee-n

221

76.9

69.5

66.1

88.05

Modest-a

262

67.9

66.9

66.4

97.04

Onion-n

208

84.6

84.6

84.6

97.20

Promise--n

108

75.9

66.2

63.0

95.58

Promise-v

218

70.2

69.0

58.3

97.32

Rabbit-n

189

93.7

92.9

92.1

85.52

Sack-n

62

75.8

75.8

75.8

75.61

Sack-v

162

90.7

90.7

90.7

91.01

Sanction-p

391

77.5

77.5

77.5

90.72

Scrap-n

147

57.8

51.4

40.8

94.23

Scrap-v

183

80.9

80.9

74.3

98.39

Seize-v

251

27.1

25.9

25.9

96.91

Shake-p

301

63.1

62.1

58.8

84.55

Shirt-n

110

99.1

90.5

80.9

59.78

Slight-a

200

84.5

59.5

59.5

91.74

Steering-n

171

5.3

5.3

5.3

97.16

Wooden-a

196

94.4

94.4

94.4

100.00

 

In general, the precision results reflect the recall results, with an expected increase whenever there was less than 100 percent attempted. However, the change was much more dramatic for words where there was a relatively lower percent attempted, particularly for shirt, deaf, knee, rabbit, sack-n, and shake. We also note the significant grain effect for several words: giant-n, hurdle, slight, shirt, scrap-n, and promise-n.

5. Examination of Results and Possible Improvements

In this section, we first describe the CL Research systems performance against other systems and relative to the best baseline results. We then examine some of the reasons for our failures and examine the prospects for improving our scores. As discussed in section 3, describing the development process, our examination of the submitted results provides the basis for making improvements. In this regard, it is important to distinguish between immediate changes that can be made to improve performance and those that require further research. The immediate changes are essentially bug fixes that would have been made without recourse to the answers; making such changes is important to a more accurate assessment of the CL Research system. In this section, we discuss (1) bug fixes and immediate changes, (2) changes that appear viable from a more complete exploitation of the available resources, and (3) research efforts that would almost assuredly lead to improved performance.

5.1. Overall Assessment

 The precision scores (as provided for the CL Research system in Table 2) were used as the basis for comparing the SENSEVAL systems. Although the SENSEVAL documentation suggests comparisons only with systems of the same type, we first present such an overall comparison. With respect to all systems, the CL Research system performed better than the average at the fine-grained level and below average at the mixed- and coarse-grained levels. Our system performed at levels below the best baseline for all systems at all grains. Table 5 shows our system's performance on the individual SENSEVAL tasks in comparison to all systems.

Table 5. Performance of CL Research System Compared to All Other Systems on Individual Tasks (n=41)

Relative Score

Grain

Coarse

Mixed

Fine

Best

3

2

2

Above average

19

19

20

Below average

19

20

17

Worst

0

0

2

 

The CL Research system is an "all-words" system; that is, it does not set parameters based on a required set of training data and it does not involve hand-crafting of definitions. In addition, the system is theoretically designed to scale up without any training data or hand-crafted definitions. In the practicalities involved in the system's development, neither of these conditions were satisfied in an absolute sense, since we attempted to order definitions according to the frequencies observed in the training data (although this is thought to have actually degraded our performance) and we did some hand-crafting of definitions (primarily since we had not perfected our program for automatic conversion of Hector dictionary data). Thus, we believe that our system is most appropriately compared to other "all-words" systems.

The CL Research system attained the best score at the fine-grained level among "all-words" systems and was above average at the mixed- and coarse-grained levels. At the mixed-grained level, our system was at 61.2 percent compared to the best system at 61.6 percent, with our recall at 56.8 percent compared to 3.1 percent of the best system. At the coarse-grained level, our system was at 63.9 percent compared to the best system at 65.0 percent, with our recall at 59.3 percent compared to 19.8 percent of the best system. Table 6 shows our performance on the 41 individual SENSEVAL tasks.

Table 6. Performance of CL Research System Compared to "All-Word" Systems on Individual Tasks (n=41)

Relative Score

Grain

Coarse

Mixed

Fine

Best

16

18

19

Above average

16

14

12

Below average

7

7

6

Worst

2

2

4

 

We believe that the relative comparisons may not provide the best indicator of our system's performance. We suspect that a more absolute assessment may be provided by comparing our performance against the best baselines. Overall, with respect to all systems, our system performed below the precision levels of the best baseline. Similarly, with respect to "all-words" systems, our system performed better than the best baseline. In Tables 7 and 8, we present our comparison against the best baseline for the 41 individual tasks. (Note that even though, in Table 8, we performed below the best baseline on individual tasks, our total scores were still above the baseline.)

Table 7. Performance of CL Research System Compared to Best Baseline for All Systems on Individual Tasks (n=41)

Relative Score

Grain

Coarse

Mixed

Fine

Above baseline

11

10

10

Below baseline

30

31

31

 

Table 8. Performance of CL Research System Compared to Best Baseline for "All-Word" Systems on Individual Tasks (n=41)

Relative Score

Grain

Coarse

Mixed

Fine

Above baseline

19

19

16

Below baseline

22

22

25

 

We have taken very little time to examine the reasons for the successes of the CL Research system. We have not parceled out the contributions of individual steps and we have hardly looked at the texts themselves or the parse trees, only focusing primarily on failures. Our observations at this time are therefore somewhat scanty on the reasons for success, with more insights provided in the analyses of the failures. We focus here primarily on the fine-grained recall, since as will become clear, we expect improvements to eliminate almost all SENSEVAL instances of "not attempted" and because the CL Research system does not yet exploit the HECTOR hierarchy.

As would be expected, the CL Research system succeeds best when a HECTOR entry has few senses of a given part of speech and the first sense occurs much more frequently than the other senses. This is the case for accident, amaze, band, behaviour, disability, onion, rabbit, sack-v, scrap-v, and wooden. The system does well discriminating the part of speech without requiring special handling of words which have multiple parts of speech (the p, or indeterminate files, and bet, float, giant, promise, sack, and scrap), although still with difficulties when there are several senses of the same part of speech. The system also seems to do well in picking up phrasal entries, even those that involve inflected forms and interpolated elements not considered part of the phrase.

The system seems to have a design that can handle syntactic discriminators of senses reasonably well when they are available. However, many of these (particularly those available in the HECTOR clues) have not yet incorporated into the analysis; this is reflected in the overall lower performance for verbs, which would be somewhat lower were it not for relatively high recall percentages for a few verbs.

The system has several intervention points where semantic analysis can be performed. This enabled some of the explorations described in the analysis strategy. But, nothing was implemented in time for the final evaluation run. Connections were made to WordNet, but were not used in analyzing the parse trees for selectional restrictions available in the HECTOR clues and definitions. We also explored the possibility of building and using a set of semantic relations based on parsing the HECTOR definitions (in the manner of MindNet, described generally in Richardson et al 1998 and specifically in Richarson 1997, or as in Barriere 1997), but were unable to implement the necessary steps at this time.

By virtue of the development process and not as a conscious design, the CL Research system seems to demonstrate the viability of using local information for making sense selections. This was definitely an emergent property. The system also suggests that a "sieve" approach works, allowing only viable senses to pass and considering all possible parts of speech; we had hoped to achieve a more organized hierarchization of the senses, but were unable to implement it in sufficient time.

5.2. Oops! and Darn!

In this section, we provide details of bugs that have been fixed and of other immediately obvious changes that have been made since the final evaluation results were submitted. These details are provided to make it clear that these changes are not merely speculative. Clearly, this examination process can continue for some time, so we provide those complete as of the date of this paper. All discussion in this section describes only results in the fine-grained analysis and focuses on the recall percentage, since all indications are that we will be able to eliminate almost all of the 613 "not attempted" assessments, in which case, there will be no difference between the recall and precision results.

  1. The first problem that was observed was the fact that the CL Research system missed 195 texts. Most of these cases (191) were due to a bug in the preprocessing phase. It was observed that many of the texts contained an extraneous closing quotation mark after the sentence. We attempted to remove it even though it gave rise to no difficulties in parsing. The routine for doing this did a forward search for a quotation mark and then a backward search, erasing all text from the quotation mark to the end of the text when the forward and backward searches resulted in the same character in the sentence. This assumed that a text would have balanced quotation marks and did not envision an opening quotation mark without a closing one. As a result, when there was an opening mark somewhere early in the sentence without a balancing closing mark, we erased all material from the opening mark to the end of the string. For the 191 cases, this meant that there was no tagged word and thus, not even an empty answer for the text. Eliminating this erasure increased our overall "attempted" rate by 2.4 percent to 95.1 percent; we obtained correct answers for 110 of these now analyzed texts (57.6 percent), with a resultant overall increase in recall of 1.2 percent.
  2. There were also 9 cases where the first text of a SENSEVAL file had not been analyzed. We found this was due in part to the fact that these corpus files (such as bury-v) had a first line that identified the part of speech for that file. When we removed these lines, we were successfully able to process 8 of those cases (one still remains mysterious).
  3. In examining results for excess, we wondered why instances involving the prepositional phrase in excess of and the adverbial phrase to excess were being assigned empty answers, despite the fact that both the parser and the relevant DIMAP dictionary performed appropriately. We found that the relevant sense was not being added to the set of possible answers despite its not being excluded by any of the tests. We modified the code to ensure that senses not excluded were added to the set. For excess, the effect was to eliminate 9 empty answers, all of which were now correctly assigned, increasing the recall by 4.8 percent. We examined some other words with relatively large numbers of empty assignments and obtained some modest improvements for a few of them. But, we also observed some degradation of results for other words, indicating once again that interactions among the processing steps (improving some results, degrading others) can be significant. Many of the negative effects arising here were reversed in subsequent changes.
  4. In examining why the hyphenated word shake-up was assigned an empty answer, we found that the DIMAP dictionary which contained all entries was not recognizing hyphenated forms correctly. For some reason (as yet undetermined), the functionality in DIMAP for merging dictionaries did not operate properly for these word forms. So, we turned to the individual dictionaries that had been created and found that they worked properly in looking up the hyphenated forms. Such forms are found in the dictionaries for shake, shirt, scrap, knee, bitter, and band. In the submitted results, all instances of hyphenated forms were given empty answers and thus scored as "not attempted." When we ran these corpora against the individual dictionaries, we eliminated a large number of these empty answers. The most dramatic effect was for shirt (primarily because of T-shirt) where we reduced the large number of empty answers from 66 to 1, with a change in recall from 48.9 to 85.3 percent. For shake, we eliminated 25 empty answers with a recall improvement from 50.6 to 57.9 percent. We eliminated the same number of empty answers for bitter, but with an improvement of only 2.5 percent in recall. We had smaller recall improvements for the remaining words in this set, but still with reductions in the empty answers. Overall, we eliminated about 150 empty answers, increasing the "attempted" rate to 96.9 percent and increasing recall by about 1.0 percent.
  5. When we looked at our 0 percent recall for giant-a, we found that we were always obtaining a noun modifier sense of the noun giant. When we examined this a little bit further, we found that giant was not identified as having an adjective sense in the Proximity parsing dictionary. When we added it to the dictionary, our results for giant-a went from 0.0 to 88.7 percent and our overall recall percentage increased by 1.0 percent (with 86 now correct answers).
  6. We were puzzled by our inability to recognize particular phrases (such as wooden spoon, rabbit warren, and steering wheel) in which the SENSEVAL word was first and for which there were separate HECTOR (and hence, DIMAP) entries. Closely related to this problem was our failure to recognize a feature after when it had a specified literal (such as the for wooden spoon). We found that these problems were due to copying code but not changing key aspects: (1) for phrases, we had copied and modified a recursive function, but had left the recursive call to the original function, and (2) for processing the after feature, we had copied code which looked at the word succeeding the tagged word, not changing it to look instead at the word preceding. This had some dramatic effects for several tasks. For steering, this resulted in a recall change from 5.1 to 31.8 percent. We had an improvement in bother of 43 now correct assignments, a recall improvement of over 20 percent (although many of these may have been due to previous changes whose effect on bother were not examined). We observed 10 percent recall improvements for seize, scrap-v, and scrap-n, 6 percent improvements for rabbit and sack-n, and a 3 percent improvement for wooden.

To date, we have reduced the number of empty assignments, viewed as "not attempted" by the SENSEVAL scoring program, from 613 to 258, increasing the percent attempted from 92.74 to 96.95 percent. (These empty assignments are a fruitful area for further improvements.) We have improved our overall recall percentage at the fine-grained level by 5.5 percent (from 51.9 to 57.4). This would improve our coarse-grained recall at least from 59.3 to 64.8 and mixed-grain from 56.8 to 62.3 percent. The effect on precision is 3.3 percent, with current scores for fine, mixed, and coarse grain at 59.2, 64.5, and 67.2 percent, respectively.

The question now is how much further we can go with this type of "bug" fixes. Extrapolating the improvement from the reduction of empty answers suggests an additional 4 percent in fine-grained recall from that source alone. At the end of the development process leading to the final evaluation run, we had made one final run against the training data (making sure that we could process all the training data for a word immediately prior to making the evaluation run). Overall, we had a 61.3 percent recall with 4 percent empty answers for this data, without any of the changes described above. For a large number of words, this final training run showed degradations in performance, several of around 10 percent. We believe that application of the changes described above would enable us to achieve nearly 70 percent success on the training data at the fine-grained level and 75 percent at the coarse-grained level.

In the final days leading up to the submission, there were several specific steps ("bug" fixes) we were trying to implement and having difficulty so doing, leading us to abandon them. For example, in the kind processing, we had implemented the Kleene operator for examining words prior to the tagged word, but encountered bugs in handling this operator for words after the tagged word and thus removed the offending code. We did this because it did not seem likely that these various small steps would have significant gains.

There are a few other avenues that we put on our "to-do" list and that are at this level of processing. For example, in interacting with the DIMAP dictionary for phrasal entries, we only coded to recognize 2-word phrases and did not generalize this to looking for n-word phrases; this is something that is not too difficult, but requires some tricky programming and was not pursued for lack of time. We also want to investigate the effect of submitting weighted answers. As noted, our system acts as a sieve, eliminating some senses, but perhaps leaving more than one in set of possible senses, each with a score. We did not choose to submit all the senses. Certainly, this would improve our overall scores for those texts where our answers were wrong. At the same time, this might degrade our performance for those which were correct.

Finally, at this level of difficulty, but not something we could have done prior to submission of results, is an examination of reasons for the differences on runs between the training and the evaluation data. In other words, why was there a discrepancy of almost 10 percent at the fine-grained level? As noted above, we reordered the HECTOR definitions to accord with the frequency order of the training data. We need to examine whether this reordering resulted in lower scores.

In summary, we believe that the CL Research system can achieve at least a 15 percent improvement in recall at the fine-grained level, to about 67 percent, from what are essentially bug fixes. We would expect this to result in about 75 percent recall at the coarse-grained level. We expect that these levels will be achieved in part by reaching nearly 100 percent in texts "attempted," so that there will be essentially no difference between our recall and precision results.

5.3. Further Exploitation of System Capabilities

The next level of improvement requires thinking, rather than the 15 minute patches typical of the bug fixes. We would characterize these improvements as involving a couple of hours to a couple of days, but making use of data already available in the system, that is, a tighter integration among the components. These changes fall into the following categories: (1) further use of the Proximity parser results and its dictionary information; (2) better structuring of the DIMAP dictionaries; and (3) more complete use of directly available HECTOR dictionary data.

As indicated above, the parser dictionary contains considerable information about verb phrase patterns. These government patterns are used by the parser to dynamically expand the 347 grammar rules directly specified. The dynamic parsing already allows for adverbs, appositives, and subordinate clauses at any point in the sentence; the grammar then keys off the verb phrase patterns to add, for example, looking for a THAT-clause or a double noun phrase (for ditransitives). There is some pattern specification for words in other parts of speech as well. In addition, there are additional verb patterns listing particles or particular prepositional phrases associated with the verb. In summary, the set of information associated with the parser dictionary is very similar to and as extensive as the information available in COMLEX. Although this information is available, we haven't used it up to this point.

We have made considerable use of the parse tree results in our implementation, but we have barely scratched the surface of what is available. The parser consists of 8000 lines of C code and we are not yet fully familiar with all its nuances and functionality. In particular, keeping with our theme of branching out from the tagged word for disambiguation, we have not yet looked beyond the few words before and after the tagged word into the full parse tree to test for context appropriate to the tagged word. In general, then, further improvements can be expected from more complete use of the parse results.

The most difficult part of our implementation was the structuring of the DIMAP dictionaries. We spent considerable time experimenting with alternative representations for capturing and structuring the HECTOR dictionary data, but we are by no means satisfied with the current state of these representations. In particular, because we rushed to workable solutions, we were not able to exploit the hierarchical sense and subsense ordering of the HECTOR dictionaries. As we proceeded through the developmental process, we frequently had to alter our representations in our attempts to deal with particular classes of problems. We expect this to continue as we examine our empty answers and incorrect guesses and we expect that this provide some improvements in our overall scores.

As noted, our improvements are very much driven by our failures. Many of these failures are the result of not having implemented routines to deal with particular aspects of the HECTOR data. For example, we have not yet made use of clues that specify syntactic constituents (such as NPs). More significantly, many of our failures stem from our poor understanding and implementation of basic linguistic phenomena. Our inexperience in these areas is considerable. As a result, it is clear that our improved treatment of these phenomena will improve the performance of the CL Research system.

The path to a fuller exploitation of these available resources is clear. Our development process and the flexibility of the system design, that is, using our failures to trigger modifications, are well geared to result in improved performance. We expect that the modifications in these areas will emerge primarily from the HECTOR data, leading to fuller exploitation of the parser dictionary and output and improved DIMAP representations. It is more difficult to assess how much improvement will occur from efforts under this heading. We expect that the distinction between bug fixes and exploitation of available resources will blur and that changes nominally under this heading will contribute to meeting our estimate of 67 percent success at the fine-grained level and 75 percent at the coarse-grained level. We remain hopeful that these modifications will provide an additional 10 to 15 percent improvement.

5.4. Paths for Future Exploration

The flexibility of the CL Research system provides clear opportunities for exploration of many techniques that might improve performance. We briefly describe the possibilities that came to mind and, in some cases, were briefly examined during the system implementation. These include (1) further development of the Proximity parser, (2) using results of parsing definitions into relational structures, (3) making direct use of WordNet and other hierarchical data, (4) using CL Research primitive-finding techniques, (5) exploring utility of techniques involving creation of supercategories from hierarchies, (6) use of underspecification techniques for hierarchizing senses, (7) use of content analysis techniques, and (8) exploring opportunities from other SENSEVAL systems.

  1. The Proximity parser is not a completed product, perhaps only at a pre-alpha stage. There is a considerable number of functions that have been sketched out but not yet integrated into the parsing and ones that require some refinement. Many of these are directly relevant to the kinds of processing that have been implemented in the CL Research system. There is also considerable opportunity for integrating some semantic processing into the parsing.
  2. We experimented a little with parsing HECTOR definitions, particularly attempting to make use of parenthetical material in verb and adjective definitions for creating selectional restrictions. We were not able to incorporate the results of such parsing. But, we look toward continued work along these lines, with the use of definition parses to create semantic relations, similar to what has been done in the development of MindNet. We also wanted to parse the example uses contained in the HECTOR data, again along the lines of MindNet. We expand on this point in the section on future directions.
  3. As indicated above, the DIMAP software contains integrated access to WordNet. We attempted to make use of the WordNet data based on selectional restrictions identified by parsing definitions or the clues collocational data in HECTOR. We were unable to devise our own methods for using WordNet in the time available, nor did we have an opportunity to systematically examine the WordNet literature for appropriate techniques.
  4. In describing the analysis strategy, we indicated our intent to examine the benefit of primitive-finding techniques developed by CL Research. Our system did not examine the contexts provided in the evaluation data. Although the report on the lexicographer's experience (Krishnamurthy & Nicholls) indicated that this context was sometimes insufficient, we fully expect that the integration of a discourse component will provide some benefit for disambiguation, particularly in connection with primitive-finding functionality available in DIMAP.
  5. We expect that lexical semantic techniques for dynamically creating supercategories from hierarchical lexicons can be used with the CL Research primitive-finding techniques in paring the set of viable senses. Such techniques, as described in Hearst, Burstein, Basili et al, and Buitelaar, clearly indicate the possibility of reducing the number of possible senses. We will explore these approaches and attempt to include them in the CL Research system.
  6. In the last few years, a considerable body of research "underspecification" has been completed. Our notion of underspecification may differ somewhat, but is similar to the work described in Sanfilippo. That is, we view the senses of a lemma as being hierarchically structured and the process of disambiguation as one that proceeds down the (inverted) tree making use of additional pieces of information to reach some node, hopefully a leaf. However, we believe it is quite appropriate to stop at interior nodes and allow the ambiguity to remain, that is, with the meaning "underspecified." As indicated above, the CL Research has not yet made use of the HECTOR hierarchy or considered the possibilities for further hierarchization.
  7. As indicated above, we found that applying the MCCA content analysis techniques included in DIMAP resulted in some apparently significant improvement in sense selection. At the time in the development process when these explorations were made, we were achieving low success rates for the words tested (about 10 to 20 percent) and obtained a 10 to 20 percent improvement from the use of the content analysis. Whether we will still be able to achieve this level of improvement when success rates from other techniques are at a 50 or 60 percent level is an open question.
  8. We expect to find many methods from other SENSEVAL participants to be useful for improving the performance of the CL Research system. For example, we see a clear opportunity to make use of the memory-based WSD techniques described in Daelemans et al. We envision that this technique can be applied after we have allowed symbolic techniques to filter the senses.

We expect that further developments along the lines described in this section will be integrated in processing against a remaining set of viable senses after filtering. Some, though, will be integrated into the other components (the parser, the DIMAP dictionaries, or the feature analysis phase).

The extent to which our results will be improved these techniques is, of course, unknown. It will be important to maintain the discipline of examining changes from run to run: the number assigned correctly and whether prior results are degraded. It seems clear that the improvements will be at the margins for most words. Notwithstanding, we view these explorations as important to further understanding of the human disambiguation process and a reduction in the reliance on frequency ordering of senses.

6. Discussion

During our development process, when we first started getting results for a reasonably large portion of the training corpora, we realized that a significant level of performance could be achieved by reliance on the frequency ordering of the definitions. As we continued further and our results began to improve with additions to the code, we were also struck by the extent to which correct assignments could be made on syntactic considerations alone (that is, simple extensions of part of speech tagging). As a result, we expected that frequency considerations alone would provide a floor for the performance of all systems participating in SENSEVAL. We also came to believe that the SENSEVAL experience would give substantial support to Wilks & Stevenson's thesis that disambiguation could for the most part ignore considerations of meaning. Our experience in the development process, particularly our inability to incorporate any substantial semantic processing, has borne out these expectations.

Upon further consideration, however, we find these developments a little bit unsatisfying for at least two reasons. First, the reliance on frequency will always require human intervention to tag corpus samples against a set of senses. Second, it is certain that the number of senses for a given word will always be in a state of flux, and usually increasing. As noted in the lexicographer's paper, there was considerable criticism of the HECTOR sense distinctions, particularly as not being "sufficiently fine to reflect the corpus contexts." Kilgarriff has somewhat exhaustively enumerated and considered lexical sense extension processes. To us, this seems to warrant some further evolution of the WSD task and its evaluation.

As our system evolved and took its current state, we saw the disambiguation process separate into distinct stages: (1) homing in on the relevant lexical entry (headword), particularly in the case of phrasal entries; (2) a coarse-grained sieving process based on syntactic considerations; (3) a finer-grained sieving process based on more semantic considerations (at least in so far as contextual and collocational information is viewed as semantic); (4) a still more fine-grained process weighing the evidence for particular sense selection; and (5) final reliance on frequency considerations.

These observations suggest to us the need to parcel out the contribution and reliance of system components so that we can assess the extent to which we can reduce reliance on human intervention for frequency determination and the extent to which we can handle sense extension processes. Although we may be able to disambiguate to a great extent based on frequency and syntactic considerations, as Wilks suggests, we may need to achieve the same results with other methods so that we can use the results of our disambiguation in other tasks. Disambiguation is not an end in itself.

7. Conclusions and Future Directions

SENSEVAL has provided the motivation for CL Research to put together the beginnings of a system that would be capable of disambiguating all words in unrestricted text. At the same time, it is clear that we are far from this goal. We are greatly encouraged by our level of performance, even though it is currently much lower than we would like. Our encouragement stems from the many avenues for improvement that have been opened and for which there are relatively clear paths for future development. We extend an open invitation to any who would like to participate in this development, particularly noting that the source code for the parser is available for use by interested researchers. We believe that the flexibility of the CL Research system provides opportunities for many interesting explorations.

However, we believe that there are important gaps necessary for scaling up this system so that the output of the disambiguation process will be usable in applications requiring NLP. In particular, we feel that it is necessary to be able to (1) parse ordinary dictionary definitions to create HECTOR-style information, (2) develop mechanisms for acquiring lexical information for unknown words, and (3) integrate discourse processing to a level capable of handling information extraction tasks.

  1. Dictionary Parsing: We have demonstrated the importance of the lexicon to the achievements of our system. It is clear, however, that such information is not readily available without considerable human intervention. We would certainly welcome any offers of lexical resources that would advance the capabilities of our system. However, we suspect that a more systematic approach is necessary and have undertaken steps for automatic creation of appropriate lexical entries. CL Research has initiated efforts to create publicly available lexical resources of the type used in our system by parsing and analysis of definitions in the publicly available Webster's 2nd International Dictionary. For further information, see details of the Dictionary Parsing Project.
  2. Lexical Acquisition: The Proximitiy parser is capable of handling unknown words, frequently assigning a part of speech that seems warranted by the context and always indicating in the parse tree the fact that a word is not in the dictionary. The parse tree developed during this process contains significant amounts of information for the surrounding context and thus provides an opportunity for use of such features in tentatively assigning features to the unknown words. We anticipate that use of machine learning techniques will provide some fruitful avenues for this type of lexical acquisition. In addition, we have begun to explore techniques for recognizing phrases and their variations through incremental extension of the lexicon (applying our phrase-finding techniques).
  3. Information Extraction: By the very nature in which our system was developed, that is, focusing on individual words with their own separate dictionaries, we see an opportunity for creating sublexicons that can handle information extraction needs. We have indicated where we expect to integrate discourse processing to assist in disambiguation. The Proximity parser identifies proper noun phrases. So, we expect that it will be possible to integrate techniques used in MUC, where reduced lexicons have been shown to be important.

In summary, we see a rich set of possibilities stemming from our participation in SENSEVAL.

8. Bibliography

Barrière, C. 1997. From a Children's First Dictionary to a Lexical Knowledge Base of Conceptual Graphs. PhD. dissertation. Simon Fraser Unviversity. (Available at http://by.genie.uottawa.ca/profs/barriere/personal_info.html.)

Basili, R., M. D. Rocca, and M. T. Pazienza. 1997. "Towards a bootstrapping framework for corpus semantic tagging," Tagging Text with Lexical Semantics: Why, What, and How? Proceedings of the ACL SIGLEX Workshop, Washington: pp. 66-73.

Buitelaar, P. 1997. "A lexicon for underspecified semantic tagging," Tagging Text with Lexical Semantics: Why, What, and How? Proceedings of the ACL SIGLEX Workshop, Washington: pp. 25-33.

Burstein, J. and R. M. Kaplan. 1997. "An automatic scoring system for advanced placement biology essays," Proceedings of the 5th Conference on Applied Natural Language Processing. Association for Computational Linguisitics, Washington: pp. 174-181.

Daelemans, W., A. van den Bosch, S. Buchholz, J. Veenstra, and J. Zavrel. 1998. "Memory-based words sense disambiguation for SENSEVAL," Proceedings of the SENSEVAL Workshop. ACL SIGLEX, Herstmonceux Castle, Sussex, England. (http://www.itri.brighton.ac.uk/events/senseval/PROCEEDINGS/)

Hearst, M. A. and H. Schutze. 1996. Customizing a lexicon to better suit a computational task. In: Boguraev, B. and Pustejovsky, J. (ed). Corpus Processing for Lexical Acquisition. MIT Press, Cambridge, Massachusetts. Pp. 77-96.

Kilgarriff, A. 1997. "'I Don't Believe in Word Senses'," Computers and the Humanities, 31(2), pp. 91-113.

Krishnamurthy, R. and D. Nicholls. 1998. "Peeling an onion: the lexicographer's experience of manual sense-tagging," Proceedings of the SENSEVAL Workshop. ACL SIGLEX, Herstmonceux Castle, Sussex, England. (http://www.itri.brighton.ac.uk/events/senseval/PROCEEDINGS/)

Litkowski, K. C. (1978) " Models of the Semantic Structure of Dictionaries," American Journal of Computational Linguistics, Microfiche 81, Frames 25-74. (Available at http://www.clres.com/. Also see Analysis of Subordinating Conjunctions for details on the use of this approach.)

Litkowski, K. C. and M. D. Harris (1997) "Category Development Using Complete Semantic Networks", Technical Report 97-01. Gaithersburg, MD: CL Research. (Available at http://www.clres.com/.)

Macleod, C., R. Grishman, and A. Meyers. 1997/8. "COMLEX syntax," Computers and the Humanities, 31(6), pp. 459-81.

Richardson, S. 1997. Determining Similiarity and Inferring Relations in a Lexical Knowledge Base. PhD dissertation, City University of New York. (Available as MSR-TR-97-02 at http://www.research.microsoft.com/research/nlp/.)

Richardson, S., Dolan, W. B., and Vanderwende, L. 1998. "MindNet: acquiring and structuring semantic information from text," Proceedings of the 17th International Conference on Computational Linguistics (COLING-ACL), Montreal, Quebec, Canada. (Also available at http://www.research.microsoft.com/research/nlp/, as Technical Report MSR-TR-98-23.)

Sanfilippo, A. 1995. "Lexical polymorphism and word disambiguation," Representation and Acquisition of Lexical Knowledge: Polysemy, Ambiguity, and Generativity. Working Notes. AAAI Spring Symposium Series. Stanford University, pp. 158-62.

Wilks, Y. and M. Stevenson. 1997. "Sense tagging: semantic tagging with a lexicon," Tagging Text with Lexical Semantics: Why, What, and How? Proceedings of the ACL SIGLEX Workshop, Washington: pp. 47-51.