- Current print dictionaries contain only a limited amount of the data in the underlying electronic files. Frequently, this data is not fully exploited for online use and not optimized for licensees of the data. In addition, the data content does not take advantage of publicly available content, particularly in the development of richer and deeper content to provide additional monetization opportunities and enriched user experiences. The core content of this data can often be expanded with semantic connections.
- CL Research has considerable experience in augmenting existing data structures with additional semantic tagging. CL Research also has experience in developing web-based software for interactive user experiences. In creating alphabetic versions of many publicly available lexical databases, CL Research has developed many insights into the expansion of additional layers of semantic content.
- CL Research uses the following general paradigm for augmenting dictionary databases:
- Augmenting the data format: Working with existing data, developing new structures to label all content, ensuring that any resultant XML tagging and content management fit within the appropriate level of editorial review.
- Integrating the data: Using multiple resources (public domain, crawlable material, copyrighted material) and combining these with existing data to produce a next generation dictionary resource, enriched with semantic content, particularly connecting individual sub-senses to these data sources.
- Producing an enhanced content database: Developing demonstration scripts for use with newly integrated data, exploiting data on user activity, and identifying potential monetization avenues.
- In performing superdictionary projects, CL Research will frequently use DIMAP to perform many of the data manipulation functions. To see some of the results,
- In performing superdictionary projects, CL Research has partnered with lexicographers, software developers with experience in electronic dictionaries, and business development specialists in emerging digital markets.
- For more details on a superdictionary project, please contact Ken Litkowski
- A thesaurus contains synonyms, "broader than," "and narrower than" terms. With DIMAP, you or CL Research can parse a (set of) dictionary (definitions) to identify how different entries relate to one another. The amount of effort depends, of course, on the size of the dictionary. As a guide, processing of Webster's 2nd International Dictionary containing 120,000 headwords and 270,000 definitions took approximately 40 hours, much of which was background processing. Familiarization may require additional time.
- To create the thesaurus yourself, you will need to put your dictionary entries into the format used to upload them into DIMAP format. The file format is described in the help file provided with the experimental DMP3A. If you are unable to create the entries directly, CL Research will provide the C source code for a program (applicable against a marked-up ASCII file). Alternatively, CL Research will modify the program to meet your format for $200.
- Once the data are in the proper format for uploading into DIMAP dictionaries, the experimental DMP3A can be used with a couple of menu selections to create the dictionaries. Parsing the definitions and creating the thesaurus require only a few more menu and dialog selections.
- After DIMAP dictionaries are created, they will then be suitable for more extended thesaural and semantic relations as DMP3A is developed further.
- If you require further assistance, CL Research can customize DMP3A to meet your needs. Please inquire.
- An ontology is an organization of concepts with one another, most specifically, a categorization of entities and actions. A full ontology may deal with all knowledge, but it is possible to construct an ontology for a single field of study.
- The main organizing principle of an ontology is the ISA backbone ("a horse is an animal"). A richer ontology contains additional relations between concepts. These relations may include the thesaural relations of synonyms and antonyms, but typically would include a breakdown of the general "related-to" thesaural relation into many semantic relations. At a minimum, the semantic relations would include "part" relations that identify conceptual entities which are construed as parts, constituents, or substances making up another entity. A more elaborate system would identify semantic relations such as "agent", "instrument", "purpose", "location", "result", "cause", "manner", and "entailment". There is no general agreement on the set of semantic relations and it is possible that a set may be somewhat arbitrary and depend on a user's needs.
- The DMP3A experimental version has now been extended to enable a user to define and identify many semantic relations from parsing of a word's definitions. These relations can be encoded as part of the dictionary used to parse the definitions by specifying "defining patterns" associated with individual words. (For details, see the discussion of semantic relations in the Dictionary Parsing Project.)
- DMP3A can be used directly to add such relations to dictionary entries based on parsing definitions.
- Please inquire if you wish assistance in developing an ontology and set of relations specific to your needs.
- When hiearchical relations have been entered into DIMAP dictionaries, it is possible to create a conceptual organization to a dictionary. This organization will identify the more basic and the more complex concepts within a field (perhaps an entire general vocabulary, but preferably within a smaller sublanguage area).
- Using the hierarchical relations, DIMAP can analyze the underlying dictionary graph to identify the primitive elements and the ordering of concepts based on complexity. This can be accomplished through the menu selection for analyzing the dictionary digraph.
- If you require assistance in interpreting this digraph, CL Research can show you how to interpret the results. Please inquire.
This document maintained by Ken Litkowski
Copyright © 2010 CL Research