Parsing in CL Research Products

Definition parsing, document and text parsing, and parsing in support of question-answering use the Proximity Parser (© Proximity Technology, Inc. 2000). This page describes the detailed steps involved in the parsing process. The output of parsing some input is a parse tree, starting with a single non-terminal node and breaking this node into its constituents, which may be other non-terminal nodes and continuing until leaf nodes (representing the words in the input) are reached. The steps in the process are:

Parsing Intialization

Initialization consists of loading the dictionary, initializing the list structure mechanism used for keeping track of information, and initializing many variables used during the parsing to various literals and lists. All this information is part of the permanent list space, whereas subsequent list space is temporary and used for the input that is being processed.

Parsing

Parsing consists of two steps: (1) initial preparation of the input and (2) building the parse tree.

Preparation of the input

Creating leaves of the parse tree

Building the parse tree

The parse tree is based on a grammar and is constructed as the input is consumed.

Grammar

Parsing acts according to the grammar contained in the file "st", which is compiled into an array of "finite state machines". Each element of this array contains the following pieces of information

  1. Start state
  2. Condition necessary to make a transition (either a non-terminal, a terminal part of speech, or a literal, but sometimes there is no condition, indicated by NIL in a grammar rule)
  3. Information attached to the growing parse, used for testing (appearing in parentheses in the grammar rules)
  4. Actions taken when the condition has been met (function names preceded by an @ sign, frequently used as tests of the properties of terminal part of speech, which may result in a penalty for pursuing such a transition when a property is not met), and
  5. End state.

Consuming the input

The parsing process builds parse list nodes (a list node consisting of a "value" field, a "content" field, and a "successor" field) to guide the process and, when the constituents for a non-terminal have been identified, constructs parse tree nodes. The general parse list node corresponds to a linked list of constituents and gives a special interpretation to its members: the "value" field identifies a rule number which identifies a non-terminal parsing goal in the grammar, the "content" field contains a list of the constituents that make up the non-terminal, and the "successor" field points to the next element in the list.

Active parses in the form of parse list nodes are maintained in the array hits, whose index identifies how many items of the input have been consumed. Elements of the array are linked lists of active parses. The length of any array element of hits indicates how many parses are active in trying to consume the next item of the input. The array is seeded with an initial parse list node of (287,0,0), which is added as the first item in the array element hits[0]. This is equivalent to the list {SEN}.

Processing takes place in a loop over the number of elements of hits, usually the length of the input. In this loop, the first step is to prune the active parses to a maximum number (usually 20) based on the score for each parse. The second step is another loop that continues as long as there is an active parse at this position in the input. In this loop, hits[hitl] (where hitl is the current position) is set to the next active parse. Then, we attempt to extend the parse just taken off the list. If there are no active parses at a position, we then move to the next position in the input. (Extending the parse at a given position may mean either not moving in the input, in which case additional active parses are added to the same position, or consuming an item in the input, in which case an active parse is added to a subsequent position in hits. If all attempts to extend the parse at a given position have failed, there will be no active parses at subsequent positions, so that we will immediately cease processing the input. In this case, if there is still more in the input, we will enter a "cleanup" phase and restart with a new parse node of (q_SEN,b,0), where b represents the best we were able to do with the previous input.)

Extending the parse is the key step in the parsing. This consists of examining every grammar rule with the same start state and seeing whether we can satisfy its condition. There are three types of conditions and one major operation: (1) pushing a subgoal, (2) a NIL transition, (3) consuming an item in the input, and (4) storing constituents.

Pushing a Subgoal
When a non-terminal in the grammar is the condition of a transition, it is pushed onto a stack of outstanding non-terminals. For example, we have the grammar rule SEN ---PHP---> SEN1, where PHP is the subgoal. In processing the parse list node (287,0,0), we create another parse list node (199,0,p) representing the new subgoal, with p pointing to the node (287,0,0). This is equivalent to the list {PHP SEN}. When we attempt to extend a parse (based on the parse list nodes in hits[0], we remove the first node before the attempt. When we successfully push a subgoal, we add the new parse list node to the array hits in the same position (since we have not consumed an item of the input). Thus, the node (287,0,0) is removed and the node (199,0,p) is added. Note that p provides the link to where we started (and the subgoal, PHP, from which we started is also retained in the "definition" for the SEN rule).
NIL Transitions
When NIL is the condition of a transition, there is no real condition to be satisfied. For example, we have the grammar rule PHP ---NIL---> PH. In this case, the transition is performed. Again, this is done in place; that is, we still make a new parse list node at the same position in the hits array, but in these cases, we effectively only change the rule number in the list, so that (199,0,p) becomes (164,0,p), equivalent to the list {PH SEN}. Note that, despite this change in state, the subgoal from which we started is unchanged and is still accessible in the "definition" for the SEN rule.
Matching a Part of Speech or Literal
When a part of speech or a literal is the condition, there are two cases: (1) no match and (2) match. In the case of the start state PH, there are 15 grammar rules, many of which have literals, such as PH ---how---> PH8. If this is not the word at the current position in the input, the rule is merely discarded. In parsing "Parse this sentence", where the first word may be either a noun or a verb, we will first go through the rule PH ---NIL---> VP with a NIL transition and then encounter the rule VP ---verb---> PSV (rule 324). Reaching this point, we will have the stack list {VP SEN}. When we recognize that "parse" is a verb, we will make the transition to the next state PSV (rule 245), but we will also grab the leaf parse tree node for the verb part of speech of "parse". We create a new parse list node, (245,t,p), where p continues to point to the next item on the stack, and where t points to the parse tree node. This is equivalent to the stack list {PSV (verb) SEN}, where it is important to note that the "verb" is a sublist attached to PSV (that is, in the "content" field of the parse list node). This marks it as a constituent of SEN (with the possibility that the PSV may find other constituents). It is also important to note that when we record this parse list node, we do so by placing it at the next element in the hits array, where it becomes an active parse at that position in the input.
Storing a Constituent
The final important operation of creating an overall parse is the "store" operation. You will find @(store) as part of many grammar rules. This refers to the "action" function store. This function also is part of many other action functions, usually as their final step. The store function leads to the aggregation of leaf parse tree nodes into non-terminal constituents. For example, the following two grammar rules lead to a store operation in the action function thanchk: In satisfying the rule SBC as SBC3, the literal "as" in the form of a parse tree node is recorded as a constituent of an SBC phrase. If an NP is then recognized (this will also lead to a "store" operation by itself), we want to recapitulate the success of the entire "SBC" goal. The "store" operation accomplishes this by creating a parse tree node labeled SBC, having two constituents, an "as" and an NP. Note that the second rule has no end state (which would come after the "@(store)" if there were one). Because of this, the two constituents would then be recorded as part of an SBC parse tree node. A parse list node is created with the rule number that created them as the "value" field, the parse tree node as its "content" field, and a pointer to the parse list node of which it is a constituent as the "successor" field.