Natural Language Processing in Cyc

Cyc offers sophisticated natural language understanding and generation functionality, using the power of Cyc knowledge to address semantics and pragmatics rather than just syntactic or statistical processing.

The lexicon

The lexicon is the backbone of the NL system. It contains syntactic and semantic information about English words. Each word is represented as a Cyc constant. For example, the constant #$Light-TheWord is used to represent the English word "light". Assertions in the lexicon specify that #$Light-TheWord has noun, verb, adjective, and adverb forms (as in "a bright light", "light a fire", "a light meal", and "touching someone lightly", respectively). Further lexical assertions specify which syntactic patterns the various forms of "light" can appear in (for example, "light" can be a transitive verb, as in "he lit a fire"; it can also appear with certain prepositions, as in "the whole house was lit up"). Most importantly, the lexicon is where links between English words and Cyc constants are stored. The noun "light", for example, has denotation links to two Cyc constants: #$LightEnergy and #$LightingDevice. The other parts of speech of #$Light-TheWord have denotation links to Cyc constants as well. When Cyc-NL processes an input sentence, it first checks the lexicon to assign possible parts of speech to words in the string. The lexicon (along with our generative morphology component) would assign these parts of speech to the following input string:

NL part-of-speech illustration

Notice that many of the words are ambiguous as to part of speech. It is the job of the syntactic parser to decide which part-of-speech assignments are appropriate, and to build a structure from the sentence which can be passed along to the semantic component for interpretation.

 

 

Understanding Natural Language

The Syntactic Parser

The syntactic parser utilizes a phrase-structure grammar loosely based on Government and Binding principles. Using a number of context-free rules, the parser builds tree-structures, bottom-up, over the input string. The parser outputs all trees allowed by the rule system, so multiple parses are possible in cases of syntactic ambiguity. In the case of the sentence above, the parser generates two tree structures:

{:SENTENCE
  {:NP 
    {:DETP  {#$Determiner  [the]}}
    {:N-BAR {#$SimpleNoun  [man]}}}
  {:VP 
    {#$Verb  [saw]}
        {:NP {:DETP {#$Determiner  [the]}}
             {:N-BAR  {#$SimpleNoun  [light]}}
        {:PP {#$Preposition  [with]}
             {:NP {:DETP {#$Determiner  [the]}}
                  {:N-BAR {#$SimpleNoun  [telescope]}}}}}}}}
  
{:SENTENCE
  {:NP
    {:DETP  {#$Determiner  [the]}}
    {:N-BAR {#$SimpleNoun  [man]}}}
  {:VP
    {#$Verb  [saw]}
    {:NP {:DETP {#$Determiner  [the]}}
         {:N-BAR
                {:N-BAR {#$SimpleNoun  [light]}}
                {:PP {#$Preposition  [with]}
                     {:NP {:DETP {#$Determiner  [the]}}
                          {:N-BAR {#$SimpleNoun  [telescope]}}}}}}}}

In the first tree, the prepositional phrase "with a telescope" attaches to the verb phrase, corresponding to the interpretation "John used a telescope to see the light". In the second tree, the prepositional phrase attaches to the noun phrase, corresponding to the interpretation "John saw the light which had a telescope". These structures are then passed to the semantic component, where they are translated into CycL, and spurious parses are discarded.

The Semantic Interpreter

Cyc-NL's semantic component transforms syntactic parses into CycL formulas. The output of the semantic component is "pure" CycL: a parsed sentence can immediately be asserted into the KB, for example, or a parsed question can be presented to the SQL generator in order to pose a database query. Cyc's semantic interpreter incorporates principles of Montague semantics. Semantic structures are built up piece-by-piece and combined into larger structures. For each syntactic rule, there is a corresponding semantic procedure which applies. Cyc-NL's clausal semantics is basically "verb-driven". Verbs are stored in the lexicon with "templates" for their translation into CycL. For example, the template for "believe" when followed by a that-clause might look like this: (#$believes :SUBJECT :CLAUSE). In translating a sentence like "Mary believes that the blue hat is pretty", we retrieve the appropriate template for "believe", then build up the interpretations of the arguments which will fill the :SUBJECT and :CLAUSE slots. Cyc-NL's semantic component makes use of knowledge in the KB at virtually every level of the interpretation process. In the example "the man saw the light with the telescope", the semantic component would consult the KB to find out whether telescopes are typically used as instruments in seeing, and whether lights are the kinds of things that usually have telescopes. Based on the results of asking the KB, the semantic component would reject the second parse as invalid, and produce a CycL translation of the first parse. Using commonsense knowledge to guide the interpretation process allows us to deal with the ever-present problem of ambiguity in natural language without having to rely solely on statistical techniques.

Generating Natural Language

Cyc is able to generate natural language output from its internal CycL knowledge representation.  Using lexical information associated with Cyc concepts, as well as models of grammar (also represented in the Cyc KB) and extensible NL-generation templates, Cyc can dynamically generate natural sounding text for concepts, sentences, query responses, and justifications, enabling applications to display Cyc content or results without requiring the user to understand CycL.  For generating more complex outputs, such as nested explanations of its conclusions or "fact sheets" about people, places, organizations, events, etc., Cyc can sort, aggregate, and filter its textual output in order to create more useful and natural sounding documents.

Cyc's natural language generation capabilities include:

  • Multi-lingual support
  • Alternative paraphrasing based on desired verbosity, register, etc.
  • Selective inclusion/exclusion of text based on expectations of the user's information needs
  • Extensible lexifications for individual concepts
  • Extensible generation templates for predicates and sentences