background logo
Home

the Cyc Knowledge BaseTM

The Cyc knowledge base (KB) is a formalized representation of a vast quantity of fundamental human knowledge: facts, rules of thumb, and heuristics for reasoning about the objects and events of everyday life. The medium of representation is the formal language CycL, described below. The KB consists of terms--which constitute the vocabulary of CycL--and assertions which relate those terms. These assertions include both simple ground assertions and rules. Cyc is not a frame-based system: the Cyc team thinks of the KB instead as a sea of assertions, with each assertion being no more "about" one of the terms involved than another.

The Cyc KB is divided into many (currently thousands of) "microtheories", each of which is essentially a bundle of assertions that share a common set of assumptions; some microtheories are focused on a particular domain of knowledge, a particular level of detail, a particular interval in time, etc. The microtheory mechanism allows Cyc to independently maintain assertions which are prima facie contradictory, and enhances the performance of the Cyc system by focusing the inferencing process.

At the present time, the Cyc KB contains nearly five hundred thousand terms, including about fifteen thousand types of relations, and about five million facts (assertions) relating these terms. New assertions are continually added to the KB through a combination of automated and manual means. Additionally, term-denoting functions allow for the automatic creation of millions of non-atomic terms, such as (LiquidFn Nitrogen); and Cyc adds a vast number of assertions to the KB by itself as a product of the inferencing process.

natural-language processing

Natural-language (NL) processing is among the most studied -- and most intractable -- outstanding challenges of software engineering. Many teams have attempted to produce NL systems capable of reading and making sense of plain English text, but none have succeeded to any significant degree outside of narrow, pre-conceived domains. As shown in the examples below, Cyc-like common sense is a prerequisite for human-level competence at this task.

Consider the following pair of sentences:

  • Fred saw the plane flying over Zurich.
  • Fred saw the mountains flying over Zurich.
Although the sentences are very similar, humans have little difficulty in recognizing that in the first sentence, "flying" probably refers to the plane, while in the second sentence, "flying" almost certainly refers to Fred. Traditional NL systems will have difficulty resolving this syntactic ambiguity, but because Cyc knows that planes fly and mountains do not, it will be able to reject nonsensical interpretations. It's difficult to see how this could be done without relying on a large database of common sense.

Here are a couple more examples; these involve pronoun disambiguation:

  • The police arrested the demonstrators because they feared violence.
  • The police arrested the demonstrators because they advocated violence.
  • Mary saw the dog in the store window and wanted it.
  • Mary saw the dog in the store window and pressed her nose up against it.
The Cyc-NL system has three components: the lexicon, the syntactic parser, and the semantic interpreter. These are described in greater detail on the natural language understanding page. At the moment, we are focusing our efforts on broadening the coverage of all three components of Cyc-NL. We are currently able to correctly parse many different sentence types, including ambiguous and syntactically complex inputs. Cyc is capable of handling negation, modals, and nested quantifiers. We are developing interfaces which will allow people to make assertions and query Cyc using english instead of CycL. We also are working on a generation component, which will produce english strings from CycL formulas.

Cyc's NL capabilities form the foundation for applications in knowledge-enhanced searching of captioned information, and for user-friendly interfaces to other applications, including the database integration application.

Future directions for Cyc-NL will include:

  • exploring the role Cyc could play in machine translation
  • using Cyc-NL to post-process output of speech recognition systems
  • harnessing Cyc-NL to enhance user interfaces.

Other potential applications are myriad.

For more information, see the more detailed description of the Cyc NL subsystem.

Semantic Integration BusTM

Computer-based information is stored in many forms, including data that is structured (databases), semi-structured (spreadsheets, web pages), and unstructured (text files and text fields). Cyc can turn some of this information into usable knowledge, and the remainder can be annotated for easier access by humans.

Cyc treats each database record as if it were an implicit assertion in the knowledge base. These implicit assertions are then available during inference. Similarly, text fields can be read using the natural language processor to see if they contain any useful implicit assertions. Sometimes the assertions describe what the text is "about". Cyc can use this information to locate and report information resources which the user may employ to answer a particular query.

Semantic Integration Bus

In the above diagram, information stored in a database or on the web is made available to the inference engine as virtual assertions. These sets of virtual assertions are managed by heuristic level (HL) modules. For example, the inference engine "broadcasts" a query on the bus. An HL module recognizes that the request asks for an assertion which maps into its virtual knowledge space. The HL module intercepts the request, communicates with the database, web site or other knowledge source, and returns bindings to the inference engine. Inference then continues, combining information from multiple sources.

developer toolsets

The Cyc system also includes a variety of interface tools that permit the user to browse, edit, and extend the Cyc KB, to pose queries to the inference engine, and to interact with the natural-language and database integration modules.

The most commonly-used tool, our HTML browser, allows the user to view the KB in a hypertexty way. HTML pages describing Cyc terms are generated on the fly by the Cyc system. Each page describes a Cyc term by showing all the assertions in which it is involved, organized according to a standard schema. Every occurrence of a Cyc term is an HTML link to a (dynamically-generated) HTML page describing that term, so that it is easy to surf around the KB following a network of relationships. The HTML browser also includes facilities for searching and editing the KB and for posing queries to the inference engine.

Other HTML interface tools include:

  • A hierarchy browser, which displays any desired subtree of the Cyc subset tree in outline format.
  • A lexicon editor, which provides a user-friendly way to edit and extend the Cyc lexicon.
  • An English-to-CycL parser, which lets users experiment with Cyc's natural language facilities by parsing arbitrary English strings.
  • A database tool interface, which provides an interface to Cyc's semantic integration module.
  • A WordNet browser, which allows users to view WordNet in relation to the Cyc ontology.
  • An English generator, which restates Cyc rules in English.


Copyright © 2002-2010 Cycorp, Inc. All Rights Reserved. | privacy statement | contact us | home

what's in Cyc?