background logo
Home

knowledge-enhanced searching of captioned information

Big organizations frequently possess large libraries of "opaque" information sources; that is, data which do not lend themselves to traditional text-based searching methods. A news agency may possess a library of thousands of news photos; a movie studio thousands of film clips; a software help desk thousands of text articles too unwieldy to index directly. When such libraries must be searched, a common solution is to attach to each item a short text caption describing its contents. Thus a news photo of a soldier holding a gun to a woman's head might be captioned "a soldier holding a gun to a woman's head", plus a few tags for time and place, and then could be retrieved by querying for "soldier" or "gun".

This solution, while certainly adequate, is far from ideal. It would be nice if the photo could also be retrieved by queries for "someone in danger", or "a frightened person", or "a man threatening a woman". Such an achievement, however, lies far beyond the abilities of even the most sophisticated of traditional text-searching tools, all of which are fundamentally based on simple string matching and synonyms. Most search tools lack the ability to handle natural-language queries, and even those that do have some NL capability lack the background of commonsense knowledge required to make a connection between having a gun to one's head and feeling frightened.

Cyc is not crippled by such a liability. Cyc knows that guns shoot bullets and are designed to kill people; that having a gun to one's head therefore threatens one's life; that those whose lives are threatened feel fear; and that the vast majority of soldiers are men. Cyc can therefore conclude that the image in question is, in all likelihood, a good match for each of the queries above.

A major focus of the Cyc team's efforts in 1994 was the creation of an image-retrieval application for a major corporate partner which possessed a library of hundreds of thousands of captioned images, but no fully satisfactory way to search them. Building on the Cyc inference engine and Cyc's NL capabilities, we developed a system along the lines described above.

First, the content of each image in the library is described to Cyc by converting the english captions to CycL and adding the resulting formulas to the knowledge base. Cyc's NL tools permit the english-to-CycL translation to be mostly automatic; human intervention is required only in unusual cases.

Once the target images have been described to Cyc, the system is ready to accept queries, which can be issued in plain english. Cyc begins by converting the english queries to CycL, again using Cyc's NL tools. For example, the english query "a frightened person" might be parsed as:

(#$and
   (#$isa ?x #$Person)
   (#$feelsEmotion ?x #$Fear #$High))

After asking the user to confirm its parse (or, occasionally, to choose one of two or more equally valid parses), Cyc begins to backchain from the query expression, using the image descriptions and other knowledge in the KB. When it is able to unify all the free variables with the elements of a picture (in our example, ?x would unify with the woman in the picture), Cyc knows it has found a match.

Cycorp has generalized this approach to image retrieval to extend it to other opaque-information-source retrieval applications. For example, document retrieval from large libraries of text documents described by short abstracts (analogous to image captions), is a task nearly identical in structure to that of retrieving captioned images. In fact, any database of captions, summaries, or abstracts could be handled similarly, whether the corresponding library contained images, sounds, video, text, or anything else.

www information retrieval

The explosion of the World Wide Web during the last two years has created a tremendous opportunity. The WWW is home to a vast quantity of information, much of which could, in principle, be used to make Cyc more intelligent, while shortcutting the laborious process of manual knowledge entry.

This could happen in one of two ways: either online information could be extracted, converted to CycL, and incorporated directly into the knowledge base, or Cyc could be taught to treat external information sources as extensions of the KB, without directly incorporating their contents.

Cycorp explored the first approach during 1996. Currently Cyc is nearing the critical mass required for the reading and assimilation of online texts (new stories, encyclopedia articles, etc.) In this scenario, Cyc's natural language tools are used to process online texts, converting them to CycL for inclusion in the knowledge base.

Cycorp is also pursuing the second approach. In this scenario, we write gateways which disguise WWW information sources as Cyc agents (with a very limited domain of expertise), available to operate in a distributed Cyc architecture. There is a large and ever-growing number of information sources on the WWW which might fill this role. All, however, share the following characteristics:

  1. They embody a large corpus of knowledge,
  2. they can respond to HTTP queries for specific knowledge, and
  3. they present their knowledge in an HTML format which, while generally not as regularly structured as a database, is sufficiently structured that it can be parsed by a relatively simple algorithm.
An example is the Internet Movie Database, a truly stupendous compendium of movie knowledge. A quick browse of the IMD demonstrates that it embodies virtually everything there is to know about movies; that it can respond to queries for particular actors, movies, etc.; and that it displays the results in a format, which, while it varies somewhat from one actor, movie, etc. to another, depending on what information is available, is nevertheless fairly regular.

We can effectively annex the contents of the IMD to the Cyc KB by creating a gateway which, on the one hand, interacts with Cyc agents exactly as a Cyc agent operating in a distributed Cyc architecture would, and, on the other hand, simulates the interaction of a WWW browser with the IMD HTTP server.

This gateway is advertised to Cyc as an expert in the movie domain, so that whenever Cyc receives a query in that domain, it turns to the gateway for assistance. For example, let's say a user asks Cyc, "What movies did Ronald Reagan act in?". This might be represented in CycL as:

(#$actedInMovie #$RonaldReagan ?x)

Cyc hands this CycL query to the gateway, which understands enough about movie-related CycL vocabulary to translate this query into an HTTP request to the IMD server for a page on Ronald Reagan. When the IMD server returns the page, the gateway parses the HTML, extracts a list of the movies in which Reagan appeared, converts the result to CycL, and then constructs a suitable reply to the Cyc agent making the request. (For reasons not worth explaining, the reply contains more than just the answer.)

To the user interacting with Cyc, this transaction is entirely transparent. It appears to the user as if Cyc now knows everything there is to know about movies, and yet the KB still fits on the hard disk!

Enhancing Cyc's cinematic erudition may be a fairly frivolous application of the techniques described above, but the WWW contains plenty of large sources of semi-structured information on weightier topics (stock quotes, company profiles, WWW indexes, resumes, the CIA World Factbook, etc.). The ability to effectively incorporate vast portions of the WWW into a "virtual" knowledge base is a compelling possibility. Not only would it greatly expand the effective scope of Cyc's knowledge, but it would do so at little cost to the Cyc development team.



Copyright © 2002-2012 Cycorp, Inc. All Rights Reserved. | privacy statement | contact us | home

intelligent search