Cyc gets its name from “Encyclopedia”, and it has an enormous knowledge base that models the real world (as well as some fictional ones).
As such, it makes sense when considering whether to use Cyc to ask what it knows about your particular domain. For instance, folks in the healthcare industry ask us what Cyc knows about hospitals, medical procedures, and insurance. We can answer this question, and we do, but it turns out this isn’t the best question to ask.
Problem 1: Burying the Lede
The first problem with “what does Cyc know about <topic>?” is that it obscures the more fundamental question: “how long will it take to teach Cyc all that is necessary for my application?” These two questions are indeed related: if Cyc already knew everything about your domain, then no time would be needed to add knowledge. However, because of the expressiveness of the language, the generality at which things are asserted, and the tools available to ontologize concepts, adding knowledge to Cyc is an efficient process.
The generality point bears further explanation. Take some claim like “All horses have heads.” While this is true, we would not outright assert this in Cyc. Instead, it is better to teach Cyc one piece of reusable information, such as “All mammals have heads.” For another example, we can teach Cyc that whenever you pinch a fluid conduit while some fluid is flowing, pressure builds upstream and decreases downstream. This single piece of knowledge can be applied to straws, veins, hoses, and standpipes in oil drilling operations. Because the knowledge base is filled with such reusable bits, adding new domain knowledge is often very quick: a few assertions can provide hooks to leverage a great deal of already present knowledge.
One way to get insight into this is to find something Cyc doesn’t know about in your domain and then have us teach Cyc about this concept and report how long it took. The sample size of a POC/POV/Phase Zero of a project is sufficient to demonstrate scalability. However, we are also happy to take a very small sample by simply adding some concept that Cyc previously did not know about and transparently recording how long the knowledge addition took.
Problem 2: Measuring Virtues
The second problem with “what does Cyc know about X?” is that it focuses on one virtue–having lots of knowledge. There are many virtues: having few falsehoods, being able to reason quickly over a knowledge base, having knowledge with high utility, having knowledge that is internally consistent, etc. Some of these virtues can conflict: the more that one knows, often the more difficult it is to draw out all of the implications of the knowledge.
So, how do you know when knowledge of some domain is complete (enough)? Should we prioritize one of the virtues over others? Internally, we attempt to balance these virtues with test-based development. This means that prior to beginning a particular task, we lay out a series of things in plain English that we expect Cyc to be able to conclude. For example, if we are ontologizing the rules of the road, we may want to ensure that Cyc can answer at least the following questions:
- In what direction should you turn your wheels when parking facing uphill without a curb?
- On a one-way street, what color is the broken lane marker?
- What color is a yield sign?
- What shape is a stop sign?
These questions may have different answers in different contexts: not every country has the same set of road signs, for instance. We then teach Cyc about the necessary underlying concepts to answer these questions correctly in any context. However, we don’t just “teach to the test”: as discussed above, we teach Cyc things at a general level that the test is supposed to be a representative sample of. Once we have done this and all of the tests are passing, we create new tests to see whether our coverage was general enough. Ideally, we ask folks who are not on the existing project to come up with questions that they would expect Cyc to know if it had mastery over the given domain. Sometimes there are third party tests that serve as a good basis for evaluating knowledge. In the case of driving, we might use review materials for state driver’s licensing tests.
To be clear, even though we do not solely value quantity of knowledge, Cyc still does quite well by that measure. The knowledge base contains over 25 million assertions, and our inference engines enable us to efficiently conclude trillions of bits of knowledge.
Problem 3: Knowledge Versus Data
At a first pass, knowledge involves general, reusable truths about types of things, whereas data involves specific claims about individuals. A few examples of knowledge:
- A birthday is the calendar date when an animal was born. Humans often celebrate the anniversary of birth with parties.
- Stocks can be bought or sold in specialized markets called stock markets.
Contrast with some similar examples of data:
- Casey Hart’s birthday is August 2, 1986.
- The stock price of Amazon as of March 28 at 7:30 AM was $1,765.70.
Cycorp is in the business of knowledge. Data is cheap and ubiquitous: we can call out to Wikidata or other external sources to find such individual facts. This is not to say that data is not important, just that one shouldn’t evaluate the quality of our knowledge base by reference to whether Cyc knows various bits of trivia. To compare: given that your cell phone can store all of the phone numbers on your contact list, we would not expect you to have all of your friends’ and family’s numbers memorized. To the contrary, it might be considered a waste of intellectual resources to remember Aunt Kathy’s number when the task is so easily and cheaply farmed out to your contact list.
This is not to say that there isn’t any data in Cyc. There are some pieces of trivia that are referenced frequently enough that storing them directly in Cyc rather than needing to call out to a database is more efficient. To return to the analogy, you probably have a few frequently used phone numbers memorized, even if you could look them up on your contact list as well.
Cyc probably knows quite a bit about whatever domain you are interested in, but we should be careful not to focus on the wrong question. Instead, we make sure we 1) target time to solution rather than the current state of the knowledge base, 2) focus on successful inference rather than just quantity of knowledge, and 3) appreciate the difference between knowledge and data.