We helped identify what concepts are likely to be found when talking about or looking at a video of common events. We expanded on what Cyc already knows about events such as baby showers, birthday parties, and working around the house. We then leveraged Cyc’s NL understanding capabilities to predict what concepts would be good to search for given a short description. For example, if a user wants to find pictures of a man smiling, then Cyc can put forward pictures with captions such as “Father watches his daughter take first steps” because Cyc knows that this is a momentous, happy event in a child’s life that parents are likely to get enjoyment from. To further improve Cyc’s predictions in this effort, we also experimented with mapping Word2Vec to OpenCyc. By identifying nearby concepts in the Word2Vec graph, we could also identify nearby Cyc constants that are likely to be found in relation to a given event.
We generated a corpus of approximately 1.2 million CycL sentences with an average of 11 English paraphrases each for use by Deep Learning approaches such as Distant Supervision. The data set consisted of three types of CycL sentences:
55k already-asserted or forward-derived CycL sentences
(eventOccursAt (SummerOlympicsFn (YearFn 2012)) CityOfLondonEngland)
“The 2012 Summer Olympics happened in London, England.”
114k questions based on those known sentences
(eventOccursAt (SummerOlympicsFn (YearFn 2012)) ?VAR)
“Where did the 2012 Summer Olympics take place?”
~1 million backward-inferred sentences that Cyc is able to easily derive via the Cyc inference engine, which is optimized to perform very deep proofs using very powerful and general relationships that were manually asserted earlier:
(commonSoundTypeFromActionType (CharacteristicSoundTypeFromActionTypeFn Speaking) (WorkOfFn NewsShowHost))
“Speech is commonly heard during the work of news show host.”
All of these sentences were generated by appealing to a small set of CycL templates that specified relationships such as “X directed the movie Y”. With only 100 templates, we quickly generated over one million CycL sentences as well as multiple ways Cyc knows to represent those sentences in English. The resulting corpus provides equivalence classes of different ways to say the same thing in English, which Deep Learning approaches can use to train their models. It also makes explicit what relationship is described in each sentence at a level of detail that is often missing from Distant Supervision experiments. That is, while Distant Supervision may tell us that X and Y are related, Cyc knows exactly what that relationship is.
This project proves that Cyc is a great partner for Deep Learning companies and research groups to make use of similar, potentially trillions of sentence-sized, equivalence classes for training their learners.