Semantic Knowledge Source Integration (FAQ)

One of the central values of an ontology is the ability to add meaning to your data. Suppose you look at a spreadsheet which has columns for “car_make”, “car_model”, “year”, and “base_price”, you might naively think this table contains all the information that you need to identify the base price of a 2017 Nissan Versa (provided there is a row with those values in it). However, this is not quite right. What you need is not entirely in the spreadsheet: strictly speaking the sheet only contains certain strings and numbers (“car_make”, “Nissan”, “2017”, etc.) that require our understanding of what these things mean in a broader context. When humans use data like this, we are our own interpretation engines that link information in the database to their meanings. If we want computers to leverage our data in the way that we do, we need AI that can serve as a similar interpretation engine.

SKSI

Cycorp connects the Knowledge Base to data sources by a capability we call Semantic Knowledge Source Integration (SKSI). In this FAQ, we will give a quick but approachable description for how this works. First, we will discuss why we utilize SKSI and what its benefits are. Then, we will lay out the basic architecture for how we represent data sources in order for Cyc to leverage them where they natively live.

The Problems

Data Without Meaning

As the introduction to this page brought out, spreadsheets do not wear their meanings on their sleeves. Rather, they relate a variety of strings and numbers (among a few other data types depending on format). These relations are tremendously powerful, but only when the data can be properly interpreted.

Above you saw an example where the field names mapped very closely to their meanings. But often we have field names that are non-obvious: a financial database may have “ma_cpg”. What does this mean? It is not at all obvious unless someone tells you that it refers to the minimum average cents per gallon cost of fuel being referred to on that row.

Cycorp solves this problem by explicitly representing the meaning of your data source and connecting it to the deepest, broadest, and most expressive knowledge base in the world.

Information Silos

You might solve this by creating additional documents that serve to provide the context for a table. For instance, we could create another table, .pdf, or other document that explains to users that “base_price” relates a type of car to a type character string that appears in the data source. But this merely generates more documentation that users have to find in order to piece together the meaning; the data and meaning are still separated.

Contrast this approach with the Cycorp solution: we allow your data and meaning to be processed at the same time by our AI platform. This means that users can simply ask the plain questions they have of the data.

Unnecessary Technical Barriers

If your data is stored in a specialized format, then you require your data analysts to be proficient in that format in order to extract any value from your data. But this is a waste: why should, say, supply chain analysts need to be master SQL programmers in order to look at the data? Data should be accessible to everyone so that your supply chain experts can interact with the data straightforwardly.

Cycorp solves this problem by providing natural language interfaces for users to query their data. Cyc can generate the SPARQL or GREMLIN or other such queries for you. Your experts simply need to be good analysts. And Cyc can help with that part, too!

“Bad” Data

Everyone cringes when outside parties see their data: we are all self-conscious about the problems with our data: it’s gappy, messy, in different formats from one place to another, and so on. This can generate a serious problem for computer systems that take any data inputs as perfectly representative of the world. However, human users are not so easily fooled: we know that some values simply don’t make sense, and we can therefore keep bad data from corrupting our reasoning.

Cycorp solves this problem by first understanding the world. This means that we do not derive our understanding of the world from data, but instead know how the world works and use data when appropriate. Cyc can therefore spot bad or unreliable data and know when to throw out suspicious data, just as you would.

Small Data

In the age of big data, there are still many areas where the data is too sparse to generate a sufficient training base for machine-learning AI technologies. What can AI do in cases where the sample size is only about one oil well, or two hospitals, or the stock prices of five companies? These cases are opaque to a data-only statistical approach, but Cyc’s symbolic reasoning can fall back on first principles to draw meaningful conclusions even when the sample size is one.

Cycorp solves this problem in the same way we deal with bad data. Since we start with an understanding of the world, we can apply general principles to reason about novel cases.

SKSI Architecture

Connection

Cycorp can derive value from data sources without needing to migrate them into some sort of data lake. Instead, we represent the nature of a given knowledge source. For example, if you had some database called SampleDB, we could create a concept for that database in Cyc, call it #$SampleDB. We can then tell Cyc where #$SampleDB lives, providing connection information. This will enable Cyc to access that knowledge source whenever necessary. But before Cyc can meaningfully hit the database, we need to further characterize SampleDB.

Database Structure

The first component of representation is the ‘physical’ information about the source’s structure. In the case of a .csv, this may include noting the number of rows and fields, as well as the datatypes for each field (e.g. string, float). Cyc will then know all of the non-semantic facts about the data source, such as the primary key and all of the field names. What is still missing is the semantics: what does everything mean?

Translation

Having the structure is good, but it requires translation into Cyc’s ontology. This so-called “logical schema” representation of a knowledge source is where we connect the terms in the physical table with the concepts in Cyc. Sometimes the connections are relatively obvious; we might specify that the string “convertible” refers to the term #$ConvertibleCar in Cyc. But, obviously the strings do not need to bear any resemblance to the CycL term that we use to represent the relevant concepts.

Meaning Sentences

The logical schema is also where we can specify the relations that are characterized by linking together the information in various fields. So, when you see a row that contains 1234, 2017, and “Versa” beneath the fields “id_no”, “year”, and “car_model”, you know that car model 1234 is a 2017 Versa. We empower Cyc to draw this conclusion by making explicit the relationship that these fields bear to one another.

Schema Modeling Tool

Cycorp facilitates efficient data connection. First, you only need to map a data source once, and then Cyc will never forget what that source looks like, or how to access it. So, data mapping is a one-time task, returning value for the life of that data source. Second, we make this one-time mapping painless by either providing professional services for the mapping or giving you access to our Schema Modelling Tool (SMT). The SMT is a semi-automated method for generate mappings and meaning sentences, automatically building the physical and logical schemas for your new data sources. Demonstrations of the SMT are available upon request.

7718 Wood Hollow Drive, Suite 250
Austin, TX 78731, USA