How Semantics Can Make Data Analysis Work Like A Google Search - Insights

How Semantics Can Make Data Analysis Work Like A Google Search


7 minute read

The interfaces used in business intelligence and data analytics are becoming smarter, conversational, and more powerful because, at long last, computational semantics are starting to be applied.

In this article, I’m going to look at how a startup called DataRPM uses an elegant computational semantic model to present an extremely pleasing experience. But other companies, including Microsoft through its PowerBI, Paxata, a platform for data transformation, and Metric Insights, through its idea of a KPI warehouse, are also using semantics in ways I will explore later.

Semantics 101

The idea of semantics is that instead of just moving bits around, the computer program attempts to understand what you are looking for and provides a richer response, usually either a better answer or suggestions about what to do next.

For most of the computing we use, the semantic model is actually in the head of the user. Most BI technology is not really aware of what it is showing end users. Data is manipulated and graphs are presented, but the BI technology is really more of a digital blender than an intelligent prep cook who can actually anticipate what you might want.

Contrast this with technology like IBM’s Watson or DeepBlue. These systems have ornate semantic models of their target domains and in a powerful sense know something. The test, of course, for whether such systems really know anything is how well they perform. Watson and Deep Blue have passed those tests with flying colors.

For my purposes, I’m going to define semantics as an attempt to model the information being addressed and then put that model to use.

Natural Language Questioning and Answering

Google and Bing are powerfully interested in semantics. Search results increasingly know what you are looking for when the clues are strong and can present not just a keyword match, but information organized in a way that helps meet a need. The movie and flight information you often see as results are the most widespread example of this sort of semantic enhancement.

The simple keyword pattern match is used in all sorts of BI systems like QlikView and Microsoft Power BI. When a new data set is added to the mix, these systems suggest fields that can be used for joins. While this sort of semantic processing is simple, it can help quite a bit.

But the new collection of products I’ve been looking at demonstrates that even simple semantic modeling that goes a few steps beyond pattern matching can go a long way.

DataRPM, a startup founded by Sundeep Sanghavi, CEO, Shyamantak Gautam, CTO, and Ruban Phukan, chief product officer, is focused on the problem of making BI easy for the widest possible set of end users. All three co-founders have combined their experiences from big companies (Arthur Andersen, IBM, and Yahoo, respectively) and have built significant startups.

“Our goal is to enable discovery and visualization of big data in every enterprise in the most natural way,” said Sanghavi.

The founding inspiration emerged from Sanghavi’s curiosity about why BI implementations so often failed. To better understand this, he left Razorsight, the company he founded, and formed a consulting firm, SearchRidge, which provided BI and analytics consulting.

“What we learned at SearchRidge is that the data warehouse is where data goes to die. Manual data modeling is what kills most modern BI projects because it results in very high cost of ownership, especially where data resides in multiple different data sources. It is extremely difficult to keep pace with dynamically changing business requirements,” said Sanghavi.

Sanghavi realized that to succeed, BI must become like Google, which replaced manual web directories of the early ‘90s for web information discovery with algorithms that can help discover any information in real-time. Similarly, successful data intelligence from multiple big data sources requires a new platform that can use algorithms to replace manual data modeling and warehousing. Also for BI to become truly pervasive, the interface should be conversational and be based on a natural language interface. He calls this approach Natural Language Question and Answering (NLQA).

Here’s how it works:

  • DataRPM automatically indexes the data and creates statistical data models from disparate different sources using proprietary algorithms.
  • The user is presented with a Google-like text box in which a question can be answered.
  • DataRPM uses natural language processing technology to map the words of the question to the model of all of the available data.
  • DataRPM presents the results using visualizations that the algorithms select. The user can override and switch to different visualizations dynamically.
  • Along with the results are suggestions for additional matches, filters, and further questions and provides the most interactive drill downs without any preconfigurations.
  • The conversation proceeds as the end-user asks a new question or takes a suggestion.

The whole value of this approach rests on two things:

  • The accuracy of the results returned based on the query, in other words, the fit of the response.
  • The quality of the suggestions for new questions that could be asked and new information that could be added to the query.

My experience in playing with DataRPM comes through its integration with MicroPact, a platform for developing case management applications. What struck me was how fast you could move from result to result and how the Google-like experience of using language to make requests felt very natural.

The way DataRPM delivers this experience is by creating a model of the available data that categorizes each element as a measure, a date/time series, or a dimension. The model of the relationships of the fields in addition to these categories are enough to drive really high quality suggestions.

The DataRPM models can be configured to accommodate company slang and connect those words to the correct field names.

DataRPM has filed patents on aspects of its Computational Search Engine. Sanghavi says the semantic modeling scales by using horizontally distributed bitmap indexes that can index data from any relational database or warehouse, Hadoop, CSV files, and streaming data. As you would expect, the semantic modeling has learning capabilities so results improve users start discovering the data and asking questions.

While this sort of technology is not in the same league as Watson or DeepBlue, it doesn’t matter. It really helps make BI more usable by a wide group of people, which is a crucial problem.

“We admire the more advanced semantic systems, but we realized we could solve two major gaps that make BI, analytics, and data warehousing user interfaces overly complex,” said Sanghavi. “It’s great to have a beautiful graph, report, and dashboard but if it is not dynamic both with respect to the modeling of data and the presentation of the results, we’ll continue to fail. Using computational semantics to automate the modeling of data and adding natural language to express the desires of the user for information closes the gap that has caused so many BI implementations to fall short of expectations.”

I was surprised when I saw a demo of Microsoft’s Power BI recently. The entire attempt at making suggestions for extending queries seems to be based on field pattern matching. I’m eager to find out if that’s all they plan on doing.

A Tool for the Top of the BI Funnel

It is important to remember that DataRPM is not intended as a replacement for advanced dashboards and applications. To use a marketing analogy, DataRPM is a top of the funnel tool, intended to bring people into the practice of using data.

DataRPM thrives in two use cases: first where multiple sources or massive amounts of data need to be correlated and scaled and second when it is embedded in applications at key points where users can ask and answer questions. Because DataRPM keeps track of the types of questions that are asked, product managers and application designers of BI systems can use this audit trail to both improve the application and also to provide better and more advanced dashboards that will actually help.

The penetration of BI is estimated to be between 30 to 40 percent. While BI may never reach everyone in a company, it is likely that “top of the funnel” approaches like DataRPM’s will be the key to achieving 60 to 80 percent adoption.

Follow Dan Woods on Twitter:
Follow @danwoodscito

Dan Woods is CTO and editor of CITO Research, a publication that seeks to advance the craft of technology leadership. For more stories like this one visit