Knowledge Mining: A New Paradigm for Big Data Analytics

Author: Donald Thompson
Knowledge Mining: A New Paradigm for Big Data Analytics

At Microsoft, we used an environment called Cosmos for managing big data for online services, such as Bing, Office365, Windows, Skype, Xbox Live, et al.  It was, for many of us, a preview of the problems associated with today’s data lake approach: it was impossible to find data, much less find any actual analysis that had been performed on it.  Such environments are fine when you have known workloads on known, high-volume data, but they fall well short of providing a 360-degree view of your business with the opportunity to uncover the sources of waste, inefficiency, failure, and fraud and then turn those discoveries into real solutions delivered to your workforce.  That’s Maana’s goal.

Maana is like a layer that sits on top of a data lake, behind your firewall or hosted/cloud, and uses many of the same underlying technologies, such as HDFS and Spark, in order to deliver a big data experience.  Maana provides comprehensive data profiling and cataloging; an extensible data enrichment and transformation pipeline; a modern and powerful search, exploration, and analysis user experience; advanced machine learning capabilities built directly into the experience; support for common programming languages (such as Python and R) and ecosystem tools (such as Tableau and Spotfire); and interfaces to support operationalization of results into new or existing line-of-business applications, where the entire workforce benefits.

This is all technology that powers an end-to-end experience allowing, for the first time, in a single platform, the ability to unify all of the activities involved in delivering business value from analytic efforts.  Maana supports the activities and needs of all the various roles:

  • data managers and analysts
  • business analysts and subject-matter experts
  • data scientists building and testing analytic models
  • enterprise architects who need to implement and integrate solutions
  • compliance officers who need to audit the lineage of and access controls on data at a granular level
  • knowledge workers through familiar browser-based search and exploration experience or through new and existing line-of-business applications they use daily

More fundamental, however, is the shift Maana is introducing from data mining to knowledge mining.

Consider the evolution of document search.  Almost 60 years ago, Salton pioneered the field of Information Retrieval by introducing and integrating the fundamental concepts of an index (speeds how to find documents), a similarity function (how to compare the query to the document), and a ranking function (how to determine the best document among the results).  This is the model that is still predominantly used in enterprise search today (and why the search experience is largely unsatisfying).  In particular, the ranking function is most important, as Google showed with its PageRank algorithm (that uses a form of popularity of the hosting site to boost document results).  This innovation distinguished Google and allowed it to take the lead in consumer search.  But that was 20 years ago and unlike enterprise search, consumer search has seen massive improvement and innovation from its early breakthroughs.  Today, we have a much more entity-centric experience through projects like Microsoft’s Satori (disclaimer: I am the founder of Satori) and Google’s Knowledge Graph.

A search for “tom hanks,” for example, should not just yield a set of documents that contain the terms “tom” and “hanks” in some prioritized order (that you have little to no control over).  Instead, we want to know various things about the disambiguated concept Tom Hanks – information about his movies, biographical information, latest news or gossip, compare and find similar actors, etc.  This is an entity and it has properties and participates in relationships and events with other entities.  This is a simple, yet powerful view of the world through data that we refer to as knowledge.  We wish to capture the structural and the dynamic qualities of some domain.  Whether this is the universe of movies and actors; or oil & gas, manufacturing, financial services, insurance, health care, travel, fraud, cyber security, ….  We want to be able to fluidly and effortlessly model and remodel these domains and populate them with continuously fresh data and surface them for discovery, exploration, analysis, and use.

Maana, then, represents this paradigm shift from the world of data lakes and Salton-style information retrieval into the world of knowledge.  Maana provides you with a fundamental new and powerful information asset: a knowledge graph of interrelated concepts.  It is all about the care-and-feeding of this graph.  Raw data is acquired from its myriad of sources and in its myriad of formats (databases and warehouses, log files, documents and media, sensor streams) and is indexed and automatically represented in the graph.  A series of user guided and machine-assisted steps (using advanced proprietary algorithms) transforms and enriches the data to produce better knowledge.  Over time, as more analysis is performed for various projects and efforts, the graph grows and expands to reflect the evolving understanding that an organization gains.

For example, we recently completed 2 projects for the same customer in different, but related, business units.  The problem to be solved in department A was to understand, from high-volume equipment sensor data (continuous), causes of alarms and trips and to ultimately build a predictive failure model.  As part of this data set, we also received contextual information about the equipment, such as its repair and operational history and notes, and its operating locations and environments.  The problem for department B, a call center, was to understand all the reasons their customers contacted them, how the cases were handled, what made for successful projects and people, and to help drive improvements to the right neighboring departments (engineering, documentation, training, services) to reduce maintenance costs and customer dissatisfaction.  We successfully completed both projects using the Maana platform.  But we received a nice surprise when department A’s project connected to department B’s project.  It turns out that the same product lines were involved and this provided a bridge between the two knowledge subgraphs.  From this, we were further able to identity a software change that went out to all the devices in a line (from department A) that resulted in a sharp increase in call volume (from department B) – a result that simply could not have been found in a traditional data lake using traditional data mining unless this was the focus of a particular effort.

There is much more that can be said, but I leave you to consider the implications of such a resource to your business.  Having a lot of data is not enough.  Having a place to put it and run jobs on it is not enough.  What is needed is an environment for supporting the activities of:

  • Connecting, enriching, and transforming
  • Searching and exploring
  • Analyzing
  • Testing and simulating
  • Collaborating
  • Reporting
  • Auditing and securing
  • Integrating with existing systems and investments

And all of this should be orchestrated around a shared, living, evolving knowledge graph.  We refer to this entire effort as Knowledge Mining.



Palo Alto, CA
Houston, TX
Bellevue, WA
London, UK
Dhahran, KSA

Strategic Partners

  • Accenture
  • Microsoft Azure

Learn More

Connect with Us

Stay in the know with the latest information about Maana services, events, news and best practices by email.