Unlock the Textual Content in Your Data Lake

Wouldn’t it be great if you could analyze all customer interaction and learn which parts of our services or sales are better than others? Or analyze all of our service request textual descriptions and infer the call volume drivers? Understand the main topics of a chat session? Use the same data to understand how the customers are actually using our products? Or to go beyond customer interactions and help us identify the common bugs in our code by analyzing the text engineers type in a bug tracking system such as Jira or Bugzilla?

Liberating your data is not enough if a big chunk of it remains locked in human generated texts.

EMC’s Data Science as a Service team has created a highly-advanced text analytics technology which can help your organization unlock the value in human generated texts.

Why a machine-learning solution is needed

Free text analysis in the corporate world has become almost synonymous with Sentiment Analysis, which is the task of identifying and measuring the emotion (negative or positive) expressed in a sentence or text. This may be very useful for a retailer like Amazon, for analyzing product reviews, or a hotel network analyzing their TripAdvisor reviews. However, most corporations don’t have these types of reviews and are simply not discussed on Twitter by anyone who is not an employee. (See “Bank of England Analyzes Twitter, Discovers Minnesota Vikings”). More importantly, the task of sentiment analysis does not reveal the content of your data and is typically applicable to but a small drop in the huge ocean of data a business would have.

Even worse – Natural Language Processing models (analyzing human-generated text) do not transfer well and you usually need to build a new model for each one of your data sets — not to mention the fact that Sentiment Analysis models aren’t that accurate to begin with (https://twitter.com/lowlandscph/status/630649790154543104).

So what do you do?   You have liberated the data, it’s now available in a Big Data platform or BDL, and you have a kickass business intelligence (BI) team that is great at analyzing structured data, but instead of analyzing sentiment, you should be analyzing content.

Content Modeling is the task of automatically adding labels to text, converting the unstructured text to structured labels specific to your data.  For example:

  • When analyzing IT tickets, the sentence “need help killing them” would get the “process management” and “os” labels.
  • In the same dataset, “blue screen on restart” and “won’t boot” would both get the “system crash” label.
  • In a medical genetics data set, “suffers from recurrent episodes of RLQ pain” would be labelled as “Cystic Fibrosis”

Having these labels attached to our existing data allows us to leverage both the structured data and the free text to answer all the questions we described above and many more.

The traditional drawback is that you need to construct one of these content models for each type of data you have. Sales data differs from support data, and IT tickets are very different from engineering tickets. Even within engineering, different products, or teams within the same product, may discuss extremely different topics. The old approach of building rule-based content models (e.g. a list of hand crafted rules and keywords for deciding which is the correct predefined label) requires months of analyst work per domain, and often the models become useless and inaccurate in a few quarters as vocabulary or products change.

EMC’s Data Science as a Service team came across the demand for this kind of text analytics across the organization. We have been hearing similar stories from EMC’s customers as they journey to the third platform (for example, “I’d like to segment my customers, but most of what I have are free text notes from the CRM” or “We need to enable our research team to use the free text parts of the EHR, as they are much more informative than the codes which the physicians neglect to maintain.”) This has led us on a journey  to create a solution addressing these needs.

The main requirement we set for our solution was quick onboarding; the customer is not going to embrace a system which requires months of their analysts’ time. Since nearly every business unit has its own data with their specific text field, a basic requirement is the ability to build many specific content models. The results should be accessible to BI tools and downstream apps which would monetize it. Finally, any output we generate should be as accurate as possible.

Before we dive into our technical machine-learning solution, let’s discuss the results. Our system is built for taking a generic text repository as input, identifying the main themes and allowing the business unit’s analysts to interact and tune the resulting model until it has accuracy greater than 90 percent. This takes less than a week and is twenty times faster than performing the same task using the old, rule-based methods.

ds1
Click to Enlarge

Furthermore, some of the topics discovered are not what the analysts had in mind when they started and would have been overlooked (for example, the term “pleasantries” when analyzing chat sessions.)  In this analysis of a chat session, we can immediately detect that the problem was related to licensing, and that a reasonable amount of the conversation was dedicated to pleasantries.

The topic labels generated by the system can be consumed in multiple ways:

  • Added to the analysts’ BI tool of choice (such as Tableau or Qlik) as a data column
  • An exploration screen for analysts to ask complex questions:
    • Provide all the records of a licensing problem (content tag) in Europe where the customer waited more than three days.
    • Chat session analysis
    • Call volume drivers
ds2
Click to Enlarge

The example to the right shows a web-based tool for analyzing IT incidents. In this example, we analyzed the breakdown of laptops problems coming from the textual content of the incidents. This is unlike most traditional systems which map incidents to a list of pre-defined problems an expert would expect to see in the data. The topics extracted from the text fields are combined with other structured data and enable one to easily search through, correlate and match topics with geographies, teams, severity and any other information available.

Bootstrapping a Supervised Classification Model

Achieving a high accuracy labeling model, with quick onboarding, is no easy task. Topic modelling algorithms jump to mind, but these suffer from low accuracy, and labeling the topic clusters by looking at a list of words is difficult at best. There is a pretty big gap between the level of accuracy required for a business application and the output of topic modelling algorithms. In our experiments, the topical clusters created are usually 40-60 percent accurate, and there are always a few topics which are just not interesting.

We solve this by adding a human element to supervise and edit the results, while maintaining the one week onboarding objective we have set for ourselves.  We have identified five key areas in the process.

ds6Achieving a high accuracy labeling model, with quick onboarding, is no easy task. Topic modelling algorithms jump to mind, but these suffer from low accuracy, and labelling the topic clusters by looking at a list of words is difficult at best. There is a pretty big gap between the level of accuracy required for a business application and the output of topic modelling algorithms. In our experiments, the topical clusters created are usually 40-60 percent accurate, and there are always a few topics which are just not interesting.

We solve this by adding a human element to supervise and edit the results, while maintaining the one week onboarding objective we have set for ourselves.  We have identified five key areas in the process.

Preprocess: Normalize the texts (fun data cleaning, this may be domain specific,) removing stopwords (commonly used words that act of noise in the analysis,) and lemmatize (group together variants of the same word.)

Unsupervised Clustering: Latent Dirichlet Allocations (LDA) come into play here. This is the state of the art model for identifying themes in a collection of documents. While all Big Data machine learning platforms have some implementation of LDA, to get good results, it is best to use the Asymmetric-Hyperparameter flavor invented by Dr. Hanna Wallach. This is easy to do with the Mallet package:

First import your data to the Mallet format:
bin/mallet import-file –input your_file –output your_file.mallet –token-regex ‘[p{L}p{N}-p{P}]+’ –keep-sequence –remove-stopwords

Then run the topic modelling: “bin/vectors2topics –input your_file.mallet –num-topics 12
–num-iterations 1000 –optimize-interval 10 –output-state topic-state.gz

Another trick is to remove duplicate documents as redundancy is not good for count based methods (read academic papers on the topic: http://www.biomedcentral.com/1471-2105/14/S2/10/ and  http://dx.plos.org/10.1371/journal.pone.0087555 )

Labeling (subject matter expert annotation): The output of topic modelling is not labels.  At this point, we have divided the data into different clusters (topics), each defined by the affinity to specific words. That is a distribution of words per cluster and the distribution of clusters per documents.

As an example, the “Disk Drive Failure” topic has high probability for words such as “replace”, “disk”, “drive”, “hot”, “spare.” Other topics may still have their own affinity to those words. Some of these topics are important for the business while others can be discarded.

At this step we present the analyst with the different topics with different levels of granularity. Use the Termite package (http://vis.stanford.edu/papers/termite) to visualize single words and the difference between topics, for each topic extract bi-grams and important noun-phrases and sentences based on KL-Divergence with the underlying cluster. These different data sources help the analyst assign labels and choose the salient topics in just 2-3 hours. The big upside of this is the fact that it is data-driven and not driven by prior misconceptions.

Post processing (tuning): After labelling we have a method which can tag different sentences with labels. Due to the unsupervised nature of LDA, the accuracy is not sufficient for business purposes (only about 40-60 percent.)

Our solution was to allow the analyst to edit the topic using the same statistically important sentences extracted before.  By marking the ones that are correct (relevant to topic) and those that are false positives, we can now train a brand new classifier using sci-kit learn (see Text Analytics: Easy Classification for Routing Service Requests). Results are usually around 90 percent accurate after 1-2 hours of tuning.

Actionable Insights: Once we have all these classifiers, we can apply them retrospectively to our textual data or to any new data we have collected. These light-weight classifier objects can also be used in any new pipeline we create. Using these we have created the exploration screen combining content labels with the structured data and a search engine or the chat session analysis. By creating a customized, machine-learning solution with the added accuracy of minimal human supervision, we can finally leverage the mounting free text being gathered across every business to gain competitive value.

Summary

Free text is being gathered by our business applications. It is now available for analysis, but analyzing free text is not an easy task. It usually has no sentiment (no one calling IT to say that they love the new exchange version, as an example.) Models that automatically add labels to our text data can provide us with a lot of value by transforming the text into structured data which the BI organization and analysts are already masters at using. Such a system can bring a lot of value but only if it can provide the accuracy and quick onboarding needed for the business users.

Dr. Alon Grubshtein contributed to this post.

About the Author: Raphael Cohen