Data mining expert lays out some useful tools and techniques from sentiment analysis to topic modeling and natural language processing
There’s a proliferation of unstructured data on the Internet and coming into customer call centres. But manually going through the haystack to find the needle is an insurmountable, unrealistic task to complete.
Speaking at the recent Big Data TechCon event in Boston, data mining expert, Dan Sullivan from Cambia Health Solutions, discussed several tools and techniques to get you started on effectively mining text data and extracting the rich insights it can bring.
Analysing the opinion or tone of what people are saying about your company on social media or through your call centre can help you respond to issues faster, see how your product and service is performing in the market, find out what customers are saying about competitors, and so on.
There are three ways of going about this kind of sentiment analysis, Sullivan said. The first is polarity analysis, where you simply identify if the tone of communications is positive or negative. The second level is categorisation, where tools get more fine-grained and identify if someone’s confused or angry, for example. Then there’s putting a scale on emotion from ‘sad’ to ‘happy’ and from 0-10.
Sullivan said Affective Norms for English Words (ANEW) is useful for emotional ratings, and ranks words in terms of their pleasure, arousal and dominance. This allows communication to be identified in more detail, such as mild concern or somewhat angry.
WordNet is another tool that relates words similar to each other, such as synonyms and antonyms, and allows users to build classification schemes using that semantic information to do semantic analysis.
“This is to do with semantic or concept-based classifications and was developed by linguists. It’s basically a ontology of English words,” Sullivan said.
Other tools to get started with include the Natural Language Toolkit and TextBlog in Python, which is free to use. Commercial tools available include RapidMiner or ViralHeat, and many others, for doing sentiment analysis, Sullivan said.
Picking up on sarcasm and irony in sentiment analysis, however, remains a challenge. “That’s a problem, especially with things like tweets and social media where people are ironic and sarcastic because that’s a way to get a message across,” Sullivan said.
“There can be an opposite sentiment, where one is very negative and the other very positive. That’s usually an indication of sarcasm. For example, coffee was watery, I really love the new blend at Starbucks.”
Companies also need to pay attention to the context of social media posts and other forms of customer communication, Sullivan said.
“You don’t want to, for example, just capture a tweet that just says, ‘I really hate company X and their product sucks.’ Contextual information is really important to help us understand why the tone might be positive or negative,” he said. “So metadata is crucial. It’s not just the physical 140 characters you want to keep track of.
“Was the person replying to another negative tweet? Was this the original composition? What was the geographic location?”
Topic modeling is a useful technique for identifying dominant themes in a vast array of documents and for dealing with a large corpus of text. Legal firms, for example, might have to go through millions of documents used in big litigation cases. This is where topic modeling can come in handy, Sullivan said.
There are a couple of ways to go about topic modeling. One is latent dirichlet allocation, where words are automatically clustered into topics, with a mixture of topics in each document. The other is probabilistic latent semantic indexing, which models co-occurrence data using probability.
“The basic idea is we have these documents about topics. You can figure out what the topics are based the words that are used in the document,” Sullivan explained. “Given a particular document, what is the probability that a certain topic is covered in that document? And given a certain topic, what is the likelihood that a particular word would be used about that?
“The way these algorithms work is kind of iterative. There are many iterations of taking guesses about what words were associated with what topics and the algorithms basically hone the best set of combinations of words for topics and topics for documents. It works really well.”
Sullivan used the homepage of the New York Times website on 27 April to show how topic modeling could go through and identify that one article was about student debt, law and graduation; another was on government debt, EU and the Euro; and a third discussed Greece, political negotiation and Greek finance ministers.
Topic modeling can also give a weight on the importance on each topic in each article. For example, the first article might be 50 per cent about student debt, 30 per cent about graduation and 20 per cent about law.
Some useful tools for topic modeling include Standford Topic Modeling Toolbox, Mallet (UMass Amherst), R package ‘topicmodels’ and Python package Gensim, Sullivan said.
“The Stanford and Mallet are Java-based tools. You don’t have to be a Java programmer, you can use these tools by running in the command line [versions],” he added.
One downside of topic modelling is that it’s not easily scalable, Sullivan said.
“If you are doing large document sets, one of the things you might want to do is use topic modeling for subsets or samples that have good representation of the entire set,” he said. “That might give you a sense of the topics, then you can do clustering to break them down in to smaller topics and do more detailed analysis on each of the clusters. That’s one way to deal with the scalability issue of topic modeling.”
TF–IDF looks at how frequently a word appears in a document relative to the whole set of documents.
“Words that appear frequently in a lot of documents may not be very useful, like ‘the’, ‘a’. But if there are words that show up frequently in stories about the Greek debt crisis but not about something else like the elections, for example, then those are useful words to keep track of. And that’s what TF–IDF captures,” Sullivan explained.
This can be used to build classifiers or predictive models, he said. For example, a company that has about 10 years’ worth of customer call centre dialogue that has been transcribed into text could tap into this data and figure out what it all says. To do this, Sullivan said the calls could be classified into ‘conversations with customers about to leave’, ‘conversations with customers downgrading their service’, ‘conversations with customers upgrading their service’.
“We adjust weights based on how frequently terms appear in a particular document set. Then we take those features, so the words that show up frequently in a certain set, and they will be good indicators,” he said. “These weights of different words is what the machine learning algorithm uses to do the classification.
“We can take our new input and push it through, so we train our machine learning algorithm to classify these calls, then push it through this classifier model so we can identify them.”
Several machine learning algorithms can be used for classification, but Sullivan said ensemble methods or a combination of algorithms, are effective.
“The idea there is you train a bunch of classifiers and then essentially they vote on it and you take the majority answer,” he said. “Heterogeneous ensemble algorithms use a combination of different kinds. So you might take a SVM [support vector machine], a perceptron, a nearest centroid, a naïve bayes [cluster classifier] and combine those results.
“They can give you better results because each of these algorithms have their own strengths and weaknesses. You want algorithms that complement each other – where one is weak the other is strong.”
This can also ensure quality and accuracy of a classifier against making too many wrong predictions, Sullivan said. A “rough” idea of what accuracy score to aim for in terms of precision and recall is 0.8 or above, he said. Below this, the model parameters need tweaking or feature selection and data quality need to be revisited.
Sullivan cautioned TF–IDF can be a “crude” analytics practice because it throws away a lot of information on syntax and words that are semantically similar, such as hotel and motel.