By: Eric Siegel, Founder, Predictive Analytics World

In anticipation of his upcoming conference presentation, Semantic Natural Language Understanding with Spark, David Talby PAW BLOG ImageMachine-Learned Annotators & Deep-Learned Ontologies at Predictive Analytics World San Francisco, May 14-18, 2017, we asked David Talby, Chief Technology Officer at Atigeo, a few questions about his work in predictive analytics.

Q: In your work with predictive analytics, what behavior or outcome do your models predict?

A: Over the past five years I've worked on a wide range of predictive analytics projects in the healthcare space. Clients were mostly healthcare providers – where models were built for patient risk prediction, population health management, forecasting clinical & financial metrics, automated clinical coding and other specialty-specific challenges. For payers, the main application of machine learning was around fraud, waste & abuse – both to augment human experts investigating claims, and to automate the review of free-text clinical notes.

Q: How does predictive analytics deliver value at your organization – what is one specific way in which it actively drives decisions or operations?

A: In healthcare, the two most common goals are to save lives and save money. Quite a few projects do both. Uncovering fraudulent clinicians & pharmacists, for example, is often justified due to its high financial ROI, but also provides major benefits by finding cases where patients are harmed, mistreated or subjected to wasteful procedures.

Q: Can you describe a quantitative result, such as the predictive lift of your model or the ROI of an analytics initiative?

A: Most of the work we've done is confidential, but one published example from 2013 was regarding a readmissions prediction model we've built at the time. We were able to build a completely automated model, that did not apply any curated medical domain expertise and was solely based on our automated feature engineering algorithms, that beat that best performing academically published model at the time by 20% (in terms of AUC improvement). We were then able to beat that model by an additional 45% by building an ensemble between that model and others models that our data science team built. We've seen then further improved both the core algorithms and scalable training pipelines around them, and it is often surprising how much of a lift can be achieved at a fairly short amount of time over commonly accepted benchmarks.  

Q: What surprising discovery or insight have you unearthed in your data?

A: One surprising discovery, for me at least, was the huge variety of clinical language, guidelines and practices across different doctors and hospitals. We have found while human biology is the same across the US, and doctors supposedly follow similar best practices – the effects of healthcare being 'hyper local' are far greater. This has direct implications when applying machine learning – models transfer very poorly across hospitals, provider groups and geographic locations, whether they are on structured or unstructured data. This happens in healthcare to an extent that's far greater than what I've seen in e-commerce, web search and financial systems, which are other verticals in which I've worked before. 

Q: Sneak preview: Please tell us a take-away that you will provide during your talk at Predictive Analytics World.

A: The talk describes the three key tasks that you must perform to build a natural language understanding pipeline: Building an annotations pipeline, training machine learned annotators, and expanding your ontology via deep learning. The talk comes with full source code, available as free Jupyter notebooks that rely only on open source libraries, so that anyone can download and hack away after the talk. The example we walk through is from the healthcare space, but the design and tasks are general and apply to natural language in any domain-specific setting – understanding patents, SEC filings, academic papers, tweets, emails or transcribed phone calls. It's a technical talk and should be fun and useful for people looking to learn how to get this done for their own projects.


Don't miss David’s conference presentation, Semantic Natural Language Understanding with Spark, Machine-Learned Annotators & Deep-Learned Ontologies, on Tuesday, May 16, 2017 from 3:55 to 4:40 pm at Predictive Analytics World San Francisco. Click here to register to attend

By: Eric Siegel, Founder, Predictive Analytics World