May 17th 2016
By: Steven Ramirez, Conference Co-Chair, Text Analytics World Chicago
In anticipation of his upcoming conference presentation, Tips and Tricks on Developing High-performance Fuzzy Name Search Engine to Prevent Terrorism Financing at Text Analytics World Chicago, June 21-22, 2016, we asked Emrah Budur, Senior Software Engineer at Garanti Technology, a few questions about his work in text analytics.
Q: What is your topic mainly about?
A: Financial institutions and US businesses working with partners overseas must follow strict government regulations to prevent them from doing business with terrorist organizations or other sanctioned entities. In particular, banks are required to precisely detect inappropriate transactions with sanctioned entities out of large amounts of legitimate money transfers in real time. However, detecting names of sanctioned entities is challenging due to the high number of variations in terms of misspellings.
In this session, we will explore the tips and tricks of developing a high precision & recall fuzzy name search engine model that detects the names of sanctioned entities with greater accuracy and precision than search engine market leaders.
Q: In your application with text analytics, what behavior or outcome do your models predict?
A: Our model detects the names of sanctioned entities out of free format input financial texts with a higher degree of accuracy than search engine market leaders. The final outcome is expected to be the names which exist fully or partially in the free format input text with exact or high similarity.
Q: Can you share a concrete example about the topic?
A: Yes. When you search for “Barrak Obama” on Google (with the intentionally incorrect spelling) you will see a panel on the right hand side featuring some information related to Barrack Obama. This means Google applied fuzzy search on our query and figured out the exact identity we intended to seek. When you search instead for “Barrak Obama elections” on Google, the identity panel on the right hand side will disappear. Although the intended information that we seek is exactly the same, Google failed to show the identity panel on the second query. The banks are required to extract the intended identity included in both queries.
Q: How does text analytics deliver value at your application – what is one specific way in which it actively drives decisions or operations?
A: Text analytics is helpful in various decision making processes. For example, it can help you figure out the correlation between the number of false positive predictions and the level of similarity your model can tolerate, then make a data driven decision about how much similarity your model should tolerate to prevent excess number of false positives.
Q: Can you describe a quantitative result, such as the predictive lift of your model or the ROI of an analytics initiative?
A: One of the most popular measures of success when evaluating search engines is the F1-Score (and also the F5-Score). F-Scores range from a scale of 0 to 1 (worst = 0, best = 1). We benchmarked our model with a hand-curated training dataset against the domain leaders. We achieved an F5-Score = 0.91 where the scores of domain leaders remained under .70.
Q: What surprising discovery or insight have you unearthed in your data?
A: We are surprised to find out some algorithms which are well known in some other domains are fairly applicable to search engine domain. For example, we were surprised to figure out that “the market basket analysis algorithm” can be applied to discover the frequent/infrequent co-occurrence of multiple stop words. For example, the terms “engineering”, “cooperation”, and “limited” can be considered to be stop words and may be filtered out from search queries because they return vast amounts of unnecessary information. And it may be possible that the statement “Cooperation Limited” is also frequent according to your dataset. However, the expression “Engineering Cooperation Limited” may refer to a unique organization. So, ignoring all of these stop words completely will lead you to a false negative match. On the other hand, identifying and unearthing the frequent co-occurrence of “Cooperation Limited” by means of market basket analysis will let you identify the importance of the term “Engineering” in this context hence come up with a true positive match instead.
Q: Sneak preview: Please tell us a take-away that you will provide during your talk at Text Analytics World.
A: Running name searches to prevent doing business with sanctioned entities in the global banking industry is a highly sophisticated and critical area in the field of text analytics. With the help of the tips and tricks that we will present in this session, a tailor-made solution can be implemented which provides more accurate and timely results than the solutions of leading search engine domains.
Q: Is it possible to share this case and comment or ask a question even before the session?
A: Yes! You are more than welcome to share your questions and comments on this Q&A platform and by #detectivenamesearch on Twitter.
Don't miss Emrah’s conference presentation, Tips and Tricks on Developing High-performance Fuzzy Name Search Engine to Prevent Terrorism Financing on Wednesday, June 22, 2016 from 2:20 to 3:05 pm at Text Analytics World Chicago. Click here to register to attend.